Parsing csv strings into webpages - csv

I'm a geneticist trying to automate a very laborious search result from my data. I understand that this question may have been asked before, please bear with me as i'm not entirely sure what keywords i should use. Thanks in advance!
What i want to achieve:
Search a specific string of numbers in a website from my list of data (csv file), then select top option. Once in the page, search for specific keyword(s)and return results into csv file.
Rinse repeat for remaining numbers.
That's it. It took me 1 day to do a couple of hundred entries. It takes up too much manpower when i really hope to use my time better than this.

Related

Writing CSV files - fill columns with whitespace or not? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
When doing various data analysis, it often makes sense to save some intermediate results as a CSV file. Could be for documentation, or to hand over to colleagues that want to work with Excel or similar or to have a quick way to do a sanity check yourself.
But how do I best format such a CSV file? Let's assume I want to have a classic spreadsheet with a header row and the data in columns. Like so:
Device_id;Location;Mean_reading;Error_count
opti-1;Upper-Underburg Backroad 2;1.45;42
ac-4;Valley 23;0.1;2
level-245;Lower-Underburg Central Market Place;1034;5
For opening it in Excel or reading it in with pandas, this works flawlessly, as long as you specify the use of ; as separator. However, as you can see with this example, it's quite hard to read when opening it up in a simple text editor, the use of which might be preferable in many cases (remote access, faster opening, no assumptions needed about separator or decimal dot vs. comma etc).
So I could simply add some whitespace to make the CSV look like this:
Device_id ;Location ;Mean_reading ;Error_count
opti-1 ;Upper-Underburg Backroad 2 ;1.45 ;42
ac-4 ;Valley 23 ;0.1 ;2
level-245 ;Lower-Underburg Central Market Place ;1034 ;5
But should I?
Are there any documented best practices or standards on how to write CSV files in such cases?
I can see pros and cons for both ways (see below), so I'm wondering if there's any guidelines on which way to go.
I'm leaning towards the latter way and looking at what kind of CSV files I do get out of various data loggers or other software, this seems to be the prefered way, but on the other hand, searching for CSV whitespace on this here site mostly results in questions about how to get rid of it.
And I can see some potential issues with the needed length of the fields, since I either need to make assumptions (i.e. Location needs a length of 40 characters) that might or might not be correct (What happens when I place a device in Underburg western motorway industrial estate northern fence?) or I need some potentially expensive logic to figure out the needed field lengths.
I work daily with CSV data files (in the printing industry, where CSV still is the common denominator). I usually tell customers that the format to choose depends on the purpose.
CSV files without whitespace is for machine (software) reading, OR where you can have a common separator that is not used elsewhere - if you want to avoid the path of escaping the separators.
Fixed-width-files are better for humans, or where the separator chosen will at times be part of the text. It comes at a penalty if you use spaces to separate since fixed-width will take up more space. And, as you point out, you need to know the longest possible field in advance. This type of file format for my customers are mostly result export from legacy software dating back many years.
A variant to consider could be TAB separated files, since you can choose on the viewer / editor part how wide a TAB should be. That way, you are less depending on the field size.
Or, keep it as compact version for machine reading, and make yourself a temporary copy using AWK as a filter. It's trivial to do, and you can have the field length anything you want, without modifying the original file.

unable to import csv table DATEs columns to BigQuery

I am unable to import csv table > DATEs columns to BigQuery,
DATEs are not recognized, even they have correct format according this docu
https://cloud.google.com/bigquery/docs/schema-detect YYYY-MM-DD
So DATEs columns are not recognized and are renamed to _2020-0122, 2020-01-23...
Is the issue that DATES are in 1st row as column name ?
But How can I then import dates, when I want use them in TimeSeries Charts (DataStudio) ?
here is sample source csv>
Province/State,Country/Region,Lat,Long,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-026
Anhui,China,31.8257,117.2264,1,9,15,39,60
Beijing,China,40.1824,116.4142,14,22,36,41,68
Chongqing,China,30.0572,107.874,6,9,27,57,75
Here is ig from Bigquery
If you have finite number of days, you can try unpivot table when using it. See blog post.
otherwise, if you dont know how many day columns in csv file.
choose a unique character as csv delimiter then just load whole file into a single column staging table, then use split function. you'll also need unnest. This approach requires a full scan and will be more expensive, especially when file gets bigger.
The issue is that in column names you cannot have a date type, for this reason when the CSV is imported it takes the dates and transforms them to the format with underscores.
The first way to face the problem would be modifying the CSV file, because any import with the first row as a header will change the date format and then it will be harder to get to date type again. If you have any experience in any programming language you can do the transformation very easily. I can help doing this but I do not know your use case so maybe this is not possible. Where does this CSV come from?
If the CSV previous modification is not possible then the second option is what ktopcuoglu said, importing the whole file as one column and process this using SQL function. This is way harder than the first option and as you import all the data into a single column, all the data will have the same data type, what will be a headache too.
If you could explain where the CSV comes from we may be able to influence it before being ingested by BigQuery. Else, you'll need to deep into SQL a bit.
Hope it helps!
Hi, now I can help you further.
First I found some COVID datasets into the public bigquery datasets. The one you are taking from github is already in BigQuery, but there are many others that may work better for your task such as the one called “covid19_ecdc”, that is inside bigquery-public-data. This last one has the confirmed cases and deaths per date and country so it should be easy to make a time series.
Second, I found an interesting link performing what you meant with python and data studio. It’s a kaggle discussion so you may not be familiar with it, but it deserves a check for sure . Moreover, he is using the dataset you are trying to use.
Hope it helps. Do not hesitate to ask!

Batch file to verify to verify a .csv file

I am hoping someone can point me in the right direction, in relation to the scenario I am faced with.
Essentially, I am given a csv each day containing payment information of 200+ lines
As the Payment reference is input by the user at source, this isn't always in the format I need.
The process is currently done manually, and can take considerable time, therefore I was hoping to come up with a batch file to isolate the reference I require, based on a set of parameters.
Each reference should be; 11 digits in length, be numeric only and start either 1,2 or 3.
I have attached a basic example with this post.
It may be that this isn't possible in batch, but any ideas would be appreciated.
Thanks in advance :-)
I'm not too sure about batch but Python and Regexcan help you out here.
Here is a great tutorial on using csv's with python.
Once you have that down, you could use Regex to filter out the correct values.
Here is the correct expression to help you out ^[1|2|3][0-9]{10}$

ABAP TVRO field TRAZTD, Route Customizing Data

A customer of mine is looking to mass create some customizing data related the routes. and as such I have a small program which reads in a CSV file with all of the fields as they would be in the customizing transaction.
I'm having a particular problem wrapping my head around a field TVRO-TRAZTD for a couple of reasons.
The user is only filling in a number which represents a number of days.
There is a conversion exit on TRAZTD, except it's obsolete, use CONVERT TIMESTAMP they say
I don't have a timestamp, I have a decimal number representing a part of a day
For example, TRAZTD would be entered as 0,58 from the CSV file, so why is it represented in the table as 135.512?
I tried it the old fashion way and multiplied 0,58 * 24 which gives me 13,92. if I take 13,92 * 10 I get 139.200, which isn't the same but it's the closest I can get, but I don't get it why 10?
Using the conversion exit even though it's obsolete doens't give me a result either, no matter number I give it I always get 0 back. I can't use the convert timestamp either because well, it's not a timestamp or I didn't look up carefully enough how to use it (I didn't see anything other than strings and characters).
The other thing I tried too was just saying "screw it" and placed the data from the CSV directly into the field and hoping the conversion routine will take care of the work, but that doesn't happen either.
Is there anybody out here that can maybe shed some light on where the number after the conversion comes from?
everybody I came to a solution, just incase anybody stumbles upon this same problem.
I took the value from the excel document and multiplied it by 24 to get the amount of hours, and then multipled it 10000 because I don't know, I picked it randomly.

Formatting data for use with JSON and D3.js

I have the following data in my MySQL database. These three columns are a subset of a table that I have selected using a query.
Value Date Time
230.8 13/08/08 15:01:22+22
233.7 13/08/08 15:13:12+22
234.5 13/08/08 15:40:33+22
I want to represent this data on a graph of (Value) versus (Date & Time) in a chronological manner. What is the format I need to put the above data into before using JSON cause I've had a look at a few tutorials and when I apply the same logic (like this:http://www.d3noob.org/2013/02/using-mysql-database-as-source-of-data.html) I don't seem to be getting any graph at all.
Or will JSON and D3.js not work for my requirement? Do I need to look at something else? Like some other JavaScript?
Your question is a little bit vague, but I'll try to adress a few of your topics to help you get started.
Firstly, I would suggest finding the visualization that fits your needs. From the data subset that you showed in the question, I would suggest maybe this one. It is interesting because if you have multiple values for different times in a given day, you could construct various time series graphs and compare them interactively. There are other options, so you should explore and find a good starting point to improve and adapt to your needs.
Regarding the origin/format of the data, if you are able to extract that data you showed to a variable (with PHP, for example), you can then manipulate the data and build a structure from it. It doesn't necessarily have to be JSON and/or CSV. As long as you can handle it with d3.js's API functions. It isn't very difficult, but it is something that requires you to understand and read about the topic. First understand how to query for your needs with MySQL. Then, I would suggest starting here if you decide to go with JSON.
The example visualization I mentioned above uses a CSV file as a data source. Other option could be for instance to build a CSV file (or data structure - ie, an array) to feed into d3.js. There are various questions covering "how to create CSV with PHP", so you shouldn't have much difficulty finding the info you need.
Either way, after you feel confortable with what you know about these topics, start breaking your problem into smaller tasks and finding answers to one question at a time. If you need, post more questions here in SO and include your attempts at coding a solution, this will definitely get you all the help you might need.
in python it would look like this:
import json
output = json.dumps(['data', {'data_1': ('230.8', '13/08/08', '15:01:22+22')}, {'data_2': ('233.7', '13/08/08', '15:13:12+22')}, {'data_3': ('234.5', '13/08/08', '15:40:33+22')}])
print output
more information about python and json can be found here