I am unable to import csv table > DATEs columns to BigQuery,
DATEs are not recognized, even they have correct format according this docu
https://cloud.google.com/bigquery/docs/schema-detect YYYY-MM-DD
So DATEs columns are not recognized and are renamed to _2020-0122, 2020-01-23...
Is the issue that DATES are in 1st row as column name ?
But How can I then import dates, when I want use them in TimeSeries Charts (DataStudio) ?
here is sample source csv>
Province/State,Country/Region,Lat,Long,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-026
Anhui,China,31.8257,117.2264,1,9,15,39,60
Beijing,China,40.1824,116.4142,14,22,36,41,68
Chongqing,China,30.0572,107.874,6,9,27,57,75
Here is ig from Bigquery
If you have finite number of days, you can try unpivot table when using it. See blog post.
otherwise, if you dont know how many day columns in csv file.
choose a unique character as csv delimiter then just load whole file into a single column staging table, then use split function. you'll also need unnest. This approach requires a full scan and will be more expensive, especially when file gets bigger.
The issue is that in column names you cannot have a date type, for this reason when the CSV is imported it takes the dates and transforms them to the format with underscores.
The first way to face the problem would be modifying the CSV file, because any import with the first row as a header will change the date format and then it will be harder to get to date type again. If you have any experience in any programming language you can do the transformation very easily. I can help doing this but I do not know your use case so maybe this is not possible. Where does this CSV come from?
If the CSV previous modification is not possible then the second option is what ktopcuoglu said, importing the whole file as one column and process this using SQL function. This is way harder than the first option and as you import all the data into a single column, all the data will have the same data type, what will be a headache too.
If you could explain where the CSV comes from we may be able to influence it before being ingested by BigQuery. Else, you'll need to deep into SQL a bit.
Hope it helps!
Hi, now I can help you further.
First I found some COVID datasets into the public bigquery datasets. The one you are taking from github is already in BigQuery, but there are many others that may work better for your task such as the one called “covid19_ecdc”, that is inside bigquery-public-data. This last one has the confirmed cases and deaths per date and country so it should be easy to make a time series.
Second, I found an interesting link performing what you meant with python and data studio. It’s a kaggle discussion so you may not be familiar with it, but it deserves a check for sure . Moreover, he is using the dataset you are trying to use.
Hope it helps. Do not hesitate to ask!
Related
Have you guys ever faced an issue while saving results (export) of a GA4 query in BigQuery as JSONL (delimited) where columns containing NULLS were removed in the JSON downloaded; which can also be noticed post uploading it as a table (upload -> JSONL) and querying. Yet another thing I’ve noticed was that the sequence of the columns after importing were different compared to the original table before exporting. Has anyone faced this issue? If so, I’d appreciate how you found a way around and how can one download and re-upload a JSON with the schema integrity.
P.S: If you are wondering why would one download and re-upload a JSONL, it was just to see if an option was viable other than being in the google cloud ecosystem. Also to be a bit more specific about the NULLS being removed, I meant let’s say float_value or double_value from GA4 Bigquery export(event_params) being eliminated if they were all NULLs
Thanks a ton in advance
Columns containing NULLs were removed. I expected it to retain the data structure like json local.
The sequence of the columns after importing were different / changed compared to pre-export.
I need to import a CSV file in SAS and it gets stuck on empty fields with a date format. My log shows this field is properly recognised as DATETIME. and ANYDTDTM40. just like other datetime fields. In The first records this field is empty and the LOG then gives a NOTE invalid data. When I enter a date in the first rows with empty fields the message moves along. So it clearly has to do with missing values. Can someone help me out?
In the future please make sure to show your actual code and the log - feel free to omit the data part of the log if it's confidential information.
PROC IMPORT is a guessing procedure and guesses at types. For production processes it's not a good idea to use PROC IMPORT.
You can add the GUESSINGROWS=MAX; option to your code to force SAS to scan the entire file before guessing at types. This will increase the run time of the process but will likely fix your issue. Also, ensure your datetime fields are consistent and correct. If the data does has mixed date types, ie MMDDYY and DDMMYY then it can be bit of a pain to manage. Or if it has DDMMYY and SAS guesses MMDDYY (or vice versa) you'll get a bunch of errors. In that case you need to write your own data step code to read in the data. You can use the code from the log as a starting point.
I often have to cleanse and import messy CSV and Excel files into my MS SQL Server 2014 (but the question would be the same if I were using Oracle or another database).
I have found a way to do this with Alteryx. Can you help me understand if I can do the same with Pentaho Kettle or SSIS? Alternatively, can you recommend another ETL software which addresses my points below?
I often have tables of, say, 100,000 records where the first 90,000 records may be null. Most ETL tools scan only the first few hundred records to guess data types and therefore fail to guess the types of these fields. Can I force Pentaho or SSIS to scan the WHOLE file before guessing types? I understand this may not be efficient for huge files of many GBs, but for the files I handle scanning the entire file is much better than wasting a lot of time trying to guess each field manually
As above, but with the length of a string. If the first 10,000 records are, say, a 3-character string but the subsequent ones are longer, SSIS and Pentaho tend to guess nvarchar(3) and the import will fail. Can I force them to scan all rows before guessing the length of the strings? Or, alternatively, can I easily force all strings to be nvarchar(x) , where I set x myself?
Alteryx has a multi-field tool, which is particularly convenient when cleansing or converting multiple fields. E.g. I have 10 date columns whose datatype was not guessed automatically. I can use the multi-field formula to get Alteryx to convert all 10 fields to date and create new fields called $oldfield_reformatted. Do Pentajho and SSIS have anything similar?
Thank you!
A silly suggestion. In Excel add a row at the top of the list that has a formula that creates a text string with the same length of the longest value in the column.
This formula entered as an array formula would do it..
=REPT("X",MAX(LEN(A:A)))
You could also use a more advanced VBA function to create other dummy values to force datatypes in SSIS.
I've not used SSIS or anything like it, but in the past I would have loaded a file into a table with columns ALL of varchar 1000 say so that all the data loaded, then processed it across into the main table using SQL that casts or removes the data values as I required.
This gives YOU Ultimate control not a package or driver. I was very surprised to hear how this works!
I have the following data in my MySQL database. These three columns are a subset of a table that I have selected using a query.
Value Date Time
230.8 13/08/08 15:01:22+22
233.7 13/08/08 15:13:12+22
234.5 13/08/08 15:40:33+22
I want to represent this data on a graph of (Value) versus (Date & Time) in a chronological manner. What is the format I need to put the above data into before using JSON cause I've had a look at a few tutorials and when I apply the same logic (like this:http://www.d3noob.org/2013/02/using-mysql-database-as-source-of-data.html) I don't seem to be getting any graph at all.
Or will JSON and D3.js not work for my requirement? Do I need to look at something else? Like some other JavaScript?
Your question is a little bit vague, but I'll try to adress a few of your topics to help you get started.
Firstly, I would suggest finding the visualization that fits your needs. From the data subset that you showed in the question, I would suggest maybe this one. It is interesting because if you have multiple values for different times in a given day, you could construct various time series graphs and compare them interactively. There are other options, so you should explore and find a good starting point to improve and adapt to your needs.
Regarding the origin/format of the data, if you are able to extract that data you showed to a variable (with PHP, for example), you can then manipulate the data and build a structure from it. It doesn't necessarily have to be JSON and/or CSV. As long as you can handle it with d3.js's API functions. It isn't very difficult, but it is something that requires you to understand and read about the topic. First understand how to query for your needs with MySQL. Then, I would suggest starting here if you decide to go with JSON.
The example visualization I mentioned above uses a CSV file as a data source. Other option could be for instance to build a CSV file (or data structure - ie, an array) to feed into d3.js. There are various questions covering "how to create CSV with PHP", so you shouldn't have much difficulty finding the info you need.
Either way, after you feel confortable with what you know about these topics, start breaking your problem into smaller tasks and finding answers to one question at a time. If you need, post more questions here in SO and include your attempts at coding a solution, this will definitely get you all the help you might need.
in python it would look like this:
import json
output = json.dumps(['data', {'data_1': ('230.8', '13/08/08', '15:01:22+22')}, {'data_2': ('233.7', '13/08/08', '15:13:12+22')}, {'data_3': ('234.5', '13/08/08', '15:40:33+22')}])
print output
more information about python and json can be found here
I have a problem loading the .CSV file as the connection manager editor settings are out of my knowledge.
When i load the .CSV file up to 18 rows i have no problem it is loading in to the table.
However, from the 19th column the data is not partioning correctly.
row delimeter is {CR}{LF}
column delimeter is Comma {,}
How can i partition the data correctly?
any help?
Here are some ideas I have with no details.
What happens when you try to import the same .CSV file into Excel? Anything interesting around row 19?
Does there appear to be anything different about row 19?
If you delete row 19, what happens?
See, I bet you've thought of these things as well, and probably more, since you have the details. If you want anything more than superficial bad guesses, you'll have to provide a little detail.
I've found the CSV Import to be a bit limited with regards to bad data. If you're having trouble with the 19th column, I would suggest figuring out why that column is failing. You can try and tell the import task's error conditions to Ignore Errors with data truncation, etc...but that may not fix the issue.
I have often switched complicated or error-prone CSV imports to simply use a SSIS Script Task, then just write my own code to parse out the CSV and handle bad data.
If it's not partitioning correctly, it might be something as trivial as one of your field values on row 19 containing a comma, thus throwing out the import by making that row seem to have more columns. If this is the case, I hope you can get a revised version of the CSV file - this time with a text qualifier set. If possible, use something like | rather than " as the qualifier so that it's less likely to appear in the field values.
Put the file in a text editor such as notepad++ or textpad and change the view to show control characters. You will probably find your culprit there.
Nothing unusuale. when i paste in excel as one column and converting text to column has no problem. but i can see in the SSIS preview the field value where the problem has started has two square boxs and data of the next row.
if any one want to see the file let me know i will e-mail you the file.