How to export JSON delimited in Big query with NULL values - json

Have you guys ever faced an issue while saving results (export) of a GA4 query in BigQuery as JSONL (delimited) where columns containing NULLS were removed in the JSON downloaded; which can also be noticed post uploading it as a table (upload -> JSONL) and querying. Yet another thing I’ve noticed was that the sequence of the columns after importing were different compared to the original table before exporting. Has anyone faced this issue? If so, I’d appreciate how you found a way around and how can one download and re-upload a JSON with the schema integrity.
P.S: If you are wondering why would one download and re-upload a JSONL, it was just to see if an option was viable other than being in the google cloud ecosystem. Also to be a bit more specific about the NULLS being removed, I meant let’s say float_value or double_value from GA4 Bigquery export(event_params) being eliminated if they were all NULLs
Thanks a ton in advance
Columns containing NULLs were removed. I expected it to retain the data structure like json local.
The sequence of the columns after importing were different / changed compared to pre-export.

Related

unable to import csv table DATEs columns to BigQuery

I am unable to import csv table > DATEs columns to BigQuery,
DATEs are not recognized, even they have correct format according this docu
https://cloud.google.com/bigquery/docs/schema-detect YYYY-MM-DD
So DATEs columns are not recognized and are renamed to _2020-0122, 2020-01-23...
Is the issue that DATES are in 1st row as column name ?
But How can I then import dates, when I want use them in TimeSeries Charts (DataStudio) ?
here is sample source csv>
Province/State,Country/Region,Lat,Long,2020-01-22,2020-01-23,2020-01-24,2020-01-25,2020-01-026
Anhui,China,31.8257,117.2264,1,9,15,39,60
Beijing,China,40.1824,116.4142,14,22,36,41,68
Chongqing,China,30.0572,107.874,6,9,27,57,75
Here is ig from Bigquery
If you have finite number of days, you can try unpivot table when using it. See blog post.
otherwise, if you dont know how many day columns in csv file.
choose a unique character as csv delimiter then just load whole file into a single column staging table, then use split function. you'll also need unnest. This approach requires a full scan and will be more expensive, especially when file gets bigger.
The issue is that in column names you cannot have a date type, for this reason when the CSV is imported it takes the dates and transforms them to the format with underscores.
The first way to face the problem would be modifying the CSV file, because any import with the first row as a header will change the date format and then it will be harder to get to date type again. If you have any experience in any programming language you can do the transformation very easily. I can help doing this but I do not know your use case so maybe this is not possible. Where does this CSV come from?
If the CSV previous modification is not possible then the second option is what ktopcuoglu said, importing the whole file as one column and process this using SQL function. This is way harder than the first option and as you import all the data into a single column, all the data will have the same data type, what will be a headache too.
If you could explain where the CSV comes from we may be able to influence it before being ingested by BigQuery. Else, you'll need to deep into SQL a bit.
Hope it helps!
Hi, now I can help you further.
First I found some COVID datasets into the public bigquery datasets. The one you are taking from github is already in BigQuery, but there are many others that may work better for your task such as the one called “covid19_ecdc”, that is inside bigquery-public-data. This last one has the confirmed cases and deaths per date and country so it should be easy to make a time series.
Second, I found an interesting link performing what you meant with python and data studio. It’s a kaggle discussion so you may not be familiar with it, but it deserves a check for sure . Moreover, he is using the dataset you are trying to use.
Hope it helps. Do not hesitate to ask!

Index multiple CSV files with different headers in Solr

I am trying to index multiple CSV files with different "schemas" in a Solr index. There's possibly some common schema elements (header columns) across these CSVs . My requirement is to be able to provide search across these CSVs amongst other items.
From what I understand, one way to index would be to treat the entire CSV as a giant text string and index that. I am not sure what searchability aspects get impacted if I index that way.
The other way is basically define a common schema and then programmatically extract the columns from the doc and index line by line with the caveat that if a file doesn't have any common schema I may not be able to index it. (BTW, this last part maybe a non-starter for me but just lets indulge the possibility for now)
Are there any other ways ? Is there any advantage to one over another?
BTW, I tried the schemaless mode but it doesn't work for me. I can index the first file but the moment I do the next file and it has some different columns, its giving back an error. Is this expected behaviour or am I doing something wrong?
Appreciate any pointers, thanks!
Update: the error with the schemaless mode is "Invalid date format". After doing some research, it seems like this is a different issue than what I'd thought, caused because Solr is autodetecting the data to be a date and it expects it to be in UTC format and its not. Is there any way for me to turn off autodetection of dates?

ETL: how to guess data types for messy CSVs with lots of nulls

I often have to cleanse and import messy CSV and Excel files into my MS SQL Server 2014 (but the question would be the same if I were using Oracle or another database).
I have found a way to do this with Alteryx. Can you help me understand if I can do the same with Pentaho Kettle or SSIS? Alternatively, can you recommend another ETL software which addresses my points below?
I often have tables of, say, 100,000 records where the first 90,000 records may be null. Most ETL tools scan only the first few hundred records to guess data types and therefore fail to guess the types of these fields. Can I force Pentaho or SSIS to scan the WHOLE file before guessing types? I understand this may not be efficient for huge files of many GBs, but for the files I handle scanning the entire file is much better than wasting a lot of time trying to guess each field manually
As above, but with the length of a string. If the first 10,000 records are, say, a 3-character string but the subsequent ones are longer, SSIS and Pentaho tend to guess nvarchar(3) and the import will fail. Can I force them to scan all rows before guessing the length of the strings? Or, alternatively, can I easily force all strings to be nvarchar(x) , where I set x myself?
Alteryx has a multi-field tool, which is particularly convenient when cleansing or converting multiple fields. E.g. I have 10 date columns whose datatype was not guessed automatically. I can use the multi-field formula to get Alteryx to convert all 10 fields to date and create new fields called $oldfield_reformatted. Do Pentajho and SSIS have anything similar?
Thank you!
A silly suggestion. In Excel add a row at the top of the list that has a formula that creates a text string with the same length of the longest value in the column.
This formula entered as an array formula would do it..
=REPT("X",MAX(LEN(A:A)))
You could also use a more advanced VBA function to create other dummy values to force datatypes in SSIS.
I've not used SSIS or anything like it, but in the past I would have loaded a file into a table with columns ALL of varchar 1000 say so that all the data loaded, then processed it across into the main table using SQL that casts or removes the data values as I required.
This gives YOU Ultimate control not a package or driver. I was very surprised to hear how this works!

Apending CSV files with columns in different orders

I need to regularly merge data from multiple CSV files into a single spreadsheet by appending the rows from each source file. Only OpenOffice/LibreOffice is able to read the UTF-8 CSV file, which has quote-delimited fields containing newline characters.
Now, each CSV file column headings, but the order of the columns vary from file to file. Some files also have missing columns, and some have extra columns.
I have my master list of column names, and the order in which I would like them all to go. What is the best way to tackle this? LibreOffice gets the CSV parsing right (Excel certainly does not). Ultimately the files will all go into a single merged spreadsheet. Every row from each source file must be kept intact, apart from the column ordering.
The steps also need to be handed over to a non-technical third party eventually, so I am looking for an approach that will not offer too many non-expert technical hurdles.
Okay, I'm approaching this problem a different way. I have instead gone back to the source application (WooCommerce) to fix the export, so the spreadsheets list all the same columns and all in the same order, on every export. This does have other consequences that I need to follow up, such as managing patches and trying to get the changes accepted by the source project. But it does avoid having to append the CSV files with mis-matched columns, which seems to be a common issue that no-one has any real solutions for (yes, I have searched, a lot).

SSIS CSV File load to table

I have a problem loading the .CSV file as the connection manager editor settings are out of my knowledge.
When i load the .CSV file up to 18 rows i have no problem it is loading in to the table.
However, from the 19th column the data is not partioning correctly.
row delimeter is {CR}{LF}
column delimeter is Comma {,}
How can i partition the data correctly?
any help?
Here are some ideas I have with no details.
What happens when you try to import the same .CSV file into Excel? Anything interesting around row 19?
Does there appear to be anything different about row 19?
If you delete row 19, what happens?
See, I bet you've thought of these things as well, and probably more, since you have the details. If you want anything more than superficial bad guesses, you'll have to provide a little detail.
I've found the CSV Import to be a bit limited with regards to bad data. If you're having trouble with the 19th column, I would suggest figuring out why that column is failing. You can try and tell the import task's error conditions to Ignore Errors with data truncation, etc...but that may not fix the issue.
I have often switched complicated or error-prone CSV imports to simply use a SSIS Script Task, then just write my own code to parse out the CSV and handle bad data.
If it's not partitioning correctly, it might be something as trivial as one of your field values on row 19 containing a comma, thus throwing out the import by making that row seem to have more columns. If this is the case, I hope you can get a revised version of the CSV file - this time with a text qualifier set. If possible, use something like | rather than " as the qualifier so that it's less likely to appear in the field values.
Put the file in a text editor such as notepad++ or textpad and change the view to show control characters. You will probably find your culprit there.
Nothing unusuale. when i paste in excel as one column and converting text to column has no problem. but i can see in the SSIS preview the field value where the problem has started has two square boxs and data of the next row.
if any one want to see the file let me know i will e-mail you the file.