Bigquery autoconverting fields in data - json

Background
In Bigquery autodetect,i have following json data being loaded to BQ table.
"a":"","b":"q"
"a":"","b":"q1"
"a":"1","b":"w2"
Now,when this json is uploaded,BQ throws error cannot convert field "a" to integer.
Thoughts
I guess BQ,after reading two rows,BQ infers field "a" as string and then later when "a":"1" comes ,BQ tries to convert it to integer(But why?).
So,to investigate more,i modified the json as follows.
"a":"f","b":"q"
"a":"v","b":"q1"
"a":"1","b":"w2"
Now,when i use this json,no errors,data is smoothly loaded to table.
I don't see as to why in this scenario,if BQ infers field "a" as string,how come it throws no error (why does it not try to convert "a":"1" to integer)?
Query
What i assume is,BQ infers a field to a particular type ,only when it sees data in the field("a":"1" or "a":"f"),but what i don't get is why is BQ trying to automatically converting ("a":"1") to integer when it is of type string.
This autoconversion could create issues.
Please let me know,if my assumptions are correct and what could be done to avoid such errors because realtime data isnot in my control,i can only control my code(using autodetect).

It is a bug with autodetect. We are working on a fix.

Related

Using Kafka Connect sink connectors to write to GCS and/or BigQuery

This question is a bit vague and I wouldn't be surprised if it gets closed for that reason, but here goes.
I have successfully been able to use this connector https://github.com/confluentinc/kafka-connect-bigquery to stream kafka topics into a BigQuery instance. There are some columns in the data which ends up looking like this:
[{ "key": "product", "value": "cell phone" }, { "key": "weight", "value": "heavy" }]
let's call this column product_details, and this automatically goes into a BQ table. I don't create the table or anything, it's created automatically and this column is created as a RECORD type in BigQuery, whose subfields are KEY and VALUE, always, no matter what.
Now, when I use the Aiven GCS connector https://github.com/aiven/gcs-connector-for-apache-kafka, I end up loading the data in as a JSON file in JSONL format into a GCS bucket. I then create an external table to query these files in the GCS bucket. The problem is that the column product_details above will look like this instead:
{ "product": "cell phone", "weight": "heavy" }
as you can see, the subfields in this case are NOT KEY and VALUE. Instead, the subfields are PRODUCT and WEIGHT, and this is fine as long as they are not null. If there are null values, then BigQuery throws an error saying Unsupported empty struct type for field product_details. This is because BigQuery does not currently support empty RECORD types.
Now, when I try and create an external table with a schema definition identical to the table created when using the Wepay BigQuery sink connector (i.e., using KEY and VALUE as in the BigQuery sink connector), I get an error saying JSON parsing error in row starting at position 0: No such field <ANOTHER, COMPLETELY UNRELATED COLUMN HERE>. So the error relates to a totally different column, and that part I can't figure out.
What I would like to know is where in the Wepay BigQuery connector does it insert data as KEY and VALUE as opposed to the names of the subfields? I would like to somehow modify the Aiven GCS connector to do the same, but after searching endlessly through the repositories, I cannot seem to figure it out.
I'm relatively new to Kafka Connect, so contributing to or modifying existing connectors is a very tall order. I would like to try and do that, but I really can't seem to figure out where in the Wepay code it serializes the data as KEY and VALUE subfields instead of the actual names of the subfields. I would like to then apply that to the Aiven GCS connector.
On another note, I wanted to originally (maybe?) mitigate this whole issue by using CSV format in the Aiven connector, but it seems CSV files must have BASE64 format or "none". BigQuery cannot seem to decode either of these formats (BASE64 just inserts a single column of data in a large BASE64 string). "None" inserts characters into the CSV file which are not supported by BigQuery when using an external table.
Any help on where I can identify the code which uses KEY and VALUE instead of the actual subfield names would be really great.

SQL compilation error: JSON file format can produce one and only one column of type variant or object or array when copying from S3 to Snowflake

I have the following JSON stored in S3:
{"data":"this is a test for firehose"}
I have created the table test_firehose with a varchar column data, and a file_format called JSON with type JSON and the rest in default values. I want to copy the content from s3 to snowflake, and I have tried with the following statement:
COPY INTO test_firehose
FROM 's3://s3_bucket/firehose/2020/12/30/09/tracking-1-2020-12-30-09-38-46'
FILE_FORMAT = 'JSON';
And I receive the error:
SQL compilation error: JSON file format can produce one and only one column of type
variant or object or array. Use CSV file format if you want to load more than one column.
How could I solve this? Thanks
If you want to keep your data as JSON (rather than just as text) then you need to load it into a column with a datatype of VARIANT, not VARCHAR

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

aws athena - Create table by an array of json object

Can I get help in creating a table on AWS Athena.
For a sample example of data :
[{"lts": 150}]
AWS Glue generate the schema as :
array (array<struct<lts:int>>)
When I try to use the created table by AWS Glue to preview the table, I had this error:
HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray
The message error is clear, but I can't find the source of the problem!
Hive running under AWS Athena is using Hive-JSON-Serde to serialize/deserialize JSON. For some reason, they don't support just any standard JSON. They ask for one record per line, without an array. In their words:
The following example will work.
{ "key" : 10 }
{ "key" : 20 }
But this won't:
{
"key" : 20,
}
Nor this:
[{"key" : 20}]
You should create a JSON classifier to convert array into list of object instead of a single array object. Use JSON path $[*] in your classifier and then set up crawler to use it:
Edit crawler
Expand 'Description and classifiers'
Click 'Add' on the left pane to associate you classifier with crawler
After that remove previously created table and re-run the crawler. It will create a table with proper scheme but I think Athena will still be complaining when you will try to query it. However, now you can read from that table using Glue ETL job and process single record object instead of array-objects
This json - [{"lts": 150}] would work like a charm with below query:-
select n.lts from table_name
cross join UNNEST(table_name.array) as t (n)
The output would be as below:-
But I have faced a challenge with json like - [{"lts": 150},{"lts": 250},{"lts": 350}].
Even if there are 3 elements in the JSON, the query is returning only the first element. This may be because of the limitation listed by #artikas.
Definitely, we can change the json like below to make it work:-
{"lts": 150}
{"lts": 250}
{"lts": 350}
Please post if anyone is having a better solution to it.

An error occurred while attempting to perform a type cast

Hi I am trying to load data from a csv to sql server. The data types in flat file source external and output are (DT_STR,50). I am converting them to their respective data types in a derived column and trying to send all the wrong fields to an error file. But when I am trying to load. I am getting the following error.
Error: 0xC0049064 at Data Flow Task, Derived Column [668]: An error occurred while attempting to perform a type cast.
Error: 0xC0209029 at Data Flow Task, Derived Column [668]: SSIS Error Code DTS_E_INDUCEDTRANSFORMFAILUREONERROR.
The field on which it is failing is 0.234 I am trying to convert it from (DT_STR,50) to (DT_NUMERIC,7,5). I do not understand why this is failing. Please help.
Unfortunately, SSIS throws some pretty generic errors and there are probably dozens of ways you can encounter this one.
I ran into this when I was unaware that my flat file contained a footer that contained a different set of fields than the normal data rows.
I discovered this after I redirected my error rows to a Multicast and enabled the data viewer on my output which let me see what was failing.
In my case, I could see that I had a footer with a reliable value that I could detect with a Conditional Split to skip it. After that, my numeric cast in the derived column behaved correctly.
It's likely that at least one of the values in your (DT_STR,50) field cannot fit (DT_NUMERIC,7,5) because it has more that 7 number characters. Enable a Data Viewer on the path that's inputting into the Data Conversion step and you will probably see what I mean (depending on buffer size, you will likely have to sort on the string field in question). If you don't see one that's too long, buffer though until you do and buffer again, at which time the Data Conversion step will fail.
I had this same error thrown while trying to convert an INT that was too big for the NUMERIC length and precision I was casting to.