aws athena - Create table by an array of json object - json

Can I get help in creating a table on AWS Athena.
For a sample example of data :
[{"lts": 150}]
AWS Glue generate the schema as :
array (array<struct<lts:int>>)
When I try to use the created table by AWS Glue to preview the table, I had this error:
HIVE_BAD_DATA: Error parsing field value for field 0: org.openx.data.jsonserde.json.JSONObject cannot be cast to org.openx.data.jsonserde.json.JSONArray
The message error is clear, but I can't find the source of the problem!

Hive running under AWS Athena is using Hive-JSON-Serde to serialize/deserialize JSON. For some reason, they don't support just any standard JSON. They ask for one record per line, without an array. In their words:
The following example will work.
{ "key" : 10 }
{ "key" : 20 }
But this won't:
{
"key" : 20,
}
Nor this:
[{"key" : 20}]

You should create a JSON classifier to convert array into list of object instead of a single array object. Use JSON path $[*] in your classifier and then set up crawler to use it:
Edit crawler
Expand 'Description and classifiers'
Click 'Add' on the left pane to associate you classifier with crawler
After that remove previously created table and re-run the crawler. It will create a table with proper scheme but I think Athena will still be complaining when you will try to query it. However, now you can read from that table using Glue ETL job and process single record object instead of array-objects

This json - [{"lts": 150}] would work like a charm with below query:-
select n.lts from table_name
cross join UNNEST(table_name.array) as t (n)
The output would be as below:-
But I have faced a challenge with json like - [{"lts": 150},{"lts": 250},{"lts": 350}].
Even if there are 3 elements in the JSON, the query is returning only the first element. This may be because of the limitation listed by #artikas.
Definitely, we can change the json like below to make it work:-
{"lts": 150}
{"lts": 250}
{"lts": 350}
Please post if anyone is having a better solution to it.

Related

Using Kafka Connect sink connectors to write to GCS and/or BigQuery

This question is a bit vague and I wouldn't be surprised if it gets closed for that reason, but here goes.
I have successfully been able to use this connector https://github.com/confluentinc/kafka-connect-bigquery to stream kafka topics into a BigQuery instance. There are some columns in the data which ends up looking like this:
[{ "key": "product", "value": "cell phone" }, { "key": "weight", "value": "heavy" }]
let's call this column product_details, and this automatically goes into a BQ table. I don't create the table or anything, it's created automatically and this column is created as a RECORD type in BigQuery, whose subfields are KEY and VALUE, always, no matter what.
Now, when I use the Aiven GCS connector https://github.com/aiven/gcs-connector-for-apache-kafka, I end up loading the data in as a JSON file in JSONL format into a GCS bucket. I then create an external table to query these files in the GCS bucket. The problem is that the column product_details above will look like this instead:
{ "product": "cell phone", "weight": "heavy" }
as you can see, the subfields in this case are NOT KEY and VALUE. Instead, the subfields are PRODUCT and WEIGHT, and this is fine as long as they are not null. If there are null values, then BigQuery throws an error saying Unsupported empty struct type for field product_details. This is because BigQuery does not currently support empty RECORD types.
Now, when I try and create an external table with a schema definition identical to the table created when using the Wepay BigQuery sink connector (i.e., using KEY and VALUE as in the BigQuery sink connector), I get an error saying JSON parsing error in row starting at position 0: No such field <ANOTHER, COMPLETELY UNRELATED COLUMN HERE>. So the error relates to a totally different column, and that part I can't figure out.
What I would like to know is where in the Wepay BigQuery connector does it insert data as KEY and VALUE as opposed to the names of the subfields? I would like to somehow modify the Aiven GCS connector to do the same, but after searching endlessly through the repositories, I cannot seem to figure it out.
I'm relatively new to Kafka Connect, so contributing to or modifying existing connectors is a very tall order. I would like to try and do that, but I really can't seem to figure out where in the Wepay code it serializes the data as KEY and VALUE subfields instead of the actual names of the subfields. I would like to then apply that to the Aiven GCS connector.
On another note, I wanted to originally (maybe?) mitigate this whole issue by using CSV format in the Aiven connector, but it seems CSV files must have BASE64 format or "none". BigQuery cannot seem to decode either of these formats (BASE64 just inserts a single column of data in a large BASE64 string). "None" inserts characters into the CSV file which are not supported by BigQuery when using an external table.
Any help on where I can identify the code which uses KEY and VALUE instead of the actual subfield names would be really great.

Custom Connector ( JSON result convert into Paramters )

List item
I am a newbie in the azure environment. I am getting JSON response from my custom connector but I need to convert that JSON into parameters so I can use these parameters in further actions Can anybody know how is this possible ?
You want to use these parameters in further actions but you don't mention which type it's stored.
1.Stored as string. In this way the whole json is a string, it's not supported to select property. So you need parse it to Json with Parse JSON action then you will be able to select property. About the Parse JSON Schema, just click the Use sample payload to generate schema and paste your json value, it will generate. And select your property just use the #{body('Parse_JSON')?['name']}, it will work.
2.If it's stored as an Object, just use expression variables('test1')['name'] to get it.

Insert JSON file into HBase using Hive

I have a simple JSON file that I would like to insert into an HBase table.
My JSON file has the following format:
{
"word1":{
"doc_01":4,
"doc_02":7
},
"word2":{
"doc_06":1,
"doc_02":3,
"doc_12":8
}
}
The HBase table is called inverted_index, it has one column family matches.
I would like to use the keys word1,word2, etc as row keys and their values to be inserted in the column family matches.
I know that Hive supports JSON parsing and I've already tried it, but only when I know the keys in the JSON beforehand to access the records.
My problem is that I don't know what or How many words my JSON file contains, or How many matches each word will have (it can't be empty though).
My question: Is this even doable using hive only? If so, kindly provide some pointers on what hive queries/functions to use to achieve that.
If not, is there any alternative to tackle this? Thanks in advance.

Bigquery autoconverting fields in data

Background
In Bigquery autodetect,i have following json data being loaded to BQ table.
"a":"","b":"q"
"a":"","b":"q1"
"a":"1","b":"w2"
Now,when this json is uploaded,BQ throws error cannot convert field "a" to integer.
Thoughts
I guess BQ,after reading two rows,BQ infers field "a" as string and then later when "a":"1" comes ,BQ tries to convert it to integer(But why?).
So,to investigate more,i modified the json as follows.
"a":"f","b":"q"
"a":"v","b":"q1"
"a":"1","b":"w2"
Now,when i use this json,no errors,data is smoothly loaded to table.
I don't see as to why in this scenario,if BQ infers field "a" as string,how come it throws no error (why does it not try to convert "a":"1" to integer)?
Query
What i assume is,BQ infers a field to a particular type ,only when it sees data in the field("a":"1" or "a":"f"),but what i don't get is why is BQ trying to automatically converting ("a":"1") to integer when it is of type string.
This autoconversion could create issues.
Please let me know,if my assumptions are correct and what could be done to avoid such errors because realtime data isnot in my control,i can only control my code(using autodetect).
It is a bug with autodetect. We are working on a fix.

Convert file of JSON objects to Parquet file

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.