How to define nested array to ingest data and convert? - json

I am using Firehose and Glue to ingest data and convert JSON to the parquet file in S3.
I was successful to achieve it with normal JSON (not nested or array). But I am failed for a nested JSON array. What I have done:
the JSON structure
{
"class_id": "test0001",
"students": [{
"student_id": "xxxx",
"student_name": "AAAABBBCCC",
"student_gpa": 123
}]
}
the Glue schema
class_id : string
students : array ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>
I receive error:
The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>' but 'ARRAY' is found.
Any suggestion is appreciated.

I ran into that because I created schemas manually in the AWS console. The problem is, that it shows some help text next to form to enter your nested data which capitalizes everything, but Parquet can only work with lowercase definitions.
Write despite the example given by AWS:
array<struct<student_id:string,student_name:string,student_gpa:int>>

Related

Efficient way to parse a file with different json schemas in spark

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet
{"type":1, "data":{"data_of_type1" : 1}}
{"type":2, "data":{"data_of_type2" : "can be any type"}}
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}
I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...
Current idea (not working):
Loaded the json and parse only the type ( "data" is loaded as a string)
attach to each row the corresponding schema (a DDL as string in a new column)
try to parse the "data" with the DDL from the previous column (method from_json)
=> Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema
Do you have any idea if it's possible?

Transform Avro file with Wrangler into JSON in cloud Datafusion

I try to read an Avro file, make a basic transformation (remove records with name = Ben) using Wrangler and write the result as JSON file into google cloud storage.
The Avro file has the following schema:
{
"type": "record",
"name": "etlSchemaBody",
"fields": [
{
"type": "string",
"name": "name"
}
]
}
Transformation in wrangler is the following:
transformation
The following is the output schema for JSON file:
output schema
When I run the pipeline it runs successfully and the JSON file is created in cloud storage. But the JSON output is empty.
When trying a preview run I get the following message:
warning message
Why is the JSON output file in gcloud storage empty?
When using the Wrangler to make transformations, the default values for the GCS source are format: text and body: string (data type); however, to properly work with an Avro file in the Wrangler you need to change that, you need to set the format to blob and the body data type to bytes, as follows:
After that, the preview for your pipeline should produce output records. You can see my working example next:
Sample data
Transformations
Input records preview for GCS sink (final output)
Edit:
You need to set the format: blob and the output schema as body: bytes if you want to parse the file to Avro within the Wrangler, as described before, because it needs the content of the file in a binary format.
On the other hand if you only want to apply filters (within the Wrangler), you could do the following:
Open the file using format: avro, see img.
Set the output schema according to the fields that your Avro file has, in this case name with string data type, see img.
Use only filters on the Wrangler (no parsing to Avro here), see img.
And this way you can also get the desired result.

In greenplum pxf external table get empty string while fetching element from json array of object

while accessing json data by creating external table using pxf json plugin in multiline json table example
when use following column definition
"coordinates.values[0]" INTEGER,
easily fetch 8 from below json
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
but if we change the json to something like this
"coordinates": {
"type": "geoloc",
"values":[
{
"latitude" : 72,
"longtitue" : 80
}
]
}
and change the column definition like this
"coordinates.values[0].latitude" INTEGER,
fetches empty string....
Unfortunately the JSON profile in PXF does not support accessing JSON objects inside arrays. However, Greenplum has very good support for JSON and you can achieve the same result by doing the following:
CREATE EXTERNAL TABLE pxf_read_json (j1 json)
LOCATION ('pxf://tmp/file.json?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
The pxf_read_json table will access JSON files on the external system. Each file is read as multi-line text files, each file represents a single table row on Greenplum. You can then query the external data as follows:
SELECT values->>'latitude' as latitude, values->>'longtitue' as longitude
FROM pxf_read_json
JOIN LATERAL json_array_elements(j1->'coordinates'->'values') values
ON true;
With this approach, you can still take advantage of PXF's support to access external system as well as leveraging the powerful JSON support in Greenplum.
Additional information about reading a multi-Line text file into a single table row can be found here. And information about Greenplum support for JSON can be found here.

Clickhouse/Kafka: reading a JSON Object type into a field

I have this kind of data in a Kafka Topic:
{..., fields: { "a": "aval", "b": "bval" } }
If I create a Kafka Engine table, I get an error when using a field definition like this:
fields String
because it (correctly) doesn't recognize it as a String:
2018.07.09 17:09:54.362061 [ 27 ] <Error> void DB::StorageKafka::streamThread(): Code: 26, e.displayText() = DB::Exception: Cannot parse JSON string: expected opening quote: (while read the value of key fields): (at row 1)
As ClickHouse does not currently have a Map or JSONObject type, what would be the best way to work over it, provided I don't know in advance the name of the inner fields ("a" or "b" in the example - so I cannot see Nested structures helping)?
Apparently, at the moment ClickHouse does not support complex JSON parsing.
From this answer in ClickHouse Github:
Clickhouse uses quick and dirty JSON parser, which does not how to read complex deep structures. So it can't skip that field as it does not know where that nested structure ends.
Sorry. :/
So you should preprocess your json with some external tools, of you can contribute to Clickhouse and improve JSON parser.

JSON SerDe for Hive that supports JSON arrays

I have tried the JSON SerDe that Amazon provides for EMR instance and works great if you need to address/map JSON dictionary fields to columns. However I wasn't been able to figure how to do the same with JSON arrays. For example if there is a JSON array as follows:
[23123.32, "Text Text", { "key1": "value1" } ]
Is there a way to map the first element of an array to a column in Hive table? What about the embedded dictionary fields?
I was struggling with the same problem till I found this serde on github -
https://github.com/rcongiu/Hive-JSON-Serde
Just include it using the 'add jar' command once you start hive and it works like a charm.