JSON SerDe for Hive that supports JSON arrays - json

I have tried the JSON SerDe that Amazon provides for EMR instance and works great if you need to address/map JSON dictionary fields to columns. However I wasn't been able to figure how to do the same with JSON arrays. For example if there is a JSON array as follows:
[23123.32, "Text Text", { "key1": "value1" } ]
Is there a way to map the first element of an array to a column in Hive table? What about the embedded dictionary fields?

I was struggling with the same problem till I found this serde on github -
https://github.com/rcongiu/Hive-JSON-Serde
Just include it using the 'add jar' command once you start hive and it works like a charm.

Related

In greenplum pxf external table get empty string while fetching element from json array of object

while accessing json data by creating external table using pxf json plugin in multiline json table example
when use following column definition
"coordinates.values[0]" INTEGER,
easily fetch 8 from below json
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
but if we change the json to something like this
"coordinates": {
"type": "geoloc",
"values":[
{
"latitude" : 72,
"longtitue" : 80
}
]
}
and change the column definition like this
"coordinates.values[0].latitude" INTEGER,
fetches empty string....
Unfortunately the JSON profile in PXF does not support accessing JSON objects inside arrays. However, Greenplum has very good support for JSON and you can achieve the same result by doing the following:
CREATE EXTERNAL TABLE pxf_read_json (j1 json)
LOCATION ('pxf://tmp/file.json?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
The pxf_read_json table will access JSON files on the external system. Each file is read as multi-line text files, each file represents a single table row on Greenplum. You can then query the external data as follows:
SELECT values->>'latitude' as latitude, values->>'longtitue' as longitude
FROM pxf_read_json
JOIN LATERAL json_array_elements(j1->'coordinates'->'values') values
ON true;
With this approach, you can still take advantage of PXF's support to access external system as well as leveraging the powerful JSON support in Greenplum.
Additional information about reading a multi-Line text file into a single table row can be found here. And information about Greenplum support for JSON can be found here.

Hive json serde selection

I am confused on choosing between two json serde given in below link (
Openx and hcatolog).
https://docs.aws.amazon.com/athena/latest/ug/json.html
My json is not a nested json.Its a simple json.
A file having each record as json separated by newline.
Please let me know which would be apt in my case .

How to define nested array to ingest data and convert?

I am using Firehose and Glue to ingest data and convert JSON to the parquet file in S3.
I was successful to achieve it with normal JSON (not nested or array). But I am failed for a nested JSON array. What I have done:
the JSON structure
{
"class_id": "test0001",
"students": [{
"student_id": "xxxx",
"student_name": "AAAABBBCCC",
"student_gpa": 123
}]
}
the Glue schema
class_id : string
students : array ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>
I receive error:
The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>' but 'ARRAY' is found.
Any suggestion is appreciated.
I ran into that because I created schemas manually in the AWS console. The problem is, that it shows some help text next to form to enter your nested data which capitalizes everything, but Parquet can only work with lowercase definitions.
Write despite the example given by AWS:
array<struct<student_id:string,student_name:string,student_gpa:int>>

Multi-line JSON file querying in hive

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.
I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
Is there a SerDe format out there that is able to parse multi-line indented .json files?
If there isn't a SerDe format to do this:
Is there a best practice for dealing with files like this?
Should I plan on flattening these records out using a different tool like python?
Is there a standard way of writing custom SerDe formats, so I can write one myself?
Example file body:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]
There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.
You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.

how to convert nested json file into csv in scala

I want to convert my nested json into csv ,i used
df.write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
But it can use to normal json but not nested json. Anyway that I can convert my nested json to csv?help will be appreciated,Thanks!
When you ask Spark to convert a JSON structure to a CSV, Spark can only map the first level of the JSON.
This happens because of the simplicity of the CSV files. It is just asigning a value to a name. That is why {"name1":"value1", "name2":"value2"...} can be represented as a CSV with this structure:
name1,name2, ...
value1,value2,...
In your case, you are converting a JSON with several levels, so Spark exception is saying that it cannot figure out how to convert such a complex structure into a CSV.
If you try to add only a second level to your JSON, it will work, but be careful. It will remove the names of the second level to include only the values in an array.
You can have a look at this link to see the example for json datasets. It includes an example.
As I have no information about the nature of the data, I can't say much more about it. But if you need to write the information as a CSV you will need to simplify the structure of your data.
Read json file in spark and create dataframe.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
Save the dataframe using spark-csv
people.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
Source :
read json
save to csv