Loading JSON data into hive tables - json

I've tried loading simple JSON records from a file into hive tables like as shown below. Each JSON record is in a separate line.
{"Industry":"Manufacturing","Phone":null,"Id":null,"type":"Account","Name":"Manufacturing"}
{"Industry":null,"Phone":"(738) 244-5566","Id":null,"type":"Account","Name":"Sales"}
{"Industry":"Government","Phone":null,"Id":null,"type":"Account","Name":"Kansas City Brewery & Co"}
But I couldn't find any serde to load the array of comma (,) separated JSON records into the hive table. Input is a file containing JSON records as shown below...
[{"Industry":"Manufacturing","Phone":null,"Id":null,"type":"Account","Name":"Manufacturing"},{"Industry":null,"Phone":"(738) 244-5566","Id":null,"type":"Account","Name":"Sales"},{"Industry":"Government","Phone":null,"Id":null,"type":"Account","Name":"Kansas City Brewery & Co"}]
Can someone suggest me a serde which can parse this JSON file?
Thanks

You can check this serde : https://github.com/rcongiu/Hive-JSON-Serde
Another related post : Parse json arrays using HIVE

Related

HIVE JSON INPUT - should the JSON be in one line

I am trying to load a JSON data to a HIVE table.
I was wondering should the JSON be in one single line only . I have tested this way with this data formatted like this:
"associatedDrug": {"name":"asprin", "dose":"","strength":"500 mg"}
"associatedDrug": {"name":"asprin2", "dose":"","strength2":"500 mg"}
or it can be provided in a formatted format like this:
"associatedDrug": {
"name":"asprin",
"dose":"",
"strength":"500 mg"
}
And if it formatted should is there a SERDE PROPERTIES that I can include so that it knows the RECORD END OF LINE is ???

Loading Json data to Redshift using Spark results in null values

I am using this spark-redshift library where I want to load the data in either Array or Json format shared in this link querying-json-data-in-amazon-redshift.
In Dataframe, I am reading data from Mongodb & I am converting the dataframe to Json & then pushing to Redshift & it is able to load without any errors/Exceptions. But in Redshift,the column values are showing values as NULL.
I am using spark-mongodb connector & I want to know,how I can store the Mongodb data in Array or Json format ?? Is it possible to pull Mongodb data & put it in Redshift in either Array or Json format ?

How to parse Nested Json messages from Kafka topic to hive

I'm pretty new to spark streaming and scala. I have a Json data and some other random log data coming in from a kafka topic.I was able to filter out just the json data like this
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet).map(_._2).filter (x => x.matches("^[{].*" ))
My json data looks like this.
{"time":"125573","randomcol":"abchdre","address":{"city":"somecity","zip":"123456"}}
Im trying to parse the json data and put it into hive table.
can someone please point me in the correct direction.
Thanks
There are multiple ways to do this.
create an external hive table with required columns and pointing to this data location.
When you create the table , you could use the default JSON serde and then use get_json_object hive function and load this raw data into a final table. Refer this for the function details
OR
You could try the avro serde and mention the avro schema as per your json message to create the hive table. Refer this for avro serde example
.
Hope it helps

How to store and query a flat file containing JSON string as a part of each line, into a hive table ?

I have a plane text file in HDFS as
44,UK,{"name":{"name1":"John","name2":"marry","name3":"michel"},"fruits":{"fruit1":"apple","fruit2":"orange"}},31-07-2016
91,INDIA,{"name":{"name1":"Ram","name2":"Sam"},"fruits":{}},31-07-2016
and want to store this in a hive table with schema as
create table data (SerNo int, country string , detail string,date string )
Then what should be the table definition so that {"name": ..... } will come as one column as "detail" and rest with other ?
what should be the column separator ? so that i can query detail column with get_json_object udf along with other columns.
Thank you.
Hive works well with Json format data until Json file is not nested at many levels. In that case it is better to make your Json file more Flat.
refer https://pkghosh.wordpress.com/2012/05/06/hive-plays-well-with-json/
Here you can find explained answer for you question.

How to serialise a spark sql row when it contains a comma

I am using Spark Jobserver https://github.com/spark-jobserver/spark-jobserver and Apache Spark for some analytic processing.
I am receiving back the following structure from jobserver when a job finishes
"status": "OK",
"result": [
"[17799.91015625,null,hello there how areyou?]",
"[50000.0,null,Hi, im fine]",
"[0.0,null,All good]"
]
The result doesnt contain valid json, as explained here:
https://github.com/spark-jobserver/spark-jobserver/issues/176
So I'm trying to convert the returned structure into a json structure, however I cant simply make the result string insert ' (single quotes) based on the comma delimiter, as sometimes the result contains a comma itself.
How can i convert a spark Sql row into a json object in the above situation?
I actually found a better way in the end,
from 1.3.0 onwards you can use .toJSON on a Dataframe to convert it to json
df.toJSON.collect()
to output a dataframes schema to json you can use
df.schema.json