I am trying to load a JSON data to a HIVE table.
I was wondering should the JSON be in one single line only . I have tested this way with this data formatted like this:
"associatedDrug": {"name":"asprin", "dose":"","strength":"500 mg"}
"associatedDrug": {"name":"asprin2", "dose":"","strength2":"500 mg"}
or it can be provided in a formatted format like this:
"associatedDrug": {
"name":"asprin",
"dose":"",
"strength":"500 mg"
}
And if it formatted should is there a SERDE PROPERTIES that I can include so that it knows the RECORD END OF LINE is ???
Related
I am trying to store my pyspark output into csv, but when I try to save it in csv, the output does not look the same. I have the output in this form:
When I try to convert this to csv, the Concat tasks column does not show up properly, due to the size of the data. Given my requirement, it's necessary for me to store the data in csv format. Is there a way out for this. (P.S- I also see columns showing nonsensical values, even though the pyspark output shows correct value)
I am having an issue loading some JSON data.
The data looks like this:
{"geometry":{"coordinates":[12.5263,55.7664],"type":"Point"},"properties":{"created":"2021-01-19T17:08:14.114216Z","observed":"2020-01-01T23:50:00Z","parameterId":"pressure_at_sea","stationId":"06181","value":1025.1},"type":"Feature","id":"00e3bc2b-9a55-03dc-3740-005fd752f840"}
{"geometry":{"coordinates":[10.6217,55.8315],"type":"Point"},"properties":{"created":"2021-01-19T23:26:37.906088Z","observed":"2020-01-01T23:50:00Z","parameterId":"radia_glob","stationId":"06132","value":1},"type":"Feature","id":"00f7c039-6096-e2c2-5063-594c1c3bc16e"}
{"geometry":{"coordinates":[-37.6367,65.6111],"type":"Point"},"properties":{"created":"2021-01-19T23:26:37.913180Z","observed":"2020-01-01T23:50:00Z","parameterId":"radia_glob","stationId":"04360","value":0},"type":"Feature","id":"0142e2d1-9c28-e884-8d88-4b8fac5cae5d"}
The challenge seems to be, when I try to look at the JSON data like;
select $1, metadata$filename from #my_bucket/2020/2020-01.json.gz limit 3;
It only returns a part of the JSON:
$1 METADATA$FILENAME
{"geometry":{"coordinates":[12.5263 2020/2020-01.json.gz
{"geometry":{"coordinates":[10.6217 2020/2020-01.json.gz
{"geometry":{"coordinates":[-37.6367 2020/2020-01.json.gz
Seems like everything after the comma in the coordinates gets truncated, but I cannot figure out how to avoid that.
Br.
Thomas
$1 is column one, which helps show that your data is being treated as CSV. Which is the default file format.
try adding file format FILE_FORMAT => 'my_json_file' which would be created via:
CREATE OR REPLACE FILE FORMAT my_json_file TYPE = JSON;
I've tried loading simple JSON records from a file into hive tables like as shown below. Each JSON record is in a separate line.
{"Industry":"Manufacturing","Phone":null,"Id":null,"type":"Account","Name":"Manufacturing"}
{"Industry":null,"Phone":"(738) 244-5566","Id":null,"type":"Account","Name":"Sales"}
{"Industry":"Government","Phone":null,"Id":null,"type":"Account","Name":"Kansas City Brewery & Co"}
But I couldn't find any serde to load the array of comma (,) separated JSON records into the hive table. Input is a file containing JSON records as shown below...
[{"Industry":"Manufacturing","Phone":null,"Id":null,"type":"Account","Name":"Manufacturing"},{"Industry":null,"Phone":"(738) 244-5566","Id":null,"type":"Account","Name":"Sales"},{"Industry":"Government","Phone":null,"Id":null,"type":"Account","Name":"Kansas City Brewery & Co"}]
Can someone suggest me a serde which can parse this JSON file?
Thanks
You can check this serde : https://github.com/rcongiu/Hive-JSON-Serde
Another related post : Parse json arrays using HIVE
What's the syntax to write an array for solr in a csv file?, i need to update a multivalued field but when i upload the file, the data get all in the array but like just one element like this:
multiField:["data1,data2,data3"]
instead of this
multiField:["data1", "data2" , "data3"]
how i can write this in the csv file by default?
You can use the split parameter to split a single field into multiple values:
&f.multiField.split=,
.. should do what you want.
How can Mapreduce parse a CSV file with 80 columns and for each row in excel format it results two to three lines in CSV format? Text input format doesn't work in this case. Does key value input format work in this case?
You can write your own InoutFormat & RecordReader which will read multiple lines and send as a single record to your Mapper.