JSON Data Read in Hive Table - json

I am able to create Hive table using JSON Serde org.openx.data.jsonserde.JsonSerDe but when I am reading the data from Hive table I am unable to read.
hive> create table emp (EmpId int , EmpFirstName string , EmpLastName string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
OK
Time taken: 2.148 seconds
hive> LOAD DATA INPATH '/user/cloudera/EmpData/emp.json' INTO table emp;
Loading data to table employee.emp
chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee.db/emp/emp.json': User does not belong to supergroup
Table employee.emp stats: [numFiles=1, totalSize=4163]
OK
Time taken: 1.141 seconds
hive> select * from emp;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
Time taken: 0.504 seconds

ERROR: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
check the json provided in /user/cloudera/EmpData/emp.json is valid
You can eliminate the invalid row by
ALTER TABLE table emp SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
check this link -> https://github.com/rcongiu/Hive-JSON-Serde
Edit:
this is invalid json
{ "cols": [ "EmpId", "EmpFirstName", "EmpLastName" ], "data": [ [ 1, "Hannah", "Walton" ], [ 2, "Barrett", "Mendoza" ], [ 3, "Camden", "Kidd" ], [ 4, "Illiana", "Collier" ] ] }
the json provided by you has
key:cols and value:[ "EmpId", "EmpFirstName", "EmpLastName" ]
and
key :data and value :[ [ 1, "Hannah", "Walton" ], [ 2, "Barrett", "Mendoza" ], [ 3, "Camden", "Kidd" ], [ 4, "Illiana", "Collier" ]
the json should be something like
{"EmpId":1,"EmpFirstName":"Hannah","EmpLastName":"Walton"}
{"EmpId":2,"EmpFirstName":"Barrett","EmpLastName":"Mendoza"}
{"EmpId":3,"EmpFirstName":"Camden","EmpLastName":"Kidd"}

Related

Table from nested list, struct

I have this json data:
consumption_json = """
{
"count": 48,
"next": null,
"previous": null,
"results": [
{
"consumption": 0.063,
"interval_start": "2018-05-19T00:30:00+0100",
"interval_end": "2018-05-19T01:00:00+0100"
},
{
"consumption": 0.071,
"interval_start": "2018-05-19T00:00:00+0100",
"interval_end": "2018-05-19T00:30:00+0100"
},
{
"consumption": 0.073,
"interval_start": "2018-05-18T23:30:00+0100",
"interval_end": "2018-05-18T00:00:00+0100"
}
]
}
"""
and I would like to covert the results list to an Arrow table.
I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.
import json
consumption_python = json.loads(consumption_json)
results = consumption_python['results']
table = pa.Table.from_pylist(results)
print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]
But, for reasons of performance, I'd rather just use pyarrow exclusively for this.
I can use pyarrow's json reader to make a table.
reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)
And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).
print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>
How do I turn this into a table directly?
following this https://stackoverflow.com/a/72880717/3617057
I can get closer...
import pyarrow.compute as pc
flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
-- is_valid: all not null
-- child 0 type: double
[
0.063,
0.071,
0.073
]
-- child 1 type: timestamp[s]
[
2018-05-18 23:30:00,
2018-05-18 23:00:00,
2018-05-18 22:30:00
]
-- child 2 type: timestamp[s]
[
2018-05-19 00:00:00,
2018-05-18 23:30:00,
2018-05-17 23:00:00
]
]
flat is a ChunkedArray whose underlying arrays are StructArray. To convert it to a table, you need to convert each chunks to a RecordBatch and concatenate them in a table:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(s)
for s in flat.iterchunks()
]
)
If flat is just a StructArray (not a ChunkedArray), you can call:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(flat)
]
)

Analysing and formatting JSON using PostgreSQL

I have a table called api_details where i dump the below JSON value into the JSON column raw_data.
Now i need to make a report from this JSON string and the expected output is something like below,
action_name. sent_timestamp Sent. Delivered
campaign_2475 1600416865.928737 - 1601788183.440805. 7504. 7483
campaign_d_1084_SUN15_ex 1604220248.153903 - 1604222469.087918. 63095. 62961
Below is the sample JSON OUTPUT
{
"header": [
"#0 action_name",
"#1 sent_timestamp",
"#0 Sent",
"#1 Delivered"
],
"name": "campaign - lifetime",
"rows": [
[
"campaign_2475",
"1600416865.928737 - 1601788183.440805",
7504,
7483
],
[
"campaign_d_1084_SUN15_ex",
"1604220248.153903 - 1604222469.087918",
63095,
62961
],
[
"campaign_SUN15",
"1604222469.148829 - 1604411016.029794",
63303,
63211
]
],
"success": true
}
I tried like below, but is not getting the results.I can do it using python by lopping through all the elements in row list.
But is there an easy solution in PostgreSQL(version 11).
SELECT raw_data->'rows'->0
FROM api_details
You can use JSONB_ARRAY_ELEMENTS() function such as
SELECT (j.value)->>0 AS action_name,
(j.value)->>1 AS sent_timestamp,
(j.value)->>2 AS Sent,
(j.value)->>3 AS Delivered
FROM api_details
CROSS JOIN JSONB_ARRAY_ELEMENTS(raw_data->'rows') AS j
Demo
P.S. in this case the data type of raw_data is assumed to be JSONB, otherwise the argument within the function raw_data->'rows' should be replaced with raw_data::JSONB->'rows' in order to perform explicit type casting.

Databricks - explode JSON from SQL column with PySpark

New to Databricks. Have a SQL database table that I am creating a dataframe from. One of the columns is a JSON string. I need to explode the nested JSON into multiple columns. Have used this post and this post to get me to where I am at now.
Example JSON:
{
"Module": {
"PCBA Serial Number": "G7456789",
"Manufacturing Designator": "DISNEY",
"Firmware Version": "0.0.0",
"Hardware Revision": "46858",
"Manufacturing Date": "10/17/2018 4:04:25 PM",
"Test Result": "Fail",
"Test Start Time": "10/22/2018 6:14:14 AM",
"Test End Time": "10/22/2018 6:16:11 AM"
}
Code so far:
#define schema
schema = StructType(
[
StructField('Module',ArrayType(StructType(Seq
StructField('PCBA Serial Number',StringType,True),
StructField('Manufacturing Designator',StringType,True),
StructField('Firmware Version',StringType,True),
StructField('Hardware Revision',StringType,True),
StructField('Test Result',StringType,True),
StructField('Test Start Time',StringType,True),
StructField('Test End Time',StringType,True))), True) ,True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)
]
#use from_json to explode json by applying it to column
df.withColumn("ActivityName", from_json("ActivityName", schema))\
.select(col('ActivityName'))\
.show()
Error:
SyntaxError: invalid syntax
File "<command-1632344621139040>", line 10
StructField('PCBA Serial Number',StringType,True),
^
SyntaxError: invalid syntax
As you are using pyspark then types should be StringType() instead of StringType and remove Seq replace it with []
schema = StructType([StructField('Module',ArrayType(StructType([
StructField('PCBA Serial Number',StringType(),True),
StructField('Manufacturing Designator',StringType(),True),
StructField('Firmware Version',StringType(),True),
StructField('Hardware Revision',StringType(),True),
StructField('Test Result',StringType(),True),
StructField('Test Start Time',StringType(),True),
StructField('Test End Time',StringType(),True)])), True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)])

PostgreSQL jsonb string format

I'm using PostgreSQL jsonb and have the following in my database record:
{"tags": "[\"apple\",\" orange\",\" pineapple\",\" fruits\"]",
"filename": "testname.jpg", "title_en": "d1", "title_ja": "1",
"description_en": "d1", "description_ja": "1"}
and both SELECT statements below retrived no results:
SELECT "photo"."id", "photo"."datadoc", "photo"."created_timestamp","photo"."modified_timestamp"
FROM "photo"
WHERE datadoc #> '{"tags":> ["apple"]}';
SELECT "photo"."id", "photo"."datadoc", "photo"."created_timestamp", "photo"."modified_timestamp"
FROM "photo"
WHERE datadoc -> 'tags' ? 'apple';
I wonder it is because of the extra backslash added to the json array string, or the SELECT statement is incorrect.
I'm running "PostgreSQL 10.1, compiled by Visual C++ build 1800, 64-bit" on Windows 10.
PostgreSQL doc is here.
As far as any JSON parser is concerned, the value of your tags key is a string, not an array.
"tags": "[\"apple\",\" orange\",\" pineapple\",\" fruits\"]"
The string itself happens to be another JSON document, like the common case in XML where the contents of a string happen to be an XML or HTML document.
["apple"," orange"," pineapple"," fruits"]
What you need to do is extract that string, then parse it as a new JSON object, and then query that new object.
I can't test it right now, but I think that would look something like this:
(datadoc ->> 'tags') ::jsonb ? 'apple'
That is, "extract the tags value as text, cast that text value as jsonb, then query that new jsonb value.
Hey there i know this is very late answer, but here is the good approach, with data i have.
initital data in db:
"{\"data\":{\"title\":\"test\",\"message\":\"string\",\"image\":\"string\"},\"registration_ids\":[\"s
tring\"],\"isAllUsersNotification\":false}"
to convert it to json
select (notificationData #>> '{}')::jsonb from sent_notification
result:
{"data": {"image": "string", "title": "string", "message": "string"}, "registration_ids": ["string"], "isAllUsersNotification": false}
getting a data object from json
select (notificationData #>> '{}' )::jsonb -> 'data' from sent_notification;
result:
{"image": "string", "title": "string", "message": "string"}
getting a field from above result:
select (notificationData #>> '{}' )::jsonb -> 'data' ->>'title' from sent_notification;
result:
string
performing where operations,
Q: get records where title ='string'
ans:
select * from sent_notification where (notificationData #>> '{}' )::jsonb -> 'data' ->>'title' ='string'

Hive Table Error loading json data

json file:
{
"DocId":"ABC",
"User":{
"Id":1234,
"Username":"sam1234",
"Name":"Sam",
"ShippingAddress":{
"Address1":"123 Main St.",
"Address2":null,
"City":"Durham",
"State":"NC"
},
"Orders":[{
"ItemId":6789,
"OrderDate":"11/11/2012"
},
{
"ItemId":4352,
"OrderDate":"12/12/2012"
}
]
}
}}
schema:
create external table sample_json(DocId string,User struct<Id:int,Username:string,Name:string,ShippingAddress:struct<Address1:string,Address2:string,City:string,State:string>,Orders:array<struct<ItemId:int,OrderDate:string>>>)ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' location '/user/babu/sample_json';
--loading data to the hive table
load data inpath '/user/samplejson/samplejson.json' into table sample_json;
Error:
when I am firing the select query like
select * from sample_json;
Exception:
Failed with exception
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input:
expected close marker for OBJECT (from [Source:
java.io.StringReader#8c3770; line: 1, column: 0]) at [Source:
java.io.StringReader#8c3770; line: 1, column: 3]
First please ensure that json file is valid through http://jsonlint.com and then remove any newline characters or unwanted spaces in the json file before loading the file into the hive table. Also please drop the table and create a new table if you have already loaded json files having newline characters into the table.
Following is the input you can try
{"DocId":"ABC",
"User":{"Id":1234,
"Username":"sam1234",
"Name":"Sam",
"ShippingAddress":{"Address1":"123 Main St.","Address2":null,"City":"Durham","State":"NC"},
"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},
{"ItemId":4352,"OrderDate":"12/12/2012"}
]
}
}
remove the newline from the json file.
{"DocId": "ABC", "Userdetails": {"Id": 1234, "Username": "sam1234", "Name": "Sam", "ShippingAddress": {"Address1": "123 Main St.", "Address2": null, "City": "Durham", "State": "NC" }, "Orders":[{"ItemId": 6789, "OrderDate": "11/11/2012"}, {"ItemId": 4352, "OrderDate": "12/12/2012"}]}}
change User to userdetails as it's a identifier check the error which I got.
3.either use location or load data inpath. because both does the same work. Location does not create a folder in HDFS while load inpath does create folder.
following are the commands :
hive>
create external table sample_json(DocId string, userdetails struct< Id:int , Username:string,Name:string,ShippingAddress:struct,Orders:array< struct< ItemId:int, OrderDate:string>>>)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' location '/user/admin';
OK
Time taken: 0.13 seconds
hive>
select * from sample_json;
OK
sample_json.docid sample_json.userdetails
ABC {"id":1234,"username":"sam1234","name":"Sam","shippingaddress":{"address1":"123 Main St.","address2":null,"city":"Durham","state":"NC"},"orders":[{"itemid":6789,"orderdate":"11/11/2012"},{"itemid":4352,"orderdate":"12/12/2012"}]}
Time taken: 0.106 seconds, Fetched: 1 row(s)