Hive Table Error loading json data - json

json file:
{
"DocId":"ABC",
"User":{
"Id":1234,
"Username":"sam1234",
"Name":"Sam",
"ShippingAddress":{
"Address1":"123 Main St.",
"Address2":null,
"City":"Durham",
"State":"NC"
},
"Orders":[{
"ItemId":6789,
"OrderDate":"11/11/2012"
},
{
"ItemId":4352,
"OrderDate":"12/12/2012"
}
]
}
}}
schema:
create external table sample_json(DocId string,User struct<Id:int,Username:string,Name:string,ShippingAddress:struct<Address1:string,Address2:string,City:string,State:string>,Orders:array<struct<ItemId:int,OrderDate:string>>>)ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' location '/user/babu/sample_json';
--loading data to the hive table
load data inpath '/user/samplejson/samplejson.json' into table sample_json;
Error:
when I am firing the select query like
select * from sample_json;
Exception:
Failed with exception
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
org.codehaus.jackson.JsonParseException: Unexpected end-of-input:
expected close marker for OBJECT (from [Source:
java.io.StringReader#8c3770; line: 1, column: 0]) at [Source:
java.io.StringReader#8c3770; line: 1, column: 3]

First please ensure that json file is valid through http://jsonlint.com and then remove any newline characters or unwanted spaces in the json file before loading the file into the hive table. Also please drop the table and create a new table if you have already loaded json files having newline characters into the table.
Following is the input you can try
{"DocId":"ABC",
"User":{"Id":1234,
"Username":"sam1234",
"Name":"Sam",
"ShippingAddress":{"Address1":"123 Main St.","Address2":null,"City":"Durham","State":"NC"},
"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},
{"ItemId":4352,"OrderDate":"12/12/2012"}
]
}
}

remove the newline from the json file.
{"DocId": "ABC", "Userdetails": {"Id": 1234, "Username": "sam1234", "Name": "Sam", "ShippingAddress": {"Address1": "123 Main St.", "Address2": null, "City": "Durham", "State": "NC" }, "Orders":[{"ItemId": 6789, "OrderDate": "11/11/2012"}, {"ItemId": 4352, "OrderDate": "12/12/2012"}]}}
change User to userdetails as it's a identifier check the error which I got.
3.either use location or load data inpath. because both does the same work. Location does not create a folder in HDFS while load inpath does create folder.
following are the commands :
hive>
create external table sample_json(DocId string, userdetails struct< Id:int , Username:string,Name:string,ShippingAddress:struct,Orders:array< struct< ItemId:int, OrderDate:string>>>)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' location '/user/admin';
OK
Time taken: 0.13 seconds
hive>
select * from sample_json;
OK
sample_json.docid sample_json.userdetails
ABC {"id":1234,"username":"sam1234","name":"Sam","shippingaddress":{"address1":"123 Main St.","address2":null,"city":"Durham","state":"NC"},"orders":[{"itemid":6789,"orderdate":"11/11/2012"},{"itemid":4352,"orderdate":"12/12/2012"}]}
Time taken: 0.106 seconds, Fetched: 1 row(s)

Related

AWS DynamoDB Issues adding values to existing table

I have already created a table called Sensors and identified Sensor as the hash key. I am trying to add to the table with my .json file. The items in my file look like this:
{
"Sensor": "Room Sensor",
"SensorDescription": "Turns on lights when person walks into room",
"ImageFile": "rmSensor.jpeg",
"SampleRate": "1000",
"Locations": "Baltimore, MD"
}
{
"Sensor": "Front Porch Sensor",
"SensorDescription": " ",
"ImageFile": "fpSensor.jpeg",
"SampleRate": "2000",
"Locations": "Los Angeles, CA"
}
There's 20 different sensors in the file. I was using the following command:
aws dynamodb batch-write-item \
--table-name Sensors \
--request-items file://sensorList.json \
--returned-consumed-capacity TOTAL
I get the following error:
Error parsing parameter '--request-items': Invalid JSON: Extra data: line 9 column 1 (char 189)
I've tried adding --table name Sensors to the cl and it says Unknown options: --table-name, Sensors. I've tried put-item and a few others. I am trying to understand what my errors are, what I need to change in my .json if anything, and what I need to change in my cl. Thanks!
Your input file is not a valid json. You are missing a comma to separate both objects, and you need to enclose everything with brackets [ ..... ]

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

how to generate schema from a newline delimited JSON file in python

I want to generate schema from a newline delimited JSON file, having each row in the JSON file has variable-key/value pairs. File size can vary from 5 MB to 25 MB.
Sample Data:
{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}
Exptected Schema:
[
{"name": "col1", "type": "INTEGER"},
{"name": "col2", "type": "STRING"},
{"name": "col3", "type": "FLOAT"},
{"name": "col4", "type": "DATE"}
]
Notes:
There is no scope to use any tool, as files loaded into an inbound location dynamically. The code will use to trigger an event as-soon-as file arrives and perform schema comparison.
Your first problem is, that json does not have a date-type. So you will get str there.
What I would do, if I was you is this:
import json
# Wherever your input comes from
inp = """{"col1":1,"col2":"c2","col3":100.75}
{"col1":2,"col3":200.50}
{"col1":3,"col3":300.15,"col4":"2020-09-08"}"""
schema = {}
# Split it at newlines
for line in inp.split('\n'):
# each line contains a "dict"
tmp = json.loads(line)
for key in tmp:
# if we have not seen the key before, add it
if key not in schema:
schema[key] = type(tmp[key])
# otherwise check the type
else:
if schema[key] != type(tmp[key]):
raise Exception("Schema mismatch")
# format however you like
out = []
for item in schema:
out.append({"name": item, "type": schema[item].__name__})
print(json.dumps(out, indent=2))
I'm using python types for simplicity, but you can write your own function to get the type, e.g. if you want to check if a string is actually a date.

Databricks - explode JSON from SQL column with PySpark

New to Databricks. Have a SQL database table that I am creating a dataframe from. One of the columns is a JSON string. I need to explode the nested JSON into multiple columns. Have used this post and this post to get me to where I am at now.
Example JSON:
{
"Module": {
"PCBA Serial Number": "G7456789",
"Manufacturing Designator": "DISNEY",
"Firmware Version": "0.0.0",
"Hardware Revision": "46858",
"Manufacturing Date": "10/17/2018 4:04:25 PM",
"Test Result": "Fail",
"Test Start Time": "10/22/2018 6:14:14 AM",
"Test End Time": "10/22/2018 6:16:11 AM"
}
Code so far:
#define schema
schema = StructType(
[
StructField('Module',ArrayType(StructType(Seq
StructField('PCBA Serial Number',StringType,True),
StructField('Manufacturing Designator',StringType,True),
StructField('Firmware Version',StringType,True),
StructField('Hardware Revision',StringType,True),
StructField('Test Result',StringType,True),
StructField('Test Start Time',StringType,True),
StructField('Test End Time',StringType,True))), True) ,True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)
]
#use from_json to explode json by applying it to column
df.withColumn("ActivityName", from_json("ActivityName", schema))\
.select(col('ActivityName'))\
.show()
Error:
SyntaxError: invalid syntax
File "<command-1632344621139040>", line 10
StructField('PCBA Serial Number',StringType,True),
^
SyntaxError: invalid syntax
As you are using pyspark then types should be StringType() instead of StringType and remove Seq replace it with []
schema = StructType([StructField('Module',ArrayType(StructType([
StructField('PCBA Serial Number',StringType(),True),
StructField('Manufacturing Designator',StringType(),True),
StructField('Firmware Version',StringType(),True),
StructField('Hardware Revision',StringType(),True),
StructField('Test Result',StringType(),True),
StructField('Test Start Time',StringType(),True),
StructField('Test End Time',StringType(),True)])), True),
StructField('Test Results',StringType(),True),
StructField('HVM Code Errors',StringType(),True)])

JSON Data Read in Hive Table

I am able to create Hive table using JSON Serde org.openx.data.jsonserde.JsonSerDe but when I am reading the data from Hive table I am unable to read.
hive> create table emp (EmpId int , EmpFirstName string , EmpLastName string) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
OK
Time taken: 2.148 seconds
hive> LOAD DATA INPATH '/user/cloudera/EmpData/emp.json' INTO table emp;
Loading data to table employee.emp
chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee.db/emp/emp.json': User does not belong to supergroup
Table employee.emp stats: [numFiles=1, totalSize=4163]
OK
Time taken: 1.141 seconds
hive> select * from emp;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
Time taken: 0.504 seconds
ERROR: Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]
check the json provided in /user/cloudera/EmpData/emp.json is valid
You can eliminate the invalid row by
ALTER TABLE table emp SET SERDEPROPERTIES ( "ignore.malformed.json" = "true");
check this link -> https://github.com/rcongiu/Hive-JSON-Serde
Edit:
this is invalid json
{ "cols": [ "EmpId", "EmpFirstName", "EmpLastName" ], "data": [ [ 1, "Hannah", "Walton" ], [ 2, "Barrett", "Mendoza" ], [ 3, "Camden", "Kidd" ], [ 4, "Illiana", "Collier" ] ] }
the json provided by you has
key:cols and value:[ "EmpId", "EmpFirstName", "EmpLastName" ]
and
key :data and value :[ [ 1, "Hannah", "Walton" ], [ 2, "Barrett", "Mendoza" ], [ 3, "Camden", "Kidd" ], [ 4, "Illiana", "Collier" ]
the json should be something like
{"EmpId":1,"EmpFirstName":"Hannah","EmpLastName":"Walton"}
{"EmpId":2,"EmpFirstName":"Barrett","EmpLastName":"Mendoza"}
{"EmpId":3,"EmpFirstName":"Camden","EmpLastName":"Kidd"}