Load complex json in hive using jsonserde - json

I am trying to build a table in hive for following json
{
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
"hours": {
"Tuesday": {
"close": "17:00",
"open": "08:00"
},
"Friday": {
"close": "17:00",
"open": "08:00"
}
},
"open": true,
"categories": [
"Doctors",
"Health & Medical"
],
"review_count": 9,
"name": "Eric Goldberg, MD",
"neighborhoods": [],
"attributes": {
"By Appointment Only": true,
"Accepts Credit Cards": true,
"Good For Groups": 1
},
"type": "business"
}
I can create a table using following DDL,however I get an exception while querying that table.
CREATE TABLE IF NOT EXISTS business (
business_id string,
hours map<string,string>,
open boolean,
categories array<string>,
review_count int,
name string,
neighborhoods array<string>,
attributes map<string,string>,
type string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
The exception while retrieving data is "ClassCast:Cant cast jsoanarray to json object" . What is the correct schema for this json? Is there any took which can help me generate correct schema for given json to be used with jsonserde?

It looks to me that the problem is hours which you defined as hours map<string,string> but should be a map<string,map<string,string> instead.
There's a tool you can use to generate the hive table definition automatically from your JSON data: https://github.com/quux00/hive-json-schema
but you may want to adjust it because when encountering a JSON Object (Anything between {} ) the tool can't know wether to translate it to a hive map or to a struct.
On your data, the tool gives me this:
CREATE TABLE x (
attributes struct<accepts credit cards:boolean,
by appointment only:boolean, good for groups:int>,
business_id string,
categories array<string>,
hours map<string:struct<close:string, open:string>
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
)
but it looks like you want something like this:
CREATE TABLE x (
attributes map<string,string>,
business_id string,
categories array<string>,
hours map<string,struct<close:string, open:string>>,
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
hive> load data local inpath 'json.data' overwrite into table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
"good for groups":"1"}
vcNAWiLM4dR7D2nwwJ7nCA
["Doctors","Health & Medical"]
{"tuesday":{"close":"17:00","open":"08:00"},
"friday":{"close":"17:00","open":"08:00"}}
Eric Goldberg, MD ["HELLO"] true 9 business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>
A few notes though:
Notice I used a different JSON SerDe because I don't have on my system the one you used. I used this one, I like it better because, well, I wrote it. But the create statement should work just as well with the other serde.
You may want to convert some of those maps to structs, as they may be more convenient to query. For instance, attributes could be a struct, but you'd need to map the names with a space in them like accepts credit cards. My SerDe allows to map a json attribute to a different hive column name. That is also needed then JSON uses an attribute that is a hive keyword like 'timestamp' or 'create'.

Related

How to query JSON data in Athena with an # symbol in the key name and duplicate keys

The data I have been tasked to query is structured like this:
{
"#timestamp": "2022-11-17T21:00:19.191+00:00",
"#version": 1,
"message": "log message",
"logger_name": "com.logger.name",
"thread_name": "tomcat-thread-13",
"level": "INFO",
"level_value": 20000,
"application_name": "app_name",
"vpc": "vpc_name",
"region": "eu-west-1",
"aid": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"account": "prod",
"rq": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"log_shipper": "firehose",
"application_name": "app_name",
"account": "prod",
"region": "eu-west-1"
}
As you can see there are some duplicate keys in here, so both the Hive and OpenX JSON SerDe throw an error and won't query it at all.
I've created a table using the Ion SerDe, which can read the data, but the #timestamp and #version fields are always blank, all the other fields are read correctly.
The initial table definition I had was this...
CREATE EXTERNAL TABLE firehose_logs_pe (
`#timestamp` STRING,
`#version` STRING,
<other columns>
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
STORED AS ION
LOCATION 's3://s3-bucket-name/folder/'
I also tried to rename the fields and use a path extractor to get the values, like this...
CREATE EXTERNAL TABLE firehose_logs_pe (
ts STRING,
version STRING,
<other columns>
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
'ion.ts.path_extractor' = '(`#timestamp`)',
'ion.version.path_extractor' = '(`#version`)'
)
STORED AS ION
LOCATION 's3://s3-bucket-name/folder/'
However, the values of the ts and version fields are still empty. The query also seems to run slower using the path extractors.
Is there any way to query this data in this format with Athena? As a test I did a find and replace on one of the JSON files and removed the #, at which point everything worked as it should, however this is not a practical solution when I have about 20Tb of data to query in hundreds of millions of files.

how to extract properly when sqlite json has value as an array

I have a sqlite database and in one of the fields I have stored complete json object . I have to make some json select requests . If you see my json
the ALL key has value which is an array . We need to extract some data like all comments where "pod" field is fb . How to extract properly when sqlite json has value as an array ?
select json_extract(data,'$."json"') from datatable ; gives me entire thing . Then I do
select json_extract(data,'$."json"[0]') but i dont want to do it manually . i want to iterate .
kindly suggest some source where i can study and work on it .
MY JSON
{
"ALL": [{
"comments": "your site is awesome",
"pod": "passcode",
"originalDirectory": "case1"
},
{
"comments": "your channel is good",
"data": ["youTube"],
"pod": "library"
},
{
"comments": "you like everything",
"data": ["facebook"],
"pod": "fb"
},
{
"data": ["twitter"],
"pod": "tw",
"ALL": [{
"data": [{
"codeLevel": "3"
}],
"pod": "mo",
"pod2": "p"
}]
}
]
}
create table datatable ( path string , data json1 );
insert into datatable values("1" , json('<abovejson in a single line>'));
Simple List
Where your JSON represents a "simple" list of comments, you want something like:
select key, value
from datatable, json_each( datatable.data, '$.ALL' )
where json_extract( value, '$.pod' ) = 'fb' ;
which, using your sample data, returns:
2|{"comments":"you like everything","data":["facebook"],"pod":"fb"}
The use of json_each() returns a row for every element of the input JSON (datatable.data), starting at the path $.ALL (where $ is the top-level, and ALL is the name of your array: the path can be omitted if the top-level of the JSON object is required). In your case, this returns one row for each comment entry.
The fields of this row are documented at 4.13. The json_each() and json_tree() table-valued functions in the SQLite documentation: the two we're interested in are key (very roughly, the "row number") and value (the JSON for the current element). The latter will contain elements called comment and pod, etc..
Because we are only interested in elements where pod is equal to fb, we add a where clause, using json_extract() to get at pod (where $.pod is relative to value returned by the json_each function).
Nested List
If your JSON contains nested elements (something I didn't notice at first), then you need to use the json_tree() function instead of json_each(). Whereas the latter will only iterate over the immediate children of the node specified, json_tree() will descend recursively through all children from the node specified.
To give us some data to work with, I have augmented your test data with an extra element:
create table datatable ( path string , data json1 );
insert into datatable values("1" , json('
{
"ALL": [{
"comments": "your site is awesome",
"pod": "passcode",
"originalDirectory": "case1"
},
{
"comments": "your channel is good",
"data": ["youTube"],
"pod": "library"
},
{
"comments": "you like everything",
"data": ["facebook"],
"pod": "fb"
},
{
"data": ["twitter"],
"pod": "tw",
"ALL": [{
"data": [{
"codeLevel": "3"
}],
"pod": "mo",
"pod2": "p"
},
{
"comments": "inserted by TripeHound",
"data": ["facebook"],
"pod": "fb"
}]
}
]
}
'));
If we were to simply switch to using json_each(), then we see that a simple query (with no where clause) will return all elements of the source JSON:
select key, value
from datatable, json_tree( datatable.data, '$.ALL' ) limit 10 ;
ALL|[{"comments":"your site is awesome","pod":"passcode","originalDirectory":"case1"},{"comments":"your channel is good","data":["youTube"],"pod":"library"},{"comments":"you like everything","data":["facebook"],"pod":"fb"},{"data":["twitter"],"pod":"tw","ALL":[{"data":[{"codeLevel":"3"}],"pod":"mo","pod2":"p"},{"comments":"inserted by TripeHound","data":["facebook"],"pod":"fb"}]}]
0|{"comments":"your site is awesome","pod":"passcode","originalDirectory":"case1"}
comments|your site is awesome
pod|passcode
originalDirectory|case1
1|{"comments":"your channel is good","data":["youTube"],"pod":"library"}
comments|your channel is good
data|["youTube"]
0|youTube
pod|library
Because JSON objects are mixed in with simple values, we can no longer simply add where json_extract( value, '$.pod' ) = 'fb' because this produces errors when value does not represent an object. The simplest way around this is to look at the type values returned by json_each()/json_tree(): these will be the string object if the row represents a JSON object (see above documentation for other values).
Adding this to the where clause (and relying on "short-circuit evaluation" to prevent json_extract() being called on non-object rows), we get:
select key, value
from datatable, json_tree( datatable.data, '$.ALL' )
where type = 'object'
and json_extract( value, '$.pod' ) = 'fb' ;
which returns:
2|{"comments":"you like everything","data":["facebook"],"pod":"fb"}
1|{"comments":"inserted by TripeHound","data":["facebook"],"pod":"fb"}
If desired, we could use json_extract() to break apart the returned objects:
.mode column
.headers on
.width 30 15 5
select json_extract( value, '$.comments' ) as Comments,
json_extract( value, '$.data' ) as Data,
json_extract( value, '$.pod' ) as POD
from datatable, json_tree( datatable.data, '$.ALL' )
where type = 'object'
and json_extract( value, '$.pod' ) = 'fb' ;
Comments Data POD
------------------------------ --------------- -----
you like everything ["facebook"] fb
inserted by TripeHound ["facebook"] fb
Note: If your structure contained other objects, of different formats, it may not be sufficient to simply select for type = 'object': you may have to devise a more subtle filtering process.

Creating a KSQL Stream: How to extract value from complex json

I am trying to create a stream in Apache/KAFKA KSQL
The topic contains (somewhat complex JSON)
{
"agreement_id": "dd8afdbe-59cf-4272-b640-b14a24d8234c",
"created_at": "2018-02-17 16:00:00.000Z",
"id": "6db276a8-2efe-4495-9908-4d3fc4cc16fa",
"event_type": "data",
"total_charged_amount": {
"tax_free_amount": null,
"tax_amounts": [],
"tax_included_amount": {
"amount": 0.0241,
"currency": "EUR"
}
}
"used_service_units": [
{
"amount": 2412739,
"currency": null,
"unit_of_measure": "bytes"
}
]
}
Now creating a stream is easy for just simple stuff like event_type and created_at. That would be like this
CREATE STREAM tstream (event_type varchar, created_at varchar) WITH (kafka_topic='usage_events', value_format='json');
But now I need to access the used_service_units....
and I would like to extract the "amount" in the JSON above
How would I do this ?
CREATE STREAM usage (event_type varchar,create_at varchar, used_service_units[0].amount int) WITH (kafka_topic='usage_events', value_format='json');
Results in
line 1:78: mismatched input '[' expecting {'ADD', 'APPROXIMATE', ...
And if I instead create a stream like so
CREATE STREAM usage (event_type varchar,create_at varchar, used_service_units varchar) WITH (kafka_topic='usage_events', value_format='json');
And then does a SQL SELECT on the stream like this
SELECT EXTRACTJSONFIELD(used_service_units,'$.amount') FROM usage;
SELECT EXTRACTJSONFIELD(used_service_units[0],'$.amount') FROM usage;
SELECT EXTRACTJSONFIELD(used_service_units,'$[0].amount') FROM usage;
Neither of these alternatives work...
This one gave me
SELECT EXTRACTJSONFIELD(used_service_units[0],'$.amount') FROM usage;'
Code generation failed for SelectValueMapper
It seems that ONE solution to this problem is to make the column datatype an array
i.e.
CREATE STREAM usage (event_type varchar,created_at varchar, total_charged_amount varchar, used_service_units array<varchar> ) WITH (kafka_topic='usage_events', value_format='json');
Now I am able to do the following:
SELECT EXTRACTJSONFIELD(used_service_units[0],'$.amount') FROM usage

U-SQL - Extract data from complex json object

So I have a lot of json files structured like this:
{
"Id": "2551faee-20e5-41e4-a7e6-57bd20b02a22",
"Timestamp": "2016-12-06T08:09:57.5541438+01:00",
"EventEntry": {
"EventId": 1,
"Payload": [
"1a3e0c9e-ef69-4c6a-ac8c-9b2de2fbc701",
"DHS.PlanCare.Business.BusinessLogic.VisionModels.VisionModelServiceWithoutUnitOfWork.FetchVisionModelsForClientOnReferenceDateAsync(System.Int64 clientId, System.DateTime referenceDate, System.Threading.CancellationToken cancellationToken)",
25,
"DHS.PlanCare.Business.BusinessLogic.VisionModels.VisionModelServiceWithoutUnitOfWork+<FetchVisionModelsForClientOnReferenceDateAsync>d__11.MoveNext\r\nDHS.PlanCare.Core.Extensions.IQueryableExtensions+<ExecuteAndThrowTaskCancelledWhenRequestedAsync>d__16`1.MoveNext\r\n",
false,
"2197, 6-12-2016 0:00:00, System.Threading.CancellationToken"
],
"EventName": "Duration",
"KeyWordsDescription": "Duration",
"PayloadSchema": [
"instanceSessionId",
"member",
"durationInMilliseconds",
"minimalStacktrace",
"hasFailed",
"parameters"
]
},
"Session": {
"SessionId": "0016e54b-6c4a-48bd-9813-39bb040f7736",
"EnvironmentId": "C15E535B8D0BD9EF63E39045F1859C98FEDD47F2",
"OrganisationId": "AC6752D4-883D-42EE-9FEA-F9AE26978E54"
}
}
How can I create an u-sql query that outputs the
Id,
Timestamp,
EventEntry.EventId and
EventEntry.Payload[2] (value 25 in the example below)
I can't figure out how to extend my query
#extract =
EXTRACT
Timestamp DateTime
FROM #"wasb://xxx/2016/12/06/0016e54b-6c4a-48bd-9813-39bb040f7736/yyy/{*}/{*}.json"
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
#res =
SELECT Timestamp
FROM #extract;
OUTPUT #res TO "/output/result.csv" USING Outputters.Csv();
I have seen some examples like:
U- SQL Unable to extract data from JSON file => this only queries one level of the document, I need data from multiple levels.
U-SQL - Extract data from json-array => this only queries one level of the document, I need data from multiple levels.
JSONTuple supports multiple JSONPaths in one go.
#extract =
EXTRACT
Id String,
Timestamp DateTime,
EventEntry String
FROM #"..."
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
#res =
SELECT Id, Timestamp, EventEntry,
Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(EventEntry,
"EventId", "Payload[2]") AS Event
FROM #extract;
#res =
SELECT Id,
Timestamp,
Event["EventId"] AS EventId,
Event["Payload[2]"] AS Something
FROM #res;
You may want to look at this GIT example. https://github.com/Azure/usql/blob/master/Examples/JsonSample/JsonSample/NestedJsonParsing.usql
This take 2 disparate data elements and combines them, like you have the Payload, and Payload schema. If you create key value pairs using the "Donut" or "Cake and Batter" examples you may be able to match the scema up to the payload and use the cross apply explode function.

How to parse JSON value of a text column in cassandra

I have a column of text type be contain JSON value.
{
"customer": [
{
"details": {
"customer1": {
"name": "john",
"addresses": {
"address1": {
"line1": "xyz",
"line2": "pqr"
},
"address2": {
"line1": "abc",
"line2": "efg"
}
}
}
"customer2": {
"name": "robin",
"addresses": {
"address1": null
}
}
}
}
]
}
How can I extract 'address1' JSON field of column with query?
First I am trying to fetch JSON value then I will go with parsing.
SELECT JSON customer from text_column;
With my query, I get following error.
com.datastax.driver.core.exceptions.SyntaxError: line 1:12 no viable
alternative at input 'customer' (SELECT [JSON] customer...)
com.datastax.driver.core.exceptions.SyntaxError: line 1:12 no viable
alternative at input 'customer' (SELECT [JSON] customer...)
Cassandra version 2.1.13
You can't use SELECT JSON in Cassandra v2.1.x CQL v3.2.x
For Cassandra v2.1.x CQL v3.2.x :
The only supported operation after SELECT are :
DISTINCT
COUNT (*)
COUNT (1)
column_name AS new_name
WRITETIME (column_name)
TTL (column_name)
dateOf(), now(), minTimeuuid(), maxTimeuuid(), unixTimestampOf(), typeAsBlob() and blobAsType()
In Cassandra v2.2.x CQL v3.3.x Introduce : SELECT JSON
With SELECT statements, the new JSON keyword can be used to return each row as a single JSON encoded map. The remainder of the SELECT statment behavior is the same.
The result map keys are the same as the column names in a normal result set. For example, a statement like “SELECT JSON a, ttl(b) FROM ...” would result in a map with keys "a" and "ttl(b)". However, this is one notable exception: for symmetry with INSERT JSON behavior, case-sensitive column names with upper-case letters will be surrounded with double quotes. For example, “SELECT JSON myColumn FROM ...” would result in a map key "\"myColumn\"" (note the escaped quotes).
The map values will JSON-encoded representations (as described below) of the result set values.
If your Cassandra version is 2.1x and below, you can use the Python-based approach.
Write a python script using Cassandra-Python API
Here you have to get your row first and then use python json's loads method, which will convert your json text column value into JSON object which will be dict in Python. Then you can play around with Python dictionaries and extract your required nested keys. See the below code snippet.
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import json
if __name__ == '__main__':
auth_provider = PlainTextAuthProvider(username='xxxx', password='xxxx')
cluster = Cluster(['0.0.0.0'],
port=9042, auth_provider=auth_provider)
session = cluster.connect("keyspace_name")
print("session created successfully")
rows = session.execute('select * from user limit 10')
for user_row in rows:
customer_dict = json.loads(user_row.customer)
print(customer_dict().keys()