Creating a KSQL Stream: How to extract value from complex json - json

I am trying to create a stream in Apache/KAFKA KSQL
The topic contains (somewhat complex JSON)
{
"agreement_id": "dd8afdbe-59cf-4272-b640-b14a24d8234c",
"created_at": "2018-02-17 16:00:00.000Z",
"id": "6db276a8-2efe-4495-9908-4d3fc4cc16fa",
"event_type": "data",
"total_charged_amount": {
"tax_free_amount": null,
"tax_amounts": [],
"tax_included_amount": {
"amount": 0.0241,
"currency": "EUR"
}
}
"used_service_units": [
{
"amount": 2412739,
"currency": null,
"unit_of_measure": "bytes"
}
]
}
Now creating a stream is easy for just simple stuff like event_type and created_at. That would be like this
CREATE STREAM tstream (event_type varchar, created_at varchar) WITH (kafka_topic='usage_events', value_format='json');
But now I need to access the used_service_units....
and I would like to extract the "amount" in the JSON above
How would I do this ?
CREATE STREAM usage (event_type varchar,create_at varchar, used_service_units[0].amount int) WITH (kafka_topic='usage_events', value_format='json');
Results in
line 1:78: mismatched input '[' expecting {'ADD', 'APPROXIMATE', ...
And if I instead create a stream like so
CREATE STREAM usage (event_type varchar,create_at varchar, used_service_units varchar) WITH (kafka_topic='usage_events', value_format='json');
And then does a SQL SELECT on the stream like this
SELECT EXTRACTJSONFIELD(used_service_units,'$.amount') FROM usage;
SELECT EXTRACTJSONFIELD(used_service_units[0],'$.amount') FROM usage;
SELECT EXTRACTJSONFIELD(used_service_units,'$[0].amount') FROM usage;
Neither of these alternatives work...
This one gave me
SELECT EXTRACTJSONFIELD(used_service_units[0],'$.amount') FROM usage;'
Code generation failed for SelectValueMapper

It seems that ONE solution to this problem is to make the column datatype an array
i.e.
CREATE STREAM usage (event_type varchar,created_at varchar, total_charged_amount varchar, used_service_units array<varchar> ) WITH (kafka_topic='usage_events', value_format='json');
Now I am able to do the following:
SELECT EXTRACTJSONFIELD(used_service_units[0],'$.amount') FROM usage

Related

How to query JSON data in Athena with an # symbol in the key name and duplicate keys

The data I have been tasked to query is structured like this:
{
"#timestamp": "2022-11-17T21:00:19.191+00:00",
"#version": 1,
"message": "log message",
"logger_name": "com.logger.name",
"thread_name": "tomcat-thread-13",
"level": "INFO",
"level_value": 20000,
"application_name": "app_name",
"vpc": "vpc_name",
"region": "eu-west-1",
"aid": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"account": "prod",
"rq": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"log_shipper": "firehose",
"application_name": "app_name",
"account": "prod",
"region": "eu-west-1"
}
As you can see there are some duplicate keys in here, so both the Hive and OpenX JSON SerDe throw an error and won't query it at all.
I've created a table using the Ion SerDe, which can read the data, but the #timestamp and #version fields are always blank, all the other fields are read correctly.
The initial table definition I had was this...
CREATE EXTERNAL TABLE firehose_logs_pe (
`#timestamp` STRING,
`#version` STRING,
<other columns>
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
STORED AS ION
LOCATION 's3://s3-bucket-name/folder/'
I also tried to rename the fields and use a path extractor to get the values, like this...
CREATE EXTERNAL TABLE firehose_logs_pe (
ts STRING,
version STRING,
<other columns>
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
'ion.ts.path_extractor' = '(`#timestamp`)',
'ion.version.path_extractor' = '(`#version`)'
)
STORED AS ION
LOCATION 's3://s3-bucket-name/folder/'
However, the values of the ts and version fields are still empty. The query also seems to run slower using the path extractors.
Is there any way to query this data in this format with Athena? As a test I did a find and replace on one of the JSON files and removed the #, at which point everything worked as it should, however this is not a practical solution when I have about 20Tb of data to query in hundreds of millions of files.

Querying on mysql json array using mysql workbench

Here is my json data:
{
"TransactionId": "1",
"PersonApplicant": [
{
"PersonalId": "1005",
"ApplicantPhone": [
{
"PhoneType": "LANDLINE",
"PhoneNumber": "8085063644",
"IsPrimaryPhone": true
}
]
},
{
"PersonalId": "1006",
"ApplicantPhone": [
{
"PhoneType": "LANDLINE",
"PhoneNumber": "9643645364",
"IsPrimaryPhone": true
},
{
"PhoneType": "HOME",
"PhoneNumber": "987654321",
"IsPrimaryPhone": false
}
]
}
]
}
I want to get phone no of the people who have phonetype as landline.
How to do that?
I tried this approach:
#find phoneNumber when phoneType='LANDLINE'
SELECT
#path_to_name := json_unquote(json_search(applicationData, 'one', 'LANDLINE')) AS path_to_name,
#path_to_parent := trim(TRAILING '.PhoneType' from #path_to_name) AS path_to_parent,
#event_object := json_extract(applicationData, #path_to_parent) as event_object,
json_unquote(json_extract(#event_object, '$.PhoneNumber')) as PhoneNumber
FROM application;
The issue with this is that I am using 'one' so I am able to achieve results but here in my json I have 2 people who have type as landline.
Using json search I am getting array of values and I am not able to decide how to extract these array row values in a manner where I can extract paths.
SELECT
#path_to_name := json_unquote(json_search(applicationData, 'all', 'LANDLINE')) from application;
result:
as you can see at 3rd and 4th row i am getting 2 data as an array.
How do I store this data to get the appropriate result?
I also tried one more query but not able to retrieve results for array of data.
I cannot use stored procedure and I have to use mysql workbench.
Please note that I am fresher so I don't know how I can approach this solution for more complex queries where I may have to retrieve id of a person having type as landline (multiple people in single array).
SELECT test.id, jsontable.*
FROM test
CROSS JOIN JSON_TABLE(test.data,
'$.PersonApplicant[*]'
COLUMNS ( PersonalId INT PATH '$.PersonalId',
PhoneType VARCHAR(255) PATH '$.ApplicantPhone[0].PhoneType',
PhoneNumber VARCHAR(255) PATH '$.ApplicantPhone[0].PhoneNumber')) jsontable
WHERE jsontable.PhoneType = 'LANDLINE';
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=4089207ccfba5068a48e06b52865e759

MS SQL Query a field containing JSON

I have the following JSON in a SQL field in a table:
{
"type": "info",
"date": "2019/11/12 14:28:51",
"state": {
"6ee8587f-3b8c-4e5c-89a9-9f04752607f0": {
"state": "open",
"color": "#0000ff"
}
},
...
}
I query this in MS SQL using the folloing:
SELECT
JSON_VALUE(json_data, '$.type') AS msg_type
,JSON_VALUE(json_data, '$."date"') AS event_date
,JSON_QUERY(json_data, '$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0".state') AS json_state
,JSON_QUERY(json_data, '$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0".color') AS json_color
FROM
[dbo].[tbl_json_dump]
To get the date (a reserved word) back I have to put the the field name in like $."date"
I cannot seem to get the data back for the state or color fields and I think it has to do with that it is nested under "6ee8587f-3b8c-4e5c-89a9-9f04752607f0" because when I query :
JSON_QUERY(json_data, '$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0"') AS json_state
I get the object back -
{"state":"open","color":"#0000ff"}
but using
JSON_QUERY(json_data, '$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0".state') AS json_state
it is not working
Any suggestions on what I'm doing wrong??
Just replace JSON_QUERY with JSON_VALUE since you're interested in getting the value.
JSON_QUERY is supposed to return a JSON fragment and designed to work on objects and arrays, not values.
Salman A already provided the answer. Just to add a few points.
JSON_VALUE() - Extracts a Scalar value
JSON_QUERY() - Extracts an object or an array from a JSON string.
If you see the syntax , JSON_QUERY ( expression [ , path ] ) & JSON_VALUE ( expression , path ) , both are more or less except the [] square brackets for path and it means optional. It is because JSON_QUERY() can extract whole JSON field if required.
And on the return types,
JSON_VALUE() returns a JSON fragment of type nvarchar(max)
JSON_QUERY() returns a single text value of type nvarchar(4000)
Overall comparison
DECLARE #data NVARCHAR(4000)
SET #data=N'{
"type": "info",
"date": "2019/11/12 14:28:51",
"state": {
"6ee8587f-3b8c-4e5c-89a9-9f04752607f0": {
"state": "open",
"color": "#0000ff"
}
},
}'
SELECT
JSON_VALUE(#data,'$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0"') AS 'JSON_VALUE_FAILED',
JSON_QUERY(#data,'$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0"') AS 'JSON_QUERY_SUCCEED',
JSON_VALUE(#data,'$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0".state') AS 'JSON_VALUE_SUCCEED',
JSON_QUERY(#data,'$.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0".state') AS 'JSON_QUERY_SUCCEED';
Check Output here
You may try with another possible approach (more complicated), which parses all nested JSON objects.
Table:
CREATE TABLE Data (
JsonData nvarchar(max)
)
INSERT INTO Data
(JsonData)
VALUES
(N'{
"type": "info",
"date": "2019/11/12 14:28:51",
"state": {
"6ee8587f-3b8c-4e5c-89a9-9f04752607f0": {
"state": "open",
"color": "#0000ff"
},
"6ee8587f-3b8c-4e5c-89a9-9f04752607f1": {
"state": "open",
"color": "#0000ff"
}
}
}')
Statement:
SELECT
j1.[type], j1.[date], j2.[key], j3.state, j3.color
FROM Data d
CROSS APPLY OPENJSON(d.JsonData) WITH (
[type] nvarchar(100) '$.type',
[date] datetime '$.date',
[state] nvarchar(max) '$.state' AS JSON
) j1
CROSS APPLY OPENJSON(j1.state) j2
CROSS APPLY OPENJSON(j2.[value]) WITH (
state nvarchar(10) '$.state',
color nvarchar(10) '$.color'
) j3
Result:
type date key state color
info 12/11/2019 14:28:51 6ee8587f-3b8c-4e5c-89a9-9f04752607f0 open #0000ff
info 12/11/2019 14:28:51 6ee8587f-3b8c-4e5c-89a9-9f04752607f1 open #0000ff
Notes:
If the input JSON has only one key "6ee8587f-3b8c-4e5c-89a9-9f04752607f0" in the "state" JSON object, you may get the value with JSON_VALUE() using the correct path $.state."6ee8587f-3b8c-4e5c-89a9-9f04752607f0".state.

U-SQL - Extract data from complex json object

So I have a lot of json files structured like this:
{
"Id": "2551faee-20e5-41e4-a7e6-57bd20b02a22",
"Timestamp": "2016-12-06T08:09:57.5541438+01:00",
"EventEntry": {
"EventId": 1,
"Payload": [
"1a3e0c9e-ef69-4c6a-ac8c-9b2de2fbc701",
"DHS.PlanCare.Business.BusinessLogic.VisionModels.VisionModelServiceWithoutUnitOfWork.FetchVisionModelsForClientOnReferenceDateAsync(System.Int64 clientId, System.DateTime referenceDate, System.Threading.CancellationToken cancellationToken)",
25,
"DHS.PlanCare.Business.BusinessLogic.VisionModels.VisionModelServiceWithoutUnitOfWork+<FetchVisionModelsForClientOnReferenceDateAsync>d__11.MoveNext\r\nDHS.PlanCare.Core.Extensions.IQueryableExtensions+<ExecuteAndThrowTaskCancelledWhenRequestedAsync>d__16`1.MoveNext\r\n",
false,
"2197, 6-12-2016 0:00:00, System.Threading.CancellationToken"
],
"EventName": "Duration",
"KeyWordsDescription": "Duration",
"PayloadSchema": [
"instanceSessionId",
"member",
"durationInMilliseconds",
"minimalStacktrace",
"hasFailed",
"parameters"
]
},
"Session": {
"SessionId": "0016e54b-6c4a-48bd-9813-39bb040f7736",
"EnvironmentId": "C15E535B8D0BD9EF63E39045F1859C98FEDD47F2",
"OrganisationId": "AC6752D4-883D-42EE-9FEA-F9AE26978E54"
}
}
How can I create an u-sql query that outputs the
Id,
Timestamp,
EventEntry.EventId and
EventEntry.Payload[2] (value 25 in the example below)
I can't figure out how to extend my query
#extract =
EXTRACT
Timestamp DateTime
FROM #"wasb://xxx/2016/12/06/0016e54b-6c4a-48bd-9813-39bb040f7736/yyy/{*}/{*}.json"
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
#res =
SELECT Timestamp
FROM #extract;
OUTPUT #res TO "/output/result.csv" USING Outputters.Csv();
I have seen some examples like:
U- SQL Unable to extract data from JSON file => this only queries one level of the document, I need data from multiple levels.
U-SQL - Extract data from json-array => this only queries one level of the document, I need data from multiple levels.
JSONTuple supports multiple JSONPaths in one go.
#extract =
EXTRACT
Id String,
Timestamp DateTime,
EventEntry String
FROM #"..."
USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
#res =
SELECT Id, Timestamp, EventEntry,
Microsoft.Analytics.Samples.Formats.Json.JsonFunctions.JsonTuple(EventEntry,
"EventId", "Payload[2]") AS Event
FROM #extract;
#res =
SELECT Id,
Timestamp,
Event["EventId"] AS EventId,
Event["Payload[2]"] AS Something
FROM #res;
You may want to look at this GIT example. https://github.com/Azure/usql/blob/master/Examples/JsonSample/JsonSample/NestedJsonParsing.usql
This take 2 disparate data elements and combines them, like you have the Payload, and Payload schema. If you create key value pairs using the "Donut" or "Cake and Batter" examples you may be able to match the scema up to the payload and use the cross apply explode function.

Load complex json in hive using jsonserde

I am trying to build a table in hive for following json
{
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
"hours": {
"Tuesday": {
"close": "17:00",
"open": "08:00"
},
"Friday": {
"close": "17:00",
"open": "08:00"
}
},
"open": true,
"categories": [
"Doctors",
"Health & Medical"
],
"review_count": 9,
"name": "Eric Goldberg, MD",
"neighborhoods": [],
"attributes": {
"By Appointment Only": true,
"Accepts Credit Cards": true,
"Good For Groups": 1
},
"type": "business"
}
I can create a table using following DDL,however I get an exception while querying that table.
CREATE TABLE IF NOT EXISTS business (
business_id string,
hours map<string,string>,
open boolean,
categories array<string>,
review_count int,
name string,
neighborhoods array<string>,
attributes map<string,string>,
type string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
The exception while retrieving data is "ClassCast:Cant cast jsoanarray to json object" . What is the correct schema for this json? Is there any took which can help me generate correct schema for given json to be used with jsonserde?
It looks to me that the problem is hours which you defined as hours map<string,string> but should be a map<string,map<string,string> instead.
There's a tool you can use to generate the hive table definition automatically from your JSON data: https://github.com/quux00/hive-json-schema
but you may want to adjust it because when encountering a JSON Object (Anything between {} ) the tool can't know wether to translate it to a hive map or to a struct.
On your data, the tool gives me this:
CREATE TABLE x (
attributes struct<accepts credit cards:boolean,
by appointment only:boolean, good for groups:int>,
business_id string,
categories array<string>,
hours map<string:struct<close:string, open:string>
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
)
but it looks like you want something like this:
CREATE TABLE x (
attributes map<string,string>,
business_id string,
categories array<string>,
hours map<string,struct<close:string, open:string>>,
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
hive> load data local inpath 'json.data' overwrite into table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
"good for groups":"1"}
vcNAWiLM4dR7D2nwwJ7nCA
["Doctors","Health & Medical"]
{"tuesday":{"close":"17:00","open":"08:00"},
"friday":{"close":"17:00","open":"08:00"}}
Eric Goldberg, MD ["HELLO"] true 9 business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>
A few notes though:
Notice I used a different JSON SerDe because I don't have on my system the one you used. I used this one, I like it better because, well, I wrote it. But the create statement should work just as well with the other serde.
You may want to convert some of those maps to structs, as they may be more convenient to query. For instance, attributes could be a struct, but you'd need to map the names with a space in them like accepts credit cards. My SerDe allows to map a json attribute to a different hive column name. That is also needed then JSON uses an attribute that is a hive keyword like 'timestamp' or 'create'.