Loading JSON file with repeating elements into hive table

Loading JSON file with repeating elements into hive table - json

Given this simple JSON file:
{
"EVT": {
"EVT_ID": "12345",
"LINES": {
"LINE": {
"LINE_NUM" : 1,
"AMT" : 100,
"EVT_DT" : "2018-01-01"
},
"LINE": {
"LINE_NUM" : 2,
"AMT" : 150,
"EVT_DT" : "2018-01-02"
}
}
}
}
We need to load that into a hive table. The ultimate goal is to flatten the json, something like this:
+--------+----------+-----+------------+
| EVT_ID | Line_Num | Amt | Evt_Dt |
+--------+----------+-----+------------+
| 12345 | 1 | 100 | 2018-01-01 |
| 12345 | 2 | 150 | 2018-01-02 |
+--------+----------+-----+------------+
Here's my current DDL for the table:
create table foo.bar (
`EVT` struct<
`EVT_ID`:string,
`LINES`:struct<
LINE: struct<`LINE_NUM`: int,`AMT`:int,`EVT_DT`:string>
>
>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
It seems like the second "line" is overwriting the first. A simple select * from the table returns;
{"evt_id":"12345","lines":{"line":{"line_num":2,"amt":150,"evt_dt":"2018-01-02"}}}
What am I doing wrong?

The JSON and table definition are wrong. "Repeating elements" is an Array. LINES should be array<struct>, not struct<struct> (note square brackets):
{
"EVT": {
"EVT_ID": "12345",
"LINES": [
{
"LINE_NUM" : 1,
"AMT" : 100,
"EVT_DT" : "2018-01-01"
},
{
"LINE_NUM" : 2,
"AMT" : 150,
"EVT_DT" : "2018-01-02"
}
]
}
}
And you do not need this "LINE": also, because it is just an array element

Related

Parse nested Json to splunk query which has string

I have a multiple result for a macAddress which contains the device details.
This is the sample data
"data": {
"a1:b2:c3:d4:11:22": {
"deviceIcons": {
"type": "Phone",
"icons": {
"3x": null,
"2x": "image.png"
}
},
"advancedDeviceId": {
"agentId": 113,
"partnerAgentId": "131",
"dhcpHostname": "Galaxy-J7",
"mac": "a1:b2:c3:d4:11:22",
"lastSeen": 12,
"model": "Android Phoe",
"id": 1
}
},
"a0:b2:c3:d4:11:22": {
"deviceIcons": {
"type": "Phone",
"icons": {
"3x": null,
"2x": "image.png"
}
},
"advancedDeviceId": {
"agentId": 113,
"partnerAgentId": "131",
"dhcpHostname": "Galaxy",
"mac": "a0:b2:c3:d4:11:22",
"lastSeen": 12,
"model": "Android Phoe",
"id": 1
}
}
}
}
How can I query in splunk for all the kind of above sample results to get the advancedDeviceId.model and advancedDeviceId.id in tabular format?

I think this will do what you want
| spath
| untable _time column value
| rex field=column "data.(?<address>[^.]+)\.advancedDeviceId\.(?<item>[^.]+)"
| table _time address item value
| eval {item}=value
| stats list(model) as model
list(id) as id
list(dhcpHostname) as dhcpHostname
list(mac) as mac
by address
Here is a "run anywhere" example that has two events each with two addresses:
| makeresults
| eval _raw="{\"data\":{\"a1:b2:c3:d4:11:21\":{\"deviceIcons\":{\"type\":\"Phone\",\"icons\":{\"3x\":null,\"2x\":\"image.png\"}},\"advancedDeviceId\":{\"agentId\":113,\"partnerAgentId\":\"131\",\"dhcpHostname\":\"Galaxy-J7\",\"mac\":\"a1:b2:c3:d4:11:21\",\"lastSeen\":12,\"model\":\"Android Phoe\",\"id\":1}},\"a0:b2:c3:d4:11:22\":{\"deviceIcons\":{\"type\":\"Phone\",\"icons\":{\"3x\":null,\"2x\":\"image.png\"}},\"advancedDeviceId\":{\"agentId\":113,\"partnerAgentId\":\"131\",\"dhcpHostname\":\"iPhone 6\",\"mac\":\"a0:b2:c3:d4:11:22\",\"lastSeen\":12,\"model\":\"Apple Phoe\",\"id\":2}}}}"
| append [
| makeresults
| eval _raw="{\"data\":{\"b1:b2:c3:d4:11:23\":{\"deviceIcons\":{\"type\":\"Phone\",\"icons\":{\"3x\":null,\"2x\":\"image.png\"}},\"advancedDeviceId\":{\"agentId\":113,\"partnerAgentId\":\"131\",\"dhcpHostname\":\"Nokia\",\"mac\":\"b1:b2:c3:d4:11:23\",\"lastSeen\":12,\"model\":\"Symbian Phoe\",\"id\":3}},\"b0:b2:c3:d4:11:24\":{\"deviceIcons\":{\"type\":\"Phone\",\"icons\":{\"3x\":null,\"2x\":\"image.png\"}},\"advancedDeviceId\":{\"agentId\":113,\"partnerAgentId\":\"131\",\"dhcpHostname\":\"Windows\",\"mac\":\"b0:b2:c3:d4:11:24\",\"lastSeen\":12,\"model\":\"Windows Phoe\",\"id\":4}}}}"
]
| spath
| untable _time column value
| rex field=column "data.(?<address>[^.]+)\.advancedDeviceId\.(?<item>[^.]+)"
| table _time address item value
| eval {item}=value
| stats list(model) as model
list(id) as id
list(dhcpHostname) as dhcpHostname
list(mac) as mac
by address

Create a composite object from a complex json object using jq

I have complex configuration file in JSON:
{
"config": {
...,
"extra": {
...
"auth_namespace.com": {
...
"name": "some_name",
"id": 1,
...
}
},
...,
"endpoints": [
{ ...,
"extra": {
"namespace_1.com": {...},
"namespace_auth.com": { "scope": "scope1" }
}},
{ ...
# object without "extra" property
...
},
...,
{ ...
"extra": {
"namespace_1.com": {...},
"namespace_auth.com": { "scope": "scope2" }
}},
{ ...
"extra": {
# scopes may repeat
"namespace_auth.com": { "scope": "scope2" }
}}
]
}
}
And I want to get the output object with properties "name", "id", "scopes". Where "scopes" is an array of unique values.
Something like this:
{
"name": "some_name",
"id": 1,
"scopes": ["scope1", "scope2" ... "scopeN"]
}
I can get these properties separately. But I don't know how to combine them together.
[
.config |
(
.extra["auth_namespace.com"] |
select(.name) |
{name, id}
) as $name_id |
.endpoints[] |
.extra["namespace_auth.com"].scope |
select(.)
] | unique | {scopes: .}

Perhaps the following is closer to what you're looking for:
.config
| (.extra."auth_namespace.com" | {id, name})
+ {scopes: .endpoints
| map( select(has("extra"))
| .extra."namespace_auth.com"
| select(has("scope"))
| .scope )
| unique }

Well, I found a solution. It's ugly, but it works.
Would be grateful if someone could write a more elegant version.
.config
| (
.endpoints
| map(.extra["namespace_auth.com"] | select(.scope) | .[])
| unique
) as $s
| .extra["auth_namespace.com"] | select(.name)
| {name, id, scopes: $s}

Karate API framework how to match the response values with the table columns?

I have below API response sample
{
"items": [
{
"id":11,
"name": "SMITH",
"prefix": "SAM",
"code": "SSO"
},
{
"id":10,
"name": "James",
"prefix": "JAM",
"code": "BBC"
}
]
}
As per above response, my tests says that whenever I hit the API request the 11th ID would be of SMITH and 10th id would be JAMES
So what I thought to store this in a table and assert against the actual response
* table person
| id | name |
| 11 | SMITH |
| 10 | James |
| 9 | RIO |
Now how would I match one by one ? like first it parse the first ID and first name from the API response and match with the Tables first ID and tables first name
Please share any convenient way of doing it from KARATE

There are a few possible ways, here is one:
* def lookup = { 11: 'SMITH', 10: 'James' }
* def items =
"""
[
{
"id":11,
"name":"SMITH",
"prefix":"SAM",
"code":"SSO"
},
{
"id":10,
"name":"James",
"prefix":"JAM",
"code":"BBC"
}
]
"""
* match each items contains { name: "#(lookup[_$.id+''])" }
And you already know how to use table instead of JSON.
Please read the docs and other stack-overflow answers to get more ideas.

Parse a tables with unicode chars in variables from JSON with SAS BASE

I've faced with a problem on parsing JSON with unicode char in vars.
So, I have the next JSON (example):
{
"SASJSONExport":"1.0",
"SASTableData+TEST":[
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
},
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
},
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
}
]
}
To parse the table from JSON I use SAS engine:
libname jsonfl JSON fileref=injson ;
The code higher decode chars in cells, but name of vars looks like missing vals:
+--------------+---------------------------+------------+---------+---------+
| ordinal_root | ordinal_SASTableData_TEST | __________ | _______ | ______ |
+--------------+---------------------------+------------+---------+---------+
| 1 | 1 | 2 | 4 | Что-то1 |
| 1 | 2 | 2 | 2 | Что-то2 |
| 1 | 3 | 1 | 42 | Что-то3 |
+--------------+---------------------------+------------+---------+---------+
The header must look like:
+--------------+---------------------------+------------+---------+---------+
| ordinal_root | ordinal_SASTableData_TEST | Переменная | Среднее | Строка |
+--------------+---------------------------+------------+---------+---------+
So I've decide to replace unicoded variables chars with names like this DIM_N_.
And for that I must find all strings, that agree with next regexp: /([\s\w\d\\]+)\"\:/
But, to get strings from json I need set as delim the next char '{','}','[',']',','.
But if set that chars as dlm , I willn't assemble json again.
So I've decide to paste before the char ~ to set it as dlm.
data delim;
infile injson lrecl=1073741823 nopad;
file delim;
input char1 $char1. ##;
if char1 in ('{','}','[',']',',') then
put '7E'x;
put char1 $CHAR1. ##;
run;
I've get the novalid json file:
~
{"SASJSONExport":"1.0"~
,"SASTableData+TEST":~
[ ~
{"\u0056\u0061\u0072":2~
,"\u006d\u0065\u0061\u006e":4~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0031"~
}~
, ~
{"\u0056\u0061\u0072":2~
,"\u006d\u0065\u0061\u006e":2~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0032"~
}~
, ~
{"\u0056\u0061\u0072":1~
,"\u006d\u0065\u0061\u006e":42~
,"\u004e\u0061\u006d\u0065":"\u0073\u006d\u0074\u0068\u0033"~
} ~
]~
}
So as the next step I'm parsing JSON and use ~ as the delimiter:
data transfer;
length column $2000;
retain r;
infile delim delimiter='7E'x nopad;
input char1 : $4000. ##;
r = prxparse('/([\s\w\d\\]+)\"\:/');
pos = prxmatch(r,char1);
column = prxposn(r,1,char1);
n= _n_;
run;
It works... But I feel that those are too bad practices, and It has confines.
UPD1
Option,
options vAlidfmtname=long VALIDMEMNAME=extend VALIDVARNAME=any;
return:
+--------------+---------------------------+----------------------------+---------+--------------+
| ordinal_root | ordinal_SASTableData_TEST | __________ | _______ | ______ |
+--------------+---------------------------+----------------------------+---------+--------------+
| 1 | 1 | авфа2 фвафв = фвыа - тфвыа | 4 | Что-то1 ,,,, |
| 1 | 2 | авфа2 фвафв = фвыа - тфвыа | 2 | Что-то2 |
| 1 | 3 | авфа2 фвафв = фвыа - тфвыа | 2017 | Что-то3 |
+--------------+---------------------------+----------------------------+---------+--------------+
So my questions are:
Can I decode the whole file without the infile statement?
Can I use infile delimiter, but set smth options to not delete the delimiter?
Adequate criticism is welcomed.

UPDI came to the solution without having to manually edit the json map file, but using a regex.
libname _all_ clear;
filename _all_ clear;
filename _PDFOUT temp;
filename _GSFNAME temp;
proc datasets lib=work kill memtype=data nolist; quit;
filename jsf '~/sasuser.v94/.json' encoding='utf-8';
data _null_;
file jsf;
length js varchar(*);
retain js;
input;
js=unicode(_infile_);
put js;
datalines;
{
"SASJSONExport":"1.0",
"SASTableData+TEST":[
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
},
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
},
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
}
]
}
;
run;
filename jsm '~/sasuser.v94/.json.map' encoding='utf-8';
libname jsd json fileref=jsf map=jsm automap=replace;
libname jsm json fileref=jsm;
data jsmm;
merge jsm.datasets jsm.datasets_variables;
by ordinal_DATASETS;
run;
proc sort data=jsmm; by ordinal_root ordinal_DATASETS; run;
data _null_;
set work.jsmm end=last;
if _N_=1 then do;
length s varchar(*) ds varchar(*);
retain s ds prx;
s='{"DATASETS":[';
ds='';
prx=prxparse('/[^_]/');
end;
if ds=dsname then s=s||',';
else do;
ds=dsname;
if _N_^=1 then s=s||']},';
s=cats(s,'{"DSNAME":"',ds,'","TABLEPATH":"',tablepath,'","VARIABLES":[');
end;
s=cats(s,'{"NAME":"',name,'","TYPE":"',type,'","PATH":"',path,'"');
if prxmatch(prx,name) > length(name) then
s=cats(s,',"LABEL":"',scan(path,-1,'/'),'"');
s=s||'}';
if last then do;
s=s||']}]}';
file jsm;
put s;
end;
run;
libname jsd json fileref=jsf map=jsm;
proc print data=jsd.SASTableData_TEST label noobs; run;
The first variant of the solutionIt is the quick'n'dirty solution.First preparing the input data:
libname _all_ clear;
filename _all_ clear;
filename jsf '~/sasuser.v94/.json' encoding='utf-8';
data _null_;
file jsf;
length js varchar(*);
input;
js=unicode(_infile_);
put js;
datalines;
{
"SASJSONExport":"1.0",
"SASTableData+TEST": [
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":4,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0031"
},
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":2,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":2,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0032"
},
{
"\u041f\u0435\u0440\u0435\u043c\u0435\u043d\u043d\u0430\u044f":1,
"\u0421\u0440\u0435\u0434\u043d\u0435\u0435":42,
"\u0421\u0442\u0440\u043e\u043a\u0430":"\u0427\u0442\u043e\u002d\u0442\u043e\u0033"
}
]
}
;
run;
The output file .json:
{
"SASJSONExport":"1.0",
"SASTableData+TEST": [
{
"Переменная":2,
"Среднее":4,
"Строка":"Что-то1"
},
{
"Переменная":2,
"Среднее":2,
"Строка":"Что-то2"
},
{
"Переменная":1,
"Среднее":42,
"Строка":"Что-то3"
}
]
}
Then create the json map file .json.map:
filename jsmf '~/sasuser.v94/.json.map' encoding='utf-8';
libname jsm json fileref=jsf map=jsmf automap=create;
The .json.map contents:
{
"DATASETS": [
{
"DSNAME": "root",
"TABLEPATH": "/root",
"VARIABLES": [
{
"NAME": "ordinal_root",
"TYPE": "ORDINAL",
"PATH": "/root"
},
{
"NAME": "SASJSONExport",
"TYPE": "CHARACTER",
"PATH": "/root/SASJSONExport",
"CURRENT_LENGTH": 3
}
]
},
{
"DSNAME": "SASTableData_TEST",
"TABLEPATH": "/root/SASTableData+TEST",
"VARIABLES": [
{
"NAME": "ordinal_root",
"TYPE": "ORDINAL",
"PATH": "/root"
},
{
"NAME": "ordinal_SASTableData_TEST",
"TYPE": "ORDINAL",
"PATH": "/root/SASTableData+TEST"
},
{
"NAME": "____________________",
"TYPE": "NUMERIC",
"PATH": "/root/SASTableData+TEST/Переменная"
},
{
"NAME": "______________",
"TYPE": "NUMERIC",
"PATH": "/root/SASTableData+TEST/Среднее"
},
{
"NAME": "____________",
"TYPE": "CHARACTER",
"PATH": "/root/SASTableData+TEST/Строка",
"CURRENT_LENGTH": 12
}
]
}
]
}
Let's change the file a bit by removing the description of the unnesessary dataset and adding labels:
{
"DATASETS": [
{
"DSNAME": "SASTableData_TEST",
"TABLEPATH": "/root/SASTableData+TEST",
"VARIABLES": [
{
"NAME": "ordinal_root",
"TYPE": "ORDINAL",
"PATH": "/root"
},
{
"NAME": "ordinal_SASTableData_TEST",
"TYPE": "ORDINAL",
"PATH": "/root/SASTableData+TEST"
},
{
"NAME": "____________________",
"TYPE": "NUMERIC",
"PATH": "/root/SASTableData+TEST/Переменная",
"LABEL": "Переменная"
},
{
"NAME": "______________",
"TYPE": "NUMERIC",
"PATH": "/root/SASTableData+TEST/Среднее",
"LABEL": "Среднее"
},
{
"NAME": "____________",
"TYPE": "CHARACTER",
"PATH": "/root/SASTableData+TEST/Строка",
"LABEL": "Строка",
"CURRENT_LENGTH": 12
}
]
}
]
}
And try again:
libname jsd json fileref=jsf map=jsmf;
proc print data=jsd.SASTableData_TEST label noobs; run;
The result:
+--------------+---------------------------+- ----------+---------+-----------+
| ordinal_root | ordinal_SASTableData_TEST | Переменная | Среднее | Строка |
+--------------+---------------------------+------------+---------+-----------+
| 1 | 1 | 2 | 4 | Что-то1 |
| 1 | 2 | 2 | 2 | Что-то2 |
| 1 | 3 | 1 | 42 | Что-то3 |
+--------------+---------------------------+------------+---------+-----------+
All it was done in SAS University Edition.

How to Simulate subquery in MongoDB query condition

Let's suppose that I have a product logs collection, all changes are being done on my products will be recorded in this collection ie :
+------------------------------+
| productId - status - comment |
| 1 0 .... |
| 2 0 .... |
| 1 1 .... |
| 2 1 .... |
| 1 2 .... |
| 3 0 .... |
+------------------------------+
I want to get all products which their status is 1 but hasn't became 2. In SQL the query would look something like :
select productId from productLog as PL1
where
status = 1
and productId not in (
select productId from productLog as PL2 where
PL1.productId = PL2.productId and PL2.status = 2
)
group by productId
I'm using native PHP MongoDB driver.

Well since the logic here on the subquery join is simply that exactly the same key matches the other then:
Setup
db.status.insert([
{ "productId": 1, "status": 0 },
{ "productId": 2, "status": 0 },
{ "productId": 1, "status": 1 },
{ "productId": 2, "status": 1 },
{ "productId": 1, "status": 2 },
{ "productId": 3, "status": 0 }
])
Then use .aggregate():
db.status.aggregate([
{ "$match": {
"status": { "$ne": 2 }
}},
{ "$group": {
"_id": "$productId"
}}
])
Or using map reduce (with a DBRef):
db.status.mapReduce(
function() {
if ( this.productId.$oid == 2 ) {
emit( this.prouctId.$oid, null )
}
},
function(key,values) {
return null;
},
{ "out": { "inline": 1 } }
);
But again the SQL here was as simple as:
select productId
from productLog
where status <> 2
group by productId
Without the superfluous join on exactly the same key value

This mongo query above doesn't meet the requirements in question,
the result of the mongo-query includes documents with productId=1,
however the result of the SQL in question doesn't. Because in sample data: there exists 1 record with status=2, and productId of that document is 1.
So, assuming db.productLog.insert executed as stated above, you can use the code below to get the results:
//First: subquery for filtering records having status=2:
var productsWithStatus2 = db.productLog .find({"status":2}).map(function(rec) { return rec.productId; });
//Second:final query to get productIds which there not exists having status=2 with same productId :
db.productLog.aggregate([ {"$match":{productId:{$nin:productsWithStatus2}}},{"$group": {"_id": "$productId"}}]) ;
//Alternative for Second final query:
//db.productLog.distinct("productId",{productId:{$nin:productsWithStatus2}});
//Alternative for Second final query,get results with product and status detail:
//db.productLog.find({productId:{$nin:productsWithStatus2}});

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Loading JSON file with repeating elements into hive table - json

Related

Parse nested Json to splunk query which has string

Create a composite object from a complex json object using jq

Karate API framework how to match the response values with the table columns?

Parse a tables with unicode chars in variables from JSON with SAS BASE

How to Simulate subquery in MongoDB query condition

Categories

Resources