Deserialize complex JSON with pandas - json
I am trying to deserialize some complex JSON that is also inconsistent with pandas and struggling to get the parsing right
{
"STATUS":"REQUEST_OK",
"DATA":[
{
"companyID":"AABBCCDD",
"ITEMS":[
{
"ind":"12345",
"pt":"1231",
"code":"E333",
"name":"Pop ,",
"RES":[
{
"i":1,
"D":{
"e":123674,
"p":"",
"s":"",
"t":1000
},
"lot":"073",
"V":[
{
"t":6,
"v":0.1
}
],
"p":1
},
{
}
]
},
{
"ind":"423",
"pt":"571",
"code":"E1",
"name":"Dam ,",
"RES":[
{
"i":5,
"D":{
"e":120751,
"p":"",
"s":"",
"t":800
},
"lot":"9",
"V":[
{
"t":4543,
"v":1.33
}
],
"p":1
},
{
}
]
},
{
"ind":"0323",
"pt":"123221",
"code":"LS",
"name":"Paint ,",
"RES":[
{
"i":61,
"D":{
"e":946,
"p":"",
"s":"",
"t":11100
},
"lot":"8",
"V":[
{
"t":9,
"v":0.06
}
],
"p":1
},
{
}
]
}
]
}
]
}
The data here is supposed to build this table
|companyID | ind |pt |code |name |i |e |p |s |t |lot |t |v |p
|------------------------------------------------------------------------------------------------
|AABBCCDD |12345 |1231 |E333 |Pop |1 |123674 | | |1000 |073 |6 |0.1 |1
|------------------------------------------------------------------------------------------------
|----
And so on.
The biggest pain for me is that there could be only 1 level of this tag
{
"ind":"423",
"pt":"571",
"code":"E1",
"name":"Dam ,",
"RES":[
{
but inside it, multiple
{
"i":61,
"D":{
"e":946,
"p":"",
"s":"",
"t":11100
},
"lot":"8",
"V":[
{
"t":9,
"v":0.06
}
],
"p":1
},
Example
i:61
means there are 61 of those json tags inside of the first tag.
Any clues on how to parse this JSON?
Try it this way:
data = """your json above"""
import pandas as pd
key = []
value = []
new_dat = data.split(',')
for n in new_dat:
if '{' in n:
m = n.split('{')[1].strip()
else:
m = n.strip().replace('}','').replace('\n','')
if ':' in m:
key.append(m.split(':')[0])
value.append(m.split(':')[1])
pd.DataFrame([value],columns=key)
Output:
"STATUS" "companyID" "ind" "pt" "code" "name" "i" "e" "p" "s" ... "name" "i" "e" "p" "s" "t" "lot" "t" "v" "p"
0 "REQUEST_OK" "AABBCCDD" "12345" "1231" "E333" "Pop 1 123674 "" "" ... "Paint 61 946 "" "" 11100 "8" 9 0.06 ] 1
You can then use standard pandas methods to drop unnecessary columns, etc.
Related
How to bring json format to relational form?
this is my code: %spark.pyspark df_principalBody = spark.sql(""" SELECT gtin , principalBodyConstituents --, principalBodyConstituents.coatings.materialType.value FROM v_df_source""") df_principalBody.createOrReplaceTempView("v_df_principalBody") df_principalBody.collect(); And this is the output: [Row(gtin='7617014161936', principalBodyConstituents=[Row(coatings=[Row(materialType=Row(value='003', valueRange='405') How can I read the value and valueRange fields in relational format? I tried with explode and flatten, but it will not work. Part of my json: { "gtin": "7617014161936", "timePeriods": [ { "fractionData": { "principalBody": { "constituents": [ { "coatings": [ { "materialType": { "value": "003", "valueRange": "405" }, "percentage": 0.1 } ], ...
You can use data_dict.items() to list key/value pairs. I used part of your json as below - str1 = """{"gtin": "7617014161936","timePeriods": [{"fractionData": {"principalBody": {"constituents": [{"coatings": [ { "materialType": { "value": "003", "valueRange": "405" }, "percentage": 0.1 } ]}]}}}]}""" import json res = json.loads(str1) res_dict = res['timePeriods'][0]['fractionData']['principalBody']['constituents'][0]['coatings'][0]['materialType'] df = spark.createDataFrame(data=res_dict.items()) Output : +----------+---+ | _1| _2| +----------+---+ | value|003| |valueRange|405| +----------+---+ You can even specify your schema: from pyspark.sql.types import * df = spark.createDataFrame(res_dict.items(), schema=StructType(fields=[ StructField("key", StringType()), StructField("value", StringType())])).show() Resulting in +----------+-----+ | key|value| +----------+-----+ | value| 003| |valueRange| 405| +----------+-----+
jq parse json with stream flag into different json file
I have a json file as below called data.json, I want to parse the data with jq tool in streaming mode(do not load the whole file into memory), because the real data have 20GB the streaming mode in jq seems to add a flag --stream and it will parse the json file row by row { "id": { "bioguide": "E000295", "thomas": "02283", "govtrack": 412667, "opensecrets": "N00035483", "lis": "S376" }, "bio": { "gender": "F", "birthday": "1970-07-01" }, "tooldatareports": [ { "name": "A", "tooldata": [ { "toolid": 12345, "data": [ { "time": "2021-01-01", "value": 1 }, { "time": "2021-01-02", "value": 10 }, { "time": "2021-01-03", "value": 5 } ] }, { "toolid": 12346, "data": [ { "time": "2021-01-01", "value": 10 }, { "time": "2021-01-02", "value": 100 }, { "time": "2021-01-03", "value": 50 } ] } ] } ] } The final result I hope it can become as below A list contains two dict, each dict contain 2 keys [ { "data": [ { "time": "2021-01-01", "value": 1 }, { "time": "2021-01-02", "value": 10 }, { "time": "2021-01-03", "value": 5 } ] }, { "data": [ { "time": "2021-01-01", "value": 10 }, { "time": "2021-01-02", "value": 100 }, { "time": "2021-01-03", "value": 50 } ] } ] For this problem, I use the below command line to get a result, but it still has some differences. cat data.json | jq --stream 'select(.[0][0]=="tooldatareports" and .[0][2]=="tooldata" and .[1]!=null) | .' the result is not a list contain a lot of dict for each time and value are separate in the different list Does anyone have any idea about this?
Here's a solution that does not use truncate_stream: jq -n --stream ' [fromstream( inputs | (.[0] | index("data")) as $ix | select($ix) | .[0] |= .[$ix:] )] ' input.json
The following produces the required output: jq -n --stream ' [{data: fromstream(5|truncate_stream(inputs))}] ' input.json Needless to say, there are other variations ...
Here's a step-by-step explanation of peak's answers. First let's convert the json to stream. https://jqplay.org/s/VEunTmDSkf [["id","bioguide"],"E000295"] [["id","thomas"],"02283"] [["id","govtrack"],412667] [["id","opensecrets"],"N00035483"] [["id","lis"],"S376"] [["id","lis"]] [["bio","gender"],"F"] [["bio","birthday"],"1970-07-01"] [["bio","birthday"]] [["tooldatareports",0,"name"],"A"] [["tooldatareports",0,"tooldata",0,"toolid"],12345] [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] [["tooldatareports",0,"tooldata",0,"data",0,"value"],1] [["tooldatareports",0,"tooldata",0,"data",0,"value"]] [["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"] [["tooldatareports",0,"tooldata",0,"data",1,"value"],10] [["tooldatareports",0,"tooldata",0,"data",1,"value"]] [["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"] [["tooldatareports",0,"tooldata",0,"data",2,"value"],5] [["tooldatareports",0,"tooldata",0,"data",2,"value"]] [["tooldatareports",0,"tooldata",0,"data",2]] [["tooldatareports",0,"tooldata",0,"data"]] [["tooldatareports",0,"tooldata",1,"toolid"],12346] [["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"] [["tooldatareports",0,"tooldata",1,"data",0,"value"],10] [["tooldatareports",0,"tooldata",1,"data",0,"value"]] [["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"] [["tooldatareports",0,"tooldata",1,"data",1,"value"],100] [["tooldatareports",0,"tooldata",1,"data",1,"value"]] [["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"] [["tooldatareports",0,"tooldata",1,"data",2,"value"],50] [["tooldatareports",0,"tooldata",1,"data",2,"value"]] [["tooldatareports",0,"tooldata",1,"data",2]] [["tooldatareports",0,"tooldata",1,"data"]] [["tooldatareports",0,"tooldata",1]] [["tooldatareports",0,"tooldata"]] [["tooldatareports",0]] [["tooldatareports"]] Now do .[0] to extract the path portion of stream. https://jqplay.org/s/XdPrp8RuEj ["id","bioguide"] ["id","thomas"] ["id","govtrack"] ["id","opensecrets"] ["id","lis"] ["id","lis"] ["bio","gender"] ["bio","birthday"] ["bio","birthday"] ["tooldatareports",0,"name"] ["tooldatareports",0,"tooldata",0,"toolid"] ["tooldatareports",0,"tooldata",0,"data",0,"time"] ["tooldatareports",0,"tooldata",0,"data",0,"value"] ["tooldatareports",0,"tooldata",0,"data",0,"value"] ["tooldatareports",0,"tooldata",0,"data",1,"time"] ["tooldatareports",0,"tooldata",0,"data",1,"value"] ["tooldatareports",0,"tooldata",0,"data",1,"value"] ["tooldatareports",0,"tooldata",0,"data",2,"time"] ["tooldatareports",0,"tooldata",0,"data",2,"value"] ["tooldatareports",0,"tooldata",0,"data",2,"value"] ["tooldatareports",0,"tooldata",0,"data",2] ["tooldatareports",0,"tooldata",0,"data"] ["tooldatareports",0,"tooldata",1,"toolid"] ["tooldatareports",0,"tooldata",1,"data",0,"time"] ["tooldatareports",0,"tooldata",1,"data",0,"value"] ["tooldatareports",0,"tooldata",1,"data",0,"value"] ["tooldatareports",0,"tooldata",1,"data",1,"time"] ["tooldatareports",0,"tooldata",1,"data",1,"value"] ["tooldatareports",0,"tooldata",1,"data",1,"value"] ["tooldatareports",0,"tooldata",1,"data",2,"time"] ["tooldatareports",0,"tooldata",1,"data",2,"value"] ["tooldatareports",0,"tooldata",1,"data",2,"value"] ["tooldatareports",0,"tooldata",1,"data",2] ["tooldatareports",0,"tooldata",1,"data"] ["tooldatareports",0,"tooldata",1] ["tooldatareports",0,"tooldata"] ["tooldatareports",0] ["tooldatareports"] Let me first quickly explain index\1. index("data") of [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] is 4 since that is the index of the first occurrence of "data". Knowing that let's now do .[0] | index("data"). https://jqplay.org/s/ny0bV1xEED null null null null null null null null null null null 4 4 4 4 4 4 4 4 4 4 4 null 4 4 4 4 4 4 4 4 4 4 4 null null null null As you can see in our case the indexes are either 4 or null. We want to filter each input such that the corresponding index is not null. Those are the input that have "data" as part of their path. (.[0] | index("data")) as $ix | select($ix) does just that. Remember that each $ix is mapped to each input. So only input with their $ix being not null are displayed. For example see https://jqplay.org/s/NwcD7_USZE Here inputs | select(null) gives no output but inputs | select(true) outputs every input. These are the filtered stream: https://jqplay.org/s/SgexvhtaGe [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] [["tooldatareports",0,"tooldata",0,"data",0,"value"],1] [["tooldatareports",0,"tooldata",0,"data",0,"value"]] [["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"] [["tooldatareports",0,"tooldata",0,"data",1,"value"],10] [["tooldatareports",0,"tooldata",0,"data",1,"value"]] [["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"] [["tooldatareports",0,"tooldata",0,"data",2,"value"],5] [["tooldatareports",0,"tooldata",0,"data",2,"value"]] [["tooldatareports",0,"tooldata",0,"data",2]] [["tooldatareports",0,"tooldata",0,"data"]] [["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"] [["tooldatareports",0,"tooldata",1,"data",0,"value"],10] [["tooldatareports",0,"tooldata",1,"data",0,"value"]] [["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"] [["tooldatareports",0,"tooldata",1,"data",1,"value"],100] [["tooldatareports",0,"tooldata",1,"data",1,"value"]] [["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"] [["tooldatareports",0,"tooldata",1,"data",2,"value"],50] [["tooldatareports",0,"tooldata",1,"data",2,"value"]] [["tooldatareports",0,"tooldata",1,"data",2]] [["tooldatareports",0,"tooldata",1,"data"]] Before we go further let's review update assignment. Have a look at https://jqplay.org/s/g4P6j8f9FG Let's say we have input [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]. Then filter .[0] |= .[4:] produces [["data",0,"time"],"2021-01-01"]. Why? Remember that right hand side (.[4:]) inherits the context of the left hand side(.[0]). So in this case it has the effect of updating the path ["tooldatareports",0,"tooldata",0,"data",0,"time"] to ["data",0,"time"]. Let's move on then. So (.[0] | index("data")) as $ix | select($ix) | .[0] |= .[$ix:] has the output: https://jqplay.org/s/AwcQpVyHO2 [["data",0,"time"],"2021-01-01"] [["data",0,"value"],1] [["data",0,"value"]] [["data",1,"time"],"2021-01-02"] [["data",1,"value"],10] [["data",1,"value"]] [["data",2,"time"],"2021-01-03"] [["data",2,"value"],5] [["data",2,"value"]] [["data",2]] [["data"]] [["data",0,"time"],"2021-01-01"] [["data",0,"value"],10] [["data",0,"value"]] [["data",1,"time"],"2021-01-02"] [["data",1,"value"],100] [["data",1,"value"]] [["data",2,"time"],"2021-01-03"] [["data",2,"value"],50] [["data",2,"value"]] [["data",2]] [["data"]] Now all we need to do is convert this stream back to json. https://jqplay.org/s/j2uyzEU_Rc [fromstream(inputs)] gives: [ { "data": [ { "time": "2021-01-01", "value": 1 }, { "time": "2021-01-02", "value": 10 }, { "time": "2021-01-03", "value": 5 } ] }, { "data": [ { "time": "2021-01-01", "value": 10 }, { "time": "2021-01-02", "value": 100 }, { "time": "2021-01-03", "value": 50 } ] } ] This is the output we wanted.
jq: count nest object values based on group by
Json: [ { "account": "1", "cost": [ { "usage":"low", "totalcost": "2.01" } ] }, { "account": "2", "cost": [ { "usage":"low", "totalcost": "2.25" } ] }, { "account": "1", "cost": [ { "usage":"low", "totalcost": "15" } ] }, { "anotheraccount": "a", "cost": [ { "usage":"low", "totalcost": "2" } ] } ] Results expected: account cost 1 17.01 2 2.25 anotheraccount cost a 2 I am able to pull out data but not sure how to aggregate it. jq '.[] | {account,cost : .cost[].totalcost}' Is there a way to do this in using jq, so I get all types of accounts and costs associated with them?
Two helper functions will help you get you to your destination: def sigma( f ): reduce .[] as $o (null; . + ($o | f )) ; def group( keyname ): map(select(has(keyname))) | group_by( .[keyname] ) | map({(keyname) : .[0][keyname], cost: sigma(.cost[].totalcost | tonumber) }) ; With these, the following invocations: group("account"), group("anotheraccount") yield: [{"account":"1","cost":17.009999999999998},{"account":"2","cost":2.25}] [{"anotheraccount":"a","cost":2}] You should be able to manage the final formating step in jq.
Using jq to parse keys present in two lists (even though it might not exist in one of those)
(It was hard to come up with a title that summarizes the issue, so feel free to improve it). I have a JSON file with the following content: { "Items": [ { "ID": { "S": "ID_Complete" }, "oldProperties": { "L": [ { "S": "[property_A : value_A_old]" }, { "S": "[property_B : value_B_old]" } ] }, "newProperties": { "L": [ { "S": "[property_A : value_A_new]" }, { "S": "[property_B : value_B_new]" } ] } }, { "ID": { "S": "ID_Incomplete" }, "oldProperties": { "L": [ { "S": "[property_B : value_B_old]" } ] }, "newProperties": { "L": [ { "S": "[property_A : value_A_new]" }, { "S": "[property_B : value_B_new]" } ] } } ] } I would like to manipulate the data using jq in such a way that for each item in Items[] that has a new value for property_A (under newProperties list) generate an output with the corresponding id, old and new (see desired output below) fields regardless of the value that property has in the oldProperties list. Moreover, if property_A does not exist in the oldProperties, I still need the old field to be populated with a null (or any fixed string for what it's worth). Desired output: { "id": "id_Complete", "old": "[property_A : value_A_old]", "new": "[property_A : value_A_new]" } { "id": "ID_Incomplete", "old": null, "new": "[property_A : value_A_new]" } Note: Even though property_A doesn't exist in the oldProperties list, other properties may (and will) exist. The problem I am facing is that I am not able to get an output when the desired property does not exist in the oldProperties list. My current jq command looks like this: jq -r '.Items[] | { id:.ID.S, old:.oldProperties.L[].S | select(. | contains("property_A")), new:.newProperties.L[].S | select(. | contains("property_A")) }' Which renders only the ID_Complete case, while I need the other as well. Is there any way to achieve this using this tool? Thanks in advance.
Your list of properties appear to be values of some object. You could map them out into an object to then diff the objects, then report on the results. You could do something like this: def make_object_from_properties: [.L[].S | capture("\\[(?<key>\\w+) : (?<value>\\w+)\\]")] | from_entries ; def diff_objects($old; $new): def _prop($key): select(has($key))[$key]; ([($old | keys[]), ($new | keys[])] | unique) as $keys | [ $keys[] as $k | ({ value: $old | _prop($k) } // { none: true }) as $o | ({ value: $new | _prop($k) } // { none: true }) as $n | (if $o.none then "add" elif $n.none then "remove" elif $o.value != $n.value then "change" else "same" end) as $s | { key: $k, status: $s, old: $o.value, new: $n.value } ] ; def diff_properties: (.oldProperties | make_object_from_properties) as $old | (.newProperties | make_object_from_properties) as $new | diff_objects($old; $new) as $diff | foreach $diff[] as $d ({ id: .ID.S }; select($d.status != "same") | .old = ((select(any("remove", "change"; . == $d.status)) | "[\($d.key) : \($d.old)]") // null) | .new = ((select(any("add", "change"; . == $d.status)) | "[\($d.key) : \($d.new)]") // null) ) ; [.Items[] | diff_properties] This yields the following output: [ { "id": "ID_Complete", "old": "[property_A : value_A_old]", "new": "[property_A : value_A_new]" }, { "id": "ID_Complete", "old": "[property_B : value_B_old]", "new": "[property_B : value_B_new]" }, { "id": "ID_Incomplete", "old": null, "new": "[property_A : value_A_new]" }, { "id": "ID_Incomplete", "old": "[property_B : value_B_old]", "new": "[property_B : value_B_new]" } ] It seems like your data is in some kind of encoded format too. For a more robust solution, you should consider defining some functions to decode them. Consider approaches found here on how you could do that.
This filter produces the desired output. def parse: capture("(?<key>\\w+)\\s*:\\s*(?<value>\\w+)") ; def print: "[\(.key) : \(.value)]"; def norm: [.[][][] | parse | select(.key=="property_A") | print][0]; .Items | map({id:.ID.S, old:.oldProperties|norm, new:.newProperties|norm})[] Sample Run (assumes filter in filter.jq and data in data.json) $ jq -M -f filter.jq data.json { "id": "ID_Complete", "old": "[property_A : value_A_old]", "new": "[property_A : value_A_new]" } { "id": "ID_Incomplete", "old": null, "new": "[property_A : value_A_new]" } Try it online!
Replacing specific fields in JSON from text file
I have a json structure and would like to replace strings in 2 fields that are in a seperate text file. Here is the json file with 2 records: { "events" : { "-KKQQIUR7FAVxBOPOFhr" : { "dateAdded" : 1487592568926, "owner" : "62e6aaa0-a50c-4448-a381-f02efde2316d", "type" : "boycott" }, "-KKjjM-pAXvTuEjDjoj_" : { "dateAdded" : 1487933370561, "owner" : "62e6aaa0-a50c-4448-a381-f02efde2316d", "type" : "boycott" } }, "geo" : { "-KKQQIUR7FAVxBOPOFhr" : { ".priority" : "qw3yttz1k9", "g" : "qw3yttz1k9", "l" : [ 40.762632, -73.973837 ] }, "-KKjjM-pAXvTuEjDjoj_" : { ".priority" : "qw3yttx6bv", "g" : "qw3yttx6bv", "l" : [ 41.889019, -87.626291 ] } }, "log" : "null", "users" : { "62e6aaa0-a50c-4448-a381-f02efde2316d" : { "events" : { "-KKQQIUR7FAVxBOPOFhr" : { "type" : "boycott" }, "-KKjjM-pAXvTuEjDjoj_" : { "type" : "boycott" } } } } } And here is the txt file that I want to substitue in: 49.287130, -123.124026 36.129770, -115.172811 There are lots more records but I kept this to 2 for brevity. Any help would be appreciated. Thank you.
The problem description seems to assume that the ordering of the key-value pairs within a JSON object is fixed. Different JSON-oriented tools (and indeed different versions of jq) have different takes on this. In any case, the following assumes a version of jq that respects the ordering (e.g. jq 1.5); it also assumes that inputs is available, though that is inessential. The key to the following solution is the helper function, map_nth_value/2, which modifies the value of the nth key in a JSON object: def map_nth_value(n; filter): to_entries | (.[n] |= {"key": .key, "value": (.value | filter)} ) | from_entries ; [inputs | select(length > 0) | split(",") | map(tonumber)] as $lists | reduce range(0; $lists|length) as $i ( $object; .geo |= map_nth_value($i; .l = $lists[$i] ) ) With the above jq program in a file (say program.jq), and with the text file in a file (say input.txt) and the JSON object in a file (say object.json), the following invocation: jq -R -n --argfile object object.json -f program.jq input.txt produces: { "events": { "-KKQQIUR7FAVxBOPOFhr": { "dateAdded": 1487592568926, "owner": "62e6aaa0-a50c-4448-a381-f02efde2316d", "type": "boycott" }, "-KKjjM-pAXvTuEjDjoj_": { "dateAdded": 1487933370561, "owner": "62e6aaa0-a50c-4448-a381-f02efde2316d", "type": "boycott" } }, "geo": { "-KKQQIUR7FAVxBOPOFhr": { ".priority": "qw3yttz1k9", "g": "qw3yttz1k9", "l": [ 49.28713, -123.124026 ] }, "-KKjjM-pAXvTuEjDjoj_": { ".priority": "qw3yttx6bv", "g": "qw3yttx6bv", "l": [ 36.12977, -115.172811 ] } }, "log": "null", "users": { "62e6aaa0-a50c-4448-a381-f02efde2316d": { "events": { "-KKQQIUR7FAVxBOPOFhr": { "type": "boycott" }, "-KKjjM-pAXvTuEjDjoj_": { "type": "boycott" } } } } }