I have a json file as below called data.json, I want to parse the data with jq tool in streaming mode(do not load the whole file into memory), because the real data have 20GB
the streaming mode in jq seems to add a flag --stream and it will parse the json file row by row
{
"id": {
"bioguide": "E000295",
"thomas": "02283",
"govtrack": 412667,
"opensecrets": "N00035483",
"lis": "S376"
},
"bio": {
"gender": "F",
"birthday": "1970-07-01"
},
"tooldatareports": [
{
"name": "A",
"tooldata": [
{
"toolid": 12345,
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"toolid": 12346,
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
}
]
}
The final result I hope it can become as below
A list contains two dict, each dict contain 2 keys
[
{
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
For this problem, I use the below command line to get a result, but it still has some differences.
cat data.json | jq --stream 'select(.[0][0]=="tooldatareports" and .[0][2]=="tooldata" and .[1]!=null) | .'
the result is not a list contain a lot of dict
for each time and value are separate in the different list
Does anyone have any idea about this?
Here's a solution that does not use truncate_stream:
jq -n --stream '
[fromstream(
inputs
| (.[0] | index("data")) as $ix
| select($ix)
| .[0] |= .[$ix:] )]
' input.json
The following produces the required output:
jq -n --stream '
[{data: fromstream(5|truncate_stream(inputs))}]
' input.json
Needless to say, there are other variations ...
Here's a step-by-step explanation of peak's answers.
First let's convert the json to stream.
https://jqplay.org/s/VEunTmDSkf
[["id","bioguide"],"E000295"]
[["id","thomas"],"02283"]
[["id","govtrack"],412667]
[["id","opensecrets"],"N00035483"]
[["id","lis"],"S376"]
[["id","lis"]]
[["bio","gender"],"F"]
[["bio","birthday"],"1970-07-01"]
[["bio","birthday"]]
[["tooldatareports",0,"name"],"A"]
[["tooldatareports",0,"tooldata",0,"toolid"],12345]
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"toolid"],12346]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
[["tooldatareports",0,"tooldata",1]]
[["tooldatareports",0,"tooldata"]]
[["tooldatareports",0]]
[["tooldatareports"]]
Now do .[0] to extract the path portion of stream.
https://jqplay.org/s/XdPrp8RuEj
["id","bioguide"]
["id","thomas"]
["id","govtrack"]
["id","opensecrets"]
["id","lis"]
["id","lis"]
["bio","gender"]
["bio","birthday"]
["bio","birthday"]
["tooldatareports",0,"name"]
["tooldatareports",0,"tooldata",0,"toolid"]
["tooldatareports",0,"tooldata",0,"data",0,"time"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"time"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"time"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2]
["tooldatareports",0,"tooldata",0,"data"]
["tooldatareports",0,"tooldata",1,"toolid"]
["tooldatareports",0,"tooldata",1,"data",0,"time"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"time"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"time"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2]
["tooldatareports",0,"tooldata",1,"data"]
["tooldatareports",0,"tooldata",1]
["tooldatareports",0,"tooldata"]
["tooldatareports",0]
["tooldatareports"]
Let me first quickly explain index\1.
index("data") of [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] is 4 since that is the index of the first occurrence of "data".
Knowing that let's now do .[0] | index("data").
https://jqplay.org/s/ny0bV1xEED
null
null
null
null
null
null
null
null
null
null
null
4
4
4
4
4
4
4
4
4
4
4
null
4
4
4
4
4
4
4
4
4
4
4
null
null
null
null
As you can see in our case the indexes are either 4 or null. We want to filter each input such that the corresponding index is not null. Those are the input that have "data" as part of their path.
(.[0] | index("data")) as $ix | select($ix) does just that. Remember that each $ix is mapped to each input. So only input with their $ix being not null are displayed.
For example see https://jqplay.org/s/NwcD7_USZE Here inputs | select(null) gives no output but inputs | select(true) outputs every input.
These are the filtered stream:
https://jqplay.org/s/SgexvhtaGe
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
Before we go further let's review update assignment.
Have a look at https://jqplay.org/s/g4P6j8f9FG
Let's say we have input [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"].
Then filter .[0] |= .[4:] produces [["data",0,"time"],"2021-01-01"].
Why?
Remember that right hand side (.[4:]) inherits the context of the left hand side(.[0]). So in this case it has the effect of updating the path ["tooldatareports",0,"tooldata",0,"data",0,"time"] to ["data",0,"time"].
Let's move on then.
So (.[0] | index("data")) as $ix | select($ix) | .[0] |= .[$ix:] has the output:
https://jqplay.org/s/AwcQpVyHO2
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],1]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],10]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],5]
[["data",2,"value"]]
[["data",2]]
[["data"]]
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],10]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],100]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],50]
[["data",2,"value"]]
[["data",2]]
[["data"]]
Now all we need to do is convert this stream back to json.
https://jqplay.org/s/j2uyzEU_Rc
[fromstream(inputs)] gives:
[
{
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
This is the output we wanted.
I am looking to inject Telemetry data and the output is a multi layered nested JSON file. I am interested in very specific fields but I am not able to parse the JSON file to get to the data.
Data Sample:
{ "version_str": "1.0.0", "node_id_str": "router-01", "encoding_path":
"sys/intf", "collection_id": 241466, "collection_start_time": 0,
"collection_end_time": 0, "msg_timestamp": 0, "subscription_id": [ ],
"sensor_group_id": [ ], "data_source": "DME", "data": {
"interfaceEntity": { "attributes": { "childAction": "", "descr": "",
"dn": "sys/intf", "modTs": "2017-09-19T13:24:14.751+00:00",
"monPolDn": "uni/fabric/monfab-default", "persistentOnReload": "true",
"status": "" }, "children": [ { "l3LbRtdIf": { "attributes": {
"adminSt": "up", "childAction": "", "descr": "Nothing", "id":
"lo103", "linkLog": "default", "modTs":
"2017-11-06T23:18:02.974+00:00", "monPolDn":
"uni/fabric/monfab-default", "name": "", "persistentOnReload": "true",
"rn": "lb-[lo103]", "status": "", "uid": "0" }, "children": [ {
"ethpmLbRtdIf": { "attributes": { "currErrIndex": "4294967295",
"ifIndex": "335544423", "iod": "14", "lastErrors": "0,0,0,0",
"operBitset": "", "operDescr": "Nothing", "operMtu": "1500",
"operSt": "up", "operStQual": "none", "rn": "lbrtdif" } } }, {
"nwRtVrfMbr": { "attributes": { "childAction": "", "l3vmCfgFailedBmp":
"", "l3vmCfgFailedTs": "00:00:00:00.000", "l3vmCfgState": "0",
"modTs": "2017-11-06T23:18:02.945+00:00", "monPolDn": "",
"parentSKey": "unspecified", "persistentOnReload": "true", "rn":
"rtvrfMbr", "status": "", "tCl": "l3Inst", "tDn": "sys/inst-default",
"tSKey": "" } } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt":
"up", "childAction": "", "descr": "Nothing", "id": "lo104",
"linkLog": "default", "modTs": "2018-01-25T15:54:20.367+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo104]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544424", "iod": "77",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"Nothing", "operMtu": "1500", "operSt": "up", "operStQual":
"none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr": { "attributes": {
"childAction": "", "l3vmCfgFailedBmp": "", "l3vmCfgFailedTs":
"00:00:00:00.000", "l3vmCfgState": "0", "modTs":
"2018-01-25T15:53:55.757+00:00", "monPolDn": "", "parentSKey":
"unspecified", "persistentOnReload": "true", "rn": "rtvrfMbr",
"status": "", "tCl": "l3Inst", "tDn": "sys/inst-default", "tSKey": ""
} } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt": "up",
"childAction": "", "descr": "Nothing", "id": "lo101",
"linkLog": "default", "modTs": "2017-11-13T21:39:58.910+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo101]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544421", "iod": "12",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"Nothing", "operMtu": "1500", "operSt": "up", "operStQual":
"none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr": { "attributes": {
"childAction": "", "l3vmCfgFailedBmp": "", "l3vmCfgFailedTs":
"00:00:00:00.000", "l3vmCfgState": "0", "modTs":
"2017-11-13T21:39:58.880+00:00", "monPolDn": "", "parentSKey":
"unspecified", "persistentOnReload": "true", "rn": "rtvrfMbr",
"status": "", "tCl": "l3Inst", "tDn": "sys/inst-default", "tSKey": ""
} } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt": "up",
"childAction": "", "descr": "\"^:tier2:if:loopback:mgmt:l3\"", "id":
"lo0", "linkLog": "default", "modTs": "2017-09-25T20:29:54.003+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo0]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544320", "iod": "11",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"\"^:tier2:if:loopback:mgmt:l3\"", "operMtu": "1500", "operSt": "up",
"operStQual": "none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr":...
I am interested in these attributes:
| | | | | | | |-- rmonIfIn: struct (nullable = true)
| | | | | | | | |-- attributes: struct (nullable = true )
| | | | | | | | | |-- broadcastPkts: string (nullabl e = true)
| | | | | | | | | |-- discards: string (nullable = t rue)
| | | | | | | | | |-- errors: string (nullable = tru e)
| | | | | | | | | |-- multicastPkts: string (nullabl e = true)
| | | | | | | | | |-- nUcastPkts: string (nullable = true)
| | | | | | | | | |-- packetRate: string (nullable = true)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.explode
import spark.implicits._
val spark = SparkSession.builder().getOrCreate
val df = spark.read.option("header","true").option("inferSchema","true").json("file:///usr/local/Projects/out.txt")
val mapDF = df.select($"node_id_str" as "nodename", $"data".getItem("InterfaceEntity").getItem("children").getItem("l1PhysIf").getItem("children").getItem("element"))
I keep getting an error when I attempt to get any deeper, I keep getting data type error:
stringJsonDF: org.apache.spark.sql.DataFrame = [nestDevice: string]
org.apache.spark.sql.AnalysisException: cannot resolve '`data`.`InterfaceEntity`.`children`.`l1PhysIf`.`children`['element']' due to data type mismatch: argument 2 requires integral type, however, ''element'' is of string type.;;
You can use Google Gson Library which is used to work with json. You can convert any object to json and of course do it in reverse. here is an example for doing so:
Gson gson = new Gson();
List<Map<Long, String>> listOfMaps = new ArrayList<>();
//here you can new some maps and add them to the listOfMaps.
String listOfMapsInJsonFormat = gson.toJson(listOfMaps);
above sample code is for converting an object to json. To do the reverse job you can check below one too:
Gson gson = new Gson();
List list = gson.fromJson(listOfMapsInJsonFormat, List.class);
the above code will change your input json string to a list which contains maps. Of course there may be a difference in the type of the map you have had before converting the original object to json and the one gson builds the object from json string. to avoid that you can use TypeToken class:
Gson gson = new Gson();
Type type = new TypeToken()<ArrayList<Map<>>>{}.getType();
ArrayList<Map<>> = gson.fromJson(listOfMapsInJsonFormat, type);
Since the fields are part of multiple nested arrays the logic would assume that you are interested in all iterations of those fields per record (so if one record contains n rmonIfIn items due to nested arrays, you would be interested in each of them?)
If so it makes sense to explode these nested arrays and process the expanded dataframe.
Based on your code and incomplete json example it could look like something like this:
val nested = df
.select(explode($"data.InterfaceEntity").alias("l1"))
.select(explode($"l1.l1PhysIf").alias("l2"))
.select($"l2.rmonIfIn.attributes".alias("l3"))
.select($"l3.broadcastPkts", $"l3.discards", $"l3.errors", $"l3.multicastPkts", $"l3.packetRate")
Returning a dataframe that could look like
+-------------+--------+------+-------------+----------+
|broadcastPkts|discards|errors|multicastPkts|packetRate|
+-------------+--------+------+-------------+----------+
|1 |1 |1 |1 |1 |
|2 |2 |2 |2 |2 |
|3 |3 |3 |3 |3 |
|4 |4 |4 |4 |4 |
+-------------+--------+------+-------------+----------+