I have a json file as below called data.json, I want to parse the data with jq tool in streaming mode(do not load the whole file into memory), because the real data have 20GB
the streaming mode in jq seems to add a flag --stream and it will parse the json file row by row
{
"id": {
"bioguide": "E000295",
"thomas": "02283",
"govtrack": 412667,
"opensecrets": "N00035483",
"lis": "S376"
},
"bio": {
"gender": "F",
"birthday": "1970-07-01"
},
"tooldatareports": [
{
"name": "A",
"tooldata": [
{
"toolid": 12345,
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"toolid": 12346,
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
}
]
}
The final result I hope it can become as below
A list contains two dict, each dict contain 2 keys
[
{
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
For this problem, I use the below command line to get a result, but it still has some differences.
cat data.json | jq --stream 'select(.[0][0]=="tooldatareports" and .[0][2]=="tooldata" and .[1]!=null) | .'
the result is not a list contain a lot of dict
for each time and value are separate in the different list
Does anyone have any idea about this?
Here's a solution that does not use truncate_stream:
jq -n --stream '
[fromstream(
inputs
| (.[0] | index("data")) as $ix
| select($ix)
| .[0] |= .[$ix:] )]
' input.json
The following produces the required output:
jq -n --stream '
[{data: fromstream(5|truncate_stream(inputs))}]
' input.json
Needless to say, there are other variations ...
Here's a step-by-step explanation of peak's answers.
First let's convert the json to stream.
https://jqplay.org/s/VEunTmDSkf
[["id","bioguide"],"E000295"]
[["id","thomas"],"02283"]
[["id","govtrack"],412667]
[["id","opensecrets"],"N00035483"]
[["id","lis"],"S376"]
[["id","lis"]]
[["bio","gender"],"F"]
[["bio","birthday"],"1970-07-01"]
[["bio","birthday"]]
[["tooldatareports",0,"name"],"A"]
[["tooldatareports",0,"tooldata",0,"toolid"],12345]
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"toolid"],12346]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
[["tooldatareports",0,"tooldata",1]]
[["tooldatareports",0,"tooldata"]]
[["tooldatareports",0]]
[["tooldatareports"]]
Now do .[0] to extract the path portion of stream.
https://jqplay.org/s/XdPrp8RuEj
["id","bioguide"]
["id","thomas"]
["id","govtrack"]
["id","opensecrets"]
["id","lis"]
["id","lis"]
["bio","gender"]
["bio","birthday"]
["bio","birthday"]
["tooldatareports",0,"name"]
["tooldatareports",0,"tooldata",0,"toolid"]
["tooldatareports",0,"tooldata",0,"data",0,"time"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",0,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"time"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",1,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"time"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2,"value"]
["tooldatareports",0,"tooldata",0,"data",2]
["tooldatareports",0,"tooldata",0,"data"]
["tooldatareports",0,"tooldata",1,"toolid"]
["tooldatareports",0,"tooldata",1,"data",0,"time"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",0,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"time"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",1,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"time"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2,"value"]
["tooldatareports",0,"tooldata",1,"data",2]
["tooldatareports",0,"tooldata",1,"data"]
["tooldatareports",0,"tooldata",1]
["tooldatareports",0,"tooldata"]
["tooldatareports",0]
["tooldatareports"]
Let me first quickly explain index\1.
index("data") of [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"] is 4 since that is the index of the first occurrence of "data".
Knowing that let's now do .[0] | index("data").
https://jqplay.org/s/ny0bV1xEED
null
null
null
null
null
null
null
null
null
null
null
4
4
4
4
4
4
4
4
4
4
4
null
4
4
4
4
4
4
4
4
4
4
4
null
null
null
null
As you can see in our case the indexes are either 4 or null. We want to filter each input such that the corresponding index is not null. Those are the input that have "data" as part of their path.
(.[0] | index("data")) as $ix | select($ix) does just that. Remember that each $ix is mapped to each input. So only input with their $ix being not null are displayed.
For example see https://jqplay.org/s/NwcD7_USZE Here inputs | select(null) gives no output but inputs | select(true) outputs every input.
These are the filtered stream:
https://jqplay.org/s/SgexvhtaGe
[["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",0,"data",0,"value"],1]
[["tooldatareports",0,"tooldata",0,"data",0,"value"]]
[["tooldatareports",0,"tooldata",0,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",0,"data",1,"value"],10]
[["tooldatareports",0,"tooldata",0,"data",1,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",0,"data",2,"value"],5]
[["tooldatareports",0,"tooldata",0,"data",2,"value"]]
[["tooldatareports",0,"tooldata",0,"data",2]]
[["tooldatareports",0,"tooldata",0,"data"]]
[["tooldatareports",0,"tooldata",1,"data",0,"time"],"2021-01-01"]
[["tooldatareports",0,"tooldata",1,"data",0,"value"],10]
[["tooldatareports",0,"tooldata",1,"data",0,"value"]]
[["tooldatareports",0,"tooldata",1,"data",1,"time"],"2021-01-02"]
[["tooldatareports",0,"tooldata",1,"data",1,"value"],100]
[["tooldatareports",0,"tooldata",1,"data",1,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2,"time"],"2021-01-03"]
[["tooldatareports",0,"tooldata",1,"data",2,"value"],50]
[["tooldatareports",0,"tooldata",1,"data",2,"value"]]
[["tooldatareports",0,"tooldata",1,"data",2]]
[["tooldatareports",0,"tooldata",1,"data"]]
Before we go further let's review update assignment.
Have a look at https://jqplay.org/s/g4P6j8f9FG
Let's say we have input [["tooldatareports",0,"tooldata",0,"data",0,"time"],"2021-01-01"].
Then filter .[0] |= .[4:] produces [["data",0,"time"],"2021-01-01"].
Why?
Remember that right hand side (.[4:]) inherits the context of the left hand side(.[0]). So in this case it has the effect of updating the path ["tooldatareports",0,"tooldata",0,"data",0,"time"] to ["data",0,"time"].
Let's move on then.
So (.[0] | index("data")) as $ix | select($ix) | .[0] |= .[$ix:] has the output:
https://jqplay.org/s/AwcQpVyHO2
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],1]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],10]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],5]
[["data",2,"value"]]
[["data",2]]
[["data"]]
[["data",0,"time"],"2021-01-01"]
[["data",0,"value"],10]
[["data",0,"value"]]
[["data",1,"time"],"2021-01-02"]
[["data",1,"value"],100]
[["data",1,"value"]]
[["data",2,"time"],"2021-01-03"]
[["data",2,"value"],50]
[["data",2,"value"]]
[["data",2]]
[["data"]]
Now all we need to do is convert this stream back to json.
https://jqplay.org/s/j2uyzEU_Rc
[fromstream(inputs)] gives:
[
{
"data": [
{
"time": "2021-01-01",
"value": 1
},
{
"time": "2021-01-02",
"value": 10
},
{
"time": "2021-01-03",
"value": 5
}
]
},
{
"data": [
{
"time": "2021-01-01",
"value": 10
},
{
"time": "2021-01-02",
"value": 100
},
{
"time": "2021-01-03",
"value": 50
}
]
}
]
This is the output we wanted.