Unnest a huge JSON array into individual JSON object - json

I have the a single JSON object as below,
{
"someOtherArray": [ {} , {} ],
"a": [
{
"item1": "item1_value",
"item2": "item2_value"
},
{
"item1": "item1_value",
"item2": "item2_value"
},
{
....
},
100 million more object
]
}
I'm trying to make each element in the array as a separate JSON object as below,
{ "a": { "item1": "item1_value", "item2": "item2_value" } }
{ "a": { "item1": "item1_value", "item2": "item2_value" } }
The raw files has millions of nested objects in a single JSON array, which I want to split into multiple individual JSON.

This is a response to the revised question (i.e., "I just want 'a'").
You could just tweak the standard answer:
jq --stream -nc '
{"a": fromstream(2|truncate_stream(inputs | select(.[0][0]=="a")) )}
'
Footnote: Execution Times
The jq streaming parser is economical with memory at the expense of execution speed. If the input consists of an array of N small objects, then the execution time should very roughly be linear in N, and the memory requirements should be roughly constant.
To give some idea of what to expect, I created an array of 10^8 objects similar to those described in the Q. The file size was 4GB. On a 3GHz machine, reading the file took about 16 minutes of u+s time, but the "peak memory footprint" was only 1.2MB.
gojq was slightly slower but required significantly more memory, the "peak memory footprint" being 8.4MB, and I suspect that the required memory grows with N.

To process a huge file, possibly larger than what fits into the memory, you can break it down into pieces using the --stream directive. This stream can then be read sequentially using inputs in combination with the --null-input (or -n) flag. To achieve the overall effect of
jq '.a[]' file.json
you need to truncate the streamed parts by stripping off the first two levels of their structure information (essentially their location path: the outer object's a field, and the contained array's indices []). Using fromstream will then reconstruct each entity once read in completely.
jq --stream -n 'fromstream(2 | truncate_stream(inputs))' file.json
{
"item1": "item1_value",
"item2": "item2_value"
}
{
"item1": "item1_value",
"item2": "item2_value"
}
:
To create your final structure, re-create the resulting object with the output of fromstream, and use the --compact-output (or -c) option to have each object on its separate line:
jq --stream -nc '{a: fromstream(2 | truncate_stream(inputs))}' file.json
{"a":{"item1":"item1_value","item2":"item2_value"}}
{"a":{"item1":"item1_value","item2":"item2_value"}}
:
If you also want the top-level field name (here a) be read in and re-created dynamically, you will have to construct your own stream truncation:
jq --stream -nc '
fromstream(inputs | if first | has(2) then
setpath([0]; first | del(.[1])),
if has(1) then empty else map(.[:1]) end
else empty end)
' file.json
{"a":{"item1":"item1_value","item2":"item2_value"}}
{"a":{"item1":"item1_value","item2":"item2_value"}}
:

Here's a straightforward "streaming" solution to the problem assuming the top-level key must be determined dynamically:
< input.json jq -cn --stream '
input as $in
| $in[0][0] as $key
| fromstream(2|truncate_stream($in,inputs))
| {($key): .}
'

Related

How to get max value of a date field in a large json file?

I have a large JSON file around 500MB which is the response of a URL call.I need to get the max value of "date" field in the JSON file in the "results" array using shell script(bash).Currently using jq as below.Below works good for smaller files but for larger files it is returning null.
maxDate=$(cat ${jsonfilePath} | jq '[ .results[]?.date ] | max')
Please help.Thanks! I am new to shell scripting,json,jq.
sample/input json file contents:
{
"results": [
{
"Id": "123",
"date": 1588910400000,
"col": "test"
},
{
"Id": "1234",
"date": 1588910412345,
"col": "test2"
}
],
"col2": 123
}
Given --stream option on the command line, JQ won't load the whole input into the memory, instead it'll read the input token by token, producing arrays in this fashion:
[["results",0,"Id"],"123"]
[["results",0,"date"],1588910400000]
...
[["results",1,"date"],1588910412345]
...
Thanks to this feature, we can pick only dates from the input and find out the maximum one without exhausting the memory (at the expense of speed). For example:
jq -n --stream 'reduce (inputs|select(.[0][-1]=="date" and length==2)[1]) as $d (null; [.,$d]|max)' file
500MB should not be so large as to require the --stream option, which generally slows things down. Here then is a fast and efficient(*) solution that does not use the streaming option, but instead uses a generic, stream-oriented "max_by" function defined as follows:
# max_by(empty;1) yields null
def max_by(s; f):
reduce s as $s (null;
if . == null then {s: $s, m: ($s|f)}
else ($s|f) as $m
| if $m > .m then {s: $s, m: $m} else . end
end)
| .s ;
With this in our toolkit, we can simply write:
max_by(.results[].date; .)
This of course assumes that there is a "results" field containing an array of JSON objects. (**) From the problem statement, it would appear that this assumption does not always hold, so you will probably want to modify whichever approach you choose accordingly (e.g. by checking whether there is a results field, whether it's array-valued, etc.)
(*) Using max_by/2 here is more efficient, both in terms of space and time, than using the built-in max_by/1.
(**) The absence of a "date" subfield should not matter as null is less than every number.
 jq '.results | max_by(.date) | .date' "$jsonfilePath"
is a more efficient way to get the maximum date value out of that JSON that might work better for you. It avoids the Useless Use Of Cat, doesn't create a temporary array of just the date values, and thus only needs one pass through the array.

Use jq to concatenate JSON arrays in multiple files

I have a series of JSON files containing an array of records, e.g.
$ cat f1.json
{
"records": [
{"a": 1},
{"a": 3}
]
}
$ cat f2.json
{
"records": [
{"a": 2}
]
}
I want to 1) extract a single field from each record and 2) output a single array containing all the field values from all input files.
The first part is easy:
jq '.records | map(.a)' f?.json
[
1,
3
]
[
2
]
But I cannot figure out how to get jq to concatenate those output arrays into a single array!
I'm not married to jq; I'll happily use another tool if necessary. But I would love to know how to do this with jq, because it's something I have been trying to figure out for years.
Assuming your jq has inputs (which is true of jq 1.5 and later), it would be most efficient to use it, e.g. along the lines of:
jq -n '[inputs.records[].a]' f*.json
Use -s (or --slurp):
jq -s 'map(.records[].a)' f?.json
You need to use --slurp so that jq will apply its filter on the aggregation of all inputs rather than on each input. When using this option, jq's input will be an array of the inputs which you need to account for.
I would use the following :
jq --slurp 'map(.records | map(.a)) | add' f?.json
We apply your current transformation to each elements of the slurped array of inputs (your previous individual inputs), then we merge those transformed arrays into one with add.
If your input files are large, slurping the file could eat up lots of memory in which you case you can reduce which works in iterative manner, appending the contents of the array .a one object at a time
jq -n 'reduce inputs.records[].a as $d (.; . += [$d])' f?.json
The -n flag is to ensure to construct the output JSON from scratch with the data available from inputs. The reduce function takes the initial value of . which because of the null input would be just null. Then for each of the input objects . += [$d] ensures that the array contents of .a are appended together.
As a compromise between the readability of --slurp and the efficiency of reduce, you can run jq twice. The first is a slightly altered version of your original command, the second slurps the undifferentiated output into a single array.
$ jq '.records[] | .a' f?.json | jq -s
[
1,
3,
2
]
--slurp (-s) key is needed and map() to do so in one shot
$ cat f1.json
{
"records": [
{"a": 1},
{"a": 3}
]
}
$ cat f2.json
{
"records": [
{"a": 2}
]
}
$ jq -s 'map(.records[].a)' f?.json
[
1,
3,
2
]

jq streaming - filter nested list and retain global structure

In a large json file, I want to remove some elements from a nested list, but keep the overall structure of the document.
My example input it this (but the real one is large enough to demand streaming).
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
{
"keep": "false",
"extra": "non-keeper"
}
]
}
The required output just has one element of the 'filter_this' block removed:
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
]
}
The standard way to handle such cases appears to be using 'truncate_stream' to reconstitute streamed objects, before filtering those in the usual jq way. Specifically, the command:
jq -nc --stream 'fromstream(1|truncate_stream(inputs))'
gives access to a stream of objects:
{"keep_this":["this","list"]}
[{"keep":"true"},{"keep":"true","extra":"keeper"},
{"keep":"false","extra":"non-keeper"}]
at which point it is easy to filter for the required objects. However, this strips the results from the context of their parent object, which is not what I want.
Looking at the streaming structure:
[["keep_untouched","keep_this",0],"this"]
[["keep_untouched","keep_this",1],"list"]
[["keep_untouched","keep_this",1]]
[["keep_untouched","keep_this"]]
[["filter_this",0,"keep"],"true"]
[["filter_this",0,"keep"]]
[["filter_this",1,"keep"],"true"]
[["filter_this",1,"extra"],"keeper"]
[["filter_this",1,"extra"]]
[["filter_this",2,"keep"],"false"]
[["filter_this",2,"extra"],"non-keeper"]
[["filter_this",2,"extra"]]
[["filter_this",2]]
[["filter_this"]]
it seems I need to select all the 'filter_this' rows, truncate those rows only (using 'truncate_stream'), rebuild these rows as objects (using 'from_stream'), filter them, and turn the objects back into the stream data format (using 'tostream') to join the stream of 'keep untouched' rows, which are still in the streaming format. At that point it would be possible to re-build the whole json. If that is the right approach - which seems overly converluted to me - how do I do that? Or is there a better way?
If your input file consists of a single very large JSON entity that is too big for the regular jq parser to handle in your environment, then there is the distinct possibility that you won't have enough memory to reconstitute the JSON document.
With that caveat, the following may be worth a try. The key insight is that reconstruction can be accomplished using reduce.
The following uses a bunch of temporary files for the sake of clarity:
TMP=/tmp/$$
jq -c --stream 'select(length==2)' input.json > $TMP.streamed
jq -c 'select(.[0][0] != "filter_this")' $TMP.streamed > $TMP.1
jq -c 'select(.[0][0] == "filter_this")' $TMP.streamed |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
| .filter_this |= map(select(.keep=="true"))
| tostream
| select(length==2)' > $TMP.2
# Reconstruction
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' $TMP.1 $TMP.2
Output
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this": [
{
"keep": "true"
},
{
"keep": "true",
"extra": "keeper"
}
]
}
Many thanks to #peak. I found his approach really useful, but unrealistic in terms of performance. Stealing some of #peak's ideas, though, I came up with the following:
Extract the 'parent' object:
jq -c --stream 'select(length==2)' input.json |
jq -c 'select(.[0][0] != "filter_this")' |
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' > $TMP.parent
Extract the 'keepers' - though this means reading the file twice (:-<):
jq -nc --stream '[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' input.json > $TMP.keepers
Insert the filtered list into the parent object.
jq -nc -s 'inputs as $items
| $items[0] as $parent
| $parent
| .filter_this |= $items[1]
' $TMP.parent $TMP.keepers > result.json
Here is a simplified version of #PeteC's script. It requires one fewer invocations of jq.
In both cases, please note that the invocation of jq that uses "2|truncate_stream(_)" requires a more recent version of jq than 1.5.
TMP=/tmp/$$
INPUT=input.json
# Extract all but .filter_this
< $INPUT jq -c --stream 'select(length==2 and .[0][0] != "filter_this")' |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
' > $TMP.parent
# Need jq > 1.5
# Extract the 'keepers'
< $INPUT jq -n -c --stream '
[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' $INPUT > $TMP.keepers
# Insert the filtered list into the parent object:
jq -s '. as $in | .[0] | (.filter_this |= $in[1])
' $TMP.parent $TMP.keepers > result.json

Parsing JSON record-per-line with jq?

I've got a tool that outputs a JSON record on each line, and I'd like to process it with jq.
The output looks something like this:
{"ts":"2017-08-15T21:20:47.029Z","id":"123","elapsed_ms":10}
{"ts":"2017-08-15T21:20:47.044Z","id":"456","elapsed_ms":13}
When I pass this to jq as follows:
./tool | jq 'group_by(.id)'
...it outputs an error:
jq: error (at <stdin>:1): Cannot index string with string "id"
How do I get jq to handle JSON-record-per-line data?
Use the --slurp (or -s) switch:
./tool | jq --slurp 'group_by(.id)'
It outputs the following:
[
[
{
"ts": "2017-08-15T21:20:47.029Z",
"id": "123",
"elapsed_ms": 10
}
],
[
{
"ts": "2017-08-15T21:20:47.044Z",
"id": "456",
"elapsed_ms": 13
}
]
]
...which you can then process further. For example:
./tool | jq -s 'group_by(.id) | map({id: .[0].id, count: length})'
As #JeffMercado pointed out, jq handles streams of JSON just fine, but if you use group_by, then you'd have to ensure its input is an array. That could be done in this case using the -s command-line option; if your jq has the inputs filter, then it can also be done using that filter in conjunction with the -n option.
If you have a version of jq with inputs (which is available in jq 1.5), however, then a better approach would be to use the following streaming variant of group_by:
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
Usage example: GROUPS_BY(inputs; .id)
Note that you will want to use this with the -n command line option.
Such a streaming variant has two main advantages:
it generally requires less memory in that it does not require a copy of the entire input stream to be kept in memory while it is being processed;
it is potentially faster because it does not require any sort operation, unlike group_by/1.
Please note that the above definition of GROUPS_BY/2 follows the convention for such streaming filters in that it produces a stream. Other variants are of course possible.
Handling a large amount of data
The following illustrates how to economize on memory. Suppose the task is to produce a frequency count of .id values. The humdrum solution would be:
GROUPS_BY(inputs; .id) | [(.[0]|.id), length]
A more economical and indeed far better solution would be:
GROUPS_BY(inputs|.id; .) | [.[0], length]

Process huge GEOJson file with jq

Given a GEOJson file as follows:-
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"FEATCODE": 15014
},
"geometry": {
"type": "Polygon",
"coordinates": [
.....
I want to end up with the following:-
{
"type": "FeatureCollection",
"features": [
{
"tippecanoe" : {"minzoom" : 13},
"type": "Feature",
"properties": {
"FEATCODE": 15014
},
"geometry": {
"type": "Polygon",
"coordinates": [
.....
ie. I have added the tippecanoe object to each feature in the array features
I can make this work with:-
jq '.features[].tippecanoe.minzoom = 13' <GEOJSON FILE> > <OUTPUT FILE>
Which is fine for small files. But processing a large file of 414Mb seems to take forever with the processor maxing out and nothing being written to the OUTPUT FILE
Reading further into jq it appears that the --stream command line parameter may help but I am completely confused as to how to use this for my purposes.
I would be grateful for an example command line that serves my purposes along with an explanation as to what --stream is doing.
A one-pass jq-only approach may require more RAM than is available. If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.
The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document. This step can be accomplished very economically using awk.
In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.
jq-only
jq -c '.features[]' input.json |
jq -c '.tippecanoe.minzoom = 13' |
jq -c -s '{type: "FeatureCollection", features: .}'
jq and awk
jq -c '.features[]' input.json |
jq -c '.tippecanoe.minzoom = 13' | awk '
BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
NR==1 { print; next }
{print ","; print}
END {print "] }";}'
Performance comparison
For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.
u+s:
jq-only: 15m 15s
jq-awk: 7m 40s
jq one-pass using map: 6m 53s
An alternative solution could be for example:
jq '.features |= map_values(.tippecanoe.minzoom = 13)'
To test this, I created a sample JSON as
d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}
and inspected the execution time as a function of N. Interestingly, while the map_values approach seems to have linear complexity in N, .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)
Alternatively, one might just do it manually with, e.g., Python:
import json
import sys
data = {}
with open(sys.argv[1], 'r') as F:
data = json.load(F)
extra_item = {"minzoom" : 13}
for feature in data['features']:
feature["tippecanoe"] = extra_item
with open(sys.argv[2], 'w') as F:
F.write(json.dumps(data))
In this case, map rather than map_values is far faster (*):
.features |= map(.tippecanoe.minzoom = 13)
However, using this approach will still require enough RAM.
p.s. If you want to use jq to generate a large file for timing, consider:
def N: 1000000;
def data:
{"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };
(*) Using map, 20s for 100MB, and approximately linear.
Here, based on the work of #nicowilliams at GitHub, is a solution that uses the streaming parser available with jq. The solution is very economical with memory, but is currently quite slow if the input is large.
The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option; and a function for converting the stream back to JSON in the original form.
Invocation:
jq -cnr --stream -f program.jq input.json
program.jq
# inject the given object into the stream produced from "inputs" with the --stream option
def inject(object):
[object|tostream] as $object
| 2
| truncate_stream(inputs)
| if (.[0]|length == 1) and length == 1
then $object[]
else .
end ;
# Input: the object to be added
# Output: text
def output:
. as $object
| ( "[",
foreach fromstream( inject($object) ) as $o
(0;
if .==0 then 1 else 2 end;
if .==1 then $o else ",", $o end),
"]" ) ;
{}
| .tippecanoe.minzoom = 13
| output
Generation of test data
def data(N):
{"features":
[range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };
Example output
With N=2:
[
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
,
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
]