Iterate though huge JSON in powershell - json

I have a 19 gigs JSON file. A huge array of rather small objects.
[{
"name":"Joe Blow",
"address":"Gotham, CA"
"log": [{},{},{}]
},
...
]
I want to iterate thru the root array of this JSON. Every object with the log takes no more than 2MB of memory. It is possible to load one object into a memory, work with it and throw it away.
Yet the file by itself is 19 gigs. It has millions of those objects. I found it is possible to iterate thru such an array by using C# and Newtonsoft.Json library. You just read a file in a stream and as soon as you see finished object, serialize it and spit it out.
But I want to see if the powershell can do the same? Not to read the whole thing as one chunk, but rather iterate what you have in the hopper right now.
Any ideas?

As far as I know, convertfrom-json doesn't have a streaming mode, but jq does: Processing huge json-array files with jq. This code will turn a giant array into just the contents of the array, that can be output piece by piece. Otherwise a 6mb, 400000 line json file can use 1 gig of memory after conversion (400 megs in powershell 7).
get-content file.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
% { $_ | convertfrom-json }
So for example this:
[
{"name":"joe"},
{"name":"john"}
]
becomes this:
{"name":"joe"}
{"name":"john"}
The streaming format of jq looks very different from json. For example, the array looks like this, with paths to each value and object or array end-markers.
'[{"name":"joe"},{"name":"john"}]' | jq --stream -c
[[0,"name"],"joe"]
[[0,"name"]] # end object
[[1,"name"],"john"]
[[1,"name"]] # end object
[[1]] # end array
And then after truncating "1" "parent folder" in the path of the two values:
'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream '1|truncate_stream(inputs)'
[["name"],"joe"]
[["name"]] # end object
[["name"],"john"]
[["name"]] # end object
# no more end array
"fromstream()" turns it back into json...
'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream 'fromstream(1|truncate_stream(inputs))'
{"name":"joe"}
{"name":"john"}

Related

How to get max value of a date field in a large json file?

I have a large JSON file around 500MB which is the response of a URL call.I need to get the max value of "date" field in the JSON file in the "results" array using shell script(bash).Currently using jq as below.Below works good for smaller files but for larger files it is returning null.
maxDate=$(cat ${jsonfilePath} | jq '[ .results[]?.date ] | max')
Please help.Thanks! I am new to shell scripting,json,jq.
sample/input json file contents:
{
"results": [
{
"Id": "123",
"date": 1588910400000,
"col": "test"
},
{
"Id": "1234",
"date": 1588910412345,
"col": "test2"
}
],
"col2": 123
}
Given --stream option on the command line, JQ won't load the whole input into the memory, instead it'll read the input token by token, producing arrays in this fashion:
[["results",0,"Id"],"123"]
[["results",0,"date"],1588910400000]
...
[["results",1,"date"],1588910412345]
...
Thanks to this feature, we can pick only dates from the input and find out the maximum one without exhausting the memory (at the expense of speed). For example:
jq -n --stream 'reduce (inputs|select(.[0][-1]=="date" and length==2)[1]) as $d (null; [.,$d]|max)' file
500MB should not be so large as to require the --stream option, which generally slows things down. Here then is a fast and efficient(*) solution that does not use the streaming option, but instead uses a generic, stream-oriented "max_by" function defined as follows:
# max_by(empty;1) yields null
def max_by(s; f):
reduce s as $s (null;
if . == null then {s: $s, m: ($s|f)}
else ($s|f) as $m
| if $m > .m then {s: $s, m: $m} else . end
end)
| .s ;
With this in our toolkit, we can simply write:
max_by(.results[].date; .)
This of course assumes that there is a "results" field containing an array of JSON objects. (**) From the problem statement, it would appear that this assumption does not always hold, so you will probably want to modify whichever approach you choose accordingly (e.g. by checking whether there is a results field, whether it's array-valued, etc.)
(*) Using max_by/2 here is more efficient, both in terms of space and time, than using the built-in max_by/1.
(**) The absence of a "date" subfield should not matter as null is less than every number.
 jq '.results | max_by(.date) | .date' "$jsonfilePath"
is a more efficient way to get the maximum date value out of that JSON that might work better for you. It avoids the Useless Use Of Cat, doesn't create a temporary array of just the date values, and thus only needs one pass through the array.

Create//append JSON array from text file in linux with loop

I have a below file in txt format. I want to arrange the data in json array format in linux and append more such data with for/while loop in the same json array based on condition. Please help me with the best way to achieve this.
File
Name:Rock
Name:Clock
{“Array" :[
{
"Name": "Rock",
},
{
"Name”: "Clock”,
}
]
Suppose your initial file is object.json and that it contains an empty object, {};
and that at the beginning of each iteration, the key:value pairs are defined in another file, kv.txt.
Then at each iteration, you can update array.json using the invocation:
< kv.txt jq -Rn --argfile object object.json -f program.jq | sponge object.json
where program.jq contains the jq program:
$object | .Array +=
reduce inputs as $in ([]; . + [$in | capture("(?<k>^[^:]*): *(?<v>.*)") | {(.k):.v} ])
(sponge is part of the moreutils package. If it cannot be used, then you will have to use another method of updating object.json.)

Parsing JSON record-per-line with jq?

I've got a tool that outputs a JSON record on each line, and I'd like to process it with jq.
The output looks something like this:
{"ts":"2017-08-15T21:20:47.029Z","id":"123","elapsed_ms":10}
{"ts":"2017-08-15T21:20:47.044Z","id":"456","elapsed_ms":13}
When I pass this to jq as follows:
./tool | jq 'group_by(.id)'
...it outputs an error:
jq: error (at <stdin>:1): Cannot index string with string "id"
How do I get jq to handle JSON-record-per-line data?
Use the --slurp (or -s) switch:
./tool | jq --slurp 'group_by(.id)'
It outputs the following:
[
[
{
"ts": "2017-08-15T21:20:47.029Z",
"id": "123",
"elapsed_ms": 10
}
],
[
{
"ts": "2017-08-15T21:20:47.044Z",
"id": "456",
"elapsed_ms": 13
}
]
]
...which you can then process further. For example:
./tool | jq -s 'group_by(.id) | map({id: .[0].id, count: length})'
As #JeffMercado pointed out, jq handles streams of JSON just fine, but if you use group_by, then you'd have to ensure its input is an array. That could be done in this case using the -s command-line option; if your jq has the inputs filter, then it can also be done using that filter in conjunction with the -n option.
If you have a version of jq with inputs (which is available in jq 1.5), however, then a better approach would be to use the following streaming variant of group_by:
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
Usage example: GROUPS_BY(inputs; .id)
Note that you will want to use this with the -n command line option.
Such a streaming variant has two main advantages:
it generally requires less memory in that it does not require a copy of the entire input stream to be kept in memory while it is being processed;
it is potentially faster because it does not require any sort operation, unlike group_by/1.
Please note that the above definition of GROUPS_BY/2 follows the convention for such streaming filters in that it produces a stream. Other variants are of course possible.
Handling a large amount of data
The following illustrates how to economize on memory. Suppose the task is to produce a frequency count of .id values. The humdrum solution would be:
GROUPS_BY(inputs; .id) | [(.[0]|.id), length]
A more economical and indeed far better solution would be:
GROUPS_BY(inputs|.id; .) | [.[0], length]

Get the last element in JSON file

I have this JSON file:
{
"system.timestamp": "{system.timestamp}",
"error.state": "{error.state}",
"system.timestamp": "{system.timestamp}",
"error.state": "{error.state}",
"system.timestamp": "{system.timestamp}",
"error.state": "{error.state}",
"error.content": "{custom.error.content}"
}
I would like to get only the last object of the JSON file as I need to check that in every case, the last object is error.content. Attached part of code is just a sample file, every file that will be generated in reality will contain at least around 40 to 50 objects, so in every case I need to check that the last object is error.content.
I have calculated the length by using jq '. | length'. How do I do it using the jq command in Linux?
Note: it's a plain JSON file without any arrays.
Objects with duplicate keys can be handled in jq using the --stream option, e.g.:
$ jq -s --stream '.[length-2] | { (.[0][0]): (.[1]) }' input.json
{
"error.content": "{custom.error.content}"
}
For large files, the following would probably be better as it avoids "slurping" the input file:
$ jq 'first(tostream) | {(.[0][0]): .[1]} ' input.json

Splitting / chunking JSON files with JQ in Bash or Fish shell?

I have been using the wonderful JQ library to parse and extract JSON data to facilitate re-importing. I am able to extract a range easily enough, but am unsure as to how you could loop through in a script and detect the end of the file, preferably in a bash or fish shell script.
Given a JSON file that is wrapped in a "results" dictionary, how can I detect the end of the file?
From testing, I can see that I will get an empty array nested in my desired structure, but how can you detect the end of file condition?:
jq '{ "results": .results[0:500] }' Foo.json > 0000-0500/Foo.json
Thanks!
I'd recommend using jq to split-up the array into a stream of the JSON objects you want (one per line), and then using some other tool (e.g. awk) to populate the files. Here's how the first part can be done:
def splitup(n):
def _split:
if length == 0 then empty
else .[0:n], (.[n:] | _split)
end;
if n == 0 then empty elif n > 0 then _split else reverse|splitup(-n) end;
# For the sake of illustration:
def data: { results: [range(0,20)]};
data | .results | {results: splitup(5) }
Invocation:
$ jq -nc -f splitup.jq
{"results":[0,1,2,3,4]}
{"results":[5,6,7,8,9]}
{"results":[10,11,12,13,14]}
{"results":[15,16,17,18,19]}
For the second part, you could (for example) pipe the jq output to:
awk '{ file="file."++n; print > file; close(file); }'
A variant you might be interested in would have the jq filter emit both the filename and the JSON on alternate lines; the awk script would then read the filename as well.