Splitting / chunking JSON files with JQ in Bash or Fish shell? - json

I have been using the wonderful JQ library to parse and extract JSON data to facilitate re-importing. I am able to extract a range easily enough, but am unsure as to how you could loop through in a script and detect the end of the file, preferably in a bash or fish shell script.
Given a JSON file that is wrapped in a "results" dictionary, how can I detect the end of the file?
From testing, I can see that I will get an empty array nested in my desired structure, but how can you detect the end of file condition?:
jq '{ "results": .results[0:500] }' Foo.json > 0000-0500/Foo.json
Thanks!

I'd recommend using jq to split-up the array into a stream of the JSON objects you want (one per line), and then using some other tool (e.g. awk) to populate the files. Here's how the first part can be done:
def splitup(n):
def _split:
if length == 0 then empty
else .[0:n], (.[n:] | _split)
end;
if n == 0 then empty elif n > 0 then _split else reverse|splitup(-n) end;
# For the sake of illustration:
def data: { results: [range(0,20)]};
data | .results | {results: splitup(5) }
Invocation:
$ jq -nc -f splitup.jq
{"results":[0,1,2,3,4]}
{"results":[5,6,7,8,9]}
{"results":[10,11,12,13,14]}
{"results":[15,16,17,18,19]}
For the second part, you could (for example) pipe the jq output to:
awk '{ file="file."++n; print > file; close(file); }'
A variant you might be interested in would have the jq filter emit both the filename and the JSON on alternate lines; the awk script would then read the filename as well.

Related

Create//append JSON array from text file in linux with loop

I have a below file in txt format. I want to arrange the data in json array format in linux and append more such data with for/while loop in the same json array based on condition. Please help me with the best way to achieve this.
File
Name:Rock
Name:Clock
{“Array" :[
{
"Name": "Rock",
},
{
"Name”: "Clock”,
}
]
Suppose your initial file is object.json and that it contains an empty object, {};
and that at the beginning of each iteration, the key:value pairs are defined in another file, kv.txt.
Then at each iteration, you can update array.json using the invocation:
< kv.txt jq -Rn --argfile object object.json -f program.jq | sponge object.json
where program.jq contains the jq program:
$object | .Array +=
reduce inputs as $in ([]; . + [$in | capture("(?<k>^[^:]*): *(?<v>.*)") | {(.k):.v} ])
(sponge is part of the moreutils package. If it cannot be used, then you will have to use another method of updating object.json.)

Iterate though huge JSON in powershell

I have a 19 gigs JSON file. A huge array of rather small objects.
[{
"name":"Joe Blow",
"address":"Gotham, CA"
"log": [{},{},{}]
},
...
]
I want to iterate thru the root array of this JSON. Every object with the log takes no more than 2MB of memory. It is possible to load one object into a memory, work with it and throw it away.
Yet the file by itself is 19 gigs. It has millions of those objects. I found it is possible to iterate thru such an array by using C# and Newtonsoft.Json library. You just read a file in a stream and as soon as you see finished object, serialize it and spit it out.
But I want to see if the powershell can do the same? Not to read the whole thing as one chunk, but rather iterate what you have in the hopper right now.
Any ideas?
As far as I know, convertfrom-json doesn't have a streaming mode, but jq does: Processing huge json-array files with jq. This code will turn a giant array into just the contents of the array, that can be output piece by piece. Otherwise a 6mb, 400000 line json file can use 1 gig of memory after conversion (400 megs in powershell 7).
get-content file.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
% { $_ | convertfrom-json }
So for example this:
[
{"name":"joe"},
{"name":"john"}
]
becomes this:
{"name":"joe"}
{"name":"john"}
The streaming format of jq looks very different from json. For example, the array looks like this, with paths to each value and object or array end-markers.
'[{"name":"joe"},{"name":"john"}]' | jq --stream -c
[[0,"name"],"joe"]
[[0,"name"]] # end object
[[1,"name"],"john"]
[[1,"name"]] # end object
[[1]] # end array
And then after truncating "1" "parent folder" in the path of the two values:
'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream '1|truncate_stream(inputs)'
[["name"],"joe"]
[["name"]] # end object
[["name"],"john"]
[["name"]] # end object
# no more end array
"fromstream()" turns it back into json...
'[{"name":"joe"},{"name":"john"}]' | jq -cn --stream 'fromstream(1|truncate_stream(inputs))'
{"name":"joe"}
{"name":"john"}

Get JSON files from particular interval based on date field

I've a lot json file the structure of which looks like below:
{
key1: 'val1'
key2: {
'key21': 'someval1',
'key22': 'someval2',
'key23': 'someval3',
'date': '2018-07-31T01:30:30Z',
'key25': 'someval4'
}
key3: []
... some other objects
}
My goal is to get only these files where date field is from some period.
For example from 2018-05-20 to 2018-07-20.
I can't base on date of creation this files, because all of this was generated in one day.
Maybe it is possible using sed or similar program?
Fortunately, the date in this format can be compared as a string. You only need something to parse the JSONs, e.g. Perl:
perl -l -0777 -MJSON::PP -ne '
$date = decode_json($_)->{key2}{date};
print $ARGV if $date gt "2018-07-01T00:00:00Z";
' *.json
-0777 makes perl slurp the whole files instead of reading them line by line
-l adds a newline to print
$ARGV contains the name of the currently processed file
See JSON::PP for details. If you have JSON::XS or Cpanel::JSON::XS, you can switch to them for faster processing.
I had to fix the input (replace ' by ", add commas, etc.) in order to make the parser happy.
If your files actually contain valid JSON, the task can be accomplished in a one-liner with jq, e.g.:
jq 'if .key2.date[0:10] | (. >= "2018-05-20" and . <= "2018-07-31") then input_filename else empty end' *.json
This is just an illustration. jq has date-handling functions for dealing with more complex requirements.
Handling quasi-JSON
If your files contain quasi-JSON, then you could use jq in conjunction with a JSON rectifier. If your sample is representative, then hjson
could be used, e.g.
for f in *.qjson
do
hjson -j $f | jq --arg f "$f" '
if .key2.date[0:7] == "2018-07" then $f else empty end'
done
Try like this:
Find a online converter. (for example: https://codebeautify.org/json-to-excel-converter#) and convert Json to CSV
Open CSV file with Excel
Filter your data

Get the last element in JSON file

I have this JSON file:
{
"system.timestamp": "{system.timestamp}",
"error.state": "{error.state}",
"system.timestamp": "{system.timestamp}",
"error.state": "{error.state}",
"system.timestamp": "{system.timestamp}",
"error.state": "{error.state}",
"error.content": "{custom.error.content}"
}
I would like to get only the last object of the JSON file as I need to check that in every case, the last object is error.content. Attached part of code is just a sample file, every file that will be generated in reality will contain at least around 40 to 50 objects, so in every case I need to check that the last object is error.content.
I have calculated the length by using jq '. | length'. How do I do it using the jq command in Linux?
Note: it's a plain JSON file without any arrays.
Objects with duplicate keys can be handled in jq using the --stream option, e.g.:
$ jq -s --stream '.[length-2] | { (.[0][0]): (.[1]) }' input.json
{
"error.content": "{custom.error.content}"
}
For large files, the following would probably be better as it avoids "slurping" the input file:
$ jq 'first(tostream) | {(.[0][0]): .[1]} ' input.json

How can I completely sort arbitrary JSON using jq?

I want to diff two JSON text files. Unfortunately they're constructed in arbitrary order, so I get diffs when they're semantically identical. I'd like to use jq (or whatever) to sort them in any kind of full order, to eliminate differences due only to element ordering.
--sort-keys solves half the problem, but it doesn't sort arrays.
I'm pretty ignorant of jq and don't know how to write a jq recursive filter that preserves all data; any help would be appreciated.
I realize that line-by-line 'diff' output isn't necessarily the best way to compare two complex objects, but in this case I know the two files are very similar (nearly identical) and line-by-line diffs are fine for my purposes.
Using jq or alternative command line tools to diff JSON files answers a very similar question, but doesn't print the differences. Also, I want to save the sorted results, so what I really want is just a filter program to sort JSON.
Here is a solution using a generic function sorted_walk/1 (so named for the reason described in the postscript below).
normalize.jq:
# Apply f to composite entities recursively using keys[], and to atoms
def sorted_walk(f):
. as $in
| if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | sorted_walk(f)) } ) | f
elif type == "array" then map( sorted_walk(f) ) | f
else f
end;
def normalize: sorted_walk(if type == "array" then sort else . end);
normalize
Example using bash:
diff <(jq -S -f normalize.jq FILE1) <(jq -S -f normalize.jq FILE2)
POSTSCRIPT: The builtin definition of walk/1 was revised after this response was first posted: it now uses keys_unsorted rather than keys.
I want to diff two JSON text files.
Use jd with the -set option:
No output means no difference.
$ jd -set A.json B.json
Differences are shown as an # path and + or -.
$ jd -set A.json C.json
# ["People",{}]
+ "Carla"
The output diffs can also be used as patch files with the -p option.
$ jd -set -o patch A.json C.json; jd -set -p patch B.json
{"City":"Boston","People":["John","Carla","Bryan"],"State":"MA"}
https://github.com/josephburnett/jd#command-line-usage
I'm surprised this isn't a more popular question/answer. I haven't seen any other json deep sort solutions. Maybe everyone likes solving the same problem over and over.
Here's an wrapper for #peak's excellent solution above that wraps it into a shell script that works in a pipe or with file args.
#!/usr/bin/env bash
# json normalizer function
# Recursively sort an entire json file, keys and arrays
# jq --sort-keys is top level only
# Alphabetize a json file's dict's such that they are always in the same order
# Makes json diff'able and should be run on any json data that's in source control to prevent excessive diffs from dict reordering.
[ "${DEBUG}" ] && set -x
TMP_FILE="$(mktemp)"
trap 'rm -f -- "${TMP_FILE}"' EXIT
cat > "${TMP_FILE}" <<-EOT
# Apply f to composite entities recursively using keys[], and to atoms
def sorted_walk(f):
. as \$in
| if type == "object" then
reduce keys[] as \$key
( {}; . + { (\$key): (\$in[\$key] | sorted_walk(f)) } ) | f
elif type == "array" then map( sorted_walk(f) ) | f
else f
end;
def normalize: sorted_walk(if type == "array" then sort else . end);
normalize
EOT
# Don't pollute stdout with debug output
[ "${DEBUG}" ] && cat $TMP_FILE > /dev/stderr
if [ "$1" ] ; then
jq -S -f ${TMP_FILE} $1
else
jq -S -f ${TMP_FILE} < /dev/stdin
fi