JQ Group Multiple Files - json

I have a set of JSON that all contain JSON in the following format:
File 1:
{ "host" : "127.0.0.1", "port" : "80", "data": {}}
File 2:
{ "host" : "127.0.0.2", "port" : "502", "data": {}}
File 3:
{ "host" : "127.0.0.1", "port" : "443", "data": {}}
These files can be rather large, up to several gigabytes.
I want to use JQ or some other bash json processing tool that can merge these json files into one file with a grouped format like so:
[{ "host" : "127.0.0.1", "data": {"80": {}, "443" : {}}},
{ "host" : "127.0.0.2", "data": {"502": {}}}]
Is this possible with jq and if yes, how could I possibly do this? I have looked at the group_by function in jq, but it seems like I need to combine all files first and then group on this big file. However, since the files can be very large, it might make sense to stream the data and group them on the fly.

With really big files, I'd look into a primarily disk based approach instead of trying to load everything into memory. The following script leverages sqlite's JSON1 extension to load the JSON files into a database and generate the grouped results:
#!/usr/bin/env bash
DB=json.db
# Delete existing database if any.
rm -f "$DB"
# Create table. Assuming each host,port pair is unique.
sqlite3 -batch "$DB" <<'EOF'
CREATE TABLE data(host TEXT, port INTEGER, data TEXT,
PRIMARY KEY (host, port)) WITHOUT ROWID;
EOF
# Insert the objects from the files into the database.
for file in file*.json; do
sqlite3 -batch "$DB" <<EOF
INSERT INTO data(host, port, data)
SELECT json_extract(j, '\$.host'), json_extract(j, '\$.port'), json_extract(j, '\$.data')
FROM (SELECT json(readfile('$file')) AS j) as json;
EOF
done
# And display the results of joining the objects Could use
# json_group_array() instead of this sed hackery, but we're trying to
# avoid building a giant string with the entire results. It might still
# run into sqlite maximum string length limits...
sqlite3 -batch -noheader -list "$DB" <<'EOF' | sed '1s/^/[/; $s/,$/]/'
SELECT json_object('host', host,
'data', json_group_object(port, json(data))) || ','
FROM data
GROUP BY host
ORDER BY host;
EOF
Running this on your sample data prints out:
[{"host":"127.0.0.1","data":{"80":{},"443":{}}},
{"host":"127.0.0.2","data":{"502":{}}}]

If the goal is really to produce a single ginormous JSON entity, then presumably that entity is still small enough to have a chance of fitting into the memory of some computer, say C. So there is a good chance of jq being up to the job on C. At any rate, to utilize memory efficiently, you would:
use inputs while performing the grouping operation;
avoid the built-in group_by (since it requires an in-memory sort).
Here then is a two-step candidate using jq, which assumes grouping.jq contains the following program:
# emit a stream of arrays assuming that f is always string-valued
def GROUPS_BY(stream; f):
reduce stream as $x ({}; ($x|f) as $s | .[$s] += [$x]) | .[];
GROUPS_BY(inputs | .data=.port | del(.port); .host)
| {host: .[0].host, data: map({(.data): {}}) | add}
If the JSON files can be captured by *.json, you could then consider:
jq -n -f grouping.jq *.json | jq -s .
One advantage of this approach is that if it fails, you could try using a temporary file to hold the output of the first step, and then processing it later, either by "slurping" it, or perhaps more sensibly distributing it amongst several files, one per .host.
Removing extraneous data
Obviously, if the input files contain extraneous data, you might want to remove it first, e.g. by running
for f in *.json ; do
jq '{host,port}' "$f" | sponge $f
done
or by performing the projection in program.jq, e.g. using:
GROUPS_BY(inputs | {host, data: .port}; .host)
| {host: .[0].host, data: map( {(.data):{}} )}

Here's a script which uses jq to solve the problem without requiring more memory than is needed for the largest group. For simplicity:
it reads *.json and directs output to $OUT as defined at the top of the script.
it uses sponge
#!/usr/bin/env bash
# Requires: sponge
OUT=big.json
/bin/rm -i "$OUT"
if [ -s "$OUT" ] ; then
echo $OUT already exists
exit 1
fi
### Step 0: setup
TDIR=$(mktemp -d /tmp/grouping.XXXX)
function cleanup {
if [ -d "$TDIR" ] ; then
/bin/rm -r "$TDIR"
fi
}
trap cleanup EXIT
### Step 1: find the groups
for f in *.json ; do
host=$(jq -r '.host' "$f")
echo "$f" >> "$TDIR/$host"
done
for f in $TDIR/* ; do
echo $f ...
jq -n 'reduce (inputs | {host, data: {(.port): {} }}) as $in (null;
.host=$in.host | .data += [$in.data])' $(cat $f) | sponge "$f"
done
### Step 2: assembly
i=0
echo "[" > $OUT
find $TDIR -type f | while read f ; do
i=$((i + 1))
if [ $i -gt 1 ] ; then echo , >> $OUT ; fi
cat "$f" >> $OUT
done
echo "]" >> $OUT
Discussion
Besides requiring enough memory to handle the largest group, the main deficiencies of the above implementation are:
it assumes that the .host string is suitable as a file name.
the resultant file is not strictly speaking pretty-printed.
These two issues could however be addressed quite easily with minor modifications to the script without requiring additional memory.

Related

How can I write a batch file using jq to find json files with certain attribute and copy the to new location

I have 100,000's of lined json files that I need to split out based on whether or not, they contain a certain value for an attribute and then I need to convert them into valid json that can be read in by another platform.
I'm using a batch file to do this and I've managed to convert them into valid json using the following:
for /r %%f in (*.json*) do jq -s -c "." "%%f" >> "C:\Users\me\my-folder\%%~nxf.json"
I just can't figure out how to only copy the files that contain a certain value. So logic should be:
Look at all the files in the folders and sub solders
If the file contains an attribute "event" with a value of "abcd123"
then: convert the file into valid json and persist it with the same filename over to location "C:\Users\me\my-folder\"
else: ignore it
Example of files it should select:
{"name":"bob","event":"abcd123"}
and
{"name":"ann","event":"abcd123"},{"name":"bob","event":"8745LLL"}
Example of files it should NOT select:
{"name":"ann","event":"778PPP"}
and
{"name":"ann","event":"778PPP"},{"name":"bob","event":"8745LLL"}
Would love help to figure out the filtering part.
Since there are probably more file names than will fit on the command line, this response will assume a shell loop through the file names will be necessary, as the question itself envisions. Since I'm currently working with a bash shell, I'll present a bash solution, which hopefully can readily be translated to other shells.
The complication in the question is that the input file might contain one or more valid JSON values, or one or more comma-separated JSON values.
The key to a simple solution using jq is jq's -e command-line option, since this sets the return code to 0 if and only if
(a) the program ran normally; and (b) the last result was a truthy value.
For clarity, let's encapsulate the relevant selection criterion in two bash functions:
# If the input is a valid stream of JSON objects
function try {
jq -e -n 'any( inputs | objects; select( .event == "abcd123") | true)' 2> /dev/null > /dev/null
}
# If the input is a valid JSON array whose elements are to be checked
function try_array {
jq -e 'any( .[] | objects; select( .event == "abcd123") | true)' 2> /dev/null > /dev/null
}
Now a comprehensive solution can be constructed along the following lines:
find . -type f -maxdepth 1 -name '*.json' | while read -r f
do
< "$f" try
if [ $? = 0 ] ; then
echo copy $f
elif [ $? = 5 ] ; then
(echo '['; cat "$f"; echo ']') | try_array
if [ $? = 0 ] ; then
echo copy $f
fi
fi
done
Have you considered using findstr?
%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"
Please open a Command Prompt window, type findstr /?, press the ENTER key, and read its usage information. (You may want to consider the /I option too, for instance).
You could then use that within another for loop to propagate those files into a variable for your copy command.
batch-file example:
#For /F "EOL=? Delims=" %%G In (
'%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"'
) Do #Copy /Y "%%G" "S:\omewhere Else"
cmd example:
For /F "EOL=? Delims=" %G In ('%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"') Do #Copy /Y "%G" "S:\omewhere Else"

Loop through JSON array shell script

I am trying to write a shell script that loops through a JSON file and does some logic based on every object's properties. The script was initially written for Windows but it does not work properly on a MacOS.
The initial code is as follows
documentsJson=""
jsonStrings=$(cat "$file" | jq -c '.[]')
while IFS= read -r document; do
# Get the properties from the docment (json string)
currentKey=$(echo "$document" | jq -r '.Key')
encrypted=$(echo "$document" | jq -r '.IsEncrypted')
# If not encrypted then don't do anything with it
if [[ $encrypted != true ]]; then
echoComment " Skipping '$currentKey' as it's not marked for encryption"
documentsJson+="$document,"
continue
fi
//some more code
done <<< $jsonStrings
When ran on a MacOs, the whole file is processed at once, so it does not loop through objects.
The closest I got to making it work - after trying a lot of suggestions - is as follows:
jq -r '.[]' "$file" | while read i; do
for config in $i ; do
currentKey=$(echo "$config" | jq -r '.Key')
echo "$currentKey"
done
done
The console result is parse error: Invalid numeric literal at line 1, column 6
I just cannot find a proper way of grabbing the JSON object and reading its properties.
JSON file example
[
{
"Key": "PdfMargins",
"Value": {
"Left":0,
"Right":0,
"Top":20,
"Bottom":15
}
},
{
"Key": "configUrl",
"Value": "someUrl",
"IsEncrypted": true
}
]
Thank you in advance!
Try putting the $jsonStrings in doublequotes: done <<< "$jsonStrings"
Otherwise the standard shell splitting applies on the variable expansion and you probably want to retain the line structure of the output of jq.
You could also use this in bash:
while IFS= read -r document; do
...
done < <(jq -c '.[]' < "$file")
That would save some resources. I am not sure about making this work on MacOS, though, so test this first.

Bash loop to merge files in batches for mongoimport

I have a directory with 2.5 million small JSON files in it. It's 104gb on disk. They're multi-line files.
I would like to create a set of JSON arrays from the files so that I can import them using mongoimport in a reasonable amount of time. The files can be no bigger than 16mb, but I'd be happy even if I managed to get them in sets of ten.
So far, I can use this to do them one at a time at about 1000/minute:
for i in *.json; do mongoimport --writeConcern 0 --db mydb --collection all --quiet --file $i; done
I think I can use "jq" to do this, but I have no idea how to make the bash loop pass 10 files at a time to jq.
Note that using bash find results in an error as there are too many files.
With jq you can use --slurp to create arrays, and -c to make multiline json single line. However, I can't see how to combine the two into a single command.
Please help with both parts of the problem if possible.
Here's one approach. To illustrate, I've used awk as it can read the list of files in small batches and because it has the ability to execute jq and mongoimport. You will probably need to make some adjustments to make the whole thing more robust, to test for errors, and so on.
The idea is either to generate a script that can be reviewed and then executed, or to use awk's system() command to execute the commands directly. First, let's generate the script:
ls *.json | awk -v group=10 -v tmpfile=json.tmp '
function out() {
print "jq -s . " files " > " tmpfile;
print "mongoimport --writeConcern 0 --db mydb --collection all --quiet --file " tmpfile;
print "rm " tmpfile;
files="";
}
BEGIN {n=1; files="";
print "test -r " tmpfile " && rm " tmpfile;
}
n % group == 0 {
out();
}
{ files = files " \""$0 "\"";
n++;
}
END { if (files) {out();}}
'
Once you've verified this works, you can either execute the generated script, or change the "print ..." lines to use "system(....)"
Using jq to generate the script
Here's a jq-only approach for generating the script.
Since the number of files is very large, the following uses features that were only introduced in jq 1.5, so its memory usage is similar to the awk script above:
def read(n):
# state: [answer, hold]
foreach (inputs, null) as $i
([null, null];
if $i == null then .[0] = .[1]
elif .[1]|length == n then [.[1],[$i]]
else [null, .[1] + [$i]]
end;
.[0] | select(.) );
"test -r json.tmp && rm json.tmp",
(read($group|tonumber)
| map("\"\(.)\"")
| join(" ")
| ("jq -s . \(.) > json.tmp", mongo("json.tmp"), "rm json.tmp") )
Invocation:
ls *.json | jq -nRr --arg group 10 -f generate.jq
Here is what I came up with. It seems to work and is importing at roughly 80 a second into an external hard drive.
#!/bin/bash
files=(*.json)
for((I=0;I<${#files[*]};I+=500)); do jq -c '.' ${files[#]:I:500} | mongoimport --writeConcern 0 --numInsertionWorkers 16 --db mydb --collection all --quiet;echo $I; done
However, some are failing. I've imported 105k files but only 98547 appeared in the mongo collection. I think it's because some documents are > 16mb.

Find and edit a Json file using bash

I have multiple files in the following format with different categories like:
{
"id": 1,
"flags": ["a", "b", "c"],
"name": "test",
"category": "video",
"notes": ""
}
Now I want to append all the files flags whose category is video with string d. So my final file should look like the file below:
{
"id": 1,
"flags": ["a", "b", "c", "d"],
"name": "test",
"category": "video",
"notes": ""
}
Now using the following command I am able to find files of my interest, but now I want to work with editing part which I an unable to find as there are 100's of file to edit manually, e.g.
find . - name * | xargs grep "\"category\": \"video\"" | awk '{print $1}' | sed 's/://g'
You can do this
find . -type f | xargs grep -l '"category": "video"' | xargs sed -i -e '/flags/ s/]/, "d"]/'
This will find all the filnames which contain line with "category": "video", and then add the "d" flag.
Details:
find . -type f
=> Will get all the filenames in your directory
xargs grep -l '"category": "video"'
=> Will get those filenames which contain the line "category": "video"
xargs sed -i -e '/flags/ s/]/, "d"]/'
=> Will add the "d" letter to the flags:line.
"TWEET!!" ... (yellow flag thown to the ground) ... Time Out!
What you have, here, is "a JSON file." You also have, at your #!shebang command, your choice of(!) full-featured programming languages ... with intimate and thoroughly-knowledgeale support for JSON ... with which you can very-speedily write your command-file.
Even if it is "theoretically possible" to do this using "bash scripts," this is roughly equivalent to "putting a beautiful stone archway over the front-entrance to a supermarket." Therefore, "waste ye no time" in such an utterly-profitless pursuit. Write a script, using a language that "honest-to-goodness knows about(!) JSON," to decode the contents of the file, then manipulate it (as a data-structure), then re-encode it again.
Here is a more appropriate approach using PHP in shell:
FILE=foo2.json php -r '$file = $_SERVER["FILE"]; $arr = json_decode(file_get_contents($file)); if ($arr->category == "video") { $arr->flags[] = "d"; file_put_contents($file,json_encode($arr)); }'
Which will load the file, decode into array, add "d" into flags property only when category is video, then write back to the file in JSON format.
To run this for every json file, you can use find command, e.g.
find . -name "*.json" -print0 | while IFS= read -r -d '' file; do
FILE=$file
# run above PHP command in here
done
If the files are in the same format, this command may help (version for a single file):
ex +':/category.*video/norm kkf]i, "d"' -scwq file1.json
or:
ex +':/flags/,/category/s/"c"/"c", "d"/' -scwq file1.json
which is basically using Ex editor (now part of Vim).
Explanation:
+ - executes Vim command (man ex)
:/pattern_or_range/cmd - find pattern, if successful execute another Vim commands (:h :/)
norm kkf]i - executes keystrokes in normal mode
kk - move cursor up twice
f] - find ]
i, "d" - insert , "d"
-s - silent mode
-cwq - executes wq (write & quit)
For multiple files, use find and -execdir or extend above ex command to:
ex +'bufdo!:/category.*video/norm kkf]i, "d"' -scxa *.json
Where bufdo! executes command for every file, and -cxa saves every file. Add -V1 for extra verbose messages.
If flags line is not 2 lines above, then you may perform backward search instead. Or using similar approach to #sps by replacing ] with d.
See also: How to change previous line when the pattern is found? at Vim.SE.
Using jq:
find . -type f | xargs cat | jq 'select(.category=="video") | .flags |= . + ["d"]'
Explanation:
jq 'select(.category=="video") | .flags |= . + ["d"]'
# select(.category=="video") => filters by category field
# .flags |= . + ["d"] => Updates the flags array

How can I completely sort arbitrary JSON using jq?

I want to diff two JSON text files. Unfortunately they're constructed in arbitrary order, so I get diffs when they're semantically identical. I'd like to use jq (or whatever) to sort them in any kind of full order, to eliminate differences due only to element ordering.
--sort-keys solves half the problem, but it doesn't sort arrays.
I'm pretty ignorant of jq and don't know how to write a jq recursive filter that preserves all data; any help would be appreciated.
I realize that line-by-line 'diff' output isn't necessarily the best way to compare two complex objects, but in this case I know the two files are very similar (nearly identical) and line-by-line diffs are fine for my purposes.
Using jq or alternative command line tools to diff JSON files answers a very similar question, but doesn't print the differences. Also, I want to save the sorted results, so what I really want is just a filter program to sort JSON.
Here is a solution using a generic function sorted_walk/1 (so named for the reason described in the postscript below).
normalize.jq:
# Apply f to composite entities recursively using keys[], and to atoms
def sorted_walk(f):
. as $in
| if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | sorted_walk(f)) } ) | f
elif type == "array" then map( sorted_walk(f) ) | f
else f
end;
def normalize: sorted_walk(if type == "array" then sort else . end);
normalize
Example using bash:
diff <(jq -S -f normalize.jq FILE1) <(jq -S -f normalize.jq FILE2)
POSTSCRIPT: The builtin definition of walk/1 was revised after this response was first posted: it now uses keys_unsorted rather than keys.
I want to diff two JSON text files.
Use jd with the -set option:
No output means no difference.
$ jd -set A.json B.json
Differences are shown as an # path and + or -.
$ jd -set A.json C.json
# ["People",{}]
+ "Carla"
The output diffs can also be used as patch files with the -p option.
$ jd -set -o patch A.json C.json; jd -set -p patch B.json
{"City":"Boston","People":["John","Carla","Bryan"],"State":"MA"}
https://github.com/josephburnett/jd#command-line-usage
I'm surprised this isn't a more popular question/answer. I haven't seen any other json deep sort solutions. Maybe everyone likes solving the same problem over and over.
Here's an wrapper for #peak's excellent solution above that wraps it into a shell script that works in a pipe or with file args.
#!/usr/bin/env bash
# json normalizer function
# Recursively sort an entire json file, keys and arrays
# jq --sort-keys is top level only
# Alphabetize a json file's dict's such that they are always in the same order
# Makes json diff'able and should be run on any json data that's in source control to prevent excessive diffs from dict reordering.
[ "${DEBUG}" ] && set -x
TMP_FILE="$(mktemp)"
trap 'rm -f -- "${TMP_FILE}"' EXIT
cat > "${TMP_FILE}" <<-EOT
# Apply f to composite entities recursively using keys[], and to atoms
def sorted_walk(f):
. as \$in
| if type == "object" then
reduce keys[] as \$key
( {}; . + { (\$key): (\$in[\$key] | sorted_walk(f)) } ) | f
elif type == "array" then map( sorted_walk(f) ) | f
else f
end;
def normalize: sorted_walk(if type == "array" then sort else . end);
normalize
EOT
# Don't pollute stdout with debug output
[ "${DEBUG}" ] && cat $TMP_FILE > /dev/stderr
if [ "$1" ] ; then
jq -S -f ${TMP_FILE} $1
else
jq -S -f ${TMP_FILE} < /dev/stdin
fi