Bash loop to merge files in batches for mongoimport - json

I have a directory with 2.5 million small JSON files in it. It's 104gb on disk. They're multi-line files.
I would like to create a set of JSON arrays from the files so that I can import them using mongoimport in a reasonable amount of time. The files can be no bigger than 16mb, but I'd be happy even if I managed to get them in sets of ten.
So far, I can use this to do them one at a time at about 1000/minute:
for i in *.json; do mongoimport --writeConcern 0 --db mydb --collection all --quiet --file $i; done
I think I can use "jq" to do this, but I have no idea how to make the bash loop pass 10 files at a time to jq.
Note that using bash find results in an error as there are too many files.
With jq you can use --slurp to create arrays, and -c to make multiline json single line. However, I can't see how to combine the two into a single command.
Please help with both parts of the problem if possible.

Here's one approach. To illustrate, I've used awk as it can read the list of files in small batches and because it has the ability to execute jq and mongoimport. You will probably need to make some adjustments to make the whole thing more robust, to test for errors, and so on.
The idea is either to generate a script that can be reviewed and then executed, or to use awk's system() command to execute the commands directly. First, let's generate the script:
ls *.json | awk -v group=10 -v tmpfile=json.tmp '
function out() {
print "jq -s . " files " > " tmpfile;
print "mongoimport --writeConcern 0 --db mydb --collection all --quiet --file " tmpfile;
print "rm " tmpfile;
files="";
}
BEGIN {n=1; files="";
print "test -r " tmpfile " && rm " tmpfile;
}
n % group == 0 {
out();
}
{ files = files " \""$0 "\"";
n++;
}
END { if (files) {out();}}
'
Once you've verified this works, you can either execute the generated script, or change the "print ..." lines to use "system(....)"
Using jq to generate the script
Here's a jq-only approach for generating the script.
Since the number of files is very large, the following uses features that were only introduced in jq 1.5, so its memory usage is similar to the awk script above:
def read(n):
# state: [answer, hold]
foreach (inputs, null) as $i
([null, null];
if $i == null then .[0] = .[1]
elif .[1]|length == n then [.[1],[$i]]
else [null, .[1] + [$i]]
end;
.[0] | select(.) );
"test -r json.tmp && rm json.tmp",
(read($group|tonumber)
| map("\"\(.)\"")
| join(" ")
| ("jq -s . \(.) > json.tmp", mongo("json.tmp"), "rm json.tmp") )
Invocation:
ls *.json | jq -nRr --arg group 10 -f generate.jq

Here is what I came up with. It seems to work and is importing at roughly 80 a second into an external hard drive.
#!/bin/bash
files=(*.json)
for((I=0;I<${#files[*]};I+=500)); do jq -c '.' ${files[#]:I:500} | mongoimport --writeConcern 0 --numInsertionWorkers 16 --db mydb --collection all --quiet;echo $I; done
However, some are failing. I've imported 105k files but only 98547 appeared in the mongo collection. I think it's because some documents are > 16mb.

Related

How can I write a batch file using jq to find json files with certain attribute and copy the to new location

I have 100,000's of lined json files that I need to split out based on whether or not, they contain a certain value for an attribute and then I need to convert them into valid json that can be read in by another platform.
I'm using a batch file to do this and I've managed to convert them into valid json using the following:
for /r %%f in (*.json*) do jq -s -c "." "%%f" >> "C:\Users\me\my-folder\%%~nxf.json"
I just can't figure out how to only copy the files that contain a certain value. So logic should be:
Look at all the files in the folders and sub solders
If the file contains an attribute "event" with a value of "abcd123"
then: convert the file into valid json and persist it with the same filename over to location "C:\Users\me\my-folder\"
else: ignore it
Example of files it should select:
{"name":"bob","event":"abcd123"}
and
{"name":"ann","event":"abcd123"},{"name":"bob","event":"8745LLL"}
Example of files it should NOT select:
{"name":"ann","event":"778PPP"}
and
{"name":"ann","event":"778PPP"},{"name":"bob","event":"8745LLL"}
Would love help to figure out the filtering part.
Since there are probably more file names than will fit on the command line, this response will assume a shell loop through the file names will be necessary, as the question itself envisions. Since I'm currently working with a bash shell, I'll present a bash solution, which hopefully can readily be translated to other shells.
The complication in the question is that the input file might contain one or more valid JSON values, or one or more comma-separated JSON values.
The key to a simple solution using jq is jq's -e command-line option, since this sets the return code to 0 if and only if
(a) the program ran normally; and (b) the last result was a truthy value.
For clarity, let's encapsulate the relevant selection criterion in two bash functions:
# If the input is a valid stream of JSON objects
function try {
jq -e -n 'any( inputs | objects; select( .event == "abcd123") | true)' 2> /dev/null > /dev/null
}
# If the input is a valid JSON array whose elements are to be checked
function try_array {
jq -e 'any( .[] | objects; select( .event == "abcd123") | true)' 2> /dev/null > /dev/null
}
Now a comprehensive solution can be constructed along the following lines:
find . -type f -maxdepth 1 -name '*.json' | while read -r f
do
< "$f" try
if [ $? = 0 ] ; then
echo copy $f
elif [ $? = 5 ] ; then
(echo '['; cat "$f"; echo ']') | try_array
if [ $? = 0 ] ; then
echo copy $f
fi
fi
done
Have you considered using findstr?
%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"
Please open a Command Prompt window, type findstr /?, press the ENTER key, and read its usage information. (You may want to consider the /I option too, for instance).
You could then use that within another for loop to propagate those files into a variable for your copy command.
batch-file example:
#For /F "EOL=? Delims=" %%G In (
'%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"'
) Do #Copy /Y "%%G" "S:\omewhere Else"
cmd example:
For /F "EOL=? Delims=" %G In ('%SystemRoot%\System32\findstr.exe /SRM "\"event\":\"abcd123\"" "C:\Users\me\my-folder\*.json"') Do #Copy /Y "%G" "S:\omewhere Else"

JQ Group Multiple Files

I have a set of JSON that all contain JSON in the following format:
File 1:
{ "host" : "127.0.0.1", "port" : "80", "data": {}}
File 2:
{ "host" : "127.0.0.2", "port" : "502", "data": {}}
File 3:
{ "host" : "127.0.0.1", "port" : "443", "data": {}}
These files can be rather large, up to several gigabytes.
I want to use JQ or some other bash json processing tool that can merge these json files into one file with a grouped format like so:
[{ "host" : "127.0.0.1", "data": {"80": {}, "443" : {}}},
{ "host" : "127.0.0.2", "data": {"502": {}}}]
Is this possible with jq and if yes, how could I possibly do this? I have looked at the group_by function in jq, but it seems like I need to combine all files first and then group on this big file. However, since the files can be very large, it might make sense to stream the data and group them on the fly.
With really big files, I'd look into a primarily disk based approach instead of trying to load everything into memory. The following script leverages sqlite's JSON1 extension to load the JSON files into a database and generate the grouped results:
#!/usr/bin/env bash
DB=json.db
# Delete existing database if any.
rm -f "$DB"
# Create table. Assuming each host,port pair is unique.
sqlite3 -batch "$DB" <<'EOF'
CREATE TABLE data(host TEXT, port INTEGER, data TEXT,
PRIMARY KEY (host, port)) WITHOUT ROWID;
EOF
# Insert the objects from the files into the database.
for file in file*.json; do
sqlite3 -batch "$DB" <<EOF
INSERT INTO data(host, port, data)
SELECT json_extract(j, '\$.host'), json_extract(j, '\$.port'), json_extract(j, '\$.data')
FROM (SELECT json(readfile('$file')) AS j) as json;
EOF
done
# And display the results of joining the objects Could use
# json_group_array() instead of this sed hackery, but we're trying to
# avoid building a giant string with the entire results. It might still
# run into sqlite maximum string length limits...
sqlite3 -batch -noheader -list "$DB" <<'EOF' | sed '1s/^/[/; $s/,$/]/'
SELECT json_object('host', host,
'data', json_group_object(port, json(data))) || ','
FROM data
GROUP BY host
ORDER BY host;
EOF
Running this on your sample data prints out:
[{"host":"127.0.0.1","data":{"80":{},"443":{}}},
{"host":"127.0.0.2","data":{"502":{}}}]
If the goal is really to produce a single ginormous JSON entity, then presumably that entity is still small enough to have a chance of fitting into the memory of some computer, say C. So there is a good chance of jq being up to the job on C. At any rate, to utilize memory efficiently, you would:
use inputs while performing the grouping operation;
avoid the built-in group_by (since it requires an in-memory sort).
Here then is a two-step candidate using jq, which assumes grouping.jq contains the following program:
# emit a stream of arrays assuming that f is always string-valued
def GROUPS_BY(stream; f):
reduce stream as $x ({}; ($x|f) as $s | .[$s] += [$x]) | .[];
GROUPS_BY(inputs | .data=.port | del(.port); .host)
| {host: .[0].host, data: map({(.data): {}}) | add}
If the JSON files can be captured by *.json, you could then consider:
jq -n -f grouping.jq *.json | jq -s .
One advantage of this approach is that if it fails, you could try using a temporary file to hold the output of the first step, and then processing it later, either by "slurping" it, or perhaps more sensibly distributing it amongst several files, one per .host.
Removing extraneous data
Obviously, if the input files contain extraneous data, you might want to remove it first, e.g. by running
for f in *.json ; do
jq '{host,port}' "$f" | sponge $f
done
or by performing the projection in program.jq, e.g. using:
GROUPS_BY(inputs | {host, data: .port}; .host)
| {host: .[0].host, data: map( {(.data):{}} )}
Here's a script which uses jq to solve the problem without requiring more memory than is needed for the largest group. For simplicity:
it reads *.json and directs output to $OUT as defined at the top of the script.
it uses sponge
#!/usr/bin/env bash
# Requires: sponge
OUT=big.json
/bin/rm -i "$OUT"
if [ -s "$OUT" ] ; then
echo $OUT already exists
exit 1
fi
### Step 0: setup
TDIR=$(mktemp -d /tmp/grouping.XXXX)
function cleanup {
if [ -d "$TDIR" ] ; then
/bin/rm -r "$TDIR"
fi
}
trap cleanup EXIT
### Step 1: find the groups
for f in *.json ; do
host=$(jq -r '.host' "$f")
echo "$f" >> "$TDIR/$host"
done
for f in $TDIR/* ; do
echo $f ...
jq -n 'reduce (inputs | {host, data: {(.port): {} }}) as $in (null;
.host=$in.host | .data += [$in.data])' $(cat $f) | sponge "$f"
done
### Step 2: assembly
i=0
echo "[" > $OUT
find $TDIR -type f | while read f ; do
i=$((i + 1))
if [ $i -gt 1 ] ; then echo , >> $OUT ; fi
cat "$f" >> $OUT
done
echo "]" >> $OUT
Discussion
Besides requiring enough memory to handle the largest group, the main deficiencies of the above implementation are:
it assumes that the .host string is suitable as a file name.
the resultant file is not strictly speaking pretty-printed.
These two issues could however be addressed quite easily with minor modifications to the script without requiring additional memory.

Using jq to combine json files, getting file list length too long error

Using jq to concat json files in a directory.
The directory contains a few hundred thousand files.
jq -s '.' *.json > output.json
returns an error that the file list is too long. Is there a way to write this that uses a method that will take in more files?
If jq -s . *.json > output.json produces "argument list too long"; you could fix it using zargs in zsh:
$ zargs *.json -- cat | jq -s . > output.json
That you could emulate using find as shown in #chepner's answer:
$ find -maxdepth 1 -name \*.json -exec cat {} + | jq -s . > output.json
"Data in jq is represented as streams of JSON values ... This is a cat-friendly format - you can just join two JSON streams together and get a valid JSON stream.":
$ echo '{"a":1}{"b":2}' | jq -s .
[
{
"a": 1
},
{
"b": 2
}
]
The problem is that the length of a command line is limited, and *.json produces too many argument for one command line. One workaround is to expand the pattern in a for loop, which does not have the same limits as a command line, because bash can iterate over the result internally rather than having to construct an argument list for an external command:
for f in *.json; do
cat "$f"
done | jq -s '.' > output.json
This is rather inefficient, though, since it requires running cat once for each file. A more efficient solution is to use find to call cat with as many files as possible each time.
find . -name '*.json' -exec cat '{}' + | jq -s '.' > output.json
(You may be able to simply use
find . -name '*.json' -exec jq -s '{}' + > output.json
as well; it may depend on what is in the files and how multiple calls to jq using the -s option compares to a single call.)
[EDITED to use find]
One obvious thing to consider would be to process one file at a time, and then "slurp" them:
$ while IFS= read -r f ; cat "$f" ; done <(find . -maxdepth 1 -name "*.json") | jq -s .
This however would presumably require a lot of memory. Thus the following may be closer to what you need:
#!/bin/bash
# "slurp" a bunch of files
# Requires a version of jq with 'inputs'.
echo "["
while read f
do
jq -nr 'inputs | (., ",")' $f
done < <(find . -maxdepth 1 -name "*.json") | sed '$d'
echo "]"

verify that a json field exists with jq and bash?

I have a script that uses jq for parsing a json string MESSAGE (that is read from another application). Meanwhile the json has changed and a field is split in 2 fields: file_path is now split into folder and file. The script was reading the file_path, now the folder may not be present, so for creating the path of the file I have to verify if the field is there. I have search for a while on the internet, and manage to do:
echo $(echo $MESSAGE | jq .folder -r)$'/'$(echo $MESSAGE | jq .file -r)
if [ $MESSAGE | jq 'has(".folder")' -r ]
then
echo $(echo $MESSAGE | jq .folder -r)$'/'$(echo $MESSAGE | jq .file -r)
else
echo $(echo $MESSAGE | jq .file -r)
fi
where MESSAGE='{"folder":"FLDR","file":"fl"}' or MESSAGE='{"file":"fl"}'
The first line is printing FLDR/fl or null/fl if the folder field is not present. So I have thought to create an if that is verifying if the folder field is present or not, but it seems that I am doing it wrong and cannot figure out what is wrong. The output is
bash: [: missing `]'
jq: ]: No such file or directory
null/fl
I'd do the whole thing in a jq filter:
echo "$MESSAGE" | jq -r '[ .folder, .file ] | join("/")'
In the event that you want to do it with bash (or to learn how to do this sort of thing in bash), two points:
Shell variables should almost always be quoted when they are used (i.e., "$MESSAGE" instead of $MESSAGE). You will run into funny problems if one of the strings in your JSON ever contains a shell metacharacter (such as *) and you forgot to do that; the string will be subject to shell expansion (and that * will be expanded into a list of files in the current working directory).
A shell if accepts as condition a command, and the decision where to branch is made depending on the exit status of that command (true if the exit status is 0, false otherwise). The [ you attempted to use is just a command (an alias for test, see man test) and not special in any way.
So, the goal is to construct a command that exits with 0 if the JSON object has a folder property, non-zero otherwise. jq has a -e option that makes it return 0 if the last output value was not false or null and non-zero otherwise, so we can write
if echo "$MESSAGE" | jq -e 'has("folder")' > /dev/null; then
echo "$MESSAGE" | jq -r '.folder + "/" + .file'
else
echo "$MESSAGE" | jq -r .file
fi
The > /dev/null bit redirects the output from jq to /dev/null (where it is ignored) so that we don't see it on the console.

Bourne shell function return variable always empty

The following Bourne shell script, given a path, is supposed to test each component of the path for existence; then set a variable comprising only those components that actually exist.
#! /bin/sh
set -x # for debugging
test_path() {
path=""
echo $1 | tr ':' '\012' | while read component
do
if [ -d "$component" ]
then
if [ -z "$path" ]
then path="$component"
else path="$path:$component"
fi
fi
done
echo "$path" # this prints nothing
}
paths=/usr/share/man:\
/usr/X11R6/man:\
/usr/local/man
MANPATH=`test_path $paths`
echo $MANPATH
When run, it always prints nothing. The trace using set -x is:
+ paths=/usr/share/man:/usr/X11R6/man:/usr/local/man
++ test_path /usr/share/man:/usr/X11R6/man:/usr/local/man
++ path=
++ echo /usr/share/man:/usr/X11R6/man:/usr/local/man
++ tr : '\012'
++ read component
++ '[' -d /usr/share/man ']'
++ '[' -z '' ']'
++ path=/usr/share/man
++ read component
++ '[' -d /usr/X11R6/man ']'
++ read component
++ '[' -d /usr/local/man ']'
++ '[' -z /usr/share/man ']'
++ path=/usr/share/man:/usr/local/man
++ read component
++ echo ''
+ MANPATH=
+ echo
Why is the final echo $path empty? The $path variable within the while loop was incrementally set for each iteration just fine.
The pipe runs all commands involved in sub-shells, including the entire while ... loop. Therefore, all changes to variables in that loop are confined to the sub-shell and invisible to the parent shell script.
One way to work around that is putting the while ... loop and the echo into a list that executes entirely in the sub-shell, so that the modified variable $path is visible to echo:
test_path()
{
echo "$1" | tr ':' '\n' | {
while read component
do
if [ -d "$component" ]
then
if [ -z "$path" ]
then
path="$component"
else
path="$path:$component"
fi
fi
done
echo "$path"
}
}
However, I suggest using something like this:
test_path()
{
echo "$1" | tr ':' '\n' |
while read dir
do
[ -d "$dir" ] && printf "%s:" "$dir"
done |
sed 's/:$/\n/'
}
... but that's a matter of taste.
Edit: As others have said, the behaviour you are observing depends on the shell. The POSIX standard describes pipelined commands as run in sub-shells, but that is not a requirement:
Additionally, each command of a multi-command pipeline is in a subshell environment; as an extension, however, any or all commands in a pipeline may be executed in the current environment.
Bash runs them in sub-shells, but some shells run the last command in the context of the main script, when only the preceding commands in the pipeline are run in sub-shells.
This should work in a Bourne shell that understands functions (and would work in Bash and other shells too):
test_path() {
echo $1 | tr ':' '\012' |
{
path=""
while read component
do
if [ -d "$component" ]
then
if [ -z "$path" ]
then path="$component"
else path="$path:$component"
fi
fi
done
echo "$path" # this prints nothing
}
}
The inner set of braces groups the commands into a unit, so path is only set in the subshell but is echoed from the same subshell.
Why is the final echo $path empty?
Until recently, Bash would give all components of a pipeline their own process, separate from the shell process in which the pipeline is run.
Separate process == separate address space, and no variable sharing.
In ksh93 and in recent Bash (may need a shopt setting), the shell will run the last component of a pipeline in the calling shell, so any variables changed inside the loop are preserved when the loop exits.
Another way to accomplish what you want is to make sure that the echo $path is in the same process as the loop, using parentheses:
#! /bin/sh
set -x # for debugging
test_path() {
path=""
echo $1 | tr ':' '\012' | ( while read component
do
[ -d "$component" ] || continue
path="${path:+$path:}$component"
done
echo "$path"
)
}
Note: I simplified the inner if. There was no else so the test can be replaced with a shortcut. Also, the two path assignments can be combined into one, using the S{var:+ ...} parameter substitution trick.
Your script works just fine with no change under Solaris 11 and probably also most commercial Unix like AIX and HP-UX because under these OSes, the underlying implementation of /bin/sh is provided by ksh. This would be also the case if /bin/sh is backed by zsh.
It doesn't work for you likely because your /bin/sh is implemented by one of bash, dash, mksh or busybox sh which all process each component of a pipeline in a subshell while ksh and zsh both keep the last element of a pipeline in the current shell, saving an unnecessary fork.
It is possible to "fix" your script for it to work when sh is provided by bash by adding this line somewhere before the pipeline:
shopt -s lastpipe
or better, if you wan't to keep portability:
command -v shopt > /dev/null && shopt -s lastpipe
This will keep the script working for ksh, and zsh but still won't work for dash, mksh or the original Bourne shell.
Note that both bash and ksh behaviors are allowed by the POSIX standard.