jq: extract value based on different (calculated) value - json

I am trying to filter down a very large json file (AWS output from aws rds describe-db-snapshots) into just a list of snapshots for deletion.
The final list of snapshots should be older than 60 days. I can discern their age via their SnapshotCreateTime, but I need their DBSnapshotIdentifier value to be able to delete them.
Greatly stripped down for SO purposes, below is the input.json file.
{
"Engine": "postgres",
"SnapshotCreateTime": "2017-08-22T16:35:42.302Z",
"AvailabilityZone": "us-east-1b",
"DBSnapshotIdentifier": "alex2-20170822-0108-bkup",
"AllocatedStorage": 5
}
{
"Engine": "postgres",
"SnapshotCreateTime": "2017-06-02T16:35:42.302Z",
"AvailabilityZone": "us-east-1a",
"DBSnapshotIdentifier": "alex-dbs-16opfr84gq4h9-snapshot-rtsmdbinstance-fr84gq4h9",
"AllocatedStorage": 5
}
{
"Engine": "postgres",
"SnapshotCreateTime": "2017-04-22T16:35:42.302Z",
"AvailabilityZone": "us-east-1a",
"DBSnapshotIdentifier": "alex3-20170422-update",
"AllocatedStorage": 5
}
I know about select but from what I can tell it can't handle the math needed for the time comparison in a one-liner. I figured I'd need to branch out to bash, so I've been messing with the following (clunky) workaround. It's not working, but I figured I'd include it as proof of effort.
THEN=$(date +'%Y%m%d' -d "`date`-60days")
while IFS= read -r i
do
awsDate=$(jq -r '.SnapshotCreateTime' < $i) // get time
snapDate=$(date -d $awsDate +'%Y%m%d') //convert to correct format
if [ $snapDate -gt $THEN ] //compare times
then
// something to copy the ID
fi
done < input.json
In this case I'd be looking for an output of
alex-dbs-16opfr84gq4h9-snapshot-rtsmdbinstance-fr84gq4h9
alex3-20170422-update

Here is an all-jq solution (i.e. one that does not depend on calling the date command). You might like to try a variation, e.g. passing some form of the date in, using one of the command-line options such as --arg.
jq currently does not quite understand the SnapshotCreateTime format; that's where the call to sub comes in:
def ago(days): now - (days*24*3600);
select(.SnapshotCreateTime | sub("\\.[0-9]*";"") < (ago(60) | todate))
| .DBSnapshotIdentifier
After fixing the sample input so that it is valid JSON, the output would be:
"alex-dbs-16opfr84gq4h9-snapshot-rtsmdbinstance-fr84gq4h9"
"alex3-20170422-update"
To strip the quotation marks, use the -r command-line option.

Here is a solution which defines a filter function which uses select, sub, fromdate and now.
def too_old:
select( .SnapshotCreateTime
| sub("[.][0-9]+Z";"Z") # remove fractional seconds
| fromdate # convert to unix time
| now - . # convert to age in seconds
| . > (86400 * 60) # true if older than 60 days in seconds
)
;
too_old
| .DBSnapshotIdentifier
If you place this in a file filter.jq and run jq with the -r option e.g
jq -M -r -f filter.jq input.json
it will produce the output you requested:
alex-dbs-16opfr84gq4h9-snapshot-rtsmdbinstance-fr84gq4h9
alex3-20170422-update

Related

jq: filter result by value (contains) is very slow

I am trying to use jq to filter a large number of JSON files and extract the ids of each object who belong to a specific domain, as well as the full URL within that domain. Here's a sample of the data:
{
"items": [
{
"completeness": 5,
"dcLanguageLangAware": {
"def": [
"de"
]
},
"edmIsShownBy": [
"https://gallica.example/image/2IC6BQAEGWUEG4OP7AYBDGIGYAX62KZ6H366KXP2IKVAF4LKY37Q/presentation_images/5591be60-01fc-11e6-8e10-fa163e091926/node-3/image/SBB/Berliner_Börsenzeitung/1920/02/27/F_065_098_0/F_SBB_00007_19200227_065_098_0_001/full/full/0/default.jpg"
],
"id": "/9200355/BibliographicResource_3000117730632",
"type": "TEXT",
"ugc": [
false
]
}
]
}
Bigger sample here: https://www.dropbox.com/s/0s0zjtxe01mecjc/AoQhRn%2B56KDm5AJJPwEvOTIwMDUyMC9hcmtfXzEyMTQ4X2JwdDZrMTAyNzY2Nw%3D%3D.json?dl=0
I can extract both ids and URL which contains the string "gallica" using the following command:
jq '[ .items[] | select(.edmIsShownBy[] | contains ("gallica")) | {id: .id, link: .edmIsShownBy[] }]'
However, i have more than 28000 JSON files to process and it is taking a large amount of time (around 1 file per minute). I am processing the files using bash with the command:
find . -name "*.json" -exec cat '{}' ';' | jq '[ .items[] | select(.edmIsShownBy[] | contains ("gallica")) | {id: .id, link: .edmIsShownBy[] }]'
I was wondering if the slowness is due by the instruction given to jq, and if it is the case, is there a faster way to filter a string contained in a chosen value? Any ideas?
It would probably be wise not to attempt to cat all the files at once; indeed, it would probably be best to avoid cat altogether.
For example, assuming program.jq contains whichever jq program you decide on (and there is nothing wrong with using contains here), you could try:
find . -name "*.json" -exec jq -f program.jq '{}' +
Using the non-standard + instead of ';' minimizes the number of times jq must be called, though the overhead of invoking jq is actually quite small. If your find does not support + and you wish to avoid calling jq once per file, then consider using xargs, or GNU parallel with the —-xargs option.
If you know the JSON files of interest are in the pwd, you could also speed up find by specifying -maxdepth 1.

JQ Group Multiple Files

I have a set of JSON that all contain JSON in the following format:
File 1:
{ "host" : "127.0.0.1", "port" : "80", "data": {}}
File 2:
{ "host" : "127.0.0.2", "port" : "502", "data": {}}
File 3:
{ "host" : "127.0.0.1", "port" : "443", "data": {}}
These files can be rather large, up to several gigabytes.
I want to use JQ or some other bash json processing tool that can merge these json files into one file with a grouped format like so:
[{ "host" : "127.0.0.1", "data": {"80": {}, "443" : {}}},
{ "host" : "127.0.0.2", "data": {"502": {}}}]
Is this possible with jq and if yes, how could I possibly do this? I have looked at the group_by function in jq, but it seems like I need to combine all files first and then group on this big file. However, since the files can be very large, it might make sense to stream the data and group them on the fly.
With really big files, I'd look into a primarily disk based approach instead of trying to load everything into memory. The following script leverages sqlite's JSON1 extension to load the JSON files into a database and generate the grouped results:
#!/usr/bin/env bash
DB=json.db
# Delete existing database if any.
rm -f "$DB"
# Create table. Assuming each host,port pair is unique.
sqlite3 -batch "$DB" <<'EOF'
CREATE TABLE data(host TEXT, port INTEGER, data TEXT,
PRIMARY KEY (host, port)) WITHOUT ROWID;
EOF
# Insert the objects from the files into the database.
for file in file*.json; do
sqlite3 -batch "$DB" <<EOF
INSERT INTO data(host, port, data)
SELECT json_extract(j, '\$.host'), json_extract(j, '\$.port'), json_extract(j, '\$.data')
FROM (SELECT json(readfile('$file')) AS j) as json;
EOF
done
# And display the results of joining the objects Could use
# json_group_array() instead of this sed hackery, but we're trying to
# avoid building a giant string with the entire results. It might still
# run into sqlite maximum string length limits...
sqlite3 -batch -noheader -list "$DB" <<'EOF' | sed '1s/^/[/; $s/,$/]/'
SELECT json_object('host', host,
'data', json_group_object(port, json(data))) || ','
FROM data
GROUP BY host
ORDER BY host;
EOF
Running this on your sample data prints out:
[{"host":"127.0.0.1","data":{"80":{},"443":{}}},
{"host":"127.0.0.2","data":{"502":{}}}]
If the goal is really to produce a single ginormous JSON entity, then presumably that entity is still small enough to have a chance of fitting into the memory of some computer, say C. So there is a good chance of jq being up to the job on C. At any rate, to utilize memory efficiently, you would:
use inputs while performing the grouping operation;
avoid the built-in group_by (since it requires an in-memory sort).
Here then is a two-step candidate using jq, which assumes grouping.jq contains the following program:
# emit a stream of arrays assuming that f is always string-valued
def GROUPS_BY(stream; f):
reduce stream as $x ({}; ($x|f) as $s | .[$s] += [$x]) | .[];
GROUPS_BY(inputs | .data=.port | del(.port); .host)
| {host: .[0].host, data: map({(.data): {}}) | add}
If the JSON files can be captured by *.json, you could then consider:
jq -n -f grouping.jq *.json | jq -s .
One advantage of this approach is that if it fails, you could try using a temporary file to hold the output of the first step, and then processing it later, either by "slurping" it, or perhaps more sensibly distributing it amongst several files, one per .host.
Removing extraneous data
Obviously, if the input files contain extraneous data, you might want to remove it first, e.g. by running
for f in *.json ; do
jq '{host,port}' "$f" | sponge $f
done
or by performing the projection in program.jq, e.g. using:
GROUPS_BY(inputs | {host, data: .port}; .host)
| {host: .[0].host, data: map( {(.data):{}} )}
Here's a script which uses jq to solve the problem without requiring more memory than is needed for the largest group. For simplicity:
it reads *.json and directs output to $OUT as defined at the top of the script.
it uses sponge
#!/usr/bin/env bash
# Requires: sponge
OUT=big.json
/bin/rm -i "$OUT"
if [ -s "$OUT" ] ; then
echo $OUT already exists
exit 1
fi
### Step 0: setup
TDIR=$(mktemp -d /tmp/grouping.XXXX)
function cleanup {
if [ -d "$TDIR" ] ; then
/bin/rm -r "$TDIR"
fi
}
trap cleanup EXIT
### Step 1: find the groups
for f in *.json ; do
host=$(jq -r '.host' "$f")
echo "$f" >> "$TDIR/$host"
done
for f in $TDIR/* ; do
echo $f ...
jq -n 'reduce (inputs | {host, data: {(.port): {} }}) as $in (null;
.host=$in.host | .data += [$in.data])' $(cat $f) | sponge "$f"
done
### Step 2: assembly
i=0
echo "[" > $OUT
find $TDIR -type f | while read f ; do
i=$((i + 1))
if [ $i -gt 1 ] ; then echo , >> $OUT ; fi
cat "$f" >> $OUT
done
echo "]" >> $OUT
Discussion
Besides requiring enough memory to handle the largest group, the main deficiencies of the above implementation are:
it assumes that the .host string is suitable as a file name.
the resultant file is not strictly speaking pretty-printed.
These two issues could however be addressed quite easily with minor modifications to the script without requiring additional memory.

Loop through JSON array shell script

I am trying to write a shell script that loops through a JSON file and does some logic based on every object's properties. The script was initially written for Windows but it does not work properly on a MacOS.
The initial code is as follows
documentsJson=""
jsonStrings=$(cat "$file" | jq -c '.[]')
while IFS= read -r document; do
# Get the properties from the docment (json string)
currentKey=$(echo "$document" | jq -r '.Key')
encrypted=$(echo "$document" | jq -r '.IsEncrypted')
# If not encrypted then don't do anything with it
if [[ $encrypted != true ]]; then
echoComment " Skipping '$currentKey' as it's not marked for encryption"
documentsJson+="$document,"
continue
fi
//some more code
done <<< $jsonStrings
When ran on a MacOs, the whole file is processed at once, so it does not loop through objects.
The closest I got to making it work - after trying a lot of suggestions - is as follows:
jq -r '.[]' "$file" | while read i; do
for config in $i ; do
currentKey=$(echo "$config" | jq -r '.Key')
echo "$currentKey"
done
done
The console result is parse error: Invalid numeric literal at line 1, column 6
I just cannot find a proper way of grabbing the JSON object and reading its properties.
JSON file example
[
{
"Key": "PdfMargins",
"Value": {
"Left":0,
"Right":0,
"Top":20,
"Bottom":15
}
},
{
"Key": "configUrl",
"Value": "someUrl",
"IsEncrypted": true
}
]
Thank you in advance!
Try putting the $jsonStrings in doublequotes: done <<< "$jsonStrings"
Otherwise the standard shell splitting applies on the variable expansion and you probably want to retain the line structure of the output of jq.
You could also use this in bash:
while IFS= read -r document; do
...
done < <(jq -c '.[]' < "$file")
That would save some resources. I am not sure about making this work on MacOS, though, so test this first.

How to get newline on every iteration in jq

I have the following file
[
{
"id": 1,
"name": "Arthur",
"age": "21"
},
{
"id": 2,
"name": "Richard",
"age": "32"
}
]
To display login and id together, I am using the following command
$ jq '.[] | .name' test
"Arthur"
"Richard"
But when I put it in a shell script and try to assign it to a variable then the whole output is displayed on a single line like below
#!/bin/bash
names=$(jq '.[] | .name' test)
echo $names
$ ./script.sh
"Arthur" "Richard"
I want to break at every iteration similar to how it works on the command line.
Couple of issues in the information you have provided. The jq filter .[] | .login, .id will not produce the output as you claimed on jq-1.5. For your original JSON
{
"login":"dmaxfield",
"id":7449977
}
{
"login":"stackfield",
"id":2342323
}
It will produce four lines of output as,
jq -r '.login, .id' < json
dmaxfield
7449977
stackfield
2342323
If you are interested in storing them side by side, you need to do variable interpolation as
jq -r '"\(.login), \(.id)"' < json
dmaxfield, 7449977
stackfield, 2342323
And if you feel your output stored in a variable is not working. It is probably because of lack of double-quotes when you tried to print the variable in the shell.
jqOutput=$(jq -r '"\(.login), \(.id)"' < json)
printf "%s\n" "$jqOutput"
dmaxfield, 7449977
stackfield, 2342323
This way the embedded new lines in the command output are not swallowed by the shell.
For you updated JSON (totally new one compared to old one), all you need to do is
jqOutput=$(jq -r '.[] | .name' < json)
printf "%s\n" "$jqOutput"
Arthur
Richard
In case the .login or .id contains embedded spaces or other characters that might cause problems, a more robust approach is to ensure each JSON value is on a separate line. Consider, for example:
jq -c .login,.id input.json | while read login ; do read id; echo login="$login" and id="$id" ; done
login="dmaxfield" and id=7449977
login="stackfield" and id=2342323

Find and edit a Json file using bash

I have multiple files in the following format with different categories like:
{
"id": 1,
"flags": ["a", "b", "c"],
"name": "test",
"category": "video",
"notes": ""
}
Now I want to append all the files flags whose category is video with string d. So my final file should look like the file below:
{
"id": 1,
"flags": ["a", "b", "c", "d"],
"name": "test",
"category": "video",
"notes": ""
}
Now using the following command I am able to find files of my interest, but now I want to work with editing part which I an unable to find as there are 100's of file to edit manually, e.g.
find . - name * | xargs grep "\"category\": \"video\"" | awk '{print $1}' | sed 's/://g'
You can do this
find . -type f | xargs grep -l '"category": "video"' | xargs sed -i -e '/flags/ s/]/, "d"]/'
This will find all the filnames which contain line with "category": "video", and then add the "d" flag.
Details:
find . -type f
=> Will get all the filenames in your directory
xargs grep -l '"category": "video"'
=> Will get those filenames which contain the line "category": "video"
xargs sed -i -e '/flags/ s/]/, "d"]/'
=> Will add the "d" letter to the flags:line.
"TWEET!!" ... (yellow flag thown to the ground) ... Time Out!
What you have, here, is "a JSON file." You also have, at your #!shebang command, your choice of(!) full-featured programming languages ... with intimate and thoroughly-knowledgeale support for JSON ... with which you can very-speedily write your command-file.
Even if it is "theoretically possible" to do this using "bash scripts," this is roughly equivalent to "putting a beautiful stone archway over the front-entrance to a supermarket." Therefore, "waste ye no time" in such an utterly-profitless pursuit. Write a script, using a language that "honest-to-goodness knows about(!) JSON," to decode the contents of the file, then manipulate it (as a data-structure), then re-encode it again.
Here is a more appropriate approach using PHP in shell:
FILE=foo2.json php -r '$file = $_SERVER["FILE"]; $arr = json_decode(file_get_contents($file)); if ($arr->category == "video") { $arr->flags[] = "d"; file_put_contents($file,json_encode($arr)); }'
Which will load the file, decode into array, add "d" into flags property only when category is video, then write back to the file in JSON format.
To run this for every json file, you can use find command, e.g.
find . -name "*.json" -print0 | while IFS= read -r -d '' file; do
FILE=$file
# run above PHP command in here
done
If the files are in the same format, this command may help (version for a single file):
ex +':/category.*video/norm kkf]i, "d"' -scwq file1.json
or:
ex +':/flags/,/category/s/"c"/"c", "d"/' -scwq file1.json
which is basically using Ex editor (now part of Vim).
Explanation:
+ - executes Vim command (man ex)
:/pattern_or_range/cmd - find pattern, if successful execute another Vim commands (:h :/)
norm kkf]i - executes keystrokes in normal mode
kk - move cursor up twice
f] - find ]
i, "d" - insert , "d"
-s - silent mode
-cwq - executes wq (write & quit)
For multiple files, use find and -execdir or extend above ex command to:
ex +'bufdo!:/category.*video/norm kkf]i, "d"' -scxa *.json
Where bufdo! executes command for every file, and -cxa saves every file. Add -V1 for extra verbose messages.
If flags line is not 2 lines above, then you may perform backward search instead. Or using similar approach to #sps by replacing ] with d.
See also: How to change previous line when the pattern is found? at Vim.SE.
Using jq:
find . -type f | xargs cat | jq 'select(.category=="video") | .flags |= . + ["d"]'
Explanation:
jq 'select(.category=="video") | .flags |= . + ["d"]'
# select(.category=="video") => filters by category field
# .flags |= . + ["d"] => Updates the flags array