Extracting values using jq streaming - json

I am trying to extract the values from a top-level JSON object using streaming with jq. For the sake of illustration, this is what the data look like (the actual data are rather large, hence needing to use streaming):
{
"empty": null,
"name": "John Smith",
"sex": "male",
"age": 51,
"hobbies": [
"running",
"kayaking",
"camping",
"foraging"
]
}
Without streaming it's easy to get what I need:
$ jq ".name" sample.json
"John Smith"
$ jq ".age" sample.json
51
$ jq ".hobbies" sample.json
[
"running",
"kayaking",
"camping",
"foraging"
]
When I use streaming I can get the value for the "hobbies" key:
$ jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "hobbies")))' <sample.json
["running","kayaking","camping","foraging"]
But using the analogous command for the "name" or "age" keys gives an empty result:
$ jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "name")))' <sample.json
$ jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "age")))' <sample.json
I suspect that this is because the value is a scalar. But I'm not sure that this is the reason and, even if I was, I'm not sure how to use that information.
I discovered the debug operation which seems to yield some light on the situation.
$ jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "hobbies") | debug))' <sample.json
["DEBUG:",[["hobbies",0],"running"]]
["DEBUG:",[["hobbies",1],"kayaking"]]
["DEBUG:",[["hobbies",2],"camping"]]
["DEBUG:",[["hobbies",3],"foraging"]]
["DEBUG:",[["hobbies",3]]]
["running","kayaking","camping","foraging"]
["DEBUG:",[["hobbies"]]]
$ jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "name") | debug))' <sample.json
["DEBUG:",[["name"],"John Smith"]]
$ jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.[0][0] == "age") | debug))' <sample.json
["DEBUG:",[["age"],51]]
So it looks like these values are being selected, but they are just not making it through to the output.
Any suggestions would be appreciated! Thank you.

You need to understand how 1 | truncate_stream() works before subsequently applying other filter expressions. The truncate_stream() prefixed with a non-zero integer is used to remove paths specified by the integer in the streamed result.
e.g. if your original result produced the following[path, value] pairs
jq -cn --stream 'inputs' json
[["empty"],null]
[["name"],"John Smith"]
[["sex"],"male"]
[["age"],51]
[["hobbies",0],"running"]
[["hobbies",1],"kayaking"]
[["hobbies",2],"camping"]
[["hobbies",3],"foraging"]
[["hobbies",3]]
[["hobbies"]]
Truncation with 1 would remove the first element of each path provided. Those with the paths removed are completely discarded from the output
jq -cn --stream '1|truncate_stream(inputs)' json
[[0],"running"]
[[1],"kayaking"]
[[2],"camping"]
[[3],"foraging"]
[[3]]
Your original attempt worked because, the select expression was able to get the desired paths to hobbies, with the parent root key hobbies removed, retaining only a list of elements.
But the same doesn't work for age, as you cannot completely trim down the path away. Remove the ["age"] entry would leave a result as [[],51] leaving only the value field.
jq -cn --stream 'inputs|select(.[0][0] == "age")' json
[["age"],51]
If a level is provided to the above expression, i.e. 1|.. the age path would be completely removed, making the fromstream not construct your object back.
So for simple scalars, simply extract away the value from the indices as below without needing to use truncate at all
jq -cn --stream 'inputs|select(.[0][0] == "age")[1]'
51

Related

jq how to pass json keys from a shell variable

I have a json file I am parsing with jq. This is a sample of the file
[{
"key1":{...},
"key2":{...}
}]
[{
"key1":{...},
"key2":{...}
}]
...
each line is a list containing a json (which I know is not technically a json format but jq still works on such a file)
The below jq command works:
cat file.json | jq -r '.[] | [.key1,.key2]'
The above correctly shows:
[
<value_of_key1>,<value_of_key2>
]
[
<value_of_key1>,<value_of_key2>
]
However, I want .key1,.key2 to be dynamic since these keys can change. So I want to pass a variable to jq. Something like:
$KEYS=.key1,.key2
cat file.json | jq -r --arg var "$KEYS" '.[] | [$var]'
But the above is returning the keys themselves:
[
".key1,.key2"
]
[
".key1,.key2"
]
why is this happening? what is the correct command to make this happen?
This answer does not help me. I am not getting any errors as the OP in that question.
Fetching the value of a jq variable doesn't cause it to be executed as jq code.
Furthermore, jq lacks the facility to take a string, compile it as jq code, and evaluate the result. (This is commonly known as eval.)
So, short of a writing a jq parser and evaluator in jq, you will need to impose limits and/or accept a different format.
For example,
keys='[ [ "key1", "childkey" ], [ "key2", "childkey2" ] ]' # JSON
jq --argjson keys "$keys" '.[] | [ getpath( $keys[] ) ]' file.json
or
keys='key1.childkey,key2.childkey2'
jq --arg keys "$keys" '
( ( $keys / "," ) | map( . / "." ) ) as $keys |
.[] | [ getpath( $keys[] ) ]
' file.json
Suppose you have:
cat file
[{
"key1":1,
"key2":2
}]
[{
"key1":1,
"key2":2
}]
You can use a jq command like so:
jq '.[] | [.key1,.key2]' file
[
1,
2
]
[
1,
2
]
You can use -f to execute a filter from a file and nothing keeps you from creating the file separately from the shell variables.
Example:
keys=".key1"
echo ".[] | [${keys}]" >jqf
jq -f jqf file
[
1
]
[
1
]
Or just build the string directly into jq:
# note double " causing string interpolation
jq ".[] | [${keys}]" file
You can use --argjson option and destructuring.
file.json
[{"key1":{"a":1},"key2":{"b":2}}]
[{"key1":{"c":1},"key2":{"d":2}}]
$ in='["key1","key2"]' jq -c --argjson keys "$in" '$keys as [$key1,$key2] | .[] | [.[$key1,$key2]]' file.json
output:
[{"a":1},{"b":2}]
[{"c":1},{"d":2}]
Elaborating on ikegami's answer.
To start with here's my version of the answer:
$ in='key1.a,key2.b'; jq -c --arg keys "$in" '($keys/","|map(./".")) as $paths | .[] | [getpath($paths[])]' <<<$'[{"key1":{"a":1},"key2":{"b":2}}] [{"key1":{"a":3},"key2":{"b":4}}]'
This gives output
[1,2]
[3,4]
Let's try it.
We have input
[{"key1":{"a":1},"key2":{"b":2}}]
[{"key1":{"a":3},"key2":{"b":4}}]
And we want to construct array
[["key1","a"],["key2","b"]]
then use it on getpath(PATHS) builtin to extract values out of our input.
To start with we are given in shell variable with string value key1.a,key2.b. Let's call this $keys.
Then $keys/"," gives
["key1.a","key2.b"]
["key1.a","key2.b"]
After that $keys/","|map(./".") gives what we want.
[["key1","a"],["key2","b"]]
[["key1","a"],["key2","b"]]
Let's call this $paths.
Now if we do .[]|[getpath($paths[])] we get the values from our input equivalent to
[.[] | .key1.a, .key2.b]
which is
[1,2]
[3,4]

Reading and Looping Through A JSON File in BASH

I've got a JSON file (see below) called department_groups.json.
Essentially if I gave an argument of commercial I'd like it to return:
commercial-team#domain.com
commercial-updates#domain.com
Can anyone guide/help me with doing this?
{
"legal": {
"google_groups":[
["Legal", "legal#domain.com"],
["Legal Team", "legal-team#domain.com"],
["Compliance Checks", "compliance#domain.com"]
],
"samba_groups": ""
},
"commercial":{
"google_groups":[
["Commercial Team", "commercial-team#domain.com"],
["Commercial Updates", "commercial-updates#domain.com"]
],
"samba_groups": ""
},
"technology":{
"google_groups":[
["Technology", "technology#domain.com"],
["Incidents", "incidents#domain.com"]
],
"samba_groups": ""
}
}
This returns the second element in each array in the google_groups property of the commercial property:
jq --arg key commercial '.[$key].google_groups | .[] | .[1]' file
Use jq -r to output in "raw" format (lose the double quotes).
$ key=commercial
$ jq -r --arg key "$key" '.[$key].google_groups | .[] | .[1]' file
commercial-team#domain.com
commercial-updates#domain.com
I used --arg in these examples to show how it is used, optionally with a shell variable. If, on the other hand, commercial was just a fixed string, then you could simplify:
jq -r '.commercial.google_groups | .[] | .[1]' file
To process each line of the output, you can just use a shell while read loop:
key=commercial
while read -r email; do
echo "$email"
# process each email individually here
done < <(jq -r --arg key "$key" '.[$key].google_groups | .[] | .[1]' file)
Here I am using a process substitution <(), which acts like a file that can be processed by the shell. One advantage of doing this, over using a pipe, is that no subshell is created. Among other things, this means that the variables used within the loop remain in scope after the while block, so you can use them later.
If you prefer to use a pipe, just remove the part after done and move the command up to the first line:
jq ... | while read -r email; do # etc.
As #TomFenech noted, the requirements are somewhat unclear, but if it's the email addresses you want, the following variant of his answer may be of interest:
key=commercial
$ jq -r --arg key "$key" '.[$key].google_groups[][] | select(test("#"))' department_groups.json
commercial-team#domain.com
commercial-updates#domain.com

jq streaming - filter nested list and retain global structure

In a large json file, I want to remove some elements from a nested list, but keep the overall structure of the document.
My example input it this (but the real one is large enough to demand streaming).
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
{
"keep": "false",
"extra": "non-keeper"
}
]
}
The required output just has one element of the 'filter_this' block removed:
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
]
}
The standard way to handle such cases appears to be using 'truncate_stream' to reconstitute streamed objects, before filtering those in the usual jq way. Specifically, the command:
jq -nc --stream 'fromstream(1|truncate_stream(inputs))'
gives access to a stream of objects:
{"keep_this":["this","list"]}
[{"keep":"true"},{"keep":"true","extra":"keeper"},
{"keep":"false","extra":"non-keeper"}]
at which point it is easy to filter for the required objects. However, this strips the results from the context of their parent object, which is not what I want.
Looking at the streaming structure:
[["keep_untouched","keep_this",0],"this"]
[["keep_untouched","keep_this",1],"list"]
[["keep_untouched","keep_this",1]]
[["keep_untouched","keep_this"]]
[["filter_this",0,"keep"],"true"]
[["filter_this",0,"keep"]]
[["filter_this",1,"keep"],"true"]
[["filter_this",1,"extra"],"keeper"]
[["filter_this",1,"extra"]]
[["filter_this",2,"keep"],"false"]
[["filter_this",2,"extra"],"non-keeper"]
[["filter_this",2,"extra"]]
[["filter_this",2]]
[["filter_this"]]
it seems I need to select all the 'filter_this' rows, truncate those rows only (using 'truncate_stream'), rebuild these rows as objects (using 'from_stream'), filter them, and turn the objects back into the stream data format (using 'tostream') to join the stream of 'keep untouched' rows, which are still in the streaming format. At that point it would be possible to re-build the whole json. If that is the right approach - which seems overly converluted to me - how do I do that? Or is there a better way?
If your input file consists of a single very large JSON entity that is too big for the regular jq parser to handle in your environment, then there is the distinct possibility that you won't have enough memory to reconstitute the JSON document.
With that caveat, the following may be worth a try. The key insight is that reconstruction can be accomplished using reduce.
The following uses a bunch of temporary files for the sake of clarity:
TMP=/tmp/$$
jq -c --stream 'select(length==2)' input.json > $TMP.streamed
jq -c 'select(.[0][0] != "filter_this")' $TMP.streamed > $TMP.1
jq -c 'select(.[0][0] == "filter_this")' $TMP.streamed |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
| .filter_this |= map(select(.keep=="true"))
| tostream
| select(length==2)' > $TMP.2
# Reconstruction
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' $TMP.1 $TMP.2
Output
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this": [
{
"keep": "true"
},
{
"keep": "true",
"extra": "keeper"
}
]
}
Many thanks to #peak. I found his approach really useful, but unrealistic in terms of performance. Stealing some of #peak's ideas, though, I came up with the following:
Extract the 'parent' object:
jq -c --stream 'select(length==2)' input.json |
jq -c 'select(.[0][0] != "filter_this")' |
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' > $TMP.parent
Extract the 'keepers' - though this means reading the file twice (:-<):
jq -nc --stream '[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' input.json > $TMP.keepers
Insert the filtered list into the parent object.
jq -nc -s 'inputs as $items
| $items[0] as $parent
| $parent
| .filter_this |= $items[1]
' $TMP.parent $TMP.keepers > result.json
Here is a simplified version of #PeteC's script. It requires one fewer invocations of jq.
In both cases, please note that the invocation of jq that uses "2|truncate_stream(_)" requires a more recent version of jq than 1.5.
TMP=/tmp/$$
INPUT=input.json
# Extract all but .filter_this
< $INPUT jq -c --stream 'select(length==2 and .[0][0] != "filter_this")' |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
' > $TMP.parent
# Need jq > 1.5
# Extract the 'keepers'
< $INPUT jq -n -c --stream '
[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' $INPUT > $TMP.keepers
# Insert the filtered list into the parent object:
jq -s '. as $in | .[0] | (.filter_this |= $in[1])
' $TMP.parent $TMP.keepers > result.json

Numeric argument passed with jq --arg not matching data with ==

Here is a sample JSON response from my curl:
{
"success": true,
"message": "jobStatus",
"jobStatus": [
{
"ID": 9,
"status": "Successful"
},
{
"ID": 2,
"status": "Successful"
},
{
"ID": 99,
"status": "Failed"
}
]
}
I want to check the status of ID=2. Here is the command I tried:
cat test.txt|jq --arg v "2" '.jobStatus[]|select(.ID == $v)|.status'
response: there is none
I tried value 2 without quotes and still no result.
By contrast, if I try the command with a literal 2, it works:
cat test.txt | jq '.jobStatus[]|select(.ID == 2)|.status'
response:
"Successful"
I'm stuck. Can anyone help me identify the problem?
jq is data-type-aware:
.ID, as defined in the JSON input, is a number,
but any command-line argument passed with --arg (such as v here) is invariably a string (whether you quote the value or not),
so, in order to compare them, you must use an explicit type conversion, such as with tonumber/1:
jq --arg v '2' '.jobStatus[] | select(.ID == ($v | tonumber)) | .status' test.txt
Given that you're only passing a scalar argument here, the following solution, using --argjson (jq v1.5+) is a bit of an overkill, but it is an alternative to explicit type conversion in that passing a JSON argument in effect passes typed data:
jq --argjson v '{ "ID": 2 }' '.jobStatus[] | select(.ID == $v.ID) | .status' test.txt
peak's answer demonstrates that even --argjson v 2 works (in which case comparing to $v works directly), which is certainly the most concise solution, but may require an explanation:
Even though 2 may not look like JSON, it is: it is a valid JSON text containing a single value of type number (see json.org).
Specifically, it is the fact that 2 is an unquoted token that starts with a digit that makes it a number in the context of JSON (the JSON string-value equivalent is "2", which from the shell would have to be passed as '"2"' - note the embedded double quotes).
Therefore jq interprets --argjson -v 2 as a number, and comparison .ID == $v works as intended (note that the same applies to --argjson -v '2' / --argjson -v "2", where the shell removes the quotes before jq sees the value).
By contrast, anything you pass with --arg is always a string value that is used as-is.
In other words: --argjson, whose purpose is to accept arbitrary JSON texts as strings (such as '{ "ID": 2 }' in the example above), can also be used to pass number-string scalars to force their interpretation as numbers.
The same technique also works with Boolean strings true and false.
Tip of the hat to peak for his help.
Assuming you want to check for the JSON value 2, you have a choice to make - either convert the argument of --arg to a number, or use --argjson with a numeric argument. These alternatives are illustrated by the following:
jq --arg v 2 '.jobStatus[] | select(.ID == ($v|tonumber) | .status'
jq --argjson v 2 '.jobStatus[] | select(.ID == $v) | .status'
Note that --argjson requires a relatively recent version of jq.
Of course, if you want to "normalize" .ID so that it's always treated as a string, you could write:
jq --arg v 2 '.jobStatus[] | select((.ID|tostring) == $v) | .status'

Filter only specific keys from an external file in jq

I have a JSON file with the following format:
[
{
"id": "00001",
"attr": {
"a": "foo",
"b": "bar",
...
}
},
{
"id": "00002",
"attr": {
...
},
...
},
...
]
and a text file with a list of ids, one per line. I'd like to use jq to filter only the records whose ids are mentioned in the text file. I.e. if the list contains "00001", only the first one should be printed.
Note, that I can't simply grep since each record may have an arbitrary number of attributes and sub-attributes.
There are basically two ways to proceed:
read the file of ids from STDIN
read the JSON from STDIN
Both are feasible, but here we illustrate (2) as it leads to a simple but efficient solution.
Suppose the JSON file is named in.json and the list of ids is in a file named ids.txt like so:
00001
00010
Notice that this file has no quotation marks. If it does, then the following can be significantly simplified as shown in the postscript.
The trick is to convert ids.txt into a JSON array. With the above assumption about quotation marks, this can be done by:
jq -R . ids.txt | jq -s .
Assuming a reasonable shell, a simple solution is now at hand:
jq --argjson ids "$(jq -R . ids.txt | jq -s .)" '
map( select( .id as $id | $ids | index($id) ))' in.json
Faster
Assuming your jq has any/2, then a simpler and more efficient solution can be obtaining by defining:
def isin($a): . as $in | any($a[]; $in == .);
The required jq filter is then just:
map( select( .id | isin($ids) ) )
If these two lines of jq are put into a file named select.jq, the required incantation is simply:
jq --argjson ids "$(jq -R . ids.txt | jq -s)" -f select.jq in.json
Postscript
If the index file consists of a stream of valid JSON texts (e.g., strings with quotation marks) and if your jq supports the --slurpfile option, the invocation can be further simplified to:
jq --slurpfile ids ids.txt -f select.jq in.json
Or if you want everything as a one-liner:
jq --slurpfile ids ids.txt 'map(select(.id as $id|any($ids[];$id==.)))' in.json