jq: select when any value is in array - json

Given the input json
[
{"title": "first line"},
{"title": "second line"},
{"title": "third line"}
]
How can we extract only the titles that contain keywords that are listed in a second "filter" array. Using a shell variable here for instance:
filter='["second", "third"]'
The output in this case would be
[
{"title": "second line"},
{"title": "third line"}
]
Also, how to use the array filter to negate instead.
Eg: return only the "first line" entry in the previous example.
There is a similar reply but using an old version of jq.
I hope that there's a more intuitive/readable way to do this with the current version of jq.

You can use a combination of jq and shell tricks using arrays to produce the filter. Firstly to produce the shell array, use an array notation from the shell as below. Note that the below notation of bash arrays will not take , as a separator in its definition. Now we need to produce a regex filter to match the string, so we produce an alternation operator
filter=("first" "second")
echo "$(IFS="|"; echo "${filter[*]}"
first|second
You haven't mentioned if the string only matches in the first or last or could be anywhere in the .title section. The below regex matches for the string anywhere in the string.
Now we want to use this filter in the jq to match against the .title string as below. Notice the use of not to negate the result. To provide the actual match, remove the part |not.
jq --arg re "$(IFS="|"; echo "${filter[*]}")" '[.[] | select(.title|test($re)|not)]' < json

One way to solve a problem that involves the word "any" is often to use jq's any, e.g. using your shell variable:
jq --argjson filter "$filter" '
map((.title | split(" ")) as $title
| select(any( $title[] as $t
| $filter[] as $kw
| $kw == $t )))' input.json
Negation
As in formal logic, you can use all or any (in conjunction with negation) to solve the negated problem. But don't forget that if you use not,
jq's not is a zero-arity filter.
jq --argjson filter "$filter" '
map((.title | split(" ")) as $title
| select(all( $title[] as $t
| $filter[] as $kw
| $kw != $t )))' input.json
Other approaches
The above uses "keyword matching" as that is what the question specifies, but of course the above jq expressions can easily be modified to use regexes or some other type of matching.
If the list of keywords is very long, then a better algorithm for array-intersection would no doubt be desirable.

Related

Discard JSON objects if they contain substrings from a list

I want to parse a JSON file and extract some values, while also discarding or skipping certain entries if they contain substrings from another list passed in as an argument. The purpose is to exclude objects containing miscellaneous human-readable keywords from a master list.
input.json
{
"entities": [
{
"id": 600,
"name": "foo-001"
},
{
"id": 601,
"name": "foo-002"
},
{
"id": 602,
"name": "foobar-001"
}
]
}
args.json (list of keywords)
"foobar-"
"BANANA"
The output must definitely contain the foo-* entries (but not the excluded foobar- entries), but it can also contain any other names, provided they don't contain foobar- or BANANA. The exclusions are to be based on substrings, not exact matches.
I'm looking for a more performant way of doing this, because currently I just do my normal filters:
jq '[.[].entities[] | select(.name != "")] | walk(if type == "string" then gsub ("\t";"") else . end)' > file
(the input file has some erroneous tab escapes and null fields in it that are preprocessed)
At this stage, the file has only been minimally prepared. Then I iterate through this file line by line in shell and invoke grep -vf with a long list of invalid patterns from the keywords file. This gives a "master list" that is sanitized for later parsing by other applications. This seems intuitively wrong, though.
It seems like this should be done in one fell swoop on the first pass with jq instead of brute forcing it in a loop later.
I tried various invocations of INDEX and --slurpfile, but I seem to be missing something:
jq '.entities | INDEX(.name)[inputs]' input.json args.json
The above is a simplistic way of indexing the input args that at least seems to demonstrate that the patterns in the file can be matched verbatim, but doesn't account for substrings (contains ).
jq '.[] | walk(if type == "object" and (.name | contains($args[]))then empty else . end)' --slurpfile args args.json input.json
This looks to be getting closer to the idea, but something is screwy here. It seems like it's regurgitating all of the input file for each iteration of the arguments in the keywords file and returning them all for N number of arguments, and not actually emptying the original input, just dumbly checking the entire file for the presence of a single keyword and then starting over.
It seems like I need to unwrap the $args[] and map it here somehow so that the input file only gets iterated through once, with each keyword being checked for each record, rather than the entire file over and over again.
I found some conflicting information about whether a slurpfile is strictly necessary and can't determine what's the optimal approach here.
Thanks.
You could use all/2 as follows:
< input.json jq --slurpfile blacklist args.json '
.entities
| map(select(.name as $n
| all( $blacklist[]; . as $b | $n | index($b) | not) ))
'
or more concisely (but perhaps less obviously correct):
.entities | map( select( all(.name; index( $blacklist[]) | not) ))
You might wish to write .entities |= map( ... ) instead if you want to retain the original structure.

create json from bash variable and associative array [duplicate]

This question already has answers here:
Constructing a JSON object from a bash associative array
(5 answers)
Closed 5 months ago.
Lets say I have the following declared in bash:
mcD="had_a_farm"
eei="eeieeio"
declare -A animals=( ["duck"]="quack_quack" ["cow"]="moo_moo" ["pig"]="oink_oink" )
and I want the following json:
{
"oldMcD": "had a farm",
"eei": "eeieeio",
"onThisFarm":[
{
"duck": "quack_quack",
"cow": "moo_moo",
"pig": "oink_oink"
}
]
}
Now I know I could do this with an echo, printf, or assign text to a variable, but lets assume animals is actually very large and it would be onerous to do so. I could also loop through my variables and associative array and create a variable as I'm doing so. I could write either of these solutions, but both seem like the "wrong way". Not to mention its obnoxious to deal with the last item in animals, after which I do not want a ",".
I'm thinking the right solution uses jq, but I'm having a hard time finding much documentation and examples on how to use this tool to write jsons (especially those that are nested) rather than parse them.
Here is what I came up with:
jq -n --arg mcD "$mcD" --arg eei "$eei" --arg duck "${animals['duck']}" --arg cow "${animals['cow']}" --arg pig "${animals['pig']}" '{onThisFarm:[ { pig: $pig, cow: $cow, duck: $duck } ], eei: $eei, oldMcD: $mcD }'
Produces the desired result. In reality, I don't really care about the order of the keys in the json, but it's still annoying that the input for jq has to go backwards to get it in the desired order. Regardless, this solution is clunky and was not any easier to write than simply declaring a string variable that looks like a json (and would be impossible with larger associative arrays). How can I build a json like this in an efficient, logical manner?
Thanks!
Assuming that none of the keys or values in the "animals" array contains newline characters:
for i in "${!animals[#]}"
do
printf "%s\n%s\n" "${i}" "${animals[$i]}"
done | jq -nR --arg oldMcD "$mcD" --arg eei "$eei" '
def to_o:
. as $in
| reduce range(0;length;2) as $i ({};
.[$in[$i]]= $in[$i+1]);
{$oldMcD,
$eei,
onthisfarm: [inputs] | to_o}
'
Notice the trick whereby {$x} in effect expands to {(x): $x}
Using "\u0000" as the separator
If any of the keys or values contains a newline character, you could tweak the above so that "\u0000" is used as the separator:
for i in "${!animals[#]}"
do
printf "%s\0%s\0" "${i}" "${animals[$i]}"
done | jq -sR --arg oldMcD "$mcD" --arg eei "$eei" '
def to_o:
. as $in
| reduce range(0;length;2) as $i ({};
.[$in[$i]]= $in[$i+1]);
{$oldMcD,
$eei,
onthisfarm: split("\u0000") | to_o }
'
Note: The above assumes jq version 1.5 or later.
You can reduce associative array with for loop and pipe it to jq:
for i in "${!animals[#]}"; do
echo "$i"
echo "${animals[$i]}"
done |
jq -n -R --arg mcD "$mcD" --arg eei "$eei" 'reduce inputs as $i ({onThisFarm: [], mcD: $mcD, eei: $eei}; .onThisFarm[0] += {($i): (input | tonumber ? // .)})'

Get field from JSON object using jq and command line argument

Assume the following JSON file
{
"foo": "hello",
"bar": "world"
}
I want to get the foo field from the JSON object in a standalone object, and I do this:
<file jq '{foo}'
{
"foo": "hello"
}
Now the field I actually want is coming from the shell and is given to jq as an argument like this:
<file jq --arg myarg "foo" '{$myarg}'
{
"myarg": "foo"
}
Unfortunately this doesn't give the expected result {"foo":"hello"}.
Any idea why the name of the variable gets into the object?
A workaround to this is to explicitly defined the object:
<file jq '{($myarg):.[$myarg]}'
Fine, but is there a way to use the shortcut syntax as explained in the man page, but with a variable ?
You can use this to select particular fields of an object: if the input is an object with “user”, “title”, “id”, and “content” fields and you just want “user” and “title”, you can write
{user: .user, title: .title}
Because that is so common, there’s a shortcut syntax for it: {user, title}.
If that matters, I'm using jq version 1.5
In short, no. The shortcut syntax can only be used under very special conditions. For example, it cannot be used with key names that are jq keywords.
Alternatives
The method described in the Q is the preferred one, but for the record, here are two alternatives:
jq --arg myarg "foo" '
.[$myarg] as $v | {} | .[$myarg] = $v'
And of course there's the alternative that comes with numerous caveats:
myarg=foo ; jq "{ $myarg }"

Select entries based on multiple values in jq

I'm working with JQ and I absolutely love it so far. I'm running into an issue I've yet to find a solution to anywhere else, though, and wanted to see if the community had a way to do this.
Let's presume we have a JSON file that looks like so:
{"author": "Gary", "text": "Blah"}
{"author": "Larry", "text": "More Blah"}
{"author": "Jerry", "text": "Yet more Blah"}
{"author": "Barry", "text": "Even more Blah"}
{"author": "Teri", "text": "Text on text on text"}
{"author": "Bob", "text": "Another thing to say"}
Now, we want to select rows where the value of author is equal to either "Gary" OR "Larry", but no other case. In reality, I have several thousand names I'm checking against, so simply stating the direct or conditional (e.g. cat blah.json | jq -r 'select(.author == "Gary" or .author == "Larry")') isn't sufficient. I'm trying to do this via the inside function like so but get an error dialog:
cat blah.json | jq -r 'select(.author | inside(["Gary", "Larry"]))'
jq: error (at <stdin>:1): array (["Gary","La...) and string ("Gary") cannot have their containment checked
What would be the best method for doing something like this?
inside and contains are a bit weird. Here are some more straightforward solutions:
index/1
select( .author as $a | ["Gary", "Larry"] | index($a) )
any/2
["Gary", "Larry"] as $whitelist
| select( .author as $a | any( $whitelist[]; . == $a) )
Using a dictionary
If performance is an issue and if "author" is always a string, then a solution along the lines suggested by #JeffMercado should be considered. Here is a variant (to be used with the -n command-line option):
["Gary", "Larry"] as $whitelist
| ($whitelist | map( {(.): true} ) | add) as $dictionary
| inputs
| select($dictionary[.author])
IRC user gnomon answered this on the jq channel as follows:
jq 'select([.author] | inside(["Larry", "Garry", "Jerry"]))'
The intuition behind this approach, as stated by the user was: "Literally your idea, only wrapping .author as [.author] to coerce it into being a single-item array so inside() will work on it." This answer produces the desired result of filtering for a series of names provided in a list as the original question desired.
You can use objects as if they're sets to test for membership. Methods operating on arrays will be inefficient, especially if the array may be huge.
You can build up a set of values prior to reading your input, then use the set to filter your inputs.
$ jq -n --argjson names '["Larry","Garry","Jerry"]' '
(reduce $names[] as $name ({}; .[$name] = true)) as $set
| inputs | select($set[.author])
' blah.json

Numeric argument passed with jq --arg not matching data with ==

Here is a sample JSON response from my curl:
{
"success": true,
"message": "jobStatus",
"jobStatus": [
{
"ID": 9,
"status": "Successful"
},
{
"ID": 2,
"status": "Successful"
},
{
"ID": 99,
"status": "Failed"
}
]
}
I want to check the status of ID=2. Here is the command I tried:
cat test.txt|jq --arg v "2" '.jobStatus[]|select(.ID == $v)|.status'
response: there is none
I tried value 2 without quotes and still no result.
By contrast, if I try the command with a literal 2, it works:
cat test.txt | jq '.jobStatus[]|select(.ID == 2)|.status'
response:
"Successful"
I'm stuck. Can anyone help me identify the problem?
jq is data-type-aware:
.ID, as defined in the JSON input, is a number,
but any command-line argument passed with --arg (such as v here) is invariably a string (whether you quote the value or not),
so, in order to compare them, you must use an explicit type conversion, such as with tonumber/1:
jq --arg v '2' '.jobStatus[] | select(.ID == ($v | tonumber)) | .status' test.txt
Given that you're only passing a scalar argument here, the following solution, using --argjson (jq v1.5+) is a bit of an overkill, but it is an alternative to explicit type conversion in that passing a JSON argument in effect passes typed data:
jq --argjson v '{ "ID": 2 }' '.jobStatus[] | select(.ID == $v.ID) | .status' test.txt
peak's answer demonstrates that even --argjson v 2 works (in which case comparing to $v works directly), which is certainly the most concise solution, but may require an explanation:
Even though 2 may not look like JSON, it is: it is a valid JSON text containing a single value of type number (see json.org).
Specifically, it is the fact that 2 is an unquoted token that starts with a digit that makes it a number in the context of JSON (the JSON string-value equivalent is "2", which from the shell would have to be passed as '"2"' - note the embedded double quotes).
Therefore jq interprets --argjson -v 2 as a number, and comparison .ID == $v works as intended (note that the same applies to --argjson -v '2' / --argjson -v "2", where the shell removes the quotes before jq sees the value).
By contrast, anything you pass with --arg is always a string value that is used as-is.
In other words: --argjson, whose purpose is to accept arbitrary JSON texts as strings (such as '{ "ID": 2 }' in the example above), can also be used to pass number-string scalars to force their interpretation as numbers.
The same technique also works with Boolean strings true and false.
Tip of the hat to peak for his help.
Assuming you want to check for the JSON value 2, you have a choice to make - either convert the argument of --arg to a number, or use --argjson with a numeric argument. These alternatives are illustrated by the following:
jq --arg v 2 '.jobStatus[] | select(.ID == ($v|tonumber) | .status'
jq --argjson v 2 '.jobStatus[] | select(.ID == $v) | .status'
Note that --argjson requires a relatively recent version of jq.
Of course, if you want to "normalize" .ID so that it's always treated as a string, you could write:
jq --arg v 2 '.jobStatus[] | select((.ID|tostring) == $v) | .status'