Select entries based on multiple values in jq - json

I'm working with JQ and I absolutely love it so far. I'm running into an issue I've yet to find a solution to anywhere else, though, and wanted to see if the community had a way to do this.
Let's presume we have a JSON file that looks like so:
{"author": "Gary", "text": "Blah"}
{"author": "Larry", "text": "More Blah"}
{"author": "Jerry", "text": "Yet more Blah"}
{"author": "Barry", "text": "Even more Blah"}
{"author": "Teri", "text": "Text on text on text"}
{"author": "Bob", "text": "Another thing to say"}
Now, we want to select rows where the value of author is equal to either "Gary" OR "Larry", but no other case. In reality, I have several thousand names I'm checking against, so simply stating the direct or conditional (e.g. cat blah.json | jq -r 'select(.author == "Gary" or .author == "Larry")') isn't sufficient. I'm trying to do this via the inside function like so but get an error dialog:
cat blah.json | jq -r 'select(.author | inside(["Gary", "Larry"]))'
jq: error (at <stdin>:1): array (["Gary","La...) and string ("Gary") cannot have their containment checked
What would be the best method for doing something like this?

inside and contains are a bit weird. Here are some more straightforward solutions:
index/1
select( .author as $a | ["Gary", "Larry"] | index($a) )
any/2
["Gary", "Larry"] as $whitelist
| select( .author as $a | any( $whitelist[]; . == $a) )
Using a dictionary
If performance is an issue and if "author" is always a string, then a solution along the lines suggested by #JeffMercado should be considered. Here is a variant (to be used with the -n command-line option):
["Gary", "Larry"] as $whitelist
| ($whitelist | map( {(.): true} ) | add) as $dictionary
| inputs
| select($dictionary[.author])

IRC user gnomon answered this on the jq channel as follows:
jq 'select([.author] | inside(["Larry", "Garry", "Jerry"]))'
The intuition behind this approach, as stated by the user was: "Literally your idea, only wrapping .author as [.author] to coerce it into being a single-item array so inside() will work on it." This answer produces the desired result of filtering for a series of names provided in a list as the original question desired.

You can use objects as if they're sets to test for membership. Methods operating on arrays will be inefficient, especially if the array may be huge.
You can build up a set of values prior to reading your input, then use the set to filter your inputs.
$ jq -n --argjson names '["Larry","Garry","Jerry"]' '
(reduce $names[] as $name ({}; .[$name] = true)) as $set
| inputs | select($set[.author])
' blah.json

Related

Discard JSON objects if they contain substrings from a list

I want to parse a JSON file and extract some values, while also discarding or skipping certain entries if they contain substrings from another list passed in as an argument. The purpose is to exclude objects containing miscellaneous human-readable keywords from a master list.
input.json
{
"entities": [
{
"id": 600,
"name": "foo-001"
},
{
"id": 601,
"name": "foo-002"
},
{
"id": 602,
"name": "foobar-001"
}
]
}
args.json (list of keywords)
"foobar-"
"BANANA"
The output must definitely contain the foo-* entries (but not the excluded foobar- entries), but it can also contain any other names, provided they don't contain foobar- or BANANA. The exclusions are to be based on substrings, not exact matches.
I'm looking for a more performant way of doing this, because currently I just do my normal filters:
jq '[.[].entities[] | select(.name != "")] | walk(if type == "string" then gsub ("\t";"") else . end)' > file
(the input file has some erroneous tab escapes and null fields in it that are preprocessed)
At this stage, the file has only been minimally prepared. Then I iterate through this file line by line in shell and invoke grep -vf with a long list of invalid patterns from the keywords file. This gives a "master list" that is sanitized for later parsing by other applications. This seems intuitively wrong, though.
It seems like this should be done in one fell swoop on the first pass with jq instead of brute forcing it in a loop later.
I tried various invocations of INDEX and --slurpfile, but I seem to be missing something:
jq '.entities | INDEX(.name)[inputs]' input.json args.json
The above is a simplistic way of indexing the input args that at least seems to demonstrate that the patterns in the file can be matched verbatim, but doesn't account for substrings (contains ).
jq '.[] | walk(if type == "object" and (.name | contains($args[]))then empty else . end)' --slurpfile args args.json input.json
This looks to be getting closer to the idea, but something is screwy here. It seems like it's regurgitating all of the input file for each iteration of the arguments in the keywords file and returning them all for N number of arguments, and not actually emptying the original input, just dumbly checking the entire file for the presence of a single keyword and then starting over.
It seems like I need to unwrap the $args[] and map it here somehow so that the input file only gets iterated through once, with each keyword being checked for each record, rather than the entire file over and over again.
I found some conflicting information about whether a slurpfile is strictly necessary and can't determine what's the optimal approach here.
Thanks.
You could use all/2 as follows:
< input.json jq --slurpfile blacklist args.json '
.entities
| map(select(.name as $n
| all( $blacklist[]; . as $b | $n | index($b) | not) ))
'
or more concisely (but perhaps less obviously correct):
.entities | map( select( all(.name; index( $blacklist[]) | not) ))
You might wish to write .entities |= map( ... ) instead if you want to retain the original structure.

How to filter out elements with a particular property (or keep elements without that property)

I'm processing the output from dcm2json, which converts the metadata from medical imaging data in the DICOM format to JSON. The values for this metadata is mostly strings, integers, floats, and similar, but it also includes inline binary values in the form of base64-encoded strings. We don't need those binaries and they can get pretty large, so I need to filter out all metadata elements that have an InlineBinary property. Below is a (very simple small) sample of the JSON output from dcm2json:
{
"00080005": {
"vr": "CS",
"Value": ["ISO_IR 192"]
},
"00291010": {
"vr": "OB",
"InlineBinary": "Zm9vYmFyCg=="
}
}
I want to transform this to:
{
"00080005": {
"vr": "CS",
"Value": ["ISO_IR 192"]
}
}
I tried a few different things that didn't work but finally ended up using this:
$ dcm2json file.dcm | jq '[to_entries | .[] | select(.value.Value)] | from_entries'
I kept playing with it though because I don't like having that expression embedded in the array (i.e. [to_entries ...]). I came up with something a bit more elegant, but I'm totally stumped as to why it works the way it does:
jq 'to_entries | . - map(select(.value | has("InlineBinary") == true)) | from_entries' | less
What's confusing is the has("InlineBinary") == true bit. I first ran this comparing it to false because what I wanted was those elements that don't have the InlineBinary property. Why does it work seemingly opposite what I think I'm requesting? Given that I really don't understand what's happening with the . - map(...) structure in there (I totally swiped it from another post where someone asked a similar question), I'm not surprised it does something I don't understand but I'd like to understand why that is :)
The other thing I'm confused about is to_entries/from_entries/with_entries. The manual says about these:
with_entries(foo) is a shorthand for to_entries | map(foo) | from_entries
Cool! So that would be:
jq 'with_entries(map( . - map(select(.value | has("InlineBinary") == true))))'
But that doesn't work:
$ cat 1.json | jq 'with_entries(map(. - map(select(.value | has("InlineBinary") == true))))'
jq: error (at <stdin>:848): Cannot iterate over string ("00080005")
Given that this statement is supposed to be functionally equivalent, I'm not sure why this wouldn't work.
Thanks for any info you can provide!
When selecting key-value pairs, with_entries is often the tool of choice:
with_entries( select(.value | has("InlineBinary") | not) )

How to find something in a json file using Bash

I would like to search a JSON file for some key or value, and have it print where it was found.
For example, when using jq to print out my Firefox' extensions.json, I get something like this (using "..." here to skip long parts) :
{
"schemaVersion": 31,
"addons": [
{
"id": "wetransfer#extensions.thunderbird.net",
"syncGUID": "{e6369308-1efc-40fd-aa5f-38da7b20df9b}",
"version": "2.0.0",
...
},
{
...
}
]
}
Say I would like to search for "wetransfer#extensions.thunderbird.net", and would like an output which shows me where it was found with something like this:
{ "addons": [ {"id": "wetransfer#extensions.thunderbird.net"} ] }
Is there a way to get that with jq or with some other json tool?
I also tried to simply list the various ids in that file, and hoped that I would get it with jq '.id', but that just returned null, because it apparently needs the full path.
In other words, I'm looking for a command-line json parser which I could use in a way similar to Xpath tools
The path() function comes in handy:
$ jq -c 'path(.. | select(. == "wetransfer#extensions.thunderbird.net"))' input.json
["addons",0,"id"]
The resulting path is interpreted as "In the addons field of the initial object, the first array element's id field matches". You can use it with getpath(), setpath(), delpaths(), etc. to get or manipulate the value it describes.
Using your example with modifications to make it valid JSON:
< input.json jq -c --arg s wetransfer#extensions.thunderbird.net '
paths as $p | select(getpath($p) == $s) | null | setpath($p;$s)'
produces:
{"addons":[{"id":"wetransfer#extensions.thunderbird.net"}]}
Note
If there are N paths to the given value, the above will produce N lines. If you want only the first, you could wrap everything in first(...).
Listing all the "id" values
I also tried to simply list the various ids in that file
Assuming that "id" values of false and null are of no interest, you can print all the "id" values of interest using the jq filter:
.. | .id? // empty

jq: select when any value is in array

Given the input json
[
{"title": "first line"},
{"title": "second line"},
{"title": "third line"}
]
How can we extract only the titles that contain keywords that are listed in a second "filter" array. Using a shell variable here for instance:
filter='["second", "third"]'
The output in this case would be
[
{"title": "second line"},
{"title": "third line"}
]
Also, how to use the array filter to negate instead.
Eg: return only the "first line" entry in the previous example.
There is a similar reply but using an old version of jq.
I hope that there's a more intuitive/readable way to do this with the current version of jq.
You can use a combination of jq and shell tricks using arrays to produce the filter. Firstly to produce the shell array, use an array notation from the shell as below. Note that the below notation of bash arrays will not take , as a separator in its definition. Now we need to produce a regex filter to match the string, so we produce an alternation operator
filter=("first" "second")
echo "$(IFS="|"; echo "${filter[*]}"
first|second
You haven't mentioned if the string only matches in the first or last or could be anywhere in the .title section. The below regex matches for the string anywhere in the string.
Now we want to use this filter in the jq to match against the .title string as below. Notice the use of not to negate the result. To provide the actual match, remove the part |not.
jq --arg re "$(IFS="|"; echo "${filter[*]}")" '[.[] | select(.title|test($re)|not)]' < json
One way to solve a problem that involves the word "any" is often to use jq's any, e.g. using your shell variable:
jq --argjson filter "$filter" '
map((.title | split(" ")) as $title
| select(any( $title[] as $t
| $filter[] as $kw
| $kw == $t )))' input.json
Negation
As in formal logic, you can use all or any (in conjunction with negation) to solve the negated problem. But don't forget that if you use not,
jq's not is a zero-arity filter.
jq --argjson filter "$filter" '
map((.title | split(" ")) as $title
| select(all( $title[] as $t
| $filter[] as $kw
| $kw != $t )))' input.json
Other approaches
The above uses "keyword matching" as that is what the question specifies, but of course the above jq expressions can easily be modified to use regexes or some other type of matching.
If the list of keywords is very long, then a better algorithm for array-intersection would no doubt be desirable.

Count records with missing keys using jq

Below is a sample output that is returned when calling an API:
curl "https://mywebsite.com/api/cars.json&page=1" | jq '.'
Using jq, how would one count the number or records where the charge key is missing? I understand that the first bit of code would include jq '. | length' but how would one filter out objects that contain or don't contain a certain key value ?
If applied to the sample below, the output would be 1
{
"current_page": 1,
"items": [
{
"id": 1,
"name": "vehicleA",
"state": "available",
"charge": 100
},
{
"id": 2,
"name": "vehicleB",
"state": "available",
},
{
"id": 3,
"name": "vehicleB",
"state": "available",
"charge": 50
}
]
}
Here is a solution using map and length:
.items | map(select(.charge == null)) | length
Try it online at jqplay.org
Here is a more efficient solution using reduce:
reduce (.items[] | select(.charge == null)) as $i (0;.+=1)
Try it online at jqplay.org
Sample Run (assuming corrected JSON data in data.json)
$ jq -M 'reduce (.items[] | select(.charge == null)) as $i (0;.+=1)' data.json
1
Note that each of the above takes a minor shortcut assuming that the items won't have a "charge":null member. If some items could have a null charge then the test for == null won't distinguish between those items and items without the charge key. If this is a concern the following forms of the above filters which use has are better:
.items | map(select(has("charge")|not)) | length
reduce (.items[] | select(has("charge")|not)) as $i (0;.+=1)
Here is a solution that uses a simple but powerful utility function worthy perhaps of your standard library:
def sigma(stream): reduce stream as $s (null; . + $s);
The filter you'd use with this would be:
sigma(.items[] | select(has("charge") == false) | 1)
This is very efficient as no intermediate array is required, and no useless additions of 0 are involved. Also, as mentioned elsewhere, using has is more robust than making assumptions about the value of .charge.
Startup file
If you have no plans to use jq's module system, you can simply add the above definition of sigma to the file ~/.jq and invoke jq like so:
jq 'sigma(.items[] | select(has("charge") == false) | 1)'
Better yet, if you also add def count(s): sigma(s|1); to the file, the invocation would simply be:
jq 'count(.items[] | select(has("charge") | not))'
Standard Library
If for example ~/.jq/jq/jq.jq is your standard library, then assuming count/1 is included in this file, you could invoke jq like so:
jq 'include "jq"; count(.items[] | select(has("charge") == false))'