Discard JSON objects if they contain substrings from a list - json

I want to parse a JSON file and extract some values, while also discarding or skipping certain entries if they contain substrings from another list passed in as an argument. The purpose is to exclude objects containing miscellaneous human-readable keywords from a master list.
input.json
{
"entities": [
{
"id": 600,
"name": "foo-001"
},
{
"id": 601,
"name": "foo-002"
},
{
"id": 602,
"name": "foobar-001"
}
]
}
args.json (list of keywords)
"foobar-"
"BANANA"
The output must definitely contain the foo-* entries (but not the excluded foobar- entries), but it can also contain any other names, provided they don't contain foobar- or BANANA. The exclusions are to be based on substrings, not exact matches.
I'm looking for a more performant way of doing this, because currently I just do my normal filters:
jq '[.[].entities[] | select(.name != "")] | walk(if type == "string" then gsub ("\t";"") else . end)' > file
(the input file has some erroneous tab escapes and null fields in it that are preprocessed)
At this stage, the file has only been minimally prepared. Then I iterate through this file line by line in shell and invoke grep -vf with a long list of invalid patterns from the keywords file. This gives a "master list" that is sanitized for later parsing by other applications. This seems intuitively wrong, though.
It seems like this should be done in one fell swoop on the first pass with jq instead of brute forcing it in a loop later.
I tried various invocations of INDEX and --slurpfile, but I seem to be missing something:
jq '.entities | INDEX(.name)[inputs]' input.json args.json
The above is a simplistic way of indexing the input args that at least seems to demonstrate that the patterns in the file can be matched verbatim, but doesn't account for substrings (contains ).
jq '.[] | walk(if type == "object" and (.name | contains($args[]))then empty else . end)' --slurpfile args args.json input.json
This looks to be getting closer to the idea, but something is screwy here. It seems like it's regurgitating all of the input file for each iteration of the arguments in the keywords file and returning them all for N number of arguments, and not actually emptying the original input, just dumbly checking the entire file for the presence of a single keyword and then starting over.
It seems like I need to unwrap the $args[] and map it here somehow so that the input file only gets iterated through once, with each keyword being checked for each record, rather than the entire file over and over again.
I found some conflicting information about whether a slurpfile is strictly necessary and can't determine what's the optimal approach here.
Thanks.

You could use all/2 as follows:
< input.json jq --slurpfile blacklist args.json '
.entities
| map(select(.name as $n
| all( $blacklist[]; . as $b | $n | index($b) | not) ))
'
or more concisely (but perhaps less obviously correct):
.entities | map( select( all(.name; index( $blacklist[]) | not) ))
You might wish to write .entities |= map( ... ) instead if you want to retain the original structure.

Related

Fuzzy match string with jq

Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.

How to find something in a json file using Bash

I would like to search a JSON file for some key or value, and have it print where it was found.
For example, when using jq to print out my Firefox' extensions.json, I get something like this (using "..." here to skip long parts) :
{
"schemaVersion": 31,
"addons": [
{
"id": "wetransfer#extensions.thunderbird.net",
"syncGUID": "{e6369308-1efc-40fd-aa5f-38da7b20df9b}",
"version": "2.0.0",
...
},
{
...
}
]
}
Say I would like to search for "wetransfer#extensions.thunderbird.net", and would like an output which shows me where it was found with something like this:
{ "addons": [ {"id": "wetransfer#extensions.thunderbird.net"} ] }
Is there a way to get that with jq or with some other json tool?
I also tried to simply list the various ids in that file, and hoped that I would get it with jq '.id', but that just returned null, because it apparently needs the full path.
In other words, I'm looking for a command-line json parser which I could use in a way similar to Xpath tools
The path() function comes in handy:
$ jq -c 'path(.. | select(. == "wetransfer#extensions.thunderbird.net"))' input.json
["addons",0,"id"]
The resulting path is interpreted as "In the addons field of the initial object, the first array element's id field matches". You can use it with getpath(), setpath(), delpaths(), etc. to get or manipulate the value it describes.
Using your example with modifications to make it valid JSON:
< input.json jq -c --arg s wetransfer#extensions.thunderbird.net '
paths as $p | select(getpath($p) == $s) | null | setpath($p;$s)'
produces:
{"addons":[{"id":"wetransfer#extensions.thunderbird.net"}]}
Note
If there are N paths to the given value, the above will produce N lines. If you want only the first, you could wrap everything in first(...).
Listing all the "id" values
I also tried to simply list the various ids in that file
Assuming that "id" values of false and null are of no interest, you can print all the "id" values of interest using the jq filter:
.. | .id? // empty

How to get max value of a date field in a large json file?

I have a large JSON file around 500MB which is the response of a URL call.I need to get the max value of "date" field in the JSON file in the "results" array using shell script(bash).Currently using jq as below.Below works good for smaller files but for larger files it is returning null.
maxDate=$(cat ${jsonfilePath} | jq '[ .results[]?.date ] | max')
Please help.Thanks! I am new to shell scripting,json,jq.
sample/input json file contents:
{
"results": [
{
"Id": "123",
"date": 1588910400000,
"col": "test"
},
{
"Id": "1234",
"date": 1588910412345,
"col": "test2"
}
],
"col2": 123
}
Given --stream option on the command line, JQ won't load the whole input into the memory, instead it'll read the input token by token, producing arrays in this fashion:
[["results",0,"Id"],"123"]
[["results",0,"date"],1588910400000]
...
[["results",1,"date"],1588910412345]
...
Thanks to this feature, we can pick only dates from the input and find out the maximum one without exhausting the memory (at the expense of speed). For example:
jq -n --stream 'reduce (inputs|select(.[0][-1]=="date" and length==2)[1]) as $d (null; [.,$d]|max)' file
500MB should not be so large as to require the --stream option, which generally slows things down. Here then is a fast and efficient(*) solution that does not use the streaming option, but instead uses a generic, stream-oriented "max_by" function defined as follows:
# max_by(empty;1) yields null
def max_by(s; f):
reduce s as $s (null;
if . == null then {s: $s, m: ($s|f)}
else ($s|f) as $m
| if $m > .m then {s: $s, m: $m} else . end
end)
| .s ;
With this in our toolkit, we can simply write:
max_by(.results[].date; .)
This of course assumes that there is a "results" field containing an array of JSON objects. (**) From the problem statement, it would appear that this assumption does not always hold, so you will probably want to modify whichever approach you choose accordingly (e.g. by checking whether there is a results field, whether it's array-valued, etc.)
(*) Using max_by/2 here is more efficient, both in terms of space and time, than using the built-in max_by/1.
(**) The absence of a "date" subfield should not matter as null is less than every number.
 jq '.results | max_by(.date) | .date' "$jsonfilePath"
is a more efficient way to get the maximum date value out of that JSON that might work better for you. It avoids the Useless Use Of Cat, doesn't create a temporary array of just the date values, and thus only needs one pass through the array.

How to get keys and key types of nested json using jq

I have a file data.json like below -
{
"parameter": {
"colA": "No",
"COLB": "No"
},
"workRequired": 0,
"work": 0,
"updateType": "AUTO"
}
I know how to get the key and key type of json -
jq -c 'to_entries[] | [.key, (.value|type)]' data.json
but the above command returns me -
["parmeter","object"]
["workRequired","string"]
["work","null"]
["updateType","number"]
but I want the command to return like below, so that I get key type of nested json -
["parmeter"."colA","string"]
["parmeter"."colB","string"]
["workRequired","string"]
["work","null"]
["updateType","number"]
Is there any way to do using jq
This gets very close to the requested output.
jq -c 'to_entries[]
| if .value|type == "object"
then .key as $k
| .value
| to_entries[]
| ["\($k).\(.key)", (.value|type)]
else [.key, (.value|type)]
end'
Output:
["parameter.colA","string"]
["parameter.COLB","string"]
["workRequired","number"]
["work","number"]
["updateType","string"]
The main difference is in the first two lines. I don't think that ["parameter"."colA","string'] is valid json.
Some of the types are different as well.
Explanation
One way of learning how this stuff works is to go step by step. So start with jq -c 'to_entries[]' to see what comes out and then add each step in turn. The manual is also quite good.
Here, we start with an object. The first command is to_entries[]. To quote the manual, when "to_entries is passed an object, then for each k: v entry in the input, the output array includes {"key": k, "value": v}." Adding [] at the end means the output will just be the objects in the array that to_entries produces. So this is the output after the first step:
{"key":"parameter","value":{"colA":"No","COLB":"No"}}
{"key":"workRequired","value":0}
{"key":"work","value":0}
{"key":"updateType","value":"AUTO"}
Now we have four objects. One contains another object. The original question concerns the object that contains an object. The problem was to combine the key for the containing object with the key for the contained object in the final output.
The conditional if .value|type == "object" identifies the object that contains an object.
When this condition is met, .key as $k saves the value of the containing object's "key" key as a variable named $k.
Then we repeat to_entries[] with the contained object. The contained object is the value of the key "value" in the containing object. The code .value | to_entries[] filters the contained object through to_entries[]. That gives us these two objects.
{"key":"colA","value":"No"}
{"key":"COLB","value":"No"}
To create the desired output, we need to construct an array that combines the key of the containing object, which was saved as $k, with the elements of those two objects. Here is how we do that. (See "String interpolation" in the manual for an explanation of how the part in quotes works.)
["\($k).\(.key)", (.value|type)]
For each object, the key we saved as the variable $k is combined with the value of the "key" key in the object. Then we output the type of the value of the "value" key in the object.
This yields the first two lines of the final output:
["parameter.colA","string"]
["parameter.COLB","string"]
Now we move to the other branch of our conditional. This deals with the three objects from the first step that do not contain objects. Here we just repeat the original code from the question since that was satisfactory.
else [.key, (.value|type)]
That yields this:
["workRequired","number"]
["work","number"]
["updateType","string"]
The command end ends the conditional.
One more thing. The flag -c at the very beginning tells jq we want compact output. Without it, the output would be logically the same, but spread over multiple lines.

Select entries based on multiple values in jq

I'm working with JQ and I absolutely love it so far. I'm running into an issue I've yet to find a solution to anywhere else, though, and wanted to see if the community had a way to do this.
Let's presume we have a JSON file that looks like so:
{"author": "Gary", "text": "Blah"}
{"author": "Larry", "text": "More Blah"}
{"author": "Jerry", "text": "Yet more Blah"}
{"author": "Barry", "text": "Even more Blah"}
{"author": "Teri", "text": "Text on text on text"}
{"author": "Bob", "text": "Another thing to say"}
Now, we want to select rows where the value of author is equal to either "Gary" OR "Larry", but no other case. In reality, I have several thousand names I'm checking against, so simply stating the direct or conditional (e.g. cat blah.json | jq -r 'select(.author == "Gary" or .author == "Larry")') isn't sufficient. I'm trying to do this via the inside function like so but get an error dialog:
cat blah.json | jq -r 'select(.author | inside(["Gary", "Larry"]))'
jq: error (at <stdin>:1): array (["Gary","La...) and string ("Gary") cannot have their containment checked
What would be the best method for doing something like this?
inside and contains are a bit weird. Here are some more straightforward solutions:
index/1
select( .author as $a | ["Gary", "Larry"] | index($a) )
any/2
["Gary", "Larry"] as $whitelist
| select( .author as $a | any( $whitelist[]; . == $a) )
Using a dictionary
If performance is an issue and if "author" is always a string, then a solution along the lines suggested by #JeffMercado should be considered. Here is a variant (to be used with the -n command-line option):
["Gary", "Larry"] as $whitelist
| ($whitelist | map( {(.): true} ) | add) as $dictionary
| inputs
| select($dictionary[.author])
IRC user gnomon answered this on the jq channel as follows:
jq 'select([.author] | inside(["Larry", "Garry", "Jerry"]))'
The intuition behind this approach, as stated by the user was: "Literally your idea, only wrapping .author as [.author] to coerce it into being a single-item array so inside() will work on it." This answer produces the desired result of filtering for a series of names provided in a list as the original question desired.
You can use objects as if they're sets to test for membership. Methods operating on arrays will be inefficient, especially if the array may be huge.
You can build up a set of values prior to reading your input, then use the set to filter your inputs.
$ jq -n --argjson names '["Larry","Garry","Jerry"]' '
(reduce $names[] as $name ({}; .[$name] = true)) as $set
| inputs | select($set[.author])
' blah.json