Using jq to count - json

Using jq-1.5 if I have a file of JSON that looks like
[{... ,"sapm_score":40.776, ...} {..., "spam_score":17.376, ...} ...]
How would I get a count of the ones where sapm_score > 40?
Thanks,
Dan
Update:
I looked at the input file and the format is actually
{... ,"sapm_score":40.776, ...}
{..., "spam_score":17.376, ...}
...
Does this change how one needs to count?

[UPDATE: If the input is not an array, see the last section below.]
count/1
I'd recommend defining a count filter (and maybe putting it in your ~/.jq), perhaps as follows:
def count(s): reduce s as $_ (0;.+1);
With this, assuming the input is an array, you'd write:
count(.[] | select(.sapm_score > 40))
or slightly more efficiently:
count(.[] | (.sapm_score > 40) // empty)
This approach (counting items in a stream) is usually preferable to using length as it avoids the costs associated with constructing an array.
count/2
Here's another definition of count that you might like to use (and perhaps add to ~/.jq as well):
def count(stream; cond): count(stream | cond // empty);
This counts the elements of the stream for which cond is neither false nor null.
Now, assuming the input consists of an array, you can simply write:
count(.[]; .sapm_score > 40)
"sapm_score" vs "spam_score"
If the point is that you want to normalize "sapm_score" to "spam_score", then (for example) you could use count/2 as defined above, like so:
count(.[]; .spam_score > 40 or .sapm_score > 40)
This assumes all the items in the array are JSON objects. If that is not the case, then you might want to try adding "?" after the key names:
count(.[]; .spam_score? > 40 or .sapm_score? > 40)
Of course all the above assumes the input is valid JSON. If that is not the case, then please see https://github.com/stedolan/jq/wiki/FAQ#processing-not-quite-valid-json
If the input is a stream of JSON objects ...
The revised question indicates the input consists of a stream of JSON objects (whereas originally the input was said to be an array of JSON objects). If the input consists of a stream of JSON objects, then the above solutions can easily be adapted, depending on the version of jq that you have. If your version of jq has inputs then (2) is recommended.
(1) All versions: use the -s command-line option.
(2) If your jq has inputs: use the -n command line option, and change .[] above to inputs, e.g.
count(inputs; .spam_score? > 40 or .sapm_score? > 40)

Filter the items that satisfy the condition then get the length.
map(select(.sapm_score > 40)) | length

Here is one way:
reduce .[] as $s(0; if $s.spam_score > 40 then .+1 else . end)
Try it online at jqplay.org
If instead of an array the input is a sequence of newline delimited objects (jsonlines)
reduce inputs as $s(0; if $s.spam_score > 40 then .+1 else . end)
will work if jq is invoked with the -n flag. Here is an example:
$ cat data.json
{ "spam_score":40.776 }
{ "spam_score":17.376 }
$ jq -Mn 'reduce inputs as $s(0; if $s.spam_score > 40 then .+1 else . end)' data.json
1
Try it online at tio.run

cat input.json | jq -c '. | select(.sapm_score > 40)' | wc -l
should do it.
The -c option prints a one-liner compact json representation of each match, and we count the number of lines jq prints.

Related

Discard JSON objects if they contain substrings from a list

I want to parse a JSON file and extract some values, while also discarding or skipping certain entries if they contain substrings from another list passed in as an argument. The purpose is to exclude objects containing miscellaneous human-readable keywords from a master list.
input.json
{
"entities": [
{
"id": 600,
"name": "foo-001"
},
{
"id": 601,
"name": "foo-002"
},
{
"id": 602,
"name": "foobar-001"
}
]
}
args.json (list of keywords)
"foobar-"
"BANANA"
The output must definitely contain the foo-* entries (but not the excluded foobar- entries), but it can also contain any other names, provided they don't contain foobar- or BANANA. The exclusions are to be based on substrings, not exact matches.
I'm looking for a more performant way of doing this, because currently I just do my normal filters:
jq '[.[].entities[] | select(.name != "")] | walk(if type == "string" then gsub ("\t";"") else . end)' > file
(the input file has some erroneous tab escapes and null fields in it that are preprocessed)
At this stage, the file has only been minimally prepared. Then I iterate through this file line by line in shell and invoke grep -vf with a long list of invalid patterns from the keywords file. This gives a "master list" that is sanitized for later parsing by other applications. This seems intuitively wrong, though.
It seems like this should be done in one fell swoop on the first pass with jq instead of brute forcing it in a loop later.
I tried various invocations of INDEX and --slurpfile, but I seem to be missing something:
jq '.entities | INDEX(.name)[inputs]' input.json args.json
The above is a simplistic way of indexing the input args that at least seems to demonstrate that the patterns in the file can be matched verbatim, but doesn't account for substrings (contains ).
jq '.[] | walk(if type == "object" and (.name | contains($args[]))then empty else . end)' --slurpfile args args.json input.json
This looks to be getting closer to the idea, but something is screwy here. It seems like it's regurgitating all of the input file for each iteration of the arguments in the keywords file and returning them all for N number of arguments, and not actually emptying the original input, just dumbly checking the entire file for the presence of a single keyword and then starting over.
It seems like I need to unwrap the $args[] and map it here somehow so that the input file only gets iterated through once, with each keyword being checked for each record, rather than the entire file over and over again.
I found some conflicting information about whether a slurpfile is strictly necessary and can't determine what's the optimal approach here.
Thanks.
You could use all/2 as follows:
< input.json jq --slurpfile blacklist args.json '
.entities
| map(select(.name as $n
| all( $blacklist[]; . as $b | $n | index($b) | not) ))
'
or more concisely (but perhaps less obviously correct):
.entities | map( select( all(.name; index( $blacklist[]) | not) ))
You might wish to write .entities |= map( ... ) instead if you want to retain the original structure.

Fuzzy match string with jq

Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.

jq 1.5 print items from array that is inside another array

Incoming json file contains json array per row eg:
["a100","a101","a102","a103","a104","a105","a106","a107","a108"]
["a100","a102","a103","a106","a107","a108"]
["a100","a99"]
["a107","a108"]
a "filter array" would be ["a99","a101","a108"] so I can slurpfile it
Trying to figure out how to print only values that are inside "filter array", eg the output:
["a101","a108"]
["a108"]
["a99"]
["a108"]
You can port IN function from jq 1.6 to 1.5 and use:
def IN(s): any(s == .; .);
map(select(IN($filter_array[])))
Or even shorter:
map(select(any($filter_array[]==.;.)))
I might be missing some simpler solution, but the following works :
map(select(. as $in | ["a99","a101","a108"] | contains([$in])))
Replace the ["a99","a101","a108"] hardcoded array by your slurped variable.
You can try it here !
In the example, the arrays in the input stream are sorted (in jq's sort order), so it is worth noting that in such cases, a more efficient solution is possible using the bsearch built-in, or perhaps even better, the definition of intersection/2 given at https://rosettacode.org/wiki/Set#Finite_Sets_of_JSON_Entities
For ease of reference, here it is:
def intersection($A;$B):
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then $A[$i], ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
[[0,0] | pop];
Assuming a jq invocation such as:
jq -c --argjson filter '["a99","a101","a108"]' -f intersections.jq input.json
an appropriate filter would be:
($filter | sort) as $sorted
| intersection(.; $sorted)
(Of course if $filter is already presented in jq's sort order, then the initial sort can be skipped, or replaced by a check.)
Output
["a101","a108"]
["a108"]
["a99"]
["a108"]
Unsorted arrays
In practice, jq's builtin sort filter is usually so fast that it might be worthwhile simply sorting the arrays in order to use intersection as defined above.

How to delete the last character of prior line with sed

I'm trying to delete a line with a the last character of the prior line with sed:
I have a json file :
{
"name":"John",
"age":"16",
"country":"Spain"
}
I would like to delete country of all entries, to do that I have to delete the comma for the json syntax of the prior line.
I'm using this pattern :
sed '/country/d' test.json
sed -n '/resolved//.$//{x;d;};1h;1!{x;p;};${x;p;}' test.json
Editor's note:
The OP later clarified the following additional requirements, which invalidated some of the existing answers:
- multiple occurrences of country properties should be removed
- across all levels of the object hierarchy
- whitespace variations should be tolerated
Using a proper JSON parser such as jq is generally the best choice (see below), but if installing a utility is not an option, try this GNU sed command:
$ sed -zr 's/,\s*"country":[^\n]+//g' test.json
{
"name":"John",
"age":"16"
}
-z splits the input into records by NULs, which, in this case means that the whole file is read at once, which enables cross-line substitutions.
-r enables extended regular expressions for a more modern syntax with more features.
s/,\n"country":\s*//g replaces all occurrences of a comma followed by a (possibly empty) run of whitespace (including possibly a newline) and then "country" through the end of that line with the empty string, i.e., effectively removes the matched strings.
Note that this assumes that no other property or closing } follows such a country property on the same line.
To demonstrate a more robust solution based on jq.
Bertrand Martel's helpful answer contains a jq solution, which, however, does not address the requirement (added later) of replacing country attributes anywhere in the input object hierarchy.
In a not-yet-released version of jq higher than v1.5.2, a builtin walk/1 function will be available, which enables the following simple solution:
# Walk all nodes and remove a "country" property from any object.
jq 'walk(if type == "object" then del (.country) else . end)' test.json
In v1.5.2 and below, you can define a simplified variant of walk yourself:
jq '
# Define recursive function walk_objects/1 that walks all objects in the
# hierarchy.
def walk_objects(f): . as $in |
if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | walk_objects(f)) } ) | f
elif type == "array" then map( walk_objects(f) )
else . end;
# Walk all objects and remove a "country" property, if present.
walk_objects(del(.country))
' test.json
As pointed out before you should really consider using a JSON parser to parse JSON.
When that is said you can slurp the whole file, remove newlines and then replace
accordantly:
$ sed ':a;N;$!ba;s/\n//g;s/,"country"[^}]*//' test.json
{"name":"John","age":"16"}
Breakdown:
:a; # Define label 'a'
N; # Append next line to pattern space
$!ba; # Goto 'a' unless it's the last line
s/\n//g; # Replace all newlines with nothing
s/,"country"[^}]*// # Replace ',"country...' with nothing
This might work for you (GNU sed):
sed 'N;s/,\s*\n\s*"country".*//;P;D' file
Read two lines into the pattern space and remove substitution string.
N.B. Allows for spaces either side of the line.
You can use a JSON parser like jq to parse json file. The following will return the document without the country field and write the new document in result.json :
jq 'del(.country)' file.json > result.json

Extract only one field for a specific flow from JSON

Below is the Json I receive as Response from url.
{"flows":[{"version":"OF_13","cookie":"0","tableId":"0x0","packetCount":"24","byteCount":"4563","durationSeconds":"5747","priority":"0","idleTimeoutSec":"0","hardTimeoutSec":"0","flags":"0","match":{},"instructions":{"instruction_apply_actions":{"actions":"output=controller"}}},
{"version":"OF_13","cookie":"45036000240104713","tableId":"0x0","packetCount":"0","byteCount":"0","durationSeconds":"29","priority":"6","idleTimeoutSec":"0","hardTimeoutSec":"0","flags":"1","match":{"eth_type":"0x0x800","ipv4_src":"10.0.0.10","ipv4_dst":"10.0.0.12"},"instructions":{"none":"drop"}},
{"version":"OF_13","cookie":"45036000240104714","tableId":"0x0","packetCount":"0","byteCount":"0","durationSeconds":"3","priority":"7","idleTimeoutSec":"0","hardTimeoutSec":"0","flags":"1","match":{"eth_type":"0x0x800","ipv4_src":"10.0.0.10","ipv4_dst":"127.0.0.1"},"instructions":{"none":"drop"}},
{"version":"OF_13","cookie":"0","tableId":"0x1","packetCount":"0","byteCount":"0","durationSeconds":"5747","priority":"0","idleTimeoutSec":"0","hardTimeoutSec":"0","flags":"0","match":{},"instructions":{"instruction_apply_actions":{"actions":"output=controller"}}}]}
So, I have for example four flows and I want to extract only the field "byteCount" for a specific flow identify by the ipv4_src and ipv4_dst that i have to give it as input
How can I do this?
json_array := JSON.parse(json_string)
foreach (element in json_array.flows):
if(element.match.hasProperty('ipv4_src') && element.match.hasProperty('ipv4_dst')):
if(element.match.ipv4_src == myValue && element.match.ipv4_dst == otherValue):
print element.byteCount ;
The above is a pseudo-code to find byteCount based on ipv4_src and ipv4_dst. Note that these two properties are within match property, which may or may not contain them. Hence, first check for their existence and then process.
Note: When formatted property, each element in the array is like
{
"version":"OF_13",
"cookie":"45036000240104713",
"tableId":"0x0",
"packetCount":"0",
"byteCount":"0",
"durationSeconds":"29",
"priority":"6",
"idleTimeoutSec":"0",
"hardTimeoutSec":"0",
"flags":"1",
"match":{
"eth_type":"0x0x800",
"ipv4_src":"10.0.0.10",
"ipv4_dst":"10.0.0.12"
},
"instructions":{
"none":"drop"
}
}
Here's how to perform the selection and extraction task using the command-line tool jq:
First create a file, say "extract.jq", with these three lines:
.flows[]
| select(.match.ipv4_src == $src and .match.ipv4_dst == $dst)
| [$src, $dst, .byteCount]
Next, assuming the desired src and dst are 10.0.0.10 and 10.0.0.12 respectively, and that the input is in a file named input.json, run this command:
jq -c --arg src 10.0.0.10 --arg dst 10.0.0.12 -f extract.jq input.json
This would produce one line per match; in the case of your example, it would produce:
["10.0.0.10","10.0.0.12","0"]
If the JSON is coming from some command (such as curl), you can use a pipeline along the following lines:
curl ... | jq -c --arg src 10.0.0.10 --arg dst 10.0.0.12 -f extract.jq