How to conditionally do a recursive merge? - json

I'd like to conditionally do a recursive merge. That is, if a key exists in the second object, I'd like to use it to override values in the first. For example, this does what I want:
$ echo '{"a":"value"}{"bar": {"a":"override"}}' | jq -sS '.[0] * if (.[1].foo|length) > 0 then .[1].foo else {} end'
{
"a": "value"
}
$ echo '{"a":"value"}{"foo": {"a":"override"}}' | jq -sS '.[0] * if (.[1].foo|length) > 0 then .[1].foo else {} end'
{
"a": "override"
}
In the first example, the second object does not contain a "foo" key, so the override does not happen. In the 2nd example, the second object does contain "foo", so the value is changed. (In my actual use, I always have 3 objects on the input and sometimes have a 4th which may override some of the previous values.)
Although the above works, it seems absurdly ugly. Is there a cleaner way to do this? I imagine something like jq -sS '.[0] * (.[1].foo ? .[1].foo : {}) or similar.

With -n flag specified on the command line this should do the trick:
reduce inputs as $in (input; . * ($in.foo // {}))
jqplay demo

Related

filter keys in JSON using jq

I am having a complex nested json
{
...
"key1": {
"key2" : [
{ ...
"base_score" :4.5
}
]
"key3": {
"key4": [
{ ...
"base_score" : 0.5
...
}
]
}
...
}
}
There maybe multiple "base_score" in the json("base_score" path is unknown) and the corresponding value will be a number, I have to check if at least one such value is greater than some known value 7.0, and if there is, I have to do "exit 1". I have to write this query in shell script.
Assuming the input is valid JSON in a file named input.json, then based on my understanding of the requirements, you could go with:
jq --argjson limit 7.0 '
any(.. | select(type=="object" and (.base_score|type=="number")) | .base_score; . > $limit)
| halt_error(if . then 1 else 0 end)
' input.json
You can modify the argument to halt_error to set the exit code as you wish.
Note that halt_error redirects its input to stderr, so you might want to append 2> /dev/null (or the equivalent expression appropriate for your shell) to the above invocation.
You can easily get a stream of base_score values at any level and use that with any:
any(..|.base_score?; . > 7)
The stream will contain null values for objects without the property, but null is not greater than any number, so that shouldn't be a stopper.
You could then compare the output or specify -e/--exit-status to be used with a condition directly:
jq -e 'any(..|.base_score?; . > 7)' complexnestedfile.json >/dev/null && exit 1

insert bash variable into json using jq

I am trying to fill JSON template with increasing counter to generate huge sample data set:
#!/bin/bash
for ((counter=1 ; counter<2 ;counter++ ))
do
NUMBER=${counter}
JSON=$(cat template.json | jq --arg NUMBER "$NUMBER" '.')
echo $JSON
#aws dynamodb batch-write-item --request-items "${JSON}"
done
My template.json looks like:
{
"My_Table":[{
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_C"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_C"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_D"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_E"},"name":{ ...
}]
}
Can I get any clue to insert bash variable into JSON template? I guess I am not doing correctly using jq here:
JSON=$(cat template.json | jq --arg NUMBER "$NUMBER" '.')
EDIT
My desired output:
{
"My_Table":[{
"PutRequest":{"Item":{"type":{"S":"test-1-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_C"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_C"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_D"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-1-Type_E"},"name":{ ...
}]
}
If the template is intended to be used with jq, it should be changed to work correctly, rather than trying to force jq to work with substandard input.
{
"My_Table":[{
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_A"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_B"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_C"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_C"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_D"},"name":{ ...
"PutRequest":{"Item":{"type":{"S":"test-\($NUMBER)-Type_E"},"name":{ ...
}]
}
Now something like your original attempt will work correctly. (Name your template template.jq to emphasize that it is not actually valid JSON.)
for ((counter=1 ; counter<2 ;counter++ ))
do
JSON=$(jq -n -f template.jq --arg NUMBER "$counter")
echo "$JSON"
#aws dynamodb batch-write-item --request-items "${JSON}"
done
Unfortunately there are several parts of the question that don't quite make sense, but I believe the following should get you on your way.
First, I'll assume your template is valid JSON:
{
"My_Table": [
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_A"},"name":0}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_A"},"name":1}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_A"},"name":2}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_B"},"name":0}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_B"},"name":1}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_B"},"name":2}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_C"},"name":0}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_D"},"name":1}}},
{"PutRequest":{"Item":{"type":{"S":"test-$NUMBER-Type_E"},"name":0}}}
]
}
Second, I'll assume you want the result along the lines of what is shown, not as described, but the following is written so you can easily adapt the code to the described problem. Specifically, the functionality for making the substitution is encapsulated
in the following definition:
def resolve(s; value):
.My_Table |= map(.PutRequest.Item.type.S |=
sub("-" + s + "-"; "-" + (value|tostring) + "-" ));
This is written using sub, the first argument of which must be a regex. So, to generate the desired output for a single substitution of "$NUMBER" by "1", one would write:
resolve("\\$NUMBER"; 1)
Since I'm not sure what your bash snippet is supposed to do exactly, I'll just suggest that you could use iteration within jq to achieve whatever result you require, rather than using bash iteration.

Using jq to count

Using jq-1.5 if I have a file of JSON that looks like
[{... ,"sapm_score":40.776, ...} {..., "spam_score":17.376, ...} ...]
How would I get a count of the ones where sapm_score > 40?
Thanks,
Dan
Update:
I looked at the input file and the format is actually
{... ,"sapm_score":40.776, ...}
{..., "spam_score":17.376, ...}
...
Does this change how one needs to count?
[UPDATE: If the input is not an array, see the last section below.]
count/1
I'd recommend defining a count filter (and maybe putting it in your ~/.jq), perhaps as follows:
def count(s): reduce s as $_ (0;.+1);
With this, assuming the input is an array, you'd write:
count(.[] | select(.sapm_score > 40))
or slightly more efficiently:
count(.[] | (.sapm_score > 40) // empty)
This approach (counting items in a stream) is usually preferable to using length as it avoids the costs associated with constructing an array.
count/2
Here's another definition of count that you might like to use (and perhaps add to ~/.jq as well):
def count(stream; cond): count(stream | cond // empty);
This counts the elements of the stream for which cond is neither false nor null.
Now, assuming the input consists of an array, you can simply write:
count(.[]; .sapm_score > 40)
"sapm_score" vs "spam_score"
If the point is that you want to normalize "sapm_score" to "spam_score", then (for example) you could use count/2 as defined above, like so:
count(.[]; .spam_score > 40 or .sapm_score > 40)
This assumes all the items in the array are JSON objects. If that is not the case, then you might want to try adding "?" after the key names:
count(.[]; .spam_score? > 40 or .sapm_score? > 40)
Of course all the above assumes the input is valid JSON. If that is not the case, then please see https://github.com/stedolan/jq/wiki/FAQ#processing-not-quite-valid-json
If the input is a stream of JSON objects ...
The revised question indicates the input consists of a stream of JSON objects (whereas originally the input was said to be an array of JSON objects). If the input consists of a stream of JSON objects, then the above solutions can easily be adapted, depending on the version of jq that you have. If your version of jq has inputs then (2) is recommended.
(1) All versions: use the -s command-line option.
(2) If your jq has inputs: use the -n command line option, and change .[] above to inputs, e.g.
count(inputs; .spam_score? > 40 or .sapm_score? > 40)
Filter the items that satisfy the condition then get the length.
map(select(.sapm_score > 40)) | length
Here is one way:
reduce .[] as $s(0; if $s.spam_score > 40 then .+1 else . end)
Try it online at jqplay.org
If instead of an array the input is a sequence of newline delimited objects (jsonlines)
reduce inputs as $s(0; if $s.spam_score > 40 then .+1 else . end)
will work if jq is invoked with the -n flag. Here is an example:
$ cat data.json
{ "spam_score":40.776 }
{ "spam_score":17.376 }
$ jq -Mn 'reduce inputs as $s(0; if $s.spam_score > 40 then .+1 else . end)' data.json
1
Try it online at tio.run
cat input.json | jq -c '. | select(.sapm_score > 40)' | wc -l
should do it.
The -c option prints a one-liner compact json representation of each match, and we count the number of lines jq prints.

How can I tell if a jq filter successfully pulls data from a JSON data structure?

I want to know if a given filter succeeds in pulling data from a JSON data structure. For example:
###### For the user steve...
% Name=steve
% jq -j --arg Name "$Name" '.[]|select(.user == $Name)|.value' <<<'
[
{"user":"steve", "value":false},
{"user":"tom", "value":true},
{"user":"pat", "value":null},
{"user":"jane", "value":""}
]'
false
% echo $?
0
Note: successful results can include boolean values, null, and even the empty string.
###### Now for user not in the JSON data...
% Name=mary
% jq -j --arg Name "$Name" '.[]|select(.user == $Name)|.value' <<<'
[
{"user":"steve", "value":false},
{"user":"tom", "value":true},
{"user":"pat", "value":null},
{"user":"jane", "value":""}
]'
% echo $?
0
If the filter does not pull data from the JSON data structure, I need to know this. I would prefer the filter to return a non-zero return code.
How would I go about determining if a selector successfully pulls data from a JSON data structure vs. fails to pull data?
Important: The above filter is just an example, the solution needs to work for any jq filter.
Note: the evaluation environment is Bash 4.2+.
You can use the -e / --exit-status flag from the jq Manual, which says
Sets the exit status of jq to 0 if the last output values was neither false nor null, 1 if the last output value was either false or null, or 4 if no valid result was ever produced. Normally jq exits with 2 if there was any usage problem or system error, 3 if there was a jq program compile error, or 0 if the jq program ran.
I can demonstrate the usage with a basic filter as below, as your given example is not working for me.
For a successful query,
dudeOnMac:~$ jq -e '.foo?' <<< '{"foo": 42, "bar": "less interesting data"}'
42
dudeOnMac:~$ echo $?
0
For an invalid query, done with a non-existent entity zoo,
dudeOnMac:~$ jq -e '.zoo?' <<< '{"foo": 42, "bar": "less interesting data"}'
null
dudeOnMac:~$ echo $?
1
For an error scenario, returning code 2 which I created by double-quoting the jq input stream.
dudeOnMac:~$ jq -e '.zoo?' <<< "{"foo": 42, "bar": "less interesting data"}"
jq: error: Could not open file interesting: No such file or directory
jq: error: Could not open file data}: No such file or directory
dudeOnMac:~$ echo $?
2
I've found a solution that meets all of my requirements! Please let me know what you think!
The idea is use jq -e "$Filter" as a first-pass check. Then for the return code of 1, do a jq "path($Filter)" check. The latter will only succeed if, in fact, there is a path into the JSON data.
Select.sh
#!/bin/bash
Select()
{
local Name="$1"
local Filter="$2"
local Input="$3"
local Result Status
Result="$(jq -e --arg Name "$Name" "$Filter" <<<"$Input")"
Status=$?
case $Status in
1) jq --arg Name "$Name" "path($Filter)" <<<"$Input" >/dev/null 2>&1
Status=$?
;;
*) ;;
esac
[[ $Status -eq 0 ]] || Result="***ERROR***"
echo "$Status $Result"
}
Filter='.[]|select(.user == $Name)|.value'
Input='[
{"user":"steve", "value":false},
{"user":"tom", "value":true},
{"user":"pat", "value":null},
{"user":"jane", "value":""}
]'
Select steve "$Filter" "$Input"
Select tom "$Filter" "$Input"
Select pat "$Filter" "$Input"
Select jane "$Filter" "$Input"
Select mary "$Filter" "$Input"
And the execution of the above:
% ./Select.sh
0 false
0 true
0 null
0 ""
4 ***ERROR***
I've added an updated solution below
The fundamental problem here is that when try to retrieve a value from an object using the .key or .[key] syntax, jq — by definition — can't distinguish a missing key from a key with a value of null.
You can instead define your own lookup function:
def lookup(k):if has(k) then .[k] else error("invalid key") end;
Then use it like so:
$ jq 'lookup("a")' <<<'{}' ; echo $?
jq: error (at <stdin>:1): invalid key
5
$ jq 'lookup("a")' <<<'{"a":null}' ; echo $?
null
0
If you then use lookup consistently instead of the builtin method, I think that will give you the behaviour you want.
Here's another way to go about it, with less bash and more jq.
#!/bin/bash
lib='def value(f):((f|tojson)//error("no such value"))|fromjson;'
users=( steve tom pat jane mary )
Select () {
local name=$1 filter=$2 input=$3
local -i status=0
result=$( jq --arg name "$name" "${lib}value(${filter})" <<<$input 2>/dev/null )
status=$?
(( status )) && result="***ERROR***"
printf '%s\t%d %s\n' "$name" $status "$result"
}
filter='.[]|select(.user == $name)|.value'
input='[{"user":"steve","value":false},
{"user":"tom","value":true},
{"user":"pat","value":null},
{"user":"jane","value":""}]'
for name in "${users[#]}"
do
Select "$name" "$filter" "$input"
done
This produces the output:
steve 0 false
tom 0 true
pat 0 null
jane 0 ""
mary 5 ***ERROR***
This takes advantage of the fact the absence of input to a filter acts like empty, and empty will trigger the alternative of //, but a string — like "null" or "false" — will not.
It should be noted that value/1 will not work for filters that are simple key/index
lookups on objects/arrays, but neither will your solution. I'm reasonably sure that to
cover all the cases, you'd need something like this (or yours) and something
like get or lookup.
Given that jq is the way it is, and in particular that it is stream-oriented, I'm inclined to think that a better approach would be to define and use one or more filters that make the distinctions you want. Thus rather than writing .a to access the value of a field, you'd write get("a") assuming that get/1 is defined as follows:
def get(f): if has(f) then .[f] else error("\(type) is not defined at \(f)") end;
Now you can easily tell whether or not an object has a key, and you're all set to go. This definition of get can also be used with arrays.

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })