Use JQ to select specific, arbitrarily nested objects from JSON - json

I'm looking for efficient means to search through an large JSON object for "sub-objects" that match a filter (via select(), I imagine). However, the top-level JSON is an object with arbitrary nesting contained within, including more simple values, objects and arrays of objects. For example:
{
"name": "foo",
"class": "system",
"description": "top-level-thing",
"configuration": {
"status": "normal",
"uuid": "id"
},
"children": [
{
"id": "c1",
"class": "c1",
"children": [
{
"id": "c1.1",
"class": "c1.1"
},
{
"id": "c1.1",
"class": "FINDME"
}
]
},
{
"id": "c2",
"class": "FINDME"
}
],
"thing": {
"id": "c3",
"class": "FINDME"
}
}
I have a solution which does part of what I want (and is understandable):
jq -r '.. | arrays | .[] | select(.class=="FINDME"?) | .id'
which returns:
c2
c1.1
... however, it misses c3, plus it changes the order of items output. Additionally I'm expecting this to operate on potentially very large JSON structures, I would like to make sure I find an efficient solution. Bonus points for something that remains readable by jq neophytes (myself included).
FWIW, references I was using to help me on the way, in case they help others:
Select objects based on value of variable in object using jq
How to use jq to find all paths to a certain key
Recursive search values by key

For small to modest-sized JSON input, you're on the right track with ..
but it seems you want to select objects, like so:
.. | objects | select(.class=="FINDME"?) | .id
For JSON documents that are very large, this might require too much memory, so it may be worth knowing about jq's streaming parser. Unfortunately it's much more difficult to use, so I'd suggest trying the above, and if you're interested, look in the usual places for documentation about the --stream option.

Here's a streaming-parser solution. To make sense of it, you'll need to read up on the --stream option, but the key is that the output includes lines of the form: [PATH, VALUE]
program.jq
foreach inputs as $in (null;
if has("id") and has("class") then null
else . as $x
| $in
| if length != 2 then null
elif .[0][-1] == "id" then ($x + {id: .[-1]})
elif .[0][-1] == "class"
and .[-1] == "FINDME" then ($x + {class: .[-1]})
else $x
end
end;
select(has("id") and has("class")) | .id )
Invocation
jq -n --stream -f program.jq input.json
Output with sample input
"c1.1"
"c2"
"c3"

Related

Merging multiple JSON Lines files into a single JSON object

I'm trying to merge / reduce many JSON objects and somehow I'm not getting the expected result.
I'm only interested in getting all keys, the values and the number of items inside arrays are irrelevant.
file1.json:
{
"customerId": "xx",
"emails": [
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
]
}
{
"id": "654",
"emails": [
{
"address": "peter#x.com",
"primary": true
}
]
}
The desired output is a JSON object with all possible keys from all input objects. The values are irrelevant, any value from any input object is OK. But all keys from input objects must be present in output object:
{
"emails": [
{
"address": "james#zz.com", <--- any existing value works
"customType": "", <--- any existing value works
"type": "custom", <--- any existing value works
"primary": true <--- any existing value works
}
],
"customerId": "xx", <--- any existing value works
"id": "654" <--- any existing value works
}
I tried reducing it, but it misses many of the keys in the array:
$ jq -s 'reduce .[] as $item ({}; . + $item)' file1.json
{
"customerId": "xx",
"emails": [
{
"address": "peter#x.com",
"primary": true
}
],
"id": "654"
}
The structure of the objects contained in file1.json is unknown, so the solution must be agnostic of any keys/values and the solution must not assume any structure or depth.
Is it possible to fix this somehow considering how jq works? Or is it possible to solve this issue using another tool?
PS: For those of you that are curious, this is useful to infer a schema that can be created in a database. Given an arbitrary number of JSON objects with an arbitrary structure, it's easy to create a single JSON squished/merged/fused structure that will "accommodate" all JSON objects.
BigQuery is able to autodetect a schema, but only 500 lines are analyzed to come up with it. This presents problems if objects have different structures past that 500 line mark.
With this approach I can squish a JSON Lines file with 1000000s of objects into one line that can be then imported into BigQuery with the autodetect schema flag and it will work every time since BigQuery only has one line to analyze and this line is the "super-schema" of all the objects. After extracting the autodetected schema I can manually fine tune it to make sure types are correct and then recreate the table specifying my tuned schema:
$ ls -1 users*.json | wc --lines
3672
$ cat users*.json > users-all.json
$ cat users-all.json | wc --lines
146482633
$ jq 'squish' users-all.json > users-all-squished.json
$ cat users-all-squished.json | wc --lines
1
$ bq load --autodetect users users-all-squished.json
$ bq show schema --format=prettyjson users > users-schema.json
$ vi users-schema.json
$ bq rm --table users
$ bq mk --table users --schema=users-schema.json
$ bq load users users-all.json
[Some options are missing or changed for readability]
Here is a solution that produces the expected result in the sample example, and seems to meet all the stated requirements. It is similar to one proposed by #pmf on this page.
jq -n --stream '
def squish: map(if type == "number" then 0 else . end);
reduce (inputs | select(length==2)) as [$p, $v] ({}; setpath($p|squish; $v))
'
Output
For the example given in the Q, the output is:
{
"customerId": "xx",
"emails": [
{
"address": "peter#x.com",
"customType": "",
"type": "custom",
"primary": true
}
],
"id": "654"
}
As #peak has pointed out, some aspects are underspecified. For instance, what should happen with .customerId and .id? Are they always the same across all files (as suggested by the sample files provided)? Do you want the items of the .emails array just thrown into one large array, or do you want to have them "merged" by some criteria (e.g. by a common value in their .address field)? Here are some stubs to start from:
Simply concatenate the .emails arrays and take all other parts from the first file:
jq 'reduce inputs as $in (.; .emails += $in.emails)' file*.json
# or simpler
jq '.emails += [inputs.emails[]]' file*.json
Demo Demo
{
"emails": [
{
"address": "cc#xx.com"
},
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
],
"customerId": "xx",
"id": "654"
}
Merge the objects in the .emails array by a common value in their .address field, with latter values overwriting former values for other fields with colliding names, and discard all other parts from the files:
jq -n 'reduce inputs.emails[] as $e ({}; .[$e.address] += $e) | map(.)' file*.json
Demo
[
{
"address": "cc#xx.com"
},
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
]
If you are only interested in a list of unique field names for a given address, regardless of the counts and values used, you can also go with:
jq -n '
reduce inputs.emails[] as $e ({}; .[$e.address][$e | keys_unsorted[]] = 1)
| map_values(keys)
'
Demo
{
"cc#xx.com": [
"address"
],
"james#zz.com": [
"address",
"customType",
"type"
],
"james#x.com": [
"address"
],
"sales#x.com": [
"address",
"primary"
],
"info#x.com": [
"address"
]
}
The structure of the objects contained in file1.json is unknown, so the solution must be agnostic of any keys/values and the solution must not assume any structure or depth.
You can use the --stream flag to break down the structure into an array of paths and values, discard the values part and make the paths unique:
jq --stream -nc '[inputs[0]] | unique[]' file*.json
["customerId"]
["emails"]
["emails",0,"address"]
["emails",0,"customType"]
["emails",0,"primary"]
["emails",0,"type"]
["emails",1,"address"]
["emails",2]
["emails",2,"address"]
["emails",2,"primary"]
["emails",3]
["emails",3,"address"]
["id"]
Trying to build a representation of this, similar to any of the input files, comes with a lot of caveats. For instance, how would you represent in a single structure if one file had .emails as an array of objects, and another had .emails as just an atomic value, say, a string. You would not be able to represent this plurality without introducing new, possibly ambiguous structures (e.g. putting all possibilities into an array).
Therefore, having a list of paths could be a fair compromise. Judging by your desired output, you want to focus more on the object structure, so you could further reduce complexity by discarding the array indices. Depending on your use case, you could replace them with a single value to retain the information of the presence of an array, or discard them entirely:
jq --stream -nc '[inputs[0] | map(numbers = 0)] | unique[]' file*.json
["customerId"]
["emails"]
["emails",0]
["emails",0,"address"]
["emails",0,"customType"]
["emails",0,"primary"]
["emails",0,"type"]
["id"]
jq --stream -nc '[inputs[0] | map(strings)] | unique[]' file*.json
["customerId"]
["emails"]
["emails","address"]
["emails","customType"]
["emails","primary"]
["emails","type"]
["id"]
The following program meets these two key requirements:
"all keys from input objects must be present in output object";
"the solution must be agnostic of any keys/values and the solution must not assume any structure or depth."
The approach is the same as one suggested by #pmf, and for the example given in the Q, produces results that are very similar to the one that is shown:
jq -n --stream '
def squish: map(select(type == "string"));
reduce (inputs | select(length==2)) as [$p, $v] ({};
setpath($p|squish; $v))
'
With the given input, this produces:
{
"customerId": "xx",
"emails": {
"address": "peter#x.com",
"customType": "",
"type": "custom",
"primary": true
},
"id": "654"
}

jq with multiple select statements and an array

I've got some JSON like the following (I've filtered the output here):
[
{
"Tags": [
{
"Key": "Name",
"Value": "example1"
},
{
"Key": "Irrelevant",
"Value": "irrelevant"
}
],
"c7n:MatchedFilters": [
"tag: example_tag_rule"
],
"another_key": "another_value_I_dont_want"
},
{
"Tags": [
{
"Key": "Name",
"Value": "example2"
}
],
"c7n:MatchedFilters": [
"tag:example_tag_rule",
"tag: example_tag_rule2"
]
}
]
I'd like to create a csv file with the value within the Name key and all of the "c7n:MatchedFilters" in the array. I've made a few attempts but still can't get quite the output I expect. There's some example code and the output below:
#Prints the key that I'm after.
cat new.jq | jq '.[] | [.Tags[], {"c7n:MatchedFilters"}] | .[] | select(.Key=="Name")|.Value'
"example1"
"example2"
#Prints all the filters in an array I'm after.
cat new.jq | jq -r '.[] | [.Tags[], {"c7n:MatchedFilters"}] | .[] | select(."c7n:MatchedFilters") | .[]'
[
"tag: example_tag_rule"
]
[
"tag:example_tag_rule",
"tag: example_tag_rule2"
]
#Prints *all* the tags (including ones I don't want) and all the filters in the array I'm after.
cat new.jq | jq '.[] | [.Tags[], {"c7n:MatchedFilters"}] | select((.[].Key=="Name") and (.[]."c7n:MatchedFilters"))'
[
{
"Key": "Name",
"Value": "example1"
},
{
"Key": "Irrelevant",
"Value": "irrelevant"
},
{
"c7n:MatchedFilters": [
"tag: example_tag_rule"
]
}
]
[
{
"Key": "Name",
"Value": "example2"
},
{
"c7n:MatchedFilters": [
"tag:example_tag_rule",
"tag: example_tag_rule2"
]
}
]
I hope this makes sense, let me know if I've missed anything.
Your attempts are not working because you start out with [.Tags[], {"c7n:MatchedFilters"}] to construct one array containing all the tags and an object containing the filters. You are then struggling to find a way to process this entire array at once because it jumbles together these unrelated things without any distinction. You will find it much easier if you don't combine them in the first place!
You want to find the single tag with a Key of "Name". Here's one way to find that:
first(
.Tags[]|
select(.Key=="Name")
).Value as $name
By using a variable binding we can save it for later and worry about constructing the array separately.
You say (in the comments) that you just want to concatenate the filters with spaces. You can do that easily enough:
(
."c7n:MatchedFilters"|
join(" ")
) as $filters
You can combine all this together like follows. Note that each variable binding leaves the input stream unchanged, so it's easy to compose everything.
jq --raw-output '
.[]|
first(
.Tags[]|
select(.Key=="Name")
).Value as $name|
(
."c7n:MatchedFilters"|
join(" ")
) as $filters|
[$name, $filters]|
#csv
Hopefully that's easy enough to read and separates out each concept. We break up the array into a stream of objects. For each object, we find the name and bind it to $name, we concatenate the filters and bind them to $filters, then we construct an array containing both, then we convert the array to a CSV string.
We don't need to use variables. We could just have a big array constructor wrapped around the expression to find the name and the expression to find the filters. But I hope you can see the variables make things a bit flatter and easier to understand.

Get parent value from json using jq

My json file looks like this;
{
"RQBTYFE86MFC3oL": {
"name": "Nightmode",
"lights": [
"1",
"2",
"3",
"4",
"5",
"7",
"8",
"9",
"10",
"11"
],
"owner": "kvovodUUfn2vlby9h9okdDhv8SrTzkBFjk6kPz2v",
"recycle": false,
"locked": false,
"appdata": {
"version": 1,
"data": "QSXCj_r01_d99"
},
"picture": "",
"lastupdated": "2018-08-08T03:21:39",
"version": 2
}
}
I want to get the 'RQBTYFE86MFC3oL' value by doing a query for 'Nightmode'. So far I came up with this;
jq '.[] | select(.name == "Nightmode")'
This will return me the correct part of the Json but the 'RQBTYFE86MFC3oL' part is stripped. How do I get this part as well?
A simple way to determine the key name(s) corresponding to values satisfying a certain condition is to use to_entries, as explained in the jq manual.
Using this approach, the appropriate jq filter would be:
to_entries[] | select(.value.name == "Nightmode") | .key
with the result:
"RQBTYFE86MFC3oL"
If you want to get the key-value pair, you'd use with_entries as follows:
with_entries( select(.value.name == "Nightmode") )
If the input JSON is too large to fit comfortably in memory, then it would make sense to use jq's streaming parser (invoked with the --stream command-line option):
jq --stream '
select(.[1] == "Nightmode" and (first|length) == 2 and first[1] == "name")
| first | first'
This would produce the key name.
The key idea is that the streaming parser produces arrays including pairs of the form: [ARRAYPATH, VALUE] where VALUE is the value at ARRAYPATH.
You want to get the Key Value.
So use the keys command, to return 'RQBTYFE86MFC3oL' as that is the key, the rest is the value of that key.
jq 'keys'
Here is a snippet: https://jqplay.org/s/YvpCb2PH42
Reference: https://stedolan.github.io/jq/manual/

Getting only desired properties from nested array values with jq

The structure I ultimately want would be:
{
"catalog": [
{
"name": "X",
"catalog": [
{ "name": "Y", "uniqueId": "Z" },
{ "name": "Q", "uniqueId": "B" }
]
}
]
}
This is what the existing structure looks like except there are many other properties at each level (https://gist.github.com/ajcrites/e0e0ca4ca3a08ff2dc401ec872e6094c). I just want to filter those out and get a JSON format that looks specifically like this.
I have started out with: jq '.catalog', but this returns only the array. I still want the catalog property name there. I can do this with jq '{catalog: .catalog[]}, but this prints out each catalog object individually which makes the whole output invalid JSON. I still want the properties to be in the array. Is there a way to filter specific property key-values within arrays using jq?
The following transforms the given input to the desired output and may well be what you want:
{catalog}
| .catalog |= map( {name, catalog} )
| .catalog[].catalog |= map( {name, uniqueId} )
| .catalog |= .[0:1]
However, it's not clear to me that this is really what you want, as you don't discuss the duplication in the given JSON input. So maybe you don't really want the last line in the above, or maybe you want duplicates to be handled in some other way, or ....
Anyway, the trick to keeping things simple here is to use |=.
An alternative approach would be to use del to delete the unwanted properties (rather than selecting the ones you want), but in the present case, that would be (at best) tedious.
You could start by using tostream to convert your sample.json
into a stream of [path, value] arrays as you can see by running
jq -c tostream sample.json
This will generate
[["catalog",0,"catalog",0,"name"],"Y"]
[["catalog",0,"catalog",0,"prop11"],""]
[["catalog",0,"catalog",0,"uniqueId"],"Z"]
[["catalog",0,"catalog",0,"uniqueId"]]
[["catalog",0,"catalog",1,"name"],"Y"]
[["catalog",0,"catalog",1,"prop11"],""]
...
reduce and setpath can be used to convert back into the
original form with a filter such as:
reduce (tostream|select(length==2)) as [$p,$v] (
{};
setpath($p;$v)
)
Adding conditionals makes it easy to omit properties at any level.
For example the following removes leaf attributes starting with "prop":
reduce (tostream|select(length==2)) as [$p,$v] (
{};
if $p[-1]|startswith("prop")
then .
else setpath($p;$v)
end
)
With your sample.json this produces
{
"catalog": [
{
"catalog": [
{
"name": "Y",
"uniqueId": "Z"
},
{
"name": "Y",
"uniqueId": "Z"
}
],
"name": "X"
},
{
"catalog": [
{
"name": "Y",
"uniqueId": "Z"
},
{
"name": "Y",
"uniqueId": "Z"
}
],
"name": "X"
}
]
}
If the goal is to remove certain properties, then one could do so using walk/1. For example, to remove properties whose names start with "prop":
walk(if type == "object"
then with_entries(select(.key|startswith("prop") | not))
else . end)
The same approach would also be applicable if the focus is on retaining certain properties, e.g.:
walk(if type == "object"
then with_entries(select(.key == "name" or .key == "uniqueId" or .key == "catalog"))
else . end)
You could build up a file that contains paths into the json (expressed as arrays) that you want to keep. Then filter out values that do not fit in those paths.
paths.json:
["catalog","name"]
["catalog","catalog","name"]
["catalog","catalog","uniqueId"]
Then filter values based on their paths. Using streams is a great way to go for this since it gives you access to these paths directly:
$ jq --slurpfile paths paths.json '
def keep_path($path): any($paths[]; . == [$path[] | select(strings)]);
fromstream(tostream | select(length == 1 or keep_path(.[0])))
' input.json

Using jq to list keys in a JSON object

I have a hierarchically deep JSON object created by a scientific instrument, so the file is somewhat large (1.3MB) and not readily readable by people. I would like to get a list of keys, up to a certain depth, for the JSON object. For example, given an input object like this
{
"acquisition_parameters": {
"laser": {
"wavelength": {
"value": 632,
"units": "nm"
}
},
"date": "02/03/2525",
"camera": {}
},
"software": {
"repo": "github.com/username/repo",
"commit": "a7642f",
"branch": "develop"
},
"data": [{},{},{}]
}
I would like an output like such.
{
"acquisition_parameters": [
"laser",
"date",
"camera"
],
"software": [
"repo",
"commit",
"branch"
]
}
This is mainly for the purpose of being able to enumerate what is in a JSON object. After processing the JSON objects from the instrument begin to diverge: for example, some may have a field like .frame.cross_section.stats.fwhm, while others may have .sample.species, so it would be convenient to be able to interrogate the JSON object on the command line.
The following should do exactly what you want
jq '[(keys - ["data"])[] as $key | { ($key): .[$key] | keys }] | add'
This will give the following output, using the input you described above:
{
"acquisition_parameters": [
"camera",
"date",
"laser"
],
"software": [
"branch",
"commit",
"repo"
]
}
Given your purpose you might have an easier time using the paths builtin to list all the paths in the input and then truncate at the desired depth:
$ echo '{"a":{"b":{"c":{"d":true}}}}' | jq -c '[paths|.[0:2]]|unique'
[["a"],["a","b"]]
Here is another variation uing reduce and setpath which assumes you have a specific set of top-level keys you want to examine:
. as $v
| reduce ("acquisition_parameters", "software") as $k (
{}; setpath([$k]; $v[$k] | keys)
)