Extract all unique keys from arbitrarily nested json data with jq

Extract all unique keys from arbitrarily nested json data with jq - json

As the subject states, my goal is to write an all_keys function to extract all keys from an arbitrarily nested json blob, traversing contained arrays and objects as needed, and outputting an array containing the keys, without duplicates.
For instance, given the following input:
[
{"name": "/", "children": [
{"name": "/bin", "children": [
{"name": "/bin/ls", "children": []},
{"name": "/bin/sh", "children": []}]},
{"name": "/home", "children": [
{"name": "/home/stephen", "children": [
{"name": "/home/stephen/jq", "children": []}]}]}]},
{"name": "/", "children": [
{"name": "/bin", "children": [
{"name": "/bin/ls", "children": []},
{"name": "/bin/sh", "children": []}]},
{"name": "/home", "children": [
{"name": "/home/stephen", "children": [
{"name": "/home/stephen/jq", "children": []}]}]}]}
]
The all_keys fuction should produce this output:
[
"children",
"name"
]
To this end, I devised the following function, but it's as slow as it is convoluted, so I was wondering whether you could come up with a more concise and faster way of obtaining the same result.
def all_keys:
. as $in |
if type == "object" then
reduce keys[] as $k (
[];
. + [$k, ($in[$k] | all_keys)[]]
) | unique
elif type == "array" then (
reduce .[] as $i (
[];
. + ($i | all_keys)
) | unique
)
else
empty
end
;
For reference, running that function on this 53MB json file takes roughly 22 seconds on my Intel T9300#2.50GHz CPU (I know, it's quite ancient but still works fine).

A naive approach would just collect all the keys and get the unique values.
[.. | objects | keys[]] | unique
But with that data, it's a bit on the slow side since the keys need to be collected and sorted.
We could do a little better with this. Since we're trying to determine all the distinct keys, we'd use a hashmap of some sort to be more efficient. Well, we have objects that can act as such.
reduce (.. | objects | keys[]) as $k ({}; .[$k] = true) | keys
I didn't measure the time on this but it's magnitudes faster than the other version. I didn't even wait for the other to finish, this one was well within 10 seconds on my work machine (i5-2400#3.1GHz).

I think you'll find the following variant of the OP's all_keys is actually slightly faster than the version using ..; this is probably to be expected -- for jeopardy.json, .. generates altogether 1,731,807 JSON entities whereas there are only 216,930 JSON objects:
def all_keys:
def uniquely(f): reduce f as $x ({}; .[$x] = true) | keys;
def rkeys:
if type == "object" then keys[] as $k | ($k, (.[$k]|rkeys))
elif type == "array" then .[]|rkeys
else empty
end;
uniquely(rkeys);

Related

Use JQ to select specific, arbitrarily nested objects from JSON

I'm looking for efficient means to search through an large JSON object for "sub-objects" that match a filter (via select(), I imagine). However, the top-level JSON is an object with arbitrary nesting contained within, including more simple values, objects and arrays of objects. For example:
{
"name": "foo",
"class": "system",
"description": "top-level-thing",
"configuration": {
"status": "normal",
"uuid": "id"
},
"children": [
{
"id": "c1",
"class": "c1",
"children": [
{
"id": "c1.1",
"class": "c1.1"
},
{
"id": "c1.1",
"class": "FINDME"
}
]
},
{
"id": "c2",
"class": "FINDME"
}
],
"thing": {
"id": "c3",
"class": "FINDME"
}
}
I have a solution which does part of what I want (and is understandable):
jq -r '.. | arrays | .[] | select(.class=="FINDME"?) | .id'
which returns:
c2
c1.1
... however, it misses c3, plus it changes the order of items output. Additionally I'm expecting this to operate on potentially very large JSON structures, I would like to make sure I find an efficient solution. Bonus points for something that remains readable by jq neophytes (myself included).
FWIW, references I was using to help me on the way, in case they help others:
Select objects based on value of variable in object using jq
How to use jq to find all paths to a certain key
Recursive search values by key

For small to modest-sized JSON input, you're on the right track with ..
but it seems you want to select objects, like so:
.. | objects | select(.class=="FINDME"?) | .id
For JSON documents that are very large, this might require too much memory, so it may be worth knowing about jq's streaming parser. Unfortunately it's much more difficult to use, so I'd suggest trying the above, and if you're interested, look in the usual places for documentation about the --stream option.

Here's a streaming-parser solution. To make sense of it, you'll need to read up on the --stream option, but the key is that the output includes lines of the form: [PATH, VALUE]
program.jq
foreach inputs as $in (null;
if has("id") and has("class") then null
else . as $x
| $in
| if length != 2 then null
elif .[0][-1] == "id" then ($x + {id: .[-1]})
elif .[0][-1] == "class"
and .[-1] == "FINDME" then ($x + {class: .[-1]})
else $x
end
end;
select(has("id") and has("class")) | .id )
Invocation
jq -n --stream -f program.jq input.json
Output with sample input
"c1.1"
"c2"
"c3"

How to filter complex JSON using jq toolset and regular expressions into new JSON array of objects

First ~ Thanks for taking the time to read this. If any further
information or restatements are needed please comment so that I can improve
the question. I am new to jq and appreciate any assistance provided.
If there is any confusion in the topic it is due to my lack of
experience with the jq tool. This seems fairly complex so even a partial
answer is welcome.
Background
I have some JSON objects within a series of JSON arrays (sample at bottom). The objects have a number of elements but only the values associated with the "data" key are of interest to me. I want to output a single array of JSON objects where the values are translated into key/value pairs based on some regular expression rules.
I want to essentially combine multiple "data" values to form a key phrase (and then a value-phrase) which I need to output as the array of target objects. I believe I should be able to use a regular expression or a set of known text (for the key-phrase) to compile the text into a single key or value.
Current Logic
Using: jq-1.5, Mac OS 10.12.6, Bash terminal
Some things I have examined are by looking at the (:) colon in the value field (it indicates the end of a key-phrase). So for example, below represents the key "Company Address":
"data":"Company ",
...
"data": "Address:"
...
{
"top": 333,
"left": 520,
"width": 66,
"height": 15,
"font": 5,
"data":"123 Main St. "
...
"data":"Smallville "
...
"data":"KS "
...
"data":"606101"
In this case, the colon in the value indicates that the next value attached to the following useful "data" key is the beginning of an address.
A space trailing the value indicates that the next data value found is a continuation of the key phrase or the value phrase I am attempting to combine into a new JSON object.
I have a set of values that I can use to delimit a new JSON object. Essentially the following example would allow me to create a key "Company Name":
...
"data":"Company "
...
"data":"Name"
(note that this entry does not have a colon but the pattern will be the start of each new JSON object to be generated)
Notes
I can determine when the end of a key or value is reached depending on whether or not it's value ends with a space. (if there is no space then I consider the value to be the end of the value phrase and begin capturing the next key phrase).
Things I've tried
Any assistance with translating this logic into one or more useful jq filter(s) would be greatly appreciated. I've taken a look at the JQ Cookbook, the JQ Manual, this article, examined other SO questions on jq, and made an evaluation of an alternate tool (underscore_cli). I am new to jq and my naive expressions keep failing...
I've tried some simple tests to attempt to select values of interest.
(I am not successfully able to walk the json tree to get to the
information under the text array. Another wrinkle is that I have
multiple text arrays. Is it possible to have the same algorithm
performed on each array of objects?)
jq -s '.[] | select(.data | contains(":"))'
jq: error (at :0): Cannot index array with string "data"
Sample
A sample of the header JSON
[
{
"number": 1,
"pages": 254,
"height": 1263,
"width": 892,
"fonts": [
{
"fontspec": "0",
"size": "-1",
"family": "Times",
"color": "#ffffff"
},
{
"fontspec": "1",
"size": "31",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "2",
"size": "16",
"family": "Helvetica",
"color": "#000000"
},
{
"fontspec": "3",
"size": "13",
"family": "Times",
"color": "#237db8"
},
{
"fontspec": "4",
"size": "17",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "5",
"size": "13",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "6",
"size": "8",
"family": "Times",
"color": "#9f97a7"
},
{
"fontspec": "7",
"size": "10",
"family": "Times",
"color": "#9f97a7"
}
],
"text": [
{
"top": 83,
"left": 60,
"width": 0,
"height": 1,
"font": 0,
"data": " "
},
{
"top": 333,
"left": 68,
"width": 68,
"height": 15,
"font": 5,
"data": "Company "
},
{
"top": 333,
"left": 135,
"width": 40,
"height": 15,
"font": 5,
"data": "Name"
},
...(more of these objects with data)
]
]
I am looking to output a JSON array of objects whose keys are composed
of known strings (patterns) for the key/value pair bound by a colon
(:) indicating the end of a key-phrase and whose next data-value would
be the start of the value-phrase. The presence of a trailing space
indicates that the data-value should be appended as part of the
value-phrase until the trailing space no longer appears in the
data-value. At that point the next data-value represents the start of
another key-phrase.
UPDATE #1:
The answers below are very helpful. I've gone back to the jq manual
and incorporated the advice below. I am getting a string but unable to
separate out the set of data tags into a single string.
.[].text | tostring
However, I am seeing the JSON being escaped and the other tags showing
up in the string top, left, right (along with their values). I'd
like to have the tokens associated only with the data key as a string.
Then run the regular expressions over that string to parse out a set
of JSON objects where the keys and values can be defined.

So from what I could tell what you're trying to do, you're trying to get all the "data" elements and concatenating them into a single string.
Should be simple enough to do:
[.. | .data? | select(. != null) | tostring] | join("")
There's not enough example data to know where the start of one "grouping" of data begins and ends. But assuming every item in the root array is a single phrase, select each item first before performing the search (or map them):
map([.. | .data? | select(. != null) | tostring] | join(""))
If ultimately you'd want to parse the data bits to a json object, it's not too far off:
map(
[.. | .data? | select(. != null) | tostring]
| join("")
| split(":") as [$key,$value]
| {$key,$value}
) | from_entries

You may want to consider using jq Streaming for this. With your sample data the following filter picks out the paths to the "data" attributes:
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
| [$p,$v]
If this is in filter.jq and your sample data is in data.json the command
$ jq -Mc -f filter.jq data.json
produces
[[0,"text",0,"data"]," "]
[[0,"text",1,"data"],"Company "]
[[0,"text",2,"data"],"Name"]
From this you can see your data has information in the paths .[0].text[0].data, .[0].text[1].data and .[0].text[2].data.
You can build on this using reduce to collect the values into groups based on the presence of the trailing space. With your data the following filter
reduce (
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
) as [$p,$v] (
[""]
; .[-1] += $v
| if $v|endswith(" ")|not then . += [""] else . end
)
| map(select(. != ""))
produces
[" Company Name"]
This example only groups data into a list. You can use a more sophisticated reduce if you need.
Here is a Try it online! link you can experiment with.
To take this further let's use the following sample data:
[
{ "data":"Company " },
{ "data": "Address:" },
{ "data":"123 Main St. " },
{ "data":"Smallville " },
{ "data":"KS " },
{ "data":"606101" }
]
The filter as is will generate
["Company Address:","123 Main St. Smallville KS 606101"]
To convert that into an object you could add another reduce. For example this filter
reduce (
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
) as [$p,$v] (
[""]
; .[-1] += $v
| if $v|endswith(" ")|not then . += [""] else . end
)
| map(select(. != ""))
| reduce .[] as $e (
{k:"", o:{}}
; if $e|endswith(":") then .k = $e[:-1] else .o[.k] += $e end
)
| .o
produces
{"Company Address":"123 Main St. Smallville KS 606101"}
One last thing: at this point the filter is getting pretty large so it would make sense to refactor a bit and break it down into functions so that it's easier to manage and extend. e.g.
def extract:
[ tostream
| select(length==2) as [$p,$v] # collect values for
| select($p[-1]=="data") # paths to "data"
| $v # in an array
]
;
def gather:
reduce .[] as $v (
[""] # state: list of grouped values
; .[-1] += $v # add value to last group
| if $v|endswith(" ")|not # if the value ended with " "
then . += [""] # form a new group
else .
end
)
| map(select(. != "")) # produce final result
;
def combine:
reduce .[] as $e (
{k:"", o:{}} # k: current key o: combined object
; if $e|endswith(":") # if value ends with a ":"
then .k = $e[:-1] # use it as a new current key
else .o[.k] += $e # otherwise add to current key's value
end
)
| .o # produce the final object
;
extract # extract "data" values
| gather # gather into groups
| combine # combine into an object

Sorting and filtering a json file using jq

I'm trying to parse a json file in order to create a deletion list for an artifactory instance.
I'd like group them by two fields: repo and path. And then keep the two objects for each grouping with the most recent "modified" timestamp and delete all the other objects in the json file.
So, an original file that looks like this:
{
"results": [
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3624,
"modified": "2016-10-01T06:22:16.335Z"
},
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:03:58.465Z"
},
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:06:36.522Z"
},
{
"repo": "repo2",
"path": "docker_image_static",
"size": 3624,
"modified": "2016-09-29T20:31:44.054Z"
}
]
}
should become this:
{
"results": [
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:03:58.465Z"
},
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:06:36.522Z"
},
{
"repo": "repo2",
"path": "docker_image_static",
"size": 3624,
"modified": "2016-09-29T20:31:44.054Z"
}
]
}

This should do it:
.results |= [group_by({repo,path})[] | sort_by(.modified)[-2:][]]
After grouping the items in the array by repo and path, you sort the groups by modified and keep the last two items of the sorted group. Then split the groups up again and collect them into a new array.

Here is a more cumbersome solution which uses reduce to maintain a temporary object with the last two values for each repo and path. It's probably not better than Jeff's solution unless the input contains a large number of entries for each combination of (repo, path):
{
results: [
reduce .results[] as $r (
{} # temporary object
; (
getpath([$r.repo, $r.path]) # keep the latest two
| . + [$r] # elements for each
| sort_by(.modified)[-2:] # repo and path in a
) as $new # temporary object
| setpath([$r.repo, $r.path]; $new) #
)
| .[] | .[] | .[] # extract saved elements
]
}

#jq170727 makes a good point about the potential inefficiency of using group_by, since group_by involves sorting. In practice, the sort is probably too fast to matter, but if it is of concern, we can define our own sort-free version of group_by very easily:
# sort-free variant of group_by/1
# f must always evaluate to a string.
# Output: an object
def GROUP_BY(f): reduce .[] as $x ({}; .[$x|f] += [$x] );
#JeffMercado's solution can now be used with the help of tojson as follows:
.results |= [GROUP_BY({repo,path}|tojson)[] | sort_by(.modified)[-2:][]]
GROUP_BY/2
To avoid the call to tojson, we can tweak the above to produce the following even faster solution:
def GROUP_BY(f;g): reduce .[] as $x ({}; .[$x|f][$x|g] += [$x]);
.results |= [GROUP_BY(.repo;.path)[][] | sort_by(.modified)[-2:][]]

Comments aside, here is a more concise (and more jq-esque (*)) way to express #jq170727's solution:
.results |= [reduce .[] as $r ( {};
.[$r.repo][$r.path] |= ((.+[$r]) | sort_by(.modified)[-2:]))
| .[][]]
(*) Specifically no getpath, setpath, or $new; and |= lessens redundancy.

"Transposing" objects in jq

I'm unsure if "transpose" is the correct term here, but I'm looking to use jq to transpose a 2-dimensional object such as this:
[
{
"name": "A",
"keys": ["k1", "k2", "k3"]
},
{
"name": "B",
"keys": ["k2", "k3", "k4"]
}
]
I'd like to transform it to:
{
"k1": ["A"],
"k2": ["A", "B"],
"k3": ["A", "B"],
"k4": ["A"],
}
I can split out the object with .[] | {key: .keys[], name} to get a list of keys and names, or I could use .[] | {(.keys[]): [.name]} to get a collection of key–value pairs {"k1": ["A"]} and so on, but I'm unsure of the final concatenation step for either approach.
Are either of these approaches heading in the right direction? Is there a better way?

This should work:
map({ name, key: .keys[] })
| group_by(.key)
| map({ key: .[0].key, value: map(.name) })
| from_entries
The basic approach is to convert each object to name/key pairs, regroup them by key, then map them out to entries of an object.
This produces the following output:
{
"k1": [ "A" ],
"k2": [ "A", "B" ],
"k3": [ "A", "B" ],
"k4": [ "B" ]
}

Here is a simple solution that may also be easier to understand. It is based on the idea that a dictionary (a JSON object) can be extended by adding details about additional (key -> value) pairs:
# input: a dictionary to be extended by key -> value
# for each key in keys
def extend_dictionary(keys; value):
reduce keys[] as $key (.; .[$key] += [value]);
reduce .[] as $o ({}; extend_dictionary($o.keys; $o.name) )
$ jq -c -f transpose-object.jq input.json
{"k1":["A"],"k2":["A","B"],"k3":["A","B"],"k4":["B"]}

Here is a better solution for the case that all the values of "name"
are distinct. It is better because it uses a completely generic
filter, invertMapping; that is, invertMapping could be a built-in or
library function. With the help of this function, the solution
becomes a simple three-liner.
Furthermore, if the values of "name" are not all unique, then the solution
below can easily be tweaked by modifying the initial reduction of the input
(i.e. the line immediately above the invocation of invertMapping).
# input: a JSON object of (key, values) pairs, in which "values" is an array of strings;
# output: a JSON object representing the inverse relation
def invertMapping:
reduce to_entries[] as $pair
({}; reduce $pair.value[] as $v (.; .[$v] += [$pair.key] ));
map( { (.name) : .keys} )
| add
| invertMapping

Using jq to list keys in a JSON object

I have a hierarchically deep JSON object created by a scientific instrument, so the file is somewhat large (1.3MB) and not readily readable by people. I would like to get a list of keys, up to a certain depth, for the JSON object. For example, given an input object like this
{
"acquisition_parameters": {
"laser": {
"wavelength": {
"value": 632,
"units": "nm"
}
},
"date": "02/03/2525",
"camera": {}
},
"software": {
"repo": "github.com/username/repo",
"commit": "a7642f",
"branch": "develop"
},
"data": [{},{},{}]
}
I would like an output like such.
{
"acquisition_parameters": [
"laser",
"date",
"camera"
],
"software": [
"repo",
"commit",
"branch"
]
}
This is mainly for the purpose of being able to enumerate what is in a JSON object. After processing the JSON objects from the instrument begin to diverge: for example, some may have a field like .frame.cross_section.stats.fwhm, while others may have .sample.species, so it would be convenient to be able to interrogate the JSON object on the command line.

The following should do exactly what you want
jq '[(keys - ["data"])[] as $key | { ($key): .[$key] | keys }] | add'
This will give the following output, using the input you described above:
{
"acquisition_parameters": [
"camera",
"date",
"laser"
],
"software": [
"branch",
"commit",
"repo"
]
}

Given your purpose you might have an easier time using the paths builtin to list all the paths in the input and then truncate at the desired depth:
$ echo '{"a":{"b":{"c":{"d":true}}}}' | jq -c '[paths|.[0:2]]|unique'
[["a"],["a","b"]]

Here is another variation uing reduce and setpath which assumes you have a specific set of top-level keys you want to examine:
. as $v
| reduce ("acquisition_parameters", "software") as $k (
{}; setpath([$k]; $v[$k] | keys)
)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract all unique keys from arbitrarily nested json data with jq - json

Related

Use JQ to select specific, arbitrarily nested objects from JSON

How to filter complex JSON using jq toolset and regular expressions into new JSON array of objects

Sorting and filtering a json file using jq

"Transposing" objects in jq

Using jq to list keys in a JSON object

Categories

Resources