Convert this JSON to a tree and find the path to parent - json

data= {
"saturn": [
"planet",
"american_car",
"car"
],
"american_car": [
"car",
"gas_driven_automobile"
],
"planet": [
"large_object",
"celestial_body"
],
"large_object": [],
"gas_driven_automobile": [
"gas_powered_road_vehicle",
"car"
],
"car": [
"vehicle",
"motor_vehicle"
],
"vehicle": [],
"motor_vehicle": [],
"gas_powered_road_vehicle": [],
"celestial_body": []
};
I need to write an algorithm where if I give the input "saturn" I need to get all the possible paths from saturn to different parents. for example,
saturn ->planet ->large_object
saturn ->american_car->car->vehicle
saturn ->american_car->car->motor_vehicle
saturn ->american_car->gas_driven_automobile->gas_powered_road_vehicle
saturn ->american_car->gas_driven_automobile->car->vehicle
and all the other possible paths.
I was thinking of somehow converting this to a tree and then using a library to calculate the path from the child to the parent.
Working on writing an algorithm, can't figure out how to start off on converting this to a tree.

Using jq, you can simply define a recursive function:
def parents($key):
if has($key)
then if .[$key] == [] then [] else .[$key][] as $k | [$k] + parents($k) end
else []
end;
To use it to produce the "->"-style output, invoke jq with the -r command-line option, and call the above function like so:
["saturn"] + parents("saturn")
| join(" -> ")
More economically
def lineages($key):
[$key] + (lineages(.[$key][]) // []);
lineages("saturn") | join(" -> ")

Related

how to calculate time duration from two date/time values using jq

I have some json that looks like this....
[
{
"start": "20200629T202456Z",
"end": "20200629T211459Z",
"tags": [
"WPP",
"WPP review tasks, splashify popup",
"clients",
"work"
],
"annotation": "update rules, fix drush errors, create base wpp-splash module."
},
{
"start": "20200629T223000Z",
"end": "20200629T224641Z",
"tags": [
"WPP",
"WPP review tasks, splashify popup",
"clients",
"work"
]
},
]
and I want to show a duration of hours:minutes instead of "start" and "end" times.
The time format might be a little unusual(?), it's coming from timewarrior. I imagine this would be easier for jq to accomplish if the date/time were stored in a normal unix timestamp, but maybe this is still possible? Could jq write the output like
[
{
"time": "0:50:03",
"tags": [
"WPP",
"WPP review tasks, splashify popup",
"clients",
"work"
],
"annotation": "update rules, fix drush errors, create base wpp-splash module."
}
]
or something similar.
Is that possible?
For clarity, let's define a helper function:
def duration($finish; $start):
def twodigits: "00" + tostring | .[-2:];
[$finish, $start]
| map(strptime("%Y%m%dT%H%M%SZ") | mktime) # seconds
| .[0] - .[1]
| (. % 60 | twodigits) as $s
| (((. / 60) % 60) | twodigits) as $m
| (./3600 | floor) as $h
| "\($h):\($m):\($s)" ;
The solution is now simply:
map( {time: duration(.end;.start)} + del(.start,.end) )

How to filter complex JSON using jq toolset and regular expressions into new JSON array of objects

First ~ Thanks for taking the time to read this. If any further
information or restatements are needed please comment so that I can improve
the question. I am new to jq and appreciate any assistance provided.
If there is any confusion in the topic it is due to my lack of
experience with the jq tool. This seems fairly complex so even a partial
answer is welcome.
Background
I have some JSON objects within a series of JSON arrays (sample at bottom). The objects have a number of elements but only the values associated with the "data" key are of interest to me. I want to output a single array of JSON objects where the values are translated into key/value pairs based on some regular expression rules.
I want to essentially combine multiple "data" values to form a key phrase (and then a value-phrase) which I need to output as the array of target objects. I believe I should be able to use a regular expression or a set of known text (for the key-phrase) to compile the text into a single key or value.
Current Logic
Using: jq-1.5, Mac OS 10.12.6, Bash terminal
Some things I have examined are by looking at the (:) colon in the value field (it indicates the end of a key-phrase). So for example, below represents the key "Company Address":
"data":"Company ",
...
"data": "Address:"
...
{
"top": 333,
"left": 520,
"width": 66,
"height": 15,
"font": 5,
"data":"123 Main St. "
...
"data":"Smallville "
...
"data":"KS "
...
"data":"606101"
In this case, the colon in the value indicates that the next value attached to the following useful "data" key is the beginning of an address.
A space trailing the value indicates that the next data value found is a continuation of the key phrase or the value phrase I am attempting to combine into a new JSON object.
I have a set of values that I can use to delimit a new JSON object. Essentially the following example would allow me to create a key "Company Name":
...
"data":"Company "
...
"data":"Name"
(note that this entry does not have a colon but the pattern will be the start of each new JSON object to be generated)
Notes
I can determine when the end of a key or value is reached depending on whether or not it's value ends with a space. (if there is no space then I consider the value to be the end of the value phrase and begin capturing the next key phrase).
Things I've tried
Any assistance with translating this logic into one or more useful jq filter(s) would be greatly appreciated. I've taken a look at the JQ Cookbook, the JQ Manual, this article, examined other SO questions on jq, and made an evaluation of an alternate tool (underscore_cli). I am new to jq and my naive expressions keep failing...
I've tried some simple tests to attempt to select values of interest.
(I am not successfully able to walk the json tree to get to the
information under the text array. Another wrinkle is that I have
multiple text arrays. Is it possible to have the same algorithm
performed on each array of objects?)
jq -s '.[] | select(.data | contains(":"))'
jq: error (at :0): Cannot index array with string "data"
Sample
A sample of the header JSON
[
{
"number": 1,
"pages": 254,
"height": 1263,
"width": 892,
"fonts": [
{
"fontspec": "0",
"size": "-1",
"family": "Times",
"color": "#ffffff"
},
{
"fontspec": "1",
"size": "31",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "2",
"size": "16",
"family": "Helvetica",
"color": "#000000"
},
{
"fontspec": "3",
"size": "13",
"family": "Times",
"color": "#237db8"
},
{
"fontspec": "4",
"size": "17",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "5",
"size": "13",
"family": "Times",
"color": "#000000"
},
{
"fontspec": "6",
"size": "8",
"family": "Times",
"color": "#9f97a7"
},
{
"fontspec": "7",
"size": "10",
"family": "Times",
"color": "#9f97a7"
}
],
"text": [
{
"top": 83,
"left": 60,
"width": 0,
"height": 1,
"font": 0,
"data": " "
},
{
"top": 333,
"left": 68,
"width": 68,
"height": 15,
"font": 5,
"data": "Company "
},
{
"top": 333,
"left": 135,
"width": 40,
"height": 15,
"font": 5,
"data": "Name"
},
...(more of these objects with data)
]
]
I am looking to output a JSON array of objects whose keys are composed
of known strings (patterns) for the key/value pair bound by a colon
(:) indicating the end of a key-phrase and whose next data-value would
be the start of the value-phrase. The presence of a trailing space
indicates that the data-value should be appended as part of the
value-phrase until the trailing space no longer appears in the
data-value. At that point the next data-value represents the start of
another key-phrase.
UPDATE #1:
The answers below are very helpful. I've gone back to the jq manual
and incorporated the advice below. I am getting a string but unable to
separate out the set of data tags into a single string.
.[].text | tostring
However, I am seeing the JSON being escaped and the other tags showing
up in the string top, left, right (along with their values). I'd
like to have the tokens associated only with the data key as a string.
Then run the regular expressions over that string to parse out a set
of JSON objects where the keys and values can be defined.
So from what I could tell what you're trying to do, you're trying to get all the "data" elements and concatenating them into a single string.
Should be simple enough to do:
[.. | .data? | select(. != null) | tostring] | join("")
There's not enough example data to know where the start of one "grouping" of data begins and ends. But assuming every item in the root array is a single phrase, select each item first before performing the search (or map them):
map([.. | .data? | select(. != null) | tostring] | join(""))
If ultimately you'd want to parse the data bits to a json object, it's not too far off:
map(
[.. | .data? | select(. != null) | tostring]
| join("")
| split(":") as [$key,$value]
| {$key,$value}
) | from_entries
You may want to consider using jq Streaming for this. With your sample data the following filter picks out the paths to the "data" attributes:
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
| [$p,$v]
If this is in filter.jq and your sample data is in data.json the command
$ jq -Mc -f filter.jq data.json
produces
[[0,"text",0,"data"]," "]
[[0,"text",1,"data"],"Company "]
[[0,"text",2,"data"],"Name"]
From this you can see your data has information in the paths .[0].text[0].data, .[0].text[1].data and .[0].text[2].data.
You can build on this using reduce to collect the values into groups based on the presence of the trailing space. With your data the following filter
reduce (
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
) as [$p,$v] (
[""]
; .[-1] += $v
| if $v|endswith(" ")|not then . += [""] else . end
)
| map(select(. != ""))
produces
[" Company Name"]
This example only groups data into a list. You can use a more sophisticated reduce if you need.
Here is a Try it online! link you can experiment with.
To take this further let's use the following sample data:
[
{ "data":"Company " },
{ "data": "Address:" },
{ "data":"123 Main St. " },
{ "data":"Smallville " },
{ "data":"KS " },
{ "data":"606101" }
]
The filter as is will generate
["Company Address:","123 Main St. Smallville KS 606101"]
To convert that into an object you could add another reduce. For example this filter
reduce (
tostream
| select(length==2) as [$p,$v]
| select($p[-1]=="data")
) as [$p,$v] (
[""]
; .[-1] += $v
| if $v|endswith(" ")|not then . += [""] else . end
)
| map(select(. != ""))
| reduce .[] as $e (
{k:"", o:{}}
; if $e|endswith(":") then .k = $e[:-1] else .o[.k] += $e end
)
| .o
produces
{"Company Address":"123 Main St. Smallville KS 606101"}
One last thing: at this point the filter is getting pretty large so it would make sense to refactor a bit and break it down into functions so that it's easier to manage and extend. e.g.
def extract:
[ tostream
| select(length==2) as [$p,$v] # collect values for
| select($p[-1]=="data") # paths to "data"
| $v # in an array
]
;
def gather:
reduce .[] as $v (
[""] # state: list of grouped values
; .[-1] += $v # add value to last group
| if $v|endswith(" ")|not # if the value ended with " "
then . += [""] # form a new group
else .
end
)
| map(select(. != "")) # produce final result
;
def combine:
reduce .[] as $e (
{k:"", o:{}} # k: current key o: combined object
; if $e|endswith(":") # if value ends with a ":"
then .k = $e[:-1] # use it as a new current key
else .o[.k] += $e # otherwise add to current key's value
end
)
| .o # produce the final object
;
extract # extract "data" values
| gather # gather into groups
| combine # combine into an object

Sorting and filtering a json file using jq

I'm trying to parse a json file in order to create a deletion list for an artifactory instance.
I'd like group them by two fields: repo and path. And then keep the two objects for each grouping with the most recent "modified" timestamp and delete all the other objects in the json file.
So, an original file that looks like this:
{
"results": [
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3624,
"modified": "2016-10-01T06:22:16.335Z"
},
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:03:58.465Z"
},
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:06:36.522Z"
},
{
"repo": "repo2",
"path": "docker_image_static",
"size": 3624,
"modified": "2016-09-29T20:31:44.054Z"
}
]
}
should become this:
{
"results": [
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:03:58.465Z"
},
{
"repo": "repo1",
"path": "docker_image_dynamic",
"size": 3646,
"modified": "2016-10-01T07:06:36.522Z"
},
{
"repo": "repo2",
"path": "docker_image_static",
"size": 3624,
"modified": "2016-09-29T20:31:44.054Z"
}
]
}
This should do it:
.results |= [group_by({repo,path})[] | sort_by(.modified)[-2:][]]
After grouping the items in the array by repo and path, you sort the groups by modified and keep the last two items of the sorted group. Then split the groups up again and collect them into a new array.
Here is a more cumbersome solution which uses reduce to maintain a temporary object with the last two values for each repo and path. It's probably not better than Jeff's solution unless the input contains a large number of entries for each combination of (repo, path):
{
results: [
reduce .results[] as $r (
{} # temporary object
; (
getpath([$r.repo, $r.path]) # keep the latest two
| . + [$r] # elements for each
| sort_by(.modified)[-2:] # repo and path in a
) as $new # temporary object
| setpath([$r.repo, $r.path]; $new) #
)
| .[] | .[] | .[] # extract saved elements
]
}
#jq170727 makes a good point about the potential inefficiency of using group_by, since group_by involves sorting. In practice, the sort is probably too fast to matter, but if it is of concern, we can define our own sort-free version of group_by very easily:
# sort-free variant of group_by/1
# f must always evaluate to a string.
# Output: an object
def GROUP_BY(f): reduce .[] as $x ({}; .[$x|f] += [$x] );
#JeffMercado's solution can now be used with the help of tojson as follows:
.results |= [GROUP_BY({repo,path}|tojson)[] | sort_by(.modified)[-2:][]]
GROUP_BY/2
To avoid the call to tojson, we can tweak the above to produce the following even faster solution:
def GROUP_BY(f;g): reduce .[] as $x ({}; .[$x|f][$x|g] += [$x]);
.results |= [GROUP_BY(.repo;.path)[][] | sort_by(.modified)[-2:][]]
Comments aside, here is a more concise (and more jq-esque (*)) way to express #jq170727's solution:
.results |= [reduce .[] as $r ( {};
.[$r.repo][$r.path] |= ((.+[$r]) | sort_by(.modified)[-2:]))
| .[][]]
(*) Specifically no getpath, setpath, or $new; and |= lessens redundancy.

Extract all unique keys from arbitrarily nested json data with jq

As the subject states, my goal is to write an all_keys function to extract all keys from an arbitrarily nested json blob, traversing contained arrays and objects as needed, and outputting an array containing the keys, without duplicates.
For instance, given the following input:
[
{"name": "/", "children": [
{"name": "/bin", "children": [
{"name": "/bin/ls", "children": []},
{"name": "/bin/sh", "children": []}]},
{"name": "/home", "children": [
{"name": "/home/stephen", "children": [
{"name": "/home/stephen/jq", "children": []}]}]}]},
{"name": "/", "children": [
{"name": "/bin", "children": [
{"name": "/bin/ls", "children": []},
{"name": "/bin/sh", "children": []}]},
{"name": "/home", "children": [
{"name": "/home/stephen", "children": [
{"name": "/home/stephen/jq", "children": []}]}]}]}
]
The all_keys fuction should produce this output:
[
"children",
"name"
]
To this end, I devised the following function, but it's as slow as it is convoluted, so I was wondering whether you could come up with a more concise and faster way of obtaining the same result.
def all_keys:
. as $in |
if type == "object" then
reduce keys[] as $k (
[];
. + [$k, ($in[$k] | all_keys)[]]
) | unique
elif type == "array" then (
reduce .[] as $i (
[];
. + ($i | all_keys)
) | unique
)
else
empty
end
;
For reference, running that function on this 53MB json file takes roughly 22 seconds on my Intel T9300#2.50GHz CPU (I know, it's quite ancient but still works fine).
A naive approach would just collect all the keys and get the unique values.
[.. | objects | keys[]] | unique
But with that data, it's a bit on the slow side since the keys need to be collected and sorted.
We could do a little better with this. Since we're trying to determine all the distinct keys, we'd use a hashmap of some sort to be more efficient. Well, we have objects that can act as such.
reduce (.. | objects | keys[]) as $k ({}; .[$k] = true) | keys
I didn't measure the time on this but it's magnitudes faster than the other version. I didn't even wait for the other to finish, this one was well within 10 seconds on my work machine (i5-2400#3.1GHz).
I think you'll find the following variant of the OP's all_keys is actually slightly faster than the version using ..; this is probably to be expected -- for jeopardy.json, .. generates altogether 1,731,807 JSON entities whereas there are only 216,930 JSON objects:
def all_keys:
def uniquely(f): reduce f as $x ({}; .[$x] = true) | keys;
def rkeys:
if type == "object" then keys[] as $k | ($k, (.[$k]|rkeys))
elif type == "array" then .[]|rkeys
else empty
end;
uniquely(rkeys);

Using jq to list keys in a JSON object

I have a hierarchically deep JSON object created by a scientific instrument, so the file is somewhat large (1.3MB) and not readily readable by people. I would like to get a list of keys, up to a certain depth, for the JSON object. For example, given an input object like this
{
"acquisition_parameters": {
"laser": {
"wavelength": {
"value": 632,
"units": "nm"
}
},
"date": "02/03/2525",
"camera": {}
},
"software": {
"repo": "github.com/username/repo",
"commit": "a7642f",
"branch": "develop"
},
"data": [{},{},{}]
}
I would like an output like such.
{
"acquisition_parameters": [
"laser",
"date",
"camera"
],
"software": [
"repo",
"commit",
"branch"
]
}
This is mainly for the purpose of being able to enumerate what is in a JSON object. After processing the JSON objects from the instrument begin to diverge: for example, some may have a field like .frame.cross_section.stats.fwhm, while others may have .sample.species, so it would be convenient to be able to interrogate the JSON object on the command line.
The following should do exactly what you want
jq '[(keys - ["data"])[] as $key | { ($key): .[$key] | keys }] | add'
This will give the following output, using the input you described above:
{
"acquisition_parameters": [
"camera",
"date",
"laser"
],
"software": [
"branch",
"commit",
"repo"
]
}
Given your purpose you might have an easier time using the paths builtin to list all the paths in the input and then truncate at the desired depth:
$ echo '{"a":{"b":{"c":{"d":true}}}}' | jq -c '[paths|.[0:2]]|unique'
[["a"],["a","b"]]
Here is another variation uing reduce and setpath which assumes you have a specific set of top-level keys you want to examine:
. as $v
| reduce ("acquisition_parameters", "software") as $k (
{}; setpath([$k]; $v[$k] | keys)
)