How can I reference dynamic keys on an object in jq? - json

I'm trying to define some custom filter functions and one thing I need to be able to do is pass a list of strings to a filter and get the corresponding values of the input object. For example:
jq -n '{name: "Chris", age: 25, city: "Chicago"} | myFilter(["name", "age"])'
should return:
{"name": "Chris", "age": 25}.
I know I can use .[some_string] to dynamically get a value on an object for a specific string key, but I don't know how to utilize this for multiple string keys. I think the problem I'm running into is that jq by default iterates over objects streamed into a filter, but doesn't give a way to iterate over an argument to that filter, even if I use def myFilter($var) syntax that the manual recommends for value-argument behavior.

You could easily define your myFilter using reduce:
def myFilter($keys):
. as $in
| reduce $keys[] as $k (null; . + {($k): $in[$k]} );
More interestingly, if you're willing to modify your "Query By Example" requirements slightly, you can simply specify the keys of interest in braces, as illustrated by this example:
jq -n '{name: "Chris", age: 25, city: "Chicago"} | {name, age}'
If any of the keys cannot be specified in this abbreviated format, simply double-quote them.

Related

Maintain order of jq CSV output and include empty values

I am currently working on a bash script that combines the output of both the aws iam list-users and aws iam list-user-tags commands in a CSV file containing all users along with their respective information and assigned tags. To parse the JSON output of those commands I choose to use jq.
Retrieving parsing and converting (JSON to CSV) the list-user output works fine and produces the expected comma-separated list of values.
The output of list-user-tags does not quite behave that way. Its JSON output has the following schema:
{
Tags: [
{
Key: "Name",
Value: "NameOfUser"
},
{
Key: "Email",
Value: "EmailOfUser"
},
{
Key: "Company",
Value: "CompanyOfUser"
}
]
}
Unfortunately the order of the tags is not consistent across users (and possibly across queries) which currently makes it impossible for me to maintain the order defined in the CSV file. On top of that there is the possibility of one or multiple missing tags.
What I am looking for is a way to achieve the following (preferably using jq):
Select a tags "Value"-value by its "Key"-value
Check whether it is existent and if not add an empty entry
Put the value in the exact same place every time (maintain a certain order)
Repeat for every entry in the original output
Convert the resulting array of values into CSV
What I tried so far:
aws iam list-user-tags --user-name abcdef --no-cli-pager \
| jq -r '[.Tags[] | select(.Key=="Name"),select(.Key=="Email"),select(.Key=="Company") | .Value // ""] | #csv'
Any help is much appreciated!
Let's suppose your sample "schema" is in a file named schema.jq
Then
jq -n -f schema.jq | jq -r '
def get($key):
map(select(.Key == $key))
| if length == 0 then null else .[0].Value end;
.Tags | [get("Name"), get("Email"), get("Company")] | #csv
'
produces the following CSV:
"NameOfUser","EmailOfUser","CompanyOfUser"
It should be easy to adapt this illustration to your needs.

How to find something in a json file using Bash

I would like to search a JSON file for some key or value, and have it print where it was found.
For example, when using jq to print out my Firefox' extensions.json, I get something like this (using "..." here to skip long parts) :
{
"schemaVersion": 31,
"addons": [
{
"id": "wetransfer#extensions.thunderbird.net",
"syncGUID": "{e6369308-1efc-40fd-aa5f-38da7b20df9b}",
"version": "2.0.0",
...
},
{
...
}
]
}
Say I would like to search for "wetransfer#extensions.thunderbird.net", and would like an output which shows me where it was found with something like this:
{ "addons": [ {"id": "wetransfer#extensions.thunderbird.net"} ] }
Is there a way to get that with jq or with some other json tool?
I also tried to simply list the various ids in that file, and hoped that I would get it with jq '.id', but that just returned null, because it apparently needs the full path.
In other words, I'm looking for a command-line json parser which I could use in a way similar to Xpath tools
The path() function comes in handy:
$ jq -c 'path(.. | select(. == "wetransfer#extensions.thunderbird.net"))' input.json
["addons",0,"id"]
The resulting path is interpreted as "In the addons field of the initial object, the first array element's id field matches". You can use it with getpath(), setpath(), delpaths(), etc. to get or manipulate the value it describes.
Using your example with modifications to make it valid JSON:
< input.json jq -c --arg s wetransfer#extensions.thunderbird.net '
paths as $p | select(getpath($p) == $s) | null | setpath($p;$s)'
produces:
{"addons":[{"id":"wetransfer#extensions.thunderbird.net"}]}
Note
If there are N paths to the given value, the above will produce N lines. If you want only the first, you could wrap everything in first(...).
Listing all the "id" values
I also tried to simply list the various ids in that file
Assuming that "id" values of false and null are of no interest, you can print all the "id" values of interest using the jq filter:
.. | .id? // empty

How to get max value of a date field in a large json file?

I have a large JSON file around 500MB which is the response of a URL call.I need to get the max value of "date" field in the JSON file in the "results" array using shell script(bash).Currently using jq as below.Below works good for smaller files but for larger files it is returning null.
maxDate=$(cat ${jsonfilePath} | jq '[ .results[]?.date ] | max')
Please help.Thanks! I am new to shell scripting,json,jq.
sample/input json file contents:
{
"results": [
{
"Id": "123",
"date": 1588910400000,
"col": "test"
},
{
"Id": "1234",
"date": 1588910412345,
"col": "test2"
}
],
"col2": 123
}
Given --stream option on the command line, JQ won't load the whole input into the memory, instead it'll read the input token by token, producing arrays in this fashion:
[["results",0,"Id"],"123"]
[["results",0,"date"],1588910400000]
...
[["results",1,"date"],1588910412345]
...
Thanks to this feature, we can pick only dates from the input and find out the maximum one without exhausting the memory (at the expense of speed). For example:
jq -n --stream 'reduce (inputs|select(.[0][-1]=="date" and length==2)[1]) as $d (null; [.,$d]|max)' file
500MB should not be so large as to require the --stream option, which generally slows things down. Here then is a fast and efficient(*) solution that does not use the streaming option, but instead uses a generic, stream-oriented "max_by" function defined as follows:
# max_by(empty;1) yields null
def max_by(s; f):
reduce s as $s (null;
if . == null then {s: $s, m: ($s|f)}
else ($s|f) as $m
| if $m > .m then {s: $s, m: $m} else . end
end)
| .s ;
With this in our toolkit, we can simply write:
max_by(.results[].date; .)
This of course assumes that there is a "results" field containing an array of JSON objects. (**) From the problem statement, it would appear that this assumption does not always hold, so you will probably want to modify whichever approach you choose accordingly (e.g. by checking whether there is a results field, whether it's array-valued, etc.)
(*) Using max_by/2 here is more efficient, both in terms of space and time, than using the built-in max_by/1.
(**) The absence of a "date" subfield should not matter as null is less than every number.
 jq '.results | max_by(.date) | .date' "$jsonfilePath"
is a more efficient way to get the maximum date value out of that JSON that might work better for you. It avoids the Useless Use Of Cat, doesn't create a temporary array of just the date values, and thus only needs one pass through the array.

Get field from JSON object using jq and command line argument

Assume the following JSON file
{
"foo": "hello",
"bar": "world"
}
I want to get the foo field from the JSON object in a standalone object, and I do this:
<file jq '{foo}'
{
"foo": "hello"
}
Now the field I actually want is coming from the shell and is given to jq as an argument like this:
<file jq --arg myarg "foo" '{$myarg}'
{
"myarg": "foo"
}
Unfortunately this doesn't give the expected result {"foo":"hello"}.
Any idea why the name of the variable gets into the object?
A workaround to this is to explicitly defined the object:
<file jq '{($myarg):.[$myarg]}'
Fine, but is there a way to use the shortcut syntax as explained in the man page, but with a variable ?
You can use this to select particular fields of an object: if the input is an object with “user”, “title”, “id”, and “content” fields and you just want “user” and “title”, you can write
{user: .user, title: .title}
Because that is so common, there’s a shortcut syntax for it: {user, title}.
If that matters, I'm using jq version 1.5
In short, no. The shortcut syntax can only be used under very special conditions. For example, it cannot be used with key names that are jq keywords.
Alternatives
The method described in the Q is the preferred one, but for the record, here are two alternatives:
jq --arg myarg "foo" '
.[$myarg] as $v | {} | .[$myarg] = $v'
And of course there's the alternative that comes with numerous caveats:
myarg=foo ; jq "{ $myarg }"

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })