Value map with JQ - json

I've a large JSON file where I'd like to transform some values based on some kind of mapping.
The data I have looks like:
[
{"id":1, "value":"yes"},
{"id":2, "value":"no"},
{"id":3, "value":"maybe"}
]
And I'd like to transform that into a list like this:
[
{"id":1, "value":"10"},
{"id":2, "value":"0"},
{"id":3, "value":"5"}
]
With the fixed mapping:
yes => 10
no => 0
maybe => 5
My current solution is based on a simple if-elif-else combination like this:
cat data.json| jq '.data[] | .value = (if .value == "yes" then "10" elif .value == "maybe" then "5" else "0" end)'
But this is really ugly and I'd love to have a more direct way to express the mapping.
Thanks for your help

If one wants to avoid having to specify the mapping on the command line, then the following two variants may be of interest.
The first variant can be used with jq 1.3, jq 1.4 and jq 1.5:
def mapping: {"yes":"10","no":"0","maybe":"5"};
map(.value |= mapping[.])
The next variant uses the --argfile option (available since jq 1.4), and is of interest if the mapping object is available in a file:
jq --argfile mapping mapping.jq 'map(.value |= $mapping[.])' data.json
Finally, in jq 1.5, other alternatives based on import are also available (!).

Here is a solution which uses an "inline" object since the mapping is small:
map(.value = {"yes":"10","no":"0","maybe":"5"}[.value])
which can be shorted with |= as in peak's solution to:
map(.value |= {"yes":"10","no":"0","maybe":"5"}[.])

Since you're translating string values, you should be able to use a json object to hold the mappings. Then mapping would be trivial.
$ jq --arg mapping '{"yes":"10","no":"0","maybe":"5"}'
'map(.value |= ($mapping | fromjson)[.])' data.json

Related

Why doesn't fromjson work as a map function on an array of objects?

I'm trying to iterate through an object and convert any value (top level only for now, no recursion) that is a valid json string to json.
I think the answer lies in using the correct incantation perhaps something like with_entries(.value |= try fromjson), but I'm having trouble getting it working. So I broke it down a bit to try something simpler.
How about the following list of objects - I just want to parse the value key of each of them if it is a string that yields valid json (let's ignore the invalid cases for now, they can return null).
So I tried this:
$ jq -n '[{key: "one", value: 1},{key: "two", value: "{\"object\":true}"}] | map(.value |= try fromjson)'
[
{
"key": "one"
},
{
"key": "two"
}
]
Values are both missing even though the two key is a valid json string.
But if I try the same with a simple array, it works as expected:
$ jq -n '[1, "two", "{\"three\":3}"] | .[] | fromjson?'
{
"three": 3
}
So my question is what I am doing wrong here?
Thanks in advance for any pointers.
You have come across one of the (somewhat well-known) deficiencies of jq, namely that the trio of map, |=, try (and therefore postfix ?) do not mix well.
The good news is that the following will work in jq 1.5 and later:
map(.value = (.value | . as $v | try fromjson catch $v))
or equivalently:
map(.value as $v | .value = try ($v|fromjson) catch $v)

Use jq to concatenate JSON arrays in multiple files

I have a series of JSON files containing an array of records, e.g.
$ cat f1.json
{
"records": [
{"a": 1},
{"a": 3}
]
}
$ cat f2.json
{
"records": [
{"a": 2}
]
}
I want to 1) extract a single field from each record and 2) output a single array containing all the field values from all input files.
The first part is easy:
jq '.records | map(.a)' f?.json
[
1,
3
]
[
2
]
But I cannot figure out how to get jq to concatenate those output arrays into a single array!
I'm not married to jq; I'll happily use another tool if necessary. But I would love to know how to do this with jq, because it's something I have been trying to figure out for years.
Assuming your jq has inputs (which is true of jq 1.5 and later), it would be most efficient to use it, e.g. along the lines of:
jq -n '[inputs.records[].a]' f*.json
Use -s (or --slurp):
jq -s 'map(.records[].a)' f?.json
You need to use --slurp so that jq will apply its filter on the aggregation of all inputs rather than on each input. When using this option, jq's input will be an array of the inputs which you need to account for.
I would use the following :
jq --slurp 'map(.records | map(.a)) | add' f?.json
We apply your current transformation to each elements of the slurped array of inputs (your previous individual inputs), then we merge those transformed arrays into one with add.
If your input files are large, slurping the file could eat up lots of memory in which you case you can reduce which works in iterative manner, appending the contents of the array .a one object at a time
jq -n 'reduce inputs.records[].a as $d (.; . += [$d])' f?.json
The -n flag is to ensure to construct the output JSON from scratch with the data available from inputs. The reduce function takes the initial value of . which because of the null input would be just null. Then for each of the input objects . += [$d] ensures that the array contents of .a are appended together.
As a compromise between the readability of --slurp and the efficiency of reduce, you can run jq twice. The first is a slightly altered version of your original command, the second slurps the undifferentiated output into a single array.
$ jq '.records[] | .a' f?.json | jq -s
[
1,
3,
2
]
--slurp (-s) key is needed and map() to do so in one shot
$ cat f1.json
{
"records": [
{"a": 1},
{"a": 3}
]
}
$ cat f2.json
{
"records": [
{"a": 2}
]
}
$ jq -s 'map(.records[].a)' f?.json
[
1,
3,
2
]

jq - parsing& replacement based on key-value pairs within json

I have a json file in the form of a key-value map. For example:
{
"users":[
{
"key1":"user1",
"key2":"user2"
}
]
}
I have another json file. The values in the second file has to be replaced based on the keys in first file.
For example 2nd file is:
{
"info":
{
"users":["key1","key2","key3","key4"]
}
}
This second file should be replaced with
{
"info":
{
"users":["user1","user2","key3","key4"]
}
}
Because the value of key1 in first file is user1. this could be done with any python program, but I am learning jq and would like to try if it is possible with jq itself. I tried different combinations with reading file using slurpfile, then select & walk etc. But couldn't arrive at the required solution.
Any suggestions for the same will be appreciated.
Since .users[0] is a JSON dictionary, it would make sense to use it as such (e.g. for efficiency):
Invocation:
jq -c --slurpfile users users.json -f program.jq input.json
program.jq:
$users[0].users[0] as $dict
| .info.users |= map($dict[.] // .)
Output:
{"info":{"users":["user1","user2","key3","key4"]}}
Note: the above assumes that the dictionary contains no null or false values, or rather that any such values in the dictionary should be ignored. This avoids the double lookup that would otherwise be required. If this assumption is invalid, then a solution using has or in (e.g. as provided by RomanPerekhrest) would be appropriate.
Solution to supplemental problem
(See "comments".)
$users[0].users[0] as $dict
| second
| .info.users |= (map($dict[.] | select(. != null)))
sponge
It is highly inadvisable to use redirection to overwrite an input file.
If you have or can install sponge, then it would be far better to use it. For further details, see e.g. "What is jq's equivalent of sed -i?" in the jq FAQ.
jq solution:
jq --slurpfile users 1st.json '$users[0].users[0] as $users
| .info.users |= map(if in($users) then $users[.] else . end)' 2nd.json
The output:
{
"info": {
"users": [
"user1",
"user2",
"key3",
"key4"
]
}
}

Parsing JSON record-per-line with jq?

I've got a tool that outputs a JSON record on each line, and I'd like to process it with jq.
The output looks something like this:
{"ts":"2017-08-15T21:20:47.029Z","id":"123","elapsed_ms":10}
{"ts":"2017-08-15T21:20:47.044Z","id":"456","elapsed_ms":13}
When I pass this to jq as follows:
./tool | jq 'group_by(.id)'
...it outputs an error:
jq: error (at <stdin>:1): Cannot index string with string "id"
How do I get jq to handle JSON-record-per-line data?
Use the --slurp (or -s) switch:
./tool | jq --slurp 'group_by(.id)'
It outputs the following:
[
[
{
"ts": "2017-08-15T21:20:47.029Z",
"id": "123",
"elapsed_ms": 10
}
],
[
{
"ts": "2017-08-15T21:20:47.044Z",
"id": "456",
"elapsed_ms": 13
}
]
]
...which you can then process further. For example:
./tool | jq -s 'group_by(.id) | map({id: .[0].id, count: length})'
As #JeffMercado pointed out, jq handles streams of JSON just fine, but if you use group_by, then you'd have to ensure its input is an array. That could be done in this case using the -s command-line option; if your jq has the inputs filter, then it can also be done using that filter in conjunction with the -n option.
If you have a version of jq with inputs (which is available in jq 1.5), however, then a better approach would be to use the following streaming variant of group_by:
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
Usage example: GROUPS_BY(inputs; .id)
Note that you will want to use this with the -n command line option.
Such a streaming variant has two main advantages:
it generally requires less memory in that it does not require a copy of the entire input stream to be kept in memory while it is being processed;
it is potentially faster because it does not require any sort operation, unlike group_by/1.
Please note that the above definition of GROUPS_BY/2 follows the convention for such streaming filters in that it produces a stream. Other variants are of course possible.
Handling a large amount of data
The following illustrates how to economize on memory. Suppose the task is to produce a frequency count of .id values. The humdrum solution would be:
GROUPS_BY(inputs; .id) | [(.[0]|.id), length]
A more economical and indeed far better solution would be:
GROUPS_BY(inputs|.id; .) | [.[0], length]

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })