Parsing JSON record-per-line with jq? - json

I've got a tool that outputs a JSON record on each line, and I'd like to process it with jq.
The output looks something like this:
{"ts":"2017-08-15T21:20:47.029Z","id":"123","elapsed_ms":10}
{"ts":"2017-08-15T21:20:47.044Z","id":"456","elapsed_ms":13}
When I pass this to jq as follows:
./tool | jq 'group_by(.id)'
...it outputs an error:
jq: error (at <stdin>:1): Cannot index string with string "id"
How do I get jq to handle JSON-record-per-line data?

Use the --slurp (or -s) switch:
./tool | jq --slurp 'group_by(.id)'
It outputs the following:
[
[
{
"ts": "2017-08-15T21:20:47.029Z",
"id": "123",
"elapsed_ms": 10
}
],
[
{
"ts": "2017-08-15T21:20:47.044Z",
"id": "456",
"elapsed_ms": 13
}
]
]
...which you can then process further. For example:
./tool | jq -s 'group_by(.id) | map({id: .[0].id, count: length})'

As #JeffMercado pointed out, jq handles streams of JSON just fine, but if you use group_by, then you'd have to ensure its input is an array. That could be done in this case using the -s command-line option; if your jq has the inputs filter, then it can also be done using that filter in conjunction with the -n option.
If you have a version of jq with inputs (which is available in jq 1.5), however, then a better approach would be to use the following streaming variant of group_by:
# sort-free stream-oriented variant of group_by/1
# f should always evaluate to a string.
# Output: a stream of arrays, one array per group
def GROUPS_BY(stream; f): reduce stream as $x ({}; .[$x|f] += [$x] ) | .[] ;
Usage example: GROUPS_BY(inputs; .id)
Note that you will want to use this with the -n command line option.
Such a streaming variant has two main advantages:
it generally requires less memory in that it does not require a copy of the entire input stream to be kept in memory while it is being processed;
it is potentially faster because it does not require any sort operation, unlike group_by/1.
Please note that the above definition of GROUPS_BY/2 follows the convention for such streaming filters in that it produces a stream. Other variants are of course possible.
Handling a large amount of data
The following illustrates how to economize on memory. Suppose the task is to produce a frequency count of .id values. The humdrum solution would be:
GROUPS_BY(inputs; .id) | [(.[0]|.id), length]
A more economical and indeed far better solution would be:
GROUPS_BY(inputs|.id; .) | [.[0], length]

Related

jq: filter result by value (contains) is very slow

I am trying to use jq to filter a large number of JSON files and extract the ids of each object who belong to a specific domain, as well as the full URL within that domain. Here's a sample of the data:
{
"items": [
{
"completeness": 5,
"dcLanguageLangAware": {
"def": [
"de"
]
},
"edmIsShownBy": [
"https://gallica.example/image/2IC6BQAEGWUEG4OP7AYBDGIGYAX62KZ6H366KXP2IKVAF4LKY37Q/presentation_images/5591be60-01fc-11e6-8e10-fa163e091926/node-3/image/SBB/Berliner_Börsenzeitung/1920/02/27/F_065_098_0/F_SBB_00007_19200227_065_098_0_001/full/full/0/default.jpg"
],
"id": "/9200355/BibliographicResource_3000117730632",
"type": "TEXT",
"ugc": [
false
]
}
]
}
Bigger sample here: https://www.dropbox.com/s/0s0zjtxe01mecjc/AoQhRn%2B56KDm5AJJPwEvOTIwMDUyMC9hcmtfXzEyMTQ4X2JwdDZrMTAyNzY2Nw%3D%3D.json?dl=0
I can extract both ids and URL which contains the string "gallica" using the following command:
jq '[ .items[] | select(.edmIsShownBy[] | contains ("gallica")) | {id: .id, link: .edmIsShownBy[] }]'
However, i have more than 28000 JSON files to process and it is taking a large amount of time (around 1 file per minute). I am processing the files using bash with the command:
find . -name "*.json" -exec cat '{}' ';' | jq '[ .items[] | select(.edmIsShownBy[] | contains ("gallica")) | {id: .id, link: .edmIsShownBy[] }]'
I was wondering if the slowness is due by the instruction given to jq, and if it is the case, is there a faster way to filter a string contained in a chosen value? Any ideas?
It would probably be wise not to attempt to cat all the files at once; indeed, it would probably be best to avoid cat altogether.
For example, assuming program.jq contains whichever jq program you decide on (and there is nothing wrong with using contains here), you could try:
find . -name "*.json" -exec jq -f program.jq '{}' +
Using the non-standard + instead of ';' minimizes the number of times jq must be called, though the overhead of invoking jq is actually quite small. If your find does not support + and you wish to avoid calling jq once per file, then consider using xargs, or GNU parallel with the —-xargs option.
If you know the JSON files of interest are in the pwd, you could also speed up find by specifying -maxdepth 1.

check if specific string is in Json file via bash

Lets say this is my json file content:
[ { "id":"45" }, { "id":"56" }, { "id":"13" }, { "id":"5" } ]
and I want to find out if id "13" is in the json file.
Is there an way to do this in bash?
I tried to run the command with jq and all sorts of different variations of it (with contain and without for example) and nothing answers this query for me.
Note: when the question was closed, I added this answer to the question in an effort to get the question reopened:
(( $(jq < file.json '[.[].id | select(. == "13")] | length') > 0))
The OP said it was inefficient. I do not know why. Here is what it does:
It passes the JSON through the jq program, which is a JSON parser. The bash shell has no native understanding of JSON, so any solution is going to make use of an external program. Other programs will treat JSON as text, and may work in some or most cases, but it is best to use a program like jq that follows the formal JSON specification to parse the data.
It creates an array to capture the output of...
It loops through the array, picking out all the id fields
It outputs the value of the id field if the value is "13"
It counts the length of the array, which is the number of the id fields whose value is "13"
Using native bash, it converts that output into a number and evaluates to true if the number is greater than 0 and false otherwise.
I do not think you will find something significantly more efficient that formally follows the JSON spec.
This only runs 1 program, jq, which is the de facto standard JSON processor. It is not part of the POSIX standard (which predates JSON) but it is the most likely JSON processor to be installed on a system.
This uses native bash constructs to interpret the output and to the test.
There is not going to be a solution that is more efficient because it runs zero programs (bash cannot do it alone) and there is not going to be a better program to use than jq.
There is not going to be a significantly better jq filter, because it is going to process the entire input (that is just how it works) and the select filter stops the processing of objects that fail the test, which is all or almost all of them.
The alternative "peak" suggests is more compact and more elegant (good things) but not significantly more (or less) efficient. It looks better in the post because a lot is left out. The full test would be
[[ $(jq < file.json 'any(.[]; .id == "13")') == "true" ]]
Actually, the .[]; generator is unnecessary, so the even more compact answer would be
[[ $(jq < file.json 'any(.id == "13")') == "true" ]]
Here is one simple way to to determine if a given "id" value is present using perl:
echo '[ { "id":"45" }, { "id":"56" }, { "id":"13" }, { "id":"5" } ]' | perl -00 -lne 'if (/"id":"13"/) {print "true"} else {print "false"}'
true
echo '[ { "id":"45" }, { "id":"56" }, { "id":"13" }, { "id":"5" } ]' | perl -00 -lne 'if (/"id":"33"/) {print "true"} else {print "false"}'
false
Here is one possibility:
any(.[]; .id == "13")

How to get max value of a date field in a large json file?

I have a large JSON file around 500MB which is the response of a URL call.I need to get the max value of "date" field in the JSON file in the "results" array using shell script(bash).Currently using jq as below.Below works good for smaller files but for larger files it is returning null.
maxDate=$(cat ${jsonfilePath} | jq '[ .results[]?.date ] | max')
Please help.Thanks! I am new to shell scripting,json,jq.
sample/input json file contents:
{
"results": [
{
"Id": "123",
"date": 1588910400000,
"col": "test"
},
{
"Id": "1234",
"date": 1588910412345,
"col": "test2"
}
],
"col2": 123
}
Given --stream option on the command line, JQ won't load the whole input into the memory, instead it'll read the input token by token, producing arrays in this fashion:
[["results",0,"Id"],"123"]
[["results",0,"date"],1588910400000]
...
[["results",1,"date"],1588910412345]
...
Thanks to this feature, we can pick only dates from the input and find out the maximum one without exhausting the memory (at the expense of speed). For example:
jq -n --stream 'reduce (inputs|select(.[0][-1]=="date" and length==2)[1]) as $d (null; [.,$d]|max)' file
500MB should not be so large as to require the --stream option, which generally slows things down. Here then is a fast and efficient(*) solution that does not use the streaming option, but instead uses a generic, stream-oriented "max_by" function defined as follows:
# max_by(empty;1) yields null
def max_by(s; f):
reduce s as $s (null;
if . == null then {s: $s, m: ($s|f)}
else ($s|f) as $m
| if $m > .m then {s: $s, m: $m} else . end
end)
| .s ;
With this in our toolkit, we can simply write:
max_by(.results[].date; .)
This of course assumes that there is a "results" field containing an array of JSON objects. (**) From the problem statement, it would appear that this assumption does not always hold, so you will probably want to modify whichever approach you choose accordingly (e.g. by checking whether there is a results field, whether it's array-valued, etc.)
(*) Using max_by/2 here is more efficient, both in terms of space and time, than using the built-in max_by/1.
(**) The absence of a "date" subfield should not matter as null is less than every number.
 jq '.results | max_by(.date) | .date' "$jsonfilePath"
is a more efficient way to get the maximum date value out of that JSON that might work better for you. It avoids the Useless Use Of Cat, doesn't create a temporary array of just the date values, and thus only needs one pass through the array.

jq: Add nested numbers?

In jq, how can I add numbers that are nested in streamed objects?
Example:
{"game":
{"player1": {"score": 2}}}
{"game":
{"player1": {"score": 4}}}
I can add these numbers using two jq calls:
$ cat foo.json | jq '.game.player1.score' | jq --slurp 'add'
6
How can it be done with one jq call?
Also, how to add the scores of two different players separately?
{"game":
{"player1": {"score": 2},
"player2": {"score": 20}}}
{"game":
{"player1": {"score": 4},
"player2": {"score": 40}}}
$ cat foo.json | jq '???'
{"player1": 6, "player2": 60}
First, here's a variant of #OguzIsmail's first solution to the first problem. It serves to validate the usefulness of the generic function:
def sigma(s): reduce s as $x (0; .+$x);
With this, and using the -n command-line option, the solution to the given problem is simply:
sigma(inputs.game.player1.score)
Second problem
In the same vein of genericity:
def sigmas(stream; f):
reduce stream as $s (null;
[., ($s | f)] | transpose | map(add));
sigmas(inputs | .game; [.player1.score, .player2.score])
| {player1: .[0], player2: .[1]}
Notice that sigmas as defined here can handle arbitrarily many summands. A still more generic solution, avoiding the need to specify the summands as a list, is left as an (easy) exercise for the reader :-)
jq.jq
Typically, generic functions can be included in a "standard jq library". For example, if your utility functions are in ~/.jq/jq.jq, then assuming the pwd does not have a different jq.jq, you could write (for the solution to the first problem):
jq -n 'include "jq"; sigma(inputs.game.player1.score)' foo.json
Robust inclusion of a library
To avoid difficulties associated with module paths, it sometimes makes sense to specify the path within the include or import directive, e.g.:
jq -n 'include "jq" {search: "~/jq"}; ...'
One option is to use reduce, e.g:
jq -n 'reduce inputs.game.player1.score as $score (0; . + $score)' file
Another one is:
jq -n '[inputs.game.player1.score] | add' file
But this is would not perform as well as the first with large inputs.
And here is a more generic one covering the second Q too
jq -n 'reduce inputs.game as $game ({};
reduce ($game|keys_unsorted)[] as $player (.;
.[$player] += $game[$player].score
)
)' file
Here's a generic solution that arises from the perspective that the stream of objects defines a table (i.e., a spreadsheet). The solution is also efficient (no slurping) and robust (no assumptions about the ordering or presence of keys).
# The stream is assumed to consist of objects (or else arrays)
# in which same-named keys have compatible values under `+`.
# The resultant object (or array) has keys the values of which are
# the sum of the values of the corresponding keys of the
# input objects, which thus need not have the same keys.
def add_by_column(stream):
def add(b): reduce (b|keys_unsorted[]) as $k
(.; .[$k] += b[$k]);
reduce stream as $x (null; add($x));
add_by_column(inputs | .game | map_values(.score) )
Notice that this assumes that the -n command-line option is used.
Here's an solution with a helper function which uses
Update assignment |= to flatten {"player1": {"score": 4}} objects to {"player1": 4}
to_entries[] to convert the rows to a canonical {"key":"player1", "value": 2} form
Once the rows are in canonical form a simple Reduce does the final aggregation.
def kvrows: inputs[] | .[] |= .[] | to_entries[] ; # {"key":"player1","value": 2}...
reduce kvrows as $e ({}; .[$e.key] += $e.value) # compute sums
Sample execution assuming the above in test.jq and data in test.json
$ jq -Mn -f test.jq test.json
{
"player1": 6,
"player2": 60
}

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })