jq 1.5 print items from array that is inside another array - json

Incoming json file contains json array per row eg:
["a100","a101","a102","a103","a104","a105","a106","a107","a108"]
["a100","a102","a103","a106","a107","a108"]
["a100","a99"]
["a107","a108"]
a "filter array" would be ["a99","a101","a108"] so I can slurpfile it
Trying to figure out how to print only values that are inside "filter array", eg the output:
["a101","a108"]
["a108"]
["a99"]
["a108"]

You can port IN function from jq 1.6 to 1.5 and use:
def IN(s): any(s == .; .);
map(select(IN($filter_array[])))
Or even shorter:
map(select(any($filter_array[]==.;.)))

I might be missing some simpler solution, but the following works :
map(select(. as $in | ["a99","a101","a108"] | contains([$in])))
Replace the ["a99","a101","a108"] hardcoded array by your slurped variable.
You can try it here !

In the example, the arrays in the input stream are sorted (in jq's sort order), so it is worth noting that in such cases, a more efficient solution is possible using the bsearch built-in, or perhaps even better, the definition of intersection/2 given at https://rosettacode.org/wiki/Set#Finite_Sets_of_JSON_Entities
For ease of reference, here it is:
def intersection($A;$B):
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then $A[$i], ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
[[0,0] | pop];
Assuming a jq invocation such as:
jq -c --argjson filter '["a99","a101","a108"]' -f intersections.jq input.json
an appropriate filter would be:
($filter | sort) as $sorted
| intersection(.; $sorted)
(Of course if $filter is already presented in jq's sort order, then the initial sort can be skipped, or replaced by a check.)
Output
["a101","a108"]
["a108"]
["a99"]
["a108"]
Unsorted arrays
In practice, jq's builtin sort filter is usually so fast that it might be worthwhile simply sorting the arrays in order to use intersection as defined above.

Related

In JQ is there a better way to process an array using a sliding window than using indexes?

In my specific case, I'm looking to convert input like ["a", 1, "b", 2, "c", 3] into an object like {"a": 1, "b": 2, "c": 3}, but the general technique is processing an array using a sliding window (in this case, of size 2).
I can make this work using indexes, but it's rather ugly, and it suffers from having to load the entire array into memory, so it's not great for streaming:
# Just creates input to play with, in this case, all the letters from 'a' to 'z'
function input () {
printf '"%s" ' {a..z} | jq --slurp --compact-output '.'
}
input |
jq '. as $i | $i
| keys
| map(select (. % 2 == 0))
| map({key:($i[.]|tostring), value:$i[. + 1]})
| from_entries'
In a perfect world, this could look something like this:
input |
jq 'sliding(2;2)
| map({key: (.[0]|tostring), value: .[1])
| from_entries'
I don't see anything like that in the docs, but I'd like to know if there's any techniques that could get me to a cleaner solution.
Tangent on sliding
I used sliding(2;2) a placeholder for "something that does this in one go", but for the curious, the semantics come from Scala's sliding(size: Int, step: Int) collection method.
Because jq returns null if you're out of range, the size would be mostly to make life easier when you're looking at an intermediate result. Borrowing the while implementation from #pmf's answer, the second has a much easier to understand intermediate output when the size argument is applied:
$ input | jq --compact-output 'while(. != []; .[2:])'
["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["m","n","o","p","q","r","s","t","u","v","w","x","y","z"]
["o","p","q","r","s","t","u","v","w","x","y","z"]
["q","r","s","t","u","v","w","x","y","z"]
["s","t","u","v","w","x","y","z"]
["u","v","w","x","y","z"]
["w","x","y","z"]
["y","z"]
$ input | jq --compact-output 'while(. != []; .[2:])[:3]'
["a","b","c"]
["c","d","e"]
["e","f","g"]
["g","h","i"]
["i","j","k"]
["k","l","m"]
["m","n","o"]
["o","p","q"]
["q","r","s"]
["s","t","u"]
["u","v","w"]
["w","x","y"]
["y","z"]
I am confused with the meaning of 2 and 2 in sliding(2;2), but here's a definition for sliding that can master what (I think) you are looking for (with maybe different parameter values). It generates an array of arrays using a step size and a length parameter:
def sliding($a;$b): [while(. != []; .[$a:])[:$b]];
Examples:
sliding(2;2) | map({key: (.[0]|tostring), value: .[1]}) | from_entries
{"a":"b","c":"d","e":"f","g":"h","i":"j","k":"l","m":"n","o":"p","q":"r","s":"t","u":"v","w":"x","y":"z"}
Skipping:
sliding(3;2) | map({key: (.[0]|tostring), value: .[1]}) | from_entries
{"a":"b","d":"e","g":"h","j":"k","m":"n","p":"q","s":"t","v":"w","y":"z"}
Overlapping:
sliding(1;2) | map({key: (.[0]|tostring), value: .[1]}) | from_entries
{"a":"b","b":"c","c":"d","d":"e","e":"f","f":"g","g":"h","h":"i","i":"j","j":"k","k":"l","l":"m","m":"n","n":"o","o":"p","p":"q","q":"r","r":"s","s":"t","t":"u","u":"v","v":"w","w":"x","x":"y","y":"z","z":null}
Note: the second parameter is not really used, as you always take two items from the current window, so you could actually omit it entirely, or hard-code it to 2.
You could use reduce to go through the array, and take two items at a time:
jq 'reduce while(. != []; .[2:]) as [$key, $val] ({}; .[$key] = $val)'
Yes, that's what _nwise is for.
reduce _nwise(2) as [$k, $v] ({}; .[$k] = $v)
Online demo

jq slow to join() medium size array

I am trying to join() a relatively big array (20k elements) of objects with a character ('\n' in this particular case). I have a few operation upfront which solve in about 8 seconds (acceptable) but when I try to '| join("\n")' at the end the runtime jump to 3+ minutes.
Is there any reason for the join() to be that slow ? Is there another way of having the same output without join() ?
I am currently using jq-1.5 (latest stable)
Here is the JQ file
json2csv.jq
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | join("\n")
)] | join("\n")
;
json2csv
Considering:
$ jq 'length' test.json
23717
With the script is I want it (and put above)
$ time jq -rf json2csv.jq test.json > test.csv
real 3m46.721s
user 1m48.660s
sys 1m57.698s
With the same script, removing the join("\n")
$ time jq -rf json2csv.jq test.json > test.csv
real 0m8.564s
user 0m8.301s
sys 0m0.242s
(note: I remove the second join because else JQ cannot aggregate an array and a string, which make sense (but that's only on an array of 2 elements anyways, so the second join isn't the problem))
You don't need to use join at all. Rather than thinking of converting the whole file to a single string, think of it as converting each row to strings. The way jq outputs streams of results will give you the desired result in the end (assuming you take the raw output).
try something more like this.
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers
# output headers followed by rows of values as arrays
| (
$headers
),
(
.[] | [ .[$headers[]] | tostring | tonull ]
)
# convert the arrays to tab separated values strings
| #tsv
;
After thinking about it I remembered that jq automatically display carriage return ('\n') if you scan an array (.[]), which mean that in this particular case I can just do this:
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | .[]
)] | .[]
;
json2csv
And this solved my problem
time jq -rf json2csv.jq test.json > test.csv
real 0m6.725s
user 0m6.454s
sys 0m0.245s
I'm leaving the question up as if I had wanted to use any other character than '\n' this wouldn't have solved the issue.
When producing output such as CSV or TSV, the idea is to stream the data as much as possible. The last thing you want to do is run join on an array containing all the data. If you did want to use a delimiter other than \n, you'd add it to each item in the stream, and then use the -j command-line option.
Also, I think your diagnosis is probably not quite right as joining an array with a large number of small strings is quite fast. Below are timings comparing joining an array with two strings and one with 100,000 strings. In case you're wondering, my machine is rather slow.
./join.sh 2
3
real 0.03
user 0.02
sys 0.00
1896448 maximum resident set size
$ ./join.sh 100000
588889
real 2.20
user 2.05
sys 0.13
21188608 maximum resident set size
$cat join.sh
#!/bin/bash
/usr/bin/time -lp jq -n --argjson n "$1" '[range(0;$n)|tostring]|join(".")|length'
The above runs used jq 1.6, but using jq 1.5 produces very similar results.
On the other hand, joining a large number (20,000) of very long strings (1K) is noticeably slow, so evidently the current jq implementation is not designed for such operations.

JSON to CSV: variable number of columns per row

I need to convert JSON to CSV where JSON has arrays of variable length, for example:
JSON objects:
{"labels": ["label1"]}
{"labels": ["label2", "label3"]}
{"labels": ["label1", "label4", "label5"]}
Resulting CSV:
labels,labels,labels
"label1",,
"label2","label3",
"label1","label4","label5"
There are many other properties in the source JSON, this is just an exсerpt for the sake of simplicity.
Also, I need to say that the process has to work with JSON as a stream because source JSON could be very large (>1GB).
I wanted to use jq with two passes, the first pass would collect the maximum length of the 'labels' array, the second pass would create CSV as the number of the resulting columns is known by this time. But jq doesn't have a concept of global variables, so I don't know where I can store the running total.
I'd like to be able to do that on Windows via CLI.
Thank you in advance.
The question shows a stream of JSON objects, so the following solutions assume that the input file is already a sequence as shown. These solutions can also easily be adapted to cover the case where the input file contains a huge array of objects, e.g. as discussed in the epilog.
A two-invocation solution
Here's a two-pass solution using two invocations of jq. The presentation assumes a bash-like environment, in case you have wsl:
n=$(jq -n 'reduce (inputs|.labels|length) as $i (-1;
if $i > . then $i else . end)' stream.json)
jq -nr --argjson n $n '
def fill($n): . + [range(length;$n)|null];
[range(0;$n)|"labels"],
(inputs | .labels | fill($n))
| #csv' stream.json
Assuming the input is as described, this is guaranteed to produce valid CSV. Hopefully you can adapt the above to your shell as necessary -- maybe this link will help:
Assign output of a program to a variable using a MS batch file
Using input_filename and a single invocation of jq
Unfortunately, jq does not have a "rewind" facility, but
there is an alternative: read the file twice within a single invocation of jq. This is more cumbersome than the two-invocation solution above but avoids any difficulties associated with the latter.
cat sample.json | jq -nr '
def fill($n): . + [range(length;$n)|null];
def max($x): if . < $x then $x else . end;
foreach (inputs|.labels) as $in ( {n:0};
if input_filename == "<stdin>"
then .n |= max($in|length)
else .printed+=1
end;
if .printed == null then empty
else .n as $n
| (if .printed == 1 then [range(0;$n)|"labels"] else empty end),
($in | fill($n))
end)
| #csv' - sample.json
Another single-invocation solution
The following solution uses a special value (here null) to delineate the two streams:
(cat stream.json; echo null; cat stream.json) | jq -nr '
def fill($n): . + [range(length; $n) | null];
def max($x): if . < $x then $x else . end;
(label $loop | foreach inputs as $in (0;
if $in == null then . else max($in|.labels|length) end;
if $in == null then ., break $loop else empty end)) as $n
| [range(0;$n)|"labels"],
(inputs | .labels | fill($n))
| #csv '
Epilog
A file with a top-level JSON array that is too large to fit into memory can be converted into a stream of the array's items by invoking jq with the --stream option, e.g. as follows:
jq -cn --stream 'fromstream(1|truncate_stream(inputs))'
For such a large file, you will probably want to do this in two separate invocations, one to get the count, then another to actually output the csv. If you wanted to read the whole file into memory, you could do this in one, but we definitely don't want to do that, we'll want to stream it in where possible.
Things get a little ugly when it comes to storing the result of commands to a variable, writing to a file might be simpler. But I'd rather not use temp files if we don't have to.
REM assuming in a batch file
for /f "usebackq delims=" %%i in (`jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json`) do set cols=%%i
jq -rn --stream --argjson cols "%cols%" "[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|#csv" input.json
> jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json
For the first invocation to get the count of columns, we're just taking advantage of the fact that the paths to the array values could be used to indicate the lengths of the arrays. We'll just want to take the max across all items.
> jq -rn --stream --argjson cols "%cols%" ^
"[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|#csv" input.json
Then to output the rest, we're just taking the labels array (assuming it's the only property on the objects) and padding them out with null up to the $cols count. Then output as csv.
If the labels are in a different, deeply nested path than what's in your example here, you'll need to select based on the appropriate paths.
set labelspath=foo.bar.labels
jq -rn --stream --argjson cols "%cols%" --arg labelspath "%labelspath%" ^
"($labelspath|split(\".\")|[.,length]) as [$path,$depth] | [range($cols)|\"labels\"],(fromstream($depth|truncate_stream(inputs|select(.[0][:$depth] == $path)))|[.[],(range($cols-length)|null)])|#csv" input.json

jq add value of a key in nested array and given to a new key

I have a stream of JSON arrays like this
[{"id":"AQ","Count":0}]
[{"id":"AR","Count":1},{"id":"AR","Count":3},{"id":"AR","Count":13},
{"id":"AR","Count":12},{"id":"AR","Count":5}]
[{"id":"AS","Count":0}]
I want to use jq to get a new json like this
{"id":"AQ","Count":0}
{"id":"AR","Count":34}
{"id":"AS","Count":0}
34=1+3+13+12+5 which are in the second array.
I don't know how to describe it in detail. But the basic idea is shown in my example.
I use bash and prefer to use jq to solve this problem. Thank you!
If you want an efficient but generic solution that does NOT assume each input array has the same ids, then the following helper function makes a solution easy:
# Input: a JSON object representing the subtotals
# Output: the object augmented with additional subtotals
def adder(stream; id; filter):
reduce stream as $s (.; .[$s|id] += ($s|filter));
Assuming your jq has inputs, then the most efficient approach is to use it (but remember to use the -n command-line option):
reduce inputs as $row ({}; adder($row[]; .id; .Count) )
This produces:
{"AQ":0,"AR":34,"AS":0}
From here, it's easy to get the answer you want, e.g. using to_entries[] | {(.key): .value}
If your jq does not have inputs and if you don't want to upgrade, then use the -s option (instead of -n) and replace inputs by .[]
Assuming the .id is the same in each array:
first + {Count: map(.Count) | add}
Or perhaps more intelligibly:
(map(.Count) | add) as $sum | first | .Count = $sum
Or more declaratively:
{ id: (first|.id), Count: (map(.Count) | add) }
It's a bit kludgey, but given your input:
jq -c '
reduce .[] as $item ({}; .[($item.id)] += ($item.Count))
| to_entries
| .[] | {"id": .key, "Count": .value}
'
Yields the output:
{"id":"AQ","Count":0}
{"id":"AR","Count":34}
{"id":"AS","Count":0}

Using jq, Flatten Arbitrary JSON to Delimiter-Separated Flat Dictionary

I'm looking to transform JSON using jq to a delimiter-separated and flattened structure.
There have been attempts at this. For example, Flatten nested JSON using jq.
However the solutions on that page fail if the JSON contains arrays. For example, if the JSON is:
{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}
The solution above will fail to transform the above to:
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
In addition, I'm looking for a solution that will also allow for an arbitrary delimiter. For example, suppose the space character is the delimiter. In this case, the result would be:
{"a b 0":1,"x 0 y":2,"x 1 z":3}
I'm looking to have this functionality accessed via a Bash (4.2+) function as is found in CentOS 7, something like this:
flatten_json()
{
local JSONData="$1"
# jq command to flatten $JSONData, putting the result to stdout
jq ... <<<"$JSONData"
}
The solution should work with all JSON data types, including null and boolean. For example, consider the following input:
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
It should produce:
{"a b 0":"p q r","w 0 x":null,"w 1 y":false,"w 2 z":3}
If you stream the data in, you'll get pairings of paths and values of all leaf values. If not a pair, then a path marking the end of a definition of an object/array at that path. Using leaf_paths as you found would only give you paths to truthy leaf values so you'll miss out on null or even false values. As a stream, you won't get this problem.
There are many ways this could be combined to an object, I'm partial to using reduce and assignment in these situations.
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
.[[$i[0][]|tostring]|join($delim)] = $i[1]
)' input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Here's the same solution broken up a bit to allow room for explanation of what's going on.
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
[$i[0][]|tostring] as $path_as_strings
| ($path_as_strings|join($delim)) as $key
| $i[1] as $value
| .[$key] = $value
)' input.json
Converting the input to a stream with tostream, we'll receive multiple values of pairs/paths as input to our filter. With this, we can pass those multiple values into reduce which is designed to accept multiple values and do something with them. But before we do, we want to filter those pairs/paths by only the pairs (select(length==2)).
Then in the reduce call, we're starting with a clean object and assigning new values using a key derived from the path and the corresponding value. Remember that every value produced in the reduce call is used for the next value in the iteration. Binding values to variables doesn't change the current context and assignments effectively "modify" the current value (the initial object) and passes it along.
$path_as_strings is just the path which is an array of strings and numbers to just strings. [$i[0][]|tostring] is a shorthand I use as an alternative to using map when the array I want to map is not the current array. This is more compact since the mapping is done as a single expression. That instead of having to do this to get the same result: ($i[0]|map(tostring)). The outer parentheses might not be necessary in general but, it's still two separate filter expressions vs one (and more text).
Then from there we convert that array of strings to the desired key using the provided delimiter. Then assign the appropriate values to the current object.
The following has been tested with jq 1.4, jq 1.5 and the current "master" version. The requirement about including paths to null and false is the reason for "allpaths" and "all_leaf_paths".
# all paths, including paths to null
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
def all_leaf_paths:
def isscalar: type | (. != "object" and . != "array");
allpaths as $p
| select(getpath($p)|isscalar)
| $p ;
. as $in
| reduce all_leaf_paths as $path ({};
. + { ($path | map(tostring) | join($delim)): $in | getpath($path) })
With this jq program in flatten.jq:
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim . -f flatten.jq input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Collisions
Here is a helper function that illustrates an alternative path-flattening algorithm. It converts keys that contain the delimiter to quoted strings, and array elements are presented in square brackets (see the example below):
def flattenPath(delim):
reduce .[] as $s ("";
if $s|type == "number"
then ((if . == "" then "." else . end) + "[\($s)]")
else . + ($s | tostring | if index(delim) then "\"\(.)\"" else . end)
end );
Example: Using flattenPath instead of map(tostring) | join($delim), the object:
{"a.b": [1]}
would become:
{
"\"a.b\"[0]": 1
}
To add a new option to the solutions already given, jqg is a script I wrote to flatten any JSON file and then search it using a regex. For your purposes your regex would simply be '.' which would match everything.
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg .
{
"a.b.0": 1,
"x.0.y": 2,
"x.1.z": 3
}
and can produce compact output:
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg -q -c .
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
It also handles the more complicated example that #peak used:
$ echo '{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}' | jqg .
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
as well as empty arrays and objects (and a few other edge-case values):
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.two-a.non-integer-number": 101.75,
"two.two-a.number-zero": 0,
"two.true-boolean": true,
"two.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
(reporting empty arrays & objects can be turned off with the -E option).
jqg was tested with jq 1.6
Note : I am the author of the jqg script.