jq slow to join() medium size array

jq slow to join() medium size array - json

I am trying to join() a relatively big array (20k elements) of objects with a character ('\n' in this particular case). I have a few operation upfront which solve in about 8 seconds (acceptable) but when I try to '| join("\n")' at the end the runtime jump to 3+ minutes.
Is there any reason for the join() to be that slow ? Is there another way of having the same output without join() ?
I am currently using jq-1.5 (latest stable)
Here is the JQ file
json2csv.jq
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | join("\n")
)] | join("\n")
;
json2csv
Considering:
$ jq 'length' test.json
23717
With the script is I want it (and put above)
$ time jq -rf json2csv.jq test.json > test.csv
real 3m46.721s
user 1m48.660s
sys 1m57.698s
With the same script, removing the join("\n")
$ time jq -rf json2csv.jq test.json > test.csv
real 0m8.564s
user 0m8.301s
sys 0m0.242s
(note: I remove the second join because else JQ cannot aggregate an array and a string, which make sense (but that's only on an array of 2 elements anyways, so the second join isn't the problem))

You don't need to use join at all. Rather than thinking of converting the whole file to a single string, think of it as converting each row to strings. The way jq outputs streams of results will give you the desired result in the end (assuming you take the raw output).
try something more like this.
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers
# output headers followed by rows of values as arrays
| (
$headers
),
(
.[] | [ .[$headers[]] | tostring | tonull ]
)
# convert the arrays to tab separated values strings
| #tsv
;

After thinking about it I remembered that jq automatically display carriage return ('\n') if you scan an array (.[]), which mean that in this particular case I can just do this:
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | .[]
)] | .[]
;
json2csv
And this solved my problem
time jq -rf json2csv.jq test.json > test.csv
real 0m6.725s
user 0m6.454s
sys 0m0.245s
I'm leaving the question up as if I had wanted to use any other character than '\n' this wouldn't have solved the issue.

When producing output such as CSV or TSV, the idea is to stream the data as much as possible. The last thing you want to do is run join on an array containing all the data. If you did want to use a delimiter other than \n, you'd add it to each item in the stream, and then use the -j command-line option.
Also, I think your diagnosis is probably not quite right as joining an array with a large number of small strings is quite fast. Below are timings comparing joining an array with two strings and one with 100,000 strings. In case you're wondering, my machine is rather slow.
./join.sh 2
3
real 0.03
user 0.02
sys 0.00
1896448 maximum resident set size
$ ./join.sh 100000
588889
real 2.20
user 2.05
sys 0.13
21188608 maximum resident set size
$cat join.sh
#!/bin/bash
/usr/bin/time -lp jq -n --argjson n "$1" '[range(0;$n)|tostring]|join(".")|length'
The above runs used jq 1.6, but using jq 1.5 produces very similar results.
On the other hand, joining a large number (20,000) of very long strings (1K) is noticeably slow, so evidently the current jq implementation is not designed for such operations.

Related

Inducing a schema using jq more efficiently

I ingest a lot of data. It comes from lots of different sources, and all ultimately goes into BigQuery.
I preparse into .jsonl file(s) — 1 line per record, named by destination table.
For a rough sense of scale, here's a sample from a dataset I'm doing now. (All the data below is real, just lightly redacted / cleaned up.)
% find json -type f -size +2000c -print0 | head -z | sort | wc --files0-from=-
2 387 4737 json/baz_1.jsonl
3 579 7055 json/baz_2.jsonl
1 193 2358 json/baz_3.jsonl
25 4835 58958 json/baz_4.jsonl
37 7161 87467 json/baz_5.jsonl
3 580 7072 json/baz_6.jsonl
15 2897 35393 json/baz_7.jsonl
129 24950 304262 json/baz_8.jsonl
3 373 4221 json/foo_1.jsonl
6 746 8491 json/foo_2.jsonl
224 42701 520014 total
% wc -l *.jsonl
11576 foos.jsonl
20 bars.jsonl
337770 bazzes.jsonl
349366 total
% du -m *.jsonl
3 foos.jsonl
1 bars.jsonl
93 bazzes.jsonl
This is relatively small for me. Other datasets are in the millions of rows / terabytes of data range.
Because the data comes from external sources, often undocumented, often not matching specs or just plain messy (e.g. various signal values for null, multiple date formats in the same field, etc), I don't really know the structure beforehand.
However, I want to have a nice, clean, efficient structure in my destination table — e.g. cast to the correct type like integer/bool/date, set REQUIRED/NULLABLE correctly, know which columns are actually enums, convert stringified arrays into REPEATED columns, have a good guess on what I can use effectively for partitioning / clustering, etc. etc.
It inevitably requires some manual work on samples to infer what's actually going on, but my first pass for doing this is jq (version 1.6).
This is my current code:
~/.jq
def isempty(v):
(v == null or v == "" or v == [] or v == {});
def isnotempty(v):
(isempty(v) | not);
def remove_empty:
walk(
if type == "array" then
map(select(isnotempty(.)))
elif type == "object" then
with_entries(select(isnotempty(.value))) # Note: this will remove keys with empty values
else .
end
);
# bag of words
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
# https://stackoverflow.com/questions/46254655/how-to-merge-json-objects-by-grouping-on-key-with-jq
def add_by(f):
reduce .[] as $x ({}; ($x|f) as $f | .[$f] += [$x])
| [.[] | add];
# takes array of {string: #, ...}
def merge_counts:
map(.|to_entries)|flatten | add_by(.key)|from_entries;
induce_schema.sh (linebreaks added)
#!/bin/zsh
pv -cN ingestion -s `wc -l $1` -l $1 | \
jq -c --unbuffered --stream '{"name": ( .[0]), "encoded_type":( .[1] | type), \
"tonumber": (.[1] | if (type == "string") then try(tonumber|type) catch type else null end), \
"chars": (.[1] | if(type=="string") then try(split("") | sort | unique | join("")) else null end), \
"length":(.[1] | length),"data":.[1]}' | \
# sed -r 's/[0-9]+(,|])/"array"\1/g' | awk '!_[$0]++' | sort | \
pv -cN grouping -l | \
jq -sc '. | group_by(.name,.encoded_type,.tonumber)[] | {"name":first|.name, \
"encoded_type":([(first|.encoded_type),(first|.tonumber)]|unique - [null]|join("_")), \
"allchars": (map(.chars) | join("")|split("")|sort|unique|join("")), \
"count_null": (map(.data | select(.==null)) | length), \
"count_empty": (map(.data | select(.==[] or . == {} or . == "")) | length), \
"count_nonempty": (map(.data | select(. != null and . != "")) |length), \
"unique": (map(.data)|unique|length), "length": bow(.[] | .length) }' | \
pv -cN final -l | \
jq -sc '. | group_by(.name)[] | {"name":first|.name, \
"nullable":(map(.encoded_type) | contains(["null"])), \
"schemas_count":(map(. | select(.encoded_type != "null") )|length), \
"lengths":(map(.length)|merge_counts), "total_nonempty":(map(.count_nonempty)|add), \
"total_null":(map(.count_null)|add), "total_empty": (map(.count_empty) |add), \
"schemas":map(. | select(.encoded_type != "null") | del(.name) )}'
Here's a partial output for bars.jsonl (linebreaks added for ease of reading):
{"name":["FILING CODE"],"nullable":false,"schemas_count":1,
"lengths":{"0":1930,"2":16},
"total_nonempty":16,"total_null":0,"total_empty":1930,
"schemas":[
{"encoded_type":"string","allchars":"EGPWX",
"count_null":0,"count_empty":1930,"count_nonempty":16,"unique":6,
"length":{"0":1930,"2":16}}
]}
{"name":["LAST NAME"],"nullable":true,"schemas_count":1,
"lengths":{"0":416,"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2},
"total_nonempty":22736,"total_null":2,"total_empty":416,
"schemas":[
{"encoded_type":"string","allchars":" ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"count_null":0,"count_empty":416,"count_nonempty":22736,"unique":6233,
"length":{"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"0":416,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2}}
]}
{"name":["NUMBER OF COFFEES"],"nullable":false,"schemas_count":2,
"lengths":{"1":16,"0":4},
"total_nonempty":16,"total_null":0,"total_empty":4,
"schemas":[
{"encoded_type":"number_string","allchars":"1",
"count_null":0,"count_empty":0,"count_nonempty":16,"unique":1,
"length":{"1":16}},
{"encoded_type":"string","allchars":"",
"count_null":0,"count_empty":4,"count_nonempty":0,"unique":1,
"length":{"0":4}}
]}
{"name":["OFFICE CODE"],"nullable":false,"schemas_count":2,
"lengths":{"3":184,"0":22092},
"total_nonempty":1036,"total_null":0,"total_empty":22092,
"schemas":[
{"encoded_type":"number_string","allchars":"0123456789",
"count_null":0,"count_empty":0,"count_nonempty":852,"unique":254,
"length":{"3":852}},
{"encoded_type":"string","allchars":"0123456789ABCDEIJQRSX",
"count_null":0,"count_empty":22092,"count_nonempty":184,"unique":66,
"length":{"0":22092,"3":184}}
]}
{"name":["SOURCE FILE"],"nullable":true,"schemas_count":1,
"lengths":{"0":416,"7":22708},
"total_nonempty":22708,"total_null":23124,"total_empty":416,
"schemas":[
{"encoded_type":"string","allchars":"0123456789F_efil",
"count_null":0,"count_empty":416,"count_nonempty":22708,"unique":30,
"length":{"7":22708,"0":416}}
]}
...
The point of this is to get a summary of "how is this unknown dataset structured and what's in it" that I can readily transform into my BigQuery table schema / parameters, use to point at what I'll probably need to do next for turning it into something cleaner & more usable than what I got, etc.
This code works, but those -s (slurp) lines are really hard on server RAM. (They simply wouldn't work on if the dataset were any larger than this; I added those parts just today. On the bazzes dataset, it uses about 20GB total RAM, including swap.)
It also doesn't detect e.g. any of the date/time field types.
I believe it should be possible to make this far more efficient using #joelpurra's jq + parallel and/or the jq cookbook's reduce inputs, but I'm having difficulty figuring out how.
So, I'd appreciate advice on how to make this
more CPU & RAM efficient
otherwise more useful (e.g. recognize date fields, which could be in almost any format)

Using inputs is the way to go, whether or not any parallelization techniques are brought to bear.
In the jq module for inducing structural schema that I wrote some time ago (https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed), there's a filter, schema/1 defined as:
def schema(stream):
reduce stream as $x ("null"; typeUnion(.; $x|typeof));
This can therefore be used as suggested by this snippet:
jq -n 'include "schema"; schema(inputs)' FILESPECIFICATIONS
(This assumes that the file "schema.jq" defining the schema module has been appropriately installed.)
The point here is not so much that schema.jq might be adapted to your particular expectations, but that the above "def" can serve as a guide (whether or not using jq) for how to write an efficient schema-inference engine, in the sense of being able to handle a very large number of instances. That is, you basically have only to write a definition of typeof (which should yield the desired "type" in the most general sense), and of typeUnion (which defines how two types are to be combined).
Of course, inferring schemas can be a tricky business. In particular, schema(stream) will never fail, assuming the inputs are valid JSON. That is, whether or not the inferred schema will be useful depends largely on how it is used. I find that an integrated approach based on these elements to be essential:
a schema specification language;
a schema inference engine that generates schemas that conform to (1);
a schema-checker.
Further thoughts
schema.jq is simple enough to be tailored to more specific requirements, e.g. to infer dates.
You might be interested in JESS ("JSON Extended Structural Schemas"), which combines a JSON-based specification language with jq-oriented tools: https://github.com/pkoppstein/JESS

JSON to CSV: variable number of columns per row

I need to convert JSON to CSV where JSON has arrays of variable length, for example:
JSON objects:
{"labels": ["label1"]}
{"labels": ["label2", "label3"]}
{"labels": ["label1", "label4", "label5"]}
Resulting CSV:
labels,labels,labels
"label1",,
"label2","label3",
"label1","label4","label5"
There are many other properties in the source JSON, this is just an exсerpt for the sake of simplicity.
Also, I need to say that the process has to work with JSON as a stream because source JSON could be very large (>1GB).
I wanted to use jq with two passes, the first pass would collect the maximum length of the 'labels' array, the second pass would create CSV as the number of the resulting columns is known by this time. But jq doesn't have a concept of global variables, so I don't know where I can store the running total.
I'd like to be able to do that on Windows via CLI.
Thank you in advance.

The question shows a stream of JSON objects, so the following solutions assume that the input file is already a sequence as shown. These solutions can also easily be adapted to cover the case where the input file contains a huge array of objects, e.g. as discussed in the epilog.
A two-invocation solution
Here's a two-pass solution using two invocations of jq. The presentation assumes a bash-like environment, in case you have wsl:
n=$(jq -n 'reduce (inputs|.labels|length) as $i (-1;
if $i > . then $i else . end)' stream.json)
jq -nr --argjson n $n '
def fill($n): . + [range(length;$n)|null];
[range(0;$n)|"labels"],
(inputs | .labels | fill($n))
| #csv' stream.json
Assuming the input is as described, this is guaranteed to produce valid CSV. Hopefully you can adapt the above to your shell as necessary -- maybe this link will help:
Assign output of a program to a variable using a MS batch file
Using input_filename and a single invocation of jq
Unfortunately, jq does not have a "rewind" facility, but
there is an alternative: read the file twice within a single invocation of jq. This is more cumbersome than the two-invocation solution above but avoids any difficulties associated with the latter.
cat sample.json | jq -nr '
def fill($n): . + [range(length;$n)|null];
def max($x): if . < $x then $x else . end;
foreach (inputs|.labels) as $in ( {n:0};
if input_filename == "<stdin>"
then .n |= max($in|length)
else .printed+=1
end;
if .printed == null then empty
else .n as $n
| (if .printed == 1 then [range(0;$n)|"labels"] else empty end),
($in | fill($n))
end)
| #csv' - sample.json
Another single-invocation solution
The following solution uses a special value (here null) to delineate the two streams:
(cat stream.json; echo null; cat stream.json) | jq -nr '
def fill($n): . + [range(length; $n) | null];
def max($x): if . < $x then $x else . end;
(label $loop | foreach inputs as $in (0;
if $in == null then . else max($in|.labels|length) end;
if $in == null then ., break $loop else empty end)) as $n
| [range(0;$n)|"labels"],
(inputs | .labels | fill($n))
| #csv '
Epilog
A file with a top-level JSON array that is too large to fit into memory can be converted into a stream of the array's items by invoking jq with the --stream option, e.g. as follows:
jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

For such a large file, you will probably want to do this in two separate invocations, one to get the count, then another to actually output the csv. If you wanted to read the whole file into memory, you could do this in one, but we definitely don't want to do that, we'll want to stream it in where possible.
Things get a little ugly when it comes to storing the result of commands to a variable, writing to a file might be simpler. But I'd rather not use temp files if we don't have to.
REM assuming in a batch file
for /f "usebackq delims=" %%i in (`jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json`) do set cols=%%i
jq -rn --stream --argjson cols "%cols%" "[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|#csv" input.json
> jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json
For the first invocation to get the count of columns, we're just taking advantage of the fact that the paths to the array values could be used to indicate the lengths of the arrays. We'll just want to take the max across all items.
> jq -rn --stream --argjson cols "%cols%" ^
"[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|#csv" input.json
Then to output the rest, we're just taking the labels array (assuming it's the only property on the objects) and padding them out with null up to the $cols count. Then output as csv.
If the labels are in a different, deeply nested path than what's in your example here, you'll need to select based on the appropriate paths.
set labelspath=foo.bar.labels
jq -rn --stream --argjson cols "%cols%" --arg labelspath "%labelspath%" ^
"($labelspath|split(\".\")|[.,length]) as [$path,$depth] | [range($cols)|\"labels\"],(fromstream($depth|truncate_stream(inputs|select(.[0][:$depth] == $path)))|[.[],(range($cols-length)|null)])|#csv" input.json

jq 1.5 print items from array that is inside another array

Incoming json file contains json array per row eg:
["a100","a101","a102","a103","a104","a105","a106","a107","a108"]
["a100","a102","a103","a106","a107","a108"]
["a100","a99"]
["a107","a108"]
a "filter array" would be ["a99","a101","a108"] so I can slurpfile it
Trying to figure out how to print only values that are inside "filter array", eg the output:
["a101","a108"]
["a108"]
["a99"]
["a108"]

You can port IN function from jq 1.6 to 1.5 and use:
def IN(s): any(s == .; .);
map(select(IN($filter_array[])))
Or even shorter:
map(select(any($filter_array[]==.;.)))

I might be missing some simpler solution, but the following works :
map(select(. as $in | ["a99","a101","a108"] | contains([$in])))
Replace the ["a99","a101","a108"] hardcoded array by your slurped variable.
You can try it here !

In the example, the arrays in the input stream are sorted (in jq's sort order), so it is worth noting that in such cases, a more efficient solution is possible using the bsearch built-in, or perhaps even better, the definition of intersection/2 given at https://rosettacode.org/wiki/Set#Finite_Sets_of_JSON_Entities
For ease of reference, here it is:
def intersection($A;$B):
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then $A[$i], ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
[[0,0] | pop];
Assuming a jq invocation such as:
jq -c --argjson filter '["a99","a101","a108"]' -f intersections.jq input.json
an appropriate filter would be:
($filter | sort) as $sorted
| intersection(.; $sorted)
(Of course if $filter is already presented in jq's sort order, then the initial sort can be skipped, or replaced by a check.)
Output
["a101","a108"]
["a108"]
["a99"]
["a108"]
Unsorted arrays
In practice, jq's builtin sort filter is usually so fast that it might be worthwhile simply sorting the arrays in order to use intersection as defined above.

Optimize JSON denormalization using JQ - "cartesian product" from 1:N

I have a JSON database change log, output of wal2json. It looks like this:
{"xid":1190,"timestamp":"2018-07-19 17:18:02.905354+02","change":[
{"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
{"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}]}
...
Each top level entry (xid) is a transaction, each item in change is, well, a change. One row may change multiple times.
To import to an OLAP system with limited feature set, I need to have the order explicitly stated. So I need to add a sn for each change in a transaction.
Also, each change must be a top level entry - the OLAP can't iterate sub-items within one entry.
{"xid":1190, "sn":1, "kind":"update", "data":{"id":401,"name":"Update AA","age":20} }
{"xid":1190, "sn":2, "kind":"update", "data":{"id":401,"name":"Update BB","age":20} }
{"xid":1191, "sn":1, "kind":"insert", "data":{"id":625,"name":"Inserted","age":20} }
{"xid":1191, "sn":2, "kind":"delete", "data":{"id":625} }
(The reason is that the OLAP has limited ability to transform the data during import, and also doesn't have the order as a parameter.)
So, I do this using jq:
function transformJsonDataStructure {
## First let's reformat it to XML, then transform using XPATH, then back to JSON.
## Example input:
# {"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change":[
# {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
# {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}]}
cat "$1" | while read -r LINE ; do
XID=`echo "$LINE" | jq -c '.xid'`;
export SN=0;
#serr "{xid: $XID, changes: $CHANGES}";
echo "$LINE" | jq -c '.change[]' | while read -r CHANGE ; do
SN=$((SN+=1))
KIND=`echo "$CHANGE" | jq -c --raw-output .kind`;
TABLE=`echo "$CHANGE" | jq -c --raw-output .table`;
DEST_FILE="$TARGET_PATH-$TABLE.json";
case "$KIND" in
update|insert)
MAP=$(convertTwoArraysToMap "$(echo "$CHANGE" | jq -c ".columnnames")" "$(echo "$CHANGE" | jq -c ".columnvalues")") ;;
delete)
MAP=$(convertTwoArraysToMap "$(echo "$CHANGE" | jq -c ".oldkeys.keynames")" "$(echo "$CHANGE" | jq -c ".oldkeys.keyvalues")") ;;
esac
#echo "{\"xid\":$XID, \"table\":\"$TABLE\", \"kind\":\"$KIND\", \"data\":$MAP }" >> "$DEST_FILE"; ;;
echo "{\"xid\":$XID, \"sn\":$SN, \"kind\":\"$KIND\", \"data\":$MAP }" | tee --append "$DEST_FILE";
done;
done;
return;
}
The problem is the performance. I am calling jq few times per entry. This is quite slow, around 1000x times slower than without the transformation.
How can perform the transformation above using just one pass? (jq is not a must, other tool can be used too, but should be in CentOS packages. I want to avoid coding an extra tool for that.
From man jq it seems that it could be capable of processing the whole file (JSON entry per row) in one go. I could do it in XSLT but I can't wrap my head around jq. Especially the iteration of the change array and combining columnnames and columnvalues to a map.
For the iteration, I think map or map_values could be used.
For the 2 arrays to map, I see the from_entries and with_entries functions, but can't get it work.
Any jq master around to advise?

The following helper function converts the incoming array into an object using headers as the keys:
def objectify(headers):
[headers, .] | transpose | map({(.[0]): .[1]}) | add;
The trick now is to use range(0;length) to generate .sn:
{xid} +
(.change
| range(0;length) as $i
| .[$i]
| .columnnames as $header
| {sn: ($i + 1),
kind,
data: (.columnvalues|objectify($header)) } )
Output
For the given log entry, the output would be:
{"xid":1190,"sn":1,"kind":"update","data":{"id":401,"name":"Update AA","age":20}}
{"xid":1190,"sn":2,"kind":"update","data":{"id":401,"name":"Update BB","age":20}}
Moral
If a solution looks too complicated, it probably is.

Using jq, Flatten Arbitrary JSON to Delimiter-Separated Flat Dictionary

I'm looking to transform JSON using jq to a delimiter-separated and flattened structure.
There have been attempts at this. For example, Flatten nested JSON using jq.
However the solutions on that page fail if the JSON contains arrays. For example, if the JSON is:
{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}
The solution above will fail to transform the above to:
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
In addition, I'm looking for a solution that will also allow for an arbitrary delimiter. For example, suppose the space character is the delimiter. In this case, the result would be:
{"a b 0":1,"x 0 y":2,"x 1 z":3}
I'm looking to have this functionality accessed via a Bash (4.2+) function as is found in CentOS 7, something like this:
flatten_json()
{
local JSONData="$1"
# jq command to flatten $JSONData, putting the result to stdout
jq ... <<<"$JSONData"
}
The solution should work with all JSON data types, including null and boolean. For example, consider the following input:
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
It should produce:
{"a b 0":"p q r","w 0 x":null,"w 1 y":false,"w 2 z":3}

If you stream the data in, you'll get pairings of paths and values of all leaf values. If not a pair, then a path marking the end of a definition of an object/array at that path. Using leaf_paths as you found would only give you paths to truthy leaf values so you'll miss out on null or even false values. As a stream, you won't get this problem.
There are many ways this could be combined to an object, I'm partial to using reduce and assignment in these situations.
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
.[[$i[0][]|tostring]|join($delim)] = $i[1]
)' input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Here's the same solution broken up a bit to allow room for explanation of what's going on.
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
[$i[0][]|tostring] as $path_as_strings
| ($path_as_strings|join($delim)) as $key
| $i[1] as $value
| .[$key] = $value
)' input.json
Converting the input to a stream with tostream, we'll receive multiple values of pairs/paths as input to our filter. With this, we can pass those multiple values into reduce which is designed to accept multiple values and do something with them. But before we do, we want to filter those pairs/paths by only the pairs (select(length==2)).
Then in the reduce call, we're starting with a clean object and assigning new values using a key derived from the path and the corresponding value. Remember that every value produced in the reduce call is used for the next value in the iteration. Binding values to variables doesn't change the current context and assignments effectively "modify" the current value (the initial object) and passes it along.
$path_as_strings is just the path which is an array of strings and numbers to just strings. [$i[0][]|tostring] is a shorthand I use as an alternative to using map when the array I want to map is not the current array. This is more compact since the mapping is done as a single expression. That instead of having to do this to get the same result: ($i[0]|map(tostring)). The outer parentheses might not be necessary in general but, it's still two separate filter expressions vs one (and more text).
Then from there we convert that array of strings to the desired key using the provided delimiter. Then assign the appropriate values to the current object.

The following has been tested with jq 1.4, jq 1.5 and the current "master" version. The requirement about including paths to null and false is the reason for "allpaths" and "all_leaf_paths".
# all paths, including paths to null
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
def all_leaf_paths:
def isscalar: type | (. != "object" and . != "array");
allpaths as $p
| select(getpath($p)|isscalar)
| $p ;
. as $in
| reduce all_leaf_paths as $path ({};
. + { ($path | map(tostring) | join($delim)): $in | getpath($path) })
With this jq program in flatten.jq:
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim . -f flatten.jq input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Collisions
Here is a helper function that illustrates an alternative path-flattening algorithm. It converts keys that contain the delimiter to quoted strings, and array elements are presented in square brackets (see the example below):
def flattenPath(delim):
reduce .[] as $s ("";
if $s|type == "number"
then ((if . == "" then "." else . end) + "[\($s)]")
else . + ($s | tostring | if index(delim) then "\"\(.)\"" else . end)
end );
Example: Using flattenPath instead of map(tostring) | join($delim), the object:
{"a.b": [1]}
would become:
{
"\"a.b\"[0]": 1
}

To add a new option to the solutions already given, jqg is a script I wrote to flatten any JSON file and then search it using a regex. For your purposes your regex would simply be '.' which would match everything.
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg .
{
"a.b.0": 1,
"x.0.y": 2,
"x.1.z": 3
}
and can produce compact output:
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg -q -c .
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
It also handles the more complicated example that #peak used:
$ echo '{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}' | jqg .
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
as well as empty arrays and objects (and a few other edge-case values):
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.two-a.non-integer-number": 101.75,
"two.two-a.number-zero": 0,
"two.true-boolean": true,
"two.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
(reporting empty arrays & objects can be turned off with the -E option).
jqg was tested with jq 1.6
Note : I am the author of the jqg script.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

jq slow to join() medium size array - json

Related

Inducing a schema using jq more efficiently

JSON to CSV: variable number of columns per row

jq 1.5 print items from array that is inside another array

Optimize JSON denormalization using JQ - "cartesian product" from 1:N

Using jq, Flatten Arbitrary JSON to Delimiter-Separated Flat Dictionary

Categories

Resources