Inducing a schema using jq more efficiently - json

I ingest a lot of data. It comes from lots of different sources, and all ultimately goes into BigQuery.
I preparse into .jsonl file(s) — 1 line per record, named by destination table.
For a rough sense of scale, here's a sample from a dataset I'm doing now. (All the data below is real, just lightly redacted / cleaned up.)
% find json -type f -size +2000c -print0 | head -z | sort | wc --files0-from=-
2 387 4737 json/baz_1.jsonl
3 579 7055 json/baz_2.jsonl
1 193 2358 json/baz_3.jsonl
25 4835 58958 json/baz_4.jsonl
37 7161 87467 json/baz_5.jsonl
3 580 7072 json/baz_6.jsonl
15 2897 35393 json/baz_7.jsonl
129 24950 304262 json/baz_8.jsonl
3 373 4221 json/foo_1.jsonl
6 746 8491 json/foo_2.jsonl
224 42701 520014 total
% wc -l *.jsonl
11576 foos.jsonl
20 bars.jsonl
337770 bazzes.jsonl
349366 total
% du -m *.jsonl
3 foos.jsonl
1 bars.jsonl
93 bazzes.jsonl
This is relatively small for me. Other datasets are in the millions of rows / terabytes of data range.
Because the data comes from external sources, often undocumented, often not matching specs or just plain messy (e.g. various signal values for null, multiple date formats in the same field, etc), I don't really know the structure beforehand.
However, I want to have a nice, clean, efficient structure in my destination table — e.g. cast to the correct type like integer/bool/date, set REQUIRED/NULLABLE correctly, know which columns are actually enums, convert stringified arrays into REPEATED columns, have a good guess on what I can use effectively for partitioning / clustering, etc. etc.
It inevitably requires some manual work on samples to infer what's actually going on, but my first pass for doing this is jq (version 1.6).
This is my current code:
~/.jq
def isempty(v):
(v == null or v == "" or v == [] or v == {});
def isnotempty(v):
(isempty(v) | not);
def remove_empty:
walk(
if type == "array" then
map(select(isnotempty(.)))
elif type == "object" then
with_entries(select(isnotempty(.value))) # Note: this will remove keys with empty values
else .
end
);
# bag of words
def bow(stream):
reduce stream as $word ({}; .[($word|tostring)] += 1);
# https://stackoverflow.com/questions/46254655/how-to-merge-json-objects-by-grouping-on-key-with-jq
def add_by(f):
reduce .[] as $x ({}; ($x|f) as $f | .[$f] += [$x])
| [.[] | add];
# takes array of {string: #, ...}
def merge_counts:
map(.|to_entries)|flatten | add_by(.key)|from_entries;
induce_schema.sh (linebreaks added)
#!/bin/zsh
pv -cN ingestion -s `wc -l $1` -l $1 | \
jq -c --unbuffered --stream '{"name": ( .[0]), "encoded_type":( .[1] | type), \
"tonumber": (.[1] | if (type == "string") then try(tonumber|type) catch type else null end), \
"chars": (.[1] | if(type=="string") then try(split("") | sort | unique | join("")) else null end), \
"length":(.[1] | length),"data":.[1]}' | \
# sed -r 's/[0-9]+(,|])/"array"\1/g' | awk '!_[$0]++' | sort | \
pv -cN grouping -l | \
jq -sc '. | group_by(.name,.encoded_type,.tonumber)[] | {"name":first|.name, \
"encoded_type":([(first|.encoded_type),(first|.tonumber)]|unique - [null]|join("_")), \
"allchars": (map(.chars) | join("")|split("")|sort|unique|join("")), \
"count_null": (map(.data | select(.==null)) | length), \
"count_empty": (map(.data | select(.==[] or . == {} or . == "")) | length), \
"count_nonempty": (map(.data | select(. != null and . != "")) |length), \
"unique": (map(.data)|unique|length), "length": bow(.[] | .length) }' | \
pv -cN final -l | \
jq -sc '. | group_by(.name)[] | {"name":first|.name, \
"nullable":(map(.encoded_type) | contains(["null"])), \
"schemas_count":(map(. | select(.encoded_type != "null") )|length), \
"lengths":(map(.length)|merge_counts), "total_nonempty":(map(.count_nonempty)|add), \
"total_null":(map(.count_null)|add), "total_empty": (map(.count_empty) |add), \
"schemas":map(. | select(.encoded_type != "null") | del(.name) )}'
Here's a partial output for bars.jsonl (linebreaks added for ease of reading):
{"name":["FILING CODE"],"nullable":false,"schemas_count":1,
"lengths":{"0":1930,"2":16},
"total_nonempty":16,"total_null":0,"total_empty":1930,
"schemas":[
{"encoded_type":"string","allchars":"EGPWX",
"count_null":0,"count_empty":1930,"count_nonempty":16,"unique":6,
"length":{"0":1930,"2":16}}
]}
{"name":["LAST NAME"],"nullable":true,"schemas_count":1,
"lengths":{"0":416,"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2},
"total_nonempty":22736,"total_null":2,"total_empty":416,
"schemas":[
{"encoded_type":"string","allchars":" ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"count_null":0,"count_empty":416,"count_nonempty":22736,"unique":6233,
"length":{"5":4650,"6":5648,"7":4796,"4":1934,"8":3042,"9":1362,"10":570,"11":226,"3":284,"14":30,"12":70,"0":416,"13":54,"16":20,"15":26,"17":10,"18":8,"2":4,"19":2}}
]}
{"name":["NUMBER OF COFFEES"],"nullable":false,"schemas_count":2,
"lengths":{"1":16,"0":4},
"total_nonempty":16,"total_null":0,"total_empty":4,
"schemas":[
{"encoded_type":"number_string","allchars":"1",
"count_null":0,"count_empty":0,"count_nonempty":16,"unique":1,
"length":{"1":16}},
{"encoded_type":"string","allchars":"",
"count_null":0,"count_empty":4,"count_nonempty":0,"unique":1,
"length":{"0":4}}
]}
{"name":["OFFICE CODE"],"nullable":false,"schemas_count":2,
"lengths":{"3":184,"0":22092},
"total_nonempty":1036,"total_null":0,"total_empty":22092,
"schemas":[
{"encoded_type":"number_string","allchars":"0123456789",
"count_null":0,"count_empty":0,"count_nonempty":852,"unique":254,
"length":{"3":852}},
{"encoded_type":"string","allchars":"0123456789ABCDEIJQRSX",
"count_null":0,"count_empty":22092,"count_nonempty":184,"unique":66,
"length":{"0":22092,"3":184}}
]}
{"name":["SOURCE FILE"],"nullable":true,"schemas_count":1,
"lengths":{"0":416,"7":22708},
"total_nonempty":22708,"total_null":23124,"total_empty":416,
"schemas":[
{"encoded_type":"string","allchars":"0123456789F_efil",
"count_null":0,"count_empty":416,"count_nonempty":22708,"unique":30,
"length":{"7":22708,"0":416}}
]}
...
The point of this is to get a summary of "how is this unknown dataset structured and what's in it" that I can readily transform into my BigQuery table schema / parameters, use to point at what I'll probably need to do next for turning it into something cleaner & more usable than what I got, etc.
This code works, but those -s (slurp) lines are really hard on server RAM. (They simply wouldn't work on if the dataset were any larger than this; I added those parts just today. On the bazzes dataset, it uses about 20GB total RAM, including swap.)
It also doesn't detect e.g. any of the date/time field types.
I believe it should be possible to make this far more efficient using #joelpurra's jq + parallel and/or the jq cookbook's reduce inputs, but I'm having difficulty figuring out how.
So, I'd appreciate advice on how to make this
more CPU & RAM efficient
otherwise more useful (e.g. recognize date fields, which could be in almost any format)

Using inputs is the way to go, whether or not any parallelization techniques are brought to bear.
In the jq module for inducing structural schema that I wrote some time ago (https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed), there's a filter, schema/1 defined as:
def schema(stream):
reduce stream as $x ("null"; typeUnion(.; $x|typeof));
This can therefore be used as suggested by this snippet:
jq -n 'include "schema"; schema(inputs)' FILESPECIFICATIONS
(This assumes that the file "schema.jq" defining the schema module has been appropriately installed.)
The point here is not so much that schema.jq might be adapted to your particular expectations, but that the above "def" can serve as a guide (whether or not using jq) for how to write an efficient schema-inference engine, in the sense of being able to handle a very large number of instances. That is, you basically have only to write a definition of typeof (which should yield the desired "type" in the most general sense), and of typeUnion (which defines how two types are to be combined).
Of course, inferring schemas can be a tricky business. In particular, schema(stream) will never fail, assuming the inputs are valid JSON. That is, whether or not the inferred schema will be useful depends largely on how it is used. I find that an integrated approach based on these elements to be essential:
a schema specification language;
a schema inference engine that generates schemas that conform to (1);
a schema-checker.
Further thoughts
schema.jq is simple enough to be tailored to more specific requirements, e.g. to infer dates.
You might be interested in JESS ("JSON Extended Structural Schemas"), which combines a JSON-based specification language with jq-oriented tools: https://github.com/pkoppstein/JESS

Related

Fuzzy match string with jq

Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.

jq slow to join() medium size array

I am trying to join() a relatively big array (20k elements) of objects with a character ('\n' in this particular case). I have a few operation upfront which solve in about 8 seconds (acceptable) but when I try to '| join("\n")' at the end the runtime jump to 3+ minutes.
Is there any reason for the join() to be that slow ? Is there another way of having the same output without join() ?
I am currently using jq-1.5 (latest stable)
Here is the JQ file
json2csv.jq
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | join("\n")
)] | join("\n")
;
json2csv
Considering:
$ jq 'length' test.json
23717
With the script is I want it (and put above)
$ time jq -rf json2csv.jq test.json > test.csv
real 3m46.721s
user 1m48.660s
sys 1m57.698s
With the same script, removing the join("\n")
$ time jq -rf json2csv.jq test.json > test.csv
real 0m8.564s
user 0m8.301s
sys 0m0.242s
(note: I remove the second join because else JQ cannot aggregate an array and a string, which make sense (but that's only on an array of 2 elements anyways, so the second join isn't the problem))
You don't need to use join at all. Rather than thinking of converting the whole file to a single string, think of it as converting each row to strings. The way jq outputs streams of results will give you the desired result in the end (assuming you take the raw output).
try something more like this.
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers
# output headers followed by rows of values as arrays
| (
$headers
),
(
.[] | [ .[$headers[]] | tostring | tonull ]
)
# convert the arrays to tab separated values strings
| #tsv
;
After thinking about it I remembered that jq automatically display carriage return ('\n') if you scan an array (.[]), which mean that in this particular case I can just do this:
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | .[]
)] | .[]
;
json2csv
And this solved my problem
time jq -rf json2csv.jq test.json > test.csv
real 0m6.725s
user 0m6.454s
sys 0m0.245s
I'm leaving the question up as if I had wanted to use any other character than '\n' this wouldn't have solved the issue.
When producing output such as CSV or TSV, the idea is to stream the data as much as possible. The last thing you want to do is run join on an array containing all the data. If you did want to use a delimiter other than \n, you'd add it to each item in the stream, and then use the -j command-line option.
Also, I think your diagnosis is probably not quite right as joining an array with a large number of small strings is quite fast. Below are timings comparing joining an array with two strings and one with 100,000 strings. In case you're wondering, my machine is rather slow.
./join.sh 2
3
real 0.03
user 0.02
sys 0.00
1896448 maximum resident set size
$ ./join.sh 100000
588889
real 2.20
user 2.05
sys 0.13
21188608 maximum resident set size
$cat join.sh
#!/bin/bash
/usr/bin/time -lp jq -n --argjson n "$1" '[range(0;$n)|tostring]|join(".")|length'
The above runs used jq 1.6, but using jq 1.5 produces very similar results.
On the other hand, joining a large number (20,000) of very long strings (1K) is noticeably slow, so evidently the current jq implementation is not designed for such operations.

jq 1.5 print items from array that is inside another array

Incoming json file contains json array per row eg:
["a100","a101","a102","a103","a104","a105","a106","a107","a108"]
["a100","a102","a103","a106","a107","a108"]
["a100","a99"]
["a107","a108"]
a "filter array" would be ["a99","a101","a108"] so I can slurpfile it
Trying to figure out how to print only values that are inside "filter array", eg the output:
["a101","a108"]
["a108"]
["a99"]
["a108"]
You can port IN function from jq 1.6 to 1.5 and use:
def IN(s): any(s == .; .);
map(select(IN($filter_array[])))
Or even shorter:
map(select(any($filter_array[]==.;.)))
I might be missing some simpler solution, but the following works :
map(select(. as $in | ["a99","a101","a108"] | contains([$in])))
Replace the ["a99","a101","a108"] hardcoded array by your slurped variable.
You can try it here !
In the example, the arrays in the input stream are sorted (in jq's sort order), so it is worth noting that in such cases, a more efficient solution is possible using the bsearch built-in, or perhaps even better, the definition of intersection/2 given at https://rosettacode.org/wiki/Set#Finite_Sets_of_JSON_Entities
For ease of reference, here it is:
def intersection($A;$B):
def pop:
.[0] as $i
| .[1] as $j
| if $i == ($A|length) or $j == ($B|length) then empty
elif $A[$i] == $B[$j] then $A[$i], ([$i+1, $j+1] | pop)
elif $A[$i] < $B[$j] then [$i+1, $j] | pop
else [$i, $j+1] | pop
end;
[[0,0] | pop];
Assuming a jq invocation such as:
jq -c --argjson filter '["a99","a101","a108"]' -f intersections.jq input.json
an appropriate filter would be:
($filter | sort) as $sorted
| intersection(.; $sorted)
(Of course if $filter is already presented in jq's sort order, then the initial sort can be skipped, or replaced by a check.)
Output
["a101","a108"]
["a108"]
["a99"]
["a108"]
Unsorted arrays
In practice, jq's builtin sort filter is usually so fast that it might be worthwhile simply sorting the arrays in order to use intersection as defined above.

Optimize JSON denormalization using JQ - "cartesian product" from 1:N

I have a JSON database change log, output of wal2json. It looks like this:
{"xid":1190,"timestamp":"2018-07-19 17:18:02.905354+02","change":[
{"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
{"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}]}
...
Each top level entry (xid) is a transaction, each item in change is, well, a change. One row may change multiple times.
To import to an OLAP system with limited feature set, I need to have the order explicitly stated. So I need to add a sn for each change in a transaction.
Also, each change must be a top level entry - the OLAP can't iterate sub-items within one entry.
{"xid":1190, "sn":1, "kind":"update", "data":{"id":401,"name":"Update AA","age":20} }
{"xid":1190, "sn":2, "kind":"update", "data":{"id":401,"name":"Update BB","age":20} }
{"xid":1191, "sn":1, "kind":"insert", "data":{"id":625,"name":"Inserted","age":20} }
{"xid":1191, "sn":2, "kind":"delete", "data":{"id":625} }
(The reason is that the OLAP has limited ability to transform the data during import, and also doesn't have the order as a parameter.)
So, I do this using jq:
function transformJsonDataStructure {
## First let's reformat it to XML, then transform using XPATH, then back to JSON.
## Example input:
# {"xid":1074,"timestamp":"2018-07-18 17:49:54.719475+02","change":[
# {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update AA",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}},
# {"kind":"update","table":"mytable2","columnnames":["id","name","age"],"columnvalues":[401,"Update BB",20],"oldkeys":{"keynames":["id"],"keyvalues":[401]}}]}
cat "$1" | while read -r LINE ; do
XID=`echo "$LINE" | jq -c '.xid'`;
export SN=0;
#serr "{xid: $XID, changes: $CHANGES}";
echo "$LINE" | jq -c '.change[]' | while read -r CHANGE ; do
SN=$((SN+=1))
KIND=`echo "$CHANGE" | jq -c --raw-output .kind`;
TABLE=`echo "$CHANGE" | jq -c --raw-output .table`;
DEST_FILE="$TARGET_PATH-$TABLE.json";
case "$KIND" in
update|insert)
MAP=$(convertTwoArraysToMap "$(echo "$CHANGE" | jq -c ".columnnames")" "$(echo "$CHANGE" | jq -c ".columnvalues")") ;;
delete)
MAP=$(convertTwoArraysToMap "$(echo "$CHANGE" | jq -c ".oldkeys.keynames")" "$(echo "$CHANGE" | jq -c ".oldkeys.keyvalues")") ;;
esac
#echo "{\"xid\":$XID, \"table\":\"$TABLE\", \"kind\":\"$KIND\", \"data\":$MAP }" >> "$DEST_FILE"; ;;
echo "{\"xid\":$XID, \"sn\":$SN, \"kind\":\"$KIND\", \"data\":$MAP }" | tee --append "$DEST_FILE";
done;
done;
return;
}
The problem is the performance. I am calling jq few times per entry. This is quite slow, around 1000x times slower than without the transformation.
How can perform the transformation above using just one pass? (jq is not a must, other tool can be used too, but should be in CentOS packages. I want to avoid coding an extra tool for that.
From man jq it seems that it could be capable of processing the whole file (JSON entry per row) in one go. I could do it in XSLT but I can't wrap my head around jq. Especially the iteration of the change array and combining columnnames and columnvalues to a map.
For the iteration, I think map or map_values could be used.
For the 2 arrays to map, I see the from_entries and with_entries functions, but can't get it work.
Any jq master around to advise?
The following helper function converts the incoming array into an object using headers as the keys:
def objectify(headers):
[headers, .] | transpose | map({(.[0]): .[1]}) | add;
The trick now is to use range(0;length) to generate .sn:
{xid} +
(.change
| range(0;length) as $i
| .[$i]
| .columnnames as $header
| {sn: ($i + 1),
kind,
data: (.columnvalues|objectify($header)) } )
Output
For the given log entry, the output would be:
{"xid":1190,"sn":1,"kind":"update","data":{"id":401,"name":"Update AA","age":20}}
{"xid":1190,"sn":2,"kind":"update","data":{"id":401,"name":"Update BB","age":20}}
Moral
If a solution looks too complicated, it probably is.

How can I completely sort arbitrary JSON using jq?

I want to diff two JSON text files. Unfortunately they're constructed in arbitrary order, so I get diffs when they're semantically identical. I'd like to use jq (or whatever) to sort them in any kind of full order, to eliminate differences due only to element ordering.
--sort-keys solves half the problem, but it doesn't sort arrays.
I'm pretty ignorant of jq and don't know how to write a jq recursive filter that preserves all data; any help would be appreciated.
I realize that line-by-line 'diff' output isn't necessarily the best way to compare two complex objects, but in this case I know the two files are very similar (nearly identical) and line-by-line diffs are fine for my purposes.
Using jq or alternative command line tools to diff JSON files answers a very similar question, but doesn't print the differences. Also, I want to save the sorted results, so what I really want is just a filter program to sort JSON.
Here is a solution using a generic function sorted_walk/1 (so named for the reason described in the postscript below).
normalize.jq:
# Apply f to composite entities recursively using keys[], and to atoms
def sorted_walk(f):
. as $in
| if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | sorted_walk(f)) } ) | f
elif type == "array" then map( sorted_walk(f) ) | f
else f
end;
def normalize: sorted_walk(if type == "array" then sort else . end);
normalize
Example using bash:
diff <(jq -S -f normalize.jq FILE1) <(jq -S -f normalize.jq FILE2)
POSTSCRIPT: The builtin definition of walk/1 was revised after this response was first posted: it now uses keys_unsorted rather than keys.
I want to diff two JSON text files.
Use jd with the -set option:
No output means no difference.
$ jd -set A.json B.json
Differences are shown as an # path and + or -.
$ jd -set A.json C.json
# ["People",{}]
+ "Carla"
The output diffs can also be used as patch files with the -p option.
$ jd -set -o patch A.json C.json; jd -set -p patch B.json
{"City":"Boston","People":["John","Carla","Bryan"],"State":"MA"}
https://github.com/josephburnett/jd#command-line-usage
I'm surprised this isn't a more popular question/answer. I haven't seen any other json deep sort solutions. Maybe everyone likes solving the same problem over and over.
Here's an wrapper for #peak's excellent solution above that wraps it into a shell script that works in a pipe or with file args.
#!/usr/bin/env bash
# json normalizer function
# Recursively sort an entire json file, keys and arrays
# jq --sort-keys is top level only
# Alphabetize a json file's dict's such that they are always in the same order
# Makes json diff'able and should be run on any json data that's in source control to prevent excessive diffs from dict reordering.
[ "${DEBUG}" ] && set -x
TMP_FILE="$(mktemp)"
trap 'rm -f -- "${TMP_FILE}"' EXIT
cat > "${TMP_FILE}" <<-EOT
# Apply f to composite entities recursively using keys[], and to atoms
def sorted_walk(f):
. as \$in
| if type == "object" then
reduce keys[] as \$key
( {}; . + { (\$key): (\$in[\$key] | sorted_walk(f)) } ) | f
elif type == "array" then map( sorted_walk(f) ) | f
else f
end;
def normalize: sorted_walk(if type == "array" then sort else . end);
normalize
EOT
# Don't pollute stdout with debug output
[ "${DEBUG}" ] && cat $TMP_FILE > /dev/stderr
if [ "$1" ] ; then
jq -S -f ${TMP_FILE} $1
else
jq -S -f ${TMP_FILE} < /dev/stdin
fi