Aggregate json arrays from multiple files using jq, grouping by key - json

I would like to aggregate two or more files into a single json, and aggregate arrays under a same key.
file1.json
{
"shapes": [
{
"id": "1",
"name": "circle"
},
{
"id": "2",
"name": "square"
}
]
}
file2.json
{
"shapes": [
{
"id": "3",
"name": "triangle"
}
]
}
Expected result :
{
"shapes": [
{
"id": "1",
"name": "circle"
},
{
"id": "2",
"name": "square"
},
{
"id": "3",
"name": "triangle"
}
]
}
I can do this with the following jq command :
jq -s '{shapes: map(.shapes)|add }' file*.json
But this requires me to know the shapes attribute and hardcode it. Is there a simple way I can get the same result without ever using the key name explicitly?

Here is a solution that’s suitable when each top-level object has only one key, and that is both efficient and conceptually simple. It assumes jq is invoked with the -n option.
reduce inputs as $in (null;
($in|keys_unsorted[0]) as $k | { ($k): (.[$k] + $in[$k]) })
or slightly more compactly:
reduce inputs as $in (null; ($in|keys_unsorted[0]) as $k | .[$k] += $in[$k] )

Here is a solution that also solves a more general problem: first, it handles arbitrarily many input files; and second, it forms the "sum" by key, for every key, on the assumption that every top-level key is array-valued.
The generic function:
# the values at each key are assumed to be arrays
def aggregate(stream):
reduce stream as $o ({};
reduce ($o|keys_unsorted[]) as $k (.;
.[$k] += $o[$k] ));
To avoid "slurping", we will use inputs:
aggregate(inputs)
The invocation must therefore use the -n command-line option:
jq -n -f program.jq *.json

Try the following code. This can handle any number of files. All inputs are assumed to be json objects with all values inside as arrays. All such arrays are aggregated after grouping by keys. It outputs an object which has keys associated with corresponding aggregated arrays.
jq -s 'map(to_entries)|add|group_by(.key)|
map( { "key": (.[0].key), "value": (map(.value)|add)})|
from_entries' file1.json file2.json
For your sample input this gives:
{
"shapes": [
{
"id": "1",
"name": "circle"
},
{
"id": "2",
"name": "square"
},
{
"id": "3",
"name": "triangle"
}
]
}

Related

Merging multiple JSON Lines files into a single JSON object

I'm trying to merge / reduce many JSON objects and somehow I'm not getting the expected result.
I'm only interested in getting all keys, the values and the number of items inside arrays are irrelevant.
file1.json:
{
"customerId": "xx",
"emails": [
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
]
}
{
"id": "654",
"emails": [
{
"address": "peter#x.com",
"primary": true
}
]
}
The desired output is a JSON object with all possible keys from all input objects. The values are irrelevant, any value from any input object is OK. But all keys from input objects must be present in output object:
{
"emails": [
{
"address": "james#zz.com", <--- any existing value works
"customType": "", <--- any existing value works
"type": "custom", <--- any existing value works
"primary": true <--- any existing value works
}
],
"customerId": "xx", <--- any existing value works
"id": "654" <--- any existing value works
}
I tried reducing it, but it misses many of the keys in the array:
$ jq -s 'reduce .[] as $item ({}; . + $item)' file1.json
{
"customerId": "xx",
"emails": [
{
"address": "peter#x.com",
"primary": true
}
],
"id": "654"
}
The structure of the objects contained in file1.json is unknown, so the solution must be agnostic of any keys/values and the solution must not assume any structure or depth.
Is it possible to fix this somehow considering how jq works? Or is it possible to solve this issue using another tool?
PS: For those of you that are curious, this is useful to infer a schema that can be created in a database. Given an arbitrary number of JSON objects with an arbitrary structure, it's easy to create a single JSON squished/merged/fused structure that will "accommodate" all JSON objects.
BigQuery is able to autodetect a schema, but only 500 lines are analyzed to come up with it. This presents problems if objects have different structures past that 500 line mark.
With this approach I can squish a JSON Lines file with 1000000s of objects into one line that can be then imported into BigQuery with the autodetect schema flag and it will work every time since BigQuery only has one line to analyze and this line is the "super-schema" of all the objects. After extracting the autodetected schema I can manually fine tune it to make sure types are correct and then recreate the table specifying my tuned schema:
$ ls -1 users*.json | wc --lines
3672
$ cat users*.json > users-all.json
$ cat users-all.json | wc --lines
146482633
$ jq 'squish' users-all.json > users-all-squished.json
$ cat users-all-squished.json | wc --lines
1
$ bq load --autodetect users users-all-squished.json
$ bq show schema --format=prettyjson users > users-schema.json
$ vi users-schema.json
$ bq rm --table users
$ bq mk --table users --schema=users-schema.json
$ bq load users users-all.json
[Some options are missing or changed for readability]
Here is a solution that produces the expected result in the sample example, and seems to meet all the stated requirements. It is similar to one proposed by #pmf on this page.
jq -n --stream '
def squish: map(if type == "number" then 0 else . end);
reduce (inputs | select(length==2)) as [$p, $v] ({}; setpath($p|squish; $v))
'
Output
For the example given in the Q, the output is:
{
"customerId": "xx",
"emails": [
{
"address": "peter#x.com",
"customType": "",
"type": "custom",
"primary": true
}
],
"id": "654"
}
As #peak has pointed out, some aspects are underspecified. For instance, what should happen with .customerId and .id? Are they always the same across all files (as suggested by the sample files provided)? Do you want the items of the .emails array just thrown into one large array, or do you want to have them "merged" by some criteria (e.g. by a common value in their .address field)? Here are some stubs to start from:
Simply concatenate the .emails arrays and take all other parts from the first file:
jq 'reduce inputs as $in (.; .emails += $in.emails)' file*.json
# or simpler
jq '.emails += [inputs.emails[]]' file*.json
Demo Demo
{
"emails": [
{
"address": "cc#xx.com"
},
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
],
"customerId": "xx",
"id": "654"
}
Merge the objects in the .emails array by a common value in their .address field, with latter values overwriting former values for other fields with colliding names, and discard all other parts from the files:
jq -n 'reduce inputs.emails[] as $e ({}; .[$e.address] += $e) | map(.)' file*.json
Demo
[
{
"address": "cc#xx.com"
},
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
]
If you are only interested in a list of unique field names for a given address, regardless of the counts and values used, you can also go with:
jq -n '
reduce inputs.emails[] as $e ({}; .[$e.address][$e | keys_unsorted[]] = 1)
| map_values(keys)
'
Demo
{
"cc#xx.com": [
"address"
],
"james#zz.com": [
"address",
"customType",
"type"
],
"james#x.com": [
"address"
],
"sales#x.com": [
"address",
"primary"
],
"info#x.com": [
"address"
]
}
The structure of the objects contained in file1.json is unknown, so the solution must be agnostic of any keys/values and the solution must not assume any structure or depth.
You can use the --stream flag to break down the structure into an array of paths and values, discard the values part and make the paths unique:
jq --stream -nc '[inputs[0]] | unique[]' file*.json
["customerId"]
["emails"]
["emails",0,"address"]
["emails",0,"customType"]
["emails",0,"primary"]
["emails",0,"type"]
["emails",1,"address"]
["emails",2]
["emails",2,"address"]
["emails",2,"primary"]
["emails",3]
["emails",3,"address"]
["id"]
Trying to build a representation of this, similar to any of the input files, comes with a lot of caveats. For instance, how would you represent in a single structure if one file had .emails as an array of objects, and another had .emails as just an atomic value, say, a string. You would not be able to represent this plurality without introducing new, possibly ambiguous structures (e.g. putting all possibilities into an array).
Therefore, having a list of paths could be a fair compromise. Judging by your desired output, you want to focus more on the object structure, so you could further reduce complexity by discarding the array indices. Depending on your use case, you could replace them with a single value to retain the information of the presence of an array, or discard them entirely:
jq --stream -nc '[inputs[0] | map(numbers = 0)] | unique[]' file*.json
["customerId"]
["emails"]
["emails",0]
["emails",0,"address"]
["emails",0,"customType"]
["emails",0,"primary"]
["emails",0,"type"]
["id"]
jq --stream -nc '[inputs[0] | map(strings)] | unique[]' file*.json
["customerId"]
["emails"]
["emails","address"]
["emails","customType"]
["emails","primary"]
["emails","type"]
["id"]
The following program meets these two key requirements:
"all keys from input objects must be present in output object";
"the solution must be agnostic of any keys/values and the solution must not assume any structure or depth."
The approach is the same as one suggested by #pmf, and for the example given in the Q, produces results that are very similar to the one that is shown:
jq -n --stream '
def squish: map(select(type == "string"));
reduce (inputs | select(length==2)) as [$p, $v] ({};
setpath($p|squish; $v))
'
With the given input, this produces:
{
"customerId": "xx",
"emails": {
"address": "peter#x.com",
"customType": "",
"type": "custom",
"primary": true
},
"id": "654"
}

Merge and Sort JSON using JQ

I have a file containing the following structure and unknown number of results:
{
"results": [
[
{
"field": "AccountID",
"value": "5177497"
},
{
"field": "Requests",
"value": "50900"
}
],
[
{
"field": "AccountID",
"value": "pro"
},
{
"field": "Requests",
"value": "251"
}
]
],
"statistics": {
"Matched": 51498,
"Scanned": 8673577,
"ScannedByte": 2.72400814E10
},
"status": "HOLD"
}
{
"results": [
[
{
"field": "AccountID",
"value": "5577497"
},
{
"field": "Requests",
"value": "51900"
}
],
"statistics": {
"Matched": 51498,
"Scanned": 8673577,
"ScannedByte": 2.72400814E10
},
"status": "HOLD"
}
There are multiple such results which are indexed as an array with the results folder. They are not seperated by a comma.
I am trying to just print The "AccountID" sorted by "Requests" in ZSH using jq. I have tried flattening them and using:
jq -r '.results[][0] |.value ' filename
jq -r '.results[][1] |.value ' filename
To get the Account ID and Requests seperately and sorting them. I don't think bash has a dictionary that can be used. The problem lies in the file as the Field and value are not key value pair but are both pairs. Therefore extracting them using the above two lines into seperate arrays and sorting by the second array seems a bit too long. I was wondering if there is a way to combine both the operations.
The other way is to combine it all to a string and sort it in ascending order. Python would probably have the best solution but the code requires to be a zsh or bash script.
Solutions that use sed, jq or any other ZSH supported compilers are welcome. If there is a way to create a dictionary in bash, please do let me know.
The projectd output requirement is just the Account ID vs Request Number.
5577497 has 51900 requests
5177497 has 50900 requests
pro has 251 requests
If you don't mind learning a little jq, it will probably be best to write a small jq program to do what you want.
To get you started, consider the following jq program, which assumes your input is a stream of valid JSON objects with a "results" key similar to your sample:
[inputs | .results[] | map( { (.field) : .value} ) | add]
After making minor changes to your input so that it consists of valid JSON objects, an invocation of jq with the -n option produces an array of AccountID/Requests objects:
[
{
"AccountID": "5177497",
"Requests": "50900"
},
{
"AccountID": "pro",
"Requests": "251"
},
{
"AccountID": "5577497",
"Requests": "51900"
}
]
You could (for example) now use jq's group_by to group these objects by AccountID, and thereby produce the result you want.
jq -S '.results[] | map( { (.field) : .value} ) | add' query-results-aggregate \
| jq -s -c 'group_by(.number_of_requests) | .[]'
This does the trick. Thanks to peak for the guidance.

Concat 2 arrays inside object based on object key/value

I have multiple json objects which could be less when i merge the arrays if a object key matches the same value as the next json object. I'm trying to accomplish this with jq.
I think i have to use group_by(.name) first to group matching keys. I'm also using slurp to first wrap all objects into one big array.
I don't have anything working for now.
given:
{
"name": "a",
"list": [ "a1", "a2" ]
}
{
"name": "a",
"list": [ "a3", "a4" ]
}
{
"name": "b",
"list": [ "b1", "b2" ]
}
should result in:
{
"name": "a",
"list": [ "a1", "a2", "a3", "a4" ]
}
{
"name": "b",
"list": [ "b1", "b2" ]
}
You can use reduce like this:
$ jq -c -n 'reduce inputs as $p ({}; .[$p.name] |= { name : $p.name, list : (.list + $p.list) }) | .[]' file
{"name":"a","list":["a1","a2","a3","a4"]}
{"name":"b","list":["b1","b2"]}
Here's a simple and efficient solution that uses a common "aggregate by" technique:
reduce inputs as $kv ({}; .[$kv.name] += $kv.list)
| keys_unsorted[] as $k
| {name: $k, list: .[$k]}
Since inputs has been used here, the -n command-line option of jq should be specified.

Update one JSON file values with values from another JSON using JQ (on all levels)

I have two JSON files:
source.json:
{
"general": {
"level1": {
"key1": "x-x-x-x-x-x-x-x",
"key3": "z-z-z-z-z-z-z-z",
"key4": "w-w-w-w-w-w-w-w"
},
"another" : {
"key": "123456",
"comments": {
"one": "111",
"other": "222"
}
}
},
"title": "The best"
}
and the
target.json:
{
"general": {
"level1": {
"key1": "xxxxxxxx",
"key2": "yyyyyyyy",
"key3": "zzzzzzzz"
},
"onemore": {
"kkeeyy": "0000000"
}
},
"specific": {
"stuff": "test"
},
"title": {
"one": "one title",
"other": "other title"
}
}
I need all the values for keys which exist in both files, copied from source.json to target.json, considering all the levels.
I've seen and tested the solution from this post.
It only copies the first level of keys, and I couldn't get it to do what I need.
The result from solution in this post, looks like this:
{
"general": {
"level1": {
"key1": "x-x-x-x-x-x-x-x",
"key3": "z-z-z-z-z-z-z-z",
"key4": "w-w-w-w-w-w-w-w"
},
"another": {
"key": "123456",
"comments": {
"one": "111",
"other": "222"
}
}
},
"specific": {
"stuff": "test"
},
"title": "The best"
}
Everything under the "general" key was copied as is.
What I need, is this:
{
"general": {
"level1": {
"key1": "x-x-x-x-x-x-x-x",
"key2": "yyyyyyyy",
"key3": "z-z-z-z-z-z-z-z"
},
"onemore": {
"kkeeyy": "0000000"
}
},
"specific": {
"stuff": "test"
},
"title": {
"one": "one title",
"other": "other title"
}
}
Only "key1" and "key3" should be copied.
Keys in target JSON must not be deleted and new keys should not be created.
Can anyone help?
One approach you could take is get all the paths to all scalar values for each input and take the set intersections. Then copy values from source to target from those paths.
First we'll need an intersect function (which was surprisingly difficult to craft):
def set_intersect($other):
(map({ ($other[] | tojson): true }) | add) as $o
| reduce (.[] | tojson) as $v ({}; if $o[$v] then .[$v] = true else . end)
| keys_unsorted
| map(fromjson);
Then to do the update:
$ jq --argfile s source.json '
reduce ([paths(scalars)] | set_intersect([$s | paths(scalars)])[]) as $p (.;
setpath($p; $s | getpath($p))
)
' target.json
[Note: this response answers the original question, with respect to the original data. The OP may have had paths in mind rather than keys.]
There is no need to compute the intersection to achieve a reasonably efficient solution.
First, let's hypothesize the following invocation of jq:
jq -n --argfile source source.json --argfile target target.json -f copy.jq
In the file copy.jq, we can begin by defining a helper function:
# emit an array of the distinct terminal keys in the input entity
def keys: [paths | .[-1] | select(type=="string")] | unique;
In order to inspect all the paths to leaf elements of $source, we can use tostream:
($target | keys) as $t
| reduce ($source|tostream|select(length==2)) as [$p,$v]
($target;
if $t|index($p[-1]) then setpath($p; $v) else . end)
Alternatives
Since $t is sorted, it would (at least in theory) make sense to use bsearch instead of index:
bsearch($p[-1]) > -1
Also, instead of tostream we could use paths(scalars).
Putting these alternatives together:
($target | keys) as $t
| reduce ($source|paths(scalars)) as $p
($target;
if $t|bsearch($p[-1]) > -1
then setpath($p; $source|getpath($p))
else . end)
Output
{
"general": {
"level1": {
"key1": "x-x-x-x-x-x-x-x",
"key2": "yyyyyyyy",
"key3": "z-z-z-z-z-z-z-z"
},
"onemore": {
"kkeeyy": "0000000"
}
},
"specific": {
"stuff": "test"
}
}
The following provides a solution to the revised question, which is actually about "paths" rather than "keys".
([$target|paths(scalars)] | unique) as $paths
| reduce ($source|paths(scalars)) as $p
($target;
if $paths | bsearch($p) > -1
then setpath($p; $source|getpath($p))
else . end)
unique is called so that binary search can be used subsequently.
Invocation:
jq -n --argfile source source.json --argfile target target.json -f program.jq

substitute certain characters in strings found in an object

I have a list of objects and want to replace all occurrences of . with : when the key is Name using jq
input:
{
"Parameters": [
{
"Name": "TEST.AB.SOMETHING",
"Value": "hvfuycsgvfiwbiwbibibewfiwbcfwifcbwibcibc"
},
{
"Name": "TEST_GF_USER",
"Value": "ssssecret"
}
]
}
expected output:
{
"Parameters": [
{
"Name": "TEST:AB:SOMETHING",
"Value": "hvfuycsgvfiwbiwbibibewfiwbcfwifcbwibcibc"
},
{
"Name": "TEST_GF_USER",
"Value": "ssssecret"
}
]
}
You may split by . and join by :
jq '(.Parameters[].Name)|=(split(".")|join(":"))' file.json
The assignment is done using the update operator.
The trick is to use .Name |= gsub("\\.";":"). In your case (a flat list), it's simple. If you want to modify the keys of all objects in an arbitrary JSON text, the simplest would be to use walk/1:
walk( if type == "object" and has("Name") then .Name |= gsub("\\.";":")) else . end )
(If your jq does not have walk/1, then its jq definition can readily be found by googling.)