using jq to merge two json values based on key value - json

i have two json files that i would like to merge based on the value of a key. the key name is different in both json files but the value would be the same. i am using jq to try to get this done. most of the examples i have found all merge based on key name and not value.
sample1.json
[
{
"unique_id": "pp1234",
"unique_id_type": "netid",
"rfid": "12245556890478",
},
{
"unique_id": "aqe123",
"unique_id_type": "netid",
"rfid": "12234556890478",
}
]
sample2.json
[
{
"mailing_state": "New York",
"mobile_phone_number": "(982) 2541212",
"netid": "pp1234",
"netid_reachable": "Y",
},
{
"mailing_state": "New York",
"mobile_phone_number": "(982) 5551212",
"netid": "aqe123",
"netid_reachable": "Y",
}
]
i would want the output to look something like:
results.json
[
{
"unique_id": "pp1234",
"unique_id_type": "netid",
"rfid": "12245556890478",
"mailing_state": "New York",
"mobile_phone_number": "(982) 2541212",
"netid_reachable": "Y",
},
{
"unique_id": "aqe123",
"unique_id_type": "netid",
"rfid": "12234556890478",
"mailing_state": "New York",
"mobile_phone_number": "(982) 5551212",
"netid_reachable": "Y",
}
]
order of results would not matter as long as the records are merged based on netid/unique_id keys. i am open to using something other than jq if necessary. thanks in advance.

Once the sample input files have been corrected, the following invocation should do the trick:
jq --argfile uid sample1.json '
($uid | INDEX(.unique_id)) as $dict
| map( $dict[.netid] + del(.netid) )
' sample2.json
If you prefer not to use --argfile because it has been deprecated, you could (for example) use --slurpfile and change $uid in the jq program to $uid[0].

Related

Merging multiple JSON Lines files into a single JSON object

I'm trying to merge / reduce many JSON objects and somehow I'm not getting the expected result.
I'm only interested in getting all keys, the values and the number of items inside arrays are irrelevant.
file1.json:
{
"customerId": "xx",
"emails": [
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
]
}
{
"id": "654",
"emails": [
{
"address": "peter#x.com",
"primary": true
}
]
}
The desired output is a JSON object with all possible keys from all input objects. The values are irrelevant, any value from any input object is OK. But all keys from input objects must be present in output object:
{
"emails": [
{
"address": "james#zz.com", <--- any existing value works
"customType": "", <--- any existing value works
"type": "custom", <--- any existing value works
"primary": true <--- any existing value works
}
],
"customerId": "xx", <--- any existing value works
"id": "654" <--- any existing value works
}
I tried reducing it, but it misses many of the keys in the array:
$ jq -s 'reduce .[] as $item ({}; . + $item)' file1.json
{
"customerId": "xx",
"emails": [
{
"address": "peter#x.com",
"primary": true
}
],
"id": "654"
}
The structure of the objects contained in file1.json is unknown, so the solution must be agnostic of any keys/values and the solution must not assume any structure or depth.
Is it possible to fix this somehow considering how jq works? Or is it possible to solve this issue using another tool?
PS: For those of you that are curious, this is useful to infer a schema that can be created in a database. Given an arbitrary number of JSON objects with an arbitrary structure, it's easy to create a single JSON squished/merged/fused structure that will "accommodate" all JSON objects.
BigQuery is able to autodetect a schema, but only 500 lines are analyzed to come up with it. This presents problems if objects have different structures past that 500 line mark.
With this approach I can squish a JSON Lines file with 1000000s of objects into one line that can be then imported into BigQuery with the autodetect schema flag and it will work every time since BigQuery only has one line to analyze and this line is the "super-schema" of all the objects. After extracting the autodetected schema I can manually fine tune it to make sure types are correct and then recreate the table specifying my tuned schema:
$ ls -1 users*.json | wc --lines
3672
$ cat users*.json > users-all.json
$ cat users-all.json | wc --lines
146482633
$ jq 'squish' users-all.json > users-all-squished.json
$ cat users-all-squished.json | wc --lines
1
$ bq load --autodetect users users-all-squished.json
$ bq show schema --format=prettyjson users > users-schema.json
$ vi users-schema.json
$ bq rm --table users
$ bq mk --table users --schema=users-schema.json
$ bq load users users-all.json
[Some options are missing or changed for readability]
Here is a solution that produces the expected result in the sample example, and seems to meet all the stated requirements. It is similar to one proposed by #pmf on this page.
jq -n --stream '
def squish: map(if type == "number" then 0 else . end);
reduce (inputs | select(length==2)) as [$p, $v] ({}; setpath($p|squish; $v))
'
Output
For the example given in the Q, the output is:
{
"customerId": "xx",
"emails": [
{
"address": "peter#x.com",
"customType": "",
"type": "custom",
"primary": true
}
],
"id": "654"
}
As #peak has pointed out, some aspects are underspecified. For instance, what should happen with .customerId and .id? Are they always the same across all files (as suggested by the sample files provided)? Do you want the items of the .emails array just thrown into one large array, or do you want to have them "merged" by some criteria (e.g. by a common value in their .address field)? Here are some stubs to start from:
Simply concatenate the .emails arrays and take all other parts from the first file:
jq 'reduce inputs as $in (.; .emails += $in.emails)' file*.json
# or simpler
jq '.emails += [inputs.emails[]]' file*.json
Demo Demo
{
"emails": [
{
"address": "cc#xx.com"
},
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
],
"customerId": "xx",
"id": "654"
}
Merge the objects in the .emails array by a common value in their .address field, with latter values overwriting former values for other fields with colliding names, and discard all other parts from the files:
jq -n 'reduce inputs.emails[] as $e ({}; .[$e.address] += $e) | map(.)' file*.json
Demo
[
{
"address": "cc#xx.com"
},
{
"address": "james#zz.com",
"customType": "",
"type": "custom"
},
{
"address": "james#x.com"
},
{
"address": "sales#x.com",
"primary": true
},
{
"address": "info#x.com"
}
]
If you are only interested in a list of unique field names for a given address, regardless of the counts and values used, you can also go with:
jq -n '
reduce inputs.emails[] as $e ({}; .[$e.address][$e | keys_unsorted[]] = 1)
| map_values(keys)
'
Demo
{
"cc#xx.com": [
"address"
],
"james#zz.com": [
"address",
"customType",
"type"
],
"james#x.com": [
"address"
],
"sales#x.com": [
"address",
"primary"
],
"info#x.com": [
"address"
]
}
The structure of the objects contained in file1.json is unknown, so the solution must be agnostic of any keys/values and the solution must not assume any structure or depth.
You can use the --stream flag to break down the structure into an array of paths and values, discard the values part and make the paths unique:
jq --stream -nc '[inputs[0]] | unique[]' file*.json
["customerId"]
["emails"]
["emails",0,"address"]
["emails",0,"customType"]
["emails",0,"primary"]
["emails",0,"type"]
["emails",1,"address"]
["emails",2]
["emails",2,"address"]
["emails",2,"primary"]
["emails",3]
["emails",3,"address"]
["id"]
Trying to build a representation of this, similar to any of the input files, comes with a lot of caveats. For instance, how would you represent in a single structure if one file had .emails as an array of objects, and another had .emails as just an atomic value, say, a string. You would not be able to represent this plurality without introducing new, possibly ambiguous structures (e.g. putting all possibilities into an array).
Therefore, having a list of paths could be a fair compromise. Judging by your desired output, you want to focus more on the object structure, so you could further reduce complexity by discarding the array indices. Depending on your use case, you could replace them with a single value to retain the information of the presence of an array, or discard them entirely:
jq --stream -nc '[inputs[0] | map(numbers = 0)] | unique[]' file*.json
["customerId"]
["emails"]
["emails",0]
["emails",0,"address"]
["emails",0,"customType"]
["emails",0,"primary"]
["emails",0,"type"]
["id"]
jq --stream -nc '[inputs[0] | map(strings)] | unique[]' file*.json
["customerId"]
["emails"]
["emails","address"]
["emails","customType"]
["emails","primary"]
["emails","type"]
["id"]
The following program meets these two key requirements:
"all keys from input objects must be present in output object";
"the solution must be agnostic of any keys/values and the solution must not assume any structure or depth."
The approach is the same as one suggested by #pmf, and for the example given in the Q, produces results that are very similar to the one that is shown:
jq -n --stream '
def squish: map(select(type == "string"));
reduce (inputs | select(length==2)) as [$p, $v] ({};
setpath($p|squish; $v))
'
With the given input, this produces:
{
"customerId": "xx",
"emails": {
"address": "peter#x.com",
"customType": "",
"type": "custom",
"primary": true
},
"id": "654"
}

Generate a separate CSV record for each array element

I have a JSON:
{
"Country": "USA",
"State": "TX",
"Employees": [
{
"Name": "Name1",
"address": "SomeAdress1"
}
]
}
{
"Country": "USA",
"State": "FL",
"Employees": [
{
"Name": "Name2",
"address": "SomeAdress2"
},
{
"Name": "Name3",
"address": "SomeAdress3"
}
]
}
{
"Country": "USA",
"State": "CA",
"Employees": [
{
"Name": "Name4",
"address": "SomeAdress4"
}
]
}
I want to use jq to get the following result in csv format:
Country, State, Name, Address
USA, TX, Name1, SomeAdress1
USA, FL, Name2, SomeAdress2
USA, FL, Name3, SomeAdress3
USA, CA, Name4, SomeAdress4
I have got the following jq:
jq -r '.|[.Country,.State,(.Employees[]|.Name,.address)] | #csv'
And I get the following with 2nd line having more columns than required. I want these extra columns in a separate row:
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
And I want the following result:
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2"
"USA","FL","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
You need to generate a separate array for each employee.
[.Country, .State] + (.Employees[] | [.Name, .address]) | #csv
Online demo
You can store root object in a variable, and then expand the Employees arrays:
$ jq -r '. as $root | .Employees[]|[$root.Country, $root.State, .Name, .address] | #csv'
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2"
"USA","FL","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
The other answers are good, but I want to talk about why your attempt doesn't work, as well as why it seems like it should.
You are wondering why this:
jq -r '.|[.Country,.State,(.Employees[]|.Name,.address)] | #csv'
produces this:
"USA","TX","Name1","SomeAdress1"
"USA","FL","Name2","SomeAdress2","Name3","SomeAdress3"
"USA","CA","Name4","SomeAdress4"
perhaps because this:
jq '{Country:.Country,State:.State,Name:(.Employees[]|.Name)}'
produces this:
{
"Country": "USA",
"State": "TX",
"Name": "Name1"
}
{
"Country": "USA",
"State": "FL",
"Name": "Name2"
}
{
"Country": "USA",
"State": "FL",
"Name": "Name3"
}
{
"Country": "USA",
"State": "CA",
"Name": "Name4"
}
It turns out the difference is in what exactly [...] and {...} do in a jq filter. In the array constructor [...], the entire contents of the square brackets, commas and all, is a single filter, which is fully evaluated and all the results combined into one array. Each comma inside is simply the sequencing operator, which means generate all the values from the filter on its left, then all the values from the filter on its right. In contrast, the commas in the {...} object constructor are part of the syntax and just separate the fields of the object. If any of the field expressions yield multiple values then multiple whole objects are produced. If multiple field expressions yield multiple value then you get a whole object for every combination of yielded values.
When you do this:
jq -r '.|[.Country,.State,(.Employees[]|.Name,.address)] | #csv'
^ ^ ^
1 2 3
the problem is that the commas labelled "1", "2" and "3" are all doing the same thing, evaluating all the values for the filter on the left, then all the values for the filter on the right. Then the array constructor catches all of them and produces a single array. The array constructor will never create more than one array for one input.
So with that in mind, you need to make sure that where you're expanding out .Employees[] isn't inside your array constructor. Here's another option to add to the answers you already have:
jq -r '.Employee=.Employees[]|[.Country,.State,.Employee.Name,.Employee.address]|#csv'
demo
or indeed:
jq -r '.Employees[] as $e|[.Country,.State,$e.Name,$e.address]|#csv'
demo

Rename duplicate Keys in Json array data

I have a json data as below.
{
"Data":
[
"User": [
{"Name": "Solomon", "Age":20},
{"Name": "Absolom", "Age":30},
]
"Country": [
{"Name" : "US", "Resident" : "Permanent"},
{"Name" : "UK", "Resident" : "Temporary"}
]]}
There are two tags with same keys,
in Users there is Name key and in Country also i have Name key. I need to preprocess the json file to differentiate the keys. My expected result is below. Tried through awk and sed commands, but i could not find proper solution. Any suggestion would be helpful.
Expected result:
{
"Data":
[
"User": [
{"User_Name": "Solomon", "User_Age":20},
{"User_Name": "Absolom", "User_Age":30},
]
"Country": [
{"Country_Name" : "US", "Country_Resident" : "Permanent"},
{"Country_Name" : "UK", "Country_Resident" : "Temporary"}
]]}
Tag name should be appended to the attribute name.
This is what i have tried,
jq '[.[] | .["User_Name"] = .Name]' file_name.json
But it changes for both the tages User as well as Country
with the permission of the OP, here's a jtc based solution while waiting for the jq's (assuming the input JSON is fixed):
bash $ <file.json jtc -w'<Data>l[:]<L>k<.*>L:<>k' -u'"{L}_{}";' -tc
{
"Data": {
"Country": [
{ "Country_Name": "US", "Country_Resident": "Permanent" },
{ "Country_Name": "UK", "Country_Resident": "Temporary" }
],
"User": [
{ "User_Age": 20, "User_Name": "Solomon" },
{ "User_Age": 30, "User_Name": "Absolom" }
]
}
}
bash $
Explanation of the jtc parameters:
-w'<Data>l[:]<L>k<.*>L:<>k' :
walk path (-w) selects Data label (<Data>l)
and then each of the nested elements ([:]),
and memorizes its key/label into the namespace L (<L>k),
then finds further each labeled element using REGEX label search (<.*>L:)
and finally reinterpret found element's key/label as the value (<>k)
-u'"{L}_{}";':
for each found label (in step 1) update operation (-u) is applied using template
"{L}_{}";', where {L} is interpolated with preserved in the namespace L value and {} is getting interpolated with the currently found label (at the each iteration of the walk path)
the trailing ; (or any other symbol) is required to distinguish the argument of -u from a literal JSON.
-tc is used to display JSON in a semi-compact form.
PS. I'm the creator of jtc unix JSON processing tool. The disclaimer is required by SO.
As originally posted, neither the illustrative input nor the corresponding output is valid JSON, but the following has been tested using JSON based on the shown input:
.Data |= ( (.User |= map(with_entries(.key |= ("User_" + .))))
| (.Country |= map(with_entries(.key |= ("Country_" + .)))) )
Of course, the above may need tweaking depending on the actual requirements, and can be generalized in various ways, e.g. as shown below.
A generalization
.Data |= with_entries( (.key + "_") as $newkey
| .value |= map(with_entries(.key |= ($newkey + .))))
Here is an approach using jq Streaming
fromstream(tostream | .[0] |= if length < 4 then . else .[3]="\(.[1])_\(.[3])" end)
It works by using tostream to convert your input to a stream of arrays
[["Data","Country",0,"Name"],"US"]
[["Data","Country",0,"Resident"],"Permanent"]
[["Data","Country",0,"Resident"]]
[["Data","Country",1,"Name"],"UK"]
[["Data","Country",1,"Resident"],"Temporary"]
[["Data","Country",1,"Resident"]]
[["Data","Country",1]]
[["Data","User",0,"Age"],20]
[["Data","User",0,"Name"],"Solomon"]
[["Data","User",0,"Name"]]
[["Data","User",1,"Age"],30]
[["Data","User",1,"Name"],"Absolom"]
[["Data","User",1,"Name"]]
[["Data","User",1]]
[["Data","User"]]
[["Data"]]
then applying a simple update assignment |= expression to transform the stream into
[["Data","Country",0,"Country_Name"],"US"]
[["Data","Country",0,"Country_Resident"],"Permanent"]
[["Data","Country",0,"Country_Resident"]]
[["Data","Country",1,"Country_Name"],"UK"]
[["Data","Country",1,"Country_Resident"],"Temporary"]
[["Data","Country",1,"Country_Resident"]]
[["Data","Country",1]]
[["Data","User",0,"User_Age"],20]
[["Data","User",0,"User_Name"],"Solomon"]
[["Data","User",0,"User_Name"]]
[["Data","User",1,"User_Age"],30]
[["Data","User",1,"User_Name"],"Absolom"]
[["Data","User",1,"User_Name"]]
[["Data","User",1]]
[["Data","User"]]
[["Data"]]
then reversing the transformation with fromstream.
Try it online!

Merge and Sort JSON using JQ

I have a file containing the following structure and unknown number of results:
{
"results": [
[
{
"field": "AccountID",
"value": "5177497"
},
{
"field": "Requests",
"value": "50900"
}
],
[
{
"field": "AccountID",
"value": "pro"
},
{
"field": "Requests",
"value": "251"
}
]
],
"statistics": {
"Matched": 51498,
"Scanned": 8673577,
"ScannedByte": 2.72400814E10
},
"status": "HOLD"
}
{
"results": [
[
{
"field": "AccountID",
"value": "5577497"
},
{
"field": "Requests",
"value": "51900"
}
],
"statistics": {
"Matched": 51498,
"Scanned": 8673577,
"ScannedByte": 2.72400814E10
},
"status": "HOLD"
}
There are multiple such results which are indexed as an array with the results folder. They are not seperated by a comma.
I am trying to just print The "AccountID" sorted by "Requests" in ZSH using jq. I have tried flattening them and using:
jq -r '.results[][0] |.value ' filename
jq -r '.results[][1] |.value ' filename
To get the Account ID and Requests seperately and sorting them. I don't think bash has a dictionary that can be used. The problem lies in the file as the Field and value are not key value pair but are both pairs. Therefore extracting them using the above two lines into seperate arrays and sorting by the second array seems a bit too long. I was wondering if there is a way to combine both the operations.
The other way is to combine it all to a string and sort it in ascending order. Python would probably have the best solution but the code requires to be a zsh or bash script.
Solutions that use sed, jq or any other ZSH supported compilers are welcome. If there is a way to create a dictionary in bash, please do let me know.
The projectd output requirement is just the Account ID vs Request Number.
5577497 has 51900 requests
5177497 has 50900 requests
pro has 251 requests
If you don't mind learning a little jq, it will probably be best to write a small jq program to do what you want.
To get you started, consider the following jq program, which assumes your input is a stream of valid JSON objects with a "results" key similar to your sample:
[inputs | .results[] | map( { (.field) : .value} ) | add]
After making minor changes to your input so that it consists of valid JSON objects, an invocation of jq with the -n option produces an array of AccountID/Requests objects:
[
{
"AccountID": "5177497",
"Requests": "50900"
},
{
"AccountID": "pro",
"Requests": "251"
},
{
"AccountID": "5577497",
"Requests": "51900"
}
]
You could (for example) now use jq's group_by to group these objects by AccountID, and thereby produce the result you want.
jq -S '.results[] | map( { (.field) : .value} ) | add' query-results-aggregate \
| jq -s -c 'group_by(.number_of_requests) | .[]'
This does the trick. Thanks to peak for the guidance.

jq: how do I update a value based on a substring match?

I've got a jq question. Given a file file.json containing:
[
{
"type": "A",
"name": "name 1",
"url": "http://domain.com/path/to/filenameA.zip"
},
{
"type": "B",
"name": "name 2",
"url": "http://domain.com/otherpath/to/filenameB.zip"
},
{
"type": "C",
"name": "name 3",
"url": "http://otherdomain.com/otherpath/to/filenameB.zip"
}
]
I'm looking to create another file using jq with url modified only if the url's value matches some pattern. For example, I'd want to update any url matching the pattern:
http://otherdomain.com.*filenameB.*
to some fixed string such as:
http://yetanotherdomain.com/new/path/to/filenameC.tar.gz
with the resulting json:
[
{
"type": "A",
"name": "name 1",
"url": "http://domain.com/path/to/filenameA.zip"
},
{
"type": "B",
"name": "name 2",
"url": "http://domain.com/otherpath/to/filenameB.zip"
},
{
"type": "C",
"name": "name 3",
"url": "http://yetanotherdomain.com/new/path/to/filenameB.tar.gz"
}
]
I haven't gotten far even on being able to find the url, let alone update it. This is as far as I've gotten (wrong results and doesn't help me with the update issue):
% cat file.json | jq -r '.[] | select(.url | index("filenameB")).url'
http://domain.com/otherpath/to/filenameB.zip
http://otherdomain.com/otherpath/to/filenameB.zip
%
Any ideas on how to get the path of the key that has a value matching a regex? And after that, how to update the key with some new string value? If there are multiple matches, all should be updated with the same new value.
The good news is that there's a simple solution to the problem:
map( if .url | test("http://otherdomain.com.*filenameB.*")
then .url |= sub( "http://otherdomain.com.*filenameB.*";
"http://yetanotherdomain.com/new/path/to/filenameC.tar.gz")
else .
end)
The not-so-good news is that it's not so easy to explain unless you understand the key cleverness here - the "|=" filter. There is plenty of jq documentation about it, so I'll just point out that it is similar to the += family of operators in the C family of programming languages.
Specifically, .url |= sub(A;B) is like .url = (.url|sub(A;B)). That is how the update is done "in-place".
Here is a solution which identifies paths to url members with tostream and select and then updates the values using reduce and setpath
"http://otherdomain.com.*filenameB.*" as $from
| "http://yetanotherdomain.com/new/path/to/filenameC.tar.gz" as $to
| reduce (tostream | select(length == 2 and .[0][-1] == "url")) as $p (
.
; setpath($p[0]; $p[1] | sub($from; $to))
)