jq: Conditional insert using "lookup" & "target" JSON objects - json

I'm trying to improve a bash script I wrote using jq (Python version), but can't quite get the conditional nature of the task at hand to work.
The task: insert array from one JSON object ("lookup") into another ("target") only if the key of the "lookup" matches a particular "higher-level" value in the "target". Assume that the two JSON objects are in lookup.json and target.json, respectively.
A minimal example to make this clearer:
"Lookup" JSON:
{
"table_one": [
"a_col_1",
"a_col_2"
],
"table_two": [
"b_col_1",
"b_col_2",
"b_col_3"
]
}
"Target" JSON:
{
"top_level": [
{
"name": "table_one",
"tests": [
{
"test_1": {
"param_1": "some_param"
}
},
{
"test_2": {
"param_1": "another_param"
}
}]
},
{
"name": "table_two",
"tests": [
{
"test_1": {
"param_1": "some_param"
}
},
{
"test_2": {
"param_1": "another_param"
}
}
]
}
]
}
I want the output to be:
{
"top_level": [{
"name": "table_one",
"tests": [{
"test_1": {
"param_1": "some_param"
}
},
{
"test_2": {
"param_1": "another_param",
"param_2": [
"a_col_1",
"a_col_2"
]
}
},
{
"name": "table_two",
"tests": [{
"test_1": {
"param_1": "some_param"
}
},
{
"test_2": {
"param_1": "another_param",
"param_2": [
"b_col_1",
"b_col_2",
"b_col_3"
]
}
}
]
}
]
}
]
}
Hopefully, that makes sense. Early attempts slurped both JSON blobs and assigned them to two variables. I'm trying to select for a match on [roughly] ($lookup | keys[]) == $target.top_level.name, but I can't quite get this match or the subsequent the array insert working.
Any advice is well-received!

Assuming the JSON samples have been corrected, and that the following program is in the file "target.jq", the invocation:
jq --argfile lookup lookup.json -f target.jq target.json
produces the expected result.
target.jq
.top_level |= map(
$lookup[.name] as $value
| .tests |= map(
if has("test_2")
then .test_2.param_2 = $value
else . end) )
Caveat
Since --argfile is officially deprecated, you might wish to choose an alternative method of passing in the contents of lookup.json, but --argfile is supported by all extant versions of jq as of this writing.

The jq answer is already given, but the ask itself is so fascinating - it requires a cross-lookup from a source file into the file being inserted, so I could not help providing also an alternative solution using jtc utility:
<target.json jtc -w'<name>l:<N>v[-1][tests][-1:][0]' \
-i file.json -i'<N>t:' -T'{"param_2":{{}}}'
A brief overlook of the used options:
-w'<name>l:<N>v[-1][tests][-1:][0]' - selects points of insertions in the source (target.json) by finding and memorizing into namespace N keys to be looked up in the inserted file, then rolling back 1 level up in the JSON tree, selecting tests label, then the last entry in it and finally addressing a 1st element of the last one
-i file.json make an insertion from the file
-i'<N>t:' - this walk over file.json finds recursively a tag (label) preserved in the namespace N from the respective walk -w (if not this insert option with the walk argument, then the whole file would get inserted into the insertion points -w..)
-T'{"param_2":{{}}}' - finally, a template operation is applied onto the insertion result transforming found entry (in file.json) into the one with the right label
PS. I'm the developer of the jtc - multithreading JSON processing utility for unix.
PPS. the disclaimer is required by SO.

Related

jq query to find nested value and return parent values

having trouble finding this, maybe it's just my search terms or who knows.
basically, i have a series of arrays mapping keyspaces to destination DBs for a large noSQL migration, in order for us to more easily script data movement. i'll include sample JSON below.
it's nested basically like: environment >> { [ target DB ] >> [ list of keyspaces ] }, { [ target DB ] >> [ list of keyspaces ] }
my intent was to update my migration script to more intelligently determine where things go based on which environment is specified, etc and require less user input or "figuring things out".
here's sample JSON:
{
"Prod": [
{
"prod1": [
"prod_db1",
"prod_db2",
"prod_d31",
"prod_db4"
]
},
{
"prod2": [
"prod_db5",
"prod_db6",
"prod_db7",
"prod_db8"
]
}
]
}
assuming i'm able to provide keyspace and environment to the script, and use those as variables in my jq query, is there a way to search for the keyspace and return the value for one level up? IE, i know i can do something like:
!#/bin/bash
ENV="Prod"
jq '.."${ENV}"[][]' env.json
to just get the DBs in the prod environment. but if i'm searching for prod_db6' how can i return the value prod2`?
Use to_entries to decompose an object into an array of key-value pairs, then IN to search in the value's array, and finally return the key:
jq -r --arg env "Prod" --arg ksp "prod_db6" '
.[$env][] | to_entries[] | select(IN(.value[]; $ksp)).key
' env.json
prod2
Demo

Get item and subsequent item based on a property of the first one

I have an event-log file generated by a third-party tool that I cannot change. So, this log file is a huge JSON array where odds elements contain metadata and the pairs contain the body message associated with the meta-data. I want to be able to split the file depending on the metadata, agglomerating the information by subject in different files.
I am working on this project on windows and I am trying it using a batch file and JQ.
Basically the array looks like this:
[
{ "type": "abc123"},
{"name":"first component of type abc123"},
{ "type": "abc123"},
{"name":"second component of type abc123"},
{ "type": "def124"},
{"name":"first component of type def124"},
{ "type": "xyz999"},
{"name":"first component of type xyz999"},
{ "type": "abc123"},
{"name":"third component of type abc123"},
{ "type": "def124"},
{"name":"second component of type def124"},
{ "type": "abc123"},
{"name":"fifth component of type abc123"},
{ "type": "abc123"},
{"name":"sixth component of type abc123"},
{ "type": "def124"},
{"name":"third component of type def124"},
{ "type": "def124"},
{"name":"fourth component of type def124"},
{ "type": "abc123"},
{"name":"seventh component of type abc123"},
{ "type": "xyz999"},
{"name":"second component of type xyz999"}
...
]
I know that I only have 3 types, so this is what I am trying to archive is create a file for each of them. something like:
First file
{
"componentLog": {
"type": "abc123",
"information": [
"first component of type abc123",
"second component of type abc123",
"third component of type abc123",
...
]
}
}
Second file
{
"componentLog": {
"type": "def124",
"information": [
"first component of type def124",
"second component of type def124",
"third component of type def124",
...
]
}
}
Third file
{
"componentLog": {
"type": "xyz999",
"information": [
"first component of type xyz999",
"second component of type xyz999",
"third component of type xyz999",
...
]
}
}
I know that I can separate the metadata with this
jq.exe ".[] | select(.type==\"product\")" file.json
And then I try to math the index.But index just returns the index of the first item that contains the select statement... So I don't know how to solve this...
The following bash script is a bit messy because it assumes none of the files (input or output) will fit into memory.
If you don't already have access to bash, sed and awk in your computing environment, you might want to consider installing wsl, mingw, or some such, or you could adapt the script as appropriate, e.g. using gawk for Windows, or Ruby for Windows.
The other main assumption not already embedded in the original question is that it's OK to remove the log-type*.tmp files and
overwrite log-TYPE.json for the various values of "type".
Be sure to set input to the appropriate input file name.
# The input file name:
input=file.json
/bin/rm log-type*.tmp
# Use jq to produce a stream of .type and .name values
# as per the jq FAQ
jq -cn --stream '
fromstream(1|truncate_stream(inputs))
| if .type then .type else .name end' "$input" |
awk '
NR%2 {fn=$1; sub("^\"","",fn); sub("\"$","", fn); next;}
{ print > "log-type." fn ".tmp"}
'
for f in log-type.*.tmp ; do
echo formatting $f ...
g=$(sed -e 's/log-type.//' -e 's/.tmp$//' <<< "$f")
echo g="$g"
awk -v type="\"$g\"" '
BEGIN { print "{\"componentLog\": { \"type\": " type " ,";
print "\"information\": ["; }
NR==1 { print; next }
{print ",", $0}
END {print "]}}"; }' "$f" > "log-$g.json"
done

ID lookup from an external file in JQ

I have a lookup file that maps IDs from one system onto another:
[
{
"idA": 2547,
"idB": "5d0bf91d191c6554d14572a6"
},
{
"idA": 2549,
"idB": "5b0473f93d4e53db19f8c249"
},
{
"idA": 2550,
"idB": "5d0bfabc8f20917b92ff07dc"
},
...
And I have a data file with values and an ID from one of these systems:
[
{
"idB": "5d0bf91d191c6554d14572a6",
"description": "Description for 5d0bf91d191c6554d14572a6"
},
{
"idB": "5d0bf49e9236c57281811cfc",
"description": "Description for 5d0bf49e9236c57281811cfc"
},
{
"idB": "5d0bfabc8f20917b92ff07dc",
"description": "Description for 5d0bfabc8f20917b92ff07dc"
},
...
I want to produce a new file of the descriptions with their IDs converted to the idA values in the lookup file. I tried this:
jq --slurpfile idmap ids.json 'map( {"description":.description, "id": (.idB as $b|$idmap[][]|select(.idB==$b)|.idA) } )' descriptions.json
But it produces only an empty array.
I have to double-dereference $idmap because slurping a file "binds an array of the parsed JSON values to the given global variable" -- so just doing $idmap[] throws an error, jq: error (at descriptions.json:70): Cannot index array with string "idB".
Can anyone explain what I'm doing wrong here?
Here's a concise and straightforward solution to the stated problem.
For simplicity, we'll begin by constructing a dictionary containing the relevant mapping using INDEX/2:
INDEX($idmap[]; .idB) | map_values(.idA)
Now the task is easy:
(INDEX($idmap[]; .idB) | map_values(.idA)) as $dict
| map( {description, "idA": $dict[.idB] } )
This assumes an invocation that uses --argfile idmap ids.json to avoid
the unwanted "slurping" caused by --slurpfile, but if the latter is used, then you would use $idmap[][] instead as noted in the original question.
Since the sample snippets do not include any matching "idB" values, there is little point in showing the output that would be obtained using these snippets.
Variation
If the objects in descriptions.js had other keys that should be retained, then the following variant would probably be a more useful guide:
(INDEX($idmap[]; .idB) | map_values(.idA)) as $dict # or $idmap[][] as above
| map( .idA = $dict[.idB] | del(.idB) )

Conditionally changing JSON values in jq with sub() function

I need to alter some values in JSON data, and would like to include it in an already existing shell script. I'm trying to do so using jq, and will need the "sub()" function to cut off a piece of a string value.
Using this command line:
jq '._meta[][].ansible_ssh_pass | sub(" .*" ; "")'
with the data below will correctly replace the value (cutting off anything including the first space in the data), but only prints out the value, not the complete JSON structure.
Here's sample JSON data:
{_meta": {
"hostvars": {
"10.1.1.3": {
"hostname": "core-gw1",
"ansible_user": "",
"ansible_ssh_pass": "test123 / ena: test2",
"configsicherung": "true",
"os": "ios",
"managementpaket": ""
}
}
}}
Output should be something like this:
{"_meta": {
"hostvars": {
"10.1.1.3": {
"hostname": "core-gw1",
"ansible_user": "",
"ansible_ssh_pass": "test123",
"configsicherung": "true",
"os": "ios",
"managementpaket": ""
}
}
}}
I assume I have to add some sort of "if... then" based arguments, but haven't been able to get jq to understand me ;) Manual is a bit sketchy and I haven't been able to find any example I could get to match up with what I need to do ...
OK, as usual ... once you post a public question, you then manage to find a solution yourself ... ;)
This jq-call does what I need:
jq '. ._meta.hostvars[].ansible_ssh_pass |= sub(" .*";"" )'

Search and replace string in a very big file

I have a preference for shell commands to get things done. I have a very, very big file -- about 2.8 GB and the content is that of JSON. Everything is on one line, and I was told there are at least 1.5 million records in there.
I must prepare the file for consumption. Each record must be on its own line. Sample:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}
Or, use the following...
{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale#hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}
Final outcome should be:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Attempted commands:
sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat
The attempted commands works perfectly fine for small files. But it does not work for the 2.8 GB file that I must manipulate. Sed quits midway after 10 mins without reason and nothing was done. Awk errored with a Segmentation Fault (core dump) reason after many hours in. I tried perl's search and replace and got an error saying "Out of memory".
Any help/ ideas would be great!
Additional info on my machine:
More than 105 GB disk space available.
8 GB memory
4 cores CPU
Running Ubuntu 14.04
Since you've tagged your question with sed, awk AND perl, I gather that what you really need is a recommendation for a tool. While that's kind of off-topic, I believe that jq is something you could use for this. It will be better than sed or awk because it actually understands JSON. Everything shown here with jq could also be done in perl with a bit of programming.
Assuming content like the following (based on your sample):
{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}
You can easily reformat this to "prettify" it:
$ jq '.' < data.json
{
"RomanCharacters": {
"Alphabet": [
{
"RecordId": "1",
"data": "data"
},
{
"RecordId": "2",
"data": "data"
},
{
"RecordId": "3",
"data": "data"
},
{
"RecordId": "4",
"data": "data"
},
{
"RecordId": "5",
"data": "data"
}
]
}
}
And we can dig in to the data to retrieve only the records you're interested in (regardless of what they're wrapped in):
$ jq '.[][][]' < data.json
{
"RecordId": "1",
"data": "data"
}
{
"RecordId": "2",
"data": "data"
}
{
"RecordId": "3",
"data": "data"
}
{
"RecordId": "4",
"data": "data"
}
{
"RecordId": "5",
"data": "data"
}
This is much more readable, both by humans and by tools like awk which process content line-by-line. If you want to join your lines for processing per your question, the awk becomes much more simple:
$ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}'
{ "RecordId": "1", "data": "data" }
{ "RecordId": "2", "data": "data" }
{ "RecordId": "3", "data": "data" }
{ "RecordId": "4", "data": "data" }
{ "RecordId": "5", "data": "data" }
Or, as #peak suggested in comments, eliminate the awk portion of thie entirely by using jq's -c (compact output) option:
$ jq -c '.[][][]' < data.json
{"RecordId":"1","data":"data"}
{"RecordId":"2","data":"data"}
{"RecordId":"3","data":"data"}
{"RecordId":"4","data":"data"}
{"RecordId":"5","data":"data"}
Regarding perl: Try setting the input line separator $/ to }, like this:
#!/usr/bin/perl
$/= "},";
while (<>){
print "$_\n";
}'
or, as a one-liner:
$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat
Try using } as the record separator, e.g. in Perl:
perl -l -0175 -ne 'print $_, $/' < input
You might need to glue back lines containing only }.
This avoids the memory problem by not looking at the data as a single record, but may go too far the other way with respect to performance (processing a single character at a time). Also note that it requires gawk for the built-in RT variable (value of the current record separator):
$ cat j.awk
BEGIN { RS="[[:print:]]" }
RT == "{" { bal++}
RT == "}" { bal-- }
{ printf "%s", RT }
RT == "," && bal == 2 { print "" }
END { print "" }
$ gawk -f j.awk j.txt
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Using the sample data provided here (the one that begins with {Accounts:{Customer... ), the solution to this problem is one that reads in the file and as it is reading it is counting the number of delimiters defined in $/. For every count of 10,000 delimiters, it will write out to a new file. And for each delimiter found, it gives it a new line. Here is how the script looks like:
#!/usr/bin/perl
$base="/home/dat789/incoming";
#$_="sample.dat";
$/= "}]},"; # delimiter to find and insert new line after
$n = 0;
$match="";
$filecount=0;
$recsPerFile=10000; # set number of records in a file
print "Processing " . $_ ."\n";
while (<>){
if ($n < $recsPerFile) {
$match=$match.$_."\n";
$n++;
print "."; #This is so that we'd know it has done something
}
else {
my $newfile="partfile".$recsPerFile."-".$filecount . ".dat";
open ( OUTPUT,'>', $newfile );
print OUTPUT $match;
$match="";
$filecount++;
$n=0;
print "Wrote file " . $newfile . "\n";
}
}
print "Finished\n\n";
I've used this script against the big 2.8 GB file where it's content is an unformatted one-liner JSON. The resulting output files would be missing the correct JSON headers and footers but this can be easily fixed.
Thank you so much guys for contributing!