Extract fields from object present in multiple nested levels using jq - json

I have a text file having data records as below:
{"input":{"payload":{"id":"rec1","var2":"imp_val1","var3":"","var4":"000000"},"recordName":"typeABC"}}
{"input":{"recordName":"typeBCD","payload":{"var5":"val_var66","recordType":"typeA","id":"rec2","var2":"imp_val2","var3":"","var4":"000000"}}}
{"recordName":"typeEFG","payload":{"var5":"val_var55","recordType":"typeA","id":"rec3","var2":"imp_val3","var3":"","var4":""}}
{"payload":{"id":"rec4","var2":"imp_val4","var3":"","var4":"000000"},"recordName":"typeABC"}
There is a recordName and payload key which has my values of interest. Some records are wrapped inside another key input.
What i want to extract is id and var2 from all these records into a new csv file.
I figured if the data format were uniform, i could do:
cat file | jq -r "[.payload.id, .payload.var2] | #csv" > newFile
OR
cat file | jq -r "[.input.payload.id, .input.payload.var2] | #csv" > newFile
Any pointers?

Since the payload object is nested in multiple levels in each of the object, you can apply Recursive Descent .. and ignore if the key is not present
jq -r '..| .payload? // empty | [ .id, .var2 ] | #csv' file
jqplay - Online demo

BTW i figured following works too:
jq -r 'if .input then .input.payload else .payload end | [.id, .var2] | #csv' file > newFile

Related

Can this jq map be simplified?

Given this JSON:
{
"key": "/books/OL1000072M",
"source_records": [
"ia:daywithtroubadou00pern",
"bwb:9780822519157",
"marc:marc_loc_2016/BooksAll.2016.part25.utf8:103836014:1267"
]
}
Can the following jq code be simplified?
jq -r '.key as $olid | .source_records | map([$olid, .])[] | #tsv'
The use of variable assignment feels like cheating and I'm wondering if it can be eliminated. The goal is to map the key value onto each of the source_records values and output a two column TSV.
Instead of mapping into an array, and then iterating over it (map(…)[]) just create an array and collect its items ([…]). Also, you can get rid of the variable binding (as) by moving the second part into its own context using parens.
jq -r '[.key] + (.source_records[] | [.]) | #tsv'
Alternatively, instead of using #tsv you could build your tab-separated output string yourself. Either by concatenation (… + …) or by string interpolation ("\(…)"):
jq -r '.key + "\t" + .source_records[]'
jq -r '"\(.key)\t\(.source_records[])"'
Output:
/books/OL1000072M ia:daywithtroubadou00pern
/books/OL1000072M bwb:9780822519157
/books/OL1000072M marc:marc_loc_2016/BooksAll.2016.part25.utf8:103836014:1267
It's not much shorter, but I think it's clearer than the original and clearer than the other shorter answers.
jq -r '.key as $olid | .source_records[] | [ $olid, . ] | #tsv'

Using jq to stream, filter large json file and save output as csv

I have a very large json file I would like to stream (using --stream) and filter with jq, then save it as a csv.
This is the sample data with two objects:
[{"_id":"1","time":"2021-07-22","body":["text1"],"region":[{"percentage":"0.1","region":"region1"},{"percentage":"0.9","region":"region2"}],"reach":{"lower_bound":"100","upper_bound":"200"},"languages":["de"]},
{"_id":"2","time":"2021-07-23","body":["text2"],"region":[{"percentage":"0.3","region":"region1"},{"percentage":"0.7","region":"region2"}],"reach":{"lower_bound":"10","upper_bound":"20"},"languages":["en"]}]
I want to filter on the "languages" field in jq stream so I only retain objects where languages==[“de”], then save it as a new csv file titled largefile.csv such that the new csv file looks like the following:
_id,time,body,percentage_region1,percentage_region2,reach_lower_bound,reach_upper_bound,languages
"1","2021-07-22","text1","0.1","0.9","100","200","de"
I have the following code so far but it doesn’t seem to work:
cat largefile.json -r | jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.))) | with_entries(select(.value.languages==[“de”])) | #csv
Any help would be much appreciated!
There are several separate tasks involved here, and some are underspecified, but hopefully the following will help you through the thicket:
jq -rn --stream '
fromstream(1|truncate_stream(inputs))
| select( .languages == ["de"] )
| [._id, .time, .body[0], .region[].percentage,
.reach.lower_bound, .reach.upper_bound, .languages[0]]
| #csv
'

How to extract json data and output to csv using jq

My json array file is one long file and has many objects [{},{},{},{}], within each object I need the value for three keys: "STK_NUM":"1004251 ", "DLR_COST":40.32 , "RTL_AMT":9.99. How would I write the jq program to get the key names as headers in my csv file: STK_NUM, DLR_COST,RTL_AMT and the values of these keys from all objects: 1004251,40.32,9.99 ? These three keys are in every object.
Desired csv:
STK_NUM,DLR_COST,RTL_AMT
1004251,40.32,9.99
1012658,29.99,4.69
1232556,18.89,2.49
I've tried:
jq -r .[].STK_NUM HSEitm.json
result:
"1004251"
"1012658"
"1232556"
All my other jq script attempts result in errors because I don't know what I'm doing.
If anyone could show me I would be very grateful.
In the spirit of DRY:
jq -r '
["STK_NUM", "DLR_COST", "RTL_AMT"] as $headers
| $headers,
# Rows:
map(.[$headers[]])
| #csv
'
Try
jq -r '
# Headers
["STK_NUM", "DLR_COST", "RTL_AMT"],
# Rows
(.[] | [.STK_NUM, .DLR_COST, .RTL_AMT])
# Output Format
| #csv
' HSEitm.json

Pretty-print valid JSONs mixed with string keys

I have a Redis hash with keys and values like string key -- serialized JSON value.
Corresponding rediscli query (hgetall some_redis_hash) being dumped in a file:
redis_key1
{"value1__key1": "value1__value1", "value1__key2": "value1__value2" ...}
redis_key2
{"value2__key1": "value2__value1", "value2__key2": "value2__value2" ...}
...
and so on.
So the question is, how do I pretty-print these values enclosed in brackets? (note that key strings between are making the document invalid, if you'll try to parse the entire one)
The first thought is to get particular pairs from Redis, strip parasite keys, and use jq on the remaining valid JSON, as shown below:
rediscli hget some_redis_hash redis_key1 > file && tail -n +2 file
- file now contains valid JSON as value, the first string representing Redis key is stripped by tail -
cat file | jq
- produces pretty-printed value -
So the question is, how to pretty-print without such preprocessing?
Or (would be better in this particular case) how to merge keys and values in one big JSON, where Redis keys, accessible on the upper level, will be followed by dicts of their values?
Like that:
rediscli hgetall some_redis_hash > file
cat file | cool_parser
- prints { "redis_key1": {"value1__key1": "value1__value1", ...}, "redis_key2": ... }
A simple way for just pretty-printing would be the following:
cat file | jq --raw-input --raw-output '. as $raw | try fromjson catch $raw'
It tries to parse each line as json with fromjson, and just outputs the original line (with $raw) if it can't.
(The --raw-input is there so that we can invoke fromjson enclosed in a try instead of running it on every line directly, and --raw-output is there so that any non-json lines are not enclosed in quotes in the output.)
A solution for the second part of your questions using only jq:
cat file \
| jq --raw-input --null-input '[inputs] | _nwise(2) | {(.[0]): .[1] | fromjson}' \
| jq --null-input '[inputs] | add'
--null-input combined with [inputs] produces the whole input as an array
which _nwise(2) then chunks into groups of two (more info on _nwise)
which {(.[0]): .[1] | fromjson} then transforms into a list of jsons
which | jq --null-input '[inputs] | add' then combines into a single json
Or in a single jq invocation:
cat file | jq --raw-input --null-input \
'[ [inputs] | _nwise(2) | {(.[0]): .[1] | fromjson} ] | add'
...but by that point you might be better off writing an easier to understand python script.

Can't put JSON output into CSV format with jq

I'm building a list of AWS EBS volumes attributes so I can store it as CSV in a variable, using jq. I'm going to output the variable to a spread sheet.
The first command gives the values I'm looking for using jq:
aws ec2 describe-volumes | jq -r '.Volumes[] | .VolumeId, .AvailabilityZone, .Attachments[].InstanceId, .Attachments[].State, (.Tags // [] | from_entries.Name)'
Gives output that I want like this:
MIAPRBcdm0002_test_instance
vol-0105a1678373ae440
us-east-1c
i-0403bef9c0f6062e6
attached
MIAPRBcdwb00000_app1_vpc
vol-0d6048ec6b2b6f1a4
us-east-1c
MIAPRBcdwb00001 /carbon
vol-0cfcc6e164d91f42f
us-east-1c
i-0403bef9c0f6062e6
attached
However, if I put it into CSV format so I can output the variable to a spread sheet, the command blows up and doesn't work:
aws ec2 describe-volumes | jq -r '.Volumes[] | .VolumeId, .AvailabilityZone, .Attachments[].InstanceId, .Attachments[].State, (.Tags // [] | from_entries.Name)| #csv'
jq: error (at <stdin>:4418): string ("vol-743d1234") cannot be csv-formatted, only array
Even putting the top level of the JSON into CSV format fails for EBS volumes:
aws ec2 describe-volumes | jq -r '.Volumes[].VolumeId | #csv'
jq: error (at <stdin>:4418): string ("vol-743d1234") cannot be csv-formatted, only array
Here is the AWS EBS Volumes JSON FILE that I am working with, with these commands (the file has been cleaned of company identifiers, but is valid json).
How can I get this json into CSV format using jq?
You can only apply #csv over an array content, just enclose your filter within a [..] as below
jq -r '[.Volumes[] | .VolumeId, .AvailabilityZone, .Attachments[].InstanceId, .Attachments[].State, (.Tags // [] | from_entries.Name)]|#csv'
Using the above might still retain the quotes, so using join() would also be appropriate here
jq -r '[.Volumes[] | .VolumeId, .AvailabilityZone, .Attachments[].InstanceId, .Attachments[].State, (.Tags // [] | from_entries.Name)] | join(",")'
The accepted Answer resolves another obscure jq error:
string ("xxx") cannot be csv-formatted, only array
In my case I did not want the entire output of jq, but rather each Elastic Search document I supplied to jq to be printed as a CSV string on a line of its own. To accomplish this I simply moved the brackets to enclose only the items to be included on each line.
First, by placing my brackets only around items to be included on each line of output, I produced:
jq -r '.hits.hits[]._source | [.syscheck.path, .syscheck.size_after]'
[
"/etc/group-",
"783"
]
[
"/etc/gshadow-",
"640"
]
[
"/etc/group",
"795"
]
[
"/etc/gshadow",
"652"
]
[
"/etc/ssh/sshd_config",
"3940"
]
Piping this to | #csv prints each document's values of .syscheck.path and .syscheck.size_after, quoted and comma-separated, on a separate line:
$ jq -r '.hits.hits[]._source | [.syscheck.path, .syscheck.size_after] | #csv'
"/etc/group-","783"
"/etc/gshadow-","640"
"/etc/group","795"
"/etc/gshadow","652"
"/etc/ssh/sshd_config","3940"
Or to omit quotation marks, following the pattern noted in the accepted Answer:
$ jq -r '.hits.hits[]._source | [.syscheck.path, .syscheck.size_after] | join(",")'
/etc/group-,783
/etc/gshadow-,640
/etc/group,795
/etc/gshadow,652
/etc/ssh/sshd_config,3940