How to extract json data and output to csv using jq - json

My json array file is one long file and has many objects [{},{},{},{}], within each object I need the value for three keys: "STK_NUM":"1004251 ", "DLR_COST":40.32 , "RTL_AMT":9.99. How would I write the jq program to get the key names as headers in my csv file: STK_NUM, DLR_COST,RTL_AMT and the values of these keys from all objects: 1004251,40.32,9.99 ? These three keys are in every object.
Desired csv:
STK_NUM,DLR_COST,RTL_AMT
1004251,40.32,9.99
1012658,29.99,4.69
1232556,18.89,2.49
I've tried:
jq -r .[].STK_NUM HSEitm.json
result:
"1004251"
"1012658"
"1232556"
All my other jq script attempts result in errors because I don't know what I'm doing.
If anyone could show me I would be very grateful.

In the spirit of DRY:
jq -r '
["STK_NUM", "DLR_COST", "RTL_AMT"] as $headers
| $headers,
# Rows:
map(.[$headers[]])
| #csv
'

Try
jq -r '
# Headers
["STK_NUM", "DLR_COST", "RTL_AMT"],
# Rows
(.[] | [.STK_NUM, .DLR_COST, .RTL_AMT])
# Output Format
| #csv
' HSEitm.json

Related

Bash: Ignore key value pairs from a JSON that failed to parse using jq

I'm writing a bash script to read a JSON file and export the key-value pairs as environment variables. Though I could extract the key-value pairs, I'm struggling to skip those entries that failed to parse by jq.
JSON (key3 should fail to parse)
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\n
dskfjlksfj"
}
Here is what I tried
for pair in $(cat test.json | jq -r -R '. as $line | try fromjson catch $line | to_entries | map("\(.key)=\(.value)") | .[]' ); do
echo $pair
export $pair
done
And this is the error
jq: error (at <stdin>:1): string ("{") has no keys
jq: error (at <stdin>:2): string (" \"key1...) has no keys
My code is based on these posts:
How to convert a JSON object to key=value format in jq?
How to ignore broken JSON line in jq?
Ignore Unparseable JSON with jq
Here's a response to the revised question. Unfortunately, it will only be useful in certain limited cases, not including the example you give. (Basically, it depends on jq's parser being able to recover before the end of file.)
while read -r line ; do
echo export "$line"
done < <(< test.json jq -rn '
def do:
try inputs catch null
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\"" ;
recurse(do) | select(.)
')
Note that further refinements may be warranted, especially if there is potentially something fishy about the key names being used as shell variable names.
[Note: this response was made to the original question, which has since been changed. The response essentially assumes the input consists of JSONLines interspersed with other lines.)
Since the goal seems to be to ignore lines that don't have valid key-value pairs, you can simply use catch empty:
while read -r line ; do
echo export "$line"
done < <(test.json jq -r -R '
try fromjson catch empty
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\""
')
Note also the use of #sh and of the shell's read, and the fact that .value (in jq) and $line (in the shell) are both quoted. These are all important for robustness, though further refinements might still be necessary for additional robustness.
Perhaps there is an algorithm that will repair the broken JSON produced by the upstream system. If not, the following is a horrible but possibly useful "hack" that will at least capture KEY1 and KEY2 in the example in the Q:
jq -Rr '
capture("\"(?<key>[^\"]*)\"[ \t]*:[ \t]*(?<value>[^}]+)")
| (.value |= sub("[ \t]+$"; "") ) # trailing whitespace
| if .value|test("^\".*\"") then .value |= sub("\"[ \t]*[,}[ \t]*$"; "\"") else . end
| select(.value | test("^\".*\"$") or (contains("\"")|not) ) # a string or not a string
| "\(.key)=\(.value|#sh)"
'
The broken JSON in the example could be repaired in a number of ways, e.g.:
sed '/\\n$/{N; s/\\n\n/\\n/;}'
produces:
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\ndskfjlksfj"
}
At least that's JSON :-)

Using jq to stream, filter large json file and save output as csv

I have a very large json file I would like to stream (using --stream) and filter with jq, then save it as a csv.
This is the sample data with two objects:
[{"_id":"1","time":"2021-07-22","body":["text1"],"region":[{"percentage":"0.1","region":"region1"},{"percentage":"0.9","region":"region2"}],"reach":{"lower_bound":"100","upper_bound":"200"},"languages":["de"]},
{"_id":"2","time":"2021-07-23","body":["text2"],"region":[{"percentage":"0.3","region":"region1"},{"percentage":"0.7","region":"region2"}],"reach":{"lower_bound":"10","upper_bound":"20"},"languages":["en"]}]
I want to filter on the "languages" field in jq stream so I only retain objects where languages==[“de”], then save it as a new csv file titled largefile.csv such that the new csv file looks like the following:
_id,time,body,percentage_region1,percentage_region2,reach_lower_bound,reach_upper_bound,languages
"1","2021-07-22","text1","0.1","0.9","100","200","de"
I have the following code so far but it doesn’t seem to work:
cat largefile.json -r | jq -cn --stream 'fromstream(1|truncate_stream(inputs | select(.))) | with_entries(select(.value.languages==[“de”])) | #csv
Any help would be much appreciated!
There are several separate tasks involved here, and some are underspecified, but hopefully the following will help you through the thicket:
jq -rn --stream '
fromstream(1|truncate_stream(inputs))
| select( .languages == ["de"] )
| [._id, .time, .body[0], .region[].percentage,
.reach.lower_bound, .reach.upper_bound, .languages[0]]
| #csv
'

Parse multiple json files and output the match/hits against the regex with associated file names

Currently, the cat command piped to jq helps me to parse multiple JSON files in my working directory and screen against the regex pattern matching email ids available in all in the files. However, am keen to identify the file name also in which the regex pattern is being hit/matched
cat *.json | jq '. as $data | [path(..| select(scalars and (tostring | test("^[a-zA-Z0-9+_.-]+#[a-zA-Z0-9.-]+$", "ixn")))) ] | map({ (.|join(".")): (. as $path | .=$data | getpath($path)) }) | reduce .[] as $item ({}; . * $item)'
Request your kind help tweaking the command to print $filename. thanks!
input_filename evaluates to the input file name of the file currently being read (after it has been opened). For STDIN, it evaluates to "<stdin>":
jq 'input_filename, input_filename' <<< 1
"<stdin>"
"<stdin>"
It works with the -n command-line option, but only after an input or inputs function has been called:
jq -n 'input_filename, (input | input_filename)' <<< 1
null
"<stdin>"
For a jq-internal solution use input_filename as #peak suggested. Here's an external solution which iterates over your input files and passes the file name as variable into jq. This approach, however, calls jq once for each input file (as opposed to your cat *.json | jq ... approach which has just one call), so you might run into performance issues when applied to a larger number of input files.
for f in *.json
do jq --arg f "$f" '. as $data | ... (use $f here) ...' "$f"
done

Pretty-print valid JSONs mixed with string keys

I have a Redis hash with keys and values like string key -- serialized JSON value.
Corresponding rediscli query (hgetall some_redis_hash) being dumped in a file:
redis_key1
{"value1__key1": "value1__value1", "value1__key2": "value1__value2" ...}
redis_key2
{"value2__key1": "value2__value1", "value2__key2": "value2__value2" ...}
...
and so on.
So the question is, how do I pretty-print these values enclosed in brackets? (note that key strings between are making the document invalid, if you'll try to parse the entire one)
The first thought is to get particular pairs from Redis, strip parasite keys, and use jq on the remaining valid JSON, as shown below:
rediscli hget some_redis_hash redis_key1 > file && tail -n +2 file
- file now contains valid JSON as value, the first string representing Redis key is stripped by tail -
cat file | jq
- produces pretty-printed value -
So the question is, how to pretty-print without such preprocessing?
Or (would be better in this particular case) how to merge keys and values in one big JSON, where Redis keys, accessible on the upper level, will be followed by dicts of their values?
Like that:
rediscli hgetall some_redis_hash > file
cat file | cool_parser
- prints { "redis_key1": {"value1__key1": "value1__value1", ...}, "redis_key2": ... }
A simple way for just pretty-printing would be the following:
cat file | jq --raw-input --raw-output '. as $raw | try fromjson catch $raw'
It tries to parse each line as json with fromjson, and just outputs the original line (with $raw) if it can't.
(The --raw-input is there so that we can invoke fromjson enclosed in a try instead of running it on every line directly, and --raw-output is there so that any non-json lines are not enclosed in quotes in the output.)
A solution for the second part of your questions using only jq:
cat file \
| jq --raw-input --null-input '[inputs] | _nwise(2) | {(.[0]): .[1] | fromjson}' \
| jq --null-input '[inputs] | add'
--null-input combined with [inputs] produces the whole input as an array
which _nwise(2) then chunks into groups of two (more info on _nwise)
which {(.[0]): .[1] | fromjson} then transforms into a list of jsons
which | jq --null-input '[inputs] | add' then combines into a single json
Or in a single jq invocation:
cat file | jq --raw-input --null-input \
'[ [inputs] | _nwise(2) | {(.[0]): .[1] | fromjson} ] | add'
...but by that point you might be better off writing an easier to understand python script.

Extract fields from object present in multiple nested levels using jq

I have a text file having data records as below:
{"input":{"payload":{"id":"rec1","var2":"imp_val1","var3":"","var4":"000000"},"recordName":"typeABC"}}
{"input":{"recordName":"typeBCD","payload":{"var5":"val_var66","recordType":"typeA","id":"rec2","var2":"imp_val2","var3":"","var4":"000000"}}}
{"recordName":"typeEFG","payload":{"var5":"val_var55","recordType":"typeA","id":"rec3","var2":"imp_val3","var3":"","var4":""}}
{"payload":{"id":"rec4","var2":"imp_val4","var3":"","var4":"000000"},"recordName":"typeABC"}
There is a recordName and payload key which has my values of interest. Some records are wrapped inside another key input.
What i want to extract is id and var2 from all these records into a new csv file.
I figured if the data format were uniform, i could do:
cat file | jq -r "[.payload.id, .payload.var2] | #csv" > newFile
OR
cat file | jq -r "[.input.payload.id, .input.payload.var2] | #csv" > newFile
Any pointers?
Since the payload object is nested in multiple levels in each of the object, you can apply Recursive Descent .. and ignore if the key is not present
jq -r '..| .payload? // empty | [ .id, .var2 ] | #csv' file
jqplay - Online demo
BTW i figured following works too:
jq -r 'if .input then .input.payload else .payload end | [.id, .var2] | #csv' file > newFile