I'm writing a bash script to read a JSON file and export the key-value pairs as environment variables. Though I could extract the key-value pairs, I'm struggling to skip those entries that failed to parse by jq.
JSON (key3 should fail to parse)
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\n
dskfjlksfj"
}
Here is what I tried
for pair in $(cat test.json | jq -r -R '. as $line | try fromjson catch $line | to_entries | map("\(.key)=\(.value)") | .[]' ); do
echo $pair
export $pair
done
And this is the error
jq: error (at <stdin>:1): string ("{") has no keys
jq: error (at <stdin>:2): string (" \"key1...) has no keys
My code is based on these posts:
How to convert a JSON object to key=value format in jq?
How to ignore broken JSON line in jq?
Ignore Unparseable JSON with jq
Here's a response to the revised question. Unfortunately, it will only be useful in certain limited cases, not including the example you give. (Basically, it depends on jq's parser being able to recover before the end of file.)
while read -r line ; do
echo export "$line"
done < <(< test.json jq -rn '
def do:
try inputs catch null
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\"" ;
recurse(do) | select(.)
')
Note that further refinements may be warranted, especially if there is potentially something fishy about the key names being used as shell variable names.
[Note: this response was made to the original question, which has since been changed. The response essentially assumes the input consists of JSONLines interspersed with other lines.)
Since the goal seems to be to ignore lines that don't have valid key-value pairs, you can simply use catch empty:
while read -r line ; do
echo export "$line"
done < <(test.json jq -r -R '
try fromjson catch empty
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\""
')
Note also the use of #sh and of the shell's read, and the fact that .value (in jq) and $line (in the shell) are both quoted. These are all important for robustness, though further refinements might still be necessary for additional robustness.
Perhaps there is an algorithm that will repair the broken JSON produced by the upstream system. If not, the following is a horrible but possibly useful "hack" that will at least capture KEY1 and KEY2 in the example in the Q:
jq -Rr '
capture("\"(?<key>[^\"]*)\"[ \t]*:[ \t]*(?<value>[^}]+)")
| (.value |= sub("[ \t]+$"; "") ) # trailing whitespace
| if .value|test("^\".*\"") then .value |= sub("\"[ \t]*[,}[ \t]*$"; "\"") else . end
| select(.value | test("^\".*\"$") or (contains("\"")|not) ) # a string or not a string
| "\(.key)=\(.value|#sh)"
'
The broken JSON in the example could be repaired in a number of ways, e.g.:
sed '/\\n$/{N; s/\\n\n/\\n/;}'
produces:
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\ndskfjlksfj"
}
At least that's JSON :-)
Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.
Having difficulties converting this JSON. It is multi-line similar to what is below. The example data at the bottom is what is reads as-is once unzipped.
An example of what has been tried:
jq -r '(([["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]]) + [(.table.All[] | [.user_id,.server_received_time,.app,.device_carrier,.$schema,.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.$insert_id,.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time])])[]|#csv' test.json > test.csv
As well as some other jq options. I need every column regardless of the value, and the values as-is. Does anyone have thoughts on why we are running into issues? One error we get is:
jq: error: try .["field"] instead of .field for unusually named fields at <top-level>, line 1:
Other jq lines have given the following error:
string (...) cannot be csv-formatted, only array
This is an excerpt from one of the JSON files:
{"groups":{},"country":"United States","device_id":"3d-88c-45-b6-ed81277eR","is_attribution_event":false,"server_received_time":"2019-12-17 17:29:11.113000","language":"English","event_time":"2019-12-17 17:27:49.047000","user_creation_time":"2019-11-08 13:15:32.919000","city":"Sure","uuid":"someID","device_model":"Windows","amplitude_event_type":null,"client_upload_time":"2019-12-17 17:29:21.958000","data":{},"library":"amplitude-js\/5.2.2","device_manufacturer":null,"dma":"Washington, DC (Townville, USA)","version_name":null,"region":"Virginia","group_properties":{},"location_lng":null,"device_family":"Windows","paying":null,"client_event_time":"2019-12-17 17:27:59.892000","$schema":12,"device_brand":null,"user_id":"email#gmail.com","event_properties":{"title":"Name","id":"1-253251","applicationName":"SomeName"},"os_version":"18","device_carrier":null,"server_upload_time":"2019-12-17 17:29:11.135000","session_id":1576603675620,"app":231165,"amplitude_attribution_ids":null,"event_type":"CHANGE_PERSPECTIVE","user_properties":{},"adid":null,"device_type":"Windows","$insert_id":"e308c923-d8eb-48c6-8ea5-600","event_id":24,"amplitude_id":515,"processed_time":"2019-12-17 17:29:12.760372","platform":"Web","idfa":null,"os_name":"Edge","location_lat":null,"ip_address":"123.456.78.90","sample_rate":null,"start_version":null}
Thank you!
There are several problems with your attempt.
First, the keys with "$" in their names cannot be specified using the abbreviated .foo syntax; you could use .["$foo"] instead.
Second, #csv expects an array of atomic values. Thus the keys with JSON objects as values must be handled specially.
Third, the "+" is incorrect. The relevant connector here is ",".
With your sample JSON, the following will work:
(["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]),
([.user_id,.server_received_time,.app,.device_carrier,.["$schema"],.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.["$insert_id"],.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time]
| map(if type=="object"
then to_entries
| map( "\(.key):\(.value)" )
| join(";")
else . end))
| #csv
A less error-prone solution
Specifying the long list of keys twice makes the above solution error-prone. It would be better to specify the keys just once, and then programatically generate the rows.
Here's a utility function that can be used to this end:
def toa($headers):
. as $in | $headers | map($in[.]);
Or you could handle the object-valued keys inside toa:
def toa($headers):
def flat:
if type == "object" or type == "array"
then to_entries | map( "\(.key):\(.value)" ) | join(";")
else .
end;
. as $in | $headers | map($in[.] | flat);
JSONL
If the input is a stream of JSON objects of the type illustrated in the question, an efficient solution would use inputs with the -n command line option. This could be along the lines of:
print_header,
(inputs | print_row)