Parse JSON output for particular key fields - json

I have the following JSON content in a file.json.
I need only a particular key field from all this overwhelming information.
Let us assume that I need web_url,
The problem here is there are multiple key field with "web_url".
How do only get the web_url field I am after?
[{"id":196,"iid":1,"project_id":233,"title":"DEV to Master","description":"","state":"merged","created_at":"2019-12-04T14:14:35.424-06:00","updated_at":"2019-12-04T14:14:47.310-06:00","merged_by":{"id":122,"name":"Sengoku","username":"sengk","state":"active","avatar_url":"https://secure.gravatar.com/avatar/7cvffgfgfgfgf9eb1348d0ba7795a076?s=80\u0026d=identicon","web_url":"https://gitlaboo.tests.com/sengk"},"merged_at":"2019-12-04T14:14:47.468-06:00","closed_by":null,"closed_at":null,"target_branch":"master","source_branch":"DEV","upvotes":0,"downvotes":0,"author":{"id":122,"name":"Sengoku","username":"sengk","state":"active","avatar_url":"https://secure.gravatar.com/avatar/7fgdfdgdfgdvfg9eb1348d0ba7795a076?s=80\u0026d=identicon","web_url":"https://gitlaboo.tests.com/sengk"},"assignee":{"id":122,"name":"Sengoku","username":"sengk","state":"active","avatar_url":"https://secure.gravatar.com/avatar/7afsdfdvdfvfde24f89eb1348d0ba7795a076?s=80\u0026d=identicon","web_url":"https://gitlaboo.tests.com/sengk"},"source_project_id":233,"target_project_id":233,"labels":[],"work_in_progress":false,"milestone":null,"merge_when_pipeline_succeeds":false,"merge_status":"can_be_merged","sha":"6318e51ea8czfdfsdvdfvdfbc02988ba62c71e5774107e","merge_commit_sha":"6dc5vdfvdfgdfg5bf14e97dea949b8584c0c68d6","user_notes_count":0,"discussion_locked":null,"should_remove_source_branch":null,"force_remove_source_branch":false,"web_url":"https://gitlaboo.tests.com/demo/frog/merge_requests/1","time_stats":{"time_estimate":0,"total_time_spent":0,"human_time_estimate":null,"human_total_time_spent":null},"squash":false}]

You have 4 web_url in your JSON.
Can check the below results,
.[] | .web_url
.[] | .merged_by.web_url
.[] | .author.web_url
.[] | .assignee.web_url

If the question is essentially how to find the needle in the haystack, the answer is: use paths; more specifically, in your case:
jq -c 'paths(. == "https://gitlaboo.tests.com/demo/frog/merge_requests/1")
| select(.[-1] == "web_url")
' file.json
The output gives the path as a JSON array:
[0,"web_url"]
This can be used directly in jq (using getpath/1), or as the basis for a direct query:
.[0].web_url

Related

Bash: Ignore key value pairs from a JSON that failed to parse using jq

I'm writing a bash script to read a JSON file and export the key-value pairs as environment variables. Though I could extract the key-value pairs, I'm struggling to skip those entries that failed to parse by jq.
JSON (key3 should fail to parse)
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\n
dskfjlksfj"
}
Here is what I tried
for pair in $(cat test.json | jq -r -R '. as $line | try fromjson catch $line | to_entries | map("\(.key)=\(.value)") | .[]' ); do
echo $pair
export $pair
done
And this is the error
jq: error (at <stdin>:1): string ("{") has no keys
jq: error (at <stdin>:2): string (" \"key1...) has no keys
My code is based on these posts:
How to convert a JSON object to key=value format in jq?
How to ignore broken JSON line in jq?
Ignore Unparseable JSON with jq
Here's a response to the revised question. Unfortunately, it will only be useful in certain limited cases, not including the example you give. (Basically, it depends on jq's parser being able to recover before the end of file.)
while read -r line ; do
echo export "$line"
done < <(< test.json jq -rn '
def do:
try inputs catch null
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\"" ;
recurse(do) | select(.)
')
Note that further refinements may be warranted, especially if there is potentially something fishy about the key names being used as shell variable names.
[Note: this response was made to the original question, which has since been changed. The response essentially assumes the input consists of JSONLines interspersed with other lines.)
Since the goal seems to be to ignore lines that don't have valid key-value pairs, you can simply use catch empty:
while read -r line ; do
echo export "$line"
done < <(test.json jq -r -R '
try fromjson catch empty
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\""
')
Note also the use of #sh and of the shell's read, and the fact that .value (in jq) and $line (in the shell) are both quoted. These are all important for robustness, though further refinements might still be necessary for additional robustness.
Perhaps there is an algorithm that will repair the broken JSON produced by the upstream system. If not, the following is a horrible but possibly useful "hack" that will at least capture KEY1 and KEY2 in the example in the Q:
jq -Rr '
capture("\"(?<key>[^\"]*)\"[ \t]*:[ \t]*(?<value>[^}]+)")
| (.value |= sub("[ \t]+$"; "") ) # trailing whitespace
| if .value|test("^\".*\"") then .value |= sub("\"[ \t]*[,}[ \t]*$"; "\"") else . end
| select(.value | test("^\".*\"$") or (contains("\"")|not) ) # a string or not a string
| "\(.key)=\(.value|#sh)"
'
The broken JSON in the example could be repaired in a number of ways, e.g.:
sed '/\\n$/{N; s/\\n\n/\\n/;}'
produces:
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\ndskfjlksfj"
}
At least that's JSON :-)

Fuzzy match string with jq

Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.

JSON will not convert with jq in Unix

Having difficulties converting this JSON. It is multi-line similar to what is below. The example data at the bottom is what is reads as-is once unzipped.
An example of what has been tried:
jq -r '(([["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]]) + [(.table.All[] | [.user_id,.server_received_time,.app,.device_carrier,.$schema,.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.$insert_id,.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time])])[]|#csv' test.json > test.csv
As well as some other jq options. I need every column regardless of the value, and the values as-is. Does anyone have thoughts on why we are running into issues? One error we get is:
jq: error: try .["field"] instead of .field for unusually named fields at <top-level>, line 1:
Other jq lines have given the following error:
string (...) cannot be csv-formatted, only array
This is an excerpt from one of the JSON files:
{"groups":{},"country":"United States","device_id":"3d-88c-45-b6-ed81277eR","is_attribution_event":false,"server_received_time":"2019-12-17 17:29:11.113000","language":"English","event_time":"2019-12-17 17:27:49.047000","user_creation_time":"2019-11-08 13:15:32.919000","city":"Sure","uuid":"someID","device_model":"Windows","amplitude_event_type":null,"client_upload_time":"2019-12-17 17:29:21.958000","data":{},"library":"amplitude-js\/5.2.2","device_manufacturer":null,"dma":"Washington, DC (Townville, USA)","version_name":null,"region":"Virginia","group_properties":{},"location_lng":null,"device_family":"Windows","paying":null,"client_event_time":"2019-12-17 17:27:59.892000","$schema":12,"device_brand":null,"user_id":"email#gmail.com","event_properties":{"title":"Name","id":"1-253251","applicationName":"SomeName"},"os_version":"18","device_carrier":null,"server_upload_time":"2019-12-17 17:29:11.135000","session_id":1576603675620,"app":231165,"amplitude_attribution_ids":null,"event_type":"CHANGE_PERSPECTIVE","user_properties":{},"adid":null,"device_type":"Windows","$insert_id":"e308c923-d8eb-48c6-8ea5-600","event_id":24,"amplitude_id":515,"processed_time":"2019-12-17 17:29:12.760372","platform":"Web","idfa":null,"os_name":"Edge","location_lat":null,"ip_address":"123.456.78.90","sample_rate":null,"start_version":null}
Thank you!
There are several problems with your attempt.
First, the keys with "$" in their names cannot be specified using the abbreviated .foo syntax; you could use .["$foo"] instead.
Second, #csv expects an array of atomic values. Thus the keys with JSON objects as values must be handled specially.
Third, the "+" is incorrect. The relevant connector here is ",".
With your sample JSON, the following will work:
(["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]),
([.user_id,.server_received_time,.app,.device_carrier,.["$schema"],.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.["$insert_id"],.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time]
| map(if type=="object"
then to_entries
| map( "\(.key):\(.value)" )
| join(";")
else . end))
| #csv
A less error-prone solution
Specifying the long list of keys twice makes the above solution error-prone. It would be better to specify the keys just once, and then programatically generate the rows.
Here's a utility function that can be used to this end:
def toa($headers):
. as $in | $headers | map($in[.]);
Or you could handle the object-valued keys inside toa:
def toa($headers):
def flat:
if type == "object" or type == "array"
then to_entries | map( "\(.key):\(.value)" ) | join(";")
else .
end;
. as $in | $headers | map($in[.] | flat);
JSONL
If the input is a stream of JSON objects of the type illustrated in the question, an efficient solution would use inputs with the -n command line option. This could be along the lines of:
print_header,
(inputs | print_row)

How does a `select (. == ["a","b"][])` predicate work in JQ?

I'm looking for ways to select JSON entries based on an array that I provide as a literal:
$ echo '["a","b","c","d"]' | jq '.[] | select (. == ["a","b"][] )'
"a"
"b"
In the code above, all entries are selected that are in the ["a","b"] array. However, I don't understand how the . == ["a","b"][] predicate works in detail and would be grateful for an explanation. The tricky part is the right-hand side of ==.
Related:
jq - How to select objects based on a 'whitelist' of property values
The key to understanding here is that jq is stream-oriented. ["a","b"][] produces a stream, ergo . == ["a","b"][] produces a stream. select selects the items that produce truthy values in that stream.
To gain an understanding of how jq works, it often helps to pull things apart. In the present case, you could begin by trying:
echo '["a","b","c","d"]' | jq '.[] | (. == ["a","b"][])'
debug is also helpful, e.g.
echo '["a","b","c","d"]' | jq '.[] | select(debug == ["a","b"][])'

Convert JSON to CSV

I'm working on storing around 200 000 Json objects into a CSV file. But the problem is that any 2 JSON Objects might be different (having different key names).
I thought about creating a HashSet and traverse through all objects once so as to get column names for my CSV file. But this process is apparently taking too much time.
Is there another way to add columns to a CSV file dynamically?
One approach would be to use jq ("Json Query"):
def tocsv:
if length == 0 then empty
else
(.[0] | keys_unsorted) as $keys
| (map(keys) | add | unique) as $allkeys
| ($keys + ($allkeys - $keys)) as $cols
| ($cols, (.[] as $row | $cols | map($row[.])))
| #csv
end ;
tocsv
For example, assuming the above is in a file named json2csv.jq and that the input is in in.json:
jq -r -f json2csv.jq in.json
The above program constructs the header line by starting with the key names of the first object (in the order in which they appear there), and then extends the header line as required.
For more about jq, see https://stedolan.github.io/jq
Another approach would be to use in2csv, part of the csvkit tookit -- see https://csvkit.readthedocs.org