unescape backslash in jq output - csv

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5317139/property/IsomericSMILES/JSON
For the above JSON, the following jq prints 5317139 CCC/C=C\\1/C2=C(C3C(O3)CC2)C(=O)O1.
.PropertyTable.Properties
| .[]
| [.CID, .IsomericSMILES]
| #tsv
But there are two \ before the first 1. Is it wrong, should three be just one \? How to get the correct number of backslash?

The extra backslash in the output is the result of the request to produce TSV, since "\" has a special role to play in jq's TSV (e.g. "\t" signifies the tab character).
By contrast, consider:
jq -r '
.PropertyTable.Properties
| .[]
| [.CID, .IsomericSMILES]
| join("\t")' smiles.json
5317139 CCC/C=C\1/C2=C(C3C(O3)CC2)C(=O)O1

Related

Bash: Ignore key value pairs from a JSON that failed to parse using jq

I'm writing a bash script to read a JSON file and export the key-value pairs as environment variables. Though I could extract the key-value pairs, I'm struggling to skip those entries that failed to parse by jq.
JSON (key3 should fail to parse)
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\n
dskfjlksfj"
}
Here is what I tried
for pair in $(cat test.json | jq -r -R '. as $line | try fromjson catch $line | to_entries | map("\(.key)=\(.value)") | .[]' ); do
echo $pair
export $pair
done
And this is the error
jq: error (at <stdin>:1): string ("{") has no keys
jq: error (at <stdin>:2): string (" \"key1...) has no keys
My code is based on these posts:
How to convert a JSON object to key=value format in jq?
How to ignore broken JSON line in jq?
Ignore Unparseable JSON with jq
Here's a response to the revised question. Unfortunately, it will only be useful in certain limited cases, not including the example you give. (Basically, it depends on jq's parser being able to recover before the end of file.)
while read -r line ; do
echo export "$line"
done < <(< test.json jq -rn '
def do:
try inputs catch null
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\"" ;
recurse(do) | select(.)
')
Note that further refinements may be warranted, especially if there is potentially something fishy about the key names being used as shell variable names.
[Note: this response was made to the original question, which has since been changed. The response essentially assumes the input consists of JSONLines interspersed with other lines.)
Since the goal seems to be to ignore lines that don't have valid key-value pairs, you can simply use catch empty:
while read -r line ; do
echo export "$line"
done < <(test.json jq -r -R '
try fromjson catch empty
| objects
| to_entries[]
| "\(.key)=\"\(.value|#sh)\""
')
Note also the use of #sh and of the shell's read, and the fact that .value (in jq) and $line (in the shell) are both quoted. These are all important for robustness, though further refinements might still be necessary for additional robustness.
Perhaps there is an algorithm that will repair the broken JSON produced by the upstream system. If not, the following is a horrible but possibly useful "hack" that will at least capture KEY1 and KEY2 in the example in the Q:
jq -Rr '
capture("\"(?<key>[^\"]*)\"[ \t]*:[ \t]*(?<value>[^}]+)")
| (.value |= sub("[ \t]+$"; "") ) # trailing whitespace
| if .value|test("^\".*\"") then .value |= sub("\"[ \t]*[,}[ \t]*$"; "\"") else . end
| select(.value | test("^\".*\"$") or (contains("\"")|not) ) # a string or not a string
| "\(.key)=\(.value|#sh)"
'
The broken JSON in the example could be repaired in a number of ways, e.g.:
sed '/\\n$/{N; s/\\n\n/\\n/;}'
produces:
{
"KEY1":"ABC",
"KEY2":"XYZ",
"KEY3":"---ABC---\ndskfjlksfj"
}
At least that's JSON :-)

Pretty-print valid JSONs mixed with string keys

I have a Redis hash with keys and values like string key -- serialized JSON value.
Corresponding rediscli query (hgetall some_redis_hash) being dumped in a file:
redis_key1
{"value1__key1": "value1__value1", "value1__key2": "value1__value2" ...}
redis_key2
{"value2__key1": "value2__value1", "value2__key2": "value2__value2" ...}
...
and so on.
So the question is, how do I pretty-print these values enclosed in brackets? (note that key strings between are making the document invalid, if you'll try to parse the entire one)
The first thought is to get particular pairs from Redis, strip parasite keys, and use jq on the remaining valid JSON, as shown below:
rediscli hget some_redis_hash redis_key1 > file && tail -n +2 file
- file now contains valid JSON as value, the first string representing Redis key is stripped by tail -
cat file | jq
- produces pretty-printed value -
So the question is, how to pretty-print without such preprocessing?
Or (would be better in this particular case) how to merge keys and values in one big JSON, where Redis keys, accessible on the upper level, will be followed by dicts of their values?
Like that:
rediscli hgetall some_redis_hash > file
cat file | cool_parser
- prints { "redis_key1": {"value1__key1": "value1__value1", ...}, "redis_key2": ... }
A simple way for just pretty-printing would be the following:
cat file | jq --raw-input --raw-output '. as $raw | try fromjson catch $raw'
It tries to parse each line as json with fromjson, and just outputs the original line (with $raw) if it can't.
(The --raw-input is there so that we can invoke fromjson enclosed in a try instead of running it on every line directly, and --raw-output is there so that any non-json lines are not enclosed in quotes in the output.)
A solution for the second part of your questions using only jq:
cat file \
| jq --raw-input --null-input '[inputs] | _nwise(2) | {(.[0]): .[1] | fromjson}' \
| jq --null-input '[inputs] | add'
--null-input combined with [inputs] produces the whole input as an array
which _nwise(2) then chunks into groups of two (more info on _nwise)
which {(.[0]): .[1] | fromjson} then transforms into a list of jsons
which | jq --null-input '[inputs] | add' then combines into a single json
Or in a single jq invocation:
cat file | jq --raw-input --null-input \
'[ [inputs] | _nwise(2) | {(.[0]): .[1] | fromjson} ] | add'
...but by that point you might be better off writing an easier to understand python script.

YAML format from JSON in Linux

I am looking easy ways to clean my json in linux.
Sample Input:
[{"name":"Student1","value":"John"},{"name":"Student2","value":"Jack"},{"name":"Student3","value":"Nick"},{"name":"Student4","value":"Karen"},{"name":"Student5","value":"Jonas"}]
Expected Output:
I want to save below expected output in txt/yaml file in linux.
Student1:John
Student2:Jack
Student3:Nick
Student4:Karen
Student5:Jonas
My Attempt:
jq '. | map([.name, .value] | join(": ")) | join("\n")' file.json >> all_param.yml;
You are only missing the additional flag --raw-output for jq.
The command should look like
jq --raw-output '.[] | [.name, .value] | join(":")' file.json
The --raw-output argument makes jq print the output directly instead of printing it as a JSON string, i.e., escaping it in a string.
In your expected output you have no space after the colon, but you have it in your jq command.

Fuzzy match string with jq

Let's say I have some JSON in a file, it's a subset of JSON data extracted from a larger JSON file - that's why I'll use stream later in my attempted solution - and it looks like this:
[
{"_id":"1","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"2","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"4a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"mkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
],
[
{"_id":"55","#":{},"article":false,"body":"Hello world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"},
{"_id":"56","#":{},"article":false,"body":"Goodbye world","comments":"3","createdAt":"20201007200628","creator":{"id":"3a7ba8fd719d43598b977dd548eed6aa","bio":"","blocked":false,"followed":false,"human":false,"integration":false,"joined":"20201007200628","muted":false,"name":"mkscott","rss":false,"private":false,"username":"jkscott","verified":false,"verifiedComments":false,"badges":[],"score":"0","interactions":258,"state":1},"depth":"0","depthRaw":0,"hashtags":[],"id":"2d4126e342ed46509b55facb49b992a5","impressions":"3","links":[],"sensitive":false,"state":4,"upvotes":"0"}
]
It describes 4 posts written by 2 different authors, with unique _id fields for each post. Both authors wrote 2 posts, where 1 says "Hello World" and the other says "Goodbye World".
I want to match on the word "Hello" and return the _id only for fields containing "Hello". The expected result is:
1
55
The closest I could come in my attempt was:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body %like% "Hello")
| ._id
' <input_file
Assuming the input is modified slightly to make it a stream of the arrays as shown in the Q:
jq -nr --stream '
fromstream(1|truncate_stream(inputs))
| select(.body | test("Hello"))
| ._id
'
produces the desired output.
test uses regex matching. In your case, it seems you could use simple substring matching instead.
Handling extraneous commas
Assuming the input has commas between a stream of valid JSON exactly as shown, you could presumably use sed to remove them first.
Or, if you want an only-jq solution, use the following in conjunction with the -n, -r and --stream command-line options:
def iterate:
fromstream(1|truncate_stream(inputs?))
| select(.body | test("Hello"))
| ._id,
iterate;
iterate
(Notice the "?".)
The streaming parser (invoked with --stream) is usually not needed for the kind of task you describe, so in this response, I'm going to assume that the following (or a variant thereof) will suffice:
.[]
| select( .body | test("Hello") )._id
This of course assumes that the input is valid JSON.
Handling comma-delimited JSON
If your input is a comma-delimited stream of JSON as shown in the Q, you could use the following in conjunction with the -n command-line option:
# This is a variant of the built-in `recurse/1`:
def iterate(f): def r: f | (., r); r;
iterate( inputs? | .[] | select( .body | test("Hello") )._id )
Please note that this assumes that whatever occurs on a line after a delimiting comma can be ignored.

jq string manipulation on domain names and dns records

I am attempting to learn some jq but am running into trouble.
I am working with a dataset of dns records like {"timestamp":"1592145252","name":"0.127.9.109.rev.sfr.net","type":"a","value":"109.9.127.0"}
I cannot figure out how to
strip the subdomain details out of the name field. in this example i just want sfr.net
print the name backwards, eg: 0.127.9.109.rev.sfr.net would become ten.rfs.ver.901.9.721.0
my end goal is to print lines like this:
0.127.9.109.rev.sfr.net,ten.rfs.ver.901.9.721.0,a,sfr.net
Thanks SO!
To extract the "domain" part, you could use simple string manipulation methods to select it. Assuming anything after the .rev. part is the domain, you could do this:
split(".rev.")[1]
To reverse a string, jq doesn't have the operations to do it directly for strings. However it does have a function to reverse arrays. So you could convert to an array, reverse, then convert back.
split("") | reverse | join("")
To put it all together for your input:
.name | [
.,
(split("") | reverse | join("")),
(split(".rev.")[1])
] | join(",")
Here's one approach using reverse and capture:
jq -r '
.type as $type
| .name
| "\(.),\(explode|reverse|implode),\($type),"
+ capture("(?<subdomain>[^.]+[.][^.]+)$").subdomain'
Like this :
$ jq -r '.name' file.json | grep -oE '\w+\.\w+$'
sfr.net
$ jq -r '.name' file.json | rev
ten.rfs.ver.901.9.721.0