Using awk to extract a token from a larger JSON string - json

I have a string assigned to a variable:
#/bin/bash
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
I need to extract only l0ng_Str1ng.of.d1fF3erent_charAct3rs without quotes and assign that to another variable.
I understand I can use awk, sed, or cut but I am having trouble getting around the special characters in the original string.
Thanks in advance!
EDIT: I was not awake I should specify this is JSON. Thanks for the replies so far.
EDIT2: I am using BSD (macOS)

It looks like you have a JSON string there. Keep in mind that JSON is unordered, so most sed, awk, cut solutions will fail if you string comes next time in a different order.
It is most robust to use a JSON parser.
You could use ruby with its JSON parser library:
$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
Or, if you don't want the quoted string (which is useful for Bash):
$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Or with jq:
$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
All these solutions will work even if the JSON string is in a different order:
$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
But KNOWING that you SHOULD use a JSON parser, you can also use a PCRE with a look behind in Gnu Grep:
$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'
Or in Perl:
$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'
Both of those also work if the string is in a different order.
Or, with POSIX awk:
$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'
Or, with POSIX sed, you can do:
$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'
Those solutions are presented strongest (use a JSON parser) to more fragile (sed). But the sed solution I have there is better than the other because it will support the key, values in the JSON string being in different order.
Ps: If you want to remove the quotes from a line, that is a great job for sed:
$ echo '"quoted string"'
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string

In awk:
$ awk -v f="$fullToken" '
BEGIN{
while(match(f,/[^:{},]+:[^:{},]+/)) { # search key:value pairs
p=substr(f,RSTART,RLENGTH) # set pair to p
f=substr(f,RSTART+RLENGTH) # remove p from f
split(p,a,":") # split to get key and value
for(i in a) # remove leadin and trailing "
gsub(/^"|"$/,"",a[i])
if(a[1]=="token") { # if key is token
print a[2] # output value
exit # no need to process further
}
}
}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
l0ng_String can't have characters :{}.

GNU sed:
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
echo "$fullToken"|sed -r 's/.*"(.*)".*/\1/'

grep method would be,
$ grep -oP '[^"]+(?="[^"]+$)' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Brief explanation,
[^"]+ : grep would extract the non-" pattern
(?="[^"]+$): extract until the pattern ahead of last "
You may also use sed method to do that,
$sed -E 's/.*"([^"]+)"[^"]+$/\1/' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs

If the source of your string is JSON, then you should use JSON-specific tools. If not, then consider:
Using awk
$ fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
$ echo "$fullToken" | awk -F'"' '{print $8}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using cut
$ echo "$fullToken" | cut -d'"' -f8
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using sed
$ echo "$fullToken" | sed -E 's/.*"([^"]*)"[^"]*$/\1/'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using bash and one of the above
The above all work with POSIX shells. If the shell is bash, then we can use a here-string and eliminate the pipeline. Taking cut as the example:
$ cut -d'"' -f8 <<<"$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs

Related

Processing command substitutions in strings embedded in a .json file with jq

I want to read environment variables from .json-file:
{
"PASSPHRASE": "$(cat /home/x/secret)",
}
With the following script:
IFS=$'\n'
for s in $(echo $values | jq -r "to_entries|map(\"\(.key)=\(.value|tostring)\")|.[]" $1); do
export $s
echo $s
done
unset IFS
But the I got $(cat /home/x/secret) in PASSPHRASE, and cat is not executed. When I execute the line export PASSPHRASE=$(cat /home/x/secret), I got the correct result (content of file in environment variable). What do I have to change on my script, to get it working?
When you do export PASSPHRASE=$(cat /home/x/secret) in the shell, it interpretes the $() expression, executes the command within and puts the output of it inside of the variable PASSPHRASE.
When you place $() in the json file, however, it is read by jq and treated as a normal string, which is the equivalent of doing export PASSPHRASE=\$(cat /home/x/secret) (notice the slash, which causes the dollar sign to be escaped and treated as a literal character, instead of creating a new shell). If you do that instead and try to echo the contents of the variable it will have similar results as running your script.
If you want to force bash to interpret the string as a command you could use sh -c <command> instead, for example like this:
test.json:
{
"PASSPHRASE": "cat /home/x/secret"
}
test.sh:
IFS=$'\n'
for s in $(echo $values | jq -r "to_entries|map(\"\(.value|tostring)\")|.[]" $1); do
echo $(sh -c $s)
done
unset IFS
This prints out the contents of /home/x/secret. It does not solve your problem directly but should give you an idea of how you could change your original code to achieve what you need.
Thanks to Maciej I changed the script and got it working:
IFS=$'\n'
for line in $(jq -r "to_entries|map(\"\(.key)=\(.value|tostring)\")|.[]" "$1"); do
lineExecuted=$(sh -c "echo $line")
export "$lineExecuted"
echo "$lineExecuted"
done
unset IFS

Parsing json output using jq -jr

I am running a puppet bolt command query certain information from a set of servers in json format. I am piping it to jq.. Below is what I get
$ bolt command run "cat /blah/blah" -n #hname.txt -u uid --no-host-key-check --format json |jq -jr '.items[]|[.node],[.result.stdout]'
[
"node-name"
][
"stdout data\n"
]
What do I need to do to make it appear like below
["nodename":"stdout data"]
If you really want output that is not valid JSON, you will have to construct the output string, which can easily be done using string interpolation, e.g.:
jq -r '.items[] | "[\"\(.node)\",\"\(.result.stdout)\"]"'
#peak thank you.. that helped. Below is how it looks like
$ bolt command run "cat /blah/blah" -n #hname.txt -u UID --no-host-key-check --format json |jq -r '.items[] | "[\"\(.node)\",\"\(.result.stdout)\"]"'
["node name","stdout data
"]
I used a work around to get the data I needed by using the #csv flag to the command itself. Sharing with you below what worked.
$ bolt command run "cat /blah/blah" -n #hname.txt -u uid --no-host-key-check --format json |jq -jr '.items[]|[.node],[.result.stdout]|#csv'
""node-name""stdout.data
"

JQ can't parse \u2022 character

I'm trying to perform a bulk upload to Elasticsearch (around 1mln documents). In order to do that, I'm using jq to reformat the JSON file extracted from MySQL database and curl to post the data to Elasticsearch:
cat dataset.json | jq -r -c '.[] | { "index" : { } }, .' | curl -u login:password -H "Content-Type: application/json" -XPOST "https://.../skills/default/_bulk?pretty" --data-binary #-
I get an error:
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 276249, column 317
I found that the character that jq can't parse is \u2022. I tried adding "-r" jq command but the error stil occurs. How can I handle this for all occurrences of \u2022?
Here's verification that \u2022 is properly handled by various versions of jq in a Mac environment:
$ echo '"\u2022"' | jq-1.4 .
"•"
$ echo '"•"' | jq-1.6 .
"•"
$ echo '"•"' | jq-1.5 .
"•"
$ echo '"•"' | jq-1.4 .
"•"
$
Perhaps the problem is related to a bug that was fixed since the release of jq 1.5 (see e.g. https://github.com/stedolan/jq/issues/1311).
If you are having difficulties with jq version 1.6 (the current version), please provide a minimal complete verifiable example
with further details about the computing environment.

Bash - How to extract JSON from a web page?

I'm trying to extract JSON from this URL: here
The output that I want is like this https://pastebin.com/BVzUrk6s .Sorry I can't paste it here because of the StackOverFlow character limit.
Here is what I have tried:
curl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' | grep -Poz '(?<=app.run\()(.*\n)*.*(?=\);)'
But that command still doesn't extract the JSON data. How do I solve this ? I want to use a pure bash script without installing any programs to do this if possible.
It's a Bad Idea (TM) to attempt JSON parsing this way.
It seems like a Good Idea (TM) to find out what is possible regardless.
#!/bin/bash
function parseUrl() {
local url=$1
echo '"childCategories": ['
curl --silent ${url} \
| awk '/<script type="text" class=J_data/ { show=1 } show; /<\/script>/ { show=0 }' \
| egrep -v "script" \
| sed -e 's/]//g' -e 's/\[//g' -e 's/{"childCategoryName":"","childCategoryUrl":""},//g' -e 's/}$/},/g' \
| sed -e 's/,{/,\'$'\n{/g' -e 's/^[ ]*//g' -e 's/{/ {/g' \
| sed -e 's/childCategoryName/name/g' -e 's/childCategoryUrl/url/g'
echo ' ]'
}
parseUrl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' \
| tee /tmp/extracted.json
So there you go: curl, awk, egrep, sed. Use at your own risk.
Code like this isn't extensible, meaning you can't extract nested JSON easily.
It is quite brittle, meaning if someone changes the layout or even CSS, it's bye-bye data extraction.

Printing column separated by comma using Awk command line

I have a problem here. I have to print a column in a text file using awk. However, the columns are not separated by spaces at all, only using a single comma. Looks something like this:
column1,column2,column3,column4,column5,column6
How would I print out 3rd column using awk?
Try:
awk -F',' '{print $3}' myfile.txt
Here in -F you are saying to awk that use , as the field separator.
If your only requirement is to print the third field of every line, with each field delimited by a comma, you can use cut:
cut -d, -f3 file
-d, sets the delimiter to a comma
-f3 specifies that only the third field is to be printed
Try this awk
awk -F, '{$0=$3}1' file
column3
, Divide fields by ,
$0=$3 Set the line to only field 3
1 Print all out. (explained here)
This could also be used:
awk -F, '{print $3}' file
A simple, although awk-less solution in bash:
while IFS=, read -r a a a b; do echo "$a"; done <inputfile
It works faster for small files (<100 lines) then awk as it uses less resources (avoids calling the expensive fork and execve system calls).
EDIT from Ed Morton (sorry for hi-jacking the answer, I don't know if there's a better way to address this):
To put to rest the myth that shell will run faster than awk for small files:
$ wc -l file
99 file
$ time while IFS=, read -r a a a b; do echo "$a"; done <file >/dev/null
real 0m0.016s
user 0m0.000s
sys 0m0.015s
$ time awk -F, '{print $3}' file >/dev/null
real 0m0.016s
user 0m0.000s
sys 0m0.015s
I expect if you get a REALY small enough file then you will see the shell script run in a fraction of a blink of an eye faster than the awk script but who cares?
And if you don't believe that it's harder to write robust shell scripts than awk scripts, look at this bug in the shell script you posted:
$ cat file
a,b,-e,d
$ cut -d, -f3 file
-e
$ awk -F, '{print $3}' file
-e
$ while IFS=, read -r a a a b; do echo "$a"; done <file
$