Bash - How to extract JSON from a web page? - json

I'm trying to extract JSON from this URL: here
The output that I want is like this https://pastebin.com/BVzUrk6s .Sorry I can't paste it here because of the StackOverFlow character limit.
Here is what I have tried:
curl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' | grep -Poz '(?<=app.run\()(.*\n)*.*(?=\);)'
But that command still doesn't extract the JSON data. How do I solve this ? I want to use a pure bash script without installing any programs to do this if possible.

It's a Bad Idea (TM) to attempt JSON parsing this way.
It seems like a Good Idea (TM) to find out what is possible regardless.
#!/bin/bash
function parseUrl() {
local url=$1
echo '"childCategories": ['
curl --silent ${url} \
| awk '/<script type="text" class=J_data/ { show=1 } show; /<\/script>/ { show=0 }' \
| egrep -v "script" \
| sed -e 's/]//g' -e 's/\[//g' -e 's/{"childCategoryName":"","childCategoryUrl":""},//g' -e 's/}$/},/g' \
| sed -e 's/,{/,\'$'\n{/g' -e 's/^[ ]*//g' -e 's/{/ {/g' \
| sed -e 's/childCategoryName/name/g' -e 's/childCategoryUrl/url/g'
echo ' ]'
}
parseUrl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' \
| tee /tmp/extracted.json
So there you go: curl, awk, egrep, sed. Use at your own risk.
Code like this isn't extensible, meaning you can't extract nested JSON easily.
It is quite brittle, meaning if someone changes the layout or even CSS, it's bye-bye data extraction.

Related

JQ can't parse \u2022 character

I'm trying to perform a bulk upload to Elasticsearch (around 1mln documents). In order to do that, I'm using jq to reformat the JSON file extracted from MySQL database and curl to post the data to Elasticsearch:
cat dataset.json | jq -r -c '.[] | { "index" : { } }, .' | curl -u login:password -H "Content-Type: application/json" -XPOST "https://.../skills/default/_bulk?pretty" --data-binary #-
I get an error:
parse error: Invalid string: control characters from U+0000 through U+001F must be escaped at line 276249, column 317
I found that the character that jq can't parse is \u2022. I tried adding "-r" jq command but the error stil occurs. How can I handle this for all occurrences of \u2022?
Here's verification that \u2022 is properly handled by various versions of jq in a Mac environment:
$ echo '"\u2022"' | jq-1.4 .
"•"
$ echo '"•"' | jq-1.6 .
"•"
$ echo '"•"' | jq-1.5 .
"•"
$ echo '"•"' | jq-1.4 .
"•"
$
Perhaps the problem is related to a bug that was fixed since the release of jq 1.5 (see e.g. https://github.com/stedolan/jq/issues/1311).
If you are having difficulties with jq version 1.6 (the current version), please provide a minimal complete verifiable example
with further details about the computing environment.

Using awk to extract a token from a larger JSON string

I have a string assigned to a variable:
#/bin/bash
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
I need to extract only l0ng_Str1ng.of.d1fF3erent_charAct3rs without quotes and assign that to another variable.
I understand I can use awk, sed, or cut but I am having trouble getting around the special characters in the original string.
Thanks in advance!
EDIT: I was not awake I should specify this is JSON. Thanks for the replies so far.
EDIT2: I am using BSD (macOS)
It looks like you have a JSON string there. Keep in mind that JSON is unordered, so most sed, awk, cut solutions will fail if you string comes next time in a different order.
It is most robust to use a JSON parser.
You could use ruby with its JSON parser library:
$ echo "$fullToken" | ruby -r json -e 'p JSON.parse($<.read)["token"];'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
Or, if you don't want the quoted string (which is useful for Bash):
$ echo "$fullToken" | ruby -r json -e 'puts JSON.parse($<.read)["token"];'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Or with jq:
$ echo "$fullToken" | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
All these solutions will work even if the JSON string is in a different order:
$ echo '{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
$ echo '{"token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs", "type":"APP"}' | jq '.token'
"l0ng_Str1ng.of.d1fF3erent_charAct3rs"
But KNOWING that you SHOULD use a JSON parser, you can also use a PCRE with a look behind in Gnu Grep:
$ echo "$fullToken" | grep -oP '(?<="token":)"([^"]*)'
Or in Perl:
$ echo "$fullToken" | perl -lane 'print $1 if /(?<="token":)"([^"]*)/'
Both of those also work if the string is in a different order.
Or, with POSIX awk:
$ echo "$fullToken" | awk -F"[,:}]" '{for(i=1;i<=NF;i++){if($i~/"token"/){print $(i+1)}}}'
Or, with POSIX sed, you can do:
$ echo "$fullToken" | sed -E 's/.*"token":"([^"]*).*/\1/'
Those solutions are presented strongest (use a JSON parser) to more fragile (sed). But the sed solution I have there is better than the other because it will support the key, values in the JSON string being in different order.
Ps: If you want to remove the quotes from a line, that is a great job for sed:
$ echo '"quoted string"'
"quoted string"
$ echo '"quoted string"' | sed -E 's/^"(.*)"$/UN\1/'
UNquoted string
In awk:
$ awk -v f="$fullToken" '
BEGIN{
while(match(f,/[^:{},]+:[^:{},]+/)) { # search key:value pairs
p=substr(f,RSTART,RLENGTH) # set pair to p
f=substr(f,RSTART+RLENGTH) # remove p from f
split(p,a,":") # split to get key and value
for(i in a) # remove leadin and trailing "
gsub(/^"|"$/,"",a[i])
if(a[1]=="token") { # if key is token
print a[2] # output value
exit # no need to process further
}
}
}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
l0ng_String can't have characters :{}.
GNU sed:
fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
echo "$fullToken"|sed -r 's/.*"(.*)".*/\1/'
grep method would be,
$ grep -oP '[^"]+(?="[^"]+$)' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Brief explanation,
[^"]+ : grep would extract the non-" pattern
(?="[^"]+$): extract until the pattern ahead of last "
You may also use sed method to do that,
$sed -E 's/.*"([^"]+)"[^"]+$/\1/' <<< "$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs
If the source of your string is JSON, then you should use JSON-specific tools. If not, then consider:
Using awk
$ fullToken='{"type":"APP","token":"l0ng_Str1ng.of.d1fF3erent_charAct3rs"}'
$ echo "$fullToken" | awk -F'"' '{print $8}'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using cut
$ echo "$fullToken" | cut -d'"' -f8
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using sed
$ echo "$fullToken" | sed -E 's/.*"([^"]*)"[^"]*$/\1/'
l0ng_Str1ng.of.d1fF3erent_charAct3rs
Using bash and one of the above
The above all work with POSIX shells. If the shell is bash, then we can use a here-string and eliminate the pipeline. Taking cut as the example:
$ cut -d'"' -f8 <<<"$fullToken"
l0ng_Str1ng.of.d1fF3erent_charAct3rs

Bash script output JSON variable to file

I am using twurl in Ubuntu's command line to connect to the Twitter Streaming API, and parse the resulting JSON with this processor. I have the following command, which returns the text of tweets sent from London:
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text'
This works great, but I'm struggling to output the result to a file called london.txt. I have tried the following, but still no luck:
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text' > london.txt
As I'm fairly new to Bash scripting, I'm sure I've misunderstood the proper use of '>' and '>>', so if anyone could point me in the right direction that'd be awesome!
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text' > london.txt
It will replace on each new line pasting on it. But if you use >> it will append the next write operation to end of file. So try the following rather above example, I'm certain it will work.
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text' >> london.txt
Also you can use tee command to check what is printing along side the redirection
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text'| tee london.txt

Bash Script Loop through MySQL row and use curl and grep

I have a mysql database, with a table :
url | words
And datas like, for example :
------Column URL------- -------Column Words------
www.firstwebsite.com | hello, hi
www.secondwebsite.com | someword, someotherword
I want to loop through that table to check if the word is present in the content of the website specified by the url.
I have something like this :
!/bin/bash
mysql --user=USERNAME --password=PASSWORD DATABASE --skip-column-names -e "SELECT url, keyword FROM things" | while read url keyword; do
content=$(curl -sL $url)
echo $content | egrep -q $keyword
status=$?
if test $status -eq 0 ; then
# Found...
else
# Not found...
fi
done
One problems :
It's very slow : how set curl to optimize the load time of each website, don't load images, things like that ?
Also, Is it a good idea to put things like that in a shell script, or is it better to create a php script, and call it with curl ?
Thanks !
As it stands your script will not work as you might expect when you have multiple keywords per row as in your example. The reason is that when you pass hello, hi to egrep it will look for the exact string "hello, hi" in its input, not for either "hello" or "hi". You can fix this without making changes to what's in your database by turning each list of keywords into an egrep-compatible regular expression with sed. You'll also need to remove the | from mysql's output, e.g, with awk.
curl doesn't retrieve images when downloading a webpage's HTML. If the order in which the URLs are queried does not matter to you then you can speed things up by making the whole thing asynchronous with &.
#!/bin/bash
handle_url() {
if curl -sL "$1" | egrep -q "$2"; then
echo 1 # Found...
else
echo 0 # Not found...
fi
}
mysql --user=USERNAME --password=PASSWORD DATABASE --skip-column-names -e "SELECT url, keyword FROM things" | awk -F \| '{ print $1, $2 }' | while read url keywords; do
keywords=$(echo $keywords | sed -e 's/, /|/g;s/^/(/;s/$/)/;')
handle_url "$url" "$keywords" &
done

Convert large text block to json block

I have a BASH script that attempts to capture the output of build/deployment logs and insert them into a Jira ticket using Jira's REST API and CURL:
curl -v -X POST \
-H "Content-Type: application/json" \
--data "#header.json" \
--data "#log.txt" \
--data "#footer.json" \
-H "Authorization:Basic ABC123!##" \
https://companyname.jira.com/rest/api/latest/issue/FOO-1234/comment
My problem is that the logs contain all manner of JSON tokens, which causes the insert to fail. Is there a way from BASH to clean up the text blob before posting to escape out all the illegal characters? Or a way to say "don't parse anything in this block" or similar? Worst case, I'll write some really scary AWK.
Once upon a time I used this code snippet for sending POST data using curl.
urlquote() {
echo -ne "$1" | xxd -plain | tr -d '\n' | sed 's/\(..\)/%\1/g'
}
It works great with unicode stuff as well. Maybe this is going to help.
It turns out that all I needed to escape were the quotes and to convert newlines to \n. I used the following sed actions:
sed -inplace 's/\"/\\\"/g' log.txt
sed -inplace ':a;N;$!ba;s/\n/\\n/g' log.txt