Extract href of a specific anchor text in bash - html

I am trying to get the href of the most recent production release from Exiftool page.
curl -s 'http://www.sno.phy.queensu.ca/~phil/exiftool/history.html' | grep -o -E "href=[\"'](.*)[\"'].*Version"
Actual output
href="Image-ExifTool-10.36.tar.gz">Version
I want this an as output
Image-ExifTool-10.36.tar.gz

Using grep -P you can use a lookahead and \K for match reset:
curl -s 'http://www.sno.phy.queensu.ca/~phil/exiftool/history.html' |
grep -o -P "href=[\"']\K[^'\"]+(?=[\"']>Version)"
Image-ExifTool-10.36.tar.gz

Related

Openshift remote command execution (exec)

I am trying to run the following command from Windows machine in the openshift docker container running Linux
oc exec openjdk-app-1-l9nrx -i -t --server https://xxx.cloud.ibm.com:30450 \
--token <token> -n dev-hg jcmd \
$(ps -ef | grep java | grep -v grep | awk '{print $2}') GC.heap_dump \
/tmp/heap1.hprof
It is trying to evaluate jcmd $(ps -ef | grep java | grep -v grep | awk '{print $2}') GC.heap_dump /tmp/heap1.hprof on local windows machine and I do not have linux commands. Also, I need the process ID of the application running in container and not my local.
Any quick help is appreciated.
Try this:
oc exec -it openjdk-app-1-l9nrx --server https://xxx.cloud.ibm.com:30450 \
--token <dont-share-your-token> -n dev-hg -- /bin/sh -c \
"jcmd $(ps -ef | grep java | grep -v grep | awk '{print \$2}')"
Or even:
oc exec -it openjdk-app-1-l9nrx --server https://xxx.cloud.ibm.com:30450 \
--token <dont-share-your-token> -n dev-hg -- /bin/sh -c \
"jcmd $(ps -ef | awk '/java/{print \$2}')"
The problem is that the $( ) piece is being interpreted locally. Surrounding it in double quotes won't help, as that kind of syntax is interpreted inside double quotes.
You have to replace your double quotes by single quotes (so $( ) is not interpreted), and then compensate for the awk single quotes:
oc exec openjdk-app-1-l9nrx -i -t --server https://xxx.cloud.ibm.com:30450 --token TOKEN -n dev-hg 'jcmd $(ps -ef | grep java | grep -v grep | awk '\''{print $2}'\'') GC.heap_dump /tmp/heap1.hprof'
Please add the tags unix and shell to your question, as this is more of a UNIX question than an Openshift one.

Bash - How to extract JSON from a web page?

I'm trying to extract JSON from this URL: here
The output that I want is like this https://pastebin.com/BVzUrk6s .Sorry I can't paste it here because of the StackOverFlow character limit.
Here is what I have tried:
curl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' | grep -Poz '(?<=app.run\()(.*\n)*.*(?=\);)'
But that command still doesn't extract the JSON data. How do I solve this ? I want to use a pure bash script without installing any programs to do this if possible.
It's a Bad Idea (TM) to attempt JSON parsing this way.
It seems like a Good Idea (TM) to find out what is possible regardless.
#!/bin/bash
function parseUrl() {
local url=$1
echo '"childCategories": ['
curl --silent ${url} \
| awk '/<script type="text" class=J_data/ { show=1 } show; /<\/script>/ { show=0 }' \
| egrep -v "script" \
| sed -e 's/]//g' -e 's/\[//g' -e 's/{"childCategoryName":"","childCategoryUrl":""},//g' -e 's/}$/},/g' \
| sed -e 's/,{/,\'$'\n{/g' -e 's/^[ ]*//g' -e 's/{/ {/g' \
| sed -e 's/childCategoryName/name/g' -e 's/childCategoryUrl/url/g'
echo ' ]'
}
parseUrl 'https://www.lazada.co.id/-i160040703-s181911730.html?spm=a2o4j.order_details.details_title.1.52ec6664luQAQs&urlFlag=true&mp=1' \
| tee /tmp/extracted.json
So there you go: curl, awk, egrep, sed. Use at your own risk.
Code like this isn't extensible, meaning you can't extract nested JSON easily.
It is quite brittle, meaning if someone changes the layout or even CSS, it's bye-bye data extraction.

Grepping a word buried in a <p> on a website

I am having trouble grepping a word on a website. This is the command I'm using
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
which is returning nothing, when it should be returning
[name of the website]:Many recent developments in biological and medical
.
.
.
.
.
.
The overall goal of what I'm trying to do is find a certain word within all the links of the website
My script is written like this
#!/bin/bash
#$1 is the parent website
#This pipeline obtains all the links located on a website
wget -qO- $1 | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | cut -c 7- | rev | cut -c 2- | rev > .linksLocated
#$2 is the word being looked for
#This loop goes though every link and tries to locate a word
while IFS='' read -r line || [[ -n "$line" ]]; do
wget -q $line | grep "$2"
done < .linksLocated
#rm .linksLocated
Wget doesn't put the downloaded file to standard output, so it's trying to grep the word from nothing (since you added the -q flag).
Add -O - to print the page to stdout:
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ -O - | grep 'medical'
I see you used it with the first wget in your script, so just add it to the second one, too.
It's also possible to use curl, which does that by default, without any parameters:
curl http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
Edit: this tool is super useful when you actually need to select certain HTML elements in the downloaded page, might suit some use cases better than grep: https://github.com/ericchiang/pup

BATCH: grep equivalent

I need some help what ith the equivalent code for grep -v Wildcard and grep -o in batch file.
This is my code in shell.
result=`mysqlshow --user=$dbUser --password=$dbPass sample | grep -v Wildcard | grep -o sample`
The batch equivalent of grep (not including third party tools like GnuWin32 grep), will be findstr.
grep -v finds lines that don't match the pattern. The findstr version of this is findstr /V
grep -o shows only the part of the line that matches the pattern. Unfortunately, there's no equivalent of this, but you can run the command and then have a check along the lines of
if %errorlevel% equ 0 echo sample

Bash script output JSON variable to file

I am using twurl in Ubuntu's command line to connect to the Twitter Streaming API, and parse the resulting JSON with this processor. I have the following command, which returns the text of tweets sent from London:
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text'
This works great, but I'm struggling to output the result to a file called london.txt. I have tried the following, but still no luck:
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text' > london.txt
As I'm fairly new to Bash scripting, I'm sure I've misunderstood the proper use of '>' and '>>', so if anyone could point me in the right direction that'd be awesome!
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text' > london.txt
It will replace on each new line pasting on it. But if you use >> it will append the next write operation to end of file. So try the following rather above example, I'm certain it will work.
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text' >> london.txt
Also you can use tee command to check what is printing along side the redirection
twurl -t -d locations=-5.67,50.06,1.76,58.62 language=en -H stream.twitter.com /1.1/statuses/filter.json | jq '.text'| tee london.txt