Grepping a word buried in a <p> on a website - html

I am having trouble grepping a word on a website. This is the command I'm using
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
which is returning nothing, when it should be returning
[name of the website]:Many recent developments in biological and medical
.
.
.
.
.
.
The overall goal of what I'm trying to do is find a certain word within all the links of the website
My script is written like this
#!/bin/bash
#$1 is the parent website
#This pipeline obtains all the links located on a website
wget -qO- $1 | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | cut -c 7- | rev | cut -c 2- | rev > .linksLocated
#$2 is the word being looked for
#This loop goes though every link and tries to locate a word
while IFS='' read -r line || [[ -n "$line" ]]; do
wget -q $line | grep "$2"
done < .linksLocated
#rm .linksLocated

Wget doesn't put the downloaded file to standard output, so it's trying to grep the word from nothing (since you added the -q flag).
Add -O - to print the page to stdout:
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ -O - | grep 'medical'
I see you used it with the first wget in your script, so just add it to the second one, too.
It's also possible to use curl, which does that by default, without any parameters:
curl http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
Edit: this tool is super useful when you actually need to select certain HTML elements in the downloaded page, might suit some use cases better than grep: https://github.com/ericchiang/pup

Related

Validating the json files in a folder/sub folder through shell script

find dir1 | grep .json | python -mjson.tool $1 > /dev/null
I am using the above command but this doesn't take the files as inputs. What should i do to check for all the json files in a folder and validate whether its a proper json.
I found this a while back and have been using it:
for j in *.json; do python -mjson.tool "$j" > /dev/null || echo "INVALID $j" >&2; done;
I'm sure it compares similarly to dasho-o's answer, but am including it as another option.
You need to use 'xarg'.
The pipe find/grep will place the file names of the json file to STDIN. You need to create a command line with those file names. This is what xargs does:
find dir1 | grep .json | xargs -L1 python -mjson.tool > /dev/null
Side notes:
Since 'find' has filtering and execution predicates, more compact line can be created
find dir1 -name '*.json' -exec python -mjson.tool '{}' ';'
Also consider using 'jq' as light weight alternative to validating via python.

Extract href of a specific anchor text in bash

I am trying to get the href of the most recent production release from Exiftool page.
curl -s 'http://www.sno.phy.queensu.ca/~phil/exiftool/history.html' | grep -o -E "href=[\"'](.*)[\"'].*Version"
Actual output
href="Image-ExifTool-10.36.tar.gz">Version
I want this an as output
Image-ExifTool-10.36.tar.gz
Using grep -P you can use a lookahead and \K for match reset:
curl -s 'http://www.sno.phy.queensu.ca/~phil/exiftool/history.html' |
grep -o -P "href=[\"']\K[^'\"]+(?=[\"']>Version)"
Image-ExifTool-10.36.tar.gz

BATCH: grep equivalent

I need some help what ith the equivalent code for grep -v Wildcard and grep -o in batch file.
This is my code in shell.
result=`mysqlshow --user=$dbUser --password=$dbPass sample | grep -v Wildcard | grep -o sample`
The batch equivalent of grep (not including third party tools like GnuWin32 grep), will be findstr.
grep -v finds lines that don't match the pattern. The findstr version of this is findstr /V
grep -o shows only the part of the line that matches the pattern. Unfortunately, there's no equivalent of this, but you can run the command and then have a check along the lines of
if %errorlevel% equ 0 echo sample

Cut file by phrase in ubuntu terminal

I downloaded some website pages with wget utility, but html pages contain too much information which is not needed. I want files to contain only text before
</article> tag. I suspect that it is possible to do it with grep command, but which parameters do I need? And how to apply such command to all files in a dir?
here's the script
for i in *.htm; do (cat $i | grep -i "</article>" -B 9999) > $i; done;

Bash Script Loop through MySQL row and use curl and grep

I have a mysql database, with a table :
url | words
And datas like, for example :
------Column URL------- -------Column Words------
www.firstwebsite.com | hello, hi
www.secondwebsite.com | someword, someotherword
I want to loop through that table to check if the word is present in the content of the website specified by the url.
I have something like this :
!/bin/bash
mysql --user=USERNAME --password=PASSWORD DATABASE --skip-column-names -e "SELECT url, keyword FROM things" | while read url keyword; do
content=$(curl -sL $url)
echo $content | egrep -q $keyword
status=$?
if test $status -eq 0 ; then
# Found...
else
# Not found...
fi
done
One problems :
It's very slow : how set curl to optimize the load time of each website, don't load images, things like that ?
Also, Is it a good idea to put things like that in a shell script, or is it better to create a php script, and call it with curl ?
Thanks !
As it stands your script will not work as you might expect when you have multiple keywords per row as in your example. The reason is that when you pass hello, hi to egrep it will look for the exact string "hello, hi" in its input, not for either "hello" or "hi". You can fix this without making changes to what's in your database by turning each list of keywords into an egrep-compatible regular expression with sed. You'll also need to remove the | from mysql's output, e.g, with awk.
curl doesn't retrieve images when downloading a webpage's HTML. If the order in which the URLs are queried does not matter to you then you can speed things up by making the whole thing asynchronous with &.
#!/bin/bash
handle_url() {
if curl -sL "$1" | egrep -q "$2"; then
echo 1 # Found...
else
echo 0 # Not found...
fi
}
mysql --user=USERNAME --password=PASSWORD DATABASE --skip-column-names -e "SELECT url, keyword FROM things" | awk -F \| '{ print $1, $2 }' | while read url keywords; do
keywords=$(echo $keywords | sed -e 's/, /|/g;s/^/(/;s/$/)/;')
handle_url "$url" "$keywords" &
done