Automatically replace all dead link texts in an HTML-page - html

I have a daily dynamically created firmware-download page with lots of different download links: https://freifunk.in-kiel.de/firmware/release-candidate/2018.1~exp-215/site/download/
Now not every day all those downloads are valid, but there should still be a note then, that those firmwares are n/a.
I thought about parsing the script via a regular expression to get all links on that page and then start some curl call, that checks if the link is dead or not. If dead, then replace the text of the link with "n/a".

This script successfully checks all links containing the string "gluon":
#!/bin/bash
# set the new version here
CUR=2018.1~ngly-234
BRANCH='nightly'
OUT_FILE=index.html
wget -k --no-check-certificate http://freifunk.in-kiel.de/firmware-rc.html -O $OUT_FILE
# replace the data from the template
sed -i 's|/sysupgrade/gluon-ffki-<VERSION>|/sysupgrade/gluon-ffki-'$CUR'|g' $OUT_FILE
sed -i 's|/factory/gluon-ffki-<VERSION>|/factory/gluon-ffki-'$CUR'|g' $OUT_FILE
sed -i 's|release-candidate|'$BRANCH'/'$CUR'|g' $OUT_FILE
echo -n "dead link check "
#sed -i "s/tube2/nixtube2/g" $OUT_FILE # for debug to create a dead link
INVALID='">n/a </a><deadlink none="'
while IFS= read -r URL; do
if wget --no-check-certificate --spider "$URL" 2>/dev/null; then
echo -n .
else
echo
echo "$URL does not exist"
sed -i 's|'$URL'|'$URL''"$INVALID"'|g' $OUT_FILE
fi
#done < <(grep -Po '(?<=href=")[^"]*' $OUT_FILE|grep gluon|grep alfa) # for debug
done < <(grep -Po '(?<=href=")[^"]*' $OUT_FILE|grep gluon)
echo "dead link check done"
sed -i 's|http://|//|g' $OUT_FILE

Related

Pass flags to the Sphinx runner?

So I've got the following project OpenFHE-development and when I run the build process, there are lots of warnings. However, most of these warnings are fine to ignore (we vet them before pushing to the main branch)
Specifically, is there a way to take
pth/python -m sphinx -T -E -b readthedocssinglehtmllocalmedia -d _build/doctrees -D language=en . _build/localmedia
and convert it to
pth/python -m sphinx -T -E -b readthedocssinglehtmllocalmedia -d _build/doctrees -D language=en . _build/localmedia 2> errors.txt
(pipe the stderr to a file instead of having it display on stdout)?
Does not seem to be possible at the moment. See git discussion

Grepping a word buried in a <p> on a website

I am having trouble grepping a word on a website. This is the command I'm using
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
which is returning nothing, when it should be returning
[name of the website]:Many recent developments in biological and medical
.
.
.
.
.
.
The overall goal of what I'm trying to do is find a certain word within all the links of the website
My script is written like this
#!/bin/bash
#$1 is the parent website
#This pipeline obtains all the links located on a website
wget -qO- $1 | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^\"]+"' | cut -c 7- | rev | cut -c 2- | rev > .linksLocated
#$2 is the word being looked for
#This loop goes though every link and tries to locate a word
while IFS='' read -r line || [[ -n "$line" ]]; do
wget -q $line | grep "$2"
done < .linksLocated
#rm .linksLocated
Wget doesn't put the downloaded file to standard output, so it's trying to grep the word from nothing (since you added the -q flag).
Add -O - to print the page to stdout:
wget -q http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ -O - | grep 'medical'
I see you used it with the first wget in your script, so just add it to the second one, too.
It's also possible to use curl, which does that by default, without any parameters:
curl http://bcbioinformaticsgrad.ca/our-faculty/james-piret/ | grep 'medical'
Edit: this tool is super useful when you actually need to select certain HTML elements in the downloaded page, might suit some use cases better than grep: https://github.com/ericchiang/pup

Docker Error: container id followed by "command not found"

I'm having difficulty with a script I'm writing. The script is largely incomplete, but so far I expect it to be able to run containers successfully. When I execute the script I'm given an error with a container ID and "command not found". For example: ./wordpress: line 73: 3c0fba4984f3b70f0eb3f1c15a7b157f4862b9b243657a3d2f7141029fb6641a: command not found
The script I'm using is as follows:
#!/bin/bash
echo "Setting Constants"
MYSQL_ROOT_PASSWORD='password'
MYSQL_DATABASE='wordpress'
WORDPRESS_DB_PASSWORD='password'
WP_PORT='80'
DB_PORT='3306'
EPOCH=$(date +%s) # append EPOCH to container names for uniqueness
#FILE='blogcontainers' # filename containing container IDs
DB_CONTAINER_NAME="myblogdb$EPOCH"
WP_CONTAINER_NAME="myblog$EPOCH"
DB_IMG_NAME='blogdb' # MySQL Docker image
WP_IMG_NAME='blog' # WordPress Docker image
cd ~/myblog
WP_CID_FILE="$PWD/blog.cid"
DB_CID_FILE="$PWD/blogdb.cid"
if [ -f $DB_CID_FILE ]; then
DB_IMG_ID=$(sed -n '1p' $DB_CID_FILE)
else
echo "dbcid not found"
# set to baseline image
DB_IMG_ID="f09a5b2903dc"
fi
if [ -f $WP_CID_FILE ]; then
WP_IMG_ID=$(sed -n '1p' $WP_CID_FILE)
else
echo "wpcid not found"
# set to baseline image
WP_IMG_ID="a8d48bc2313d"
fi
DB_PATH='/var/lib/mysql' # standard MySQL path
WP_PATH='/var/www/html' # standard WordPress path
LOCAL_DB_PATH="/$PWD$DB_PATH"
LOCAL_WP_PATH="/$PWD$WP_PATH"
echo "Starting MySQL Container"
#DB_ID=
$(docker run \
-e MYSQL_ROOT_PASSWORD=$MYSQL_ROOT_PASSWORD \
-e MYSQL_DATABASE=$MYSQL_DATABASE \
-v $LOCAL_WP_PATH:$DB_PATH \
-v /$PWD/.bash_history:$WP_PATH \
--name $DB_CONTAINER_NAME \
-p $DB_PORT:3306 \
--cidfile $DB_CID_FILE \
-d \
$DB_IMG_ID)
echo "Starting WordPress Container"
#WP_ID=
$(docker run \
-e WORDPRESS_DB_PASSWORD=$WORDPRESS_DB_PASSWORD \
--link $DB_CONTAINER_NAME:$DB_IMG_NAME \
-p $WP_PORT:80 \
-v $LOCAL_WP_PATH:$WP_PATH \
-v /$PWD/.bash_history:/root/.bash_history \
--name $WP_CONTAINER_NAME \
--cidfile $WP_CID_FILE \
-d \
$WP_IMG_ID)
echo $WP_CONTAINER_NAME
echo $WP_IMG_ID
echo "reached end"
#echo $WP_ID > $FILE # copy WordPress container ID to file
#echo $DB_ID >> $FILE # append MySQL container ID to file
After executing the code there usually is a MySQL container instance running. For example:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4f2e9ab14c2e f09a5b2903dc "/entrypoint.sh mysql" 2 seconds ago Up 2 seconds 0.0.0.0:3306->3306/tcp myblogdb1449768739
Also, both blog.cid and blogdb.cid are created successfully containing container IDs.
$ cat blog.cid
e6005bcb4dba524b121d02b301fbe421d67d60986c55d554a0e20443df27ed18
$ cat blogdb.cid
4f2e9ab14c2ea5361557a3714477d7758c993af3b08bbc7db529282a41f90959
I've been troubleshooting and searching around for answers, but I think it's time to have another set of eyes take a look at it. As always, any input/criticism are welcome.
You are using $(docker run ...) instead of simply docker run .... The command substitution ($(...)) runs the command, captures the output, and expands to that output. As a result, you are trying to run the output of docker run as a command.

How to attach several files in mail using unix command?

I'm implementing a script to backup my MySQL database. All process is OK and I send an email when it finish. But I want to attach the file in that email and I don't know how to do it.
My command line is:
mail -s "$1" -a "MIME-Version: 1.0;" -a "Content-type: text/html;" root#$domain -c ops#mydomain.com < $2
Where $1 = My subject and $2 = my message body
Thanks!
You are very close. You can use mail command to send 1 attachment as follow (you'd better TAR / ZIP your files before sending):
echo "$2" | mail -s "$1" -a /path/to/file.tar.gz ops#mydomain.com
Next, if you want to have more features, you can use mutt (install with apt-get install mutt):
mutt -s "$1" -a /path/to/file1.tar.gz -a /path/to/file2.tar.gz -a /path/to/file3.tar.gz ops#mydomain.com < /tmp/mailbody.txt
where:
file1.tar.gz to file3.tar.gz are file attachments
ops#mydomain.com is recipient
mailbody.txt is the contents of email
or use uuencode (install with apt-get install sharutils):
uuencode /path/to/file.tar.gz /path/to/file.tar.gz | mailx -s "$1" ops#mydomain.com
Note:
you have to repeat the file.tar.gz twice (read uuencode documentation for more information)
mailx is a newer version of mail, but still an ancient command
to send multiple attachments with mail command (well, if you insist):
$ uuencode file1.tar.gz file1.tar.gz > /tmp/out.mail
$ uuencode file2.tar.gz file3.tar.gz >> /tmp/out.mail
$ uuencode file3.tar.gz file3.tar.gz >> /tmp/out.mail
$ cat email-body.txt >> /tmp/out.mail
$ mail -s "$1" ops#mydomain.com < /tmp/out.mail
Hope the above helps.

Search and replace html tags in sed recursively

I am trying to write a script to search and remove htm and html tags from all files recursively. The starting point is given as input in the command to run the script. The resultant files should be saved in new file at the same place ending with _changed. e.g., start.html > start.html_changed.
Here is the script I wrote so far. It works fine, but the output prints out to the terminal, and I want it to be saved in files respectively.
#!/bin/bash
sudo find $1 -name '*.html' -type f -print0 | xargs -0 sed -n '/<div/,/<\/div>/p'
sudo find $1 -name '*.htm' -type f -print0 | xargs -0 sed -n '/<div/,/<\/div>/p'
Any help is much appreciated.
The following script works just fine, but it is not recursive. how can I make it recursive?
#!/bin/bash
for l in /$1/*.html
do
sed -n '/<div/,/<\/div>/p' $l > "${l}_nobody"
done
for m in /$1/*.htm
do
sed -n '/<div/,/<\/div>/p' $m > "${m}_nobody"
done
Just edit the xargs part as follows:
xargs -0 -I {} sh -c "sed -n '/<div/,/<\/div>/p' {} > {}_changed"
Explanation:
-I {}: sets a placeholder
> {}_changed": does redirection to the file with _changed suffix