Finding all kinds of extensions referenced in a html file - html

Here is my problem statement :
There is a folder with many html and text files. I need to recursively go through each one of them and find all kinds of file extensions referenced in these html/text files like .jpg, .tif, .png etc
The problem is I don't have a defined list of the extensions I want to search for.
What would be the best way to achieve this using a shell script ?
Coming up with a Reg-ex which would essentially search for all occurrences of a dot followed by 3 or 4 letters, and filtering out the ones which end with a space or a comma, or a quote etc ??
Any suggestions would be helpful.

You can use shell script to parse file name with regex, but straight forward version is pretty simple:
$ cat *.{txt,html} | grep -oP '\b[A-Za-z0-9_]+\.[A-Za-z0-9]{1,4}\b' | awk -F. '{ print "." $(NF) }' | sort -u
For recursive search:
find . -name '*.txt' -or -name '*.html' -exec grep -oP '\b[A-Za-z0-9_.]+\.[A-Za-z0-9]{1,4}\b' {} \; | awk -F. '{ print "." $(NF) }' | sort -u

Related

Extract data from JSON file using bash [duplicate]

This question already has answers here:
Read JSON data in a shell script [duplicate]
(4 answers)
Closed 7 years ago.
Let's say that we have this kind of JSON file:
{
...
"quotes":{
"SOMETHING":10,
...
"SOMETHING_ELSE":120.4,
...
} }
How can I obtain those values and use them in order to add them together?
Am I able to do even this?
#!/bin/bash
#code ...
echo "$SOMETHING + $SOMETHING_ELSE" | bc
#code ...
#exit
I will obtain the JSON file with wget command. All I want is the content from this file.
Can you help me, please? I am a beginner in shell programming.
I usually use jq, a really fast json parser, to do this kind of things (because parsing a json file with tools like awk or sed is really error-prone).
Given an input file like this:
# file: input.json
{
"quotes":{
"SOMETHING":10,
"SOMETHING_ELSE":120.4
}
}
You can obtain the sum of the 2 fields with a simple filter:
jq '.quotes.SOMETHING + .quotes.SOMETHING_ELSE' input.json
# output -> 130.4
NOTE: jq is available in every major linux distribution. In a debian-derivative system you can install with a sudo apt-get install jq.
This will print out the sum of the selected lines' floats.
#!/bin/bash
awk '{ if ($1 ~ /"SOMETHING":/) {print}; if ($1 ~ /"SOMETHING_ELSE":/) {print} }' $1 | cut -d: -f2 | cut -d, -f1 | awk '{s+=$1};END{print s}'
This finds the lines you want, the plucks out the numbers, and adds them.
You should look up and learn jq as shown in Read the json data in shell script.
The tools in a "normal" shell installation like awk and sed all predate JSON by decades, and are a very very bad fit. jq is worth the time to learn.
Or use Python instead.

linux command line to zip files based on mysql resultset

i have a table, where some filenames are stored.
i would like to find all the files having that name under a specific folder and zip all of them.
on disk the structure is similar to this:
/folder/sub1/file1
/folder/sub1/file2
/folder/sub2/file1 <- same name as under sub1
/folder/sub2/file2
so i am looking for something similar to:
mysql -e "select file from table" | find /folder -type f -name <the value of file from mysql result set> | zip <all files found by all find commands>
thanks.
Couple of additions to your command:
Firstly, you want to use mysql in batch mode, so you do this:
mysql -Be "select file from table"
It gives you a single column table with no borders, so you get rid of the headers by piping it to tail starting at the second line:
tail -n +2
Then you pipe that to xargs, but before you do, hack it a bit with concat (you'll see why in a sec):
mysql -Be "select concat(' -o -name ', file) from table"
NOW you pipe it to xargs:
xargs find /folder -false
This does a "false" test (i.e. a no-op), but it appends a whole pile of things like -o -name somename.file, each of which performs a boolean or (with false originally, later with all other file names) and ultimately returns the list of files that match.
...which you finally pipe to zip, with another xargs:
xargs zip files.zip
Again, this puts the file names as arguments to zip.
Here's the total line:
mysql -Be "select concat(' -o -name ', file) from table" | tail -n +2 | xargs find /folder -false | xargs zip files.zip
Bear in mind that this assumes you have no spaces in your filenames. If you do, that'll add a bit of complexity: You can work around that by using -print0 and -0 in find and xargs respectively, although zip will have a harder time with that so you'd need to add another intermediate stage (or use zip -r).

How to Convert Regex Pattern Match to Lowercase for URL Standardization/Tidying

I am currently trying to convert all links and files and tags on my site from UPPERCASE.ext and CamelCase.ext to lowercase.ext.
I can match the links in pages using a regular expression match for href="[^"]*" and src="[^"]*"
This seems to work fine for identifying the link and images in the HTML.
However what I need to do with this is to take the match and run a ToLowercase() function on the matches. Since I have a lot of pages that I'd like to parse through, I'm looking to make a short shell script that will run on a specified directory and pattern match the specified regexes and perform a lowercase operation on them.
Perl one-liner to rename all regular files to lowercase:
perl -le 'use File::Find; find({wanted=>sub{-f && rename($_, lc)}}, "/path/to/files");'
If you want to be more specific about what files are renamed you could change -f to a regex or something:
perl -le 'use File::Find; find({wanted=>sub{/\.(txt|htm|blah)$/i && rename($_, lc)}}, "/path/to/files");'
EDIT: Sorry, after rereading the question I see you also want to replace occurrences within files as well:
find /path/to/files -name "*.html" -exec perl -pi -e 's/\b(src|href)="(.+)"/$1="\L$2"/gi;' {} \;
EDIT 2: Try this one as the find command uses + instead of \; which is more efficient since multiple files are passed to perl at once (thanks to #ikegami from another post). It also It also handles both ' and " around the URL. Finally, it uses {} instead of // for substitutions since you are substituting URLs (maybe the /s in the URL are confusing perl or your shell?). It shouldn't matter, and I tried both on my system with the same effect (both worked fine), but it's worth a shot:
find . -name "*.html" -exec perl -pi -e \
'$q=qr/"|\x39/; s{\b(src|href)=($q?.+$q?)\b}{$1=\L$2}gi;' {} +
PS: I also have a Macbook and tested these using bash shell with Perl versions 5.8.9 and 5.10.0.
With bash, you can declare a variable to only hold lower case values:
declare -l varname
read varname <<< "This Is LOWERCASE"
echo $varname # ==> this is lowercase
Or, you can convert a value to lowercase (bash version 4, I think)
x="This Is LOWERCASE"
echo ${x,,} # ==> this is lowercase
you want this?
kent$ echo "aBcDEF"|sed 's/.*/\L&/g'
abcdef
or this
kent$ echo "aBcDEF"|awk '$0=tolower($0)'
abcdef
with your own regex:
kent$ echo 'FOO src="htTP://wWw.GOOGLE.CoM" BAR BlahBlah'|sed -r 's/src="[^"]*"/\L&/g'
FOO src="http://www.google.com" BAR BlahBlah
You could use sed with -i (in-place edit):
sed -i'' -re's/(href|src)="[^"]*"/\L&/g' /path/to/files/*

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).
This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.
This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"
I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.
aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)
For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.

Easiest way to extract the urls from an html page using sed or awk only

I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of #condit)
lynx -dump -hiddenlinks=listonly my.html
You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
The first grep looks for lines containing urls. You can add more elements
after if you want to look only on local pages, so no http, but
relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/#href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(#href, base-uri())" http://example.com/
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line
An example, since you didn't provide any sample
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' index.html
You can do it quite easily with the following regex, which is quite good at finding URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!
This is my first post, so I try to do my best explaining why I post this answer...
Since the first 7 most voted answers, 4 include GREP even when the
post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous
point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was
required ) to do it in BASH.
So here come the simplest script from GNU grep 2.28:
grep -Po 'href="\K.*?(?=")'
About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!
In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
Expanding on kerkael's answer:
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
# now adding some more
|grep -v "<a href=\"#"
|grep -v "<a href=\"../"
|grep -v "<a href=\"http"
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.
Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.The rest should be easy, here is an example:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
You can try:
curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'
That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.
a=$1
lynx -listonly -dump "$a" > temp
awk 'FNR > 2 {print$2}' temp > temp2.txt
rm temp
>sh test.sh http://link.com
Eschewing the awk/sed requirement:
urlextract is made just for such a task (documentation).
urlview is an interactive CLI solution (github repo).
I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'
Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.
Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.
The awk finds any line that begins with href or src and outputs it.
Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'
Once the subset is extracted, just remove the href=" or src="
sed -r 's~(href="|src=")~~g'
This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.