How to extract data from html table in shell script?

How to extract data from html table in shell script? - html

I am trying to create a BASH script what would extract the data from HTML table.
Below is the example of table from where I need to extract data:
<table border=1>
<tr>
<td><b>Component</b></td>
<td><b>Status</b></td>
<td><b>Time / Error</b></td>
</tr>
<tr><td>SAVE_DOCUMENT</td><td>OK</td><td>0.406 s</td></tr>
<tr><td>GET_DOCUMENT</td><td>OK</td><td>0.332 s</td></tr>
<tr><td>DVK_SEND</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>DVK_RECEIVE</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>GET_USER_INFO</td><td>OK</td><td>0.143 s</td></tr>
<tr><td>NOTIFICATIONS</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>ERROR_LOG</td><td>OK</td><td>0.001 s</td></tr>
<tr><td>SUMMARY_STATUS</td><td>OK</td><td>0.888 s</td></tr>
</table>
And I want the BASH script to output it like so:
SAVE_DOCUMENT OK 0.475 s
GET_DOCUMENT OK 0.345 s
DVK_SEND OK 0.002 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 4.465 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.002 s
SUMMARY_STATUS OK 5.294 s
How to do it?
So far I have tried using the sed, but I don't know how to use it quite well. The header of the table(Component, Status, Time/Error) I excluded with grep using grep "<tr><td>, so only lines starting with <tr><td> will be selected for next parsing (sed).
This is what I used: sed 's#<\([^<>][^<>]*\)>\([^<>]*\)</\1>#\2#g'
But then <tr> tags still remain and also it wont separate the strings. In other words the result of this script is:
<tr>SAVE_DOCUMENTOK0.406 s</tr>
The full command of the script I'm working on is:
cat $FILENAME | grep "<tr><td>" | sed 's#<\([^<>][^<>]*\)>\([^<>]*\)</\1>#\2#g'

Go with (g)awk, it's capable :-), here is a solution, but please note: it's only working with the exact html table format you had posted.
awk -F "</*td>|</*tr>" '/<\/*t[rd]>.*[A-Z][A-Z]/ {print $3, $5, $7 }' FILE
Here you can see it in action: https://ideone.com/zGfLe
Some explanation:
-F sets the input field separator to a regexp (any of tr's or td's opening or closing tag
then works only on lines that matches those tags AND at least two upercasse fields
then prints the needed fields.
HTH

You can use bash xpath (XML::XPath perl module) to accomplish that task very easily:
xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g'

You may use html2text command and format the columns via column, e.g.:
$ html2text table.html | column -ts'|'
Component Status Time / Error
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
then parse it further from there (e.g. cut, awk, ex).
In case you'd like to sort it first, you can use ex, see the example here or here.

There are a lot of ways of doing this but here's one:
grep '^<tr><td>' < $FILENAME \
| sed \
-e 's:<tr>::g' \
-e 's:</tr>::g' \
-e 's:</td>::g' \
-e 's:<td>: :g' \
| cut -c2-
You could use more sed(1) (-e 's:^ ::') instead of the cut -c2- to remove the leading space but cut(1) doesn't get as much love as it deserves. And the backslashes are just there for formatting, you can remove them to get a one liner or leave them in and make sure that they're immediately followed by a newline.
The basic strategy is to slowly pull the HTML apart piece by piece rather than trying to do it all at once with a single incomprehensible pile of regex syntax.
Parsing HTML with a shell pipeline isn't the best idea ever but you can do it if the HTML is known to come in a very specific format. If there will be variation then you'd be better with with a real HTML parser in Perl, Ruby, Python, or even C.

A solution based on multi-platform web-scraping CLI xidel and XPath:
Tip of the hat to Reino for providing the simpler XPath equivalent to the original XQuery solution.[1]
xidel -s -e '//tr[position() > 1]/join(td)' file
With the sample input, this yields:
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Explanation:
//tr[position() > 1] matches the tr elements starting with the 2nd one, so as to skip the header row), and join(td) joins the values of the matching elements' child td elements with an implied single space as the separator.
-s makes xidel silent (suppresses output of status information).
While html2text is convenient for display of the extracted data, providing machine-parseable output is non-trivial, unfortunately:
html2text file | awk -F' *\\|' 'NR>2 {gsub(/^\||.\b/, ""); $1=$1; print}'
The Awk command removes the hidden \b-based (backspace-based) sequences that html2text outputs by default, and parses the lines into fields by |, and then outputs them with a space as the separator (a space is Awk's default output field separator; to change it to a tab, for instance, use -v OFS='\t').
Note: Use of -nobs to suppress backspace sequences at the source is not an option, because you then won't be able to distinguish between the hidden-by-default _ instances used for padding and actual _ characters in the data.
Note: Given that html2text seemingly invariably uses | as the column separator, the above will only work robustly if the are no | instances in the data being extracted.
[1] xidel -s --xquery 'for $tr in //tr[position()>1] return join($tr/td, " ")' file

You can parse the file using Ex editor (part of Vim) by removing HTML tags, e.g.:
$ ex -s +'%s/<[^>]\+>/ /g' +'v/0/d' +'wq! /dev/stdout' table.html
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
Here is shorter version by printing the whole file without HTML tags:
$ ex +'%s/<[^>]\+>/ /g|%p' -scq! table.html
Explanation:
%s/<[^>]\+>/ /g - Substitute all HTML tags into empty space.
v/0/d - Deletes all lines without 0.
wq! /dev/stdout - Quits editor and writes the buffer to the standard output.

For the sake of completeness, pandoc does a good job when you have extracted the HTML table. For example,
pandoc --from html --to plain table.txt
---------------- -------- --------------
Component Status Time / Error
SAVE_DOCUMENT OK 0.406 s
GET_DOCUMENT OK 0.332 s
DVK_SEND OK 0.001 s
DVK_RECEIVE OK 0.001 s
GET_USER_INFO OK 0.143 s
NOTIFICATIONS OK 0.001 s
ERROR_LOG OK 0.001 s
SUMMARY_STATUS OK 0.888 s
---------------- -------- --------------

Related

Parsing JSON with BusyBox tools

I'm working on a blog theme for Hugo installable on Android (BusyBox via Termux) and plan to create a BusyBox Docker image and copy my theme and the hugo binary to it for use on ARM.
Theme releases are archived and made available on NPM and the tools available on BusyBox have allowed me to reliably parse version from the metadata from JSON:
meta=$(wget -qO - https://registry.npmjs.org/package/latest)
vers=$(echo "$meta" | egrep -o "\"version\".*[^,]*," | cut -d ',' -f1 | cut -d ':' -f2 | tr -d '" ')
Now I would like to copy the dist value from the meta into a text file for use in Hugo:
"dist": {
"integrity": "sha512-3MH2/UKYPjr+CTC85hWGg/N3GZmSlgBWXzdXHroDfJRnEmcBKkvt1oiadN8gzCCppqCQhwtmengZzg0imm1mtg==",
"shasum": "a159699b1c5fb006a84457fcdf0eb98d72c2eb75",
"tarball": "https://registry.npmjs.org/after-dark/-/after-dark-6.4.1.tgz",
"fileCount": 98,
"unpackedSize": 5338189
},
Above pretty-printed for clarity. The actual metadata is compressed.
Is there a way I can reuse the version parsing logic above to also pull the dist field value?

Proper robust parsing requires tools like jq where it could be as simple as jq '.version' ip.txt and jq '.dist' ip.txt
You could use sed but use it at your own risk
$ sed -n 's/.*"version":"\([^"]*\).*/\1/p' ip.txt
6.4.1
$ sed -n 's/.*\("dist":{[^}]*}\).*/\1/p' ip.txt
"dist":{"integrity":....
....}
-n option to disable automatic printing
the p modifier with s command will allow to print only when substitution succeeds, this will mean output is empty instead of entire input line when something goes wrong
.*"version":"\([^"]*\).* this will match entire line, capturing data between double quotes after version tag - you'll have to adjust the regex if whitespaces are allowed and other valid json formats
.*\("dist":{[^}]*}\).* this will match entire line, capturing data starting with "dist":{ and first occurrence of } afterwards - so this is not suited if the tag itself can contain }

Similar strings, different results

I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?

The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]

I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.

chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")

awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15

How can I extract td from html in bash?

I am querying London postcode data from geonames:
http://www.geonames.org/postalcode-search.html?q=london&country=GB
I want to turn the output into a list of just the postcode identifiers (Bethnal Green, Islington, etc.). What is the best way to extract just the names in bash?

I'm not sure if you mean this \n delimited list (or one in brackets and comma delimited)
html='http://www.geonames.org/postalcode-search.html?q=london&country=GB'
wget -q "$html" -O - |
w3m -dump -T 'text/html'|
sed -nr 's/^ +[0-9]+ +(.*) +[A-Z]+[0-9]+ +United Kingdom.*/\1/p'
w3m is a: "WWW browsable pager with excellent tables/frames support"
output (first 10 lines)
London Bridge
Kilburn
Ealing
Wandsworth
Pimlico
Kensington
Leyton
Leytonstone
Plaistow
Poplar

I see the site offers (but not for free) web services with XML or JSON data... It would be the best way, since the HTML page is not meant to be parsed (easily).
Anyway, nothing is impossible, nonetheless using strictly only bash commands would be a lot hard, if not impossible; often several other common tools are piped in order to achieve the result. But then, sometimes it turns to be more conveniente to stick to a single tool like e.g. Perl, instead of combining cat, grep, awk, sed and whatever else.
Something like
sed -e 's/>/>\n/g' region.html |
egrep -i "^\s*[A-Z]+[0-9]+</td>" |
sed -e 's|</td>||g'
worked extracting 200 lines, assuming a specific format for the code.
ADD
If there's no limit to the software you can use to parse the data, then you could use a line like
wget -q "http://www.geonames.org/postalcode-search.html?q=london&country=GB" -O - |
sgrep '"<table class=\"restable\"" .. "</table>"' |
sed -e 's|/tr>|/tr>\n|g; s|</td>\s*<td[^>]*>|;|g; s|</th>\s*<th[^>]*>|;|g; s|<[^>]\+>||g; s|;; .*$| |g' |
grep -v "^\s*$" |
tail -n+2 | cut -d";" -f2,3
which extracts places and postal codes seperated by a ; like in a CSV, as well as awk:
wget -q "$html" -O - |
w3m -dump -T 'text/html' |
awk '/\s*[0-9]+ / { print substr($0, 11, 16); }'
which is based on the answer by Peter.O and extracts the same data... and so on. But in these cases, since you are not limited to the minimal tools found on most Unix or GNU systems, I would stick to one single widespread tool, e.g. perl.

If you have access to the mojo tool from the Mojolicious project this all becomes quite a lot easier:
mojo get 'http://www.geonames.org/postalcode-search.html?q=london&country=GB' '.restable > tr > td:nth-child(2)' text | grep ^'[a-zA-Z]'
The grep at the end is just to filter out some junk results; almost (but not quite) every other line is bad, because the page structure is slightly inconsistent. Otherwise you could say tr:nth-child(even) and get nice results.

Can aspell output line number and not offset in pipe mode?

Can aspell output line number and not offset in pipe mode for html and xml files? I can't read the file line by line because in this case aspell can't identify closed tag (if tag situated on the next line).

This will output all occurrences of misspelt words with line numbers:
# Get aspell output...
<my_document.txt aspell pipe list -d en_GB --personal=./aspell.ignore.txt |
# Proccess the aspell output...
grep '[a-zA-Z]\+ [0-9]\+ [0-9]\+' -oh | \
grep '[a-zA-Z]\+' -o | \
while read word; do grep -on "\<$word\>" my_document.txt; done
Where:
my_document.txt is your original document
en_GB is your primary dictionary choice (e.g. try en_US)
aspell.ignore.txt is an aspell personal dictionary (example below)
aspell_output.txt is the output of aspell in pipe mode (ispell style)
result.txt is a final results file
aspell.ignore.txt example:
personal_ws-1.1 en 500
foo
bar
example results.txt output (for an en_GB dictionary):
238:color
302:writeable
355:backends
433:dataonly
You can also print the whole line by changing the last grep -on into grep -n.

This is just an idea, I haven't really tried it yet (I'm on a windows machine :(). But maybe you could pipe the html file through head (with byte limit) and count newlines using grep to find your line number. It's neither efficient nor pretty, but it might just work.
cat icantspell.html | head -c <offset from aspell> | egrep -Uc "$"

I use the following script to perform spell-checking and to work-around the awkward output of aspell -a / ispell. At the same time, the script also works around the problem that ordinals like 2nd aren't recognized by aspell by simply ignoring everything that aspell reports which is not a word of its own.
#!/bin/bash
set +o pipefail
if [ -t 1 ] ; then
color="--color=always"
fi
! for file in "$#" ; do
<"$file" aspell pipe list -p ./dict --mode=html |
grep '[[:alpha:]]\+ [0-9]\+ [0-9]\+' -oh |
grep '[[:alpha:]]\+' -o |
while read word ; do
grep $color -n "\<$word\>" "$file"
done
done | grep .
You even get colored output if the stdout of the script is a terminal, and you get an exit status of 1 in case the script found spelling mistakes, otherwise the exit status of the script is 0.
Also, the script protects itself from pipefail, which is a somewhat popular option to be set i.e. in a Makefile but doesn't work for this script. Last but not least, this script explicitly uses [[:alpha:]] instead of [a-zA-Z] which is less confusing when it's also matching non-ASCII characters like German äöüÄÖÜß and others. [a-zA-Z] also does, but that to some level comes at a surprise.

aspell pipe / aspell -a / ispell output one empty line for each input line (after reporting the errors of the line).
Demonstration printing the line number with awk:
$ aspell pipe < testFile.txt |
awk '/^$/ { countedLine=countedLine+1; print "#L=" countedLine; next; } //'
produces this output:
#(#) International Ispell Version 3.1.20 (but really Aspell 0.60.7-20110707)
& iinternational 7 0: international, Internationale, internationally, internationals, intentional, international's, Internationale's
#L=1
*
*
*
& reelly 22 11: Reilly, really, reel, rely, rally, relay, resell, retell, Riley, rel, regally, Riel, freely, real, rill, roll, reels, reply, Greeley, cruelly, reel's, Reilly's
#L=2
*
#L=3
*
*
& sometypo 18 8: some typo, some-typo, setup, sometime, someday, smote, meetup, smarty, stupor, Smetana, somatic, symmetry, mistype, smutty, smite, Sumter, smut, steppe
#L=4
with testFile.txt
iinternational
I say this reelly.
hello
here is sometypo.
(Still not as nice as hunspell -u (https://stackoverflow.com/a/10778071/4124767). But hunspell misses some command line options I like.)

For others using aspell with one of the filter modes (tex, html, etc), here's a way to only print line numbers for misspelled words in the filtered text. So for example, it won't print misspellings in the comments.
ASPELL_ARGS="--mode=html --personal=./.aspell.en.pws"
for file in "$#"; do
for word in $(aspell $ASPELL_ARGS list < "$file" | sort -u); do
grep -no "\<$word\>" <(aspell $ASPELL_ARGS filter < "$file")
done | sort -n
done
This works because aspell filter does not delete empty lines. I realize this isn't using aspell pipe as requested by OP, but it's in the same spirit of making aspell print line numbers.

Easiest way to extract the urls from an html page using sed or awk only

I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?

You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of #condit)
lynx -dump -hiddenlinks=listonly my.html

You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.

grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
The first grep looks for lines containing urls. You can add more elements
after if you want to look only on local pages, so no http, but
relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html

With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/#href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(#href, base-uri())" http://example.com/

I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line

An example, since you didn't provide any sample
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' index.html

You can do it quite easily with the following regex, which is quite good at finding URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'

I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!

This is my first post, so I try to do my best explaining why I post this answer...
Since the first 7 most voted answers, 4 include GREP even when the
post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous
point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was
required ) to do it in BASH.
So here come the simplest script from GNU grep 2.28:
grep -Po 'href="\K.*?(?=")'
About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!

In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...

Expanding on kerkael's answer:
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
# now adding some more
|grep -v "<a href=\"#"
|grep -v "<a href=\"../"
|grep -v "<a href=\"http"
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.

Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.The rest should be easy, here is an example:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'

You can try:
curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'

That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.
a=$1
lynx -listonly -dump "$a" > temp
awk 'FNR > 2 {print$2}' temp > temp2.txt
rm temp
>sh test.sh http://link.com

Eschewing the awk/sed requirement:
urlextract is made just for such a task (documentation).
urlview is an interactive CLI solution (github repo).

I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'
Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.
Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.
The awk finds any line that begins with href or src and outputs it.
Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'
Once the subset is extracted, just remove the href=" or src="
sed -r 's~(href="|src=")~~g'
This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to extract data from html table in shell script? - html

You can use bash xpath (XML::XPath perl module) to accomplish that task very easily: xpath -e '//tr[position()>1]' test_input1.xml 2> /dev/null | sed -e 's/<\/*tr>//g' -e 's/<td>//g' -e 's/<\/td>/ /g'

Related

Parsing JSON with BusyBox tools

Similar strings, different results

How can I extract td from html in bash?

Can aspell output line number and not offset in pipe mode?

Easiest way to extract the urls from an html page using sed or awk only

Categories

Resources