Extract pdf from html using sed - html

I am writing a bash script that extracts pdf files from html and downloads it. Here is the line of code that extracts:
curl -s https://info.uqam.ca/\~privat/INF1070/ |
sed 's/.*href="//' |
sed 's/".*//' |
sed '/^[^\.]/d' |
sed '/\.[^p][^d][^f]$/d' |
sed '/^$/d' |
sed '/\/$/d'
Result:
./07b-reseau.pdf
./07a-reseau.pdf
./06b-script.pdf
./06a-script.pdf
./05-processus.pdf
./04b-regex.pdf
./181-quiz1-g1-sujet.pdf
./03b-fichiers-solution.pdf
./04a-regex.pdf
./03d-fichiers.pdf
./03c-fichiers.pdf
./03b-fichiers.pdf
./03a-fichiers.pdf
./02-shell.pdf
./01-intro.pdf
./01-intro.pdf
./02-shell.pdf
./03a-fichiers.pdf
./03b-fichiers.pdf
./03b-fichiers-solution.pdf
./03c-fichiers.pdf
./03d-fichiers.pdf
./04a-regex.pdf
./04b-regex.pdf
./05-processus.pdf
./06a-script.pdf
./06b-script.pdf
./07a-reseau.pdf
./07b-reseau.pdf
./181-quiz1-g1-sujet.pdf
It's working fine but I was wondering if there is a better way (always by using sed) to do this with less sed commands.
Thank you.

You can translate your original question into something like How to output only captured groups with sed?. This one-liner should do the trick for you:
curl -s https://info.uqam.ca/\~privat/INF1070/ | sed -rn 's/.*href="(.*\.pdf)".*$/\1/p'
which produces the desired output.
Where the combination of the -n option (not to print) and the p flag (print what is matched) will print only the lines where substitution take place based on the regex .*href="(.*\.pdf)".*$. The value of the href attribute (the capture group in parenthesis) is back referenced with \1, thus the whole line is replaced with it.

This might work for you (GNU sed):
sed -r '/\n/!s/href="(\.[^"]*\.pdf)"/\n\1\n/g;/\`[^\n]*\.pdf$/MP;D' file
This puts each pdf file into a separate line (multiple lines within a line) and only prints out a line that ends in .pdf.

Related

Extract json value with sed

I have a json result and I would like to extract a string without double quotes
{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}
With this regex I can extract the value3 (019-10-24T15:26:00.000Z) correctly
sed -e 's/^.*"endTime":"\([^"]*\)".*$/\1/'
How can I extract the "value2" result, a string without double quotes?
I need to do with sed so can’t install jq. That’s my problem
With GNU sed for -E to enable EREs:
$ sed -E 's/.*"value3":"?([^,"]*)"?.*/\1/' file
2019-10-24T15:26:00.000Z
$ sed -E 's/.*"value2":"?([^,"]*)"?.*/\1/' file
2.5
With any POSIX sed:
$ sed 's/.*"value3":"\{0,1\}\([^,"]*\)"\{0,1\}.*/\1/' file
2019-10-24T15:26:00.000Z
$ sed 's/.*"value2":"\{0,1\}\([^,"]*\)"\{0,1\}.*/\1/' file
2.5
The above assumes you never have commas inside quoted strings.
Just run jq a Command-line JSON processor
$ json_data='{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}'
$ jq '.value2' <(echo "$json_data")
2.5
with the key .value2 to access the value you are interested in.
This link summarize why you should NOT use, regex for parsing json
(the same goes for XML/HTML and other data structures that are in
theory can be infinitely nested)
Regex for parsing single key: values out of JSON in Javascript
If you do not have jq available:
you can use the following GNU grep command:
$ echo '{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}' | grep -zoP '"value2":\s*\K[^\s,]*(?=\s*,)'
2.5
using the regex detailed here:
"value2":\s*\K[^\s,]*(?=\s*,)
demo: https://regex101.com/r/82J6Cb/1/
This will even work if the json is not linearized!!!!
With python it is also pretty direct and you should have it installed by default on your machine even if it is not python3 it should work
$ cat data.json
{"value1":5.0,"value2":2.5,"value3":"2019-10-24T15:26:00.000Z","modifier":[]}
$ cat extract_value2.py
import json
with open('data.json') as f:
data = json.load(f)
print(data["value2"])
$ python extract_value2.py
2.5
You can try this :
creds=$(eval aws secretsmanager get-secret-value --region us-east-1 --secret-id dpi/dev/hivemetastore --query SecretString --output text )
passwd=$(/bin/echo "${creds}" | /bin/sed -n 's/.*"password":"\(.*\)",/\1/p' | awk -F"\"" '{print $1}')
it is definitely possible to remove the AWK part though ...
To extract all values in proper list form to a file using sed(LINUX).
sed 's/["{}\]//g' <your_file.json> | sed 's/,/\n/g' >> <your_new_file_to_save>
sed 's/regexp/replacement/g' inputFileName > outputFileName
In some versions of sed, the expression must be preceded by -e to indicate that an expression follows.
The s stands for substitute, while the g stands for global, which means that all matching occurrences in the line would be replaced.
I've put [ ] inside it as elements that you wanna remove from .json file.
The pipe character | is used to connect the output from one command to the input of another.
Then, the last thing I did is substitute , and add a \n, known as line breaker.
If you want to show a single value see below command:
sed 's/["{}\]//g' <your_file.json> | sed 's/,/\n/g' | sed 's/<ur_value>//p'
p is run; this is equivalent to /pattern match/! p as per above; i.e., "if the line does not match /pattern match/, print it". So the complete command prints all the lines from the first occurrence of the pattern to the last line, but suppresses the ones that match.
if your data in 'd' file, try gnu sed
sed -E 's/[{,]"\w+":([^,"]+)/\1\n/g ;s/(.*\n).*".*\n/\1/' d

Filtering out indented lines in JSON GitHub API with sed

I'm able get all the names filtered out using,
sed -n '/"name":/p' htop.json
but I want to filter out all the indented outputs. I'm looking for the repo titles from each GitHub. It's important I use something light like sed to make this small and portable.
Here is htop.json
https://pastebin.com/5xuH29yW
Well, just filter from the beginning of the line with spaces/indentation characters then:
sed -n '/^ "name":/p' htop.json
and we can also specify the number of spaces as a number:
sed -n '/^[ ]\{6\}"name":/p' htop.json
Let's get repo names!
sed -n '/^ "name":/{s/[[:space:]]*"name":[[:space:]]*"\(.*\)",$/\1/;p}' htop.json

extract text from between html tags with specific id using sed or grep

What command should I be using to extract the text from within the following html code which sits in a "test.html" file containing : "<span id="imAnID">extractme</span>" ?
The file will be larger so I need to point grep or sed to an id and then tell it to extract only the text from the tag having this ID.
Assuming I run the terminal from the directory where the file resides, I am doing this:
cat test.html | sed -n 's/.*<span id="imAnID">\(.*\)<\/span>.*/\1/p'
What am I doing wrong? I get an empty output...
Not opposed to using grep for this if it's easier.
You can try doing it with awk instead:
#!/bin/bash
start_tag="span id=\"imAnID\""
end_tag="/span"
awk -F'[<>]' -v taga="$start_tag" -v tagb="$end_tag" '{ i=1; while (i<=NF) { if ($(i)==taga && $(i+2)==tagb) { print $(i+1) }; i++} }'
Use this by:
$ ./script < infile > outfile
It is awkward to use awk, sed, or grep for this since these tools are line-based (one line at a time). Is it guaranteed that the span you are trying to extract is all on the same line? Is there any possibility of other tags used within the span (e.g. em tags)? If not, then this sounds like a job for perl.
awk, sed and grep are line-oriented tools. XML and HTML are based on tags. The two don't combine that well, though you can get by with awk, sed and grep on XML and HTML by using a pretty formatter on the XML or HTML before resorting to your line-oriented tools.
There's a program called xmlgawk that is supposed to be quite gawk-like, while still working on XML.
I personally prefer to do this sort of thing in Python using the lxml module, so that the XML/HTML can be fully understood without getting too wordy.
using grep -o
echo "<span id="imAnID" hello>extractme</span> <span id='imAnID'>extractmetoo</span>" | grep -oE 'id=.?imAnID[^<>]*>[^<>]+' | cut -d'>' -f2
will find:
#=>extractme
#=>extractmetoo
it will work if the span element carrying the desired id attribute comes immediately before the extractme stuff.

Help with sed regex: extract text from specific tag

First time sed'er, so be gentle.
I have the following text file, 'test_file':
<Tag1>not </Tag1><Tag2>working</Tag2>
I want to extract the text in between <Tag2> using sed regex, there may be other occurrences of <Tag2> and I would like to extract those also.
So far I have this sed based regex:
cat test_file | grep -i "Tag2"| sed 's/<[^>]*[>]//g'
which gives the output:
not working
Anyone any idea how to get this working?
As another poster said, sed may not be the best tool for this job. You may want to use something built for XML parsing, or even a simple scripting language, such as perl.
The problem with your try, is that you aren't analyzing the string properly.
cat test_file is good - it prints out the contents of the file to stdout.
grep -i "Tag2" is ok - it prints out only lines with "Tag2" in them. This may not be exactly what you want. Bear in mind that it will print the whole line, not just the <Tag2> part, so you will still have to search out that part later.
sed 's/<[^>]*[>]//g' isn't what you want - it simply removes the tags, including <Tag1> and <Tag2>.
You can try something like:
cat tmp.tmp | grep -i tag2 | sed 's/.*<Tag2>\(.*\)<\/Tag2>.*/\1/'
This will produce
working
but it will only work for one tag pair.
For your nice, friendly example, you could use
sed -e 's/^.*<Tag2>//' -e 's!</Tag2>.*!!' test-file
but the XML out there is cruel and uncaring. You're asking for serious trouble using regular expressions to scrape XML.
you can use gawk, eg
$ cat file
<Tag1>not </Tag1><Tag2>working here</Tag2>
<Tag1>not </Tag1><Tag2>
working
</Tag2>
$ awk -vRS="</Tag2>" '/<Tag2>/{gsub(/.*<Tag2>/,"");print}' file
working here
working
awk -F"Tag2" '{print $2}' test_1 | sed 's/[^a-zA-Z]//g'

Easiest way to extract the urls from an html page using sed or awk only

I want to extract the URL from within the anchor tags of an html file.
This needs to be done in BASH using SED/AWK. No perl please.
What is the easiest way to do this?
You could also do something like this (provided you have lynx installed)...
Lynx versions < 2.8.8
lynx -dump -listonly my.html
Lynx versions >= 2.8.8 (courtesy of #condit)
lynx -dump -hiddenlinks=listonly my.html
You asked for it:
$ wget -O - http://stackoverflow.com | \
grep -io '<a href=['"'"'"][^"'"'"']*['"'"'"]' | \
sed -e 's/^<a href=["'"'"']//i' -e 's/["'"'"']$//i'
This is a crude tool, so all the usual warnings about attempting to parse HTML with regular expressions apply.
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
The first grep looks for lines containing urls. You can add more elements
after if you want to look only on local pages, so no http, but
relative path.
The first sed will add a newline in front of each a href url tag with the \n
The second sed will shorten each url after the 2nd " in the line by replacing it with the /a tag with a newline
Both seds will give you each url on a single line, but there is garbage, so
The 2nd grep href cleans the mess up
The sort and uniq will give you one instance of each existing url present in the sourcepage.html
With the Xidel - HTML/XML data extraction tool, this can be done via:
$ xidel --extract "//a/#href" http://example.com/
With conversion to absolute URLs:
$ xidel --extract "//a/resolve-uri(#href, base-uri())" http://example.com/
I made a few changes to Greg Bacon Solution
cat index.html | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'
This fixes two problems:
We are matching cases where the anchor doesn't start with href as first attribute
We are covering the possibility of having several anchors in the same line
An example, since you didn't provide any sample
awk 'BEGIN{
RS="</a>"
IGNORECASE=1
}
{
for(o=1;o<=NF;o++){
if ( $o ~ /href/){
gsub(/.*href=\042/,"",$o)
gsub(/\042.*/,"",$o)
print $(o)
}
}
}' index.html
You can do it quite easily with the following regex, which is quite good at finding URLs:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
I took it from John Gruber's article on how to find URLs in text.
That lets you find all URLs in a file f.html as follows:
cat f.html | grep -o \
-E '\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))'
I am assuming you want to extract a URL from some HTML text, and not parse HTML (as one of the comments suggests). Believe it or not, someone has already done this.
OT: The sed website has a lot of good information and many interesting/crazy sed scripts. You can even play Sokoban in sed!
This is my first post, so I try to do my best explaining why I post this answer...
Since the first 7 most voted answers, 4 include GREP even when the
post explicitly says "using sed or awk only".
Even when the post requires "No perl please", due to the previous
point, and because use PERL regex inside grep.
and because this is the simplest way ( as far I know , and was
required ) to do it in BASH.
So here come the simplest script from GNU grep 2.28:
grep -Po 'href="\K.*?(?=")'
About the \K switch , not info was founded in MAN and INFO pages, so I came here for the answer....
the \K switch get rid the previous chars ( and the key itself ).
Bear in mind following the advice from man pages:
"This is highly experimental and grep -P may warn of unimplemented features."
Of course, you can modify the script to meet your tastes or needs, but I found it pretty straight for what was requested in the post , and also for many of us...
I hope folks you find it very useful.
thanks!!!
In bash, the following should work. Note that it doesn't use sed or awk, but uses tr and grep, both very standard and not perl ;-)
$ cat source_file.html | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
for example:
$ curl "https://www.cnn.com" | tr '"' '\n' | tr "'" '\n' | grep -e '^https://' -e '^http://' -e'^//' | sort | uniq
generates
//s3.amazonaws.com/cnn-sponsored-content
//twitter.com/cnn
https://us.cnn.com
https://www.cnn.com
https://www.cnn.com/2018/10/27/us/new-york-hudson-river-bodies-identified/index.html\
https://www.cnn.com/2018/11/01/tech/google-employee-walkout-andy-rubin/index.html\
https://www.cnn.com/election/2016/results/exit-polls\
https://www.cnn.com/profiles/frederik-pleitgen\
https://www.facebook.com/cnn
etc...
Expanding on kerkael's answer:
grep "<a href=" sourcepage.html
|sed "s/<a href/\\n<a href/g"
|sed 's/\"/\"><\/a>\n/2'
|grep href
|sort |uniq
# now adding some more
|grep -v "<a href=\"#"
|grep -v "<a href=\"../"
|grep -v "<a href=\"http"
The first grep I added removes links to local bookmarks.
The second removes relative links to upper levels.
The third removes links that don't start with http.
Pick and choose which one of these you use as per your specific requirements.
Go over with a first pass replacing the start of the urls (http) with a newline (\nhttp). Then you have guaranteed for yourself that your link starts at the beginning of the line and is the only URL on the line.The rest should be easy, here is an example:
sed "s/http/\nhttp/g" <(curl "http://www.cnn.com") | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"
alias lsurls='_(){ sed "s/http/\nhttp/g" "${1}" | sed -n "s/\(^http[s]*:[a-Z0-9/.=?_-]*\)\(.*\)/\1/p"; }; _'
You can try:
curl --silent -u "<username>:<password>" http://<NAGIOS_HOST/nagios/cgi-bin/status.cgi|grep 'extinfo.cgi?type=1&host='|grep "status"|awk -F'</A>' '{print $1}'|awk -F"'>" '{print $3"\t"$1}'|sed 's/<\/a> <\/td>//g'| column -c2 -t|awk '{print $1}'
That's how I tried it for better view, create shell file and give link as parameter, it will create temp2.txt file.
a=$1
lynx -listonly -dump "$a" > temp
awk 'FNR > 2 {print$2}' temp > temp2.txt
rm temp
>sh test.sh http://link.com
Eschewing the awk/sed requirement:
urlextract is made just for such a task (documentation).
urlview is an interactive CLI solution (github repo).
I scrape websites using Bash exclusively to verify the http status of client links and report back to them on errors found. I've found awk and sed to be the fastest and easiest to understand. Props to the OP.
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//'
Because sed works on a single line, this will ensure that all urls are formatted properly on a new line, including any relative urls. The first sed finds all href and src attributes and puts each on a new line while simultaneously removing the rest of the line, inlcuding the closing double qoute (") at the end of the link.
Notice I'm using a tilde (~) in sed as the defining separator for substitution. This is preferred over a forward slash (/). The forward slash can confuse the sed substitution when working with html.
The awk finds any line that begins with href or src and outputs it.
Once the content is properly formatted, awk or sed can be used to collect any subset of these links. For example, you may not want base64 images, instead you want all the other images. Our new code would look like:
curl -Lk https://example.com/ | sed -r 's~(href="|src=")([^"]+).*~\n\1\2~g' | awk '/^(href|src)/,//' | awk '/^src="[^d]/,//'
Once the subset is extracted, just remove the href=" or src="
sed -r 's~(href="|src=")~~g'
This method is extremely fast and I use these in Bash functions to format the results across thousands of scraped pages for clients that want someone to review their entire site in one scrape.