Bash: Content between two complex Patterns - html - html

I have tried multiple times to get digits between two html patterns.
Neither sed nor awk worked for me, since the examples in the internet were too easy to fit my task.
Here is the code I want to filter:
....class="a-size-base review-text">I WANT THIS TEXT</span></div> ....
So I would need a command that output: I WANT THIS TEXT between ...review-text"> and </span>
Do you have a clue? Thanks for the effort and greetings from Germany.
Here is the plain code

Try:
tr '\n' ' ' file.html | grep -o 'review-text">[^<>]*</span> *</div>' | cut -d'>' -f2 | cut -d'<' -f 1
It should work if there are no any tags inside "I WANT THIS TEXT"

I can't see the problem here supposing the text you want to extract doesn't contains < nor >.
For instance with POSIX REGEXP:
$ HTML_FILE=/tmp/myfile.html
$ sed -n "s/.*review-text.>\([^<]*\)<.*/\1/gp" $HTML_FILE
prints the text between HTML TAGS

Related

Shell bash how to delete text between pattern not inclusive?

I am trying to delete the text between <pre><\pre> html tags using:
sed -i '/<pre>/,/<\/pre>/d' file.html
But this deletes the <pre></pre> tags too. The are only one pre tag pair in the file.
How can I avoid to delete de pre tags?
Thanks.
This might work for you (GNU sed):
sed -n '/<pre>/{p;:a;n;/<\/pre>/!ba};p' file
Turn off implicit printing by using the -n option.
If a line contains <pre>, print it and fetch the next line.
If that line does not contain </pre> loop back and repeat.
Otherwise print all other lines.

Replace character in file at specific line number

I have index.html file
I want line# 88 which looks like this: <h1>Test Page 1</h1>
To be like this: <h1>Test Page 10</h1>
Tried basic procedures such as:
sed -i '88s/1/10' index.html
sed -i ‘88|\(.*\)| <h1>Test Page 10</h1>\1|' index.html
but seems like html tags needs different treatment?
I filled t.html with 100 lines of the same content (each line is just <h1>Test Page 1</h1>. For this demonstration I used nl t.html | grep 88; to show the 88th line. ( nl just numbers each line, and grep searches for a regular expression to match, but 88 just matches a literal 88). I run that at the beginning and the end of my command, to show line 88 before and after the change.
$ nl t.html | grep 88; sed -i -e '88 s/Page 1/Page 10/' t.html; nl t.html | grep 88
88 <h1>Test Page 1</h1>
88 <h1>Test Page 10</h1>
You have to be careful with regular expressions - if you just use s/1/10/ it will replace the 1 in the first h1 instead of the 1 in Page 1.
cat >> replace.ed << EOF
88s/e 1/e 10/
wq
EOF
ed -s index.html < replace.ed
rm -v ./replace.ed
Use
sed -E 's,(>[^<]*)1([^<]*<),\110\2,'
s,(>[^<]*)1([^<]*<),\110\2, will find > and any text other than <, then 1, then any text other than < up to and including <, and replaces the match with the text before 1, then 10, then the text after 1.
Nice little one liner:
cat data.txt | awk 'NR==2'| sed 's/Test Page 1/Test Page 10/'
So the AWK command uses NR, which is the total number of input lines seen so far. Change this to the specific line number in question. The sed command literally just swaps everything inbetween the first and last slashes.
Hopefully that helps :)

BASH-SHELL SCRIPT: Print part of the text from a file

So i have a text file with part of html code:
>>nano wynik.txt
with text:
1743: < a href="/currencies/lisk/#markets" class="price" data-usd="24.6933" data-btc= "0.00146882"
and i want to print only: 24.6933
I tried the way with the cut command but it does not work. Can anyone give me a solution?
With GNU grep and Perl Compatible Regular Expressions:
grep -Po '(?<=data-usd=").*?(?=")' file
Output:
24.6933

Xidel extract data inside the tag -- raw output

Pleased to be member of StackOverflow, a long time lurker in here.
I need to parse text between two tags, so far I've found a wonderful tool called Xidel
I need to parse text in between
<div class="description">
Text. <tag>Also tags.</tag> More text.
</div>
However, said text can include HTML tags in it, and I want them to be printed out in raw format. So using a command like:
xidel --xquery '//div[#class="description"]' file.html
Gets me:
Text. Also tags. More text.
And I need it to be exactly as it is, so:
Text. <tag>Also tags.</tag> More text.
How can I achieve this?
Regards, R
Can be done in a couple of ways with Xidel, which is why I love it so much.
HTML-templating:
xidel -s file.html -e "<div class='description'>{inner-html()}</div>"
XPath:
xidel -s file.html -e "//div[#class='description']/inner-html()"
CSS:
xidel -s file.html -e "inner-html(css('div.description'))"
BTW, on Linux: swap the double quotes for single and vice versa.
You can show the tags by adding the --output-format=xml option.
xidel --xquery '//div[#class="description"]' --output-format=xml file.html

How to parse out text from a span using curl in bash?

I am using Geektools (bash desktop widget app for mac) to try and display text from a website. I have been trying to cURl the site and then grep the text but I am finding it more difficult than I imagined. Just looking for some help.
HTML:
<div is>
<div class="page-status status-none">
<span class="status font-large">
All Systems Operational
</span>
<span class="last-updated-stamp font-small"></span>
</div>
Above is the span that is displayed when I cURL the site. I just need to display the text "All Systems Operational".
Thank in advance for your assistance.
getting in the habit of using regular expressions with html is a slippery slope; it's not the right tool for the job, as mentioned here; I'd suggest either
hxselect from the html-xml-utils package
xidel
both of which let you use css3 selectors to target content in your input
for example:
curl -s $website_url | hxselect '.status.font-large'
All Systems Operational
You could pipe the output of curl into gawk. This gawk command seems to do the trick (I'm using Cygwin's gawk on Windows):
gawk "/status font-large/ {wantedLine=NR+1} {if (NR==wantedLine) {print $0}}"
I was able to parse out the status that I was looking for by using Nokogiri.
curl -s $Website_URL | nokogiri -e 'puts $_.at_css("span.status").text'