How to parse out text from a span using curl in bash?

How to parse out text from a span using curl in bash? - html

I am using Geektools (bash desktop widget app for mac) to try and display text from a website. I have been trying to cURl the site and then grep the text but I am finding it more difficult than I imagined. Just looking for some help.
HTML:
<div is>
<div class="page-status status-none">
<span class="status font-large">
All Systems Operational
</span>
<span class="last-updated-stamp font-small"></span>
</div>
Above is the span that is displayed when I cURL the site. I just need to display the text "All Systems Operational".
Thank in advance for your assistance.

getting in the habit of using regular expressions with html is a slippery slope; it's not the right tool for the job, as mentioned here; I'd suggest either
hxselect from the html-xml-utils package
xidel
both of which let you use css3 selectors to target content in your input
for example:
curl -s $website_url | hxselect '.status.font-large'
All Systems Operational

You could pipe the output of curl into gawk. This gawk command seems to do the trick (I'm using Cygwin's gawk on Windows):
gawk "/status font-large/ {wantedLine=NR+1} {if (NR==wantedLine) {print $0}}"

I was able to parse out the status that I was looking for by using Nokogiri.
curl -s $Website_URL | nokogiri -e 'puts $_.at_css("span.status").text'

Related

How do I do a regex only the specific selection between two tags?

There have been dozens of similar questions that was asked but my question is about a specific selection between the tags. I don't want the entire selection from <a href to </a>, I only need to target the "> between those tags itself.
I am trying to convert a href links into wikilinks. For example, if the sample text has:
Light is light.
<div class="reasons">
I wanted to edit the file itself and change from Link into [[link.html|Link]]. The basic idea that I have right now uses 3 sed edits as follows:
Link -> <a href="link.html|Link</a>
<a href="link.html|Link</a> -> [[link.html|Link</a>
[[link.html|Link</a> -> [[link.html|Link]]
My problem lies with the first step; I can't find the regex that only targets "> between <a href and </a>.
I understand that the basic idea would need to be the search target between lookaround and lookbehind. But trying it on regexr showed a fail. I also tried using conditional regex. I can't find the syntax I used but it either turned an error or it worked but it also captured the div class.
Edit: I'm on Ubuntu and using a bash script using sed to do the text manipulation.

The basic idea that I have right now uses 3 sed edits
Assuming you've also read the answers underneath those dozens of similar questions, you could've known that it's a bad idea to parse HTML with sed (regex).
With an HTML-parser like xidel this would be as simple as:
$ xidel -s 'Link' -e 'concat("[[",//a/#href,"|",//a,"]]")'
$ xidel -s 'Link' -e '"[["||//a/#href||"|"||//a||"]]"'
$ xidel -s 'Link' -e 'x"[[{//a/#href}|{//a}]]"'
[[link.html|Link]]
Three different queries to concatenate strings. The 1st query uses the XPath concat() function, the 2nd query uses the XPath || operator and the 3rd uses xidel's extended string syntax.

BASH-SHELL SCRIPT: Print part of the text from a file

So i have a text file with part of html code:
>>nano wynik.txt
with text:
1743: < a href="/currencies/lisk/#markets" class="price" data-usd="24.6933" data-btc= "0.00146882"
and i want to print only: 24.6933
I tried the way with the cut command but it does not work. Can anyone give me a solution?

With GNU grep and Perl Compatible Regular Expressions:
grep -Po '(?<=data-usd=").*?(?=")' file
Output:
24.6933

Bash: Content between two complex Patterns - html

I have tried multiple times to get digits between two html patterns.
Neither sed nor awk worked for me, since the examples in the internet were too easy to fit my task.
Here is the code I want to filter:
....class="a-size-base review-text">I WANT THIS TEXT</span></div> ....
So I would need a command that output: I WANT THIS TEXT between ...review-text"> and </span>
Do you have a clue? Thanks for the effort and greetings from Germany.
Here is the plain code

Try:
tr '\n' ' ' file.html | grep -o 'review-text">[^<>]*</span> *</div>' | cut -d'>' -f2 | cut -d'<' -f 1
It should work if there are no any tags inside "I WANT THIS TEXT"

I can't see the problem here supposing the text you want to extract doesn't contains < nor >.
For instance with POSIX REGEXP:
$ HTML_FILE=/tmp/myfile.html
$ sed -n "s/.*review-text.>\([^<]*\)<.*/\1/gp" $HTML_FILE
prints the text between HTML TAGS

Xidel extract data inside the tag -- raw output

Pleased to be member of StackOverflow, a long time lurker in here.
I need to parse text between two tags, so far I've found a wonderful tool called Xidel
I need to parse text in between
<div class="description">
Text. <tag>Also tags.</tag> More text.
</div>
However, said text can include HTML tags in it, and I want them to be printed out in raw format. So using a command like:
xidel --xquery '//div[#class="description"]' file.html
Gets me:
Text. Also tags. More text.
And I need it to be exactly as it is, so:
Text. <tag>Also tags.</tag> More text.
How can I achieve this?
Regards, R

Can be done in a couple of ways with Xidel, which is why I love it so much.
HTML-templating:
xidel -s file.html -e "<div class='description'>{inner-html()}</div>"
XPath:
xidel -s file.html -e "//div[#class='description']/inner-html()"
CSS:
xidel -s file.html -e "inner-html(css('div.description'))"
BTW, on Linux: swap the double quotes for single and vice versa.

You can show the tags by adding the --output-format=xml option.
xidel --xquery '//div[#class="description"]' --output-format=xml file.html

How to copy text between 2 html tags?

I want to copy all the text in a website between tags:
<p> and </p>
using bash.
Do you have an idea how to do it?

As the comment above states: don't even try. There is no reliable way to parse HTML with Bash internals.
But when you're using a shell you may as well use third-party command line tools such as pup which are built for HTML parsing on the command line.

Yes, an HTML parser is a better choice. But if you are just trying to grab the text in between the first set of P tags quickly, you can use Perl:
perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
For example:
echo "
<p>A test
here
today</p>
<p>whatever</p>
" | perl -n0e 'if (/<p>(.*?)<\/p>/s) { print $1; }'
This will output:
A test
here
today

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to parse out text from a span using curl in bash? - html

You could pipe the output of curl into gawk. This gawk command seems to do the trick (I'm using Cygwin's gawk on Windows): gawk "/status font-large/ {wantedLine=NR+1} {if (NR==wantedLine) {print $0}}"

I was able to parse out the status that I was looking for by using Nokogiri. curl -s $Website_URL | nokogiri -e 'puts $_.at_css("span.status").text'

Related

How do I do a regex only the specific selection between two tags?

BASH-SHELL SCRIPT: Print part of the text from a file

Bash: Content between two complex Patterns - html

Xidel extract data inside the tag -- raw output

How to copy text between 2 html tags?

Categories

Resources