HTML::ELEMENT not finding all elements - html

I have this snippet of html:
<li class="result-row" data="2">
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2018-12-04 09:21" title="Tue 04 Dec 09:21:50 AM">Dec 4</time>
Link Text
and this perl code (not production, so no quality comments are necessary)
my $root = $tree->elementify();
my #rows = $root->look_down('class', 'result-row');
my $item = $rows[0];
say $item->dump;
my $date = $item->look_down('class', 'result-date');
say $date;
my $title = $item->look_down('class', 'result-title hdrlnk');
All outputs are as I expected except $date isn't defined.
When I look at the $item->dump, it looks like the time element doesn't show up in the output. Here's a snippet of the output from $item->dump where I would expect to see a <time...> element. All it shows is the text from the time element.
<li class="result-row" data="2"> #0.1.9.3.2.0
<a class="result-image gallery empty" href="https://localhost/1.html"> #0.1.9.3.2.0.0
<p class="result-info"> #0.1.9.3.2.0.1
<span class="icon icon-star" role="button"> #0.1.9.3.2.0.1.0
" "
<span class="screen-reader-text"> #0.1.9.3.2.0.1.0.1
"favorite this post"
" "
" Dec 4 "
<a class="result-title hdrlnk" data="2" href="https://localhost/1.html"> #0.1.9.3.2.0.1
.2
"Link Text..."
" "
...
I've not used HTML::Element before. I rtfmed and didn't see any tag exclusions and I did a search of the package code for tags white/black lists (which wouldn't make sense, but neither does leaving out the time tag).
Does anyone know why the time element is not showing up in the dump and any search for it turns up nothing?
As an fyi, the rest of the code searches and finds elements without issue, it just appears to be the time tag that's missing.

HTML::TreeBuilder does not support HTML5 tags. Consider Mojo::DOM as an alternative that keeps up with the living HTML standard. I can't show how your whole code would look with Mojo::DOM since you've only shown a piece, but the Mojo::DOM equivalent of look_down is find (returns a Mojo::Collection arrayref) or at (returns the first element found or undef), both taking a CSS selector.

Related

return html tags in the html text using re

I have html text and i just want to determine what are the html tags available in the text.
html_text = '<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin:0in;margin-bottom:.0001pt">Position Title: Onsite Client Services Associate<br /> Duration: 7 months<br /> Location: Tempe, AZ 85282<br /> <br /> <b><u>Roles and responsibilities</u></b><o:p></o:p></p> <p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt"><span style="font-family:Symbol">·</span><span style="font-size:7.0pt"> </span>Primary function during peak season (July-December) will be an onsite presence at our large client in the Phoenix area. <o:p></o:p></p>'
As a first step I was parsing every tag from the text for every html tag
like html_text.find('</p>'). As it is very long to parse by checking with every tag, I was trying to use of regex
re.findall(r'\<\/.>', html_text)
The output of the above is ['</p>', '</b>', '</u>']. But I want the output to be ['</p>','</span>', '<br />', '</b>', '</u>']. So If I modify
re.findall(r'\<\/.*>', html_text)
presuming i can get </span>, I am getting the whole text.
['</u></b><o:p></o:p></p> <p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt"><span style="font-family:Symbol">·</span><span style="font-size:7.0pt"> </span>Primary function during peak season (July-December) will be an onsite presence at our large client in the Phoenix area. <o:p></o:p></p>']
Is there a way I can write the expression for all tags as one expression or else should I write condition check for every tag ? In the above I couldn't determine <br />.
Finally after some little trails, I have found answer for my self, just posting it if it would help some one. It will determine all the tags, do some cleaning will determine the tags.
re.findall(re.compile("<.*?>"), html_text)
output is
['<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin:0in;margin-bottom:.0001pt">', '<br />', '<br />', '<br />', '<br />', '<b>', '<u>', '</u>', '</b>', '<o:p>', '</o:p>', '</p>', '<p class="gmail-m3464245979397595798gmail-m6143070745855285966gmail-m-3072962113628903492gmail-m-7999079541169053160wordsection1" style="margin-top:5.0pt;margin-right:0in;margin-bottom:0in;margin-left:.25in; margin-bottom:.0001pt">', '<span style="font-family:Symbol">', '</span>', '<span style="font-size:7.0pt">', '</span>', '<o:p>', '</o:p>', '</p>']
As far as I know, what you are trying to do won't be fully achievable with just regex.
Usually, in an HTML tag there are attributes inside the opening tag. For example-
<span class="text">Some Text </span> has class="text" between the opening <span and the closing >.
So, if you want to just match <span> from <span class="text">Some Text </span>, you'll have to match <span first and then somehow skip class="text" and match > again. Which is not possible with regex as regex can only match characters one after another.
One solution that comes to my mind is, you can use this regex (<[^\/\s]+)([^>]+)>. Which will match <span class="text">Some Text </span> and return <span. You can then just add > after that using string concatenation.
Regex Explanation-
Thanks.

Use regular expressions to add new class to element using search/replace

I want to add a NewClass value to the class attribute and modify the text of the span using find/replace functionality with a pair of regular expressions.
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
I am trying to get the following result using after search/replace:
<span class='customer NewClass' id='phone$1'>Organization</span>
Also curious to know if a single find/replace operation can been used for both tasks?
Regex can do this, but be aware the using regex to change HTML can have a lot of edge cases that you may not have accounted for.
This regex101 example shows those three <span> elements changed to add NewClass and the contents to be changed to Organization.
Other technologies, however, would be safer. jQuery, for example, could replace them regardless of the order of the attributes:
$("span#phone$1").addClass("NewClass");
$("span#phone$1").text("Organization");
So just be careful with it, and you should be fine.
EDIT
According to comments on the OP, you want to only change the span containing ID phone$1, so the regex101 link has been updated to reflect this.
EDIT 2
Permalink was too long to fit into a comment, so adding the permalink here. Click on the "Content" tab at the bottom to see the replacement.
You can use a regex like this:
'.*?' id='phone\$1'>.*?<
With substitution string:
'customer' id='phone\$1'>Organization<
Working demo
Php code
$re = "/'.*?' id='phone\\$1'>.*?</";
$str = "<div>\n <span class='customer' id='phone\$0'>Home</span>\n<br/>\n <span class='customer' id='phone\$1'>Business</span>\n<br/>\n <span class='customer' id='phone\$2'>Mobile</span>\n</div>";
$subst = "'customerNewClass' id='phone\$1'>Organization<";
$result = preg_replace($re, $subst, $str);
Result
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customerNewClass' id='phone$1'>Organization</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
Since your tags include preg_match and preg_replace, I think you are using PHP.
Regex is generally not a good idea to manipulate HTML or XML. See RegEx match open tags except XHTML self-contained tags SO post.
In PHP, you can use DOMDocument and DOMXPath with //span[#id="phone$1"] xpath (get all span tags with id attribute vlaue equal to phone$1):
$html =<<<DATA
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$sps = $xp->query('//span[#id="phone$1"]');
foreach ($sps as $sp) {
$sp->setAttribute('class', $sp->getAttribute('class') . ' NewClass');
$sp->nodeValue = 'Organization';
}
echo $dom->saveHTML();
See IDEONE demo
Result:
<div>
<span class="customer" id="phone$0">Home</span>
<br>
<span class="customer NewClass" id="phone$1">Organization</span>
<br>
<span class="customer" id="phone$2">Mobile</span>
</div>

Replace only raw text in HTML string

I have a string:
html_string =
'<span><span class=\"ip\"></span> Do not stare <span class=\"img\"></span> at the monitor continuously </span>\r\n'
I want to replace the character s in the raw text (not in the html tags) of html_string with <span class="highlighted">s</span>.
The result should be:
'<span><span class=\"ip\"></span> Do not <span class="highlighted">s</span>tare <span class=\"img\"></span> at the monitor continuou<span class="highlighted">s</span>ly </span>\r\n'
What I did is:
html_string.gsub(/s/, '<span class="highlighted">s</span>')
but this replaces all occurrences of the s character regardless of raw text or a tag. I want to replace it skipping html tags and its attributes. How it can be done?
Do not pretend to be ideal answer, just to give you a way where to go:
require 'nokogiri'
html_string = '<span><span class="ip"></span> Do not stare <span class="img"></span> at the monitor continuously </span>'
doc = Nokogiri::HTML.fragment(html_string)
spans = doc.css('span')
spans.each do |span|
span.xpath('text()').each do |text|
if text.content =~ /stare/
text.content = text.content.sub(/stare/, '<span class="highlighted">s</span>tare')
end
end
end
p doc.to_html.gsub(/\</, '<').gsub(/\>/, '>')
Which output is:
#=> "<span><span class=\"ip\"></span> Do not <span class=\"highlighted\">s</span>tare <span class=\"img\"></span> at the monitor continuously </span>"
So, here we are looking for all spans and checking them for content that has stare word. Then we change content. That's all, and learn nokogiri.
That's really simple: parse the html, replace in the text nodes, print to html.
Nokogiri seems to be popular for that in Ruby.

Hover text is broken if text has special symbols when description is given

Example:
<a title="A web design community.'test'~`!##$%^&*()-_+=\|][{};:,<.>?/ **"new test"** " href="http://css-tricks.com">CSS-Tricks</a>
In tooltip, after the double quotes "new test" is not working.
Is there any possible to show the content in tooltip like this
ex: testing 'welcome', # 3 $ ^ & * "flow"?
The problem is that your double quotes in the title close your title automatically. Escape them by replacing " with " and also funkwurm recommends to replace < and > with < and > respectively to avoid errors in xml:
<a title="A web design community.'test'~`!##$%^&*()-_+=\|][{};:,<.>?/ **"new test"** " href="http://css-tricks.com">CSS-Tricks</a>
You can use this is also.
<a title="Answer to your's question.'Test It' :):)'B Happy' :):)"new test"**" href="http://css-tricks.com">CSS-Tricks</a>

how to retrieve data from html between <span> and </span>

I want to get the rate that is from 1 to 5 in amazon customer reviews.
I check the source, and find this part looks as
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" ><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Works great right out of the box with Surface Pro</b>, <nobr>October 5, 2013</nobr></span>
</div>
I want to get 5.0 out of 5 stars from
<span>5.0 out of 5 stars</span></span> </span>
how can i use xpathSApply to get it?
Thank you!
I would recommend using the selectr package, which uses css selectors in place of xpath.
library(XML)
doc <- htmlParse('
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;">
<span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" >
<span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;">
<b>Works great right out of the box with Surface Pro</b>,
<nobr>October 5, 2013</nobr></span>
</div>', asText = TRUE
)
library(selectr)
xmlValue(querySelector(doc, 'div > span > span > span'))
UPDATE: If you are looking to use xpath, you can use the css_to_xpath function in selectr to figure out the appropriate xpath command, which in this case turns out to be
"descendant-or-self::div/span/span/span"
I do not know r much but I can give you the XPath string. It seems you want the first span's text which has no attribute and this would be:
//span[not(#*)][1]/text()
You can put this string into xpathSApply.