Extracting texts from an element in a HTML page using Jsoup - html

I am extracting texts from the following html element
<span class="adr" style="float: none !important;">
<span class="street-address" style="float: none !important;">18, Jawaharlal Nehru
Road,
</span>
<span style="float: none !important;" class="estb_addr-HeadingTxt">
<a style="float: none !important;" href="http://kolkata.burrp.com/area/park-street" class="locality"> Park Street</a></span>
, Kolkata<span class="region" style="display: none;">Kolkata
</span>
</span>
For that I wrote the following piece of code:
for (Element element : doc.getAllElements())
{
for(Element childelem: element.children())
{
if (childelem.hasText() && !childelem.ownText().isEmpty())
{
String currText=childelem.ownText();
System.out.print(currText+" ");
}
}
System.out.println("");
}
Ideally the output should be 18, Jawaharlal Nehru Road, Park Street, Kolkata. But it is giving 18, Jawaharlal Nehru Road, Kolkata and Park Street. I can understand that the output is basically inorder traversal of the DOM tree rooted at outer <span>. But I don't know exactly how to achieve that by Jsoup, where a DOM tree for an element in a HTML page has arbitrary levels of nesting.
Any help would be appreciated. Thank you.

Use either DOM navigation or CSS-selector syntax to do the task, do not loop through all Elements.
Element adr = doc.select("span.adr").first().
System.out.println(adr.text());

Related

HTML::ELEMENT not finding all elements

I have this snippet of html:
<li class="result-row" data="2">
<p class="result-info">
<span class="icon icon-star" role="button">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2018-12-04 09:21" title="Tue 04 Dec 09:21:50 AM">Dec 4</time>
Link Text
and this perl code (not production, so no quality comments are necessary)
my $root = $tree->elementify();
my #rows = $root->look_down('class', 'result-row');
my $item = $rows[0];
say $item->dump;
my $date = $item->look_down('class', 'result-date');
say $date;
my $title = $item->look_down('class', 'result-title hdrlnk');
All outputs are as I expected except $date isn't defined.
When I look at the $item->dump, it looks like the time element doesn't show up in the output. Here's a snippet of the output from $item->dump where I would expect to see a <time...> element. All it shows is the text from the time element.
<li class="result-row" data="2"> #0.1.9.3.2.0
<a class="result-image gallery empty" href="https://localhost/1.html"> #0.1.9.3.2.0.0
<p class="result-info"> #0.1.9.3.2.0.1
<span class="icon icon-star" role="button"> #0.1.9.3.2.0.1.0
" "
<span class="screen-reader-text"> #0.1.9.3.2.0.1.0.1
"favorite this post"
" "
" Dec 4 "
<a class="result-title hdrlnk" data="2" href="https://localhost/1.html"> #0.1.9.3.2.0.1
.2
"Link Text..."
" "
...
I've not used HTML::Element before. I rtfmed and didn't see any tag exclusions and I did a search of the package code for tags white/black lists (which wouldn't make sense, but neither does leaving out the time tag).
Does anyone know why the time element is not showing up in the dump and any search for it turns up nothing?
As an fyi, the rest of the code searches and finds elements without issue, it just appears to be the time tag that's missing.
HTML::TreeBuilder does not support HTML5 tags. Consider Mojo::DOM as an alternative that keeps up with the living HTML standard. I can't show how your whole code would look with Mojo::DOM since you've only shown a piece, but the Mojo::DOM equivalent of look_down is find (returns a Mojo::Collection arrayref) or at (returns the first element found or undef), both taking a CSS selector.

Parsing ONLY plain text from HTML using Kanna Swift

I am using Kanna Swift for HTML parsing.
For example:
How can I parse ONLY the highlighted English Text in this situation?
To be prone to something, usually something bad, means to have
a tendency to be affected by it or to do it.
<div class="caption hide_cn">
<a class="anchor" name="prone_1"></a>
<span class="num">1</span>
<span class="st" title="能被表示程度的副词或介词词组修饰的形容词">ADJ-GRADED </span>
<span class="tips_box">
<span class="lbl type-syntax">
<span class="span"> [</span>
verb-link <span class="hi rend-sc">ADJ</span>
</span>
<span class="lbl type-syntax">
<span class="span">, </span>
<span class="hi rend-sc">ADJ</span>
to-infinitive
<span class="span">]</span>
</span>
</span>
<span class="def_cn cn_before">
<span class="chinese-text">有(不好的)倾向的;易于</span>
…
<span class="chinese-text">的;很可能</span>
…
<span class="chinese-text">的</span>
</span>
To be <b>prone to</b> something, usually something bad, means to have a tendency to be affected by it or to do it.
<span class="def_cn cn_after">
<span class="chinese-text">有(不好的)倾向的;易于</span>
…
<span class="chinese-text">的;很可能</span>
…
<span class="chinese-text">的</span>
</span>
</div>
If I use:
doc.css("div[class='caption hide_cn']")
I get all the messy part around the sentence I want.
Maybe I am wrong but I could not find enough documentation about the usage.
e.g. I learned"span[class= 'xxx xxx']" from stackoverflow instead of the documentation from that github page.
Do we have something like "[class != 'xxx xxx'] " or !=span
After some tweaks, I found a work around solution, in case someone needs it later.
We can use the removeChild method to remove all the other sections!
// Search for nodes by CSS
for whole in doc.css("div[class='caption hide_cn']") {
if let a1 = doc.css("span[class='num']").first {
whole.removeChild(a1)
}
if let a2 = doc.css("span[class='st']").first {
whole.removeChild(a2)
}
if let a3 = doc.css("span[class='tips_box']").first {
whole.removeChild(a3)
}
if let s1 = doc.css("span[class='def_cn cn_before']").first {
whole.removeChild(s1)
}
if let s2 = doc.css("span[class='def_cn cn_after']").first {
whole.removeChild(s2)
}
print(whole.text)
}
It's a pity I could not find this in the documentation. I guess those packages/libs are powerful enough to do almost anything you want. You just need to tweak a little bit.

How to get span class text using jsoup

I am using jsoup HTML parser and trying to travel into span class and get the text from it but Its returning nothing and its size always zero. I have pasted small part of HTML source . pls help me to extract the text.
<div class="list_carousel">
<div class="rightfloat arrow-position">
<a class="prev disabled" id="ucHome_prev" href="#"><span>prev</span></a>
<a class="next" id="ucHome_next" href="#"><span>next</span></a>
</div>
<div id="uc-container" class="carousel_wrapper">
<ul id="ucHome">
<li modelID="587">
<h3 class="margin-bottom10"> Ford Figo Aspire</h3>
<div class="border-dotted margin-bottom10"></div>
<div>Estimated Price: <span class="cw-sprite rupee-medium"></span> 5.50 - 7.50 lakhs</div>
<div class="border-dotted margin-top10"></div>
</li>
<li modelID="899">
<h3 class="margin-bottom10"> Chevrolet Trailblazer</h3>
<div class="border-dotted margin-bottom10"></div>
<div>Estimated Price: <span class="cw-sprite rupee-medium"></span> 32 - 40 lakhs</div>
<div class="border-dotted margin-top10"></div>
</li>
I have tried below code:
Elements var_1=doc.getElementsByClass("list_carousel");//four classes with name of list_carousel
Elements var_2=var_1.eq(1);//selecting first div class
Elements var_3 = var_2.select("> div > span[class=cw-sprite rupee-medium]");
System.out.println(var_3 .eq(0).text());//printing first result of span text
please ask me , if my content was not very clear to you. thanks in advance.
There are several things to note about your code:
A) you can't get the text of the span, since it has no text in the first place:
<div>Estimated Price:
<span class="cw-sprite rupee-medium"></span>
5.50 - 7.50 lakhs
</div>
See? The text is in the div, not the span!
B) Your selector "> div > span[class=cw-sprite rupee-medium]" is not really robust. Classes in HTML can occur in any order, so both
<span class="cw-sprite rupee-medium"></span>
<span class="rupee-medium cw-sprite"></span>
are the same. Your selector only picks up the first. This is why there is a class syntax in css, which you should use instead:
"> div > span.cw-sprite.rupee-medium"
Further you can leave out he first > if you like.
Proposed solution
Elements lcEl = doc.getElementsByClass("list_carousel").first();
Elements spans = lcEl.select("span.cw-sprite.rupee-medium");
for (Element span:spans){
Element priceDiv = span.parent();
System.out.println(priceDiv.getText());
}
Try
System.out.println(doc.select("#ucHome div:nth-child(3)").text());

HTML Agility get text from paragraph tags in a div

I'm trying to get the text of paragraph tags in a div using htmlagilitypack 2.28 in a windows phone 8.1 app.
The structure of div is
<div id="55">
<p> </p>
<p><span class="dropcap">W
</span><span class="zw-portion"><strong>ith the start of festive season in India</strong>, we
will also witness the f<strong>irst London Derby</strong> of the season
between the newly London rivals <strong>Chelsea and Arsenal</strong>. It will be a great chance
for Arsene Wenger to get rid of his <strong>1000</strong></span>
<strong><span class="zw-portion">th</span><span class="zw-portion"> managed </span>
<span class="zw-portion">6-0 </spa>
<span class="zw-portion">massacre</span></strong>
<span class="zw-portion"> in March,</span>
<span class="zw-portion"> </span>
<span class="zw-portion">while the Special One will be eager to continue his winning rampage
</span>
<span class="zw-portion"> </span>
<span class="zw- portion">over his “<strong>Specialist in Failure</strong>” counterpart. Although
both clubs can boast of being unbeaten this season and both clubs can take this opportunity
</span>
<span class="zw-portion"> to bring down their rival</span><span class="zw-portion">.</span></p>
<p> </p>
<p><iframe width="640" height="360" src="https://www.youtube.com/embed/zFBN8M1pCxo?
feature=oembed" frameborder="0" allowfullscreen=""></iframe></p>
<p class="zw-paragraph" data-textformat="
{"type":"text","td":"none"}"></p>
<p class="zw-paragraph" data-textformat=
{"type":"text","td":"none"}">
<span class="zw-portion">The rivalry between Chelsea and Arsenal was not as a primary London
Derby, until Chelsea rose to top of Premier League in 2000’s, when they consistently competed
against each other. The rivalry between the two clubs rose higher as compared to their
traditional rivals. Both the clubs rivalry are now not only limited to their pitch but has also
been to the fans. In 2009 survey by Football Fans Census, Arsenal fans named Chelsea as the
<strong>most disliked club</strong> </span>
<span class="zw-portion"> ahead of their traditional rivals <strong>Manchest</strong></span>
<strong> <span class="zw-portion">er United and Tottenham Hotspur</span></strong>
<span class="zw-portion">. However the report of the other camp doesn’t differ much as Chelsea
fans ranks Arsenal as their <strong>second most-disliked club</strong></span>
<strong><span class="zw-portion">.
</span></strong></p>
</div>
I want to extract only the text containined within the paragraph element within the div.
I have written the following code so far where feedurl contains the address of page from which data is to be extracted (the correct address is extracted). After that i try to get a reference to the div using it's id (which is equal to 55 always).
var feedurl = GetValue("feedurl");
string htmlPage = "asdsad";
HtmlDocument htmldoc = new HtmlDocument();
htmldoc.LoadHtml(feedurl);
htmldoc.OptionUseIdAttribute=true;
HtmlNode div = htmldoc.GetElementbyId("55");
if (div != null)
{
htmlPage += "done";
}
_content = htmlPage;
return _content;
htmldoc.GetElementbyId("55"); is returning a null reference.
I've read to use htmldoc.DocumentNode.SelectNodes([arguments]). but there is not SelectNodes method available to me. And I'm lost on how to proceed further. Please help.
HtmlAgilityPack version for WP 8.1 doesn't support SelectNodes() because that method requires XPath implementation which unfortunately missing in .NET version for WP8.1.
The solution is to use HtmlAgilityPack's LINQ API instead of Xpath. For example, to get <div> element having id attribute equals 55 :
HtmlNode div55 = htmldoc.DocumentNode
.Descendants("div")
.FirstOrDefault(o => o.GetAttributeValue("id", "")
== "55");

how to retrieve data from html between <span> and </span>

I want to get the rate that is from 1 to 5 in amazon customer reviews.
I check the source, and find this part looks as
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" ><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Works great right out of the box with Surface Pro</b>, <nobr>October 5, 2013</nobr></span>
</div>
I want to get 5.0 out of 5 stars from
<span>5.0 out of 5 stars</span></span> </span>
how can i use xpathSApply to get it?
Thank you!
I would recommend using the selectr package, which uses css selectors in place of xpath.
library(XML)
doc <- htmlParse('
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;">
<span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" >
<span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;">
<b>Works great right out of the box with Surface Pro</b>,
<nobr>October 5, 2013</nobr></span>
</div>', asText = TRUE
)
library(selectr)
xmlValue(querySelector(doc, 'div > span > span > span'))
UPDATE: If you are looking to use xpath, you can use the css_to_xpath function in selectr to figure out the appropriate xpath command, which in this case turns out to be
"descendant-or-self::div/span/span/span"
I do not know r much but I can give you the XPath string. It seems you want the first span's text which has no attribute and this would be:
//span[not(#*)][1]/text()
You can put this string into xpathSApply.