How to take content from two same nodes separately? - html

I have HTML file with list of product names and prices
<ul>
<li>
<label>
<span class="name">Name 1</span>
<span class="price">3.99</span>
</label>
</li>
<li>
<label>
<span class="name">Name 2</span>
<span class="price">5.49</span>
</label>
</li>
...
</ul>
and need to take names and prices from each <label> separately.
I'm using Nokogiri to parse HTML file and tried
file.xpath('//ul/li/label').each do |item|
puts item.content
end
but, as you may have guessed, it returns both name and price.

Name and price span elements are children of the label element, so you can fetch them using xpath within the scope of each label
file.xpath('//ul/li/label').each do |item|
name = item.at_xpath("span[#class='name']").text()
price = item.at_xpath("span[#class='price']").text()
puts "#{name} - #{price}"
end
or using css selector
file.xpath('//ul/li/label').each do |item|
name = item.at_css('.name').text()
price = item.at_css('.price').text()
puts "#{name} - #{price}"
end

Typically I'd use something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>
<label>
<span class="name">Name 1</span>
<span class="price">3.99</span>
</label>
</li>
<li>
<label>
<span class="name">Name 2</span>
<span class="price">5.49</span>
</label>
</li>
</ul>
EOT
data = doc.css('label').map { |label| [label.at('.name').text, label.at('.price').text] }.to_h
# => {"Name 1"=>"3.99", "Name 2"=>"5.49"}
As long as the .name text is unique, which it seems like it should from the example HTML, the resulting hash will be valid and easy to use.
IF you need them in order then Ruby will return the key/value pairs in the order they're originally inserted if you iterate over them, which is not something I recommend relying on because in other languages you can't rely on that but your mileage might vary. Otherwise, the lookup to retrieve the value for a given key is extremely fast, no matter how many entries there are because it's a hash. And, a hash can be passed around for a lot of useful munging.

Related

Xpath issues selecting <spans> nested in <td>

I'm trying to extract text from a lot of XHTML documents with a program that uses Xpath queries to map the text into a structured table. the XHTML document looks like this
<td class="td-3 c12" valign="top">
<p class="pa-4">
<span class="ca-5">text I would like to select </span>
</p>
</td>
<td class="td-3 c13" valign="top">
<p class="pa-2">
<span class="ca-0">some more text I want to select </span>
</p>
<p class="pa-2">
<span class="ca-0">
<br>
</br>
</span>
</p>
<p class="pa-2">
<span class="ca-5">text and values I don't want to select.</span>
</p>
<p class="pa-2">
<span class="ca-5"> also text and values I don't want to </span>
</p>
</td>
I'm able to select the the spans by their class and retrieve the text/values, however they're not unique enough and I need to filter by table classes. for example only the text from span class ca-0 that is a child of td class td-3 c13
which would be <span class="ca-0">some more text I want to select </span>
I've tried all these combinations
//xhtml:td[#class="td-3 c13"]/xhtml:span[#class = "ca-0"]
//xhtml:span[#class = "ca-0"] //ancestor::xhtml:td[#class= "td-3 c13"]
//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"]
I'm not sure how much your sample xml reflects your actual xml, but strictly based on your sample xml (AND disregarding possible namespaces issues you will probably face), the following xpath expression:
//td[contains(#class,"td-3")]/p[1]/span/text()
selects
text I would like to select
some more text I want to select
According to the doc, and to support namespaces, you should write something like this (fn:...) :
//*:td[fn:contains(#class,"td-3")]/*:p[1]/*:span
Or with a binding namespace :
node.xpath("//xhtml:td[fn:contains(#class,'td-3')]/xhtml:p[1]/xhtml:span", {"xhtml":"http://example.com/ns"})
This expression should work too (select the first span of the first p of each td element) :
//*:td/*:p[1]/*:span[1]
Side notes :
Your XPath expressions could be fixed. Span is not a child but a descendant, so we use //. We use () to keep the first result only.
(//xhtml:td[#class="td-3 c13"]//xhtml:span[#class = "ca-0"])[1]
(//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"])[1]
Replace // with a predicate [] :
(//xhtml:span[#class = "ca-0"][ancestor::xhtml:td[#class= "td-3 c13"]])[1]
Test your XPath with : https://docs.marklogic.com/cts.validIndexPath
The solution is
//td[(#class ="td-3") and (#class = "c13)]/p/span
for some reason it sees the
<td class="td-3 c13">
as separate classes e.g.
<td class = "td-3" and class = "c13"
so you need to treat them as such
Thanks to #E.Wiest and #JackFleeting for validating and pointing me in the right direction.

XPath selection by value

I want to get a value of "square" (for example, 201). I tried to do so, as described here, but it doesn't work:
./li[attributeTitle='Этаж']
Html code:
<div class = "A">
<ui class = "B">
<li>
<span class = "attributeTitle"> Floor </span>
<span class = "attributeValue"> 3 </span>
</li>
<! A random more items "li" >
<li>
<span class = "attributeTitle"> Square </span>
<span class = "attributeValue"> 201 </span>
</li>
<li>
<span class = "attributeTitle"> Nrooms </span>
<span class = "attributeValue"> 4 </span>
</li>
</ui>
</div>
Thanks for any help.
You can use contains() function in xpath to check whether text contains some string:
"//div[#class='attributeTitle'][contains(text(),'Square')]"
This gets you this node:
<span class = "attributeTitle"> Square </span>
To get the value node that is right below it you can use following-sibling::span:
"//div[#class='attributeTitle'][contains(text(),'Square')]/following-sibling::span[1]"
And adding [1] to indicate that we want only the first sibling in case there are more than one sibling. You can also use [class='attributeValue'] instead to indicate that we only want siblings that have this particular class, or not use anything at all there if you trust there will only be 1 sibling.

Use regular expressions to add new class to element using search/replace

I want to add a NewClass value to the class attribute and modify the text of the span using find/replace functionality with a pair of regular expressions.
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
I am trying to get the following result using after search/replace:
<span class='customer NewClass' id='phone$1'>Organization</span>
Also curious to know if a single find/replace operation can been used for both tasks?
Regex can do this, but be aware the using regex to change HTML can have a lot of edge cases that you may not have accounted for.
This regex101 example shows those three <span> elements changed to add NewClass and the contents to be changed to Organization.
Other technologies, however, would be safer. jQuery, for example, could replace them regardless of the order of the attributes:
$("span#phone$1").addClass("NewClass");
$("span#phone$1").text("Organization");
So just be careful with it, and you should be fine.
EDIT
According to comments on the OP, you want to only change the span containing ID phone$1, so the regex101 link has been updated to reflect this.
EDIT 2
Permalink was too long to fit into a comment, so adding the permalink here. Click on the "Content" tab at the bottom to see the replacement.
You can use a regex like this:
'.*?' id='phone\$1'>.*?<
With substitution string:
'customer' id='phone\$1'>Organization<
Working demo
Php code
$re = "/'.*?' id='phone\\$1'>.*?</";
$str = "<div>\n <span class='customer' id='phone\$0'>Home</span>\n<br/>\n <span class='customer' id='phone\$1'>Business</span>\n<br/>\n <span class='customer' id='phone\$2'>Mobile</span>\n</div>";
$subst = "'customerNewClass' id='phone\$1'>Organization<";
$result = preg_replace($re, $subst, $str);
Result
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customerNewClass' id='phone$1'>Organization</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
Since your tags include preg_match and preg_replace, I think you are using PHP.
Regex is generally not a good idea to manipulate HTML or XML. See RegEx match open tags except XHTML self-contained tags SO post.
In PHP, you can use DOMDocument and DOMXPath with //span[#id="phone$1"] xpath (get all span tags with id attribute vlaue equal to phone$1):
$html =<<<DATA
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$sps = $xp->query('//span[#id="phone$1"]');
foreach ($sps as $sp) {
$sp->setAttribute('class', $sp->getAttribute('class') . ' NewClass');
$sp->nodeValue = 'Organization';
}
echo $dom->saveHTML();
See IDEONE demo
Result:
<div>
<span class="customer" id="phone$0">Home</span>
<br>
<span class="customer NewClass" id="phone$1">Organization</span>
<br>
<span class="customer" id="phone$2">Mobile</span>
</div>

Scraping with Nokogiri::HTML - Can't get text from XPATH

I'm trying to scrape html with Nokogiri.
This is the html source:
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
I need to get the following text: 山东济南
Checked shortest XPATH with firebug:
//*[#id="J-From"]
Here is my ruby code:
doc = Nokogiri::HTML(open("http://foo.html"), "UTF-8")
area = doc.xpath('//*[#id="J-From"]')
puts area.text
However, it returns nothing.
What am I doing wrong?
However, it returns nothing. What am I doing wrong?
xpath() returns an array containing the matches (it's actually called a NodeSet):
require 'nokogiri'
html = %q{
<span id="J_WlAreaInfo" class="wl-areacon">
<span id="J-From">山东济南</span>
至
<span id="J-To">
<span id="J_WlAddressInfo" class="wl-addressinfo" title="全国">
全国
<s></s>
</span>
</span>
</span>
}
doc = Nokogiri::HTML(html)
target_tags = doc.xpath('//*[#id="J-From"]')
target_tags.each do |target_tag|
puts target_tag.text
end
--output:--
山东济南
Edit: You can actually call text() on the Array, but it will return the concatenated results of the text for each match in the array--which is not something I've ever found useful--but because there is only one match you should have gotten the result 山东济南. There is nothing in your post that indicates why you didn't get that result.
If you only want a single result from your xpath, i.e. the first match, then you can use at_xpath():
target_tag = doc.at_xpath('//*[#id="J-From"]')
puts target_tag.text
--output:--
山东济南

how to get text of few child nodes as single string

how can I get or compare text that is split into few nodes? For example:
<label>
<span class="first">first part</span>
<span class="second">second part</span>
<span class="third">third part</span>
</label>
How to get string "first partsecond partthird part" and use it for comparison with text() function?
And more complicated example:
<label>
<span class="first">first part</span>
<span class="second">second part</span>
something between
<span class="third">third part</span>
</label>
How to get string "first partsecond partsomething betweenthird part" and use it for comparison with text() function?
Obviously, wouldbe nice to strip the result from unnecessary spaces with normalize-space(), but the key is how to concatenate those children's texts.
You can use the following xpath:
//label/descendant::*/text() | //label/text()[normalize-space()]
You may see some white spaces around 'something between' which you can trim using the programming language whatever you use.