Xpath issues selecting <spans> nested in <td> - html

I'm trying to extract text from a lot of XHTML documents with a program that uses Xpath queries to map the text into a structured table. the XHTML document looks like this
<td class="td-3 c12" valign="top">
<p class="pa-4">
<span class="ca-5">text I would like to select </span>
</p>
</td>
<td class="td-3 c13" valign="top">
<p class="pa-2">
<span class="ca-0">some more text I want to select </span>
</p>
<p class="pa-2">
<span class="ca-0">
<br>
</br>
</span>
</p>
<p class="pa-2">
<span class="ca-5">text and values I don't want to select.</span>
</p>
<p class="pa-2">
<span class="ca-5"> also text and values I don't want to </span>
</p>
</td>
I'm able to select the the spans by their class and retrieve the text/values, however they're not unique enough and I need to filter by table classes. for example only the text from span class ca-0 that is a child of td class td-3 c13
which would be <span class="ca-0">some more text I want to select </span>
I've tried all these combinations
//xhtml:td[#class="td-3 c13"]/xhtml:span[#class = "ca-0"]
//xhtml:span[#class = "ca-0"] //ancestor::xhtml:td[#class= "td-3 c13"]
//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"]

I'm not sure how much your sample xml reflects your actual xml, but strictly based on your sample xml (AND disregarding possible namespaces issues you will probably face), the following xpath expression:
//td[contains(#class,"td-3")]/p[1]/span/text()
selects
text I would like to select
some more text I want to select

According to the doc, and to support namespaces, you should write something like this (fn:...) :
//*:td[fn:contains(#class,"td-3")]/*:p[1]/*:span
Or with a binding namespace :
node.xpath("//xhtml:td[fn:contains(#class,'td-3')]/xhtml:p[1]/xhtml:span", {"xhtml":"http://example.com/ns"})
This expression should work too (select the first span of the first p of each td element) :
//*:td/*:p[1]/*:span[1]
Side notes :
Your XPath expressions could be fixed. Span is not a child but a descendant, so we use //. We use () to keep the first result only.
(//xhtml:td[#class="td-3 c13"]//xhtml:span[#class = "ca-0"])[1]
(//xhtml:td[#class="td-3 c6"]//xhtml:span[#class = "ca-0"])[1]
Replace // with a predicate [] :
(//xhtml:span[#class = "ca-0"][ancestor::xhtml:td[#class= "td-3 c13"]])[1]
Test your XPath with : https://docs.marklogic.com/cts.validIndexPath

The solution is
//td[(#class ="td-3") and (#class = "c13)]/p/span
for some reason it sees the
<td class="td-3 c13">
as separate classes e.g.
<td class = "td-3" and class = "c13"
so you need to treat them as such
Thanks to #E.Wiest and #JackFleeting for validating and pointing me in the right direction.

Related

exclude html with regex and select only the text

could you help me, I'm using a content extractor in regex but I have a problem extracting a subtitle:
<h2 class="page-title">Jesse Vega Schoolgirl <span class="duration">14 min</span> </h2>
I would like to select only the text and exclude the <span class="duration">14 min</span>
just stay like this
Jesse Vega Schoolgirl or so <h2> Jesse Vega Schoolgirl </h2>
I appreciate your answers
I assume, since your text is html, that you are trying to use javascript. So, the following will do it quickly.
.title\">([0-9a-zA-Z ]+).
Result will be in group 1.
Example:
let str = '<h2 class="page-title">Jesse Vega Schoolgirl <span class="duration">14 min</span> </h2>';
let groups = str.match(/.*title\">([0-9a-zA-Z ]+).*/);
alert(groups[1]);
Adding to the response, if there are more characters to match, you just need to add them to the group, as follows:
https://jsfiddle.net/fj2146ye/

(Reverse) Traverse XPath Query for Accessing a DIV with a particular Text Value

Working with a DOM that has the same HTML loop 100+ times that looks like this
<div class="intro">
<div class="header">
<h1 class="product-code"> <span class="code">ZY001</span> <span class="intro">ZY001 Title/Intro</span> </h1>
</div>
<div>
<table>
<tbody>
<tr>
<td>Available</td>
<td> S </td>
<td> M </td>
<td> XL </td>
</tr>
I was previously using this XPath Query to get ALL the node values back (all 100+ instances of the DOM Query in connection with the variable nodes that may contain in Available
//div[#class='intro']/div/table/tbody/tr/td[contains(text(),'Available')]/following-sibling::td
object(DOMNodeList)[595]
public 'length' => int 591
Now I am needing to target the product-code / code specifically to retrieve all the td attributes for a particular code
Because the div that contains the unique identifier (in the example above, ZY001) is not a direct ancestor, my thinking is I have to do a Reverse XPath Query
Here's one of my attempts:
//h1[#class='product-code']/span[contains(#class, 'code') and text() = 'ZY001']/../../div[#class='intro']/div/table/tbody/tr/td[contains(text(),'Available')]/following-sibling::td
As I am defining /span[contains(#class, 'code') and text() = 'ZY001'] and then attempting to traverse the dom backwards twice using /../../ I was hoping/expecting to get back the div[#class='intro'] with the text ZY001 immediately above it, or rather a public 'length' => int 1
But all my attempts thus far have resulted in 0 results. Not false, indicating an improper XPath, but 0.
How can I modify my XPath Query to get back the single instance in the one-of-many <div class="intro">'s that contain the <h1 class="product-code">/<span class="code"> text value ZY001?
Use
//h1[#class='product-code']/span[contains(#class, 'code') and text() = 'ZY001']/../../../div/table/tbody
instead of
//h1[#class='product-code']/span[contains(#class, 'code') and text() = 'ZY001']/../../div[#class='intro']/div/table/tbody
You can use any of the below xpath's for that:
//div[#class='intro' and //h1[#class='product-code']/span[#class='code' and text()='ZY001']]//tbody/tr[td[text()='Available']]/td[2]
//div[#class='intro' and //span[#class='code' and text()='ZY001']]//tbody/tr[td[text()='Available']]/td[2]
//div[#class='intro' and //span[#class='code' and text()='ZY001']]//tr[td[text()='Available']]/td[2]
Change td[2] to td[3] and td[4] to get the 3rd and 4th td respectively

Use regular expressions to add new class to element using search/replace

I want to add a NewClass value to the class attribute and modify the text of the span using find/replace functionality with a pair of regular expressions.
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
I am trying to get the following result using after search/replace:
<span class='customer NewClass' id='phone$1'>Organization</span>
Also curious to know if a single find/replace operation can been used for both tasks?
Regex can do this, but be aware the using regex to change HTML can have a lot of edge cases that you may not have accounted for.
This regex101 example shows those three <span> elements changed to add NewClass and the contents to be changed to Organization.
Other technologies, however, would be safer. jQuery, for example, could replace them regardless of the order of the attributes:
$("span#phone$1").addClass("NewClass");
$("span#phone$1").text("Organization");
So just be careful with it, and you should be fine.
EDIT
According to comments on the OP, you want to only change the span containing ID phone$1, so the regex101 link has been updated to reflect this.
EDIT 2
Permalink was too long to fit into a comment, so adding the permalink here. Click on the "Content" tab at the bottom to see the replacement.
You can use a regex like this:
'.*?' id='phone\$1'>.*?<
With substitution string:
'customer' id='phone\$1'>Organization<
Working demo
Php code
$re = "/'.*?' id='phone\\$1'>.*?</";
$str = "<div>\n <span class='customer' id='phone\$0'>Home</span>\n<br/>\n <span class='customer' id='phone\$1'>Business</span>\n<br/>\n <span class='customer' id='phone\$2'>Mobile</span>\n</div>";
$subst = "'customerNewClass' id='phone\$1'>Organization<";
$result = preg_replace($re, $subst, $str);
Result
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customerNewClass' id='phone$1'>Organization</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
Since your tags include preg_match and preg_replace, I think you are using PHP.
Regex is generally not a good idea to manipulate HTML or XML. See RegEx match open tags except XHTML self-contained tags SO post.
In PHP, you can use DOMDocument and DOMXPath with //span[#id="phone$1"] xpath (get all span tags with id attribute vlaue equal to phone$1):
$html =<<<DATA
<div>
<span class='customer' id='phone$0'>Home</span>
<br/>
<span class='customer' id='phone$1'>Business</span>
<br/>
<span class='customer' id='phone$2'>Mobile</span>
</div>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$sps = $xp->query('//span[#id="phone$1"]');
foreach ($sps as $sp) {
$sp->setAttribute('class', $sp->getAttribute('class') . ' NewClass');
$sp->nodeValue = 'Organization';
}
echo $dom->saveHTML();
See IDEONE demo
Result:
<div>
<span class="customer" id="phone$0">Home</span>
<br>
<span class="customer NewClass" id="phone$1">Organization</span>
<br>
<span class="customer" id="phone$2">Mobile</span>
</div>

Selenium WebDriver how to verify Text from Span Tag

I'm trying to verify the text in the span by using WebDriver. There is the span tag:
<span class="value">
/Company Home/IRP/tranzycja
</span>
I tried something like this:
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja']'"));
driver.findElement(By.cssSelector("span./Company Home/IRP/tranzycja"));
but none of this work.
Any help would be really appreciated. Thanks
More code:
<span id="uniqName_64_0" class="alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small" data-dojo-attach-point="renderedValueNode" widgetid="uniqName_64_0">
<span class="inner" tabindex="0" data-dojo-attach-event="ondijitclick:onLinkClick">
<span class="label">
In folder:
</span>
<span class="value">
/Company Home/IRP/tranzycja
</span>
</span>
uniqName shouldn't be a target because are a lot of them and they are change.
There is a full html code:
http://www.filedropper.com/spantag
Here I am assuming you are trying to verify the text in the span tag.
i.e '/Company Home/IRP/tranzycja'
Try Below code
String expected String = "/Company Home/IRP/tranzycja";
String actual_String = driver.findElement(By.xpath("//span[#class='alfresco-renderers-PropertyLink alfresco-renderers-Property pointer small']//span[#class='value']")).getText();
if(expected String.equals(actual_String))
{
System.out.println("Text is Matched");
}
else
{
System.out.println("Text is not Matched");
}
You can try using xpath ('some text' can be replaced by variable like #Rupesh suggested):
driver.findElement(By.xpath("//span/span[#class='value'][normalize-space(.) = 'some text']"))
or
driver.findElement(By.xpath("//span/span[#class='value'][contains(text(),'some text')]"))
(Be aware that this xpath will find first matching element, so if there are span elements with text 'some text 1' and 'some text 2', only first occurrence will be found.)
Of course, those two methods will throw NoSuchElementException if element (with defined text) is not found on page. If you're using Java and if needed, you can easy catch that error and print proper message.
One possible xpath to find that <span> element :
//span[normalize-space(.) = '/Company Home/IRP/tranzycja']
I think your going to want to use something like
driver.findElement(By.xpath("//span[#id='/Company Home/IRP/tranzycja'])).getText();
the getText(); will get the text within that span
You can use text() method inside Xpath. I hope this will resolve your problem
String str1 = driver.findElement(By.xpath("//span[text()='/Company Home/IRP/tranzycja']")).getText();
System.out.println("str1");
Output = /Company Home/IRP/tranzycja

how to retrieve data from html between <span> and </span>

I want to get the rate that is from 1 to 5 in amazon customer reviews.
I check the source, and find this part looks as
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;"><span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" ><span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;"><b>Works great right out of the box with Surface Pro</b>, <nobr>October 5, 2013</nobr></span>
</div>
I want to get 5.0 out of 5 stars from
<span>5.0 out of 5 stars</span></span> </span>
how can i use xpathSApply to get it?
Thank you!
I would recommend using the selectr package, which uses css selectors in place of xpath.
library(XML)
doc <- htmlParse('
<div style="margin-bottom:0.5em;">
<span style="margin-right:5px;">
<span class="swSprite s_star_5_0 " title="5.0 out of 5 stars" >
<span>5.0 out of 5 stars</span></span> </span>
<span style="vertical-align:middle;">
<b>Works great right out of the box with Surface Pro</b>,
<nobr>October 5, 2013</nobr></span>
</div>', asText = TRUE
)
library(selectr)
xmlValue(querySelector(doc, 'div > span > span > span'))
UPDATE: If you are looking to use xpath, you can use the css_to_xpath function in selectr to figure out the appropriate xpath command, which in this case turns out to be
"descendant-or-self::div/span/span/span"
I do not know r much but I can give you the XPath string. It seems you want the first span's text which has no attribute and this would be:
//span[not(#*)][1]/text()
You can put this string into xpathSApply.