What is the role of parentheses in XPath 1.0? - google-chrome

In Chrome DevTools > Elements, when I search for //tr/td/span I find an element (because such an element exists on my page).
When I search for (//tr)/td/span or (//tr/td)/span I also find this element.
But neither //tr(/td)/span nor //tr/(td)/span nor //tr/(td/)span find anything.
What is the meaning of these parentheses in XPath?

Parenthesis in XPath are used as they are in other programming languages:
Function argument grouping: e.g: //tr/td[contains(.,"e")]
Evaluation precedence indication: e.g: normal arithmetic expression grouping as well as leading path grouping (trace LocationPath through to PrimaryExpr in the XPath grammar) as in (//td)[1] to find the first td in the document as opposed to //td[1] which finds the td elements that are the first child of their respective parent elements.
They're also used in
node tests: e.g: node(), element(), ...
processing instructions: e.g: PageBreak().
Your examples that do not find anything (e.g: //tr(/td)/span, //tr/(td)/span1, etc) have parenthesis embedded within the path that do not follow in one of the above categories. Such use of parenthesis are actually syntactically invalid and should have been reported as such rather than silently failing.
1Note that this expression would actually be syntatically valid under XPath 2.0/3.0. Thanks, #Andersson, for noticing.

I don't think that parenthesis mean something in your case, but it might be used to return required node/nodes set depending on passed index
For instance, HTML is like below:
<table>
<tr>
<td>
<span>first</span>
</td>
<td>
<span>second</span>
</td>
</tr>
<tr>
<td>
<span>third</span>
</td>
<td>
<span>fourth</span>
</td>
</tr>
</table>
(//tr)[1]/td will return cells for first row only (first, second)
(//tr)[2]/td - for second row (third, fourth)
(//tr/td)[1] - first cell of first row (first). Note that //tr/td[1] will returns each first cell of each row (first, third)
...

Related

HtmlAgilityPack Wildcard Search in Powershell

How could I shorten the following?
$contactsBlock is an HTMLAgilityPack node, XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]
$contactsBlock.SelectSingleNode(".//table").SelectSingleNode(".//table")
Results in desired XPath: /html[1]/body[1]/div[3]/div[2]/div[2]/div[1]/div[1]/div[2]/div[1]/div[3]/div[5]/div[1]/div[2]/table[1]/tr[2]/td[1]/div[1]/div[2]/table[1]
The second table is nested in the first, and I'd like to shorten the above SelectSingleNode twice to something like this
$contactsBlock.SelectSingleNode(".//table/*/table") and skip the in-between.
Is there a way to wild-card like this?
An XPath expression .//table//table should match all tables nested within other tables under the current node. Double forward slashes match arbitrary length paths.
.//table/*/table is unlikely to give you a match, because the asterisk wildcard matches one node (i.e. one level of hierarchy), so the nested table would have to be a grandchild node of the first table:
<table>
<tr>
<table>...</table> <!-- nested table would have to go here -->
</tr>
</table>
which would be quite unusual. Doesn't match the structure suggested by the XPath expression from your question, too.

xpath expression - select element where parent contains a specific text

I struggle currently to find the correct xpath expression to select an input element where its parent/sibling element contains a specific text.
In the example below, I would like to select the "input" element where, in the same tr row, a td element with a specific text exists.
my example path - returns no match
//input[contains(../../../td/text(),"15-935-331")]
source code
<tr>
<td>xxxx, yyyyy</td>
<td>Mr</td>
<td></td>
<td> 15-935-331</td>
<form id="betreuerModel" action="xxxx" method="POST">
<td class="tRight">
<input value="Bearbeiten" id="bearbeiten" name="bearbeiten" class="submit" title="Bearbeiten" type="submit"/>
</td>
</form>
</tr>
<tr>
// .. next row with same structure
</tr>
The contains function, when given a nodeset, will only operate on the very first node in that nodeset. In your case, it is <td>xxxx, yyyyy</td>.
You could instead refactor your expression so that the predicate operates on all the nodes to check, and the contains function operates on a single item:
//input[../../../td/text()[contains(., "15-935-331")]]
This will get any input element, where the parent's parent's parent contains a td element with a text node containing the text 15-935-331.
A perhaps easier way to specify this would be to use ancestor::tr[1]/td in place of ../../../td.
//input[ancestor::tr[1]/td/text()[contains(.,"15-935-331")]]
This would get the first tr in the ancestor hierarchy, and operate on that.
As an alternative to the solution posted by Keith, you can use the following XPath expression:
//tr[td[contains(., "15-935-331")]]/form//input
This makes it a bit more independent of the actual structure of the HTML. It selects the tr which contains a td containing the given text, and from that tr it takes the input element anywhere below the form element.

XPath for text after <br/>

Looking to get the XPath of $2.00 with this block:
<td class="undefined" colspan="6">
<table class="history-bill-payments" cellspacing="0" cellpadding="0" border="0" align="center" width="99%">
<thead>
<tbody>
<tr>
<td valign="top">04/19/2016</td>
<td valign="top" style="text-align:right; height:">
$3.00
<br/>
$2.00
</td>
I have tried these but to no avail
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/following-sibling::br)]");
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/preceding-sibling::br/text(),'$2.00')]");
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/following-sibling::br/text(),'$2.00')]");
Using firepath in Firefox I get this XPath
html/body/div[4]/div[2]/div/div/div/div/table/tbody/tr[2]/td/table/tbody/tr/td[2]
I was able to get the xpath of $3.00
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/text(),'$3.00')]");
In XPath 1.0, given a node-set, contains() would only evaluates the first node in the set. That's why your initial XPath successfully find text node that contains '$3.00', but not the one that contains '$2.00'.
XPath expression that is close to the way your xpath of $3.00 works would be as follow :
//table[tbody/tr[2]/td/table/tbody/tr/td[2]/text()[contains(.,'$2.00')]]
The XPath above works by applying contains() on individual text node instead of passing multiple text nodes at once.
td with certain contents
From your trials, it seems you're fine with keying off of $2.00 literally, so you could use this XPath 2.0 expression to get the td that ends with $2.00:
//td[ends-with(normalize-space(), '$2.00')]
Note that browsers don't generally support XPath 2.0, so use this XPath 1.0 expression if running within a browser and you're ok with $2.00 appearing anywhere within the td:
//td[contains(.,'$2.00')]
Text following a br
If you don't want to literally specify the $2.00, you'll have to state some other invariant constraint. For example, this XPath will return the string that follows the br contained within a td that starts with $3.00:
normalize-space(//td[starts-with(normalize-space(),'$3.00')]/br/following::text())
See also
XPath contains() works differently in XPath v1.0 vs v2.0+
How to use XPath contains() here?
How to use XPath contains() for specific text?
If you need, just add table id or any other specific locator.
xpath=//table//tr/td[2]/text()[2]

Using XPath to select table that includes specific class

I have an HTML table that I need to select using XPath. The table may or may not contain multiple classes, but I only want tables that include a specific class.
Here is a sample HTML snippet:
<html>
<body>
<table class="no-border">
<tr>
<th colspan="2">Blah Blah Blah</th>
</tr>
<tr>
<td>Content</td>
<td>
<table class="info no-border">
<tr>
<!-- Inner table content -->
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I need to use XPath to retrieve ONLY the table that includes the class info. I've tried using /html/body/table/tr/td/table[#class='info*'], but that doesn't work. The table I'm trying to retrieve may exist ANYWHERE in the HTML document - technically, not ANYWHERE, but there may be varying levels of hierarchy between the outer and inner table.
If anyone can point me in the right direction, I'd be grateful.
The closest you can do is with the contains function:
//table[contains(#class,'info')]
But please be aware that this would capture a table with the class information, or anything else that has the info substring. As far as I know XPath can't distinguish whole-word matches. So you'd have to filter results to check for this possible condition.
What you'd ideally need is a CSS selector like table.info. And some XPath engines and toolkits fo XML/HTML parsing do support these selectors, which are translated to XPath expressions internally, e.g. cssselect if you use Python and which is included in lxml, or Nokogiri for Ruby.
In the general case, to emulate a CSS selector like table.info with XPath, a common trick or pattern is to use contains() combined with concat() and space characters. In your case, it looks like this:
.//table[contains(concat(' ', normalize-space(#class), ' '), ' info')]
I know that you did not asked for this answer, but I think it will help you to make your queries more precise.
//table[ (contains(#class,"result-cont") or contains(#class,"resultCont")) and not(contains(#class,"hide")) ]
This will get classes that contain 'result-cont' or 'resultCont', and do not have the 'hide' class.
XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex.
XSLT2.0 (which not all browsers and software support) has support for regex.

Get last </td></tr> with regular expression?

I need to get all tags between last </td> and the closing </tr> in each row. The regular expression I use <\/TD\s*>(.*?)<\/TR\s*> retrieve all from first </TD> till last </TR> - marked with bold on sample below.
<TABLE>
<TR><TD>TD11**</TD><TD>TD12</TD><TD>TD13</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21**</TD><TD>TD22</TD><TD>TD23</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
But a what I really need is
<TABLE>
<TR><TD>TD11</TD><TD>TD12</TD><TD>TD13**</TD><SPAN><FONT>test1</FONT></SPAN></TR>**
<TR><TD>TD21</TD><TD>TD22</TD><TD>TD23**</TD><SPAN><FONT>test2</FONT></SPAN></TR>**
</TABLE>
Its not recommended to use regular expressions to parse HTML, html is non regular and there for notoriously unreliable when trying to use regular expressions.
Heres a good blog post explaining the logic and offering alternatives:
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
</TD>((?:(?!</T[DR]>).)*)</TR>
The regex starts to match at the first </TD>, but fails as soon as it reaches the second </TD> because of the (?!</T[DR]>)., which matches any character that's not the first character of a </TD> or </TR> tag. That's optional because of the enclosing (?:...)*, so it tries to match the next part of the regex, which is </TR>. That fails too, so the match attempt is abandoned.
It tries again starting at the second </TD> and fails again. Finally, it starts matching at the third </TD> and successfully matches from there to the first </TR>.
You may want to specify "single-line" or "dot-matches-all" mode, in case there are newlines that didn't show in your example. You didn't specify a regex flavor, so I can't say exactly how to do that.