Combining functions in xpath selector - html

I have a selection I want to make in xpath and can't seem to get it right. So I have: //td[starts-with(#id, '16276688381') and not(ends-with(#id, '_name'))]
This is the simple html
<td id="16276688381_name">I don't want this</td>
<td id="16276688381_B3" >What I want</td>
<td id="16276688381_B4" >More of these...I want them</td>
Once I add the and my selection disappears. Any idea what is going wrong here?

As Martin points out, XPath 1.0 does not support ends-with, but you can simulate it with some string length calculations:
//td[starts-with(#id, '16276688381') and
not(substring(#id, string-length(#id) - 4) = '_name'))]

ends-with is a function introduced in XPath 2.0 in 2007, browsers unfortunately still only support XPath 1.0 from 1999.

Related

Xpath: Get Text After Element With Containing Text

I am looking for a way to get text which is not inside an HTML element:
<div class="col-sm-4">
<strong>Handelnde Personen:</strong><br><br>
<strong>Geschäftsführer</strong><br>
Mr John Doe<br>
Privatperson<br>
.....<br>
<br>
I want to get "Mr John Doe".
The only way I see is looking for a strong element which contains "Geschäftsführer" and then look for the following text.
My idea so far:
//strong[contains(text(), 'Gesch')]/br/../text()
... I simply can't make it work.
Also, is there a "wildcard" for strings? That I could use
*esch*ftsf*hr*
for "Geschäftsführer"?
I highly appreciate your help, thanks!
Try
//strong[starts-with(., 'Gesch')]/following-sibling::text()[1]
As for wildcard matching, with XPath 2.0 you use regular expressions:
//strong[matches(., '.*esch.*ftsf.*hr.*')]
With XPath 3.0 you could also use the Unicode collation algorithm
//strong[compare(., 'Geschäftsführer',
'http://www.w3.org/2013/collation/UCA?strength=primary') = 0]
(strength=primary ignores case and accents)
But to get anything more advanced than XPath 1.0 in the browser, you would need to deploy Saxon-JS.
Another option with 1.0 is to use translate() to remove case and umlauts:
//strong[translate(., 'ABCD..XYZÄÖÜäöüß', 'abcd..xyzaouaous') = 'geschaftsfuhrer']
Note, in all these examples I have used "." rather than "text()" to get the string value of an element - this is recommended practice.

XPath for text after <br/>

Looking to get the XPath of $2.00 with this block:
<td class="undefined" colspan="6">
<table class="history-bill-payments" cellspacing="0" cellpadding="0" border="0" align="center" width="99%">
<thead>
<tbody>
<tr>
<td valign="top">04/19/2016</td>
<td valign="top" style="text-align:right; height:">
$3.00
<br/>
$2.00
</td>
I have tried these but to no avail
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/following-sibling::br)]");
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/preceding-sibling::br/text(),'$2.00')]");
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/following-sibling::br/text(),'$2.00')]");
Using firepath in Firefox I get this XPath
html/body/div[4]/div[2]/div/div/div/div/table/tbody/tr[2]/td/table/tbody/tr/td[2]
I was able to get the xpath of $3.00
$I->CanSeeElement("//table[contains(tbody/tr[2]/td/table/tbody/tr/td[2]/text(),'$3.00')]");
In XPath 1.0, given a node-set, contains() would only evaluates the first node in the set. That's why your initial XPath successfully find text node that contains '$3.00', but not the one that contains '$2.00'.
XPath expression that is close to the way your xpath of $3.00 works would be as follow :
//table[tbody/tr[2]/td/table/tbody/tr/td[2]/text()[contains(.,'$2.00')]]
The XPath above works by applying contains() on individual text node instead of passing multiple text nodes at once.
td with certain contents
From your trials, it seems you're fine with keying off of $2.00 literally, so you could use this XPath 2.0 expression to get the td that ends with $2.00:
//td[ends-with(normalize-space(), '$2.00')]
Note that browsers don't generally support XPath 2.0, so use this XPath 1.0 expression if running within a browser and you're ok with $2.00 appearing anywhere within the td:
//td[contains(.,'$2.00')]
Text following a br
If you don't want to literally specify the $2.00, you'll have to state some other invariant constraint. For example, this XPath will return the string that follows the br contained within a td that starts with $3.00:
normalize-space(//td[starts-with(normalize-space(),'$3.00')]/br/following::text())
See also
XPath contains() works differently in XPath v1.0 vs v2.0+
How to use XPath contains() here?
How to use XPath contains() for specific text?
If you need, just add table id or any other specific locator.
xpath=//table//tr/td[2]/text()[2]

Special Characters in HTML Element

What I'm trying to do is output a percent sign (%) directly into a < td > tag. Below is my code:
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="item_container" %%=v(#Item_Container_Style)=%%>
...
When I test the XSL I get the following error:
SAXParseException: Expected an attribute name (Set_A_Custom.xsl, line 205, column 38)
So basically it's seeing "%%=v(#Item_Container_Style)=%%" as invalid HTML but I need this code to be there.
If you are wondering why I am doing this it is because I am writing the XSL to output HTML that contains AMPscript (An ExactTarget proprietary Scripting language). You don't need to know anything about AMPscript though to help me out though, I just need to output the percent sign (%) in the HTML and everything will work.
Any ideas? For the record I'm using XSL 1.0. Thanks all!
An XSLT stylesheet must itself be well-formed XML, so you can't include this kind of construct directly in the stylesheet. If the XSLT processor you're using supports disable-output-escaping then you would be able to do something like
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<xsl:text disable-output-escaping="yes"><![CDATA[<td class="item_container" %%=v(#Item_Container_Style)=%%>]]></xsl:text>
...
<xsl:text disable-output-escaping="yes"><![CDATA[</td>]]></xsl:text>
</tr>
</table>
If it does not allow disable-output-escaping then your only option is to use the text output method, and write all the tags you want to output as text with the angle brackets escaped (or in CDATA).
What I'm trying to do is output a percent sign (%) directly into a <td> tag.
Not possible with the "html" or "xml" output modes. XSLT has been designed to create syntactically sane HTML, you cannot make it do anything else.
Of course you could switch to the "text" output mode and do whatever you like, but generating HTML this way it a lot harder.
Alternatively you can use disable-output-escaping, if your XSLT processor supports it, but this will quickly degenerate your XSLT stylesheet into a mess if you need to do it in many places.
That being said, here's a proposal. In XSLT you use the "html" output mode and this:
<td
class="item_container"
amp-1="%%=v({#Item_Container_Style})%%"
amp-2="%%=v({#Some_Other_Element})%%"
>
some text %%=v(<xsl:value-of select="Other_Stuff" />)%% more text
</td>
That is syntactically valid XSLT which covers both cases (multiple placeholders in attributes, multiple placeholders in the text) and creates syntactically valid HTML:
<td
class="item_container"
amp-1="%%=v(item container style content)%%"
amp-2="%%=v(some other element content)%%"
>
Here some text %%=v(other stuff)%%
</td>
and then you use a post-processing step to convert that HTML into AMPscript:
Regex-replace \bamp-\d+="(%%[\s\S]*?%%)" with $1, which would result in
<td
class="item_container"
%%=v(item container style content)%%
%%=v(some other element content)%%
>
Here some text %%=v(other stuff)%%
</td>
Handling HTML with regular expressions is generally strongly dis-recommended, but this might just be a narrow-enough use case.
AMPScript appears to have a standards-based syntax as an alternative to its proprietary syntax:
Delimiter Comparison
The table below demonstrates the similarities between standard AMPscript delimiters and server-side delimiters.
Standard AMPscript Delimiter Tag-based AMPscript Delimiter
%%[ <script runat=server language=ampscript>
etc
Does this help you?

Using XPath to select table that includes specific class

I have an HTML table that I need to select using XPath. The table may or may not contain multiple classes, but I only want tables that include a specific class.
Here is a sample HTML snippet:
<html>
<body>
<table class="no-border">
<tr>
<th colspan="2">Blah Blah Blah</th>
</tr>
<tr>
<td>Content</td>
<td>
<table class="info no-border">
<tr>
<!-- Inner table content -->
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I need to use XPath to retrieve ONLY the table that includes the class info. I've tried using /html/body/table/tr/td/table[#class='info*'], but that doesn't work. The table I'm trying to retrieve may exist ANYWHERE in the HTML document - technically, not ANYWHERE, but there may be varying levels of hierarchy between the outer and inner table.
If anyone can point me in the right direction, I'd be grateful.
The closest you can do is with the contains function:
//table[contains(#class,'info')]
But please be aware that this would capture a table with the class information, or anything else that has the info substring. As far as I know XPath can't distinguish whole-word matches. So you'd have to filter results to check for this possible condition.
What you'd ideally need is a CSS selector like table.info. And some XPath engines and toolkits fo XML/HTML parsing do support these selectors, which are translated to XPath expressions internally, e.g. cssselect if you use Python and which is included in lxml, or Nokogiri for Ruby.
In the general case, to emulate a CSS selector like table.info with XPath, a common trick or pattern is to use contains() combined with concat() and space characters. In your case, it looks like this:
.//table[contains(concat(' ', normalize-space(#class), ' '), ' info')]
I know that you did not asked for this answer, but I think it will help you to make your queries more precise.
//table[ (contains(#class,"result-cont") or contains(#class,"resultCont")) and not(contains(#class,"hide")) ]
This will get classes that contain 'result-cont' or 'resultCont', and do not have the 'hide' class.
XPath 1.0 is , indeed, fairly limited in its string processing. You can do modest amounts of processing with starts-with() substring() and similar functions. See this answer for creating something similar to a regex.
XSLT2.0 (which not all browsers and software support) has support for regex.

How to get a HTML element by text using XPath?

I'm encoutered a problem that is could not get a HTML element by element's text.My HTML looks like:
...
<table>
...
<tr>
...
<td class="oMain">test value</td>
...
<tr>
...
</table>
...
For some special reasons,I have to get the '<td class="oMain">' element using it's text 'test value'. I tried '//tr[td='test value']/td' but no result.How can i write the XPath expression?
Any help is welcome.Thanks!
Your expression
//tr[td='test value']/td
places the predicate on the parent node "tr". Maybe that's what's causing the problem.
What you want probably is this
//td[#class = "oMain" and child::text() = 'test value']]
Here's a link to th W3 specification of the xPath language for further reading: http://www.w3.org/TR/xpath/
Your XPath expression seems to be correct. Do you have a default namespace (e.g. XHTML) in your html? If so, you can modify your XPath like this:
//*[local-name()='td' and text()='test value']
If you can figure out how to use namespaces, you could also do
//xhtml:tr[xhtml:td='test value']/xhtml:td
Does that help?
In the xpath expression, first put the element node, which in your case is td, and then apply the filter text()='text node'
//td[text()='test value']
Hope this helps.
What are you using to do the parsing? In Ruby + Hpricot, you can do
doc.search("//td.oMain").each do |cell|
if cell.inner_html == "test value"
return cell
end
end
In this case, cell would be:
<td class="oMain">test value</td>