XPath //div[contains(text(), 'string')] fails to select divs containing 'string' - html

This is the HTML code:
<div> <span></span> Elangovan </div>
I want to write an XPath for the div based on its contained text. I tried
//div[contains(text(),'Elangovan')]
but this is not working.

Replace text() with string():
//div[contains(string(), "Elangovan")]
Or, you can check that span's following text sibling contains the text:
//div[contains(span/following-sibling::text(), "Elangovan")]
Also see:
Difference between text() and string()

Alternatively to alecxe's correct answer (+1), the following slightly simpler and somewhat more idiomatic XPath will work the same way:
//div[contains(., "Elangovan")]
The reason that your original XPath with text() does not work is that text() will select all text node children of div. However, contains() expects a string in its first argument, and when given a node set of text nodes, it only uses the first one. Here, the first text node contains whitespace, not the sought after string, so the test fails. With the implicit . or the explicit string() first argument, all text node descendants are concatenated together before performing the contains() test, so the test passes.

To make #kjhughes's already good answer just a little more precise, what you're really asking for is a way to look for substrings in the div's string-value:
For every type of node, there is a way of determining a string-value
for a node of that type. For some types of node, the string-value is
part of the node; for other types of node, the string-value is
computed from the string-value of descendant nodes.
Both the context node (. or the div itself) and the set of nodes returned by text() -- or any other argument! -- are first converted to strings when passed to contains. It's just that they're converted in different ways, because one refers to a single element and the other refers to a node-set.
A single element's string-value is the concatenation of the string-values of all its text node descendants. A node-set's string-value, on the other hand, is the string-value of the node in the set that is first in document order.
So the real difference is in what you're converting to a string and how that conversion takes place.

Related

What does contains(.,'substring') mean in XPath and what's its equivalent in CSS?

Able to locate all div elements in UI page using this XPATH locator : //div[contains(.,'')] irrespective of the text inside them. What does .,'' mean ?
Consider below example:
XPath to specifically select div with text - 'Apple' would be //div[contains(.,'Apple')].
What is its CSS equivalent ?
<div>
<span> Apple </span>
</div>
//div[contains(.,'Apple')]
//div means to select all div elements in the document.
//div[ predicate ] means to filter those per the given predicate.
contains( str, substr ) means to return true iff str contains the substring, substr.
. is the context node. (See Current node vs. Context node in XSLT/XPath?) Within the predicate of your XPath, it will be a div element. When passed as a function parameter with type string, it will be converted to the string value of the node.
The string-value of an element is equal to the concatenation of the string values of its children elements.
Therefore, your XPath returns all div elements in the document whose string value contains the 'Apple' substring.
There is no CSS equivalent.
See also
Is there a CSS selector for elements containing certain text?
How to use XPath contains() here?

Why is XPath contains(text(),'substring') not working as expected?

Let's say I have a piece of HTML like this:
<a>Ask Question<other/>more text</a>
I can match this piece of XPath:
//a[text() = 'Ask Question']
Or...
//a[text() = 'more text']
Or I can use dot to match the whole thing:
//a[. = 'Ask Questionmore text']
This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:
//a[contains(text(), 'Ask Question')]
...I get the following error:
Error: Required cardinality of first argument of contains() is one or zero
How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?
For this markup,
<a>Ask Question<other/>more text</a>
notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").
Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:
contains(x,y) expects x to be a string, but text() matches two text nodes.
In XPath 1.0, the rule for converting multiple nodes to a string is this:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]
In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.
In your case...
XPath 1.0 would treat contains(text(),'Ask Question') as
contains('Ask Question','Ask Question')
which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.
XPath 2.0 would treat it as an error.
Better alternatives
If the goal is to find all a elements whose string value contains the substring, "Ask Question":
//a[contains(.,'Ask Question')]
This is the most common requirement.
If the goal is to find all a elements with an immediate text node child equal to "Ask Question":
//a[text()='Ask Question']
This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,
<a>Ask Question<other/>more text</a>
but not this a:
<a>more text before <not>Ask Question</not> more text after</a>
See also
How contains() handles a nodeset first arg
How to use XPath contains() for specific text?
Testing text() nodes vs string values in XPath
The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)
//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.
//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.
So both of these expressions match the same a element.
You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

Select all deepest nodes with XPath 1.0 containing text, ignoring markup

I want to extract elements from the HTML page, containing text, ignoring markup. For example, I want to extract node containing the text "Run, Sarah, run!" from https://en.wiktionary.org/wiki/run. I know about node test text() and function string(). I tried them both:
As you see, if I use string() it returns too many nodes (result includes the nodes that include the node I need) and if I use text() it returns nothing (because of the <b> tag).
How do I find required nodes?
UPD: I want all deepest nodes. That means if the Wikitionary page contained this sentence twice, I wanted to select two nodes.
Also, I don't know the node type.
//*[contains(string(.), "Run, Sarah, run!")] returns all elements (starting from html node till last descendant node) that contains that string.
//*[contains(text(), "Run, Sarah, run!")] returns nothing as "Run, Sarah, run!" is compound text from several text nodes, but not from single text node
You can use below to match italic node with required text:
'//i[normalize-space()="Run, Sarah, run!"]'
If you don't want to specify node name, you can try
'//*[normalize-space()="Run, Sarah, run!" and not(./*[normalize-space()="Run, Sarah, run!"])]'

How can you view the output XPATH functions like normalize-space()?

Say I have the following HTML:
<div class="instruction" id="scan-prompt">
<span class="long instruction">Scan </span>
<span id="slot-to-scan">A-2</span>
<span class="long instruction"> to prep</span>
</div>
And I'm trying to write an XPATH selector like this
//div[#id='scan-prompt' and normalize-space()='Scan A-2 to prep']
Is there a way to see what the normalize-space output actually is?
I know you can do $x("//div[#id='scan-prompt']) in chrome debugger but I don't know how to go from that to seeing the output of normalize-space.
Why can you not simply use the path expression
normalize-space(//div[#id='scan-prompt'])
to see what the normalized string value would look like? Other than that, what normalize-space() does exactly is:
Removing any leading or trailing whitespaces from the string argument
Collapsing any sequence of whitespace characters to just one whitespace character
If handed an element node as an argument (as is the case with your original expression), the function evaluates the string value of that element node. The string value of an element node is the concatenation of all its descendant text nodes.
The result of normalize-space(//div[#id='scan-prompt']) is, given the input you show (whitespace marked with "+"):
Scan+A-2+to+prep
Without invoking normalize-space(), for example string(//div[#id='scan-prompt']):
+
Scan+
A-2+
to+prep+
+
So, simply use path expressions that do nothing else than either giving back a string value or a normalized string value. With Google Chrome by using an XPath expression inside $x().

XPath Expression Problem

I have the following HTML snippet, http://paste.enzotools.org/show/1209/ , and I want to extract the tag that has a text() descendant with the value of "172.80" (it's the fourth node from that snippet). My attempts so far have been:
'descendant::td[#class="roomPrice figure" and contains(descendant::text(), "172.80")]'
'descendant::td[#class="roomPrice figure" and contains(div/text(), "172.80")]'
'descendant::td[#class="roomPrice figure" and div[contains(text(), "172.80")]]'
but neither of them selects anything.
Does anyone have any suggestions?
When passing node set to function calls, do note that if the function signature doesn't declare a node set argument then it will cast the the first node from that node set.
So, I think you need this XPath expression:
descendant::td[#class="roomPrice figure"][div[text()[contains(.,'172.80')]]]
Test for a text node child of div
or
descendant::td[#class="roomPrice figure"]
[div[descendant::text()[contains(.,'172.80')]]]
Test for a text node descendant of div
or
descendant::td[#class="roomPrice figure"]
[descendant::text()[contains(.,'172.80')]]
Test for a text node descendat of td
I believe you want something like this:
<xsl:for-each select="//td[contains(string(.), '172.80')]">
The string() function will give you all the text in the current and descendant nodes wherease text() just gives you the text in the current (context) node.
Of course, you extend the xpath selector to filter on the class names too...
<xsl:for-each select="//td[contains(string(.), '172.80')][#class='roomPrice figure']">
And as stated in the comments above, you're posted xml/html is invalid as it stands.
My understanding is that you want to select the td element in specified class, that has a descendant text node containing the value "172.80".
I'm assuming the context node is the <tr> (or some ancestor of it).
The attempts you listed all suffer from the problem that contains() converts its first argument to a single string, using only the first node of the nodeset. So if the td or div has a descendant or child text node before the one that contains "172.80", the one containing "172.80" will not be noticed.
Try this:
'descendant::td[#class="roomPrice figure" and
descendant::text()[contains(., "172.80")]]'