Selecting elements based on string/text matching in XPath?

Selecting elements based on string/text matching in XPath? - html

For HTML tables on web page I am using the following XPath:
/tr/td[2]/.[contains(text(),'Some')]
This works fine in all the case but it also match 'Something'.
/tr/td[2]/.[normalize-space(text()) = 'Some']
doesn't work in all the cases.
Can somebody comment on what's wrong with latter XPath?

Your problem doesn't likely involve normalize-space() but rather one of two common text/string matching areas of confusion:
Text node vs string value
text() matches text nodes.
//td[contains(text(), 'Some')] will match this
<td>Some text</td>
but not
<td><b>Some text</b></td>
To match the latter too, use //td[contains(., 'Some')] instead. This will check that the string value of td contains the string "Some".
For more details, see XPath text() = is different than XPath . =
String contains vs string equals
Note also that contains() tests for substring containment. If you want string equality, use the = operator against string:
//td[. = 'Some']
Will match
<td><b>Some</b></td>
but not
<td><b>Some text</b></td>
Be aware of the difference.

Related

How a write a common XPath for same text displayed for different HTML tags?

I want to write a common XPath for the result displayed for my searched text 'Automation Server'
The same text is displayed for td HTML tags as well as for div html tags as shown below, and I wrote XPath as below based on my understanding by going through different article
displayed_text = //td[contains(text(),'Automation Server') or div[contains(text(),' Automation Server ')]
<td role="cell" mat-cell="" class="mat-cell cdk-cell cdk-column-siteName mat-column-siteName ng-star-inserted">Automation Server</td>
<div class="change-list-value ng-star-inserted"> Automation Server </div>

The operator you are looking for in XPath is |. It is a union operator and will return both sets of elements.
The XPath you are looking for is
//td[contains(text(),'Automation Server')] | //div[contains(text(),'Automation Server')]

This XPath,
//*[self::td or self::div][text()[normalize-space()='Automation Server']]
will select all td or div elements with an immediate text node whose normalize string value equals 'Automation Server'.
Cautions regarding other answers here
| is not logical-OR or "OR-like".
It is a union operator over node sets (XPath 1.0) or sequences (XPath 2.0+), not boolean values.
See: Logical OR in XPath? Why isn't | working?
contains(text(), "string") only tests the first text node child.
See: Why is contains(text(), "string" ) not working in XPath?

A few alternatives to JeffC answer, using common properties for both:
1. use the * as a wildcard for any element:
//*[contains(#class,'ng-star-inserted') and normalize-space(text())='Automation Server']
2. use in addition the local-name() function to narrow down the names of the elements:
//*[local-name()[.='td' or .='div']][contains(#class,'ng-star-inserted') and normalize-space(text())='Automation Server']
The normalize-space() function can be used to clean-up the optional white space, so a = operator can be used.

You could use the following XPath to test the local-name() of the element in a predicate and whether it's text() contains the phrase:
//*[(local-name() = "td" or local-name() = "div") and contains(text(), "Automation Server")]

XPath to separately select each of two values in a table cell?

<td _ngcontent-wp class="align-middle">
"4.79728"
<small _ngcontent-wp class="neo_red_dark"> -0.08% </small>
</td>
My XPath as follows:
(//table[#class="table"]/tbody/tr/td[3])[1]
It works, but it gets two values together (4.79728 -0.08%). How can I get them separately?

You can get the value before the space and after the space using:
substring-before() and substring-after()
or change your XPath to target the text() descendants of the td instead of the td itself (which is producing the calculated text value).
In order to select "4.79728":
(//table[#class="table"]/tbody/tr/td[3])[1]/text()
In order to select -0.08%:
(//table[#class="table"]/tbody/tr/td[3])[1]/small/text()

You should indicate with XPath questions which XPath version you are using.
If it's version 1.0, remember that the set of data types you can return is very limited: a single string, number, or boolean, or a node-set. And some APIs only allow you to return a node-set.
Your current query is returning a node-set containing one node, namely a td element, whose string value contains the concatenation of all the text within. You could return a node-set containing all the text nodes individually by appending //text() to the query. But of course, it won't always be the case that the two numbers are in separate text nodes.

Why is XPath contains(text(),'substring') not working as expected?

Let's say I have a piece of HTML like this:
<a>Ask Question<other/>more text</a>
I can match this piece of XPath:
//a[text() = 'Ask Question']
Or...
//a[text() = 'more text']
Or I can use dot to match the whole thing:
//a[. = 'Ask Questionmore text']
This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:
//a[contains(text(), 'Ask Question')]
...I get the following error:
Error: Required cardinality of first argument of contains() is one or zero
How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?

For this markup,
<a>Ask Question<other/>more text</a>
notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").
Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:
contains(x,y) expects x to be a string, but text() matches two text nodes.
In XPath 1.0, the rule for converting multiple nodes to a string is this:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]
In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.
In your case...
XPath 1.0 would treat contains(text(),'Ask Question') as
contains('Ask Question','Ask Question')
which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.
XPath 2.0 would treat it as an error.
Better alternatives
If the goal is to find all a elements whose string value contains the substring, "Ask Question":
//a[contains(.,'Ask Question')]
This is the most common requirement.
If the goal is to find all a elements with an immediate text node child equal to "Ask Question":
//a[text()='Ask Question']
This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,
<a>Ask Question<other/>more text</a>
but not this a:
<a>more text before <not>Ask Question</not> more text after</a>
See also
How contains() handles a nodeset first arg
How to use XPath contains() for specific text?
Testing text() nodes vs string values in XPath

The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)
//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.
//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.
So both of these expressions match the same a element.
You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

xpath find link containing HTML in page

This is not the same question as xpath find specific link in page . I've got foo <em class="bar">baz</em>.. and need to find the link by the full foo <em class="bar">baz</em>. including the closing dot.

Note: I'm following up on OP's comment
A (visually) simpler variation of OP's own answer could be:
//a[. = "foo baz."][em[#class = "bar"] = "baz"]
or even:
//a[.="foo baz." and em[#class="bar"]="baz"]
(assuming you want to select the <a> node, and not the child <em>)
Regarding OP's question:
why the [em[]= doesn't need the dot?
Inside a predicate, testing = against a string on the right will convert the left part to a string, here <em> to its string representation, i.e. what string() would return.
XPath 1.0 specification document has an example of this:
chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to "Introduction"
Later, the same spec says on boolean tests:
If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true.
In OP's answer, //a[string() = 'bar baz.']/em[#class='bar' and .='baz'], the . is needed since the test on 'baz' is on the context node
Note that my answer is somewhat naive and assumes there's only 1 <em> child of <a>, because [em[#class="bar"]="baz"] is looking for one em[#class="bar"] matching the string-value condition, not that it's the only or first one.
Consider this input (a second <em class="bar"> child, but empty):
foo <em class="bar">baz</em><em class="bar"></em>..
and this test using Scrapy selectors
>>> import scrapy
>>> s = scrapy.Selector(text="""foo <em class="bar">baz</em><em class="bar"></em>..""")
>>> s.xpath('//a[.="foo baz." and em[#class="bar"]="baz"]').extract_first()
u'foo <em class="bar">baz</em><em class="bar"></em>.'
>>>
The XPath matches but you may not want this.

In my understanding XPath can't see the raw HTML markup, it works on the abstracted layer of the HTML document. Trying to incorporate as much information the HTML markup contains to an XPath expression would yield something like this :
//a[
node()[1][self::text() and .='foo ']
/following-sibling::node()[1][self::em[#class='bar' and .='baz']]
/following-sibling::node()[1][self::text() and .='.']
]
brief explanation about the predicate being used :
node()[1][self::text() and .='foo '] : having first child node a text node with value equals "foo"
/following-sibling::node()[1][self::em[#class='bar' and .='baz']] : followed directly by <em> having class equals "bar" and value equals "baz"
/following-sibling::node()[1][self::text() and .='.'] : followed directly by a text node having value equals "."

This is not 100% because there can be other HTML tags we have stripped by calling string() but for my purposes this looks enough:
//a[string() = 'bar baz.']/em[#class='bar' and .='baz']

XPath //div[contains(text(), 'string')] fails to select divs containing 'string'

This is the HTML code:
<div> <span></span> Elangovan </div>
I want to write an XPath for the div based on its contained text. I tried
//div[contains(text(),'Elangovan')]
but this is not working.

Replace text() with string():
//div[contains(string(), "Elangovan")]
Or, you can check that span's following text sibling contains the text:
//div[contains(span/following-sibling::text(), "Elangovan")]
Also see:
Difference between text() and string()

Alternatively to alecxe's correct answer (+1), the following slightly simpler and somewhat more idiomatic XPath will work the same way:
//div[contains(., "Elangovan")]
The reason that your original XPath with text() does not work is that text() will select all text node children of div. However, contains() expects a string in its first argument, and when given a node set of text nodes, it only uses the first one. Here, the first text node contains whitespace, not the sought after string, so the test fails. With the implicit . or the explicit string() first argument, all text node descendants are concatenated together before performing the contains() test, so the test passes.

To make #kjhughes's already good answer just a little more precise, what you're really asking for is a way to look for substrings in the div's string-value:
For every type of node, there is a way of determining a string-value
for a node of that type. For some types of node, the string-value is
part of the node; for other types of node, the string-value is
computed from the string-value of descendant nodes.
Both the context node (. or the div itself) and the set of nodes returned by text() -- or any other argument! -- are first converted to strings when passed to contains. It's just that they're converted in different ways, because one refers to a single element and the other refers to a node-set.
A single element's string-value is the concatenation of the string-values of all its text node descendants. A node-set's string-value, on the other hand, is the string-value of the node in the set that is first in document order.
So the real difference is in what you're converting to a string and how that conversion takes place.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Selecting elements based on string/text matching in XPath? - html

For HTML tables on web page I am using the following XPath: /tr/td[2]/.[contains(text(),'Some')] This works fine in all the case but it also match 'Something'. /tr/td[2]/.[normalize-space(text()) = 'Some'] doesn't work in all the cases. Can somebody comment on what's wrong with latter XPath?

Related

How a write a common XPath for same text displayed for different HTML tags?

XPath to separately select each of two values in a table cell?

Why is XPath contains(text(),'substring') not working as expected?

xpath find link containing HTML in page

XPath //div[contains(text(), 'string')] fails to select divs containing 'string'

Categories

Resources