XPath: "Exclude" tag in "InnerHtml" (InnerHtml<span>excludeme</span> - html

I am using XPath to query HTML sites, which works pretty good so far, but now I hit a (brick)wall and can't find a solution :-)
The html looks like this:
<ul>
<li>Text1<span>AnotherText1</span></li>
<li>Text2<span>AnotherText2</span></li>
<li>Text3<span>AnotherText3</span></li>
</ul>
I want to select the "TextX" part, but NOT the AnotherTextX part in the <span></span>
So far I couldn't come up with any (pure) XPath solution to do that (and in my setup I unfortunately need a pure XPath solution.
This selects kind of what I want, but it results in "TextXAnotherTextX" and I only need "TextX".
/ul/li/a
Any hints? :-)

This gets you the first direct text node child of <a>:
/ul/li/a/text()[1]
and this would get you any direct text node child (separately):
/ul/li/a/text()
Both of the above return "TextX", but if you had:
<li>Text4<span>AnotherText3</span>TrailingText</li>
then the latter would return: ["Text4", "TrailingText"], while the former would return "Text4" only.
Your expression /ul/li/a gets the string value of <a>, which is defined as the concatenation of the string value of all the children of <a>, so you get "TextXAnotherTextX".

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.
According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

XPath concat two children

I know this is a commonly asked question (I found Concatenate multiple node values in xpath XPath joining multiple elements and a few others), but for the life of me I can't figure it out. I've got the following HTML:
<div class="product-card__price__new"><span class="product-card__price__euros">0.</span> <span class="product-card__price__cents">69</span></div>
From which I need to extract the 0.69. I tried the following XPATH:
'.//*[#class="product-card__price__new"]/concat(/span/text(), following-sibling::span[1]/text)'
'.//*[#class="product-card__price__new"]/span/text()/concat(., following-sibling::span[1]/text)'
'.//*[#class="product-card__price__new"]/text()'
But I keep getting nothing. What would be the correct expression to extract it?
It's much simpler than this. The string value of an element is the concatenation of all its descendant text nodes. So it's just
string(.//*[#class="product-card__price__new"])
and in many contexts you don't need the call on string() because it's implicit.
Most of the time when people use text() they are (a) over-complicating things and/or (b) getting it wrong.

Why won't my XPath select link/button based on its label text?

<a href="javascript:void(0)" title="home">
<span class="menu_icon">Maybe more text here</span>
Home
</a>
So for above code when I write //a as XPath, it gets highlighted, but when I write //a[contains(text(), 'Home')], it is not getting highlighted. I think this is simple and should have worked.
Where's my mistake?
Other answers have missed the actual problem here:
Yes, you could match on #title instead, but that's not why OP's
XPath is failing where it may have worked previously.
Yes, XML and XPath are case sensitive, so Home is not the same as
home, but there is a Home text node as a child of a, so OP is
right to use Home if he doesn't trust #title to be present.
Real Problem
OP's XPath,
//a[contains(text(), 'Home')]
says to select all a elements whose first text node contains the substring Home. Yet, the first text node contains nothing but whitespace.
Explanation: text() selects all child text nodes of the context node, a. When contains() is given multiple nodes as its first argument, it takes the string value of the first node, but Home appears in the second text node, not the first.
Instead, OP should use this XPath,
//a[text()[contains(., 'Home')]]
which says to select all a elements with any text child whose string value contains the substring Home.
If there weren't surrounding whitespace, this XPath could be used to test for equality rather than substring containment:
//a[text()[.='Home']]
Or, with surrounding whitespace, this XPath could be used to trim it away:
//a[text()[normalize-space()= 'Home']]
See also:
Testing text() nodes vs string values in XPath
Why is XPath unclean constructed? Why is text() not needed in predicate?
XPath: difference between dot and text()
yes you are doing 2 mistakes, you're writing Home with an uppercase H when you want to match home with a lowercase h. also you're trying to check the text content, when you want to check check the "title" attribute. correct those 2, and you get:
//a[contains(#title, 'home')]
however, if you want to match the exact string home, instead of any a that has home anywhere in the title attribute, use #zsbappa's code.
You can try this XPath..Its just select element by attribute
//a[#title,'home']

Using XPath to get the first child for every child of a node

I'm trying to parse some HTML with the following structure, how can I extract the first <a> element of every <li> element using xpath?
<ul>
<li>
<a>
<span>
<a>
</li>
<li>
<a>
<span>
<a>
</li>
...
</ul>
#Mathias : You are correct, I apologize. //li/a[1] did not work because it wasn't a direct child (there is an article tag in between, which I omitted for simplicity).
Then let me post this as a solution with some more explanation.
If, as you have described, //li/a[1] does not return anything while (//li//a)[1] does, then the HTML sample you show is not representative for your actual document. Then, a would be a descendant of li, but not a direct child of it.
A correct XPath expression in this case is
//li//a[1]
but only use it if the level of nesting varies, i.e. if there could be other elements nested between li and a:
<li>
<article>
<other>
<a/>
If the nesting is consistent, but it is not always the article element which is in between li and a then use
//li/*/a[1]
Which avoids the // axis that is computationally more expensive than /.
Finally, if you know that the a elements you are interested in are always grandchildren of li elements and if it is always the article element in between them, use
//li/article/a[1]
When I correct the expression to be //li/article/a[1]', I get the first a` for the first li.
//li/article/a[1] returns several results if there are several a elements that are children of article and grandchildren of li. If this only returns a single result either
you invoke this XPath expression in a context where only a single result is expected, e.g. if you use an XPath library in a programming language or
the structure of your input document is even more intricate
I think that the XPath to accomplish that would be .//ul/li/a[position()=1] .
Explanation:
The reason I spell it all out as .//ul/li/a is because, when you use the xpath, if there is an error, your stack-trace will reveal exactly what the locator pointed at, and is less vague. But, you can obviously short-hand it if you dont care: .//a .
Using the position clause, you can do =1 or >1 , or whatever. I would choose using [position()=1] over using [1] because Xpath doesn't use 0-based arrays, which might confuse others looking at your locator. I mean position=0, by logic, means null, right?
I start my locator with a . because personally, sometimes I like to chain my locators together in a combination. You don't really need to start with the dot char but since i use the // wildcard in this case, its effectively the same as starting without a dot, but with the additional ability to be chained.
Answer tested on http://the-internet.herokuapp.com/

Confounded by XPath

When it comes to indexing in XPath, I feel like I'm missing something here.
If I have two table tags in an HTML document, and within the Chrome console I type $x("//table[1]");, I expect to get the first table tag on the page.
Instead, I get a list containing both table tags. I suspected it might have something to do with using // but using an absolute XPath expression yielded the same results.
I think this is a pretty simple misunderstanding, but I'm not seeing it when reading the docs.
//table[1] returns all tables that are the first table child of their respective parents.
To get the first table use /descendant::table[1] or in XPath 2.0 (//table)[1].
Here it is in the standard:
The path expression //para[1] does not mean the same as the path expression /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their respective parents.
Use
(//table)[1]
i.e. the first of all the tables.