How do I use the and operator in XPath? - html

I am trying to select all text within ul tags and p tags on a web page. I'm finding it hard to select both at the same time. I can select each separately no problem. Here's what I've tried so far:
('//p and ul');
('//p::ul');
I'm also trying to select a specific list. There are multiple lists on the web page but I am only interested in one. How would I go about selecting all list on a page that are within a tag with a certain id, e.g.
<div id="thisistheid"
<ul....
Thanks for any help.

As mentioned in my comment, I think you need to use //p and //ul, but I'm not certain.
I can certainly answer your second question though: //div[#id='thisistheid']//ul will select uls only if they are a descendant of #thisistheid. You can also use /ul in place of //ul to only allow one level of depth (useful if you have nested lists)

And and Or are boolean expressions:
An and expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean function. The result is true if both values are true and false otherwise. The right operand is not evaluated if the left operand evaluates to false.
Since a node cannot be both P and UL at the same time, your test will never return true.
Try the Union operator (a PathExpression) instead:
//p | //ul
This will give you all the p and ul elements in the document
To get a node with a specific id you can use the id() function
id('thisistheid')
If this doesn't work for whatever reason, you can still use an attribute test, e.g.
//div[#id='thisistheid']
Or - if you are using the DOM API, you can use getElementById().
You can find an easy to follow XPath tutorial at http://schlitt.info/opensource/blog/0704_xpath.html

Related

How to select all elements with a specific name under every li node with the same structure?

I have a certain bunch of XPath locators that hold the elements I want to extract, and they have a similar structure:
/div/ul/li[1]/div/div[2]/a
/div/ul/li[2]/div/div[2]/a
/div/ul/li[3]/div/div[2]/a
...
They are actually simplified from Pixiv user page. Each /div/div[2]/a element has a title string, so they are actually artwork titles.
I want to use a single expression to fetch all the above a elements in an WebExtension called PageProbe. Although I've tried a bunch of methods, it just can't return the wanted result.
However, the following expression does return all the a elements, including the ones I don't need.
/div/
The following expression returns the a element under only the first li item.
/div/ul/li/div/div[2]/a
Sorry for not providing enough info earlier. Hope someone can help me out. Thanks.
According to the information you gave here you can simply use this xpath:
/div/ul/li/div/div[2]/a
however I'm quite sure it should be some better locator based on other attributes like class names etc.

Using XPath to get the first child for every child of a node

I'm trying to parse some HTML with the following structure, how can I extract the first <a> element of every <li> element using xpath?
<ul>
<li>
<a>
<span>
<a>
</li>
<li>
<a>
<span>
<a>
</li>
...
</ul>
#Mathias : You are correct, I apologize. //li/a[1] did not work because it wasn't a direct child (there is an article tag in between, which I omitted for simplicity).
Then let me post this as a solution with some more explanation.
If, as you have described, //li/a[1] does not return anything while (//li//a)[1] does, then the HTML sample you show is not representative for your actual document. Then, a would be a descendant of li, but not a direct child of it.
A correct XPath expression in this case is
//li//a[1]
but only use it if the level of nesting varies, i.e. if there could be other elements nested between li and a:
<li>
<article>
<other>
<a/>
If the nesting is consistent, but it is not always the article element which is in between li and a then use
//li/*/a[1]
Which avoids the // axis that is computationally more expensive than /.
Finally, if you know that the a elements you are interested in are always grandchildren of li elements and if it is always the article element in between them, use
//li/article/a[1]
When I correct the expression to be //li/article/a[1]', I get the first a` for the first li.
//li/article/a[1] returns several results if there are several a elements that are children of article and grandchildren of li. If this only returns a single result either
you invoke this XPath expression in a context where only a single result is expected, e.g. if you use an XPath library in a programming language or
the structure of your input document is even more intricate
I think that the XPath to accomplish that would be .//ul/li/a[position()=1] .
Explanation:
The reason I spell it all out as .//ul/li/a is because, when you use the xpath, if there is an error, your stack-trace will reveal exactly what the locator pointed at, and is less vague. But, you can obviously short-hand it if you dont care: .//a .
Using the position clause, you can do =1 or >1 , or whatever. I would choose using [position()=1] over using [1] because Xpath doesn't use 0-based arrays, which might confuse others looking at your locator. I mean position=0, by logic, means null, right?
I start my locator with a . because personally, sometimes I like to chain my locators together in a combination. You don't really need to start with the dot char but since i use the // wildcard in this case, its effectively the same as starting without a dot, but with the additional ability to be chained.
Answer tested on http://the-internet.herokuapp.com/

XPath: Way to match text inside an arbitrary number of nested elements?

Is it possible for one XPath expression to match all the following <a> elements using the text in the element, in this case "Link"?
Examples:
Link
<span>Link</span>
<div>Link</div>
<div><span>Link</span></div>
This simple XPath expression,
//a[contains(., 'Link')]
will select the a elements of all of your examples because . represents the current node (a), and contains() will check the string value of a to see if it contains 'Link'. The string value of a already conveniently abstracts away from any descendent elements.
This even simpler XPath expression,
//a[. = 'Link']
will also select the a elements in all of your examples. It's appropriate to use if the string value of a will exactly equal, rather than just contain, "Link".
Note: The above expressions will also select Li<br/>nk, which may or may not be desirable.
You could use the following:
//a[(.//*|.)[contains(text(), "Link")]]
This will select a elements that contain the text "Link" or a elements that have a descendant element that contains the text "Link".
//a - Select all a elements
( - Open OR grouping
.//* Select all the descendant nodes
| - Or..
. - Select the current node
) - Close OR grouping
[contains(text(), "Link")] - If they contain the text "Link"
Alternatively, you could also use:
//a[(.//*|.)[.="Link"]]

using Xpath to scrape inconsistent DOM

I want to scrape the post name, which for pattern one it's located within a span
but the forum thread can goes like this (line 7)
because the thread is a poll.
so in my case I can't target the span (line 8 first picture), I used descendants-or-self but hardly to get it right. What's wrong here?
$postTitle = $xpath->query("//tr/td[#class='row1'][3]/div/div[1]//descendant-or-self::text()");
With this expression you will select the first <a> in the <div> where the text you wish to extract is located:
//tr/td[#class='row1'][3]/div/div[1]/a[1]
I'm assuming you intend to select one element (and not a node-set). For that you can get the string-value of this expression (which will return all the text in the descendant nodes) using string() or normalize-space() (which trims and removes extra spaces):
normalize-space(//tr/td[#class='row1'][3]/div/div[1]/a[1])
This will extract Salary vs age or /ktards are you... depending on the node found.
If there is more than one match it will return a collection, which you should iterate over and get the string value of each one individually. Using those functions on a node-set will give you the text in the first element, discarding the others.
If you only have to deal with two cases: 1) text inside a/span, 2) text inside a, you can select the text nodes directly using a union (|) operator:
//tr/td[#class='row1'][3]/div/div[1]/a[1]/text() | //tr/td[#class='row1'][3]/div/div[1]/a[1]/span/text()

Finding a string that is split by multiple html tags

I am using Xpath to find a list of strings in an HTML document. The strings appear when you type into a text box, to suggest possible results - in other words, it's auto-complete. The problem is, I'm trying to retrieve the whole list of auto-complete suggestions, the results are all split up by <strong> tags.
To give a couple examples: I start typing "str" and the HTML will look like this:
<strong>str</strong>ing
But it gets better! If I don't type anything at all, every single character in the auto-complete results will be interrupted with opening and closing strong tags. Like so:
s
<strong></strong>
t
<strong></strong>
r
<strong></strong>
i
<strong></strong>
n
<strong></strong>
g
So, my question is, how do I construct an xpath that retrieves this string, but omits the strong tags?
For reference, the hierarchy of the HTML looks like this:
-div
--ul
---li
----(string I'm looking for)
---li
----(another string I'm looking for)
So my xpath at this point is: //div[#class='class']/ul/li/text(), which will get me the individual parts of the strings.
This XPath expression:
string(PathToYourDiv/ul/li[$n])
evaluates to the string value of $n-th li child of the ul that is a child of YourDiv. And this is the concatenation of all the text-node descendents od this li element -- effectively giving you the complete string you want.
You have just to substitute YourDiv and $n with specific expressions.
Do not use the // abbreviation, because:
Its evaluation can be very slow.
Indexing such an expression with [] in not intuitive and produces surprizing results that result in a FAQ.
That is much less code on the question than people would like to see around here.
But why don't you try a variant like this:
//div[#class='class']/ul/li/strong/text()