XPath for parent's sibling descendants - html

I have the following HTML I need to scrape, but the only reliable handle is a stable description of a text field. From there, I need to go to its parent, find that parents next sibling and then get the descendents (unfortunately the data-automation-id selector repeats in every such iteration of this snippet on the site). I put together the below XPath but my RPA tool is unable to find it in the document.
XPath
div[contains(text(),'STABLE TEXT HANDLE')]/following-sibling::div/div/div/span[data-automation-id="SOMETHING"]
HTML:
<ul>
<li>
<div>
<label>STABLE TEXT HANDLE</label>
</div>
<div>
<div>
<div>
<span></span>
<span data-automation-id="something">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
<span data-automation-id="somethingelse">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
</div>
</div>
</div>
</li>
</ul>
EDIT:
After futher testing, it seems the issue starts with the contains(text(),'STABLE TEXT HANDLE'), which fails to find that particular node (be it the label, or its parent div).

Please try this:
//label[contains(text(),'STABLE TEXT HANDLE')]/../..//span[#data-automation-id="something"]

Related

Cheerio: Add Link To Text

Using Cheerio, I'm trying to search an HTML for certain text and add a link to it only if it is not linked already.
For example:
<div>
<p> this is an example link </p>
</div>
I want to transform to:
<div>
<p> this is an example link </p>
</div>

XPath for all the //a[#href] tags in a div with the first /li tag text "Something"

I have to find all the URLs from a page in categories. The categories are the first <li> tag in a <div> tag. The page looks like below.
<div class="c1">
<ui>
<li class="d1"> someText </li>
<div>
<li> <a href="some url1">
</div>
<div>
<li> <a href="some url2">
</div>
<div>
<li> <a href="some url3">
</div>
</ui>
</div>
How to find all the hrefs corresponding to the "someText" li tag?
You can get first locate the li element by the "someText" text and then go sideways to get following sibling div element:
//li[contains(., "someText")]/following-sibling::div/li/a
Or, with normalize-space():
//li[normalize-space(.) = "someText"]/following-sibling::div/li/a
(not including the #href part as you've indicated you are using selenium - you would need to find elements matching the XPath expression and get href attribute with getAttribute())
You can use xpath following-sibling axes.
//div/ui/li[contains(text(), 'someText')]/following-sibling::div/li/a/#href
How to find all the hrefs corresponding to the "someText" li tag?
Content-based selection
See #alecxe's fine answer (+1), but your title and this part of your question,
I have to find all the URLs from a page in categories. The categories are the first <li> tag in a <div> tag.
appear to be concerned more with first position than with content...
Position-based selection
This XPath,
(//div[#class="c1"]//li[1]/following::a)[1]
selects the first a element following first li element descendant of the noted div element.
Try this XPath-1.0 expression:
//div[#class='c1']/ui[normalize-space(li[#class='d1'])='someText']/div/li/a/#href
Its output is
some url1
some url2
some url3

xpath select parent elem w/ blank text() after excluding certain children

I am trying to select all div.to_get whose children have no text content, excluding certain elements
html:
<body>
<div class="to_get">
<span> </span>
<span class="exclude"> text is ignored </span>
<span> </span>
</div>
<div class="to_get">
<span> there is text here, so don't select the parent div </span>
<span class="exclude"> text is ignored </span>
<span> </span>
</div>
<div class="to_get">
<span> </span>
<span class="exclude"> text is ignored </span>
<span> there is text here, so don't select the parent div </span>
</div>
</body>
xpath attempt:
//*/body/div[#class='to_get']/descendant::text()[not(ancestor::span/#class='exclude')][normalize-space(.)='']/ancestor::div[#class='to_get']
The problem is that this still returns the 2nd (and 3rd) div.to_get because of its 3rd (and 1st) span child. But those divs should be excluded due to its 1st (and 3rd) span child.
The xpath should only select the 1st div.to_get.
The following XPath
//div[#class='to_get' and normalize-space(span[not(#class='exclude')]/text())='']
selects all div with the class to_get that only contains empty span elements, excluding the span elements with the class exclude. For the input HTML, this returns only the first div.
Update: As noticed as comment, above XPath only checks for the first span. Following XPath
//div[#class='to_get'][not(span[not(#class='exclude') and not(normalize-space(text())='')])]
selects all div elements with the class to_get that only contain empty span elements excluding the ones having the class exclude. For the updated input HTML only the first div is returned.
You can try this way (formatted for readability) :
//div[
#class='to_get'
and
not(
span[not(#class='exclude') and normalize-space()]
)
]
To compare with the other answer, not(normalize-space(text())='') only tests if the first text node in the <span> is empty while normalize-space() tests if all text node(s) in the <span> is empty. Consider the following example that will pass the former but not the latter :
<div class="to_get">
<span> </span>
<span class="exclude"> text is ignored </span>
<span> <br/> there is text here, so don't select the parent div </span>
</div>

Fetching text with xpath in dynamic html structure

I have a lot of html and want to process it via xpath. There are two possible ways text can occur:
<div>
The Text
</div>
<!-- OR -->
<div>
<span>The Text</span>
</div>
<!-- BUT NOT -->
<div> other text
<span>The Text</span>
</div> other text
Is there a way I can fetch "The Text" with a single xpath expression?
edit:
concrete structure:
<div id="content">
<h1>...</h1>
<div>
...
</div>
<div>
<span>The Text</span>
</div>
I'm getting the content node via //div[#id='content'][1] and reuse it for other purposes. On this context-node, I tried to execute ./div[2]/span/text() | ./div[not(span)][2]/text(). It works if there is no span, but returns blank/null if there is a spawn. Im using the Java xpath implementation. The div is always the second one of the content-node.
div/span/text() | div[not(span)]/text()
should do the trick. This selects text nodes that are children of the <span> (if there is a <span>), as well as text nodes that are children of the <div> if there is no <span>.
You'll have to modify the div parts to reflect the context from which you're evaluating the XPath expression. If you want to do this with all <div> elements in the document, then change div to //div.
Update:
Based on the new context information you posted, the above XPath should be modified to:
./div[2]/span/text() | ./div[2][not(span)]/text()
However I don't see why your version is returning no text when there is a <span> element. Can you give more context -- your java code that's evaluating the XPath; maybe a more detailed snippet of your input HTML? Is the sample input HTML really exactly representative of your actual input? Could there be another </div> in there that's going unnoticed?

XPath searching multiple nested elements

I have the following html document
<div class="books">
<div class="book">
<div>
there are many deep nested elements here, somewhere there will be one span with some text e.g. 'mybooktext' within these
<div>
<div>
<div>
<span>mybooktext</span>
</div>
</div>
</div>
</div>
<div>
there are also many nested elements here, somewhere there will be a link with a class called 'mylinkclass' within these. (this is the element i want to find)
<div>
<div>
<a class="mylinkclass">Bla</a>
</div>
</div>
</div>
</div>
<div class="book">
<div>
there are many deep nested elements here, somewhere there will be one span with some text e.g. 'mybooktext' within these
<div>
<span>mybooktext</span>
</div>
<div>
</div>
<div>
there are also many nested elements here, somewhere there will be a link with a class called 'mylinkclass' within these. (this is the element i want to find)
<div>
<a class="mylinkclass">Bla</a>
</div>
</div>
</div>
<div class="book">
same as above
</div>
</div>
I want to find the link element (link has class called 'mylinkclass') within the book element, this will be based on the text of the span within the same book element.
So it would be something like:
-Find span with text 'mybooktext'
-Navigate up Book div
-Find link with class 'mylinkclass' within book div
This should be done using one xpath statement
In my few this is was your are looking for:
" //span[contains(text(),'mybooktext')]
/ancestor::div[#class='book']
//a[#class='mylinkclass']"
//span[contains(text(),'mybooktext')] Find san containing "mybooktext"
/ancestor::div[#class='book'] Navigate up Book div (in any deeps)
//a[#class='mylinkclass'] Find link with class 'mylinkclass' within book div (in any deeps)
Update:
change first condition to
//span[(text() ='mybooktext'] if mybooktext is the only text in span