XPath searching multiple nested elements - html

I have the following html document
<div class="books">
<div class="book">
<div>
there are many deep nested elements here, somewhere there will be one span with some text e.g. 'mybooktext' within these
<div>
<div>
<div>
<span>mybooktext</span>
</div>
</div>
</div>
</div>
<div>
there are also many nested elements here, somewhere there will be a link with a class called 'mylinkclass' within these. (this is the element i want to find)
<div>
<div>
<a class="mylinkclass">Bla</a>
</div>
</div>
</div>
</div>
<div class="book">
<div>
there are many deep nested elements here, somewhere there will be one span with some text e.g. 'mybooktext' within these
<div>
<span>mybooktext</span>
</div>
<div>
</div>
<div>
there are also many nested elements here, somewhere there will be a link with a class called 'mylinkclass' within these. (this is the element i want to find)
<div>
<a class="mylinkclass">Bla</a>
</div>
</div>
</div>
<div class="book">
same as above
</div>
</div>
I want to find the link element (link has class called 'mylinkclass') within the book element, this will be based on the text of the span within the same book element.
So it would be something like:
-Find span with text 'mybooktext'
-Navigate up Book div
-Find link with class 'mylinkclass' within book div
This should be done using one xpath statement

In my few this is was your are looking for:
" //span[contains(text(),'mybooktext')]
/ancestor::div[#class='book']
//a[#class='mylinkclass']"
//span[contains(text(),'mybooktext')] Find san containing "mybooktext"
/ancestor::div[#class='book'] Navigate up Book div (in any deeps)
//a[#class='mylinkclass'] Find link with class 'mylinkclass' within book div (in any deeps)
Update:
change first condition to
//span[(text() ='mybooktext'] if mybooktext is the only text in span

Related

XPath for parent's sibling descendants

I have the following HTML I need to scrape, but the only reliable handle is a stable description of a text field. From there, I need to go to its parent, find that parents next sibling and then get the descendents (unfortunately the data-automation-id selector repeats in every such iteration of this snippet on the site). I put together the below XPath but my RPA tool is unable to find it in the document.
XPath
div[contains(text(),'STABLE TEXT HANDLE')]/following-sibling::div/div/div/span[data-automation-id="SOMETHING"]
HTML:
<ul>
<li>
<div>
<label>STABLE TEXT HANDLE</label>
</div>
<div>
<div>
<div>
<span></span>
<span data-automation-id="something">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
<span data-automation-id="somethingelse">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
</div>
</div>
</div>
</li>
</ul>
EDIT:
After futher testing, it seems the issue starts with the contains(text(),'STABLE TEXT HANDLE'), which fails to find that particular node (be it the label, or its parent div).
Please try this:
//label[contains(text(),'STABLE TEXT HANDLE')]/../..//span[#data-automation-id="something"]

How to find an element with XPath in an arbitrary position in a set of nesting divs

What XPath expression will allow me to find an element in arbitrary position in a set of nesting <div>s and only those elements?
For example, how to find all the <a> elements except the last one in this HTML fragment:
<div id="0">
<a href="first.com"/>
<div id="1"></div>
<div id="2">
<div id="2.1">
<div id="2.11">
<a href="second.com" />
</div>
</div>
</div>
<div id="3"><a href="third.com" /></div>
</div>
<a href="dont_find_this_one.com" />
This XPath,
//a
will select all a elements in the document.
Update per requirements clarification comment:
This XPath,
//div[#id="0"]//a
will select all a elements under all id="0" div elements in the document.
Another way of writing it could be :
//a[ancestor::div[#id="0"]]
Select all anchor elements with a specific common ancestor (div with a specific attribute).
Other options, but more risky :
//a[parent::div]
Select all anchor elements with a div element as a parent.
(//a)[not(position()=last())]
Select all anchor elements except the last one present on the page.

XPath query: How to find all the <div> that have 2 <a> as the first 2 elements?

XPath query: How to find all the <div> that have 2 <a> as the first 2 elements?
For instance, to find all <div>, using $xpath->query(); where:
<div>
<a href="https://www.somesite.com/" id="" src="" alt="" /></a>
... more elements of various kinds ...
</div>
... more elements of various kinds ...
Any help would be greatly appreciated.
This XPath,
//div[*[1][self::a]][*[2][self::a]]
will select all div elements which have a elements in the first and second child positions.
So, for example, for this XML,
<div>
<div id="d1"></div>
<div id="d2"><a/></div>
<div id="d3"><div/><a/><a/></div>
<div id="d4"><a/><a/></div>
<div id="d5"><a/><a/><a/></div>
</div>
only these div elements,
<div id="d4"><a/><a/></div>
<div id="d5"><a/><a/><a/></div>
will be selected, as requested.
Another way to write it (more consuming though) :
//a[count(preceding-sibling::*)=1 and preceding-sibling::*[1][self::a]][parent::div]/..
Look for an a element child of a div, with 1 preceding-sibling which is an anchor. Then get the parent.

Unsure of correct BEM style syntax

Let's say I have a product within a collection. Is it appropriate to call the product "feature-collection__product" so it's still an element within the block of "feature-collection" or call it "feature-collection-product" so it becomes it's own block, as it has other elements within it, or something different.
<div class="feature-collection">
<div class="feature-collection__product">
<h2 class="feature-collection__product-title"></h2>
<h2 class="feature-collection__product-price"></h2>
</div>
</div>
OR
<div class="feature-collection">
<div class="feature-collection-product">
<h2 class="feature-collection-product__title"></h2>
<h2 class="feature-collection-product__price"></h2>
</div>
</div>
Most likely the correct answer is both:
<div class="feature-collection">
<div class="feature-collection__product product">
<h2 class="product__title"></h2>
<h2 class="product__price"></h2>
</div>
</div>
The situation when you have different entities on the same DOM node is called mix. In this case it's reasonable to have independent block product and also an element of feature-collection to set some styling for production inside feature-collection.
For more info about mixes please take a look at https://en.bem.info/methodology/key-concepts/#mix and https://en.bem.info/methodology/faq/#mixes

Fetching text with xpath in dynamic html structure

I have a lot of html and want to process it via xpath. There are two possible ways text can occur:
<div>
The Text
</div>
<!-- OR -->
<div>
<span>The Text</span>
</div>
<!-- BUT NOT -->
<div> other text
<span>The Text</span>
</div> other text
Is there a way I can fetch "The Text" with a single xpath expression?
edit:
concrete structure:
<div id="content">
<h1>...</h1>
<div>
...
</div>
<div>
<span>The Text</span>
</div>
I'm getting the content node via //div[#id='content'][1] and reuse it for other purposes. On this context-node, I tried to execute ./div[2]/span/text() | ./div[not(span)][2]/text(). It works if there is no span, but returns blank/null if there is a spawn. Im using the Java xpath implementation. The div is always the second one of the content-node.
div/span/text() | div[not(span)]/text()
should do the trick. This selects text nodes that are children of the <span> (if there is a <span>), as well as text nodes that are children of the <div> if there is no <span>.
You'll have to modify the div parts to reflect the context from which you're evaluating the XPath expression. If you want to do this with all <div> elements in the document, then change div to //div.
Update:
Based on the new context information you posted, the above XPath should be modified to:
./div[2]/span/text() | ./div[2][not(span)]/text()
However I don't see why your version is returning no text when there is a <span> element. Can you give more context -- your java code that's evaluating the XPath; maybe a more detailed snippet of your input HTML? Is the sample input HTML really exactly representative of your actual input? Could there be another </div> in there that's going unnoticed?