xpath select parent elem w/ blank text() after excluding certain children - html

I am trying to select all div.to_get whose children have no text content, excluding certain elements
html:
<body>
<div class="to_get">
<span> </span>
<span class="exclude"> text is ignored </span>
<span> </span>
</div>
<div class="to_get">
<span> there is text here, so don't select the parent div </span>
<span class="exclude"> text is ignored </span>
<span> </span>
</div>
<div class="to_get">
<span> </span>
<span class="exclude"> text is ignored </span>
<span> there is text here, so don't select the parent div </span>
</div>
</body>
xpath attempt:
//*/body/div[#class='to_get']/descendant::text()[not(ancestor::span/#class='exclude')][normalize-space(.)='']/ancestor::div[#class='to_get']
The problem is that this still returns the 2nd (and 3rd) div.to_get because of its 3rd (and 1st) span child. But those divs should be excluded due to its 1st (and 3rd) span child.
The xpath should only select the 1st div.to_get.

The following XPath
//div[#class='to_get' and normalize-space(span[not(#class='exclude')]/text())='']
selects all div with the class to_get that only contains empty span elements, excluding the span elements with the class exclude. For the input HTML, this returns only the first div.
Update: As noticed as comment, above XPath only checks for the first span. Following XPath
//div[#class='to_get'][not(span[not(#class='exclude') and not(normalize-space(text())='')])]
selects all div elements with the class to_get that only contain empty span elements excluding the ones having the class exclude. For the updated input HTML only the first div is returned.

You can try this way (formatted for readability) :
//div[
#class='to_get'
and
not(
span[not(#class='exclude') and normalize-space()]
)
]
To compare with the other answer, not(normalize-space(text())='') only tests if the first text node in the <span> is empty while normalize-space() tests if all text node(s) in the <span> is empty. Consider the following example that will pass the former but not the latter :
<div class="to_get">
<span> </span>
<span class="exclude"> text is ignored </span>
<span> <br/> there is text here, so don't select the parent div </span>
</div>

Related

XPath for parent's sibling descendants

I have the following HTML I need to scrape, but the only reliable handle is a stable description of a text field. From there, I need to go to its parent, find that parents next sibling and then get the descendents (unfortunately the data-automation-id selector repeats in every such iteration of this snippet on the site). I put together the below XPath but my RPA tool is unable to find it in the document.
XPath
div[contains(text(),'STABLE TEXT HANDLE')]/following-sibling::div/div/div/span[data-automation-id="SOMETHING"]
HTML:
<ul>
<li>
<div>
<label>STABLE TEXT HANDLE</label>
</div>
<div>
<div>
<div>
<span></span>
<span data-automation-id="something">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
<span data-automation-id="somethingelse">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
</div>
</div>
</div>
</li>
</ul>
EDIT:
After futher testing, it seems the issue starts with the contains(text(),'STABLE TEXT HANDLE'), which fails to find that particular node (be it the label, or its parent div).
Please try this:
//label[contains(text(),'STABLE TEXT HANDLE')]/../..//span[#data-automation-id="something"]

Select optional nodes with XPath

I have an HTML fragment:
<td>
<span class="x-cell">something</span>
<span class="y-cell">something</span>
<span class="z-cell">something</span>
A text
<span class="foo"/>
Another text
<span class="bar"/>
Also text
</td>
I try to select all nodes following the <span class="z-cell"/> to move them into another node. But all the nodes within td are optional, I can have zero to three <span class="*-cell"/>, the text is optional and there could be further <span> nodes in the middle/begin/end of the text or not.
In short, I have to move all nodes except the <span class="*-cell"/> into another node. I tried XPath to select the nodes:
td/span[contains(#class,"-cell")][last()]/following-sibling::*
but it doesn't work, if there aren't any <span class="*-cell"/> nodes. How I could solve that?
Have your xpath expression exclude all elements you do not want:
td/(*[not(contains(#class,"-cell"))]|text())
If you only want to copy elements without the intervening text this simplifies to
td/*[not(contains(#class,"-cell"))]
Live Demo on XPathTester

How do I find SPAN tag containing DIV tag with notepad++ regex for W3C Validation?

I'm trying to fix my HTML views for W3C validation. On error is that I had some rare div or structural tags in a span tag. Here's a fake example made from my HTML codes :
<div style="margin-left:10px;">
<h2>Sub Title</h2>
<span><span class="bold_text">Phones : </span> 000-000-000000 / 000-000-000000 </span>
<br/>
<span><span class="bold_text">Email : </span>
<ul>
<li>For Support use <a href="mailto:support#email.com" >support#email.com</a></li>
<li>For CopyRights use <a href="mailto:copyright#email.com" >copyright#email.com</a></li>
<li>For Technical issue use <a href="mailto:staff#email.com" >staff#email.com</a></li>
</ul>
</span>
<span>
<span class="bold_text">Location : </span>
<div class="address_container">#0, City, Region, Country</div>
</span>
<div class="map_container" style="margin-top:10px;display:inline-block;width:90%;height:400px;" >
#yield('map_member')
</div>
I'm playing with regex101 and so far I got this :
<span[^>]*>[.\s\S]*<div[\s\S]*<\/div>[\s\S]*<\/span> /gm
It must match new lines and spaces. But this select the 1st and finishes on the last span ending tag . But I want it to point only to :
<span>
<span class="bold_text">Location : </span>
<div class="address_container">#0, City, Region, Country</div>
</span>
To replace those DIV within the SPAN, while there is SPAN within the SPAN?
One can also assume that if it ended with SPAN that it also started with SPAN.
So this regex just uses a positive lookahead to check if the DIV is followed by 0 or more enclosed DIV or SPAN, then closed with SPAN.
\s*<div[^<>]*>[^<>]*</div>(?=(?:\s*<(div|span)[^<>]*>[^<>]*</\1>)*[^<>]*</span>)
Replace with nothing and it'll be spick-and-span.

html and combining span ID's into one span ID

I'm working on an eBook which requires me to create an overlay. All is working fine except in some cases I have a drop cap combined with the rest of the word which need to be highlighted at the same time.
The code below is my current problem. I need to have the two span ID's combined into on without destroying the html.
Any ideas?
<p class="ParaOverride-1"><span id="_idTextSpan017" class="DropCap-color CharOverride-6" style="position:absolute;top:-109.78px;left:26.39px;">W</span><span id="_idTextSpan018" class="PageText-v1 CharOverride-7" style="position:absolute;top:0px;left:1626.19px;letter-spacing:-2.6px;">hat </span>
You need a nested <span>:
<span id="myID">
<span id="x">
</span>
<span id="y">
</span>
</span>

get sibling element text only when its parallel element meets a condition with xpath1.0

The goal is to get the code of the user named Nick who's title is Mr with xpath1.0.
<span class="user">
<span class="master">
<span class="user-title" title="Mr">
<span class="name">Nick</span>
</span>
<span class="user-info">
<span class="code">A</span>
</span>
</span>
</span>
<span class="user">
<span class="master">
<span class="user-title" title="Mr">
<span class="name">Bob</span>
</span>
<span class="user-info">
<span class="code">B</span>
</span>
</span>
</span>
I would divide it into several steps to understand how xpath works in this case.
//span[contains(., 'Nick']) can get that node, but how to get the person's code info which is in next node?
You could do something like:
//span
[#class='user']
[.//span[#class='name']='Nick']
//span[#class='code']
/text()
Basically this says:
Find the user span that contains the name span with text Nick
Within that user span, find the code span
For the code span, return the text
Alternatively, you could directly navigate to the sibling element. However, it is not as readable:
//span[.='Nick']/../following-sibling::*[1]/span/text()
This says to find the span with text Nick. From there, go to the parent (the user-title span). Then go to the next sibling (the user-info span). Then get the span in there, which is the code span.