Xpath: select div with an anchor descendant whose depth is unknown - html

Sample html:
<div>
<div class="foo bar baz"> <-- target 1 -->
<div>
<span>
hello world
</span>
</div>
</div>
<div class="foo bar">foo</div>
<div class="bar"> <-- target 2 -->
<div>
<div>
<span>
hello world
</span>
</div>
</div>
</div>
</div>
I want to select: divs that: 1)has class name bar 2) has an <a> descendant whose href contains hello.
My problem is that the <a> tag could be nested in different levels. How to handle this correctly?

You can use relative descendant-or-self axis (.//) to check <a> element in various possible level depth :
//div[contains(concat(' ', #class, ' '), ' bar ')][.//a[contains(#href,'hello')]]
related discussion : How can I find an element by CSS class with XPath?

Related

Traversing the DOM with querySelector

I'm using the statement document.querySelector("[data-testid='people-menu'] div:nth-child(4)") in the console to give me the below HTML snippet:
<div>
<span class="jss1">
<div class="jss2">
<p class="jss3">Owner</p>
</div>
</span>
<div class="jss4">
<div class="5" title="User Title">
<p class="jss6">UT</p>
</div>
<div class="jss7">
<p class="jss82">User Title</p>
<span class="jss9">Project Manager</span>
</div>
</div>
</div>
I'd like to extend the statement in the console to extract the title "User Title" but can't figure out what combination of nth-child or nextSibling (or something else) to use. The closest I've gotten is:
document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1)")
which gives me the span with class jss1.
I expected document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1).nextSibling") to give me the div with class jss4, but it returns null.
I can't use class selectors because those are generated dynamically at build.
Why not just add [title] onto your querySelector?
document.querySelector("[data-testid='people-menu'] div:nth-child(4) [title]")
You can then get whatever you are looking for from that section? This is assuming title will be unique attribute in this section of html

Xpath node-set nesting order selection

Is there an Xpath 1.0 expression that I could use starting at the div[#id='rootTag'] context to select the different nested span descendants based on how deep they are nested?
For example could you use something like span[2] to select the second most deeply nested span tag rather than second span child of the same parent element?
<div id='rootTag'>
<span>Test</span>
<div>
<span>Test</span>
<span>Test</span>
</div>
</div>
<span>Test</span>
</div>
<div>
<div>
<div>
<div>
<span>Test</span>
</div>
<span>Test</span>
</div>
</div>
</div>
</div>
It's a bit (a lot...) of a hack, but it can be done this way:
Assume your html is like this:
levels = """<div id='rootTag'>
<span>Level2</span>
<div>
<span>Level3</span>
<div>
<span>Level4</span>
</div>
</div>
<div>
<span>Level3</span>
</div>
<div>
<div>
<div>
<div>
<span>Level6</span>
</div>
<span>Level5</span>
</div>
</div>
</div>
</div>"""
We then do this:
#First collect the data:
from lxml import etree #you have to make sure your html is well-formed, or it won't work
root = etree.fromstring(levels)
tree = etree.ElementTree(root)
#collect the paths of all <span> elements
paths = [tree.getpath(e) for e in root.iter('span')]
#determine the nesting level of each <span> element
nests = [e.count('/') for e in paths] #or, alternatively:
#nests = [tree.getpath(e).count('/') for e in root.iter('span')]
From here, we use the nesting level in the nests list to extract the comparable element in the paths list. For example, to get the <span> element with the deepest nesting level:
deepest = nests.index(max(nests))
print(paths[deepest],root.xpath(paths[deepest])[0].text)
Output:
/div/div[3]/div/div/div/span Level6
Or to extract the <span> element with a level 4 nesting:
print(paths[nests.index(4)],root.xpath(paths[nests.index(4)])[0].text)
Output:
/div/div[1]/div/span Level4

How to merge/collapse child DOM node into parent with BeautifulSoup / lxml?

I am writing some HTML pre-processing scripts that are cleaning/tagging HTML from a web crawler, for use in a semantic/link analysis step that follows. I have filtered out undesired tags from the HTML, and simplified it to contain only visible text and <div> / <a> elements.
I now am trying to write a "collapseDOM()" function to walk through the DOM tree and perform the following actions:
(1) destroy leaf nodes without any visible text
(2) collapse any <div>, replacing it with its child, if it (a) directly contains no visible text AND (b) has only a single <div> child
So for instance if I have the following HTML as input:
<html>
<body>
<div>
<div>
not collapsed into empty parent: only divs
</div>
</div>
<div>
<div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
</div>
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
It should get transformed into this "collapsed" version:
<html>
<body>
<div>
not collapsed into empty parent: only divs
</div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
I have been unable to figure out how to do this. I tried writing a recursive tree-walking function using BeautifulSoup's unwrap() and decompose() methods, but this modified the DOM while iterating over it and I couldn't figure out how to get it to work ...
Is there a simple way to do what I want? I am open to solutions either in BeautifulSoup or lxml. Thanks!
You can start with this and adjust to your own needs:
def stripTagWithNoText(soup):
def remove(node):
for index, item in enumerate(node.contents):
if not isinstance(item, NavigableString):
currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[\s+]', '', text)) > 0)]
parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[\s+]', '', text)) > 0)]
if len(currentNodes) == 1 and item.name == item.parent.name:
if len(parentNodes) > 1:
continue
if item.name == currentNodes[0].name and len(currentNodes) == 1:
item.replaceWithChildren()
node.unwrap()
for tag in soup.find_all():
remove(tag)
print(soup)
soup = BeautifulSoup(data, "lxml")
stripTagWithNoText(soup)
<html>
<body>
<div>
not collapsed into empty parent: only divs
</div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
</html>

How to find the text in all div elements without the first div coming from class "B"

I'm trying to capture the text inside all the div elements without the text of the first div that is the child of class "B"
I've been banging my head all day, but I can't seem to get it to work properly.
<div class="A">
Text 1
</div>
<div class="B">
<div>
Welcome 1
</div>
<div>
Welcome 2
</div>
</div>
This is the expression I'm using:
//body//text()[not (//div[#class='B']/div[1])]
but it is not returning any results.
After making the XML well-formed by giving it a single root element,
<div>
<div class="A">
Text 1
</div>
<div class="B">
<div>
Welcome 1
</div>
<div>
Welcome 2
</div>
</div>
</div>
Here are all the div elements that have no div descendents and are not the first div with a parent with a #class attribute value of `B':
//div[not(descendant::div) and not(../#class='B' and position() = 1)]
The above XPath selects these two div elements:
<div class="A">
Text 1
</div>
<div>
Welcome 2
</div>
So you can get the associated text() nodes using this XPath:
//div[not(descendant::div) and not(../#class='B' and position() = 1)]/text()
...which will select:
Text 1
Welcome 2
without selecting Welcome 1, as requested.

How to split string containing comma using span tag

There is a string containing tags separated by comma:
<div class="article" id="1">
<span class="tags">Dog, Cat, Bird, Pig</span>
</div>
<div class="article" id="2">
<span class="tags">Asia, Africa, Australia, Europe</span>
</div>
...
I would like to wrap each item using span tag within the tags class so that each of them can be individually styled like this
<span class="tags"><span>Dog</span><span>Cat</span>...</span>
$('.tags').html($('.tags').html().split(', ').map(function(el) {return '<span>' + el + '</span>'}))
Fiddle here