How to merge/collapse child DOM node into parent with BeautifulSoup / lxml? - html

I am writing some HTML pre-processing scripts that are cleaning/tagging HTML from a web crawler, for use in a semantic/link analysis step that follows. I have filtered out undesired tags from the HTML, and simplified it to contain only visible text and <div> / <a> elements.
I now am trying to write a "collapseDOM()" function to walk through the DOM tree and perform the following actions:
(1) destroy leaf nodes without any visible text
(2) collapse any <div>, replacing it with its child, if it (a) directly contains no visible text AND (b) has only a single <div> child
So for instance if I have the following HTML as input:
<html>
<body>
<div>
<div>
not collapsed into empty parent: only divs
</div>
</div>
<div>
<div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
</div>
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
It should get transformed into this "collapsed" version:
<html>
<body>
<div>
not collapsed into empty parent: only divs
</div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
I have been unable to figure out how to do this. I tried writing a recursive tree-walking function using BeautifulSoup's unwrap() and decompose() methods, but this modified the DOM while iterating over it and I couldn't figure out how to get it to work ...
Is there a simple way to do what I want? I am open to solutions either in BeautifulSoup or lxml. Thanks!

You can start with this and adjust to your own needs:
def stripTagWithNoText(soup):
def remove(node):
for index, item in enumerate(node.contents):
if not isinstance(item, NavigableString):
currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[\s+]', '', text)) > 0)]
parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[\s+]', '', text)) > 0)]
if len(currentNodes) == 1 and item.name == item.parent.name:
if len(parentNodes) > 1:
continue
if item.name == currentNodes[0].name and len(currentNodes) == 1:
item.replaceWithChildren()
node.unwrap()
for tag in soup.find_all():
remove(tag)
print(soup)
soup = BeautifulSoup(data, "lxml")
stripTagWithNoText(soup)
<html>
<body>
<div>
not collapsed into empty parent: only divs
</div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
</html>

Related

How to get content of unnamed nested div inside div with class? (Use Scrapy or BeautifulSoup)

I'm trying to obtain the nested unnamed div inside:
div class="pod three columns closable content-sizes clearfix">
The nested unnamed div is also the first div inside the div above (see image)
I have tried the following:
for div in soup.findAll('div', attrs={'class':'pod three columns closable content-sizes clearfix'}):
print(div.text)
The length of
soup.findAll('div',attrs={'class':'pod three columns closable
content-sizes clearfix'})
is just one despite this div having many nested divs. So, the for-loop runs only once and prints everything.
I need all the text inside only the first nested div div (see image):
Project...
Reference Number...
Other text
Try:
from bs4 import BeautifulSoup
html_doc = """
<div class="pod three columns closable content-sizes clearfix">
<div>
<b>Project: ...</b>
<br>
<br>
<b>Reference Number: ...</b>
<br>
<br>
<b>Other text ...</b>
<br>
<br>
</div>
<div>
Other text I don't want
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
print(
soup.select_one("div.pod.three.columns > div").get_text(
strip=True, separator="\n"
)
)
Prints:
Project: ...
Reference Number: ...
Other text ...
Or without CSS selector:
print(
soup.find(
"div",
attrs={"class": "pod three columns closable content-sizes clearfix"},
)
.find("div")
.get_text(strip=True, separator="\n")
)
try this:-
result = soup.find('div', class_ = "pod three columns closable content-sizes clearfix").find("div")
print(result.text)
output:-
Project: .............
Reference Number: ....
Other text ...........

Nokogiri HTML Nested Elements Extract Class and Text

I have a basic page structure with elements (span's) nested under other elements (div's and span's). Here's an example:
html = "<html>
<body>
<div class="item">
<div class="profile">
<span class="itemize">
<div class="r12321">Plains</div>
<div class="as124223">Trains</div>
<div class="qwss12311232">Automobiles</div>
</div>
<div class="profile">
<span class="itemize">
<div class="lknoijojkljl98799999">Love</div>
<div class="vssdfsd0809809">First</div>
<div class="awefsaf98098">Sight</div>
</div>
</div>
</body>
</html>"
Notice that the class names are random. Notice also that there is whitespace and tabs in the html.
I want to extract the children and end up with a hash like so:
page = Nokogiri::HTML(html)
itemhash = Hash.new
page.css('div.item div.profile span').map do |divs|
children = divs.children
children.each do |child|
itemhash[child['class']] = child.text
end
end
Result should be similar to:
{\"r12321\"=>\"Plains\", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", \"lknoijojkljl98799999\"=>\"Love\", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
But I'm ending up with a mess like this:
{nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"r12321\"=>\"Plains\", nil=>\" \", \"as124223\"=>\"Trains\", \"qwss12311232\"=>\"Automobiles\", nil=>\"\\n\\t\\t\\t\\t\\t\\t\", \"lknoijojkljl98799999\"=>\"Love\", nil=>\" \", \"vssdfsd0809809\"=>\"First\", \"awefsaf98098\"=>\"Sight\"}
This is because of the tabs and whitespace in the HTML. I don't have any control over how the HTML is generated so I'm trying to work around the issue. I've tried noblanks but that's not working. I've also tried gsub but that only destroys my markup.
How can I extract the class and values of these nested elements while cleanly ignoring whitespace and tabs?
P.S. I'm not hung up on Nokogiri - so if another gem can do it better I'm game.
The children method returns all child nodes, including text nodes—even when they are empty.
To only get child elements you could do an explicit XPath query (or possibly the equivalent CSS), e.g.:
children = divs.xpath('./div')
You could also use the children_elements method, which would be closer to what you are already doing, and which only returns children that are elements:
children = divs.element_children

How to find the text in all div elements without the first div coming from class "B"

I'm trying to capture the text inside all the div elements without the text of the first div that is the child of class "B"
I've been banging my head all day, but I can't seem to get it to work properly.
<div class="A">
Text 1
</div>
<div class="B">
<div>
Welcome 1
</div>
<div>
Welcome 2
</div>
</div>
This is the expression I'm using:
//body//text()[not (//div[#class='B']/div[1])]
but it is not returning any results.
After making the XML well-formed by giving it a single root element,
<div>
<div class="A">
Text 1
</div>
<div class="B">
<div>
Welcome 1
</div>
<div>
Welcome 2
</div>
</div>
</div>
Here are all the div elements that have no div descendents and are not the first div with a parent with a #class attribute value of `B':
//div[not(descendant::div) and not(../#class='B' and position() = 1)]
The above XPath selects these two div elements:
<div class="A">
Text 1
</div>
<div>
Welcome 2
</div>
So you can get the associated text() nodes using this XPath:
//div[not(descendant::div) and not(../#class='B' and position() = 1)]/text()
...which will select:
Text 1
Welcome 2
without selecting Welcome 1, as requested.

Xpath: select div with an anchor descendant whose depth is unknown

Sample html:
<div>
<div class="foo bar baz"> <-- target 1 -->
<div>
<span>
hello world
</span>
</div>
</div>
<div class="foo bar">foo</div>
<div class="bar"> <-- target 2 -->
<div>
<div>
<span>
hello world
</span>
</div>
</div>
</div>
</div>
I want to select: divs that: 1)has class name bar 2) has an <a> descendant whose href contains hello.
My problem is that the <a> tag could be nested in different levels. How to handle this correctly?
You can use relative descendant-or-self axis (.//) to check <a> element in various possible level depth :
//div[contains(concat(' ', #class, ' '), ' bar ')][.//a[contains(#href,'hello')]]
related discussion : How can I find an element by CSS class with XPath?

How to access div element text based on adjacent text

I have the following HTML code and am trying to access "QA1234", which is the value of the Serial Number. Can you let me know how I can access this text?
<div class="dataField">
<div class="dataName">
<span id="langSerialNumber">Serial Number</span>
</div>
<div class="dataValue">QA1234</div>
</div>
<div class="dataField">
<div class="dataName">
<span id="langHardwareRevision">Hardware Revision</span>
</div>
<div class="dataValue">05</div>
</div>
<div class="dataField">
<div class="dataName">
<span id="langManufactureDate">Manufacture Date</span>
</div>
<div class="dataValue">03/03/2011</div>
</div>
I assume you are trying to get the "QA1234" text in terms of being the "Serial Number". If that is correct, you basically need to:
Locate the "dataField" div that includes the serial number span.
Get the "dataValue" within that div.
One way is to get all the "dataField" divs and find the one that includes the span:
parent = browser.divs(class: 'dataField').find { |div| div.span(id: 'langSerialNumber').exists? }
p parent.div(class: 'dataValue').text
#=> "QA1234"
parent = browser.divs(class: 'dataField').find { |div| div.span(id: 'langManufactureDate').exists? }
p parent.div(class: 'dataValue').text
#=> "03/03/2011"
Another option is to find the serial number span and then traverse up to the parent "dataField" div:
parent = browser.span(id: 'langSerialNumber').parent.parent
p parent.div(class: 'dataValue').text
#=> "QA1234"
parent = browser.span(id: 'langManufactureDate').parent.parent
p parent.div(class: 'dataValue').text
#=> "03/03/2011"
I find the first approach to be more robust to changes since it is more flexible to how the serial number is nested within the "dataField" div. However, for pages with a lot of fields, it may be less performant.