following a hierarchy in beautiful soup - html

I have a HTML file with following format:
<div class="entry">
<p>para1</p>
<p>para2</p>
<p><div class="abc"> Ignore this part1</div> </p>
<p><script class="xyz">Ignore this part2 </script></p>
</div>
Suppose there is only one div tag with class value "entry". I want to print all text inside those p tags which are inside div tag with class value "entry" , except those p tags which are followed by div or script tag. So here I want to print "para1" and "para2" but not "Ignore this part1" and "Ignore this part2"
How do I achieve this using beautiful soup?

Use a lambda expression to filter what you don't need.
Example:
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
example = """<div class="entry">
<p>para1</p>
<p>para2</p>
<p><div class="abc"> Ignore this part1</div> </p>
<p><script class="xyz">Ignore this part2 </script></p>
<p>example para</p>
</div>"""
soup = BeautifulSoup(example, 'html.parser')
entry = soup.find('div', class_="entry")
p = entry.find_all(lambda tag: tag.name == "p" and not (tag.find("div")
or tag.find("script")))
for content in p:
print (content.get_text(strip=True))
Outputs:
para1
para2
example para

Alternativ solution with css selector
A simple solution could be the use of modern css selectors:
soup.select('div.entry p:not(:empty,:has(div,script))')
Example
from bs4 import BeautifulSoup
example = '''<div class="entry">
<p>para1</p>
<p>para2</p>
<p><div class="abc"> Ignore this part1</div> </p>
<p><script class="xyz">Ignore this part2 </script></p>
<p>example para</p>
</div>'''
soup = BeautifulSoup(example, 'html.parser')
for e in soup.select('div.entry p:not(:empty,:has(div,script))'):
print(e.text)
Output
para1
para2
example para

Related

Getting specific span tag text in python (BeautifulSoup)

Im scraping some information off MyAnimeList using BeautifulSoup on python3 and am trying to get information about a show's 'Status', but am having trouble accessing it.
Here is the html:
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
All of this is also contained within another div tag but I only included the portion of the html that I want to scrape. To clarify, I want to obtain the text 'Finished Airing' contained within 'Status'.
Here's the code I have so far but I'm not really sure if this is the best approach or where to go from here:
Page_soup = soup(Page_html, "html.parser")
extra_info = Page_soup.find('td', attrs={'class': 'borderClass'})
span_html = extra_info.select('span')
for i in range(len(span_html)):
if 'Status:' in span_html[i].getText():
Any help would be appreciated, thanks!
To get the text next to the <span> with "Status:", you can use:
from bs4 import BeautifulSoup
html_doc = """
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('span:-soup-contains("Status:")').find_next_sibling(text=True)
print(txt.strip())
Prints:
Finished Airing
Or:
txt = soup.find("span", text="Status:").find_next_sibling(text=True)
print(txt.strip())
Another solution (maybe):
f = soup.find_all('span',attrs={'class':'dark_text'})
for i in f:
if i.text == 'Status:':
print(i.parent.text)
And change 'Status:' to whatever other thing you want to find.
Hope I helped!

Get div content inside a div BeautifulSoup

I have a website in the following format:
<html lang="en">
<head>
#anything
</head>
<body>
<div id="div1">
<div id="div2">
<div class="class1">
#something
</div>
<div class="class2">
#something
</div>
<div class="class3">
<div class="sub-class1">
<div id="statHolder">
<div class="Class 1 of 15">
"Name"
<b>Bob</b>
</div>
<div class="Class 2 of 15">
"Age"
<b>24</b>
</div>
# Here are 15 of these kinds
</div>
</div>
</div>
</div>
</div>
</body>
</html>
I want to retrieve all the content in those 15 classes. How do I do that?
Edit:
My Current Approach:
import requests
from bs4 import BeautifulSoup
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
name_box = soup.findAll('div', {"id": "div1"}) #I dont know what to do after this
Expected Output:
Name: Bob
Age: 24
#All 15 entries like this
I am using BeautifulSoup4 for this.
Is there any direct way to get all the contents in <div id="stats">?
Based on the HTML above, you can try it this way:
import requests
from bs4 import BeautifulSoup
result = {}
url = 'my-url-here'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
stats = soup.find('div', {'id': 'statHolder'})
for data in stats.find_all('div'):
key, value = data.text.split()
result[key.replace('"', '')] = value
print(result)
# Prints:
# [{'Name': 'Bob'}, {'Age': '24'}]
for key, value in result.items():
print(f'{key}: {value}')
# Prints:
# Name: Bob
# Age: 24
This finds the div with the id of statHolder.
Then, we find all divs inside that div, and extract the two lines of text (using split) -- the first line being the key, and the second line being the value. We also remove the double quotes from the value using replace.
Then, we add the key-value pair to our result dictionary.
Iterating through this, you can get the desired output as shown.
If you do it according to the actual html of the webpage the following will give you the stats as a dictionary. It takes each element with class pSt as the key and then moves to the following strong tag to get the associated value.
from bs4 import BeautifulSoup as bs
#html is response.content assuming not dynamic
soup = bs(html, 'html.parser')
stats = {i.text:i.strong.text for i in soup.select('.pSt')}
For your shown html you can use stripped_strings to get the first sibling
from bs4 import BeautifulSoup as bs
html = '''
<html lang="en">
<head>
#anything
</head>
<body>
<div id="div1">
<div id="div2">
<div class="class1">
#something
</div>
<div class="class2">
#something
</div>
<div class="class3">
<div class="sub-class1">
<div id="statHolder">
<div class="Class 1 of 15">
"Name"
<b>Bob</b>
</div>
<div class="Class 2 of 15">
"Age"
<b>24</b>
</div>
# Here are 15 of these kinds
</div>
</div>
</div>
</div>
</div>
</body>
</html>
'''
soup = bs(html, 'html.parser')
stats = {[s for s in i.stripped_strings][0]:i.b.text for i in soup.select('#statHolder [class^=Class]')}
print(stats)

Python3 web-scraper can't extract text from every <a> tag in the site

I'm trying to program a Python3 web-scraper that extracts text inside a tags from a site.
I'm using the bs4 library with this code:
from bs4 import BeautifulSoup
import requests
req = requests.get(mainUrl).text
soup = BeautifulSoup(req, 'html.parser')
for div in soup.find_all('div', 'turbolink_scroller'):
for a in div.find_all('a', href=True, text=True):
print(a.text)
The only problem I encounter is that it does only find text with this type of syntax:
<div class="test">
Text that i want
</div>
but not with this one:
<div class="test">
<a href="/link/to/whatIwant2">
The text
<br>
I would like
</a>
</div>
Could you explain me why? and what are the differences between the two?
It might have to do with the <br> tag within the second div. If you remove text=True you'll get both of them.
from bs4 import BeautifulSoup
sample = """
<div class="test">
Text that i want
</div>
<div class="test">
<a href="/link/to/whatIwant2">
The text
<br>
I would like
</a>
</div>
"""
for div in BeautifulSoup(sample, 'html.parser').find_all('div', 'test'):
for a in div.find_all('a', href=True):
print(a.getText(strip=True))
Output:
Text that i want
The textI would like

How do i get the text inside a class while ignoring the text of the next class that is inside

I'm trying to get the text inside the class="hardfact" but is also getting the text of the class="hardfactlabel color_f_03" because this class is inside hardfact.
.text.strip() get the text of both class because they are nested.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import requests
import lxml
my_url = 'https://www.immowelt.de/expose/2QC5D4A?npv=52'
page = requests.get(my_url)
ct = soup(page.text, 'lxml')
specs = ct.find('div', class_="hardfacts clear").findAll('div', class_="hardfact")
for items in specs:
e = items.text.strip()
print(e)
I'm getting this
82.500 € 
Kaufpreis
47 m²
Wohnfläche (ca.)
1
Zimmer
and i want this
82.500 €
47 m²
1
Here is the html content you are trying to crawl:
<div class="hardfact ">
<strong>82.500 € </strong>
<div class="hardfactlabel color_f_03">
Kaufpreis
</div>
</div>
<div class="hardfact ">
47 m²
<div class="hardfactlabel color_f_03">
Wohnfläche (ca.)
</div>
</div>
<div class="hardfact rooms">
1
<div class="hardfactlabel color_f_03">
Zimmer
</div>
</div>
What you want to achieve is to remove the div tags within, so you can just decompose the div:
for items in specs:
items.div.decompose()
e = items.text.strip()
print(e)
If your first "hardfact" class doesn't contain the "strong" tag, you can just find the first element like so
e = items.find().text.strip()
but we can't do this so you have to decompose the div tag.
You can use stripped strings. You probably want to add a condition to ensure at least length of 3 before attempting to slice list.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.immowelt.de/expose/2QC5D4A?npv=52')
soup = bs(r.content, 'lxml')
items = soup.select('.hardfact')[:3]
for item in items:
strings = [string for string in item.stripped_strings]
print(strings[0])

Remove height and width from inline styles

I'm using BeautifulSoup to remove inline heights and widths from my elements. Solving it for images was simple:
def remove_dimension_tags(tag):
for attribute in ["width", "height"]:
del tag[attribute]
return tag
But I'm not sure how to go about processing something like this:
<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">
when I would want to leave the background-color (for example) or any other style attributes other than height or width.
The only way I can think of doing it is with a regex but last time I suggested something like that the spirit of StackOverflow came out of my computer and murdered my first-born.
A full walk-through would be:
from bs4 import BeautifulSoup
import re
string = """
<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">
<p>Some line here</p>
<hr/>
<p>Some other beautiful text over here</p>
</div>
"""
# look for width or height, followed by not a ;
rx = re.compile(r'(?:width|height):[^;]+;?')
soup = BeautifulSoup(string, "html5lib")
for div in soup.findAll('div'):
div['style'] = rx.sub("", string)
As stated by others, using regular expressions on the actual value is not a problem.
You could use regex if you want, but there is a simpler way.
Use cssutils for a simpler css parsing
A simple example:
from bs4 import BeautifulSoup
import cssutils
s = '<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">'
soup = BeautifulSoup(s, "html.parser")
div = soup.find("div")
div_style = cssutils.parseStyle(div["style"])
del div_style["width"]
div["style"] = div_style.cssText
print (div)
Outputs:
>>><div class="wp-caption aligncenter" id="attachment_9565" style="background-color: red"></div>
import bs4
html = '''<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">'''
soup = bs4.BeautifulSoup(html, 'lxml')
Tag's attribute is a dict object, you can modify it like a dict:
get item:
soup.div.attrs
{'class': ['wp-caption', 'aligncenter'],
'id': 'attachment_9565',
'style': 'width: 2010px;background-color:red'}
set item:
soup.div.attrs['style'] = soup.div.attrs['style'].split(';')[-1]
{'class': ['wp-caption', 'aligncenter'],
'id': 'attachment_9565',
'style': 'background-color:red'}
Use Regex:
soup.div.attrs['style'] = re.search(r'background-color:\w+', soup.div.attrs['style']).group()