Beautiful Soup: extracting from deeply nested <div>'s - html

Trying to extract Message text from:
<div class="Item ItemDiscussion Role_Member" id="Discussion_2318">
<div class="Discussion">
<div class="Item-BodyWrap">
<div class="Item-Body">
<div class="Message">
Hello<br/>I have a very interesting observation on nature of birds in Alaska ... <br/>
Was there 10/19/18 has anyone heard of this </div>
<div class="ReactionRecord"></div><div class="Reactions"></div> </div>
</div>
</div>
</div>
I have got this bit with:
tag = soup.find('div', {'class' : 'ItemDiscussion'})
Next I am trying to go down with:
s = str((tag.contents)[1])
sp = BeautifulSoup(s)
sp.contents
But this does not help much. How to get message text from <div class="Message"> ?

you can find the element from soup directly.
discussion_div = soup.find("div", {"class": "ItemDiscussion"})
message_text = discussion_div.find("div", {"class": "Message"}).text

You can select any element using select_one() function by entering the CSS Selector to the element. select_one() function will only return one element if you want more than one element then you can use select() which will return a list of found elements. here is the example for you,
soup = BeautifulSoup(html, "html.parser")
print soup.select_one("div.Item div.Discussion div.Item-BodyWrap div.Item-Body div.Message").text
You can also select your element using a single class if it is
unique.
print soup.select_one("div.Message").text

Related

How to get a div or span class from a related span class?

I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number

Getting specific span tag text in python (BeautifulSoup)

Im scraping some information off MyAnimeList using BeautifulSoup on python3 and am trying to get information about a show's 'Status', but am having trouble accessing it.
Here is the html:
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
All of this is also contained within another div tag but I only included the portion of the html that I want to scrape. To clarify, I want to obtain the text 'Finished Airing' contained within 'Status'.
Here's the code I have so far but I'm not really sure if this is the best approach or where to go from here:
Page_soup = soup(Page_html, "html.parser")
extra_info = Page_soup.find('td', attrs={'class': 'borderClass'})
span_html = extra_info.select('span')
for i in range(len(span_html)):
if 'Status:' in span_html[i].getText():
Any help would be appreciated, thanks!
To get the text next to the <span> with "Status:", you can use:
from bs4 import BeautifulSoup
html_doc = """
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('span:-soup-contains("Status:")').find_next_sibling(text=True)
print(txt.strip())
Prints:
Finished Airing
Or:
txt = soup.find("span", text="Status:").find_next_sibling(text=True)
print(txt.strip())
Another solution (maybe):
f = soup.find_all('span',attrs={'class':'dark_text'})
for i in f:
if i.text == 'Status:':
print(i.parent.text)
And change 'Status:' to whatever other thing you want to find.
Hope I helped!

how to get content within a span tag

#Example 1
<span class="levelone">
<span class="leveltwo" dir="auto">
::before
"Blue"
::after
</span>
</span>
#Example 2
<div class="itemlist">
<div dir="auto" style="text-align: start;">
"mobile"
</div>
</div>
#Example 3
<div class="quantity">
<div class="color">...</div>
<span class="num">10</span>
</div>
Hi, I am trying to use selenium to extract content from html. I managed to extract the content for example 1 & 2, the code that I have used is
example1 = driver.find_elements_by_css_selector("span[class='leveltwo']")
example2 = driver.find_elements_by_css_selector("div[class='itemlist']")
and printed out as text with
data = [dt.text for dt in example1]
print(data)
I got "Blue" for example 1 & "mobile" for example 2. For simplicity purposes, the html given above is for one iteration, I have scraped all elements with the class mentioned above
However, for the 3rd example, I tried to use
example3a = driver.find_elements_by_css_selector("div[class='quantity']")
and
example3b = driver.find_elements_by_css_selector("div[class='num']")
and
example3c = driver. find_element_by_class_name("num")
but all of it returned an empty list. I'm not sure is it because there is no dir in example 3? What method should I use to extract the "10"?
for 3rd example, you can try the below css :
div.quantity span.num
in code you can write like this :
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
print(example3a.text)
or
print(example3a.get_attribute('innerHTML'))
To extract specifically the 10 you can use
example3a = driver.find_elements_by_css_selector("div.quantity span.num")
To extract both elements inside <div class="quantity"> you can use
example3 = driver.find_elements_by_xpath("//div[#class='quantity']//*")
for el in example3:
print(el.text)

Trying to get the texts between this tag but getting an empty list

\Trying to get the texts A Plus and Computers from this html:
<div class="u-space-t1">
<h1 class="biz-page-title embossed-text-white shortenough">A Plus</h1>
<div class="u-inline-block">
<h1 class="biz-page-title\ embossed-text-white\ shortenough">Computers</h1>
<div class="u-inline-block">
So I tried to get the text like this:
c = soup.findAll('h1',{"class":"biz-page-title embossed-text-white shortenough"})
print(c)
However I am getting an empty list
I have tried doing this as well:
c = soup.find('div', class_='u-inline-block').h1
I am getting a 'Nonetype' object not found.
Try this:
html = """
<div class="u-space-t1">
<h1 class="biz-page-title embossed-text-white shortenough">A Plus</h1>
<div class="u-inline-block">
<h1 class="biz-page-title\ embossed-text-white\ shortenough">Computers</h1>
<div class="u-inline-block">
"""
soup = bs4(html, 'lxml')
for i in soup.find_all('h1'):
print(i.text)
Output:
A Plus
Computers
Do it like this.
texts = soup.select("div > h1, div > div > h1")
for text in texts:
print(text.text)
"A Plus" and "Computers" will come out.

BeatifulSoup - Trying to get text inside span tags

I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.