Python 3: How to web scrape text from div that contains multiple class values - html

I'm trying to web scrape a website (Here is the link to website), but the div in the page seems to have multiple class attributes which is making me hard to scrape the data. I tried to look for historical questions posted on Stackoverflow, but could not find an answer that I wanted. The below is part of the code I extracted from the website:
<div data-reactid="118">
<div class="ue-ga base_ ue-jk" style="margin-left:-24px;margin-bottom:;" data-reactid="119">
<div style="display: flex; flex-direction: column; width: 100%; padding-left: 24px;" data-reactid="120">
<div class="ue-a3 ue-ap ue-a6 ue-gb ue-ah ue-n ue-f5 ue-ec ue-gc ue-gd ue-ge ue-gf base_ ue-jv ue-gz ue-h0 ue-h1" data-reactid="121">
<div class="ue-a6 ue-bz ue-gb ue-ah ue-gg ue-gh ue-gi" data-reactid="122">
<div class="ue-bn ue-bo ue-cc ue-bq ue-g9 ue-bs" title="Want to extract this part" data-reactid="123">
Want to extract this part
</div>
</div>
</div>
</div>
</div>
</div>
What I want to extract is the text where it states "Want to extract this part". I did think of scraping the data through data-reactid, but different pages have different data-reactid number assigned so wasn't a good idea. I also want to inform that class names are not unique.
Can anyone guide me through this? Much appreciated.

If the classes always remain the same for that specific element on each page you can target it with this selector:
.ue-bn.ue-bo.ue-cc.ue-bq.ue-g9.ue-bs
However, there are many other selectors you could use but it all depends on if they are unique and consistent across pages.

This may help you
from bs4 import BeautifulSoup
html = """<div data-reactid="118">
<div class="ue-ga base_ ue-jk" style="margin-left:-24px;margin-bottom:;" data-reactid="119">
<div style="display: flex; flex-direction: column; width: 100%; padding-left: 24px;" data-reactid="120">
<div class="ue-a3 ue-ap ue-a6 ue-gb ue-ah ue-n ue-f5 ue-ec ue-gc ue-gd ue-ge ue-gf base_ ue-jv ue-gz ue-h0 ue-h1" data-reactid="121">
<div class="ue-a6 ue-bz ue-gb ue-ah ue-gg ue-gh ue-gi" data-reactid="122">
<div class="ue-bn ue-bo ue-cc ue-bq ue-g9 ue-bs" title="Want to extract this part" data-reactid="123">
Want to extract this part
</div>
</div>
</div>
</div>
</div>
</div>"""
soup = BeautifulSoup(html,'html.parser')
tag = soup.find('div', attrs={'class':'ue-bn'})
text = (''.join(tag.stripped_strings))
print (text)

you can use jQuery as below.
$("div[title=Want to extract this part]").text();

Menus:
- all menus to use in loop, css selector: div.base_ h3
- menu by name, xpath: //div[contains(#class,'base_')]//h3[.='Big Mac® Bundles']
Food Cards
- titles, css selector: div[title]
- titles, xpath: //div[./div[#title]]/div[#title]
- prices, xpath: //div[./div[#title]]//span
If you want to loop:
cards = driver.find_elements_by_xpath("//div[./div[#title]]")
for card in cards:
title = card.find_element_by_css_selector("div[title]")
price = card.find_element_by_css_selector("span")
#or using xpath
#title = card.find_element_by_xpath("./div[#title]")
#price = card.find_element_by_xpath(".//span")
Category menu:
- all categories, css selector: a[href*='category']

As per the HTML you have shared to extract the text Want to extract this part as the element is a React element you have to induce WebDriverWait for the element to be visible and you can use either of the following solution:
Using title attribute:
myText = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.base_ div[title]"))).get_attribute("title")
Using innerHTML:
myText = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.base_ div[title]"))).get_attribute("innerHTML")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Related

How do I search for an attribute using BeautifulSoup?

I am trying to scrape a that contains the following HTML.
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
As you can see, "FeedCard " is something they have in common. Therefore, I am trying to use a regular expression in conjunction with BeautifulSoup. Here is the code I've tried.
pattern = r"\AFeedCard"
for card in soup.find('div', 'class'==re.compile(pattern)):
print(card)
print('**********')
I'm expecting it to give me each on of the divs from above, with the asterisks separating them. Instead it is giving me the entire HTML of the page in a single instance
Thank you,
No need to use regular expression here. Just use CSS selector or BS4 Api:
from bs4 import BeautifulSoup
html = """\
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 1
</div>
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 2
</div>
"""
soup = BeautifulSoup(html, "html.parser")
for card in soup.select(".FeedCard"):
print(card.text.strip())
Prints:
Item 1
Item 2

How to get a div or span class from a related span class?

I've found the lowest class: <span class="pill css-1a10nyx e1pqc3131"> of multiple elements of a website but now I want to find the related/linked upper-class so for example the highest <div class="css-1v73czv eh8fd9011" xpath="1">. I've got the soup but can't figure out a way to get from the 'lowest' class to the 'highest' class, any idea?
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">
<div class="css-1rkuvma eh8fd908">
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">
End result would be:
INPUT- Search on on all elements of a page with class <span class="pill css-1a10nyx e1pqc3131">(lowest)
OUTPUT - Get all related titles or headers of said class.
I've tried it with if-statements but that doesn't work consistently. Something with an if class = (searchable class) then get (desired higher class) should work.
I can add any more details if needed please let me know, thanks in advance!
EDIT: Picture per clarification where the title(highest class) = "Wooferland Festival 2022" and the number(lowest class) = 253
As mentioned, question needs some more information, to give a concret answer.
Assuming you like to scrape the information in the picture based on your example HTML you select your pill and use .find_previous() to locate your elements:
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
Assuming there is a cotainer tag in HTML structure like <a> or other you would select this based on the condition, that it contains a <span> wit class pill:
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Note: Instead of using css classes, that can be highly dynamic, try use more static attributes or the HTML structure.
Example
See both options, for first one the <a> do not matter.
from bs4 import BeautifulSoup
html='''
<a>
<div class="css-1v73czv eh8fd9011" xpath="1">
<div class="css-19qortz eh8fd9010">
<header class="css-1idy7oy eh8fd909">some date information</header>
<div class="css-1rkuvma eh8fd908">some title</div>
<footer class="css-f9q2sp eh8fd907">
<span class="pill css-1a10nyx e1pqc3131">some number</span>
<footer>
</div>
</div>
</a>
'''
soup = BeautifulSoup(html)
for e in soup.select('span.pill'):
print(e.find_previous('header').text)
print(e.find_previous('div').text)
print(e.text)
print('---------')
for e in soup.select('a:has(span.pill)'):
print(e.header.text)
print(e.header.next.text)
print(e.footer.span.text)
Output
some date information
some title
some number
---------
some date information
some date information
some number

How to select specific div using BeautifulSoup when multiple divs have the same class name no id tag?

Please help I don't know how to select specific div using BeautifulSoup when multiple divs have the same class name no id tag.
Web page that I am trying to scrape: https://www.helpmefind.com/rose/l.php?l=2.65689.
I want to select contents of specific divs independently and then pass to a csv file. Got stuck since find_all returns multiple divs and I don't know how to restrict further.
rose_div = rose.find_all("div", class_="hdg")
Returns:
[<div class="hdg">HMF Ratings:</div>, <div class="hdg">Origin:</div>, <div class="hdg">Class:</div>, <div class="hdg">Bloom:</div>, <div class="hdg">Parentage:</div>, <div class="hdg">Notes:</div>, <div class="hdg"> </div>]
I want to select individually below divs:
<div class="hdg">Origin:</div>
<div class="hdg">Class:</div>
<div class="hdg">Bloom:</div>
<div class="hdg">Parentage:</div>
You can use CSS selector div.hdg:contains("Origin:") to select <div> with class="hdg" that contains word "Origing:". To get next element with class grp, you can add + .grp.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.helpmefind.com/rose/l.php?l=2.65689'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
origin = soup.select_one('div.hdg:contains("Origin:") + .grp').text
class_ = soup.select_one('div.hdg:contains("Class:") + .grp').text
bloom = soup.select_one('div.hdg:contains("Bloom:") + .grp').text
parentage = soup.select_one('div.hdg:contains("Parentage:") + .grp').text
print(origin)
print(class_)
print(bloom)
print(parentage)
Prints:
Bred by Arai (Japan, before 2009).
Floribunda.  
Light pink and white, yellow stamens.  Single (4-8 petals), cluster-flowered bloom form.  Blooms in flushes throughout the season.  
If you know the parentage of this rose, or other details, please contact us.

BeatifulSoup - Trying to get text inside span tags

I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.

removing elements from html using BeautifulSoup and Python 3

I'm scraping data from the web and trying to remove all elements that have tag 'div' and class 'notes module' like this html below:
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<ul class="associations">
<li>
Translation into Русский available:
Два-два-один Браво Бейкер by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
</li>
</ul>
<blockquote class="userstuff">
<p>
<i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
</blockquote>
<p class="jump">(See the end of the work for other works inspired by this one.)</p>
</div>
source is here: view-source:http://archiveofourown.org/works/180121?view_full_work=true
I'm struggling to even find and print the elements I want to delete. So far I have:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
match.decompose()
but removals returns an empty list. Can you help me select the entire div element that I've shown above so that I can select and remove all such elements from the html?
Thank you.
The div you are trying to find hasclass = "notes module", yet in your code you are trying to find those divs by id = "notes module".
Change this line:
removals = soup.find_all('div', {'id':'notes module'})
To this:
removals = soup.find_all('div', {'class':'notes module'})
Give it a go. It will kick out all available divs from that webpage under class='wrapper'.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
soup = BeautifulSoup(html.text, 'lxml')
for item in soup.select(".wrapper"):
[elem.extract() for elem in item("div")]
print(item)