How do I search for an attribute using BeautifulSoup? - html

I am trying to scrape a that contains the following HTML.
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
As you can see, "FeedCard " is something they have in common. Therefore, I am trying to use a regular expression in conjunction with BeautifulSoup. Here is the code I've tried.
pattern = r"\AFeedCard"
for card in soup.find('div', 'class'==re.compile(pattern)):
print(card)
print('**********')
I'm expecting it to give me each on of the divs from above, with the asterisks separating them. Instead it is giving me the entire HTML of the page in a single instance
Thank you,

No need to use regular expression here. Just use CSS selector or BS4 Api:
from bs4 import BeautifulSoup
html = """\
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 1
</div>
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 2
</div>
"""
soup = BeautifulSoup(html, "html.parser")
for card in soup.select(".FeedCard"):
print(card.text.strip())
Prints:
Item 1
Item 2

Related

How to select specific div using BeautifulSoup when multiple divs have the same class name no id tag?

Please help I don't know how to select specific div using BeautifulSoup when multiple divs have the same class name no id tag.
Web page that I am trying to scrape: https://www.helpmefind.com/rose/l.php?l=2.65689.
I want to select contents of specific divs independently and then pass to a csv file. Got stuck since find_all returns multiple divs and I don't know how to restrict further.
rose_div = rose.find_all("div", class_="hdg")
Returns:
[<div class="hdg">HMF Ratings:</div>, <div class="hdg">Origin:</div>, <div class="hdg">Class:</div>, <div class="hdg">Bloom:</div>, <div class="hdg">Parentage:</div>, <div class="hdg">Notes:</div>, <div class="hdg"> </div>]
I want to select individually below divs:
<div class="hdg">Origin:</div>
<div class="hdg">Class:</div>
<div class="hdg">Bloom:</div>
<div class="hdg">Parentage:</div>
You can use CSS selector div.hdg:contains("Origin:") to select <div> with class="hdg" that contains word "Origing:". To get next element with class grp, you can add + .grp.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.helpmefind.com/rose/l.php?l=2.65689'
soup = BeautifulSoup( requests.get(url).content, 'html.parser' )
origin = soup.select_one('div.hdg:contains("Origin:") + .grp').text
class_ = soup.select_one('div.hdg:contains("Class:") + .grp').text
bloom = soup.select_one('div.hdg:contains("Bloom:") + .grp').text
parentage = soup.select_one('div.hdg:contains("Parentage:") + .grp').text
print(origin)
print(class_)
print(bloom)
print(parentage)
Prints:
Bred by Arai (Japan, before 2009).
Floribunda.  
Light pink and white, yellow stamens.  Single (4-8 petals), cluster-flowered bloom form.  Blooms in flushes throughout the season.  
If you know the parentage of this rose, or other details, please contact us.

web crawling:how to retrieve number only among text-and number combination

How can I scrape the number only in this whole html. In this example, I want the output to be '7'.
<div class="pagination">
7 posts • Page <strong>1</strong> of <strong>1</strong>
</div>
Here's my code:
for num_replys in soup.findAll('div', {'class': 'pagination'}):
print(num_reply)
You could use re for exmaple assuming you always have number space posts as pattern. You could possibly use split as well. You need to keep your loop variable having the same name and you want to work with it's .text value.
import requests
from bs4 import BeautifulSoup
html = '''
<div class="pagination">
7 posts • Page <strong>1</strong> of <strong>1</strong>
</div>
'''
p = re.compile(r'(\d+)\s+posts')
soup = bs(html, 'lxml')
for num_reply in soup.findAll('div', {'class': 'pagination'}):
print(int(p.findall(num_reply.text)[0]))

BeatifulSoup - Trying to get text inside span tags

I want to pull the text inside the span tags but when I try and use .text or get_text() I get errors (either after print spans or in the for loop). What am I missing? I have it set just now to just do this for the first div of class col, just to test if it is working, but I will want it to work for the 2nd aswell.
Thanks
My Code -
premier_soup1 = player_soup.find('div', {'class': 'row-table details -bp30'})
premier_soup_tr = premier_soup1.find_all('div', {'class': 'col'})
for x in premier_soup_tr[0]:
spans = x.find('span')
print (spans)
Output
-1
<span itemprop="name">Alisson Ramses Becker</span>
-1
<span itemprop="birthDate">02/10/1992</span>
-1
<span itemprop="nationality"> Brazil</span>
-1
>>>
The HTML
<div class="col">
<p>Name: <strong><span itemprop="name">Alisson Ramses Becker</span> </strong></p>
<p>Date of birth:<span itemprop="birthDate">02/10/1992</span></p>
<p>Place of birth:<span itemprop="nationality"> Brazil</span></p>
</div>
<div class="col">
<p>Club: <span itemprop="affiliation">Liverpool</span></p>
<p>Squad: 13</p><p>Position: Goal Keeper</p>
</div>
If you just want the text in the spans you can search specifically for the spans:
soup = BeautifulSoup(html, 'html.parser')
spans = soup.find_all('span')
for span in spans:
print(span.text)
If you want to find the spans with the specific divs, then you can do:
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
spans = div.find_all('span')
for span in spans:
print(span.text)
If you just want all of the values after the colons, you can search for the paragraph tags:
soup = BeautifulSoup(html, 'html.parser')
divs = soup.find_all( 'div', {'class': 'col'})
for div in divs:
ps = div.find_all('p')
for p in ps:
print(p.text.split(":")[1].strip())
Kyle's answer is good, but to avoid printing the same value multiple times like you said happened, you need to change up the logic a little bit. First you parse and add all matches you find to a list and THEN you loop through the list with all the matches and print them.
Another thing that you may have to consider is this problem:
<div class=col>
<div class=col>
<span/>
</div>
</div>
By using a list instead of printing right away, you can handle any matches that are identical to any existing records
in the above html example you can see how the span could be added twice with how you find matches in the answer suggested by Kyle. It's all about making sure you create a logic that will only find the matches you need. How you do it is often/always dependant on how the html is formatted, but its also important to be creative!
Good luck.

Python 3: How to web scrape text from div that contains multiple class values

I'm trying to web scrape a website (Here is the link to website), but the div in the page seems to have multiple class attributes which is making me hard to scrape the data. I tried to look for historical questions posted on Stackoverflow, but could not find an answer that I wanted. The below is part of the code I extracted from the website:
<div data-reactid="118">
<div class="ue-ga base_ ue-jk" style="margin-left:-24px;margin-bottom:;" data-reactid="119">
<div style="display: flex; flex-direction: column; width: 100%; padding-left: 24px;" data-reactid="120">
<div class="ue-a3 ue-ap ue-a6 ue-gb ue-ah ue-n ue-f5 ue-ec ue-gc ue-gd ue-ge ue-gf base_ ue-jv ue-gz ue-h0 ue-h1" data-reactid="121">
<div class="ue-a6 ue-bz ue-gb ue-ah ue-gg ue-gh ue-gi" data-reactid="122">
<div class="ue-bn ue-bo ue-cc ue-bq ue-g9 ue-bs" title="Want to extract this part" data-reactid="123">
Want to extract this part
</div>
</div>
</div>
</div>
</div>
</div>
What I want to extract is the text where it states "Want to extract this part". I did think of scraping the data through data-reactid, but different pages have different data-reactid number assigned so wasn't a good idea. I also want to inform that class names are not unique.
Can anyone guide me through this? Much appreciated.
If the classes always remain the same for that specific element on each page you can target it with this selector:
.ue-bn.ue-bo.ue-cc.ue-bq.ue-g9.ue-bs
However, there are many other selectors you could use but it all depends on if they are unique and consistent across pages.
This may help you
from bs4 import BeautifulSoup
html = """<div data-reactid="118">
<div class="ue-ga base_ ue-jk" style="margin-left:-24px;margin-bottom:;" data-reactid="119">
<div style="display: flex; flex-direction: column; width: 100%; padding-left: 24px;" data-reactid="120">
<div class="ue-a3 ue-ap ue-a6 ue-gb ue-ah ue-n ue-f5 ue-ec ue-gc ue-gd ue-ge ue-gf base_ ue-jv ue-gz ue-h0 ue-h1" data-reactid="121">
<div class="ue-a6 ue-bz ue-gb ue-ah ue-gg ue-gh ue-gi" data-reactid="122">
<div class="ue-bn ue-bo ue-cc ue-bq ue-g9 ue-bs" title="Want to extract this part" data-reactid="123">
Want to extract this part
</div>
</div>
</div>
</div>
</div>
</div>"""
soup = BeautifulSoup(html,'html.parser')
tag = soup.find('div', attrs={'class':'ue-bn'})
text = (''.join(tag.stripped_strings))
print (text)
you can use jQuery as below.
$("div[title=Want to extract this part]").text();
Menus:
- all menus to use in loop, css selector: div.base_ h3
- menu by name, xpath: //div[contains(#class,'base_')]//h3[.='Big Mac® Bundles']
Food Cards
- titles, css selector: div[title]
- titles, xpath: //div[./div[#title]]/div[#title]
- prices, xpath: //div[./div[#title]]//span
If you want to loop:
cards = driver.find_elements_by_xpath("//div[./div[#title]]")
for card in cards:
title = card.find_element_by_css_selector("div[title]")
price = card.find_element_by_css_selector("span")
#or using xpath
#title = card.find_element_by_xpath("./div[#title]")
#price = card.find_element_by_xpath(".//span")
Category menu:
- all categories, css selector: a[href*='category']
As per the HTML you have shared to extract the text Want to extract this part as the element is a React element you have to induce WebDriverWait for the element to be visible and you can use either of the following solution:
Using title attribute:
myText = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.base_ div[title]"))).get_attribute("title")
Using innerHTML:
myText = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.base_ div[title]"))).get_attribute("innerHTML")
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

removing elements from html using BeautifulSoup and Python 3

I'm scraping data from the web and trying to remove all elements that have tag 'div' and class 'notes module' like this html below:
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<ul class="associations">
<li>
Translation into Русский available:
Два-два-один Браво Бейкер by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
</li>
</ul>
<blockquote class="userstuff">
<p>
<i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
</blockquote>
<p class="jump">(See the end of the work for other works inspired by this one.)</p>
</div>
source is here: view-source:http://archiveofourown.org/works/180121?view_full_work=true
I'm struggling to even find and print the elements I want to delete. So far I have:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
match.decompose()
but removals returns an empty list. Can you help me select the entire div element that I've shown above so that I can select and remove all such elements from the html?
Thank you.
The div you are trying to find hasclass = "notes module", yet in your code you are trying to find those divs by id = "notes module".
Change this line:
removals = soup.find_all('div', {'id':'notes module'})
To this:
removals = soup.find_all('div', {'class':'notes module'})
Give it a go. It will kick out all available divs from that webpage under class='wrapper'.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
soup = BeautifulSoup(html.text, 'lxml')
for item in soup.select(".wrapper"):
[elem.extract() for elem in item("div")]
print(item)