web crawling:how to retrieve number only among text-and number combination - html

How can I scrape the number only in this whole html. In this example, I want the output to be '7'.
<div class="pagination">
7 posts • Page <strong>1</strong> of <strong>1</strong>
</div>
Here's my code:
for num_replys in soup.findAll('div', {'class': 'pagination'}):
print(num_reply)

You could use re for exmaple assuming you always have number space posts as pattern. You could possibly use split as well. You need to keep your loop variable having the same name and you want to work with it's .text value.
import requests
from bs4 import BeautifulSoup
html = '''
<div class="pagination">
7 posts • Page <strong>1</strong> of <strong>1</strong>
</div>
'''
p = re.compile(r'(\d+)\s+posts')
soup = bs(html, 'lxml')
for num_reply in soup.findAll('div', {'class': 'pagination'}):
print(int(p.findall(num_reply.text)[0]))

Related

How do I search for an attribute using BeautifulSoup?

I am trying to scrape a that contains the following HTML.
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
As you can see, "FeedCard " is something they have in common. Therefore, I am trying to use a regular expression in conjunction with BeautifulSoup. Here is the code I've tried.
pattern = r"\AFeedCard"
for card in soup.find('div', 'class'==re.compile(pattern)):
print(card)
print('**********')
I'm expecting it to give me each on of the divs from above, with the asterisks separating them. Instead it is giving me the entire HTML of the page in a single instance
Thank you,
No need to use regular expression here. Just use CSS selector or BS4 Api:
from bs4 import BeautifulSoup
html = """\
<div class="FeedCard urn:publicid:ap.org:db2b278b7e4f9fea9a2df48b8508ed14 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 1
</div>
<div class="FeedCard urn:publicid:ap.org:2f23aa3df0f2f6916ad458785dd52c59 Component-wireStory-0-2-116 card-0-2-117" data-key="feed-card-wire-story-with-image" data-tb-region-item="true">
Item 2
</div>
"""
soup = BeautifulSoup(html, "html.parser")
for card in soup.select(".FeedCard"):
print(card.text.strip())
Prints:
Item 1
Item 2

Getting specific span tag text in python (BeautifulSoup)

Im scraping some information off MyAnimeList using BeautifulSoup on python3 and am trying to get information about a show's 'Status', but am having trouble accessing it.
Here is the html:
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
All of this is also contained within another div tag but I only included the portion of the html that I want to scrape. To clarify, I want to obtain the text 'Finished Airing' contained within 'Status'.
Here's the code I have so far but I'm not really sure if this is the best approach or where to go from here:
Page_soup = soup(Page_html, "html.parser")
extra_info = Page_soup.find('td', attrs={'class': 'borderClass'})
span_html = extra_info.select('span')
for i in range(len(span_html)):
if 'Status:' in span_html[i].getText():
Any help would be appreciated, thanks!
To get the text next to the <span> with "Status:", you can use:
from bs4 import BeautifulSoup
html_doc = """
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('span:-soup-contains("Status:")').find_next_sibling(text=True)
print(txt.strip())
Prints:
Finished Airing
Or:
txt = soup.find("span", text="Status:").find_next_sibling(text=True)
print(txt.strip())
Another solution (maybe):
f = soup.find_all('span',attrs={'class':'dark_text'})
for i in f:
if i.text == 'Status:':
print(i.parent.text)
And change 'Status:' to whatever other thing you want to find.
Hope I helped!

Scraping from specific div without attribute where tags can have the same name

I am trying to scrape a table from wowhead. The issue is that the span classes are the same for 2 different types of data (Sell for: and Buy for:).
The division under which the spans are have no class and just the stings I wrote in the brackets.
I've tried
import requests
from bs4 import BeautifulSoup
import urllib.request
import re
import lxml
session = requests.session()
url1 = 'https://classicdb.ch/?item=4291'
response = session.get(url1)
soup = BeautifulSoup(response.text, 'lxml')
x=(soup.find('table', attrs={'class': "infobox"}))
y=x.find('td')
y=y.find('ul')
sell_silver = soup.select_one('div:contains("Sells for:") .moneysilver').text
buy_silver = y.select_one('div:contains("Buy for:") .moneysilver').text
print(sell_silver)
print(buy_silver)
but then I only get the first span.
The relevant HTML after i get the table looks like this
<div>
Buy for:
<span class="moneysilver">5</span>
</div>
</li>
<li>
<div>
Sells for:
<span class="moneysilver">1</span> <span class="moneycopper">25</span>
</div>
....
The end result should allow me to sort the data into
Buy_silver=5
Sell_silver=1
edit to clarify question and shoutout #QHarr
BS4 4.7.1+ you can use :contains to target by Buy for or Sells for
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://classicdb.ch/?item=4291')
soup = bs(r.content, 'lxml')
buy_silver, sell_silver = soup.select_one('li:contains("Buy for") .moneysilver').text , soup.select_one('li:contains("Sells for") .moneysilver').text
Assuming that the buy and sell quotes are always in the same order and distance from each other, you can try to use this:
metal = """
<li>
<div>
Buyfor:
<span class="moneysilver">5</span>
</div>
</li>
<li>
<div>
Sells for:
<span class="moneysilver">1</span> <span class="moneycopper">25</span>
</div>
</li>
"""
from bs4 import BeautifulSoup as bs
soup = bs(metal, 'lxml')
silver=soup.find_all('div')
print("buy silver =",silver[0].find("span", class_="moneysilver").text)
print("sell silver =",silver[1].find("span", class_="moneysilver").text)
Output:
buy silver = 5
sell silver = 1

How do i get the text inside a class while ignoring the text of the next class that is inside

I'm trying to get the text inside the class="hardfact" but is also getting the text of the class="hardfactlabel color_f_03" because this class is inside hardfact.
.text.strip() get the text of both class because they are nested.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import requests
import lxml
my_url = 'https://www.immowelt.de/expose/2QC5D4A?npv=52'
page = requests.get(my_url)
ct = soup(page.text, 'lxml')
specs = ct.find('div', class_="hardfacts clear").findAll('div', class_="hardfact")
for items in specs:
e = items.text.strip()
print(e)
I'm getting this
82.500 € 
Kaufpreis
47 m²
Wohnfläche (ca.)
1
Zimmer
and i want this
82.500 €
47 m²
1
Here is the html content you are trying to crawl:
<div class="hardfact ">
<strong>82.500 € </strong>
<div class="hardfactlabel color_f_03">
Kaufpreis
</div>
</div>
<div class="hardfact ">
47 m²
<div class="hardfactlabel color_f_03">
Wohnfläche (ca.)
</div>
</div>
<div class="hardfact rooms">
1
<div class="hardfactlabel color_f_03">
Zimmer
</div>
</div>
What you want to achieve is to remove the div tags within, so you can just decompose the div:
for items in specs:
items.div.decompose()
e = items.text.strip()
print(e)
If your first "hardfact" class doesn't contain the "strong" tag, you can just find the first element like so
e = items.find().text.strip()
but we can't do this so you have to decompose the div tag.
You can use stripped strings. You probably want to add a condition to ensure at least length of 3 before attempting to slice list.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.immowelt.de/expose/2QC5D4A?npv=52')
soup = bs(r.content, 'lxml')
items = soup.select('.hardfact')[:3]
for item in items:
strings = [string for string in item.stripped_strings]
print(strings[0])

removing elements from html using BeautifulSoup and Python 3

I'm scraping data from the web and trying to remove all elements that have tag 'div' and class 'notes module' like this html below:
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<ul class="associations">
<li>
Translation into Русский available:
Два-два-один Браво Бейкер by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
</li>
</ul>
<blockquote class="userstuff">
<p>
<i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
</blockquote>
<p class="jump">(See the end of the work for other works inspired by this one.)</p>
</div>
source is here: view-source:http://archiveofourown.org/works/180121?view_full_work=true
I'm struggling to even find and print the elements I want to delete. So far I have:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
match.decompose()
but removals returns an empty list. Can you help me select the entire div element that I've shown above so that I can select and remove all such elements from the html?
Thank you.
The div you are trying to find hasclass = "notes module", yet in your code you are trying to find those divs by id = "notes module".
Change this line:
removals = soup.find_all('div', {'id':'notes module'})
To this:
removals = soup.find_all('div', {'class':'notes module'})
Give it a go. It will kick out all available divs from that webpage under class='wrapper'.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
soup = BeautifulSoup(html.text, 'lxml')
for item in soup.select(".wrapper"):
[elem.extract() for elem in item("div")]
print(item)