Need to extract the contents of two html classes with same name - html

I want to extract the contents of the following html code
<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>
I need all the contents of both the card__position classes. In the second card__position class, there is a double quoted text after the span class closing. I want to extract that text as well.
I am able to extract only the card__title and the contents of only first card__position class with the following code.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = '<url>'
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", {"class": "card__content"})
container = containers[0]
title = container.findAll("div", {"class": "card__title"})
#print(name[0].text)
position = container.findAll("div", {"class": "card__position"})
#print(position[0].text)
However, I want to print the results as in the following manner:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning

The code has the correct approach. However I'd recommend separating code into fetching html & scraping results. The scrape results code should return results in some sort of response object. My example below uses NamedTupple (for ease of use).
find_all() should be used if you wish to find more than one html element. Otherwise use find() to get the first item. Also there is a slight typo of findAll().
Also this question only provides a snippet of html content. This makes it difficult to make assumptions. However I assume the html always contains 2 divs with class card__position. I use assert to throw an exception if this is not true.
Anyway below is an example, which achieves the desired output:
from collections import namedtuple
from bs4 import BeautifulSoup
ExtractResponse = namedtuple('ExtractResponse', ['Name', 'Role', 'Course', 'MiscInfoTitle', 'MiscInfoDetails']) # This is used to store date frome extract. Not too sure on correct namings
html = '''<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>'''
def extract(soup):
card_container_soup = soup.find("div", {"class": "card__content"}) # use find() if you know there is 1 instance (e.g <div class="card__content">)
title_soup = card_container_soup.find("div", {"class": "card__title"})
card_positions_soup = card_container_soup.find_all("div", {"class": "card__position"})
assert len(card_positions_soup) == 2 # Validate that there are always 2 card__position. If theres is not, the scrape results will be wrong.
# This code assumes the first card__position contains course data and the second contains intrest
card_text_1_soup = card_positions_soup[0].find('span', {'class': 'card__text'})
card_text_1 = card_text_1_soup.text.strip()
card_text_1_soup.extract() # remove card text. This is so that the non span text can be scraped.
card_text_2_soup = card_positions_soup[1].find('span', {'class': 'card__text'})
card_text_2 = card_text_2_soup.text.strip()
card_text_2_soup.extract()
response = ExtractResponse(title_soup.text.strip(), card_positions_soup[0].text.strip().replace(',', ''), card_text_1, card_text_2, card_positions_soup[1].text.strip())
return response
soup = BeautifulSoup(html, "html.parser")
results = extract(soup)
print(results.Name)
print(results.Role)
print(results.Course)
print(f"{results.MiscInfoTitle} {results.MiscInfoDetails}")
Output:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning

Related

Retrieve all names from html tags using BeautifulSoup

I managed to setup by Beautiful Soup and find the tags that I needed. How do I extract all the names in the tags?
tags = soup.find_all("a")
print(tags)
After running the above code, I got the following output
[Alfred the Great, <a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">Queen Elizabeth I</a>, Family tree of Scottish monarchs, Kenneth MacAlpin]
How do I retrieve the names, Alfred the Great,Queen Elizabeth I, Kenneth MacAlpin, etc? Do i need to use regular expression? Using .string gave me an error
You can iterate over the tags and use tag.get('title') to get the title value.
Some other ways to do the same:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
No need to apply re. You can easily grab all the names by iterating all a tags then call title attribute or get_text() or .find(text=True)
html='''
<html>
<body>
<a href="/wiki/Alfred_the_Great" title="Alfred the Great">
Alfred the Great
</a>
,
<a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">
Queen Elizabeth I
</a>
,
<a href="/wiki/Family_tree_of_Scottish_monarchs" title="Family tree of Scottish monarchs">
Family tree of Scottish monarchs
</a>
,
<a href="/wiki/Kenneth_MacAlpin" title="Kenneth MacAlpin">
Kenneth MacAlpin
</a>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
#print(soup.prettify())
for name in soup.find_all('a'):
txt = name.get('title')
#OR
#txt = name.get_text(strip=True)
print(txt)
Output:
Alfred the Great
Queen Elizabeth I
Family tree of Scottish monarchs
Kenneth MacAlpin

Getting specific span tag text in python (BeautifulSoup)

Im scraping some information off MyAnimeList using BeautifulSoup on python3 and am trying to get information about a show's 'Status', but am having trouble accessing it.
Here is the html:
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
All of this is also contained within another div tag but I only included the portion of the html that I want to scrape. To clarify, I want to obtain the text 'Finished Airing' contained within 'Status'.
Here's the code I have so far but I'm not really sure if this is the best approach or where to go from here:
Page_soup = soup(Page_html, "html.parser")
extra_info = Page_soup.find('td', attrs={'class': 'borderClass'})
span_html = extra_info.select('span')
for i in range(len(span_html)):
if 'Status:' in span_html[i].getText():
Any help would be appreciated, thanks!
To get the text next to the <span> with "Status:", you can use:
from bs4 import BeautifulSoup
html_doc = """
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('span:-soup-contains("Status:")').find_next_sibling(text=True)
print(txt.strip())
Prints:
Finished Airing
Or:
txt = soup.find("span", text="Status:").find_next_sibling(text=True)
print(txt.strip())
Another solution (maybe):
f = soup.find_all('span',attrs={'class':'dark_text'})
for i in f:
if i.text == 'Status:':
print(i.parent.text)
And change 'Status:' to whatever other thing you want to find.
Hope I helped!

web crawling:how to retrieve number only among text-and number combination

How can I scrape the number only in this whole html. In this example, I want the output to be '7'.
<div class="pagination">
7 posts • Page <strong>1</strong> of <strong>1</strong>
</div>
Here's my code:
for num_replys in soup.findAll('div', {'class': 'pagination'}):
print(num_reply)
You could use re for exmaple assuming you always have number space posts as pattern. You could possibly use split as well. You need to keep your loop variable having the same name and you want to work with it's .text value.
import requests
from bs4 import BeautifulSoup
html = '''
<div class="pagination">
7 posts • Page <strong>1</strong> of <strong>1</strong>
</div>
'''
p = re.compile(r'(\d+)\s+posts')
soup = bs(html, 'lxml')
for num_reply in soup.findAll('div', {'class': 'pagination'}):
print(int(p.findall(num_reply.text)[0]))

Data Scraping using python And bs4

<a href="/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html">
<h2 class="top-sec-title">
Israel launches counterattacks in Gaza amid soaring tensions
</h2>
</a>
I want to use the class of h2 that is "top-sec-title" and scrape the text on h2 with href of a.
The example below is what I have been dealing this below html has a class of a tag which helped me getting href also the text in its child element that is h3 in the bellow case:
<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/world-us-canada-44294366">
<h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">
Hurricane Maria 'killed 4,600 in Puerto Rico'
</h3>
</a>
The code below is what I used to extract data from the html source above.
news = soup.find_all('a', attrs={'class':'gs-c-promo-heading gs-o-faux-block-
link__overlay-link gel-pica-bold nw-o-link-split__anchor'})
for item in news:
print(item.get(href))
print(item.text)
This will get you the all the elements that enclose h2 elements, which will allow you to get the href if the enclosing element is an a.
lst_of_h2 = soup.find_all('h2', {'class': 'top-sec-title'})
for h2 in lst_of_h2:
h2.parent # enclosing element
Code:
html = '''
<a href="/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html">
<h2 class="top-sec-title">
Israel launches counterattacks in Gaza amid soaring tensions
</h2>
</a>
'''
soup = BeautifulSoup(html, 'lxml')
a_tags = [h.parent for h in soup.select('.top-sec-title')]
for a in a_tags:
print(a['href'])
print(a.get_text(strip=True))
Output:
/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html
Israel launches counterattacks in Gaza amid soaring tensions

removing elements from html using BeautifulSoup and Python 3

I'm scraping data from the web and trying to remove all elements that have tag 'div' and class 'notes module' like this html below:
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<ul class="associations">
<li>
Translation into Русский available:
Два-два-один Браво Бейкер by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
</li>
</ul>
<blockquote class="userstuff">
<p>
<i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
</blockquote>
<p class="jump">(See the end of the work for other works inspired by this one.)</p>
</div>
source is here: view-source:http://archiveofourown.org/works/180121?view_full_work=true
I'm struggling to even find and print the elements I want to delete. So far I have:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
match.decompose()
but removals returns an empty list. Can you help me select the entire div element that I've shown above so that I can select and remove all such elements from the html?
Thank you.
The div you are trying to find hasclass = "notes module", yet in your code you are trying to find those divs by id = "notes module".
Change this line:
removals = soup.find_all('div', {'id':'notes module'})
To this:
removals = soup.find_all('div', {'class':'notes module'})
Give it a go. It will kick out all available divs from that webpage under class='wrapper'.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
soup = BeautifulSoup(html.text, 'lxml')
for item in soup.select(".wrapper"):
[elem.extract() for elem in item("div")]
print(item)