Data Scraping using python And bs4 - html

<a href="/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html">
<h2 class="top-sec-title">
Israel launches counterattacks in Gaza amid soaring tensions
</h2>
</a>
I want to use the class of h2 that is "top-sec-title" and scrape the text on h2 with href of a.
The example below is what I have been dealing this below html has a class of a tag which helped me getting href also the text in its child element that is h3 in the bellow case:
<a class="gs-c-promo-heading gs-o-faux-block-link__overlay-link gel-pica-bold nw-o-link-split__anchor" href="/news/world-us-canada-44294366">
<h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">
Hurricane Maria 'killed 4,600 in Puerto Rico'
</h3>
</a>
The code below is what I used to extract data from the html source above.
news = soup.find_all('a', attrs={'class':'gs-c-promo-heading gs-o-faux-block-
link__overlay-link gel-pica-bold nw-o-link-split__anchor'})
for item in news:
print(item.get(href))
print(item.text)

This will get you the all the elements that enclose h2 elements, which will allow you to get the href if the enclosing element is an a.
lst_of_h2 = soup.find_all('h2', {'class': 'top-sec-title'})
for h2 in lst_of_h2:
h2.parent # enclosing element

Code:
html = '''
<a href="/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html">
<h2 class="top-sec-title">
Israel launches counterattacks in Gaza amid soaring tensions
</h2>
</a>
'''
soup = BeautifulSoup(html, 'lxml')
a_tags = [h.parent for h in soup.select('.top-sec-title')]
for a in a_tags:
print(a['href'])
print(a.get_text(strip=True))
Output:
/news/2018/05/israeli-army-projectiles-fired-israel-gaza-180529051139606.html
Israel launches counterattacks in Gaza amid soaring tensions

Related

Retrieve all names from html tags using BeautifulSoup

I managed to setup by Beautiful Soup and find the tags that I needed. How do I extract all the names in the tags?
tags = soup.find_all("a")
print(tags)
After running the above code, I got the following output
[Alfred the Great, <a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">Queen Elizabeth I</a>, Family tree of Scottish monarchs, Kenneth MacAlpin]
How do I retrieve the names, Alfred the Great,Queen Elizabeth I, Kenneth MacAlpin, etc? Do i need to use regular expression? Using .string gave me an error
You can iterate over the tags and use tag.get('title') to get the title value.
Some other ways to do the same:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
No need to apply re. You can easily grab all the names by iterating all a tags then call title attribute or get_text() or .find(text=True)
html='''
<html>
<body>
<a href="/wiki/Alfred_the_Great" title="Alfred the Great">
Alfred the Great
</a>
,
<a class="mw-redirect" href="/wiki/Elizabeth_I_of_England" title="Elizabeth I of England">
Queen Elizabeth I
</a>
,
<a href="/wiki/Family_tree_of_Scottish_monarchs" title="Family tree of Scottish monarchs">
Family tree of Scottish monarchs
</a>
,
<a href="/wiki/Kenneth_MacAlpin" title="Kenneth MacAlpin">
Kenneth MacAlpin
</a>
</body>
</html>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
#print(soup.prettify())
for name in soup.find_all('a'):
txt = name.get('title')
#OR
#txt = name.get_text(strip=True)
print(txt)
Output:
Alfred the Great
Queen Elizabeth I
Family tree of Scottish monarchs
Kenneth MacAlpin

Getting specific span tag text in python (BeautifulSoup)

Im scraping some information off MyAnimeList using BeautifulSoup on python3 and am trying to get information about a show's 'Status', but am having trouble accessing it.
Here is the html:
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
All of this is also contained within another div tag but I only included the portion of the html that I want to scrape. To clarify, I want to obtain the text 'Finished Airing' contained within 'Status'.
Here's the code I have so far but I'm not really sure if this is the best approach or where to go from here:
Page_soup = soup(Page_html, "html.parser")
extra_info = Page_soup.find('td', attrs={'class': 'borderClass'})
span_html = extra_info.select('span')
for i in range(len(span_html)):
if 'Status:' in span_html[i].getText():
Any help would be appreciated, thanks!
To get the text next to the <span> with "Status:", you can use:
from bs4 import BeautifulSoup
html_doc = """
<h2>Information</h2>
<div>
<span class="dark_text">Type:</span>
Movie
</div>
<div class="spaceit">
<span class="dark_text">Episodes:</span>
1
</div>
<div>
<span class="dark_text">Status:</span>
Finished Airing
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('span:-soup-contains("Status:")').find_next_sibling(text=True)
print(txt.strip())
Prints:
Finished Airing
Or:
txt = soup.find("span", text="Status:").find_next_sibling(text=True)
print(txt.strip())
Another solution (maybe):
f = soup.find_all('span',attrs={'class':'dark_text'})
for i in f:
if i.text == 'Status:':
print(i.parent.text)
And change 'Status:' to whatever other thing you want to find.
Hope I helped!

Need to extract the contents of two html classes with same name

I want to extract the contents of the following html code
<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>
I need all the contents of both the card__position classes. In the second card__position class, there is a double quoted text after the span class closing. I want to extract that text as well.
I am able to extract only the card__title and the contents of only first card__position class with the following code.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
url = '<url>'
uClient = uReq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", {"class": "card__content"})
container = containers[0]
title = container.findAll("div", {"class": "card__title"})
#print(name[0].text)
position = container.findAll("div", {"class": "card__position"})
#print(position[0].text)
However, I want to print the results as in the following manner:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning
The code has the correct approach. However I'd recommend separating code into fetching html & scraping results. The scrape results code should return results in some sort of response object. My example below uses NamedTupple (for ease of use).
find_all() should be used if you wish to find more than one html element. Otherwise use find() to get the first item. Also there is a slight typo of findAll().
Also this question only provides a snippet of html content. This makes it difficult to make assumptions. However I assume the html always contains 2 divs with class card__position. I use assert to throw an exception if this is not true.
Anyway below is an example, which achieves the desired output:
from collections import namedtuple
from bs4 import BeautifulSoup
ExtractResponse = namedtuple('ExtractResponse', ['Name', 'Role', 'Course', 'MiscInfoTitle', 'MiscInfoDetails']) # This is used to store date frome extract. Not too sure on correct namings
html = '''<div class="card__content">
<div class="card__title">
Sajjad Haider Khan
<svg class="icon link__icon" height="10" width="10">
<use xlink:href="#icon-arrow-right">
</use>
</svg>
</div>
<div class="card__position">
Student,
<span class="card__text">
Computer Science Engineering (CSE)
</span>
</div>
<div class="card__position">
<span class="card__text">
Career Interests:
</span>
Computer Science; Python; Machine Learning
</div>
</div>'''
def extract(soup):
card_container_soup = soup.find("div", {"class": "card__content"}) # use find() if you know there is 1 instance (e.g <div class="card__content">)
title_soup = card_container_soup.find("div", {"class": "card__title"})
card_positions_soup = card_container_soup.find_all("div", {"class": "card__position"})
assert len(card_positions_soup) == 2 # Validate that there are always 2 card__position. If theres is not, the scrape results will be wrong.
# This code assumes the first card__position contains course data and the second contains intrest
card_text_1_soup = card_positions_soup[0].find('span', {'class': 'card__text'})
card_text_1 = card_text_1_soup.text.strip()
card_text_1_soup.extract() # remove card text. This is so that the non span text can be scraped.
card_text_2_soup = card_positions_soup[1].find('span', {'class': 'card__text'})
card_text_2 = card_text_2_soup.text.strip()
card_text_2_soup.extract()
response = ExtractResponse(title_soup.text.strip(), card_positions_soup[0].text.strip().replace(',', ''), card_text_1, card_text_2, card_positions_soup[1].text.strip())
return response
soup = BeautifulSoup(html, "html.parser")
results = extract(soup)
print(results.Name)
print(results.Role)
print(results.Course)
print(f"{results.MiscInfoTitle} {results.MiscInfoDetails}")
Output:
Sajjad Haider Khan
Student
Computer Science Engineering (CSE)
Career Interests: Computer Science; Python; Machine Learning

removing elements from html using BeautifulSoup and Python 3

I'm scraping data from the web and trying to remove all elements that have tag 'div' and class 'notes module' like this html below:
<div class="notes module" role="complementary">
<h3 class="heading">Notes:</h3>
<ul class="associations">
<li>
Translation into Русский available:
Два-два-один Браво Бейкер by <a rel="author" href="/users/dzenka/pseuds/dzenka">dzenka</a>, <a rel="author" href="/users/La_Ardilla/pseuds/La_Ardilla">La_Ardilla</a>
</li>
</ul>
<blockquote class="userstuff">
<p>
<i>Warnings: numerous references to and glancing depictions of combat, injury, murder, and mutilation of the dead; deaths of minor and major original characters. Numerous explicit depictions of sex between two men.</i>
</p>
</blockquote>
<p class="jump">(See the end of the work for other works inspired by this one.)</p>
</div>
source is here: view-source:http://archiveofourown.org/works/180121?view_full_work=true
I'm struggling to even find and print the elements I want to delete. So far I have:
import urllib.request, urllib.parse, urllib.error
from lxml import html
from bs4 import BeautifulSoup
url = 'http://archiveofourown.org/works/180121?view_full_work=true'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
removals = soup.find_all('div', {'id':'notes module'})
for match in removals:
match.decompose()
but removals returns an empty list. Can you help me select the entire div element that I've shown above so that I can select and remove all such elements from the html?
Thank you.
The div you are trying to find hasclass = "notes module", yet in your code you are trying to find those divs by id = "notes module".
Change this line:
removals = soup.find_all('div', {'id':'notes module'})
To this:
removals = soup.find_all('div', {'class':'notes module'})
Give it a go. It will kick out all available divs from that webpage under class='wrapper'.
import requests
from bs4 import BeautifulSoup
html = requests.get('http://archiveofourown.org/works/180121?view_full_work=true')
soup = BeautifulSoup(html.text, 'lxml')
for item in soup.select(".wrapper"):
[elem.extract() for elem in item("div")]
print(item)

how to parse nested html tag using xpath

This is my sample html code.
using HtmlXpathSelector i need to parse the html file.
def parse(self, response):
edxData = HtmlXpathSelector(response)
first i need to get all the tag which contain
edxData.xpath('//h2[#class = "title course-title"]')
inside of that tag i need to check a tag value.
then need to parse the div tag with class name subtitle course-subtitle copy-detail.
how can i parse this value kindly give some suggestion
sample html response data:
<html>
<body>
<h2 class="title course-title">
<a href="https://www.edx.org/course/mitx/mitx-14-73x-challenges-global-poverty-1350">The Challenges of Global Poverty
</a>
</h2>
<div class="subtitle course-subtitle copy-detail">A course for those who are interested in the challenge posed by massive and persistent world poverty.
</div>
</body>
</html>
one way to loop over the inner tag could be:
>>> for h2 in sel.xpath('//h2[#class = "title course-title"]'):
... print h2.xpath('a')
...
[<Selector xpath='a' data=u'<a href="https://www.edx.org/course/mitx'>]
or even simply:
>>> sel.xpath('//h2[#class = "title course-title"]/a')
[<Selector xpath='//h2[#class = "title course-title"]/a' data=u'<a href="https://www.edx.org/course/mitx'>]
to find another xpath, simply do:
>>> sel.xpath('//div[#class="subtitle course-subtitle copy-detail"]')
[<Selector xpath='//div[#class="subtitle course-subtitle copy-detail"]' data=u'<div class="subtitle course-subtitle cop'>]
it seem like you're using scrapy, pls also tag that question as such