Extract academic publication information from IDEAS - html

I want to extract the list of publications from a specific IDEAS's page. I want to retrieve information about name of the paper, authors, and year. However, I am bit stuck in doing so. By inspecting the page, all information is inside the div class="tab-pane fade show active" [...], then with h3 we do have the year of publication while inside each li class="list-group-item downfree" [...] we can find each paper with relative author (as showed in this image). At the end, what I willing to obtain is a dataframe containing three columns: title, author, and year.
Nonetheless, while I am able to retrieve each paper's name, when I want to add also year and author(s) I get confused. What I wrote so far is the following short code:
from requests import get
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
containers = soup.findAll("div", {'class': 'tab-pane fade show active'})
title_list = []
year_list = []
for container in containers:
year = container.findAll('h3')
year_list.append(int(year[0].text))
title_containers = container.findAll("li", {'class': 'list-group-item downfree'})
title = title_containers[0].a.text
title_list.append(title)
What I get are two list of only one element each. This because the initial containers has the size of 1. Regarding instead how to retrieve author(s) name I have no idea, I tried in several ways without success. I think I have to stripe the titles using 'by' as separator.
I hope someone could help me or re-direct to some other discussion which face a similar situation. Thank you in advance. Apologize for my (probably) silly question, I am still a beginner in web scraping with BeautifulSoup.

You can get the desired information like this:
from requests import get
import pprint
from bs4 import BeautifulSoup
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
title_list.append([x.text for x in panel.find_all('a')])
author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))
pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)
outputs:
[ ( 2020,
['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-Piñeiro']),
( 2019,
[ 'Probability Forecasts and Prediction Markets',
'R&D Financing And Growth',
'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
'Consumption Smoothing Channels Within And Between Households',
'A critical analysis of the secular stagnation theory',
'Further evidence of the relationship between social transfers and income inequality in OECD countries',
'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
[ 'Julia Mortera & A. Philip Dawid',
'Luca Spinesi & Mario Tirelli',
'Matteo Deleidi & Mariana Mazzucato',
'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
'Stefano Di Bucchianico',
"Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
'Giovanni Scarano']),
( 2018, ...
I got the years using a list comprehension. I got the titles and authors by appending a list to the title_list and title_list for the required elements in each div element with the class panel-body again using a list comprehension and using next.sibling for the i element to get the authors. Then I zipped the three lists and cast the result to a list. Finally I pretty printed the result.

Related

How access div within gridview with Beautiful Soup

I'm trying to scrape all information from archdaily over multiple pages of the website (e.g. from page 1 to 20).
The html structure looks like:
<div>
<div class = 'afd-container-main afd-container-main--margin-bottom nft-container-main-search clearfix afd-mobile-margin search-container'>
::before
<div>
<div class='gridview'>
<div>
<div data-insights-category>
<a href = '...'> # this is the htmls i wanted
The code I'm using is
soup = BeautifulSoup(html, 'html')
for foo in soup.find_all('div'):
bar = foo.find('div', attrs={'class': 'afd-container-main afd-container-main--margin-bottom nft-container-main-search clearfix afd-mobile-margin search-container'})
print(bar.text)
Error message
AttributeError: 'NoneType' object has no attribute 'text'
Am I misunderstanding something?
Note: Because the question does not reveal, how you get your html, it is not that easy to answer.
If you use requests, you wont get the results that way, cause the site deals with dynamic served content.
Alternativ approaches:
Get information with requests via api (provides even more information - categories, company,...)
#iterate over pages
for p in range(1,3):
r = requests.get(f'https://www.archdaily.com/search/api/v1/us/projects/categories/residential-architecture?page={p}') #url of next page
for item in r.json()['results']:
# iterate over results and print title+url
print(item['title'], item['url'])
Get renderd html via Selenium
Example
import requests
for p in range(1,2):
r = requests.get(f'https://www.archdaily.com/search/api/v1/us/projects/categories/residential-architecture?page={p}') #url of next page
for item in r.json()['results']:
print(item['title'], item['url'])
Output
Wooden House / derksen | windt architecten https://www.archdaily.com/972995/wooden-house-derksen-windt-architecten?ad_source=search&ad_medium=projects_tab
PLA2 House / Dersyn Studio https://www.archdaily.com/972939/pla2-house-dersyn-studio?ad_source=search&ad_medium=projects_tab
gjG House / BLAF Architecten https://www.archdaily.com/951845/gjg-house-blaf-architecten?ad_source=search&ad_medium=projects_tab
Leopoldo 1201 Residential Building / aflalo/gasperini arquitetos https://www.archdaily.com/972959/leopoldo-1201-residential-building-aflalo-gasperini-arquitetos?ad_source=search&ad_medium=projects_tab
Sayang House / Carlos Gris Studio https://www.archdaily.com/972773/sayang-house-carlos-gris-studio?ad_source=search&ad_medium=projects_tab
Nong Ho 17 House / Skarn Chaiyawat https://www.archdaily.com/972911/nong-ho-17-house-skarn-chaiyawat?ad_source=search&ad_medium=projects_tab
LÂM’s Home / AD+studio https://www.archdaily.com/972794/lams-home-ad-plus-studio?ad_source=search&ad_medium=projects_tab
Limestone House / John Wardle Architects https://www.archdaily.com/972958/limestone-house-john-wardle-architects?ad_source=search&ad_medium=projects_tab
Quay Wall House / Thomas Kemme Architects https://www.archdaily.com/971781/quay-wall-house-thomas-kemme-architects?ad_source=search&ad_medium=projects_tab
...

Unable to get the anchor tag using beautifulsoup

I wanted to get the name and link from the list of anchor tag inside a section, but I am not able to get it.
URL https://www.snopes.com/collections/new-coronavirus-collection/
category=[]
url=[]
for ul in soup.findAll('a',{"class":"collected-list"}):
if ul is not None:
category.append(ul.get_text())
else:
category.append("")
links = ul.findAll('a')
if links is not None:
for a in links:
url.append(a['href'])
Earlier, I was able to get the list and URL, but now the website structure is changed, and my code is not working, the expected output is like
Look like the a tag of interest is now collected-item not collected-list anymore (this is the section class now). You could search for all a tags with class name collected-item, and under that same anchor find the h5 tag with class title to get the title description wich seems to include (with some manipulation) the category you described in your expected output.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.snopes.com/collections/new-coronavirus-collection/').text
soup = BeautifulSoup(source, 'lxml')
category=[]
url = []
for ul in soup.findAll('a',{"class":"collected-item"}):
if ul is not None:
title = ul.find('h5', {"class": "title"}).get_text()
title_short = title.replace("The Coronavirus Collection: ","")
category.append(title_short)
url.append(ul['href'])
for c,u in zip(category, url):
print(c,u)
Origins and Spread https://www.snopes.com/collections/coronavirus-origins-treatments/?collection-id=238235
Prevention and Treatments https://www.snopes.com/collections/coronavirus-collection-prevention-treatments/?collection-id=238235
Prevention and Treatments II https://www.snopes.com/collections/coronavirus-collection-prevention-treatments-2/?collection-id=238235
International Response https://www.snopes.com/collections/coronavirus-international-rumors/?collection-id=238235
US Government Response https://www.snopes.com/collections/coronavirus-government-role/?collection-id=238235
Trump and the Pandemic https://www.snopes.com/collections/coronavirus-collection-trump/?collection-id=238235
Trump and the Pandemic II https://www.snopes.com/collections/coronavirus-collection-trump-2/?collection-id=238235

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

How to extract text under "About us" for web pages using BeautifulSoup

I am new to webscraping and I am not sure how to extract text under "About us" from webpage.
Classes are different for "About us" header in different webpages.
Could you please guide me or provide code to extract text under "About us" in webpage like https://www.thestylistgroup.com/
I can see "About us" in headers but unable to extract data with this headers.
for heading in soup.find_all(re.compile("^h[1-6]")):
print(heading.name + ' ' + heading.text.strip())
Thanks,
Naidu
Assuming the text is always an immediate sibling you could use the following (bs4 4.7.1 +). Note that there is potential for incorrect results due to immediate sibling assumption.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.thestylistgroup.com/')
soup = bs(r.content, 'lxml')
for h in range(1,7):
header_with_sibling = soup.select('h' + str(h) + ':contains("About Us") + *')
if header_with_sibling:
for i in header_with_sibling:
print(i.text)
If you want to stop at first match
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.thestylistgroup.com/')
soup = bs(r.content, 'lxml')
for h in range(1,7):
header_with_sibling = soup.select_one('h' + str(h) + ':contains("About Us") + *')
if header_with_sibling:
print(header_with_sibling.text)
break
This script will select all <Hx> tags, which contains the string "About Us":
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.thestylistgroup.com/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for tag in soup.find_all(lambda t: re.findall(r'h\d+', t.name) and t.text.strip().lower()=='about us'):
print(tag)
print(tag.next_sibling.text) # This will get text from the next sibling tag
Prints:
<h2 class="css-6r2li">About Us</h2>
The Stylist Group is a leading digital publisher and media platform with pioneering brands Stylist and Emerald Street. Within an inspiring, fast-paced, entrepreneurial environment we create original magazines and digital brands for Stylist Women - our successful, sophisticated, dynamic and urban audience. These people have very little time, a considerable disposable income and no patience with inauthentic attempts to try to engage them. Our purpose is to create content Stylist Women are proud to enjoy.

Scrape from an empty class tag (HTML)

I want to scrape Directors and Actors from IMDB from a single webpage which lists top 50 films of 2018. The issue I have is that I have no idea how to scrape them as the class has no name.
Part of my code which is working fine:
response = requests.get('https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1')
for i in soup.find_all('div', class_ = 'lister-item-content'):
film_lenght = film_details.find('span', class_='runtime').text
film_genre = film_details.find('span', class_='genre').text
public_rating = i.find('div', class_='ratings-bar').strong.text
Part of the HTML code that I don't know how to work with:
</p>, <p class="">
Directors:
Anthony Russo,
Joe Russo
<span class="ghost">|</span>
Stars:
Robert Downey Jr.,
Chris Hemsworth,
Mark Ruffalo,
Chris Evans
</p>
I want to be able to pull all Directors and all listed Actors for each film. I want to do that from the single URL as provided in the code.
You can use :contains, and specify Director: or Directors:, to target the blocks for each film; then separate the director(s) by grabbing a tags before the span tag (by filtering out those after). The actors will be the general a tag siblings of the span tag. Requires bs4 v 4.7.1
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.imdb.com/search/title?release_date=2018&sort=num_votes,desc&page=1')
soup = bs(r.content, 'lxml')
for item in soup.select('p:contains("Director:"), p:contains("Directors:")'):
#print(item)
directors = [d.text for d in item.select('a:not(span ~ a)')]
actors = [d.text for d in item.select('span ~ a')]
print(directors, actors)
QHarr's answer was great but later I've noticed that some films do not have Director(s) listed at all; in such cases the code ignored these films. Therefore, I updated QHarr's code and now it takes such scenario into account:
'''
for item in soup.select('p:contains("Stars:")'):
reqs += 1
if item not in soup.select('p:contains("Director:"), p:contains("Directors:")'):
actors = [d.text for d in item.select('a:not(span ~ a)')]
directors = ['none']
else:
directors = str([d.text for d in item.select('a:not(span ~ a)')]).strip('[]').replace("'","")
actors = [d.text for d in item.select('span ~ a')]
'''