How access div within gridview with Beautiful Soup - html

I'm trying to scrape all information from archdaily over multiple pages of the website (e.g. from page 1 to 20).
The html structure looks like:
<div>
<div class = 'afd-container-main afd-container-main--margin-bottom nft-container-main-search clearfix afd-mobile-margin search-container'>
::before
<div>
<div class='gridview'>
<div>
<div data-insights-category>
<a href = '...'> # this is the htmls i wanted
The code I'm using is
soup = BeautifulSoup(html, 'html')
for foo in soup.find_all('div'):
bar = foo.find('div', attrs={'class': 'afd-container-main afd-container-main--margin-bottom nft-container-main-search clearfix afd-mobile-margin search-container'})
print(bar.text)
Error message
AttributeError: 'NoneType' object has no attribute 'text'
Am I misunderstanding something?

Note: Because the question does not reveal, how you get your html, it is not that easy to answer.
If you use requests, you wont get the results that way, cause the site deals with dynamic served content.
Alternativ approaches:
Get information with requests via api (provides even more information - categories, company,...)
#iterate over pages
for p in range(1,3):
r = requests.get(f'https://www.archdaily.com/search/api/v1/us/projects/categories/residential-architecture?page={p}') #url of next page
for item in r.json()['results']:
# iterate over results and print title+url
print(item['title'], item['url'])
Get renderd html via Selenium
Example
import requests
for p in range(1,2):
r = requests.get(f'https://www.archdaily.com/search/api/v1/us/projects/categories/residential-architecture?page={p}') #url of next page
for item in r.json()['results']:
print(item['title'], item['url'])
Output
Wooden House / derksen | windt architecten https://www.archdaily.com/972995/wooden-house-derksen-windt-architecten?ad_source=search&ad_medium=projects_tab
PLA2 House / Dersyn Studio https://www.archdaily.com/972939/pla2-house-dersyn-studio?ad_source=search&ad_medium=projects_tab
gjG House / BLAF Architecten https://www.archdaily.com/951845/gjg-house-blaf-architecten?ad_source=search&ad_medium=projects_tab
Leopoldo 1201 Residential Building / aflalo/gasperini arquitetos https://www.archdaily.com/972959/leopoldo-1201-residential-building-aflalo-gasperini-arquitetos?ad_source=search&ad_medium=projects_tab
Sayang House / Carlos Gris Studio https://www.archdaily.com/972773/sayang-house-carlos-gris-studio?ad_source=search&ad_medium=projects_tab
Nong Ho 17 House / Skarn Chaiyawat https://www.archdaily.com/972911/nong-ho-17-house-skarn-chaiyawat?ad_source=search&ad_medium=projects_tab
LÂM’s Home / AD+studio https://www.archdaily.com/972794/lams-home-ad-plus-studio?ad_source=search&ad_medium=projects_tab
Limestone House / John Wardle Architects https://www.archdaily.com/972958/limestone-house-john-wardle-architects?ad_source=search&ad_medium=projects_tab
Quay Wall House / Thomas Kemme Architects https://www.archdaily.com/971781/quay-wall-house-thomas-kemme-architects?ad_source=search&ad_medium=projects_tab
...

Related

selenium by_xpath not returning any results

I am using Selenium 4+, and I seem to not get back the any result when requesting for elements in a div.
# Wait for the page to load
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.ID, "search-key")))
# wait for the page to load
driver.implicitly_wait(10)
# search for the suppliers you want to message
search_input = driver.find_element(By.ID,"search-key")
search_input.send_keys("suppliers")
search_input.send_keys(Keys.RETURN)
# find all the supplier stores on the page
supplier_stores_div = driver.find_element(By.CLASS_NAME, "list--gallery--34TropR")
print(supplier_stores_div)
supplier_stores = supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']")
print(supplier_stores)
The logging statements gave me <selenium.webdriver.remote.webelement.WebElement (session="a3be4d8c5620e760177247d5b8158823", element="5ae78693-4bae-4826-baa6-bd940fa9a41b")> and an empty list for the Element objects, []
The html code is here:
<div class="list--gallery--34TropR" data-spm="main" data-spm-max-idx="24">flex
<a class="v3--container--31q8BOL cards--gallery--2o6yJVt" href="(link)" target="_blank" style="text-align: left;" data-spm-anchor-id="a2g0o.productlist.main.1"> (some divs) </a>flex
That is just one <a> class, there's more.
Before scraping the supplier names, you have to scroll down the page slowly to the bottom, then only you can get all the supplier names, try the below code:
driver.get("https://www.aliexpress.com/premium/supplier.html?spm=a2g0o.best.1000002.0&initiative_id=SB_20221218233848&dida=y")
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollBy(0, 800);")
sleep(1)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
suppliers = driver.find_elements(By.XPATH, ".//*[#class='list--gallery--34TropR']//span/a")
print("Total no. of suppliers:", len(suppliers))
print("======")
for supplier in suppliers:
print(supplier.text)
Output:
Total no. of suppliers: 60
======
Reading Life Store
ASONSTEEL Store
Paper, ink, pen and inkstone Store
Custom Stationery Store
ZHOUYANG Official Store
WOWSOCOOL Store
IFPD Official Store
The 9 Store
QuanRun Store
...
...
It returns you the Element object.
If you want to get the text you need to write
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getText()
or for getting an attribute (for example the href)
supplier_stores_div.find_elements(By.XPATH, "./a[#class='v3--container--31q8BOL cards--gallery--2o6yJVt']").getAttribute("href")

Extract academic publication information from IDEAS

I want to extract the list of publications from a specific IDEAS's page. I want to retrieve information about name of the paper, authors, and year. However, I am bit stuck in doing so. By inspecting the page, all information is inside the div class="tab-pane fade show active" [...], then with h3 we do have the year of publication while inside each li class="list-group-item downfree" [...] we can find each paper with relative author (as showed in this image). At the end, what I willing to obtain is a dataframe containing three columns: title, author, and year.
Nonetheless, while I am able to retrieve each paper's name, when I want to add also year and author(s) I get confused. What I wrote so far is the following short code:
from requests import get
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
containers = soup.findAll("div", {'class': 'tab-pane fade show active'})
title_list = []
year_list = []
for container in containers:
year = container.findAll('h3')
year_list.append(int(year[0].text))
title_containers = container.findAll("li", {'class': 'list-group-item downfree'})
title = title_containers[0].a.text
title_list.append(title)
What I get are two list of only one element each. This because the initial containers has the size of 1. Regarding instead how to retrieve author(s) name I have no idea, I tried in several ways without success. I think I have to stripe the titles using 'by' as separator.
I hope someone could help me or re-direct to some other discussion which face a similar situation. Thank you in advance. Apologize for my (probably) silly question, I am still a beginner in web scraping with BeautifulSoup.
You can get the desired information like this:
from requests import get
import pprint
from bs4 import BeautifulSoup
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
title_list.append([x.text for x in panel.find_all('a')])
author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))
pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)
outputs:
[ ( 2020,
['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-Piñeiro']),
( 2019,
[ 'Probability Forecasts and Prediction Markets',
'R&D Financing And Growth',
'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
'Consumption Smoothing Channels Within And Between Households',
'A critical analysis of the secular stagnation theory',
'Further evidence of the relationship between social transfers and income inequality in OECD countries',
'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
[ 'Julia Mortera & A. Philip Dawid',
'Luca Spinesi & Mario Tirelli',
'Matteo Deleidi & Mariana Mazzucato',
'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
'Stefano Di Bucchianico',
"Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
'Giovanni Scarano']),
( 2018, ...
I got the years using a list comprehension. I got the titles and authors by appending a list to the title_list and title_list for the required elements in each div element with the class panel-body again using a list comprehension and using next.sibling for the i element to get the authors. Then I zipped the three lists and cast the result to a list. Finally I pretty printed the result.

Scrape table with no ids or classes using only standard libraries?

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

How to extract text under "About us" for web pages using BeautifulSoup

I am new to webscraping and I am not sure how to extract text under "About us" from webpage.
Classes are different for "About us" header in different webpages.
Could you please guide me or provide code to extract text under "About us" in webpage like https://www.thestylistgroup.com/
I can see "About us" in headers but unable to extract data with this headers.
for heading in soup.find_all(re.compile("^h[1-6]")):
print(heading.name + ' ' + heading.text.strip())
Thanks,
Naidu
Assuming the text is always an immediate sibling you could use the following (bs4 4.7.1 +). Note that there is potential for incorrect results due to immediate sibling assumption.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.thestylistgroup.com/')
soup = bs(r.content, 'lxml')
for h in range(1,7):
header_with_sibling = soup.select('h' + str(h) + ':contains("About Us") + *')
if header_with_sibling:
for i in header_with_sibling:
print(i.text)
If you want to stop at first match
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.thestylistgroup.com/')
soup = bs(r.content, 'lxml')
for h in range(1,7):
header_with_sibling = soup.select_one('h' + str(h) + ':contains("About Us") + *')
if header_with_sibling:
print(header_with_sibling.text)
break
This script will select all <Hx> tags, which contains the string "About Us":
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.thestylistgroup.com/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for tag in soup.find_all(lambda t: re.findall(r'h\d+', t.name) and t.text.strip().lower()=='about us'):
print(tag)
print(tag.next_sibling.text) # This will get text from the next sibling tag
Prints:
<h2 class="css-6r2li">About Us</h2>
The Stylist Group is a leading digital publisher and media platform with pioneering brands Stylist and Emerald Street. Within an inspiring, fast-paced, entrepreneurial environment we create original magazines and digital brands for Stylist Women - our successful, sophisticated, dynamic and urban audience. These people have very little time, a considerable disposable income and no patience with inauthentic attempts to try to engage them. Our purpose is to create content Stylist Women are proud to enjoy.

How to get data from a specific section in MediaWiki supported wikis

I am parsing a Wikia article and trying to get the data from the right hand side highlighted block, I have already got the left one using the following URL
http://hetalia.wikia.com/api.php?action=parse&prop=revisions&prop=sections&page=America&format=json
But don't know the reference about the right one. What will be the parameter?
The original URL is,
http://hetalia.wikia.com/wiki/America
I believe the only way to get info from the Infoboxes would be to get the page source, which can be done with this query
http://hetalia.wikia.com/api.php?action=query&prop=revisions&rvprop=content&titles=America&format=json
And then parsing the text to get the information, as the source of that box is in this format
{{Character
|name = America
|jname = アメリカ
|image = America0.png
|country = [[wikipedia:United States|The United States of America]]
|human = Alfred F.Jones (アルフレッド・F・ジョーンズ, ''Arufureddo F. Joonzu'')
|age = 19
...
|japanese = [[Katsuyuki Konishi]], Ryoko Shimizu (Young America, drama CD "Prologue"), [[Ai Iwamura]] (Young America, anime), [[Axis Powers Hetalia: The CD|Osamu Ikeda]] (''Flower Of Iris'')
|english = [[Eric Vale]], Stephanie Young (young America)}}
You could use Regex to extract the data from the text, such as using \|age\s*=\s*(\d*) to get the age attribute.