Scrape table with no ids or classes using only standard libraries? - html

I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.

It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%

Related

How to scrape only texts from specific HTML elements?

I have a problem with selecting the appropriate items from the list.
For example - I want to omit "1." then the first "5" (as in the example)
Additionally, I would like to write a condition that the letter "W" should be changed to "WIN".
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
sleep(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
for i in content:
line = i.text.split()[0]
if re.search('Ajax', line):
res.append(line)
print(res)
results
['1.Ajax550016:315?WWWWW']
I need
Ajax;5;5;0;16;3;W;W;W;W;W
I would recommend to select your elements more specific:
for e in soup.select('.ui-table__row'):
Iterate the ResultSet and decompose() unwanted tag:
e.select_one('.wld--tbd').decompose()
Extract texts with stripped_strings and join() them to your expected string:
data.append(';'.join(e.stripped_strings))
Example
Also making some replacements, based on dict just to demonstrate how this would work, not knowing R or P.
...
soup = BS2(page,'html.parser')
data = []
for e in soup.select('.ui-table__row'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
To only get result for Ajax:
data = []
for e in soup.select('.ui-table__row:-soup-contains("Ajax")'):
e.select_one('.wld--tbd').decompose()
e.select_one('.tableCellRank').decompose()
e.select_one('.table__cell--points').decompose()
e.select_one('.table__cell--score').string = ';'.join(e.select_one('.table__cell--score').text.split(':'))
pattern = {'W':'WIN','R':'RRR','P':'PPP'}
data.append(';'.join([pattern.get(i,i) for i in e.stripped_strings]))
data
Output
Based on actually data it may differ from questions example.
['Ajax;6;6;0;0;21;3;WIN;WIN;WIN;WIN;WIN']
you had the right start by using bs4 to find the table div, but then you gave up and just tried to use re to extract from the text. as you can see that's not going to work. Here is a simple way to hack and get what you want. I keep grabinn divs from the table div you find, and the grab the text of the next eight divs after finding Ajax. then I do some dirty string manipulation thing because the WWWWW is all in the same toplevel div.
import re
from selenium import webdriver
from bs4 import BeautifulSoup as BS2
from time import sleep
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
#driver = webdriver.Chrome()
driver.get("https://www.flashscore.pl/druzyna/ajax/8UOvIwnb/tabela/")
driver.implicitly_wait(10)
page = driver.page_source
soup = BS2(page,'html.parser')
content = soup.find('div',{'class':'ui-table__body'})
content_list = content.find_all('span',{"table__cell table__cell--value"})
res = []
found = 0
for i in content.find('div'):
line = i.text.split()[0]
if re.search('Ajax', line):
found = 8
if found:
found -= 1
res.append(line)
# change field 5 into separate values and skip field 6
res = res[:4] +res[5].split(':') + res[7:]
# break the last field into separate values and drop the first '?'
res = res[:-1] + [ i for i in res[-1]][1:]
print(";".join(res))
returns
Ajax;5;5;0;16;3;W;W;W;W;W
This works, but it is very brittle, and will break as soon as the website changes their content. you should put in a lot of error checking. I also replaced the sleep with a wait call, and added chromedrivermamager, which allows me to use selenium with chrome.

Extract academic publication information from IDEAS

I want to extract the list of publications from a specific IDEAS's page. I want to retrieve information about name of the paper, authors, and year. However, I am bit stuck in doing so. By inspecting the page, all information is inside the div class="tab-pane fade show active" [...], then with h3 we do have the year of publication while inside each li class="list-group-item downfree" [...] we can find each paper with relative author (as showed in this image). At the end, what I willing to obtain is a dataframe containing three columns: title, author, and year.
Nonetheless, while I am able to retrieve each paper's name, when I want to add also year and author(s) I get confused. What I wrote so far is the following short code:
from requests import get
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
containers = soup.findAll("div", {'class': 'tab-pane fade show active'})
title_list = []
year_list = []
for container in containers:
year = container.findAll('h3')
year_list.append(int(year[0].text))
title_containers = container.findAll("li", {'class': 'list-group-item downfree'})
title = title_containers[0].a.text
title_list.append(title)
What I get are two list of only one element each. This because the initial containers has the size of 1. Regarding instead how to retrieve author(s) name I have no idea, I tried in several ways without success. I think I have to stripe the titles using 'by' as separator.
I hope someone could help me or re-direct to some other discussion which face a similar situation. Thank you in advance. Apologize for my (probably) silly question, I am still a beginner in web scraping with BeautifulSoup.
You can get the desired information like this:
from requests import get
import pprint
from bs4 import BeautifulSoup
url = 'https://ideas.repec.org/s/rtr/wpaper.html'
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one("#content")
title_list = []
author_list = []
year_list = [int(h.text) for h in container.find_all('h3')]
for panel in container.select("div.panel-body"):
title_list.append([x.text for x in panel.find_all('a')])
author_list.append([x.next_sibling.strip() for x in panel.find_all('i')])
result = list(zip(year_list, title_list, author_list))
pp = pprint.PrettyPrinter(indent=4, width=250)
pp.pprint(result)
outputs:
[ ( 2020,
['The Role Of Public Procurement As Innovation Lever: Evidence From Italian Manufacturing Firms', 'A voyage in the role of territory: are territories capable of instilling their peculiarities in local production systems'],
['Francesco Crespi & Serenella Caravella', 'Cristina Vaquero-Piñeiro']),
( 2019,
[ 'Probability Forecasts and Prediction Markets',
'R&D Financing And Growth',
'Mission-Oriented Innovation Policies: A Theoretical And Empirical Assessment For The Us Economy',
'Public Investment Fiscal Multipliers: An Empirical Assessment For European Countries',
'Consumption Smoothing Channels Within And Between Households',
'A critical analysis of the secular stagnation theory',
'Further evidence of the relationship between social transfers and income inequality in OECD countries',
'Capital accumulation and corporate portfolio choice between liquidity holdings and financialisation'],
[ 'Julia Mortera & A. Philip Dawid',
'Luca Spinesi & Mario Tirelli',
'Matteo Deleidi & Mariana Mazzucato',
'Enrico Sergio Levrero & Matteo Deleidi & Francesca Iafrate',
'Simone Tedeschi & Luigi Ventura & Pierfederico Asdrubal',
'Stefano Di Bucchianico',
"Giorgio D'Agostino & Luca Pieroni & Margherita Scarlato",
'Giovanni Scarano']),
( 2018, ...
I got the years using a list comprehension. I got the titles and authors by appending a list to the title_list and title_list for the required elements in each div element with the class panel-body again using a list comprehension and using next.sibling for the i element to get the authors. Then I zipped the three lists and cast the result to a list. Finally I pretty printed the result.

Can't seem to extract text from element using BS4

I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.
The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.

Second Scraper - If Statement

I am working on my second Python scraper and keep running into the same problem. I would like to scrape the website shown in the code below. I would like to be ability to input parcel numbers and see if their Property Use Code matches. However, I am not sure if my scraper if finding the correct row in the table. Also, not sure how to use the if statement if the use code is not the 3730.
Any help would be appreciated.
from bs4 import BeautifulSoup
import requests
parcel = input("Parcel Number: ")
web = "https://mcassessor.maricopa.gov/mcs.php?q="
web_page = web+parcel
web_header={'User-Agent':'Mozilla/5.0(Macintosh;IntelMacOSX10_13_2)AppleWebKit/537.36(KHTML,likeGecko)Chrome/63.0.3239.132Safari/537.36'}
response=requests.get(web_page,headers=web_header,timeout=100)
soup=BeautifulSoup(response.content,'html.parser')
table=soup.find("td", class_="Property Use Code" )
first_row=table.find_all("td")[1]
if first_row is '3730':
print (parcel)
else:
print ('N/A')
There's no td with class "Property Use Code" in the html you're looking at - that is the text of a td. If you want to find that row, you can use
td = soup.find('td', text="Property Use Code")
and then, to get the next td in that row, you can use:
otherTd = td.find_next_sibling()
or, of you want them all:
otherTds = td.find_next_siblings()
It's not clear to me what you want to do with the values of these tds, but you'll want to use the text attribute to access them: your first_row is '3730' will always be False, because first_row is a bs4.element.Tag object here and '3730' is a str. You can, however, get useful information from otherTd.text == '3730'.

Issue in outputting data scraped using beautiful soup in two columns of csv using spamwriter.writerow

I am scraping 2 sets of data from a website using beautiful soup and I want them to output in a csv file in 2 columns side by side. I am using spamwriter.writerow([x,y]) argument for this but I think because of some error in my recursion structure, I am getting the wrong output in my csv file. Below is the referred code:
import csv
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('Smartphones_20decv2.0.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
for anchor in soup.findAll('a', {"class": "clickStreamSingleItem"},text=True):
if anchor.string:
print unicode(anchor.string).encode('utf8').strip()
for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
textcontent = u' '.join(anchor1.stripped_strings)
if textcontent:
print textcontent
spamwriter.writerow([unicode(anchor.string).encode('utf8').strip(),textcontent])
Output which I am getting in csv is:
Samsung Focus® 2 (Refurbished) $99.99
Samsung Focus® 2 (Refurbished) $99.99 to $199.99 8 to 16 GB
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $0.99
Samsung Focus® 2 (Refurbished) $149.99 to $349.99 16 to 64 GB
Problem is I am getting only 1 device name in column 1 instead of all while price is coming for all devices.
Please pardon my ignorance as I am new to programming.
You are using anchor.string, instead of archor1. anchor is the last item from the previous loop, instead of the item in the current loop.
Perhaps using clearer variable names would help avoid confusion here; use singleitem and gridprice perhaps?
It could be I misunderstood though and you want to combine each anchor1 with a corresponding anchor. You'll have to loop over them together, perhaps using zip():
items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
prices = soup.findAll('div', {"class": "listGrid-price"})
for item, price in zip(items, prices):
textcontent = u' '.join(price.stripped_strings)
if textcontent:
print textcontent
spamwriter.writerow([unicode(item.string).encode('utf8').strip(),textcontent])
Normally it should be easier to loop over the parent table row instead, then find the cells within that row within a loop. But the zip() should work too, provided the clickStreamSingleItem cells line up with the listGrid-price matches.