currently I'm trying to automatically scrape/download yahoo finance historical data. I plan to download the data using the download link provided in the website.
My code is to list all the available link and work it from there, the problem is that the exact link doesn't appear in the result. Here is my code(partial):
def scrape_page(url, header):
page = requests.get(url, headers=header)
if page.status_code == 200:
soup = bs.BeautifulSoup(page.content, 'html.parser')
return soup
return null
if __name__ == '__main__':
symbol = 'GOOGL'
dt_start = datetime.today() - timedelta(days=(365*5+1))
dt_end = datetime.today()
start = format_date(dt_start)
end = format_date(dt_end)
sub = subdomain(symbol, start, end)
header = header_function(sub)
base_url = 'https://finance.yahoo.com'
url = base_url + sub
soup = scrape_page(url, header)
result = soup.find_all('a')
for a in result:
print('URL :',a['href'])
UPDATE 10/9/2020 :
I managed to find the span which is the parent for the link with this code
spans = soup.find_all('span',{"class":"Fl(end) Pos(r) T(-6px)"})
However, when I print it out, it does not show the link, here is the output:
>>> spans
[<span class="Fl(end) Pos(r) T(-6px)" data-reactid="31"></span>]
To download the historical data in CSV format from Yahoo Finance, you can use this example:
import requests
from datetime import datetime
csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'
quote = 'GOOGL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]
print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)
Prints:
Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,1242.829956,1244.989990,1215.199951,1225.949951,1225.949951,1706100
2019-09-30,1220.599976,1227.410034,1213.420044,1221.140015,1221.140015,1223500
2019-10-01,1222.489990,1232.859985,1205.550049,1206.000000,1206.000000,1225200
2019-10-02,1196.500000,1198.760010,1172.630005,1177.920044,1177.920044,1651500
2019-10-03,1183.339966,1191.000000,1163.140015,1189.430054,1189.430054,1418400
2019-10-04,1194.290039,1212.459961,1190.969971,1210.959961,1210.959961,1214100
2019-10-07,1207.000000,1218.910034,1204.359985,1208.250000,1208.250000,852000
2019-10-08,1198.770020,1206.869995,1189.479980,1190.130005,1190.130005,1004300
2019-10-09,1201.329956,1208.459961,1198.119995,1202.400024,1202.400024,797400
2019-10-10,1198.599976,1215.619995,1197.859985,1209.469971,1209.469971,642100
2019-10-11,1224.030029,1228.750000,1213.640015,1215.709961,1215.709961,1116500
2019-10-14,1213.890015,1225.880005,1211.880005,1217.770020,1217.770020,664800
2019-10-15,1221.500000,1247.130005,1220.920044,1242.239990,1242.239990,1379200
2019-10-16,1241.810059,1254.189941,1238.530029,1243.000000,1243.000000,1149300
2019-10-17,1251.400024,1263.750000,1249.869995,1252.800049,1252.800049,1047900
2019-10-18,1254.689941,1258.109985,1240.140015,1244.410034,1244.410034,1581200
2019-10-21,1248.699951,1253.510010,1239.989990,1244.280029,1244.280029,904700
2019-10-22,1244.479980,1248.729980,1239.849976,1241.199951,1241.199951,1143100
2019-10-23,1240.209961,1258.040039,1240.209961,1257.630005,1257.630005,1064100
2019-10-24,1259.109985,1262.900024,1252.349976,1259.109985,1259.109985,1011200
...and so on.
I figured it out. That link is generated by javascript and requests.get() method won't work on dynamic content. I switched to selenium to download that link.
I want to scrape two pieces of data from a website:
https://www.moneymetals.com/precious-metals-charts/gold-price
Specifically I want the "Gold Price per Ounce" and the "Spot Change" percent two columns to the right of it.
Using only Python standard libraries, is this possible? A lot of tutorials use the HTML element id to scrape effectively but inspecting the source for this page, it's just a table. Specifically I want the second and fourth <td> which appear on the page.
It's possible to do it with standard python libraries; ugly, but possible:
import urllib
from html.parser import HTMLParser
URL = 'https://www.moneymetals.com/precious-metals-charts/gold-price'
page = urllib.request.Request(URL)
result = urllib.request.urlopen(page)
resulttext = result.read()
class MyHTMLParser(HTMLParser):
gold = []
def handle_data(self, data):
self.gold.append(data)
parser = MyHTMLParser()
parser.feed(str(resulttext))
for i in parser.gold:
if 'Gold Price per Ounce' in i:
target= parser.gold.index(i) #get the index location of the heading
print(parser.gold[target+2]) #your target items are 2, 5 and 9 positions down in the list
print(parser.gold[target+5].replace('\\n',''))
print(parser.gold[target+9].replace('\\n',''))
Output (as of the time the url was loaded):
$1,566.70
8.65
0.55%
I am trying to extract the name on this web page: https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29
the element i am trying to grab it from is
<h1 class="hover_item_name" id="largeiteminfo_item_name" style="color:
rgb(210, 210, 210);">AK-47 | Redline</h1>
I am able to search for the ID "largeiteminfo_item_name" using selenium and retrieve the text that way but when i duplicate this with bs4 I can't seem to find the text.
Ive tried searching class "item_desc_description" but no text could be found there either. What am I doing wrong?
a = soup.find("h1", {"id": "largeiteminfo_item_name"})
a.get_text()
a = soup.find('div', {'class': 'item_desc_description'})
a.get_text()
I expected "AK-47 | Redline" but received '' for the first try and '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n' for the second try.
The data you are trying to extract is not present in the HTML page, I guess it might be generated aside with JavaScript (just guessing).
However I managed to find the info in the div "market_listing_nav".
from bs4 import BeautifulSoup as bs4
import requests
lnk = "https://steamcommunity.com/market/listings/730/AK-47%20%7C%20Redline%20%28Field-Tested%29"
res = requests.get(lnk)
soup = bs4(res.text, features="html.parser")
elem = soup.find("div", {"class" : "market_listing_nav"})
print(elem.get_text())
This will output the following
Counter-Strike: Global Offensive
>
AK-47 | Redline (Field-Tested)
Have a look at the web page source for tag with better formatting or just clean up the on generated by my code.
I am trying to pull all the text from the div class 'caselawcontent searchable-content'. This code just prints the HTML without the text from the web page. What am I missing to get the text?
The following link is in the 'finteredcasesdoc.text' file:
http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html
import requests
from bs4 import BeautifulSoup
with open('filteredcasesdoc.txt', 'r') as openfile1:
for line in openfile1:
rulingpage = requests.get(line).text
soup = BeautifulSoup(rulingpage, 'html.parser')
doctext = soup.find('div', class_='caselawcontent searchable-content')
print (doctext)
from bs4 import BeautifulSoup
import requests
url = 'http://caselaw.findlaw.com/mo-court-of-appeals/1021163.html'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
I've added a much more reliable .find method ( key : value)
whole_section = soup.find('div',{'class':'caselawcontent searchable-content'})
the_title = whole_section.center.h2
#e.g. Missouri Court of Appeals,Southern District,Division Two.
second_title = whole_section.center.h3.p
#e.g. STATE of Missouri, Plaintiff-Appellant v....
number_text = whole_section.center.h3.next_sibling.next_sibling
#e.g.
the_date = number_text.next_sibling.next_sibling
#authors
authors = whole_section.center.next_sibling
para = whole_section.findAll('p')[1:]
#Because we don't want the paragraph h3.p.
# we could aslso do findAll('p',recursive=False) doesnt pickup children
Basically, I've dissected this whole tree
as for the Paragraphs (e.g. Main text, the var para), you'll have to loop
print(authors)
# and you can add .text (e.g. print(authors.text) to get the text without the tag.
# or a simple function that returns only the text
def rettext(something):
return something.text
#Usage: print(rettext(authorts))
Try printing doctext.text. This will get rid of all the HTML tags for you.
from bs4 import BeautifulSoup
cases = []
with open('filteredcasesdoc.txt', 'r') as openfile1:
for url in openfile1:
# GET the HTML page as a string, with HTML tags
rulingpage = requests.get(url).text
soup = BeautifulSoup(rulingpage, 'html.parser')
# find the part of the HTML page we want, as an HTML element
doctext = soup.find('div', class_='caselawcontent searchable-content')
print(doctext.text) # now we have the inner HTML as a string
cases.append(doctext.text) # do something useful with this !
So i'm trying to make a program that tells the user how far away voyager 1 is from the Earth, NASA has this info on their website here http://voyager.jpl.nasa.gov/where/index.html...
I can't seem to manage to get the information within the div, here's the div: <div id="voy1_km">Distance goes here</div>
my current program is as follows : `
import requests
from BeautifulSoup import BeautifulSoup
url = "http://voyager.jpl.nasa.gov/where/index.html"
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
test = soup.find('div', {'id' : 'voy1_km'})
print test
So long story short, How do I get the div contents?
as you can see from the webpage itself, the distance keep changing which is actually driven by a Javascript. You can maybe just read the javascrip code so you don't even need to scrape to get the distance... (I hate websites using Javascript as much as you:) )
If you really want to get the number off their website. You can use Selenium.
# pip install selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("http://voyager.jpl.nasa.gov/where/index.html")
time.sleep(5)
elem = driver.find_element_by_class_name("tr_dark")
print elem.text
driver.close()
Here is the output:
Distance from Earth
19,964,147,071 KM
133.45208042 AU
Of course, please refer to the terms&conditions of their website regarding to what level you can scrape their website and distribute the data.
The bigger question is why even bother scraping it. If you dive a bit deeper into the Javascript file, you can repeat its calculation in a very simple manner:
import time
epoch_0 = 1445270400
epoch_1 = 1445356800
dist_0_v1 = 19963672758.0152
dist_1_v1 = 19966727483.2612
current_time = time.time()
current_dist_km_v1 = ( ( ( current_time - epoch_0 ) / ( epoch_1 - epoch_0 ) ) * ( dist_1_v1 - dist_0_v1 ) ) + dist_0_v1
print("{:,.0f} KM".format(current_dist_km_v1))