Best way to deal with "key error" when scraping (yahoo finance)? - json

Hi I am making a scrape for yahoo finance and I am using JSON to get keys and then scraping the keys e.g ...
fwd_div_yield = data['context']['dispatcher']['stores']['QuoteSummaryStore']["summaryDetail"]['dividendYield']['raw']
The error is that if a company doesn't pay a dividend it will produce a key error as there is no key 'raw' instead of using raw = 0 they just don't have raw. But if a company does have a dividend it will return 'raw', 'fmt' etc.
I was wondering what the most efficient way of dealing with this is?
Another Question Is how would you access ...
[{'raw': 1595894400, 'fmt': '2020-07-28'}, {'raw': 1596412800, 'fmt': '2020-08-03'}]
my current soloution is...
earnings_dates = data['context']['dispatcher']['stores']['QuoteSummaryStore']['calendarEvents']['earnings']['earningsDate'][0]['fmt']
earnings_datee = data['context']['dispatcher']['stores']['QuoteSummaryStore']['calendarEvents']['earnings']['earningsDate'][1]['fmt']
earnings_date = earnings_dates+", "+earnings_datee

To extract the dividend yield from the raw key and not get a KeyError when it's not there, do the following:
fwd_div_yield = data['context']['dispatcher']['stores']['QuoteSummaryStore']["summaryDetail"]['dividendYield'].get('raw', 0)
In the event raw is not there, the fwd_div_yield will be 0.
Then to retrieve each date from the list of dictionaries, you can use a list comprehension:
earnings_dates = data['context']['dispatcher']['stores']['QuoteSummaryStore']['calendarEvents']['earnings']['earningsDate']
fmt_dates = [date['fmt'] for date in earnings_dates]
Also, this data is available via url: https://query2.finance.yahoo.com/v10/finance/quoteSummary/aapl?modules=summaryDetail. Just replace aapl with the symbol you're scraping.

I would wrap whatever code is checking if the company pays a dividend in a try/except block.
def paysDivivend(data):
try:
if 'raw' in data:
return True
except KeyError:
return False
Without seeing any example code this is a quick fix solution
For the second question...
IF you are asking to create [{'raw': 1234,'fmt':'2020-07-28'},...]:
Based on the compiled list of companies that pay a dividend.
Create a the list:
def dividendList(data):
dividend_list = []
for company in data:
dividend_list.append({'raw':compay['path']['to']['raw'],'fmt':company['path']['to'][fmt']})
return dividend_list
IF you are trying to access each one after you already created the list:
def accessDividend(dividend_data):
for dividend in dividend_data:
print(f"{dividend['raw']}, {dividend['fmt']}")

I created this method as a workaround.
def yfinanceDataframe(symbol, interval, _range):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
data = requests.get(f'https://query1.finance.yahoo.com/v8/finance/chart/{symbol}?interval={interval}&range={_range}', headers=headers).json()
timestamp = data['chart']['result'][0]['timestamp']
data = data['chart']['result'][0]['indicators']['quote'][0]
df = pd.DataFrame(data)
df['Datetime'] = timestamp
df['Datetime'] = df['Datetime'].apply(lambda x: dt.fromtimestamp(x).strftime('%m/%d/%Y %H:%M'))
df.dropna(inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'close': 'Close'}, inplace=True)
return df

Related

JSONDecodeError: Expecting value: line 1 column 1 (char 0) while getting data from Pokemon API

I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function

Groovy - parse/ convert x-www-form-urlencoded to something like JSON

this is my second try to explain a bit more precisely what I'm looking for ;-)
I set a webhook in Mailchimp that fires every time a new subscriber of an audience appears. Mailchimp sends a HTTP POST request to a Jira Sriptrunner REST endpoint.
The content type of this request is application/x-www-form-urlencoded.
Within the Jira endpoint I would like to read the request data. How can I do that?
The payload (raw body) I receive looks like this:
type=unsubscribe&fired_at=2020-05-26+07%3A04%3A42&data%5Baction%5D=unsub&data%5Breason%5D=manual&data%5Bid%5D=34f28a4516&data%5Bemail%5D=examlple%40bla.com&data%5Bemail_type%5D=html&data%5Bip_opt%5D=xx.xxx.xxx.198&data%5Bweb_id%5D=118321378&data%5Bmerges%5D%5BEMAIL%5D=example%40bla.com&data%5Bmerges%5D%5BFNAME%5D=Horst&data%5Bmerges%5D%5BLNAME%5D=Schlemmer&data%5Bmerges%5D%5BCOMPANY%5D=First&data%5Bmerges%5D%5BADDRESS%5D%5Baddr1%5D=XXX
Now I would like to parse the data of the raw body into a JSON or something similiar.
The result might look like this:
{
"web_id": 123,
"email": "example#bla.com",
"company": "First",
...
}
Meanwhile I searched around a little and found something like the node.js "querystring" module. It would be great if there is something similiar within Groovy or any other way to parse the data of application/x-www-form-urlencoded to json format.
Best regards and thanks in advance
Bernhard
def body = "type=unsubscribe&fired_at=2020-05-26+07%3A04%3A42&data%5Baction%5D=unsub&data%5Breason%5D=manual&data%5Bid%5D=34f28a4516&data%5Bemail%5D=examlple%40bla.com&data%5Bemail_type%5D=html&data%5Bip_opt%5D=xx.xxx.xxx.198&data%5Bweb_id%5D=118321378&data%5Bmerges%5D%5BEMAIL%5D=example%40bla.com&data%5Bmerges%5D%5BFNAME%5D=Horst&data%5Bmerges%5D%5BLNAME%5D=Schlemmer&data%5Bmerges%5D%5BCOMPANY%5D=First&data%5Bmerges%5D%5BADDRESS%5D%5Baddr1%5D=XXX"
def map = body.split('&').collectEntries{e->
e.split('=').collect{ URLDecoder.decode(it, "UTF-8") }
}
assert map.'data[merges][EMAIL]'=='example#bla.com'
map.each{println it}
prints:
type=unsubscribe
fired_at=2020-05-26 07:04:42
data[action]=unsub
data[reason]=manual
data[id]=34f28a4516
data[email]=examlple#bla.com
data[email_type]=html
data[ip_opt]=xx.xxx.xxx.198
data[web_id]=118321378
data[merges][EMAIL]=example#bla.com
data[merges][FNAME]=Horst
data[merges][LNAME]=Schlemmer
data[merges][COMPANY]=First
data[merges][ADDRESS][addr1]=XXX
A imple no-brainer groovy:
def a = '''
data[email_type]: html
data[web_id]: 123
fired_at: 2020-05-26 07:28:25
data[email]: example#bla.com
data[merges][COMPANY]: First
data[merges][FNAME]: Horst
data[ip_opt]: xx.xxx.xxx.xxx
data[merges][PHONE]: xxxxx
data[merges][ADDRESS][zip]: 33615
type: subscribe
data[list_id]: xxXXyyXX
data[merges][ADDRESS][addr1]: xxx.xxx'''
def res = [:]
a.eachLine{
def parts = it.split( /\s*:\s*/, 2 )
if( 2 != parts.size() ) return
def ( k, v ) = parts
def complexKey = ( k =~ /\[(\w+)\]/ ).findAll()
if( complexKey ) complexKey = complexKey.last().last()
res[ ( complexKey ?: k ).toLowerCase() ] = v
}
res
gives:
[email_type:html, web_id:123, fired_at:2020-05-26 07:28:25,
email:example#bla.com, company:First, fname:Horst, ip_opt:xx.xxx.xxx.xxx,
phone:xxxxx, zip:33615, type:subscribe, list_id:xxXXyyXX, addr1:xxx.xxx]
I found a solution finally. I hope you understand and maybe it helps others too ;-)
Starting from daggett's answer I did the following:
// Split body and remove unnecessary characters
def map = body.split('&').collectEntries{e->
e.split('=').collect{ URLDecoder.decode(it, "UTF-8") }
}
// Processing the map to readable stuff
def prettyMap = new JsonBuilder(map).toPrettyString()
// Convert the pretty map into a json object
def slurper = new JsonSlurper()
def jsonObject = slurper.parseText(prettyMap)
(The map looks pretty much like in daggett's answer.
prettyMap)
Then I extract the keys:
// Finally extracting customer data
def type = jsonObject['type']
And I get the data I need. For example
Type : subscribe
...
First Name : Heinz
...
Thanks to daggett!

Use Querysets and JSON correctly in API View

I am trying to write my custom API view and I am struggling a bit with querysets and JSON. It shouldn't be that complicated but I am stuck still. Also I am confused by some strange behaviour of the loop I coded.
Here is my view:
#api_view()
def BuildingGroupHeatYear(request, pk, year):
passed_year = str(year)
building_group_object = get_object_or_404(BuildingGroup, id=pk)
buildings = building_group_object.buildings.all()
for item in buildings:
demand_heat_item = item.demandheat_set.filter(year=passed_year).values('building_id', 'year', 'demand')
print(demand_heat_item)
print(type(demand_heat_item)
return Response(demand_heat_item))
Ok so this actually gives me back exactly what I want. Namely that:
{'building_id': 1, 'year': 2019, 'demand': 230.3}{'building_id': 1, 'year': 2019, 'demand': 234.0}
Ok, great, but why? Shouldn't the data be overwritten each time the loop goes over it?
Also when I get the type of the demand_heat_item I get back a queryset <class 'django.db.models.query.QuerySet'>
But this is an API View, so I would like to get a JSON back. SHouldn't that throw me an error?
And how could I do this so I get the same data structure back as a JSON?
It tried to rewrite it like this but without success because I can't serialize it:
#api_view()
def BuildingGroupHeatYear(request, pk, year):
passed_year = str(year)
building_group_object = get_object_or_404(BuildingGroup, id=pk)
buildings = building_group_object.buildings.all()
demand_list = []
for item in buildings:
demand_heat_item = item.demandheat_set.filter(year=passed_year).values('building_id', 'year', 'demand')
demand_list.append(demand_heat_item)
json_data = json.dumps(demand_list)
return Response(json_data)
I also tried with JSON Response and Json decoder.
But maybe there is a better way to do this?
Or maybe my question is formulated clearer like this: How can I get the data out of the loop, and return it as a JSON
Any help is much appreciated. Thanks in advance!!
Also, I tried the following:
for item in buildings:
demand_heat_item = item.demandheat_set.filter(year=passed_year).values('building_id', 'year', 'demand')
json_data = json.dumps(list(demand_heat_item))
return Response(json_data)
that gives me this weird response that I don't really want:
"[{\"building_id\": 1, \"year\": 2019, \"demand\": 230.3}, {\"building_id\": 1, \"year\": 2019, \"demand\": 234.0}]"

Scrape table from webpage when in <div> format - using Beautiful Soup

So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)

Parsing JSON string in Python

I am new in Python so i beg your patience !
In the following output string, i need to get the latest price (determined from timestamp) where type = 'bid'. Please suggest how can i read the output into JSON and read the latest price
{"dollar_pound":[
{"type":"ask","price":0.01769341,"amount":1.10113151,"tid":200019988,"timestamp":1515919171},
{"type":"ask","price":0.017755,"amount":3.95681783,"tid":200019987,"timestamp":1515919154},
{"type":"bid","price":0.01778859,"amount":3.7753814,"tid":200019986,"timestamp":1515919152},
{"type":"ask","price":0.017755,"amount":0.01216145,"tid":200019985,"timestamp":1515919147},
{"type":"ask","price":0.017755,"amount":0.05679142,"tid":200019984,"timestamp":1515919135}]}
I tried this but didn't worked
parsed_json = json.loads(request.text)
price = parsed_json['price'][0]
I think this may be what you want - here's a short script to get the latest price of type "bid":
# Here's a few more test cases for bid prices to let you test out your script
parsed_json = {"dollar_pound":[
{"type":"ask","price":0.01769341,"amount":1.10113151,"tid":200019988,"timestamp":1515919171},
{"type":"ask","price":0.017755,"amount":3.95681783,"tid":200019987,"timestamp":1515919154},
{"type":"bid","price":0.01778859,"amount":3.7753814,"tid":200019986,"timestamp":1515919152},
{"type":"bid","price":0.01542344,"amount":3.7753814,"tid":200019983,"timestamp":1715929152},
{"type":"bid","price":0.023455,"amount":3.7753814,"tid":200019982,"timestamp":1515919552},
{"type":"ask","price":0.017755,"amount":0.01216145,"tid":200019985,"timestamp":1515919147},
{"type":"ask","price":0.017755,"amount":0.05679142,"tid":200019984,"timestamp":1515919135}]}
# To get items of type "bid"
def get_bid_prices(parsed_json):
return filter(lambda x: x["type"] == "bid", parsed_json)
# Now, we want to get the latest "bid" price, i.e. the largest number in the "timestamp" field
latest_bid_price = max(get_bid_prices(parsed_json["dollar_pound"]), key=lambda x: x["timestamp"])
# Your result will be printed here
print(latest_bid_price) # {"type":"bid","price":0.01542344,"amount":3.7753814,"tid":200019983,"timestamp":1715929152}
For all good fellas struggling like me, i would like to share the answer for them.
json_data = json.loads (req.text)
for x in json_data['dollar_pound']:
print (x['price'])