Q1. Is there any way to extract data from the table but still able to track back the axis titles?
Q2. which approach will be better to extract data from a html table? HTMLParser or beautifulsoup or else?
i was trying to extract this income table
http://investing.businessweek.com/research/stocks/financials/financials.asp?ticker=TSCO:LN
i would like to to be
"Currency in Millions of British Pounds","2009","2010","2011","2012"
"Revenues", "53,898.0", "56,910.0", "60,455.0", "64,539.0"
"TOTAL REVENUES", "53,898.0", "56,910.0", "60,455.0", "64,539.0"
in the meantime i want to know "56,910.0" is the revenue in 2009
but i experienced two issues:
HTMLParser.HTMLParseError: malformed start tag, at line 1148, column 47 or
HTMLParser.HTMLParseError: bad end tag: "", at line 225, column 104
cant keep track of the axis titles
Many thanks
I've done quite a bit of scraping and BeautifulSoup rarely disappoints.
from BeautifulSoup import BeautifulSoup
URL = "http://investing.businessweek.com/research/stocks/financials/financials.asp?ticker=TSCO:LN"
from urllib import urlopen
HTML = urlopen ( URL )
soup = BeautifulSoup ( HTML )
statement = soup . find ( 'table', { 'class' : "financialStatement" } )
rows = statement . findAll ( 'tr' )
At this point I think you will find that rows has a length of 25 and that its first item is the header and last is the final row of the desired table.
Related
I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function
I am having a problem with web-scraping. I am trying to learn how to do it, but I can't seem to get past some of the basics. I am getting an error, "TypeError: 'ResultSet' object is not callable" is the error I'm getting.
I've tried a number of different things. I was originally trying to use the "find" instead of "find_all" function, but I was having an issue with beautifulsoup pulling in a nonetype. I was unable to create an if loop that could overcome that exception, so I tried using the "find_all" instead.
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')all_company_list =
soup.find_all(class_='sortable-table')
#all_company_list = soup.find(class_='sortable-table')
company_name_list_items = all_company_list('td')
for company_name in company_name_list_items:
#print(company_name.prettify())
companies = company_name.content[0]
I'd like this to pull in all the companies in Orange County California that are on this list in a clean manner. As you can see, I've already accomplished pulling them in, but I want the list to be clean.
You've got the right idea. I think instead of immediately finding all the <td> tags (which is going to return one <td> for each row (140 rows) and each column in the row (4 columns)), if you want only the company names, it might be easier to find all the rows (<tr> tags) then append however many columns you want by iterating the <td>s in each row.
This will get the first column, the company names:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')
all_company_list = soup.find_all('tr')
company_list = [c.find('td').text for c in all_company_list[1::]]
Now company_list contains all 140 company names:
>>> print(len(company_list))
['Advanced Behavioral Health', 'Advanced Management Company & R³ Construction Services, Inc.',
...
, 'Wes-Tec, Inc', 'Western Resources Title Company', 'Wunderman', 'Ytel, Inc.', 'Zillow Group']
Change c.find('td') to c.find_all('td') and iterate that list to get all the columns for each company.
Pandas:
Pandas is often useful here. The page uses multiple sorts including company size, rank. I show rank sort.
import pandas as pd
table = pd.read_html('https://topworkplaces.com/publication/ocregister/')[0]
table.columns = table.iloc[0]
table = table[1:]
table.Rank = pd.to_numeric(table.Rank)
rank_sort_table = table.sort_values(by='Rank', axis=0, ascending = True)
rank_sort_table.reset_index(inplace=True, drop=True)
rank_sort_table.columns.names = ['Index']
print(rank_sort_table)
Depending on your sort, companies in order:
print(rank_sort_table.Company)
Requests:
Incidentally, you can use nth-of-type to select just first column (company names) and use id, rather than class name, to identify the table as faster
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('#twpRegionalList td:nth-of-type(1)')]
print(names)
Note the default sorting is alphabetical on name column rather than rank.
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
I tried to web scrape the table data from a binary signals website. The data updates after some time and I wanted to get the data as it updates. The problem is, when I scrape the code it returns empty values. The table has a table tag.
I'm not sure if it uses something else other than html because it updates without reloading. I had to use a browser user agent to get passed the security.
When I run it returns correct data but I have noticed signal id increments by 1
<table class="ui stripe hover dt-center table" id="isosignal-table" style="width:100%"><thead><tr><th></th><th class="no-sort">Current Price</th><th class="no-sort">Direction</th><th class="no-sort">Asset</th><th class="no-sort">Strike Price</th><th class="no-sort">Expiry Time</th></tr></thead><tbody><tr :class="[ signal.direction.toLowerCase() == 'call' ? 'call' : 'put' ]" :id="'signal-' + signal.id" :key="signal.id" ref="signals" v-for="signal in signals"><td style="display: none;" v-text="signal.id"></td><td v-text="signal.current_price"></td><td v-html="showDirection(signal.direction)"></td><td v-text="signal.asset"></td><td v-text="signal.strike_price"></td><td v-text="parseTime(signal.expiry)"></td></tr></tbody></table>
table = soup.table
print(table)
But when I run the whole code it returns this:
[]
['', '', '', '', '', '']
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://signals.investingstockonline.com/free-binary-signal-page"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
data = page.read()
soup = BeautifulSoup(data, 'html.parser')
table = soup.table
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
if len(row) < 1:
pass
print(row)
I thought it would display the whole table but it just displayed empty strings. What could be the problem?
In the HTML you've provided, there is no text content in the elements, so you're getting that correctly. When you look at the live website, text content that appears in the table was inserted dynamically by JS fetching information from a server via ajax. In other words, if you perform a request, you'll get the skeleton (HTML) but no meat (live data).
You can use something like Selenium to extract this information as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("https://signals.investingstockonline.com/free-binary-signal-page")
for tr in driver.find_elements_by_tag_name("tr"):
for td in tr.find_elements_by_tag_name("td"):
print(td.get_attribute("innerText"))
Output (truncated):
EURJPY
126.044
22:00:00
1.50318
EURCAD
1.50332
22:00:00
1.12595
EURUSD
1.12604
22:00:00
0.86732
EURGBP
0.86743
22:00:00
1.29825
GBPUSD
1.29841
22:00:00
145.320
One table entry within a table row on an html table I am trying to scrape looks like so:
<td class="top100nation" title="PAK">
<img src="/images/flag/flags_pak.jpg" alt="PAK"></td>
The web page to which this belongs is the following: http://www.relianceiccrankings.com/datespecific/odi/?stattype=bowling&day=01&month=01&year=2014. The entire column to which this belongs in the table has similar table data (i.e. it's a column of images).
I am using lxml in a python script. (Open to using BeautifulSoup instead, if I have to for some reason.) For every other column in the table, I can extract the data I want on the given row by using 'data = entry.text_content()'. Obviously, this doesn't work for this column of images. But I don't want the image data in any case. What I want to get from this table data is the 'PAK' bit - that is, I want the name of the nation. I think this is extremely simple but unfortunately I am a simpleton who doesn't understand the library he is using.
Thanks in advance
Edit: Full script, as per request
import requests
import lxml.html as lh
import csv
with open('firstPageCricinfo','w') as file:
writer = csv.writer(file)
page = requests.get(url)
doc = lh.fromstring(page.content)
#rows of the table
tr_elements = doc.xpath('//tr')
data_array = [[] for _ in range(len(tr_elements))]
del tr_elements[0]
for t in tr_elements[0]:
name=t.text_content()
if name == "":
continue
print(name)
data_array[0].append(name)
#printing out first row of table, to check correctness
print(data_array[0])
for j in range(1,len(tr_elements)):
T=tr_elements[j]
i=0
for t in T.iterchildren():
#column is not at issue
if i != 3:
data=t.text_content()
#image-based column
else:
#what do I do here???
data = t.
data_array[j].append(data)
i+=1
#printing last row to check correctness
print(data_array[len(tr_elements)-1])
with open('list1','w') as file:
writer = csv.writer(file)
for i in range(0,len(tr_elements)):
writer.writerow(data_array[i])`
Along with lxml library you'll either need to use requests or some other library to get the website content.
Without seeing the code you have so far, I can offer a BeautifulSoup solution:
url = 'http://www.relianceiccrankings.com/datespecific/odi/?stattype=bowling&day=01&month=01&year=2014'
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get(url).text, 'lxml')
r = soup.find_all('td', {'class': 'top100cbr'})
for td in r:
print(td.text.split('v')[1].split(',')[0].strip())
outputs about 522 items:
South Africa
India
Sri Lanka
...
Canada
New Zealand
Australia
England
I am scraping some data from a website and I am able to do so using the below referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
items = soup.findAll('div',{"class":"title"})
prices = soup.findAll('span', {"class": "handset"})
for oem, item, price in zip(oems, items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])
Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is
<span class="handset">
FREE to £79.99</span>
For those 2 values structure is
<span class="handset">
<span class="delivery_amber">Up to 7 days delivery</span>
<br>"FREE on all tariffs"</span>
Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure
Please help me solve this issue, Pardon my ignorance as I am new to programming.
Just detect those 2 items with an additional if statement:
if price.string is None:
price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
price_text = unicode(price.string).strip().encode('utf8')
then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.