Convert HTML Table to Pandas Data Frame in Python - html

Here i am trying to extract a table from a website as specified in Python code . i am able to get the HTML Table and further i am unable to convert to data frame using Python . Here is the code
# import libraries
import requests
from bs4 import BeautifulSoup
# specify url
url = 'http://my-trade.in/'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')
tbl =soup.find("table",{"id":"MainContent_dataGridView1"})

You can just Use pandas read_html function for that, and remember to convert the html you get to string else you will get some parsing error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://my-trade.in/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"id":"MainContent_dataGridView1"})
data_frame = pd.read_html(str(tbl))[0]

Related

Can't scrape table using lxml

I'm trying to scrape the urls for the individual players from this website.
I've already tried doing this with bs4 and it just returns [] every time i try to find the table. Switched to lxml to give this a try.
import urlopen from urllib.requests
import lxml.html
url = "https://www.espn.com/soccer/team/squad/_/id/359/arsenal"
tree = etree.HTML(urlopen(url).read())
table = tree.xpath('/*
[#id="fittPageContainer"]/div[2]/div[5]/div[1]/div/article/div/section/div[5]/section/table/tbody/tr/td[1]/div/table/tbody/tr[1]/td/span')
print(table)
I expect some sort output that I could use to get the links but the code returns square brackets
I think this is what you want.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.espn.com/soccer/team/squad/_/id/359/arsenal")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))

Extraction of Data using tbodyclass and Beautifulsoup

Extraction with tbody class using BeautifulSoup and Python 3.
Im trying to extract the table (summary) on top of it. Im using BeautifulSoup for extraction. However I get the following error while using tclass to extract the table containing name,age,info etc
I am aware I can use the previous table{class :datatable} to extract the table .However I want to try extracting using tbody class
How do i extract the table with tbodyclass and what error am i making?
Im bit new to web scraping and any detailed help would be appreciated
Here is the code
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for item in urls:
response=requests.get(item)
data=response.content
soup=BeautifulSoup(data,'html.parser')
required_data=soup.find_all(class_='moduleBody')
real_data=required_data.find_all(tbodyclass_='dataSmall')
print(real_data)
Here is the Error
Traceback (most recent call last):
File "C:\Users\XXXX\Desktop\scrape.py", line 15, in <module>
real_data=required_data.find_all(tbodyclass_='dataSmall')
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python37\lib\site-
packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
To target only be tbody you need to select only the first match for that tbody class. So you can use select_one.
table = soup.select_one('table:has(.dataSmall)')
gets you the table without the table tags and you can still loop trs and tds within to write out table. I show using pandas though to handle below.
Looks like you can use pandas
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for url in urls:
table = pd.read_html(url)[0]
print(table)
Combining using pandas and the tbody class but pulling in parent table tag
import requests
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('table:has(.dataSmall)')
print(pd.read_html(str(table)))
Ignoring table tag (but adding later for pandas to parse) - you don't have to and can loop tr and td within rather than handover to pandas.
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('.dataSmall')
print(pd.read_html('<table>' + str(table) + '</table>'))
To get the table under summary, you can try the following script:
import requests
from bs4 import BeautifulSoup
URLS = [
'https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL'
]
for url in URLS:
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select_one("h3:contains(Summary)").find_parent().find_next_sibling().select("table tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
you can do this:
required_data.find_all("tbody", class_="dataSmall")
The findall() function returns a set of results. This is useful when more than one occurrence of the tag is present.
You need to iterate over the results, and using findall() on each one separately:
for item in urls:
response = requests.get(item)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
required_data = soup.find_all(class_='moduleBody')
for res in required_data:
real_data = res.find_all(tbodyclass_='dataSmall')
print(real_data)
Edit: One find_all is enough:
import requests
from bs4 import BeautifulSoup
url = ["https://www.reuters.com/finance/stocks/company-officers/GOOG.O", "https://www.reuters.com/finance/stocks/company-officers/AMZN.O", "https://www.reuters.com/finance/stocks/company-officers/AAPL.O"]
for URL in url:
req = requests.get(URL)
soup = BeautifulSoup(req.content, 'html.parser')
for i in soup.find_all("tbody", {"class": "dataSmall"}):
print(i)
It's not clear exactly what information you are trying to extract, but doing:
for res in required_data:
real_data=res.find_all("tbody", class_="dataSmall")
for dat in real_data:
print(dat.text.strip())
outputs lots of info about lots of people...
EDIT: If you're looking for the summary table, you need this:
import pandas as pd
tabs = pd.read_html(url)
print(tabs[0]
and there's your table.

Ipython notebook

I'm trying to download and convert this set into a pandas DataFrame structure and display the first 10 lines for viewing in a jupyter notebook.
url = 'https://ckannet-storage.commondatastorage.googleapis.com/2014-12-13T15:15:31.729Z/airfields.json'
resp = requests.get(url)
resp.content
I ran this and it gives me all the content how can I only limit the content so that it can display the first 10 lines only.
Convert your response string to json and then pandas dataframe. Finally select first 10 rows from it. Code -
import pandas as pd
import json
j = json.loads(resp.content)
df = pd.DataFrame(j)
df[:10]

Using beautiful soup beyond </html>

Is there a possibility to use beautiful soup beyond the tag . A case in point would be the following page
http://dsalsrv02.uchicago.edu/cgi-bin/app/biswas-bangala_query.py?page=1
which has data after the end of html tag .
From what I see, you can either use html.parser orhtml5lib for this particular page:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://dsalsrv02.uchicago.edu/cgi-bin/app/biswas-bangala_query.py?page=1")
soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "html5lib")
lxml parser does not handle this page well and it is parsed only partially.

BeautifulSoup Error in Python 3.3

I'm just trying to parse this site and I keep getting errors using BeautifulSoup. Can someone help me and identify the problem?
import urllib
import urllib.request
import beautifulsoup
html = urllib.request.urlopen('http://yugioh.wikia.com/wiki/Card_Tips:Blue-Eyes_White_Dragon').read()
soup = beautifulsoup.bs4(html)
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
You've mixed up the module and class names. Rather than:
import beautifulsoup
You need:
import bs4
And rather than:
beautifulsoup.bs4(...)
You need:
bs4.BeautifulSoup(...)
Also, in the newest version of Beautiful Soup, the underscore variants are preferred over the camel-case variants of the names because it fits in better with other Python conventions:
soup.find_all(...)
Also, depending on what you're doing with visible_texts, you may want a list rather than a lazy filter:
visible_texts = list(filter(visible, texts))