Can't scrape table using lxml - html

I'm trying to scrape the urls for the individual players from this website.
I've already tried doing this with bs4 and it just returns [] every time i try to find the table. Switched to lxml to give this a try.
import urlopen from urllib.requests
import lxml.html
url = "https://www.espn.com/soccer/team/squad/_/id/359/arsenal"
tree = etree.HTML(urlopen(url).read())
table = tree.xpath('/*
[#id="fittPageContainer"]/div[2]/div[5]/div[1]/div/article/div/section/div[5]/section/table/tbody/tr/td[1]/div/table/tbody/tr[1]/td/span')
print(table)
I expect some sort output that I could use to get the links but the code returns square brackets

I think this is what you want.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.espn.com/soccer/team/squad/_/id/359/arsenal")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))

Related

Convert HTML Table to Pandas Data Frame in Python

Here i am trying to extract a table from a website as specified in Python code . i am able to get the HTML Table and further i am unable to convert to data frame using Python . Here is the code
# import libraries
import requests
from bs4 import BeautifulSoup
# specify url
url = 'http://my-trade.in/'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')
tbl =soup.find("table",{"id":"MainContent_dataGridView1"})
You can just Use pandas read_html function for that, and remember to convert the html you get to string else you will get some parsing error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://my-trade.in/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"id":"MainContent_dataGridView1"})
data_frame = pd.read_html(str(tbl))[0]

Extraction of Data using tbodyclass and Beautifulsoup

Extraction with tbody class using BeautifulSoup and Python 3.
Im trying to extract the table (summary) on top of it. Im using BeautifulSoup for extraction. However I get the following error while using tclass to extract the table containing name,age,info etc
I am aware I can use the previous table{class :datatable} to extract the table .However I want to try extracting using tbody class
How do i extract the table with tbodyclass and what error am i making?
Im bit new to web scraping and any detailed help would be appreciated
Here is the code
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for item in urls:
response=requests.get(item)
data=response.content
soup=BeautifulSoup(data,'html.parser')
required_data=soup.find_all(class_='moduleBody')
real_data=required_data.find_all(tbodyclass_='dataSmall')
print(real_data)
Here is the Error
Traceback (most recent call last):
File "C:\Users\XXXX\Desktop\scrape.py", line 15, in <module>
real_data=required_data.find_all(tbodyclass_='dataSmall')
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python37\lib\site-
packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
To target only be tbody you need to select only the first match for that tbody class. So you can use select_one.
table = soup.select_one('table:has(.dataSmall)')
gets you the table without the table tags and you can still loop trs and tds within to write out table. I show using pandas though to handle below.
Looks like you can use pandas
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for url in urls:
table = pd.read_html(url)[0]
print(table)
Combining using pandas and the tbody class but pulling in parent table tag
import requests
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('table:has(.dataSmall)')
print(pd.read_html(str(table)))
Ignoring table tag (but adding later for pandas to parse) - you don't have to and can loop tr and td within rather than handover to pandas.
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('.dataSmall')
print(pd.read_html('<table>' + str(table) + '</table>'))
To get the table under summary, you can try the following script:
import requests
from bs4 import BeautifulSoup
URLS = [
'https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL'
]
for url in URLS:
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select_one("h3:contains(Summary)").find_parent().find_next_sibling().select("table tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
you can do this:
required_data.find_all("tbody", class_="dataSmall")
The findall() function returns a set of results. This is useful when more than one occurrence of the tag is present.
You need to iterate over the results, and using findall() on each one separately:
for item in urls:
response = requests.get(item)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
required_data = soup.find_all(class_='moduleBody')
for res in required_data:
real_data = res.find_all(tbodyclass_='dataSmall')
print(real_data)
Edit: One find_all is enough:
import requests
from bs4 import BeautifulSoup
url = ["https://www.reuters.com/finance/stocks/company-officers/GOOG.O", "https://www.reuters.com/finance/stocks/company-officers/AMZN.O", "https://www.reuters.com/finance/stocks/company-officers/AAPL.O"]
for URL in url:
req = requests.get(URL)
soup = BeautifulSoup(req.content, 'html.parser')
for i in soup.find_all("tbody", {"class": "dataSmall"}):
print(i)
It's not clear exactly what information you are trying to extract, but doing:
for res in required_data:
real_data=res.find_all("tbody", class_="dataSmall")
for dat in real_data:
print(dat.text.strip())
outputs lots of info about lots of people...
EDIT: If you're looking for the summary table, you need this:
import pandas as pd
tabs = pd.read_html(url)
print(tabs[0]
and there's your table.

decode not working properly in python 2.7

I am trying to make a program that uses web crawling to retrive the stock info but somehow the program is not able to decode the webpage. I want this code to be strictly for python 2.
import urllib2
import re
stock=str(raw_input("Give the stock name"))
url = "https://www.google.com/finance?q="
req = urllib2.Request(url)
response = urllib2.urlopen(req)
data = str(response.read())
data1 = data.decode('utf-8')
print(data)
m = re.search('meta itemprop="price"',data1)
start = m.start()
end = start+50
newString = data1[start:end]
m=re.search('content="',newString)
start = m.end()
newString1 = newString[start:]
m = re.search("/",newString1)
start=0
end=m.end()-3
final= newString1[0:end]
print(final)
This is not a direct answer to your question but a suggestion. Try using the beautifulsoup python library. Its has many available functions for web scraping and crawling plus other functionalities and handles most of what you are trying to achieve in your question plus is compatible with all python versions.
Go to https://pypi.python.org/pypi/beautifulsoup4 for the documentation.
A sample example is;
import BeautifulSoup, urllib
url = 'http://www.py4inf.com/code/romeo.txt'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)
x = soup('a')
print x

Python3 - TypeError: the JSON object must be str, not 'bytes'

I'm a python beginner (working only with python3 so far) and I'm trying to present some code working the curses library to my classmates.
I got the code from a python/curses tutorial and it runs without problems in python2. In python3 it doesn't and I get the error in title.
Searching through the already asked questions, I found several solutions to this, but since being a absolute beginner with coding, I have no idea how to execute those in my specific code.
This is the code working in python2 :
import curses
from urllib2 import urlopen
from HTMLParser import HTMLParser
from simplejson import loads
def get_new_joke():
joke_json = loads(urlopen('http://api.icndb.com/jokes/random').read())
return HTMLParser().unescape(joke_json['value']['joke']).encode('utf-8')
Using the new modules in python3:
import curses
import json
import urllib
from html.parser import HTMLParser
def get_new_joke():
joke_json = loads(urlopen('http://api.icndb.com/jokes/random').read())
return HTMLParser().unescape(joke_json['value']['joke']).encode('utf-8')
Furthermore I tried to include this solution into my code:
Python 3, let json object accept bytes or let urlopen output strings
response = urllib.request.urlopen('http://api.icndb.com/jokes/random')
str_response = joke_json.readall().decode('utf-8')
obj = json.loads(str_response)
Tried around for hours now, but it tells me "json" ist not defined.

BeautifulSoup Error in Python 3.3

I'm just trying to parse this site and I keep getting errors using BeautifulSoup. Can someone help me and identify the problem?
import urllib
import urllib.request
import beautifulsoup
html = urllib.request.urlopen('http://yugioh.wikia.com/wiki/Card_Tips:Blue-Eyes_White_Dragon').read()
soup = beautifulsoup.bs4(html)
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
You've mixed up the module and class names. Rather than:
import beautifulsoup
You need:
import bs4
And rather than:
beautifulsoup.bs4(...)
You need:
bs4.BeautifulSoup(...)
Also, in the newest version of Beautiful Soup, the underscore variants are preferred over the camel-case variants of the names because it fits in better with other Python conventions:
soup.find_all(...)
Also, depending on what you're doing with visible_texts, you may want a list rather than a lazy filter:
visible_texts = list(filter(visible, texts))