Extraction with tbody class using BeautifulSoup and Python 3.
Im trying to extract the table (summary) on top of it. Im using BeautifulSoup for extraction. However I get the following error while using tclass to extract the table containing name,age,info etc
I am aware I can use the previous table{class :datatable} to extract the table .However I want to try extracting using tbody class
How do i extract the table with tbodyclass and what error am i making?
Im bit new to web scraping and any detailed help would be appreciated
Here is the code
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for item in urls:
response=requests.get(item)
data=response.content
soup=BeautifulSoup(data,'html.parser')
required_data=soup.find_all(class_='moduleBody')
real_data=required_data.find_all(tbodyclass_='dataSmall')
print(real_data)
Here is the Error
Traceback (most recent call last):
File "C:\Users\XXXX\Desktop\scrape.py", line 15, in <module>
real_data=required_data.find_all(tbodyclass_='dataSmall')
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python37\lib\site-
packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
To target only be tbody you need to select only the first match for that tbody class. So you can use select_one.
table = soup.select_one('table:has(.dataSmall)')
gets you the table without the table tags and you can still loop trs and tds within to write out table. I show using pandas though to handle below.
Looks like you can use pandas
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for url in urls:
table = pd.read_html(url)[0]
print(table)
Combining using pandas and the tbody class but pulling in parent table tag
import requests
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('table:has(.dataSmall)')
print(pd.read_html(str(table)))
Ignoring table tag (but adding later for pandas to parse) - you don't have to and can loop tr and td within rather than handover to pandas.
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('.dataSmall')
print(pd.read_html('<table>' + str(table) + '</table>'))
To get the table under summary, you can try the following script:
import requests
from bs4 import BeautifulSoup
URLS = [
'https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL'
]
for url in URLS:
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select_one("h3:contains(Summary)").find_parent().find_next_sibling().select("table tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
you can do this:
required_data.find_all("tbody", class_="dataSmall")
The findall() function returns a set of results. This is useful when more than one occurrence of the tag is present.
You need to iterate over the results, and using findall() on each one separately:
for item in urls:
response = requests.get(item)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
required_data = soup.find_all(class_='moduleBody')
for res in required_data:
real_data = res.find_all(tbodyclass_='dataSmall')
print(real_data)
Edit: One find_all is enough:
import requests
from bs4 import BeautifulSoup
url = ["https://www.reuters.com/finance/stocks/company-officers/GOOG.O", "https://www.reuters.com/finance/stocks/company-officers/AMZN.O", "https://www.reuters.com/finance/stocks/company-officers/AAPL.O"]
for URL in url:
req = requests.get(URL)
soup = BeautifulSoup(req.content, 'html.parser')
for i in soup.find_all("tbody", {"class": "dataSmall"}):
print(i)
It's not clear exactly what information you are trying to extract, but doing:
for res in required_data:
real_data=res.find_all("tbody", class_="dataSmall")
for dat in real_data:
print(dat.text.strip())
outputs lots of info about lots of people...
EDIT: If you're looking for the summary table, you need this:
import pandas as pd
tabs = pd.read_html(url)
print(tabs[0]
and there's your table.
Related
enter image description here
I want to fetch the trends as shown in the images but i am not able to fetch the tweets which has no tweet counts.(refer the image) below is the error i am getting for those tweets which has no tweet counts. Please let me know how to handle the this exception and print all the tweets.
from bs4 import BeautifulSoup
import requests
URL = "https://trends24.in/india/"
html_text=requests.get(URL)
soup= BeautifulSoup(html_text.content,'html.parser')
results = soup.find(id='trend-list')
job_elems = results.find_all('li')
for job_elem in job_elems:
print(job_elem.find('a').get_text(), job_elem.find('span').get_text())
You could select for a shared parent then wrap the attempt to grab with tweet-count inside the try except
from bs4 import BeautifulSoup
import requests
URL = "https://trends24.in/india/"
html_text=requests.get(URL)
soup= BeautifulSoup(html_text.content,'lxml')
for i in soup.select('#trend-list li'):
print(i.a.text)
try:
print(i.select_one('.tweet-count').text)
except:
print("no tweets")
List of dicts:
from bs4 import BeautifulSoup
import requests
URL = "https://trends24.in/india/"
html_text=requests.get(URL)
soup= BeautifulSoup(html_text.content,'lxml')
results = []
for i in soup.select('#trend-list li'):
d = dict()
d[i.a.text] = ''
try:
val = i.select_one('.tweet-count').text
except:
val = "no tweets"
finally:
d[i.a.text] = val
results.append(d)
I'm trying to scrape the urls for the individual players from this website.
I've already tried doing this with bs4 and it just returns [] every time i try to find the table. Switched to lxml to give this a try.
import urlopen from urllib.requests
import lxml.html
url = "https://www.espn.com/soccer/team/squad/_/id/359/arsenal"
tree = etree.HTML(urlopen(url).read())
table = tree.xpath('/*
[#id="fittPageContainer"]/div[2]/div[5]/div[1]/div/article/div/section/div[5]/section/table/tbody/tr/td[1]/div/table/tbody/tr[1]/td/span')
print(table)
I expect some sort output that I could use to get the links but the code returns square brackets
I think this is what you want.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.espn.com/soccer/team/squad/_/id/359/arsenal")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
Here i am trying to extract a table from a website as specified in Python code . i am able to get the HTML Table and further i am unable to convert to data frame using Python . Here is the code
# import libraries
import requests
from bs4 import BeautifulSoup
# specify url
url = 'http://my-trade.in/'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')
tbl =soup.find("table",{"id":"MainContent_dataGridView1"})
You can just Use pandas read_html function for that, and remember to convert the html you get to string else you will get some parsing error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://my-trade.in/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"id":"MainContent_dataGridView1"})
data_frame = pd.read_html(str(tbl))[0]
I am trying to parse only the independent claims off of google.com/patents, but they use the same class name as the children dependent claims. I am new, but I think what I am trying to ask is how do I exclude child results if the parent has a particular class name.
I have tried to work the examples of parent / child / sibling / etc. off of this BeautifulSoup tutorial.
Unfortunately, nothing seemed to work.
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.find_all('div', class_='claim')
for claim in claims:
if claim.find(class_='claim-dependent style-scope patent-text'):
continue
print(claim.text)
I expected the dependent claim sections to be skipped and only the independent claims printed.
Results - All the claims, independent and dependent, gets printed.
Your if statement does not do anything, because it contains just continue (and the result is empty by the way), so you are printing all claims in the next line.
You could filter all the claims with the dependent claim-ref tag:
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.find_all('div', class_='claim')
for claim in claims:
if not claim.find('claim-ref'):
print(claim.find(class_='claim'))
Simply filter on parent and child classes I think as this excludes claims with parent class of claim-dependent which I assume are the dependants.
print(soup.select('.claim .claim')
3 matches (claims 1,6,19)
You can see one of each type here:
This is for claims 1 and 2. The top, claim 1, has parent div with class claim and child with class claim, whereas the bottom, claim 2, has parent div with class claim-dependant, then child with class claim. So you specify that relationship of parent class and child class to filter.
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = [claim.text for claim in soup.select('.claim .claim')]
print(data)
I'm trying to download and convert this set into a pandas DataFrame structure and display the first 10 lines for viewing in a jupyter notebook.
url = 'https://ckannet-storage.commondatastorage.googleapis.com/2014-12-13T15:15:31.729Z/airfields.json'
resp = requests.get(url)
resp.content
I ran this and it gives me all the content how can I only limit the content so that it can display the first 10 lines only.
Convert your response string to json and then pandas dataframe. Finally select first 10 rows from it. Code -
import pandas as pd
import json
j = json.loads(resp.content)
df = pd.DataFrame(j)
df[:10]