How to use Python and BeautifulSoup to parse classes - html

I am trying to parse only the independent claims off of google.com/patents, but they use the same class name as the children dependent claims. I am new, but I think what I am trying to ask is how do I exclude child results if the parent has a particular class name.
I have tried to work the examples of parent / child / sibling / etc. off of this BeautifulSoup tutorial.
Unfortunately, nothing seemed to work.
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.find_all('div', class_='claim')
for claim in claims:
if claim.find(class_='claim-dependent style-scope patent-text'):
continue
print(claim.text)
I expected the dependent claim sections to be skipped and only the independent claims printed.
Results - All the claims, independent and dependent, gets printed.

Your if statement does not do anything, because it contains just continue (and the result is empty by the way), so you are printing all claims in the next line.
You could filter all the claims with the dependent claim-ref tag:
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.find_all('div', class_='claim')
for claim in claims:
if not claim.find('claim-ref'):
print(claim.find(class_='claim'))

Simply filter on parent and child classes I think as this excludes claims with parent class of claim-dependent which I assume are the dependants.
print(soup.select('.claim .claim')
3 matches (claims 1,6,19)
You can see one of each type here:
This is for claims 1 and 2. The top, claim 1, has parent div with class claim and child with class claim, whereas the bottom, claim 2, has parent div with class claim-dependant, then child with class claim. So you specify that relationship of parent class and child class to filter.
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = [claim.text for claim in soup.select('.claim .claim')]
print(data)

Related

Can't scrape table using lxml

I'm trying to scrape the urls for the individual players from this website.
I've already tried doing this with bs4 and it just returns [] every time i try to find the table. Switched to lxml to give this a try.
import urlopen from urllib.requests
import lxml.html
url = "https://www.espn.com/soccer/team/squad/_/id/359/arsenal"
tree = etree.HTML(urlopen(url).read())
table = tree.xpath('/*
[#id="fittPageContainer"]/div[2]/div[5]/div[1]/div/article/div/section/div[5]/section/table/tbody/tr/td[1]/div/table/tbody/tr[1]/td/span')
print(table)
I expect some sort output that I could use to get the links but the code returns square brackets
I think this is what you want.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.espn.com/soccer/team/squad/_/id/359/arsenal")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))

Convert HTML Table to Pandas Data Frame in Python

Here i am trying to extract a table from a website as specified in Python code . i am able to get the HTML Table and further i am unable to convert to data frame using Python . Here is the code
# import libraries
import requests
from bs4 import BeautifulSoup
# specify url
url = 'http://my-trade.in/'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')
tbl =soup.find("table",{"id":"MainContent_dataGridView1"})
You can just Use pandas read_html function for that, and remember to convert the html you get to string else you will get some parsing error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://my-trade.in/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"id":"MainContent_dataGridView1"})
data_frame = pd.read_html(str(tbl))[0]

Extraction of Data using tbodyclass and Beautifulsoup

Extraction with tbody class using BeautifulSoup and Python 3.
Im trying to extract the table (summary) on top of it. Im using BeautifulSoup for extraction. However I get the following error while using tclass to extract the table containing name,age,info etc
I am aware I can use the previous table{class :datatable} to extract the table .However I want to try extracting using tbody class
How do i extract the table with tbodyclass and what error am i making?
Im bit new to web scraping and any detailed help would be appreciated
Here is the code
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for item in urls:
response=requests.get(item)
data=response.content
soup=BeautifulSoup(data,'html.parser')
required_data=soup.find_all(class_='moduleBody')
real_data=required_data.find_all(tbodyclass_='dataSmall')
print(real_data)
Here is the Error
Traceback (most recent call last):
File "C:\Users\XXXX\Desktop\scrape.py", line 15, in <module>
real_data=required_data.find_all(tbodyclass_='dataSmall')
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python37\lib\site-
packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
To target only be tbody you need to select only the first match for that tbody class. So you can use select_one.
table = soup.select_one('table:has(.dataSmall)')
gets you the table without the table tags and you can still loop trs and tds within to write out table. I show using pandas though to handle below.
Looks like you can use pandas
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for url in urls:
table = pd.read_html(url)[0]
print(table)
Combining using pandas and the tbody class but pulling in parent table tag
import requests
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('table:has(.dataSmall)')
print(pd.read_html(str(table)))
Ignoring table tag (but adding later for pandas to parse) - you don't have to and can loop tr and td within rather than handover to pandas.
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('.dataSmall')
print(pd.read_html('<table>' + str(table) + '</table>'))
To get the table under summary, you can try the following script:
import requests
from bs4 import BeautifulSoup
URLS = [
'https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL'
]
for url in URLS:
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select_one("h3:contains(Summary)").find_parent().find_next_sibling().select("table tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
you can do this:
required_data.find_all("tbody", class_="dataSmall")
The findall() function returns a set of results. This is useful when more than one occurrence of the tag is present.
You need to iterate over the results, and using findall() on each one separately:
for item in urls:
response = requests.get(item)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
required_data = soup.find_all(class_='moduleBody')
for res in required_data:
real_data = res.find_all(tbodyclass_='dataSmall')
print(real_data)
Edit: One find_all is enough:
import requests
from bs4 import BeautifulSoup
url = ["https://www.reuters.com/finance/stocks/company-officers/GOOG.O", "https://www.reuters.com/finance/stocks/company-officers/AMZN.O", "https://www.reuters.com/finance/stocks/company-officers/AAPL.O"]
for URL in url:
req = requests.get(URL)
soup = BeautifulSoup(req.content, 'html.parser')
for i in soup.find_all("tbody", {"class": "dataSmall"}):
print(i)
It's not clear exactly what information you are trying to extract, but doing:
for res in required_data:
real_data=res.find_all("tbody", class_="dataSmall")
for dat in real_data:
print(dat.text.strip())
outputs lots of info about lots of people...
EDIT: If you're looking for the summary table, you need this:
import pandas as pd
tabs = pd.read_html(url)
print(tabs[0]
and there's your table.

BeautifulSoup gets nothing between tags

I am a novice to write a web crawler. I want to use the searching engine of http://www.creditchina.gov.cn/search_all#keyword=&searchtype=0&templateId=&creditType=&areas=&objectType=2&page=1 to check whether my input is valid.
For example, 912101127157655762 is a valid input, and 912101127157655760 is invalid.
After observing the web source code from developer tools I found that, if the input is a invalid number, the tags would be:
While if the input is valid, the tags would be:
So I want to determine whether the input is valid or not by just checking whether there is anything within the 'ul class = "credit-info-results public-results-left item-template"' tag. Here is how I wrote my web crawler:
import urllib
from bs4 import BeautifulSoup
url = 'http://www.creditchina.gov.cn/search_all#keyword=912101127157655762&searchtype=0&
templateId=&creditType=&areas=&objectType=2&page=1'
req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
bs = data.read().decode('utf-8')
soup = BeautifulSoup(bs, 'lxml')
check = soup.find_all("ul", {"class": "credit-info-results public-results-left item-template"})
if check == []:
# TODO
if check != []:
# TODO
However, the value of check is always []. I cannot understand why there is nothing between the tab. Hope somebody may help me solve the problem.
You've got not html but JS object as a response. That's why BS can't parse it.
You can use substring search to check if response contains something or not.
import urllib
from bs4 import BeautifulSoup
url = 'http://www.creditchina.gov.cn/search_all#keyword=912101127157655762&searchtype=0&
templateId=&creditType=&areas=&objectType=2&page=1'
req = urllib.request.Request(url)
data = urllib.request.urlopen(req)
bs = data.read().decode('utf-8')
ul_pos = bs.find('credit-info-results public-results-left item-template')
if ul_pos <> 0:
bs = bs[ul_pos:]
soup = BeautifulSoup(bs, 'lxml')
check = soup.find_all("ul", {"class": "credit-info-results public-results-left item-template"})
if check == []:
# TODO
if check != []:
# TODO

Using beautiful soup beyond </html>

Is there a possibility to use beautiful soup beyond the tag . A case in point would be the following page
http://dsalsrv02.uchicago.edu/cgi-bin/app/biswas-bangala_query.py?page=1
which has data after the end of html tag .
From what I see, you can either use html.parser orhtml5lib for this particular page:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://dsalsrv02.uchicago.edu/cgi-bin/app/biswas-bangala_query.py?page=1")
soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "html5lib")
lxml parser does not handle this page well and it is parsed only partially.