Using beautiful soup beyond </html> - html

Is there a possibility to use beautiful soup beyond the tag . A case in point would be the following page
http://dsalsrv02.uchicago.edu/cgi-bin/app/biswas-bangala_query.py?page=1
which has data after the end of html tag .

From what I see, you can either use html.parser orhtml5lib for this particular page:
import requests
from bs4 import BeautifulSoup
response = requests.get("http://dsalsrv02.uchicago.edu/cgi-bin/app/biswas-bangala_query.py?page=1")
soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "html5lib")
lxml parser does not handle this page well and it is parsed only partially.

Related

Convert HTML Table to Pandas Data Frame in Python

Here i am trying to extract a table from a website as specified in Python code . i am able to get the HTML Table and further i am unable to convert to data frame using Python . Here is the code
# import libraries
import requests
from bs4 import BeautifulSoup
# specify url
url = 'http://my-trade.in/'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'html.parser')
tbl =soup.find("table",{"id":"MainContent_dataGridView1"})
You can just Use pandas read_html function for that, and remember to convert the html you get to string else you will get some parsing error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://my-trade.in/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tbl = soup.find("table",{"id":"MainContent_dataGridView1"})
data_frame = pd.read_html(str(tbl))[0]

How to use Python and BeautifulSoup to parse classes

I am trying to parse only the independent claims off of google.com/patents, but they use the same class name as the children dependent claims. I am new, but I think what I am trying to ask is how do I exclude child results if the parent has a particular class name.
I have tried to work the examples of parent / child / sibling / etc. off of this BeautifulSoup tutorial.
Unfortunately, nothing seemed to work.
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.find_all('div', class_='claim')
for claim in claims:
if claim.find(class_='claim-dependent style-scope patent-text'):
continue
print(claim.text)
I expected the dependent claim sections to be skipped and only the independent claims printed.
Results - All the claims, independent and dependent, gets printed.
Your if statement does not do anything, because it contains just continue (and the result is empty by the way), so you are printing all claims in the next line.
You could filter all the claims with the dependent claim-ref tag:
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
claims = soup.find_all('div', class_='claim')
for claim in claims:
if not claim.find('claim-ref'):
print(claim.find(class_='claim'))
Simply filter on parent and child classes I think as this excludes claims with parent class of claim-dependent which I assume are the dependants.
print(soup.select('.claim .claim')
3 matches (claims 1,6,19)
You can see one of each type here:
This is for claims 1 and 2. The top, claim 1, has parent div with class claim and child with class claim, whereas the bottom, claim 2, has parent div with class claim-dependant, then child with class claim. So you specify that relationship of parent class and child class to filter.
from bs4 import BeautifulSoup
import requests
url = 'https://patents.google.com/patent/US7054130?oq=US7654309'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = [claim.text for claim in soup.select('.claim .claim')]
print(data)

Extraction of Data using tbodyclass and Beautifulsoup

Extraction with tbody class using BeautifulSoup and Python 3.
Im trying to extract the table (summary) on top of it. Im using BeautifulSoup for extraction. However I get the following error while using tclass to extract the table containing name,age,info etc
I am aware I can use the previous table{class :datatable} to extract the table .However I want to try extracting using tbody class
How do i extract the table with tbodyclass and what error am i making?
Im bit new to web scraping and any detailed help would be appreciated
Here is the code
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for item in urls:
response=requests.get(item)
data=response.content
soup=BeautifulSoup(data,'html.parser')
required_data=soup.find_all(class_='moduleBody')
real_data=required_data.find_all(tbodyclass_='dataSmall')
print(real_data)
Here is the Error
Traceback (most recent call last):
File "C:\Users\XXXX\Desktop\scrape.py", line 15, in <module>
real_data=required_data.find_all(tbodyclass_='dataSmall')
File "C:\Users\XXXX\AppData\Local\Programs\Python\Python37\lib\site-
packages\bs4\element.py", line 1620, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
To target only be tbody you need to select only the first match for that tbody class. So you can use select_one.
table = soup.select_one('table:has(.dataSmall)')
gets you the table without the table tags and you can still loop trs and tds within to write out table. I show using pandas though to handle below.
Looks like you can use pandas
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
for url in urls:
table = pd.read_html(url)[0]
print(table)
Combining using pandas and the tbody class but pulling in parent table tag
import requests
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('table:has(.dataSmall)')
print(pd.read_html(str(table)))
Ignoring table tag (but adding later for pandas to parse) - you don't have to and can loop tr and td within rather than handover to pandas.
import pandas as pd
urls=['https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = soup.select_one('.dataSmall')
print(pd.read_html('<table>' + str(table) + '</table>'))
To get the table under summary, you can try the following script:
import requests
from bs4 import BeautifulSoup
URLS = [
'https://www.reuters.com/finance/stocks/company- officers/GOOG.O',
'https://www.reuters.com/finance/stocks/company- officers/AMZN',
'https://www.reuters.com/finance/stocks/company- officers/AAPL'
]
for url in URLS:
r = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(r.text,"lxml")
for items in soup.select_one("h3:contains(Summary)").find_parent().find_next_sibling().select("table tr"):
data = [' '.join(item.text.split()) for item in items.select("th,td")]
print(data)
you can do this:
required_data.find_all("tbody", class_="dataSmall")
The findall() function returns a set of results. This is useful when more than one occurrence of the tag is present.
You need to iterate over the results, and using findall() on each one separately:
for item in urls:
response = requests.get(item)
data = response.content
soup = BeautifulSoup(data, 'html.parser')
required_data = soup.find_all(class_='moduleBody')
for res in required_data:
real_data = res.find_all(tbodyclass_='dataSmall')
print(real_data)
Edit: One find_all is enough:
import requests
from bs4 import BeautifulSoup
url = ["https://www.reuters.com/finance/stocks/company-officers/GOOG.O", "https://www.reuters.com/finance/stocks/company-officers/AMZN.O", "https://www.reuters.com/finance/stocks/company-officers/AAPL.O"]
for URL in url:
req = requests.get(URL)
soup = BeautifulSoup(req.content, 'html.parser')
for i in soup.find_all("tbody", {"class": "dataSmall"}):
print(i)
It's not clear exactly what information you are trying to extract, but doing:
for res in required_data:
real_data=res.find_all("tbody", class_="dataSmall")
for dat in real_data:
print(dat.text.strip())
outputs lots of info about lots of people...
EDIT: If you're looking for the summary table, you need this:
import pandas as pd
tabs = pd.read_html(url)
print(tabs[0]
and there's your table.

BeautifulSoup Error in Python 3.3

I'm just trying to parse this site and I keep getting errors using BeautifulSoup. Can someone help me and identify the problem?
import urllib
import urllib.request
import beautifulsoup
html = urllib.request.urlopen('http://yugioh.wikia.com/wiki/Card_Tips:Blue-Eyes_White_Dragon').read()
soup = beautifulsoup.bs4(html)
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
You've mixed up the module and class names. Rather than:
import beautifulsoup
You need:
import bs4
And rather than:
beautifulsoup.bs4(...)
You need:
bs4.BeautifulSoup(...)
Also, in the newest version of Beautiful Soup, the underscore variants are preferred over the camel-case variants of the names because it fits in better with other Python conventions:
soup.find_all(...)
Also, depending on what you're doing with visible_texts, you may want a list rather than a lazy filter:
visible_texts = list(filter(visible, texts))

How to Parse a Glossary Section of a Website with BeautifulSoup & strip HTML tags

I'm trying to Parse this page
http://www.lib.uts.edu.au/about-uts-library/corporate-information/library-glossary
and get just the
Title,
Description
for each section and that's it, no tags.
I parse the page and try to search for all <title> and <p> tags but it doesn't produce the right results.
I am using Python 2.7 and BeautifulSoup 3-2-0
Here is a sample of my code:
import urllib2, sys
address = sys.argv[1]
html = urlib2.urlopen(http://www.lib.uts.edu.au/about-uts-library/corporate-information/library-glossary).read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
def printText(tags):
for tag in tags:
if tag._class_==NavigableString:
print tag,
else:
printText(tag)
print ""
printText(soup.findALL("p"))
print "".join(soup.findALL("p", text=re.compile(".")))
I'm not entirely sure what you're looking for, but I suspect you are looking to get the term and definition out from this page. Looking for the <title> and <p> tags is not really what you need. You should look for attributes that make a tag unique. In this case, looking at the <span> tag shows that there is a class attribute that uniquely labels the terms. This can be used to isolate the sections that you need. I suggest looking more closely at the documentation for find/findAll. Below is some code that will get you most of the way there.
from BeautifulSoup import BeautifulSoup
import urllib
url = 'http://www.lib.uts.edu.au/about-uts-library/corporate-information/library-glossary'
soup = BeautifulSoup(urllib.urlopen(url))
paragraphs = [x.parent for x in soup.findAll(name='span',attrs={'class':'definition'}) if x.parent.name == 'p']
for p in paragraphs:
name = p.find(name='span',attrs={'class':'definition'}).text
text = p.text.replace(name,'')
print '-'*80
print name
print text