Scrap info from htm with Beautifulsoup - html

Good morning, i can't find the data correctly with beautifulsoup, can you help me? i want to get the last numbers of this html:
i want to have only the employee id ( 22219 )
<td class="day list-item">
<div class="allocation-day click-area clickable" data-date="2022-02-07" data-url="anonymous-duty-details?beginDate=2022-02-07&allocatedEmployeeId=22219">
<div class="day-info">
<div class="date-with-type">7</div>
<div class="weekday">Mo.</div>
<div class="row-status-icons">
from bs4 import BeautifulSoup as bs
url = ('20105.html')
with open(url, 'r') as f:
contents = f.read()
soup = bs(contents, features="html.parser")
userfinder = soup.find_all('data-url', class_='')
tutte = soup.find_all('div', attrs={"data-url":"data-url"})
print(tutte)
print(tutte)
print(userfinder)

You could select your targets more specific - In this case css selectors are used two select all div with an attribute data-url:
soup.select('div[data-url]')
To get the values from the resultset you have to iterate it:
for url in soup.select('div[data-url]'):
print(url['data-url'])
##output
anonymous-duty-details?beginDate=2022-02-07&allocatedEmployeeId=22219
Getting only the ids a simple approache could be to split() the url-string, but be aware only if structure is still the same, else you have to use regex or other approaches:
for url in soup.select('div[data-url]'):
print(url['data-url'].split('=')[-1])
##output
22219
Simple regex approache:
import re
for url in soup.select('div[data-url]'):
print(re.search(r"allocatedEmployeeId=(\d*)",url['data-url']).group(1))
##output
22219
EDIT
Getting only the first result use select_one():
soup.select_one('div[data-url]')['data-url'].split('=')[-1]
or
import re
re.search(r"allocatedEmployeeId=(\d*)",soup.select_one('div[data-url]')['data-url']).group(1)

Related

How to find the enclosing tags (start and end tag) after searching for a specific keyword in a HTML file?

I have a list of keywords which I need to search for in a website. I first extracted the contents of the webpage using BeautifulSoup and stored it in a text file. I wish to search for the list of keywords in the text file (which contains HTML data) and when one of the keywords match, the respective start and end tags where the keyword was found needs to be extracted.
For example-
<div class="col-md-6">
<img alt="DC Sustainable Energy Utility: Your Guide to Green" class="img-fluid" src="//d2z33q8cpwfp3p.cloudfront.net/content/dcseu-temp.png"/>
</div>
I search for the word "Energy" and I find it in the 'img' tag, BUT, I wish to extract the parent tag, which is 'div' here.
Is there a way I can do that?
from bs4 import BeautifulSoup
import urllib
#Extracting HTML content from a webpage
webUrl = urllib.request.urlopen("URL")
html_doc = webUrl.read()
soup = BeautifulSoup(html_doc, 'html.parser')
soup = str(soup)
with open('path to .txt file', 'w') as output:
output.write(soup)
#Extracting start and end tag
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
parser.feed('<div class="col-md-6"><img alt="Some Energy Utility: " class="img-fluid" src="//some_image.png"/></div>')
This identifies all the start and end tags, but I wish to be able to extract the parent/master tags which holds the keyword.
You can find elements with a certain text or imgs with a certain alt text using a custom filter, then find the closest parent of type div (or any other criteria .find_* methods accept)
from bs4 import BeautifulSoup, Tag
html = '''
<div class="col-md-6">
<img alt="DC Sustainable Energy Utility: Your Guide to Green" class="img-fluid" src="//d2z33q8cpwfp3p.cloudfront.net/content/dcseu-temp.png"/>
</div>
'''
keyword = 'energy'
if __name__ == '__main__':
soup = BeautifulSoup(html, 'html.parser')
def keyword_filter(el: Tag):
"""Pick a tag according to its text content"""
if keyword.lower() in el.text.lower():
return True
try:
if keyword.lower() in el['alt'].lower():
return True
except KeyError:
return False
return False
for el in soup.find_all(keyword_filter):
div = el.find_parent('div')
print(div)

Trying to scrape a specific part of html with Python-3.7, but it returns "None"

I am a beginner writing some simple Python code to scrape data from a web page. I have located the exact part of the html that I want to scrape, but it keeps returning "None." It works for other parts of the web page, but not this one specific part
I am using BeautifulSoup to parse the html, and since I can scrape some of the code, I am assuming I will not need to use Selenium. But I still cannot find how to scrape one specific part.
Here is the Python code I have written:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page=2'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
apt_listings = html_soup.find_all('div', class_='_3RRl_')
print(type(apt_listings))
print(len(apt_listings))
first_apt = apt_listings[0]
first_apt.a
first_add = first_apt.a.text
print(first_add)
apt_rents = html_soup.find_all('div', class_='_3e12V')
print(type(apt_rents))
print(len(apt_rents))
first_rent = apt_rents[0]
print(first_rent)
first_rent = first_rent.find('class', attrs={'data-tid' : 'price'})
print(first_rent)
Here is the output from CMD:
<class 'bs4.element.ResultSet'>
30
address not disclosed
<class 'bs4.element.ResultSet'>
30
<div class="_3e12V" data-tid="price">$2,350</div>
None
The "address not disclosed" is correct and was scraped successfully. I want to scrape the $2,350 but it keeps returning "None." I think I am close to getting it right but I just can't seem to get the $2,350. Any help is greatly appreciated.
you need to use the property .text of BeautifulSoup instead .find() like this:
first_rent = first_rent.text
as simple as that.
You can extract all the listings from a script tag and parse as json. The regex looks for this script tag which starts window.__APPLICATION_CONTEXT__ =.
The string after that is extracted via the group in the regex (.*). That javascript object can be parsed as json if the string is loaded with json.loads.
You can explore the json object here
import requests
import json
from bs4 import BeautifulSoup as bs
import re
base_url = 'https://www.rent.com/'
res = requests.get('https://www.rent.com/new-york/tuckahoe-apartments?page=2')
soup = bs(res.content, 'lxml')
r = re.compile(r'window.__APPLICATION_CONTEXT__ = (.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0]
items = json.loads(script)['store']['listings']['listings']
results = []
for item in items:
address = item['address']
area = ', '.join([item['city'], item['state'], item['zipCode']])
low_price = item['aggregates']['prices']['low']
high_price = item['aggregates']['prices']['high']
listingId = item['listingId']
url = base_url + item['listingSeoPath']
# all_info = item
record = {'address' : address,
'area' : area,
'low_price' : low_price,
'high_price' : high_price,
'listingId' : listingId,
'url' : url}
results.append(record)
df = pd.DataFrame(results, columns = [ 'address', 'area', 'low_price', 'high_price', 'listingId', 'url'])
print(df)
Sample of results:
Short version with class:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page=2'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.select_one('._3e12V').text)
All prices:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page=2'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
print([item.text for item in html_soup.select('._3e12V')])

Getting html table data other than text content (get "title" tag data)

One table entry within a table row on an html table I am trying to scrape looks like so:
<td class="top100nation" title="PAK">
<img src="/images/flag/flags_pak.jpg" alt="PAK"></td>
The web page to which this belongs is the following: http://www.relianceiccrankings.com/datespecific/odi/?stattype=bowling&day=01&month=01&year=2014. The entire column to which this belongs in the table has similar table data (i.e. it's a column of images).
I am using lxml in a python script. (Open to using BeautifulSoup instead, if I have to for some reason.) For every other column in the table, I can extract the data I want on the given row by using 'data = entry.text_content()'. Obviously, this doesn't work for this column of images. But I don't want the image data in any case. What I want to get from this table data is the 'PAK' bit - that is, I want the name of the nation. I think this is extremely simple but unfortunately I am a simpleton who doesn't understand the library he is using.
Thanks in advance
Edit: Full script, as per request
import requests
import lxml.html as lh
import csv
with open('firstPageCricinfo','w') as file:
writer = csv.writer(file)
page = requests.get(url)
doc = lh.fromstring(page.content)
#rows of the table
tr_elements = doc.xpath('//tr')
data_array = [[] for _ in range(len(tr_elements))]
del tr_elements[0]
for t in tr_elements[0]:
name=t.text_content()
if name == "":
continue
print(name)
data_array[0].append(name)
#printing out first row of table, to check correctness
print(data_array[0])
for j in range(1,len(tr_elements)):
T=tr_elements[j]
i=0
for t in T.iterchildren():
#column is not at issue
if i != 3:
data=t.text_content()
#image-based column
else:
#what do I do here???
data = t.
data_array[j].append(data)
i+=1
#printing last row to check correctness
print(data_array[len(tr_elements)-1])
with open('list1','w') as file:
writer = csv.writer(file)
for i in range(0,len(tr_elements)):
writer.writerow(data_array[i])`
Along with lxml library you'll either need to use requests or some other library to get the website content.
Without seeing the code you have so far, I can offer a BeautifulSoup solution:
url = 'http://www.relianceiccrankings.com/datespecific/odi/?stattype=bowling&day=01&month=01&year=2014'
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get(url).text, 'lxml')
r = soup.find_all('td', {'class': 'top100cbr'})
for td in r:
print(td.text.split('v')[1].split(',')[0].strip())
outputs about 522 items:
South Africa
India
Sri Lanka
...
Canada
New Zealand
Australia
England

Trying to scrape the image source using Beautiful Soup and Python [duplicate]

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get TypeError: list indices must be integers, not str
Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here... but I am no specialist, and I may have misunderstood.
Any suggestion is greatly appreciated!
.find_all() returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find() method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']
In Python 3.x, simply use get(attr_name) on your tag object that you get using find_all:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlDecoded = xmlData
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
print("Attribute id = %s" % repElemID)
print("Attribute name = %s" % repElemName)
against XML file conf//test1.xml that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
prints:
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary
For me:
<input id="color" value="Blue"/>
This can be fetched by below snippet.
page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])
If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.
I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.
suppose say a tag xyz has that attritube named "staininfo"..
full_tag = soup.findAll("xyz")
And i wan't you to understand that full_tag is a list
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]
print staininfo_attrb_value
Thus you can get all the attrb values of staininfo for all the tags xyz
you can also use this :
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)
You could try to use the new powerful package called requests_html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date) # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'
I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.
Here is an example for how to extract the href attrbiutes of all a tags:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')
hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
# print(href.get("href"))
links = href.get("href")
all_hrefs.append(links)
print(all_hrefs)
You can try gazpacho:
Install it using pip install gazpacho
Get the HTML and make the Soup using:
from gazpacho import get, Soup
soup = Soup(get("http://ip.add.ress.here/")) # get directly returns the html
inputs = soup.find('input', attrs={'name': 'stainfo'}) # Find all the input tags
if inputs:
if type(inputs) is list:
for input in inputs:
print(input.attr.get('value'))
else:
print(inputs.attr.get('value'))
else:
print('No <input> tag found with the attribute name="stainfo")

Issue in scraping data from a html page using beautiful soup

I am scraping some data from a website and I am able to do so using the below referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
items = soup.findAll('div',{"class":"title"})
prices = soup.findAll('span', {"class": "handset"})
for oem, item, price in zip(oems, items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])
Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is
<span class="handset">
FREE to £79.99</span>
For those 2 values structure is
<span class="handset">
<span class="delivery_amber">Up to 7 days delivery</span>
<br>"FREE on all tariffs"</span>
Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure
Please help me solve this issue, Pardon my ignorance as I am new to programming.
Just detect those 2 items with an additional if statement:
if price.string is None:
price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
price_text = unicode(price.string).strip().encode('utf8')
then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.