Trying to scrape the image source using Beautiful Soup and Python [duplicate] - html

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get TypeError: list indices must be integers, not str
Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here... but I am no specialist, and I may have misunderstood.
Any suggestion is greatly appreciated!

.find_all() returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find() method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']

In Python 3.x, simply use get(attr_name) on your tag object that you get using find_all:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlDecoded = xmlData
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
print("Attribute id = %s" % repElemID)
print("Attribute name = %s" % repElemName)
against XML file conf//test1.xml that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
prints:
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary

For me:
<input id="color" value="Blue"/>
This can be fetched by below snippet.
page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])

If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.

I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.
suppose say a tag xyz has that attritube named "staininfo"..
full_tag = soup.findAll("xyz")
And i wan't you to understand that full_tag is a list
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]
print staininfo_attrb_value
Thus you can get all the attrb values of staininfo for all the tags xyz

you can also use this :
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)

You could try to use the new powerful package called requests_html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date) # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'

I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.

Here is an example for how to extract the href attrbiutes of all a tags:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')
hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
# print(href.get("href"))
links = href.get("href")
all_hrefs.append(links)
print(all_hrefs)

You can try gazpacho:
Install it using pip install gazpacho
Get the HTML and make the Soup using:
from gazpacho import get, Soup
soup = Soup(get("http://ip.add.ress.here/")) # get directly returns the html
inputs = soup.find('input', attrs={'name': 'stainfo'}) # Find all the input tags
if inputs:
if type(inputs) is list:
for input in inputs:
print(input.attr.get('value'))
else:
print(inputs.attr.get('value'))
else:
print('No <input> tag found with the attribute name="stainfo")

Related

Python web-scraping output

I want to make a script that prints the links to results in bing search to the console. The problem is that when I run the script there is no output. I believe the website thinks I am a bot?
from bs4 import BeautifulSoup
import requests
search = input("search for:")
params = {"q": "search"}
r = requests.get("http://www.bing.com/search", params=params)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find("ol", {"id": "b_results"})
links = results.find_all("Li", {"class": "b_algo"})
for item in links:
item_text = item.find("a").text
item_href = item.find("a").attrs["href"]
if item_text and item_href:
print(item_text)
print(item_href)
You need to use the search variable instead of "search". You also have a typo in your script: li is lower case.
Change these lines:
params = {"q": "search"}
.......
links = results.find_all("Li", {"class": "b_algo"})
To this:
params = {"q": search}
........
links = results.find_all("li", {"class": "b_algo"})
Note that some queries don't return anything. "crossword" has results, but "peanut" does not. The result page structure may be different based on the query.
There are 2 issues in this code -
Search is a variable name, so it should not be used with quotes. Change it to below
params = {"q": search}
When you include variable name inside quotes while fetching link, it becomes a static link. For dynamic link you should do it as below -
r = requests.get("http://www.bing.com/"+search, params=params)
After making these 2 changes, if you still do not get any output , check if you are using correct tag in results variable.

Converting variables into lists and removing duplicates

I scraped a website using the below code.
The website is structured in a certain way that requires using 4 different classes to scrape all the data which causes some data to be duplicated.
For converting my variables into lists, I tried using the split(' ') method, but it only created a list for each scraped string with /n in the beginning.
I also tried to create the variable as empty lists, api_name = [] for instance but it did not work.
For removing duplicates, I thought of using the set method, but I think it only works on lists.
I want to remove all the duplicated data from my variables before I write them into the CSV file, do I have to convert them into lists first or there is a way to remove them directly from the variables?
Any assistance or even feedback for the code would be appreciated.
Thanks.
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.programmableweb.com/apis/directory"
api_no = 0
urlnumber = 0
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
csv_file = open('api_scraper.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['api_no', 'API Name', 'Description','api_url', 'Category', 'Submitted'])
#THis is the place where I parse and combine all the classes, which causes the duplicates data
directories1 = soup.find_all('tr', {'class': 'odd'})
directories2 = soup.find_all('tr', {'class': 'even'})
directories3 = soup.find_all('tr', {'class': 'odd views-row-first'})
directories4 = soup.find_all('tr', {'class': 'odd views-row-last'})
directories = directories1 + directories2 + directories3 + directories4
while urlnumber <= 765:
for directory in directories:
api_NameTag = directory.find('td', {'class':'views-field views-field-title col-md-3'})
api_name = api_NameTag.text if api_NameTag else "N/A"
description_nametag = directory.find('td', {'class': 'col-md-8'})
description = description_nametag.text if description_nametag else 'N/A'
api_url = 'https://www.programmableweb.com' + api_NameTag.a.get('href')
category_nametage = directory.find('td',{'class': 'views-field views-field-field-article-primary-category'})
category = category_nametage.text if category_nametage else 'N/A'
submitted_nametag = directory.find('td', {'class':'views-field views-field-created'})
submitted = submitted_nametag.text if submitted_nametag else 'N/A'
#These are the variables I want to remove the duplicates from
csv_writer.writerow([api_no,api_name,description,api_url,category,submitted])
api_no +=1
urlnumber +=1
url = "https://www.programmableweb.com/apis/directory?page=" + str(urlnumber)
csv_file.close()
If it wasn't for the api links I would have said just use pandas read_html and take index 2. As you want the urls as well I suggest that you change your selectors. You want to limit to the table to avoid duplicates and choose the class name that depicts the column.
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.programmableweb.com/apis/directory')
soup = bs(r.content, 'lxml')
api_names, api_links = zip(*[(item.text, 'https://www.programmableweb.com' + item['href']) for item in soup.select('.table .views-field-title a')])
descriptions = [item.text for item in soup.select('td.views-field-search-api-excerpt')]
categories = [item.text for item in soup.select('td.views-field-field-article-primary-category a')]
submitted = [item.text for item in soup.select('td.views-field-created')]
df = pd.DataFrame(list(zip(api_names, api_links, descriptions, categories, submitted)), columns = ['API name','API Link', 'Description', 'Category', 'Submitted'])
print(df)
Though you could just do
pd.read_html(url)[2]
and then add in the extra column for api_links from bs4 using selectors shown above.

Trying to scrape a specific part of html with Python-3.7, but it returns "None"

I am a beginner writing some simple Python code to scrape data from a web page. I have located the exact part of the html that I want to scrape, but it keeps returning "None." It works for other parts of the web page, but not this one specific part
I am using BeautifulSoup to parse the html, and since I can scrape some of the code, I am assuming I will not need to use Selenium. But I still cannot find how to scrape one specific part.
Here is the Python code I have written:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page=2'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
apt_listings = html_soup.find_all('div', class_='_3RRl_')
print(type(apt_listings))
print(len(apt_listings))
first_apt = apt_listings[0]
first_apt.a
first_add = first_apt.a.text
print(first_add)
apt_rents = html_soup.find_all('div', class_='_3e12V')
print(type(apt_rents))
print(len(apt_rents))
first_rent = apt_rents[0]
print(first_rent)
first_rent = first_rent.find('class', attrs={'data-tid' : 'price'})
print(first_rent)
Here is the output from CMD:
<class 'bs4.element.ResultSet'>
30
address not disclosed
<class 'bs4.element.ResultSet'>
30
<div class="_3e12V" data-tid="price">$2,350</div>
None
The "address not disclosed" is correct and was scraped successfully. I want to scrape the $2,350 but it keeps returning "None." I think I am close to getting it right but I just can't seem to get the $2,350. Any help is greatly appreciated.
you need to use the property .text of BeautifulSoup instead .find() like this:
first_rent = first_rent.text
as simple as that.
You can extract all the listings from a script tag and parse as json. The regex looks for this script tag which starts window.__APPLICATION_CONTEXT__ =.
The string after that is extracted via the group in the regex (.*). That javascript object can be parsed as json if the string is loaded with json.loads.
You can explore the json object here
import requests
import json
from bs4 import BeautifulSoup as bs
import re
base_url = 'https://www.rent.com/'
res = requests.get('https://www.rent.com/new-york/tuckahoe-apartments?page=2')
soup = bs(res.content, 'lxml')
r = re.compile(r'window.__APPLICATION_CONTEXT__ = (.*)')
data = soup.find('script', text=r).text
script = r.findall(data)[0]
items = json.loads(script)['store']['listings']['listings']
results = []
for item in items:
address = item['address']
area = ', '.join([item['city'], item['state'], item['zipCode']])
low_price = item['aggregates']['prices']['low']
high_price = item['aggregates']['prices']['high']
listingId = item['listingId']
url = base_url + item['listingSeoPath']
# all_info = item
record = {'address' : address,
'area' : area,
'low_price' : low_price,
'high_price' : high_price,
'listingId' : listingId,
'url' : url}
results.append(record)
df = pd.DataFrame(results, columns = [ 'address', 'area', 'low_price', 'high_price', 'listingId', 'url'])
print(df)
Sample of results:
Short version with class:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page=2'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.select_one('._3e12V').text)
All prices:
import requests
from bs4 import BeautifulSoup
url = 'https://www.rent.com/new-york/tuckahoe-apartments?page=2'
response = requests.get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
print([item.text for item in html_soup.select('._3e12V')])

Scrape table from webpage when in <div> format - using Beautiful Soup

So I'm aiming to scrape 2 tables (in different formats) from a website - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate after using the search bar to iterate this over a list of license codes. I haven't included the loop fully yet but I added it at the top for completeness.
My issue is that because the two tables I want, Product Data and Certificate Data are in 2 different formats, so I have to scrape them separately. As the Product data is in the normal "tr" format on the webpage, this bit is easy and I've managed to extract a CSV file of this. The harder bit is extracting Certificate Data, as it is in "div" form.
I've managed to print the Certificate Data as a list of text, using the class function, however I need to have it in a tabular form saved in a CSV file. As you can see, I've tried several unsuccessful ways of converting it to a CSV but If you have any suggestions, it would be much appreciated, thank you!! Also any other general tips to improve my code would be great too, as I am new to web-scraping.
#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
#include all of the below
driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)
search_bar = driver.find_element_by_xpath('//*[#id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url
r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0]
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))
driver.quit()
df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)
#print(df[0].to_json(orient='records'))
print certificate
Output:
Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0
What I want but over hundreds/thousands of license codes (I just manually created this one sample in Excel):
Desired output
EDIT
So whilst this is now working for Certificate Data, I also want to scrape the Product Data and output that into another .csv file. However currently it is only printing 5 copies of the product data for the final license code which is not what I want.
New Code:
df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]
def get_data_by_code(code):
data = [
('code', code),
('submit', 'Search'),
]
response = requests.post('https://info.fsc.org/certificate.php', data=data)
soup = BeautifulSoup(response.content, 'lxml')
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
return [code, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
table = soup.find_all('table')[0]
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
Here's all you need.
No chromedriver. No pandas. Forget about it in context of scraping.
import requests
import csv
from bs4 import BeautifulSoup
# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.
#Function to parse single data page based on single input code.
def get_data_by_code(code):
# Parameters to build POST-request.
# "type" and "submit" params are static. "code" is your desired code to scrape.
data = [
('type', 'certificate'),
('code', code),
('submit', 'Search'),
]
# POST-request to gain page data.
response = requests.post('https://info.fsc.org/certificate.php', data=data)
# "soup" object to parse html data.
soup = BeautifulSoup(response.content, 'lxml')
# "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
# Same for issue dates... etc.
first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text
# Returning found data as list of values.
return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]
# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
#Writing list of values to file as single row.
writer.writerow((get_data_by_code(code)))
Everything is really straightforward here. I'd suggest you spend some time in Chrome dev tools "network" tab to have a better understanding of request forging, which is a must for scraping tasks.
In general, you don't need to run chrome to click the "search" button, you need to forge request generated by this click. Same for any form and ajax.
well... you should sharpen your skills (:
df3=pd.DataFrame()
with open(OUTPUT_FILE_NAME, 'w') as f:
writer = csv.writer(f)
for code in codes:
print('Getting code# {}'.format(code))
writer.writerow((get_data_by_code(code)))
### HERE'S THE PROBLEM:
# "soup" variable is declared inside of "get_data_by_code" function.
# So you can't use it in outer context.
table = soup.find_all('table')[0] # <--- you should move this line to
#definition of "get_data_by_code" function and return it's value somehow...
df1, = pd.read_html(str(table))
df3 = df3.append(df1)
df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')
As per example you can return dictionary of values from "get_data_by_code" function:
def get_data_by_code(code):
...
table = soup.find_all('table')[0]
return dict(row=row, table=table)

Issue in scraping data from a html page using beautiful soup

I am scraping some data from a website and I am able to do so using the below referred code:
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
from itertools import islice
page = urllib2.urlopen('http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('O2_2012-12-21.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',')
spamwriter.writerow(["Date","Month","Day of Week","OEM","Device Name","Price"])
oems = soup.findAll('span', {"class": "wwFix_h2"},text=True)
items = soup.findAll('div',{"class":"title"})
prices = soup.findAll('span', {"class": "handset"})
for oem, item, price in zip(oems, items, prices):
textcontent = u' '.join(islice(item.stripped_strings, 1, 2, 1))
if textcontent:
spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(oem.string).encode('utf8').strip(),textcontent,unicode(price.string).encode('utf8').strip()])
Now, issue is 2 of the all the price values I am scraping have different html structure then rest of the values. My output csv is showing "None" value for those because of this. Normal html structure for price on webpage is
<span class="handset">
FREE to £79.99</span>
For those 2 values structure is
<span class="handset">
<span class="delivery_amber">Up to 7 days delivery</span>
<br>"FREE on all tariffs"</span>
Out which I am getting right now displays None for the second html structure instead of Free on all tariffs, also price value Free on all tariffs is mentioned under double quotes in second structure while it is outside any quotes in first structure
Please help me solve this issue, Pardon my ignorance as I am new to programming.
Just detect those 2 items with an additional if statement:
if price.string is None:
price_text = u' '.join(price.stripped_strings).replace('"', '').encode('utf8')
else:
price_text = unicode(price.string).strip().encode('utf8')
then use price_text for your CSV file. Note that I removed the " quotes with a simple replace call.