Web Scraping 101

Web Scraping 101 - html

I'm doing a simple web scrape that prints the price of a company's stock based on the entered ticker symbol. I was able to do this for many companies like SBUX, UPS, NKE, AAPL, BRK-A, and BRK-B. However, I tried MOTR.L, and I got the following error. AttributeError: 'NoneType' object has no attribute 'text'. In the below code, I simply printed the HTML from beautiful soup, and it was obvious the HTML from the website and beautiful soup are not the same. When I print the source variable, I get a response 404, which I assume is an invalid URL. When I print the source variable for the companies that worked, I get a response 200. Could it be the '.' in the ticker symbol that is causing the issue?
from bs4 import BeautifulSoup
import requests
Ticker = "MOTR.L"
source = requests.get('https://finance.yahoo.com/quote/' + Ticker).text
soup = BeautifulSoup(source, 'lxml')
price = soup.find("fin-streamer", {"class": "Fw(b) Fz(36px) Mb(-4px) D(ib)"}).text
print(soup.prettify())

Added the User-Agent in the request header, and it works!!
from bs4 import BeautifulSoup
import requests
Ticker = "MOTR.L"
source = requests.get('https://finance.yahoo.com/quote/' + Ticker, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)"}).text
soup = BeautifulSoup(source, 'lxml')
price = soup.find("fin-streamer", {"data-field": "regularMarketPrice"}).text
print(price)

Related

How do I grab all paragraphs from an article instead of just one?

Beginner here.
I've just begun learning Python and I'm learning to webscrape and I want to grab each paragraph and then write them on either a text file or a csv. Each paragraph has the same tag name so I figured a for loop would go through each tag of that name and grab the text from each one and viola!... Except it only displays the first paragraph 15+ times.. I'm assuming the reason why it does this is because it grabs the first tag like I told it to and prints that same tag for as many other tags that have the same name as it. I tried to replace .find with .find_all but I get an attribute error.. How do I grab all of the paragraphs and not just one?
Article: https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2
from bs4 import BeautifulSoup
import requests
import csv
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
}
url = "https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2"
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
article = soup.find('article')
headline = article.header.h1.text
print(headline)
headline_Sub = article.find('div', class_="headline__subtitle").text
print(headline_Sub)
print('')
for summaries in article.find('div', class_="entry__text js-entry-text yr-entry-text"):
p = article.find('div', class_='content-list-component yr-content-list-text text').p.text
print(p)
for loop with find_all instead returns an error:
Traceback (most recent call last): File
"C:\Users\Denze\MyPythonScripts\Webscraping learning\Webscrape
article.py", line 27, in
p = article.find_all('div', class_='content-list-component yr-content-list-text text').p.text File
"C:\Users\Denze\AppData\Local\Programs\Python\Python39\lib\site-packages\bs4\element.py",
line 2173, in getattr
raise AttributeError( AttributeError: ResultSet object has no attribute 'p'. You're probably treating a list of elements like a
single element. Did you call find_all() when you meant to call find()?

I am using select method to get all the paragraph elements. Take a look to this code:
from bs4 import BeautifulSoup
import requests
import csv
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
}
url = "https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2"
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
h1=soup.find('h1')
article = soup.select('p')
print(h1.text+'\n')
for i in article:
print(i.text)

Web scraping of hotel name and its price?

I am a new learner of python and i am trying to scrape the name and price of particular hotel form goibibo. But every time it shows the output "None". I am not able to figure it out.
Code:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
from requests import get
driver=webdriver.Chrome("C:\\Users\\hp\\Desktop\\New folder\\chromedriver.exe")
driver.get("https://www.goibibo.com/hotels/fihalhohi-island-resort-hotel-in-maldives-3641257667928818414/?hquery={%22ci%22:%2220200903%22,%22co%22:%2220200904%22,%22r%22:%221-2-0%22,%22ibp%22:%22v15%22}&hmd=57c9d7df10ce2ccd7c8fa6e25f4961a9e2bfd39eec002b71e73c095ccec678b1755edc988f4cd5487a9cfba712fa5eb237dc2d5536cdeb75fecfaec0bfc0461f0c82c1fe7f95feb466a9a50609566a429b8bf31d8b3a4058c68a373f40919411fcb01f758fc7b1ff0846764e224629a9a423cd882cf963a63765c80233c253db9d1aeee5200a0a7bc860be97a52ef3df77f49aa906fbb53d10dd59707f1a01ced53756ceded90cbdd8ddec83bdaf5a7ce162f86cb0ed7e115362182c4b99d853f16c5f4e80622113ceadf4d80191000a9e84ded0531fde54fb8ab281943bc2bb7ad41d60a81ba59478e75ac61f6a58ace01e071429b0837292a94d8cfd4da1a5ef453856d6f7d46c6b1adb4abaa7a2ca8e955cb316afe5e220000346c85759a750fdee0887c402eb3ded5c1d054fb84df56afc7a64bc2b2f6c98222c948e80ff32bd88820398ec7b055f7bf27c60f31ebe7f2d1427302997b2b9da5db3aef2f81bac4c21729e84002fbe5afd065ea4c248aa115c405e&cc=PF&reviewType=gi")
content=driver.page_source
soup=BeautifulSoup(content,"html.parser")
soup.prettify()
soup_name=soup.find("h3",class_="dwebCommonstyles__SectionHeader-sc-112ty3f-5 HotelName__HotelNameText-sc-1bfbuq5-0 hMoKY")
print(soup_name.txt)
Output:
> None

Here is the hotel name code. Price depends on the room user chooses from hotel and there is no common price for hotel.
import requests
from bs4 import BeautifulSoup
url = "https://www.goibibo.com/hotels/fihalhohi-island-resort-hotel-in-maldives-3641257667928818414/?hquery={%22ci%22:%2220200903%22,%22co%22:%2220200904%22,%22r%22:%221-2-0%22,%22ibp%22:%22v15%22}&hmd=57c9d7df10ce2ccd7c8fa6e25f4961a9e2bfd39eec002b71e73c095ccec678b1755edc988f4cd5487a9cfba712fa5eb237dc2d5536cdeb75fecfaec0bfc0461f0c82c1fe7f95feb466a9a50609566a429b8bf31d8b3a4058c68a373f40919411fcb01f758fc7b1ff0846764e224629a9a423cd882cf963a63765c80233c253db9d1aeee5200a0a7bc860be97a52ef3df77f49aa906fbb53d10dd59707f1a01ced53756ceded90cbdd8ddec83bdaf5a7ce162f86cb0ed7e115362182c4b99d853f16c5f4e80622113ceadf4d80191000a9e84ded0531fde54fb8ab281943bc2bb7ad41d60a81ba59478e75ac61f6a58ace01e071429b0837292a94d8cfd4da1a5ef453856d6f7d46c6b1adb4abaa7a2ca8e955cb316afe5e220000346c85759a750fdee0887c402eb3ded5c1d054fb84df56afc7a64bc2b2f6c98222c948e80ff32bd88820398ec7b055f7bf27c60f31ebe7f2d1427302997b2b9da5db3aef2f81bac4c21729e84002fbe5afd065ea4c248aa115c405e&cc=PF&reviewType=gi"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
htag=soup.findAll('section',attrs={"class":"HotelDetailsMain__HotelDetailsContainer-sc-2p7gdu-0 kuBApH"})
for x in htag:
print (x.find('h3').text)
Note: There is no need to use selenium webdriver if you just want to fetch the contents of webpage. Beautifulsoup itself does it for you

parse a json object in a div while scraping using beautifulsoup python

I am learning scraping. I need to access the json string i encounter within a DIV. I am using beautifulsoup.
This is the json string i get in the DIV. I need the value (51.65) of the tag "lastprice". Please help. The JSON object is in json_d
import pip
import requests
import json
from bs4 import BeautifulSoup
print ('hi')
page = requests.get('https://www.nseindia.com/live_market/dynaContent/live_watch/get_quote/GetQuote.jsp?symbol=NBCC&illiquid=0&smeFlag=0&itpFlag=0')
soup = BeautifulSoup(page.text, 'html.parser')
json_d = soup.find(id='responseDiv')
print ('bye')

import bs4
import json
r= '''
<div id="responseDiv" style="display:none">{"tradedDate":"07DEC2018","data":[{"pricebandupper":"58.35","symbol":"NBCC","applicableMargin":"15.35","bcEndDate":"14-SEP-18","totalSellQuantity":"40,722","adhocMargin":"-","companyName":"NBCC (India) Limited","marketType":"N","exDate":"06-SEP-18","bcStartDate":"10-SEP-18","css_status_desc":"Listed","dayHigh":"53.55","basePrice":"53.05","securityVar":"10.35","pricebandlower":"47.75","sellQuantity5":"-","sellQuantity4":"-","sellQuantity3":"-","cm_adj_high_dt":"08-DEC-17","sellQuantity2":"-","dayLow":"51.55","sellQuantity1":"40,722","quantityTraded":"71,35,742","pChange":"-2.64","totalTradedValue":"3,714.15","deliveryToTradedQuantity":"40.23","totalBuyQuantity":"-","averagePrice":"52.05","indexVar":"-","cm_ffm":"2,424.24","purpose":"ANNUAL GENERAL MEETING\/DIVIDEND RE 0.56 PER SHARE","buyPrice2":"-","secDate":"7DEC2018","buyPrice1":"-","high52":"266.00","previousClose":"53.05","ndEndDate":"-","low52":"50.80","buyPrice4":"-","buyPrice3":"-","recordDate":"-","deliveryQuantity":"28,70,753","buyPrice5":"-","priceBand":"No Band","extremeLossMargin":"5.00","cm_adj_low_dt":"26-OCT-18","varMargin":"10.35","sellPrice1":"51.80","sellPrice2":"-","totalTradedVolume":"71,35,742","sellPrice3":"-","sellPrice4":"-","sellPrice5":"-","change":"-1.40","surv_indicator":"-","ndStartDate":"-","buyQuantity4":"-","isExDateFlag":false,"buyQuantity3":"-","buyQuantity2":"-","buyQuantity1":"-","series":"EQ","faceValue":"1.00","buyQuantity5":"-","closePrice":"51.80","open":"53.15","isinCode":"INE095N01031","lastPrice":"51.65"}],"optLink":"\/marketinfo\/sym_map\/symbolMapping.jsp?symbol=NBCC&instrument=-&date=-&segmentLink=17&symbolCount=2","otherSeries":["EQ"],"futLink":"\/live_market\/dynaContent\/live_watch\/get_quote\/GetQuoteFO.jsp?underlying=NBCC&instrument=FUTSTK&expiry=27DEC2018&type=-&strike=-","lastUpdateTime":"07-DEC-2018 15:59:59"}</div>'''
html = bs4.BeautifulSoup(r)
soup = html.find('div', {'id':'responseDiv'}).text
data = json.loads(soup)
last_price = data['data'][0]['lastPrice']
EDIT:
json_d = soup.find(id='responseDiv')
Try changing to
json_d = soup.find(‘div’, {‘id’:'responseDiv'})
Then you should be able to do
data = json.loads(json_d)
last_price = data['data'][0]['lastPrice']
See if that helps. I’m currently away from my computer until Tuesday so typing this up on my iPhone, so can’t test/play with it.
The other thing is the site might need to be read in after it’s loaded. In that case, I think you’d need to look into selenium package or html-requests packages.
Again, I can’t look until Tuesday when I get back home to my laptop.

beautiful soup findall not returning results

I am trying to webscrape and get the insurance dollars as listed in the below html.
Insurance
Insurance
Used the below code but it is not fetching anything. Can someone help? I am fairly new to python...
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/ford/escape/2017/s/?vehicleid=415933&intent=buy-new')
html_soup = BeautifulSoup(r.content, 'lxml')
test2 = html_soup.find_all('div',attrs={"class":"col-base-6"})
print(test2)

Not all the data you see on the page is actually the response to the get request to this URL. there are a lot of other requests the browser make in the background, which are initiated by javascript code.
Specifically, the request for the insurance data is made to this URL:
https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933
Here is a working code for what you need:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.kbb.com/vehicles/hub/_costtoown/?vehicleid=415933')
html_soup = BeautifulSoup(r.text, 'html.parser')
Insurance = html_soup.find('div',string="Insurance").find_next().text
print(Insurance)

Entering Value into Search Bar and Downloading Output from Webpage

I'm trying to search a webpage (http://www.phillyhistory.org/historicstreets/). I think the relevent source html is this:
<input name="txtStreetName" type="text" id="txtStreetName">
You can see the rest of the source html at the website. I want to go into the that text box and enter an street name and download an output (ie enter 'Jefferson' in the search box of the page and see historic street names with Jefferson). I have tried using requests.post, and tried typing ?get=Jefferson in the url to test if that works with no luck. Anyone have any ideas how to get this page? Thanks,
Cameron
code that I currently tried (some imports are unused as I plan to parse etc):
import requests
from bs4 import BeautifulSoup
import csv
from string import ascii_lowercase
import codecs
import os.path
import time
arrayofstreets = []
arrayofstreets = ['Jefferson']
for each in arrayofstreets:
url = 'http://www.phillyhistory.org/historicstreets/default.aspx'
payload = {'txtStreetName': each}
r = requests.post(url, data=payload).content
outfile = "raw/" + each + ".html"
with open(outfile, "w") as code:
code.write(r)
time.sleep(2)
This did not work and only gave me the default webpage downloaded (ie Jefferson not entered in the search bar and retrieved.

I'm guessing your reference to 'requests.post' relates to the requests module for python.
As you have not specified what you want to scrape from the search results I will simply give you a snippet to get the html for a given search query:
import requests
query = 'Jefferson'
url = 'http://www.phillyhistory.org/historicstreets/default.aspx'
post_data = {'txtStreetName': query}
html_result = requests.post(url, data=post_data).content
print html_result
If you need to further process the html file to extract some data, I suggest you use the Beautiful Soup module to do so.
UPDATED VERSION:
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
import csv
from string import ascii_lowercase
import codecs
import os.path
import time
def get_post_data(html_soup, query):
view_state = html_soup.find('input', {'name': '__VIEWSTATE'})['value']
event_validation = html_soup.find('input', {'name': '__EVENTVALIDATION'})['value']
textbox1 = ''
btn_search = 'Find'
return {'__VIEWSTATE': view_state,
'__EVENTVALIDATION': event_validation,
'Textbox1': '',
'txtStreetName': query,
'btnSearch': btn_search
}
arrayofstreets = ['Jefferson']
url = 'http://www.phillyhistory.org/historicstreets/default.aspx'
html = requests.get(url).content
for each in arrayofstreets:
payload = get_post_data(BeautifulSoup(html, 'lxml'), each)
r = requests.post(url, data=payload).content
outfile = "raw/" + each + ".html"
with open(outfile, "w") as code:
code.write(r)
time.sleep(2)
The problem in my/your first version was that we weren't posting all the required parameters. To find out what you need to send, open the network monitor in your browser (Ctrl+Shitf+Q in Firefox) and make that search as you would normally. If you select the POST request in the network log, on the right you should see 'parameters tab' where the post parameters your browser sent.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Web Scraping 101 - html

Related

How do I grab all paragraphs from an article instead of just one?

Web scraping of hotel name and its price?

parse a json object in a div while scraping using beautifulsoup python

beautiful soup findall not returning results

Entering Value into Search Bar and Downloading Output from Webpage

Categories

Resources