BeautifulSoup IndexError: list index out of range - html

My code below:
import requests
from bs4 import BeautifulSoup
def investopedia():
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}
ticker = 'TSLA'
url = f'https://www.investopedia.com/markets/quote?tvwidgetsymbol={ticker.lower()}'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
ip_price = soup.find_all('div', {'class':'tv-symbol-price-quote__value js-symbol-last'})[0].find('span').text
print(ip_price)
investopedia()
The class I used while inspecting element (in html):
<div class="tv-symbol-price-quote__value js-symbol-last"><span>736.27</span></div>
736.27 in "span" is the number I need
Please help out a web scraping beginnger here. Thanks in advance!

You get index out of range error because your code can't find any HTML elements you are looking for right now.
Information you are looking for is kept within an iframe. In order to retrieve the data you want, we have to switch to that iframe. One way to do it is using Selenium.
from selenium import webdriver
def investopedia():
ticker = 'TSLA'
url = f'https://www.investopedia.com/markets/quote?tvwidgetsymbol={ticker.lower()}'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5) # it takes time to download the webpage
iframe = driver.find_elements_by_css_selector('.tradingview-widget-container > iframe')[0]
driver.switch_to.frame(iframe)
time.sleep(1)
ip_price = driver.find_elements_by_xpath('.//div[#class="tv-symbol-price-quote__value js-symbol-last"]')[0].get_attribute('innerText').strip()
print(ip_price)
investopedia()

Related

Extracting Text from Span Tag using BeautifulSoup

I am trying to extract the estimated monthly cost of "$1,773" from this url:
https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/
Upon inspecting that part of the page, I see this data:
<div class="sc-qWfCM cdZDcW">
<span class="Text-c11n-8-48-0__sc-aiai24-0 dQezUG">Estimated monthly cost</span>
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,773</span></div>
To extract $1,773, I have tried this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/'
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html")
print(soup.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'}))
This returns a list of three elements, with no mention of $1,773.
[<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$463,300</span>,
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,438</span>,
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$2,300<!-- -->/mo</span>]
Can someone please explain how to return $1,773?
I think you have to find the first parent element.
for example:
parent_div = soup.find('div', {'class': 'sc-fzqBZW bzsmsC'})
result = parent_div.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'})
While parsing a web page we need to separate components of the page in the way they are rendered. There are components that are statically or dynamically rendered. The dynamic content also takes some time to load, as the page calls for backend API of some sort.
Read more here
I tried parsing your page using Selenium ChromeDriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/")
time.sleep(3)
time.sleep(3)
el = driver.find_elements_by_xpath("//span[#class='Text-c11n-8-48-0__sc-aiai24-0 jLucLe']")
for e in el:
print(e.text)
time.sleep(3)
driver.quit()
#OUTPUT
$463,300
$1,773
$2,300/mo

requests doesn't get the full body content

I know, this is the question that have been already asked much. So I tried some solutions, and it worked for my other works.
But this site is different, I think.
I tried this at first.
html = requests.get(url = "http://loawa.com")
soup = BeautifulSoup(html.content.decode('utf-8','replace'), 'html.parser')
print(soup)
It fetches me a head, and slight of body.
<body class="p-0 bg-theme-6" style="overflow-x:hidden"><script>window.location.reload(true);</script></body>
So I used prerender as
html = requests.get(url = "http://service.prerender.io/http://loawa.com")
soup = BeautifulSoup(html.content.decode('utf-8','replace'), 'html.parser')
print(soup)
It gives me the same result.
So I tried it with headers.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36','Content-Type': 'text/html',}
response = requests.get("http://loawa.com",headers=headers)
html = response.text
soup = BeautifulSoup(html.content.decode('utf-8','replace'), 'html.parser')
print(soup)
The html comes out as empty. Not sure I did a right job with headers.
What can I try more with? I don't want to use selenium for this work.
Hope someone can enlighten me. Thanks!

Beautiful Soup can't extract links

I am trying to extract the links of this webpage: https://search.cisco.com/search?query=iot
Using this code I am not getting anything returned:
# Get Html Data from webpage
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html5lib')
# Retrieve all of the anchor tags
tags = soup('a') for tag in tags:
print(tag.get('href'))
I have tried the find_all() method but had the same problem.
Seems like java script render to pages.You can use selenium and beautiful soup to fetch the links.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://search.cisco.com/search?query=iot&locale=enUS")
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
for a in soup.find_all('a', href=True):
print(a['href'])
Output:
https://onesearch.cloudapps.cisco.com/searchpage?queryFilter=iot
/login?query=iot&locale=enUS
/login?query=iot&locale=enUS
https://secure.opinionlab.com/ccc01/o.asp?id=pGuoWfLm&static=1&custom_var=undefined%7CS%7CenUS%7Ciot%7Cundefined%7CNA
https://www.cisco.com/c/en/us/support/index.html
//www.cisco.com/en/US/support/tsd_most_requested_tools.html
https://apps.cisco.com/WOC/WOConfigUI/pages/configset/configset.jsp?flow=nextgen&createNewConfigSet=Y
http://www.cisco-servicefinder.com/ServiceFinder.aspx
http://www.cisco-servicefinder.com/WarrantyFinder.aspx
//www.cisco.com/web/siteassets/sitemap/index.html
https://www.cisco.com/c/dam/en/us/products/collateral/se/internet-of-things/at-a-glance-c45-731471.pdf?dtid=osscdc000283
https://www.cisco.com/c/en/us/solutions/internet-of-things/overview.html?dtid=osscdc000283
https://www.cisco.com/c/en/us/solutions/internet-of-things/iot-kinetic.html?dtid=osscdc000283
https://www.cisco.com/c/m/en_us/solutions/internet-of-things/iot-system.html?dtid=osscdc000283
https://learningnetworkstore.cisco.com/internet-of-things?dtid=osscdc000283
https://connectedfutures.cisco.com/tag/internet-of-things/?dtid=osscdc000283
https://blogs.cisco.com/internet-of-things?dtid=osscdc000283
https://learningnetwork.cisco.com/community/internet_of_things?dtid=osscdc000283
https://learningnetwork.cisco.com/community/learning_center/training-catalog/internet-of-things?dtid=osscdc000283
https://blogs.cisco.com/digital/internet-of-things-at-mwc?dtid=osscdc000283
https://cwr.cisco.com/
https://engage2demand.cisco.com/LP=4213?dtid=osscdc000283
https://engage2demand.cisco.com/LP=15823?dtid=osscdc000283
https://video.cisco.com/detail/video/4121788948001/internet-of-things:-empowering-the-enterprise?dtid=osscdc000283
https://video.cisco.com/detail/video/4121788948001/internet-of-things:-empowering-the-enterprise?dtid=osscdc000283
https://video.cisco.com/detail/video/3740968721001/protecting-the-internet-of-things?dtid=osscdc000283
https://video.cisco.com/detail/video/3740968721001/protecting-the-internet-of-things?dtid=osscdc000283
https://video.cisco.com/detail/video/4657296333001/the-internet-of-things:-the-vision-and-new-directions-ahead?dtid=osscdc000283
https://video.cisco.com/detail/video/4657296333001/the-internet-of-things:-the-vision-and-new-directions-ahead?dtid=osscdc000283
/search/videos?locale=enUS&query=iot
/search/videos?locale=enUS&query=iot
https://secure.opinionlab.com/ccc01/o.asp?id=pGuoWfLm&static=1&custom_var=undefined%7CS%7CenUS%7Ciot%7Cundefined%7CNA
You don't need selenium. It is better to use requests. The page uses an API so request from that
import requests
body = {"query":"iot","startIndex":0,"count":10,"searchType":"CISCO","tabName":"Cisco","debugScoreExplain":"false","facets":[],"localeStr":"enUS","advSearchFields":{"allwords":"","phrase":"","words":"","noOfWords":"","occurAt":""},"sortType":"RELEVANCY","isAdvanced":"false","dynamicRelevancyId":"","accessLevel":"","breakpoint":"XS","searchProfile":"","ui":"one","searchCat":"","searchMode":"text","callId":"j5JwndwQZZ","requestId":1558540148392,"bizCtxt":"","qnaTopic":[],"appName":"CDCSearhFE","social":"false"}
r = requests.post('https://search.cisco.com/api/search', json = body).json()
for item in r['items']:
print(item['url'])
Alter parameters to get more results etc.
Try following the template given in the documentation:
for link in soup.find_all('a'):
print(link.get('href'))

Get specific informaion from html code

The idea is to collect all soundcloud users' id's (not names) who posted tracks that first letter is e.g. "f" in the period in our case of "past year".
I used filters on soundcloud and got results in the next URL: https://soundcloud.com/search/sounds?q=f&filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap
I found the first user's id ("wavey-hefner") in the follow line of html code:
<a class="sound__coverArt" href="/wavey-hefner/foreign" draggable="true">
I want to get every user's id from the whole html.
My code is:
import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://soundcloud.com/search/sounds?q=f& filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap")
soup = BeautifulSoup(html.text, 'html.parser')
for id in soup.findAll("a", {"class" : "sound_coverArt"}):
print (id.get('href'))
It returns nothing :(
The page is rendered in JavaScript. You can use Selenium to render it, first install Selenium:
pip3 install selenium
Then get a driver e.g. https://sites.google.com/a/chromium.org/chromedriver/downloads (if you are on Windows or Mac you can get a headless version of Chrome - Canary if you like) put the driver in your path.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = ('https://soundcloud.com/search/sounds?q=f& filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap')
browser.get(url)
time.sleep(5)
# To make it load more scroll to the bottom of the page (repeat if you want to)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for id in soup.findAll("a", {"class" : "sound__coverArt"}):
print (id.get('href'))
Outputs:
/tee-grizzley/from-the-d-to-the-a-feat-lil-yachty
/empire/fat-joe-remy-ma-all-the-way-up-ft-french-montana
/tee-grizzley/first-day-out
/21savage/feel-it
/pluggedsoundz/famous-dex-geek-1
/rodshootinbirds/fairytale-x-rod-da-god
/chancetherapper/finish-line-drown-feat-t-pain-kirk-franklin-eryn-allen-kane-noname
/alkermith/future-low-life-ft-the-weeknd-evol
/javon-woodbridge/fabolous-slim-thick
/hamburgerhelper/feed-the-streets-prod-dequexatron-1000
/rob-neal-139819089/french-montana-lockjaw-remix-ft-gucci-mane-kodak-black
/pluggedsoundz/famous-dex-energy
/ovosoundradiohits/future-ft-drake-used-to-this
/pluggedsoundz/famous
/a-boogie-wit-da-hoodie/fucking-kissing-feat-chris-brown
/wavey-hefner/foreign
/jalensantoy/foreplay
/yvng_swag/fall-in-luv
/rich-the-kid/intro-prod-by-lab-cook
/empire/fat-joe-remy-ma-money-showers-feat-ty-dolla-ign

How to get div with multiple classes BS4

What is the most efficient way to get divs with BeautifulSoup4 if they have multiple classes?
I have an html structure like this:
<div class='class1 class2 class3 class4'>
<div class='class5 class6 class7'>
<div class='comment class14 class15'>
<div class='date class20 showdate'> 1/10/2017</div>
<p>comment2</p>
</div>
<div class='comment class25 class9'>
<div class='date class20 showdate'> 7/10/2017</div>
<p>comment1</p>
</div>
</div>
</div>
I want to get div with comment. Usually there is no problem with nested classes, but I don't know why the command:
html = BeautifulSoup(content, "html.parser")
comments = html.find_all("div", {"class":"comment"})
doesn't work. It gives empty array.
And I guess this happens because there are a lot of classes, so he looks for div with only comment class and it doesn't exist. How can I find all the comments?
Apparently, the URL that fetches the comments section is different from the original URL that retrieves the main contents.
This is the original URL you gave:
http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best
Behind the scenes, if you record the network log in the network tab of Chrome's developer menu, you'll see a list of all URLs that are sent by the browser. Most of them are for fetching images and scripts. Few relate to other sites such as Facebook or Google (for analytics, etc.). The browser sends another request to this particular site (sparknotes), which gives you the comments section. This is the URL:
http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548
The value for post_id can be found in the web page returned when we request the first URL. It is contained in an input tag which has a hidden attribute.
<input type="hidden" id="postid" name="postid" value="1375724">
You can extract this info from the first web page using a simple soup.find('input', {'id': 'postid'})['value']. Of course, since this identifies the post uniquely, you need not worry about its changing dynamically on each request.
I couldn't find the '1507467541548' value passed to '_' parameter (last parameter of the URL) anywhere in the main page or anywhere in the cookies set by response headers of any of the pages.
However, I went out on a limb and tried to fetch the URL by passing it without the '_' parameter, and it worked.
So, here's the entire script that worked for me:
from bs4 import BeautifulSoup
import requests
req_headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'community.sparknotes.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
with requests.Session() as s:
url = 'http://community.sparknotes.com/2017/10/06/find-out-your-colleges-secret-mantra-we-hack-college-life-at-the-100-of-the-best'
r = s.get(url, headers=req_headers)
soup = BeautifulSoup(r.content, 'lxml')
post_id = soup.find('input', {'id': 'postid'})['value']
# url = 'http://community.sparknotes.com/commentlist?post_id=1375724&page=1&comment_type=&_=1507467541548' # the original URL found in network tab
url = 'http://community.sparknotes.com/commentlist?post_id={}&page=1&comment_type='.format(post_id) # modified by removing the '_' parameter
r = s.get(url)
soup = BeautifulSoup(r.content, 'lxml')
comments = soup.findAll('div', {'class': 'commentCite'})
for comment in comments:
c_name = comment.div.a.text.strip()
c_date_text = comment.find('div', {'class': 'commentBodyInner'}).text.strip()
print(c_name, c_date_text)
As you can see, I haven't used headers for the second requests.get. So I'm not sure if it's required at all. You can experiment omitting them in the first request as well. But make sure you use requests, as I haven't tried using urllib. Cookies might play a vital role here.