Beautiful Soup can't extract links

Beautiful Soup can't extract links - html

I am trying to extract the links of this webpage: https://search.cisco.com/search?query=iot
Using this code I am not getting anything returned:
# Get Html Data from webpage
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html5lib')
# Retrieve all of the anchor tags
tags = soup('a') for tag in tags:
print(tag.get('href'))
I have tried the find_all() method but had the same problem.

Seems like java script render to pages.You can use selenium and beautiful soup to fetch the links.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://search.cisco.com/search?query=iot&locale=enUS")
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
for a in soup.find_all('a', href=True):
print(a['href'])
Output:
https://onesearch.cloudapps.cisco.com/searchpage?queryFilter=iot
/login?query=iot&locale=enUS
/login?query=iot&locale=enUS
https://secure.opinionlab.com/ccc01/o.asp?id=pGuoWfLm&static=1&custom_var=undefined%7CS%7CenUS%7Ciot%7Cundefined%7CNA
https://www.cisco.com/c/en/us/support/index.html
//www.cisco.com/en/US/support/tsd_most_requested_tools.html
https://apps.cisco.com/WOC/WOConfigUI/pages/configset/configset.jsp?flow=nextgen&createNewConfigSet=Y
http://www.cisco-servicefinder.com/ServiceFinder.aspx
http://www.cisco-servicefinder.com/WarrantyFinder.aspx
//www.cisco.com/web/siteassets/sitemap/index.html
https://www.cisco.com/c/dam/en/us/products/collateral/se/internet-of-things/at-a-glance-c45-731471.pdf?dtid=osscdc000283
https://www.cisco.com/c/en/us/solutions/internet-of-things/overview.html?dtid=osscdc000283
https://www.cisco.com/c/en/us/solutions/internet-of-things/iot-kinetic.html?dtid=osscdc000283
https://www.cisco.com/c/m/en_us/solutions/internet-of-things/iot-system.html?dtid=osscdc000283
https://learningnetworkstore.cisco.com/internet-of-things?dtid=osscdc000283
https://connectedfutures.cisco.com/tag/internet-of-things/?dtid=osscdc000283
https://blogs.cisco.com/internet-of-things?dtid=osscdc000283
https://learningnetwork.cisco.com/community/internet_of_things?dtid=osscdc000283
https://learningnetwork.cisco.com/community/learning_center/training-catalog/internet-of-things?dtid=osscdc000283
https://blogs.cisco.com/digital/internet-of-things-at-mwc?dtid=osscdc000283
https://cwr.cisco.com/
https://engage2demand.cisco.com/LP=4213?dtid=osscdc000283
https://engage2demand.cisco.com/LP=15823?dtid=osscdc000283
https://video.cisco.com/detail/video/4121788948001/internet-of-things:-empowering-the-enterprise?dtid=osscdc000283
https://video.cisco.com/detail/video/4121788948001/internet-of-things:-empowering-the-enterprise?dtid=osscdc000283
https://video.cisco.com/detail/video/3740968721001/protecting-the-internet-of-things?dtid=osscdc000283
https://video.cisco.com/detail/video/3740968721001/protecting-the-internet-of-things?dtid=osscdc000283
https://video.cisco.com/detail/video/4657296333001/the-internet-of-things:-the-vision-and-new-directions-ahead?dtid=osscdc000283
https://video.cisco.com/detail/video/4657296333001/the-internet-of-things:-the-vision-and-new-directions-ahead?dtid=osscdc000283
/search/videos?locale=enUS&query=iot
/search/videos?locale=enUS&query=iot
https://secure.opinionlab.com/ccc01/o.asp?id=pGuoWfLm&static=1&custom_var=undefined%7CS%7CenUS%7Ciot%7Cundefined%7CNA

You don't need selenium. It is better to use requests. The page uses an API so request from that
import requests
body = {"query":"iot","startIndex":0,"count":10,"searchType":"CISCO","tabName":"Cisco","debugScoreExplain":"false","facets":[],"localeStr":"enUS","advSearchFields":{"allwords":"","phrase":"","words":"","noOfWords":"","occurAt":""},"sortType":"RELEVANCY","isAdvanced":"false","dynamicRelevancyId":"","accessLevel":"","breakpoint":"XS","searchProfile":"","ui":"one","searchCat":"","searchMode":"text","callId":"j5JwndwQZZ","requestId":1558540148392,"bizCtxt":"","qnaTopic":[],"appName":"CDCSearhFE","social":"false"}
r = requests.post('https://search.cisco.com/api/search', json = body).json()
for item in r['items']:
print(item['url'])
Alter parameters to get more results etc.

Try following the template given in the documentation:
for link in soup.find_all('a'):
print(link.get('href'))

Related

Extracting Text from Span Tag using BeautifulSoup

I am trying to extract the estimated monthly cost of "$1,773" from this url:
https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/
Upon inspecting that part of the page, I see this data:
<div class="sc-qWfCM cdZDcW">
<span class="Text-c11n-8-48-0__sc-aiai24-0 dQezUG">Estimated monthly cost</span>
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,773</span></div>
To extract $1,773, I have tried this:
from bs4 import BeautifulSoup
import requests
url = 'https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/'
headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html")
print(soup.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'}))
This returns a list of three elements, with no mention of $1,773.
[<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$463,300</span>,
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$1,438</span>,
<span class="Text-c11n-8-48-0__sc-aiai24-0 jLucLe">$2,300<!-- -->/mo</span>]
Can someone please explain how to return $1,773?

I think you have to find the first parent element.
for example:
parent_div = soup.find('div', {'class': 'sc-fzqBZW bzsmsC'})
result = parent_div.findAll('span', {'class': 'Text-c11n-8-48-0__sc-aiai24-0 jLucLe'})

While parsing a web page we need to separate components of the page in the way they are rendered. There are components that are statically or dynamically rendered. The dynamic content also takes some time to load, as the page calls for backend API of some sort.
Read more here
I tried parsing your page using Selenium ChromeDriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get("https://www.zillow.com/homedetails/4651-Genoa-St-Denver-CO-80249/13274183_zpid/")
time.sleep(3)
time.sleep(3)
el = driver.find_elements_by_xpath("//span[#class='Text-c11n-8-48-0__sc-aiai24-0 jLucLe']")
for e in el:
print(e.text)
time.sleep(3)
driver.quit()
#OUTPUT
$463,300
$1,773
$2,300/mo

BeautifulSoup IndexError: list index out of range

My code below:
import requests
from bs4 import BeautifulSoup
def investopedia():
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}
ticker = 'TSLA'
url = f'https://www.investopedia.com/markets/quote?tvwidgetsymbol={ticker.lower()}'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
ip_price = soup.find_all('div', {'class':'tv-symbol-price-quote__value js-symbol-last'})[0].find('span').text
print(ip_price)
investopedia()
The class I used while inspecting element (in html):
<div class="tv-symbol-price-quote__value js-symbol-last"><span>736.27</span></div>
736.27 in "span" is the number I need
Please help out a web scraping beginnger here. Thanks in advance!

You get index out of range error because your code can't find any HTML elements you are looking for right now.
Information you are looking for is kept within an iframe. In order to retrieve the data you want, we have to switch to that iframe. One way to do it is using Selenium.
from selenium import webdriver
def investopedia():
ticker = 'TSLA'
url = f'https://www.investopedia.com/markets/quote?tvwidgetsymbol={ticker.lower()}'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5) # it takes time to download the webpage
iframe = driver.find_elements_by_css_selector('.tradingview-widget-container > iframe')[0]
driver.switch_to.frame(iframe)
time.sleep(1)
ip_price = driver.find_elements_by_xpath('.//div[#class="tv-symbol-price-quote__value js-symbol-last"]')[0].get_attribute('innerText').strip()
print(ip_price)
investopedia()

beautifulsoup - filter text of anchor tag is not working when there is img tag in front of text

I have the following html content :
from bs4 import BeautifulSoup
import re
html = """<a href="http://app_url1" >install app xyz</a>
install app xyz
<a href="http://app_url3" >install app aaa</a>
install app aaa"""
soup = BeautifulSoup(html, "html.parser")
print(soup.findAll("a", text=re.compile("xyz$")))
I want to filter the anchor tag texts that end with a given regex pattern (like xyz here)? I am looking to pass a regex pattern to findAll instead of extra iteration of all anchor tags. But I am getting output only one anchor tag as
install app xyz
The other anchor tag which has img in front of text is getting ignored
expected output:
<a href="http://app_url1" >install app xyz</a>
install app xyz

You can use CSS selector select instead of extra iteration of all anchor tags.
Example:
from bs4 import BeautifulSoup
import re
html = """<a href="http://app_url1" >install app xyz</a>
install app xyz
<a href="http://app_url3" >install app aaa</a>
install app aaa"""
soup = BeautifulSoup(html, "html.parser")
print(soup.select('a:contains("xyz")'))
Output will be:
[install app xyz, <img src="/path.jpg"/>install app xyz]
For getting href content from the list of the above output:
anchors = soup.select('a:contains("xyz")')
href = [i['href'] for i in anchors]
print(href)
Output will be:
['http://app_url1', 'http://app_url2']

Filter by only text=re.compile("xyz$") then use .parent
Ex:
from bs4 import BeautifulSoup
import re
html = """<a href="http://app_url1" >install app xyz</a>
install app xyz
<a href="http://app_url3" >install app aaa</a>
install app aaa"""
soup = BeautifulSoup(html, "html.parser")
result = [el.parent for el in soup.findAll(text=re.compile("xyz$"))]
print(result)
Output:
[install app xyz, <img src="/path.jpg"/>install app xyz]

Sending a plotly graph over flask

Right now I have a code that uses plotly to create a figure
def show_city_frequency(number_of_city = 10):
plot_1 = go.Histogram(
x=dataset[dataset.city.isin(city_count[:number_of_city].index.values)]['city'],
showlegend=False)
## Creating the grid for all the above plots
fig = tls.make_subplots(rows=1, cols=1)
fig.append_trace(plot_1,1,1)
fig['layout'].update(showlegend=True, title="Frequency of cities in the dataset ")
return plot(fig)
I want to incorporate this into a flask function and send it to an html template as a bytes io object using send_file. I was able to do this for a matplotlib just using:
img = io.BytesIO()
plt.plot(x,y, label='Fees Paid')
plt.savefig(img, format='png')
img.seek(0)
return send_file(img, mimetype='image/png')
I've read that I can do basically the same thing except using:
img = plotly.io.to_image(fig, format='png')
img.seek(0)
return send_file(img, mimetype='image/png')
but I can't seem to find where to download plotly.io. I've read that plotly offline doesn't work for Ubuntu so I am wondering if that is what my issue is as well. I am also open to new suggestions of how to send this image dynamically to my html code.

Get specific informaion from html code

The idea is to collect all soundcloud users' id's (not names) who posted tracks that first letter is e.g. "f" in the period in our case of "past year".
I used filters on soundcloud and got results in the next URL: https://soundcloud.com/search/sounds?q=f&filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap
I found the first user's id ("wavey-hefner") in the follow line of html code:
<a class="sound__coverArt" href="/wavey-hefner/foreign" draggable="true">
I want to get every user's id from the whole html.
My code is:
import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://soundcloud.com/search/sounds?q=f& filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap")
soup = BeautifulSoup(html.text, 'html.parser')
for id in soup.findAll("a", {"class" : "sound_coverArt"}):
print (id.get('href'))
It returns nothing :(

The page is rendered in JavaScript. You can use Selenium to render it, first install Selenium:
pip3 install selenium
Then get a driver e.g. https://sites.google.com/a/chromium.org/chromedriver/downloads (if you are on Windows or Mac you can get a headless version of Chrome - Canary if you like) put the driver in your path.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = ('https://soundcloud.com/search/sounds?q=f& filter.created_at=last_year&filter.genre_or_tag=hip-hop%20%26%20rap')
browser.get(url)
time.sleep(5)
# To make it load more scroll to the bottom of the page (repeat if you want to)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
html_source = browser.page_source
browser.quit()
soup = BeautifulSoup(html_source, 'html.parser')
for id in soup.findAll("a", {"class" : "sound__coverArt"}):
print (id.get('href'))
Outputs:
/tee-grizzley/from-the-d-to-the-a-feat-lil-yachty
/empire/fat-joe-remy-ma-all-the-way-up-ft-french-montana
/tee-grizzley/first-day-out
/21savage/feel-it
/pluggedsoundz/famous-dex-geek-1
/rodshootinbirds/fairytale-x-rod-da-god
/chancetherapper/finish-line-drown-feat-t-pain-kirk-franklin-eryn-allen-kane-noname
/alkermith/future-low-life-ft-the-weeknd-evol
/javon-woodbridge/fabolous-slim-thick
/hamburgerhelper/feed-the-streets-prod-dequexatron-1000
/rob-neal-139819089/french-montana-lockjaw-remix-ft-gucci-mane-kodak-black
/pluggedsoundz/famous-dex-energy
/ovosoundradiohits/future-ft-drake-used-to-this
/pluggedsoundz/famous
/a-boogie-wit-da-hoodie/fucking-kissing-feat-chris-brown
/wavey-hefner/foreign
/jalensantoy/foreplay
/yvng_swag/fall-in-luv
/rich-the-kid/intro-prod-by-lab-cook
/empire/fat-joe-remy-ma-money-showers-feat-ty-dolla-ign

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Beautiful Soup can't extract links - html

Try following the template given in the documentation: for link in soup.find_all('a'): print(link.get('href'))

Related

Extracting Text from Span Tag using BeautifulSoup

BeautifulSoup IndexError: list index out of range

beautifulsoup - filter text of anchor tag is not working when there is img tag in front of text

Sending a plotly graph over flask

Get specific informaion from html code

Categories

Resources