I am trying to extract class information form below website https://www.programmableweb.com/category/all/apis. My code works fine for all pages except https://www.programmableweb.com/category/all/apis?page=2092.
from bs4 import BeautifulSoup
import requests
url = 'https://www.programmableweb.com/category/all/apis?page=2092'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
apis = soup.find_all('tr',{'class':['odd views-row-first', 'odd','even','even views-row-last']})
print(apis)
On 2092 page I get info about only 1 class as below
[<tr class="odd views-row-first views-row-last"><td class="views-field views-field-pw-version-title"> Inkling API<br/></td><td class="views-field views-field-search-api-excerpt views-field-field-api-description hidden-xs visible-md visible-sm col-md-8"> Our REST API allows you to replicate much of the functionality in our hosted marketplace solution to build custom widgets and stock tickers for your Intranet, create custom reports, add trading...</td><td class="views-field views-field-field-article-primary-category"> Financial</td><td class="views-field views-field-pw-version-links"> REST v0.0</td></tr>]
For any other page (like https://www.programmableweb.com/category/all/apis?page=2091), I get info about all the classes. The HTML structure seems similar in all pages.
This website is constantly adding new APIs to it's database So there is three scenarios here that might have caused this :
The selectors you are using is not accurate.
The website has
some kinda security measure for you sending too many requests.
at
the time of your scrape this page truly had one item on it .
scenario no 3 is most likely to believe .
from bs4 import BeautifulSoup
import requests
from time import sleep
for page in range(1,2094): #starting with 1 then the last page will be 2093
url = f'https://www.programmableweb.com/category/all/apis?page={page}'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
apis = soup.select('table[class="views-table cols-4 table"] tbody tr') # better selector
print(apis) #page 2093 currently has 6 items on it .
sleep(5) #This will sleep for 5 secs
Related
I'm trying to get more familiar with web scraping. I came across this website, https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/ that gives an intro into web scraping using Beautiful Soup. Following the demonstration, I tried to scrape the value and name of the S&P stock index with the code they provided, but that wasn't working. I think some things have changed like the tag for price is no longer under h1 as the author wrote on the website. When I inspect the web page to view the html code, I can see all the tags used. I figured out that some of the html code isn't being scraped from the bloomberg website. I printed what the webscraper is collecting onto the console.
The code:
import urllib2
from bs4 import BeautifulSoup
quote_page = "http://www.bloomberg.com/quote/SPX:IND"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
print (soup)
name_box = soup.find("h1", attrs={"class": "price"})
name = name_box.text.strip() #get 'Nonetype object has no attribute text' here
print(name)
I was having troubles displaying what the code prints on stack, but basically some of the tags are not there. I'm wondering why this is and how to actually scrape the website. When I inspect the website, I can find the tag I am looking for which is:
<span class="priceText__1853e8a5">2,912.43</span>
But using the code I have, I can't seem to get this tag.
I am trying to pull the full HTML from ratemyprofessors.com however at the bottom of the page, there is a "Load More Ratings" button that allows you to see more comments.
I am using requests.get(url) and beautifulsoup, but that only gives the first 20 comments. Is there a way to have the page load all the comments before it returns?
Here is what I am currently doing that gives the top 20 comments, but not all of them.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
comments.append(j.text)
BeautifulSoup is more of an HTML parser for static pages than renderer for more dynamic web apps.
You could achieve what you want using a headless browser via Selenium by rendering the full page and repeatedly clicking the more link until there is no more to load.
Example: Clicking on a link via selenium
Since you're already using Requests, another option that might work is Requests-HTML which also supports dynamic rendering By calling .html.render() on the response object.
Example: https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render
Reference: Clicking link using beautifulsoup in python
I am trying to make a bot that can play Cookie Clicker. I have successfully opened the website using the webbrowser module. When I use the developer tool to see the html I can see the information I want to obtain, such as how much money I have, how expensive items are ect. But when I try to get that information using the requests and beautifulsoup it instead gets the html of a new window. How can I make it so that I get the html of the already opened tab?
import webbrowser
webbrowser.open('https://orteil.dashnet.org/cookieclicker/')
from bs4 import BeautifulSoup
import requests
def scrape():
html = requests.get('https://orteil.dashnet.org/cookieclicker/')
print(html)
scrape()
You can try to do this:
body_element = html.find_element_by_xpath("//body")
body_content = body_element.get_attribute("innerHTML")
print(body_content)
I am trying to scrape surfline.com using python3 with beautiful soup and requests. I am using this bit of code. Additionally I am using spyder 3.7. Also I am fairly new to webscraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.surfline.com/surf-report/salt-creek/5842041f4e65fad6a770882e'
r = requests.get(url)
html_soup = BeautifulSoup(r.text,'html.parser')
print(html_soup.prettify())
The goal is to scrape the surf height for each day. Using inspect I found the HTML section that contains the wave height. screen shot of surfline website & HTML . I run the code and it prints out the HMTL of the website. When I do ctrl find to look for the section I want to scrape it is not there. My question is why is it not being printed out and how do I fix it. I am aware that some website use java to load data onto website, is that the case here.
Thank you for any help you can provide.
I was trying to get the best search result position of few product in the below link
https://www.purplle.com/search?q=hair%20fall%20shamboo
I used below tools to get the html details from the page
++
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
++
now I am confused how to get the product names and position(to get the best rank in the search )from this html.
I used the below method to get the details of the products but the output has a lot of unwanted things too.
details = soup.find('div', attrs={'class': 'pr'})
any idea how to solve this?
I don't know what you meant by position. However, the below script can fetch you the title of different products and its position (allegedly) from that page:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://www.purplle.com/search?q=hair%20fall%20shamboo")
soup = BeautifulSoup(driver.page_source, 'html.parser')
for item in soup.find_all(class_="prd-lstng pr"):
name = item.find_all(class_="pro-name el2")[0].text
position = item.find_all(class_="mrl5 tx-std30")[0].text
print(name,position)
driver.quit()