Get full HTML for page with dynamic expanded containers with python - html

I am trying to pull the full HTML from ratemyprofessors.com however at the bottom of the page, there is a "Load More Ratings" button that allows you to see more comments.
I am using requests.get(url) and beautifulsoup, but that only gives the first 20 comments. Is there a way to have the page load all the comments before it returns?
Here is what I am currently doing that gives the top 20 comments, but not all of them.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
comments.append(j.text)

BeautifulSoup is more of an HTML parser for static pages than renderer for more dynamic web apps.
You could achieve what you want using a headless browser via Selenium by rendering the full page and repeatedly clicking the more link until there is no more to load.
Example: Clicking on a link via selenium
Since you're already using Requests, another option that might work is Requests-HTML which also supports dynamic rendering By calling .html.render() on the response object.
Example: https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render
Reference: Clicking link using beautifulsoup in python

Related

Scraping prices with BeautifulSoup4 in Python3

I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.

Trying to get the html on an open page

I am trying to make a bot that can play Cookie Clicker. I have successfully opened the website using the webbrowser module. When I use the developer tool to see the html I can see the information I want to obtain, such as how much money I have, how expensive items are ect. But when I try to get that information using the requests and beautifulsoup it instead gets the html of a new window. How can I make it so that I get the html of the already opened tab?
import webbrowser
webbrowser.open('https://orteil.dashnet.org/cookieclicker/')
from bs4 import BeautifulSoup
import requests
def scrape():
html = requests.get('https://orteil.dashnet.org/cookieclicker/')
print(html)
scrape()
You can try to do this:
body_element = html.find_element_by_xpath("//body")
body_content = body_element.get_attribute("innerHTML")
print(body_content)

missing HTML information when using requests.get

I am trying to scrape surfline.com using python3 with beautiful soup and requests. I am using this bit of code. Additionally I am using spyder 3.7. Also I am fairly new to webscraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.surfline.com/surf-report/salt-creek/5842041f4e65fad6a770882e'
r = requests.get(url)
html_soup = BeautifulSoup(r.text,'html.parser')
print(html_soup.prettify())
The goal is to scrape the surf height for each day. Using inspect I found the HTML section that contains the wave height. screen shot of surfline website & HTML . I run the code and it prints out the HMTL of the website. When I do ctrl find to look for the section I want to scrape it is not there. My question is why is it not being printed out and how do I fix it. I am aware that some website use java to load data onto website, is that the case here.
Thank you for any help you can provide.

Different HTTP response from inspector HTML

I am trying to get data for the following website using requests and Scrapy Selector.
import requests
from scrapy import Selector
url="https://seekingalpha.com/article/4312816-exxon-mobil-dividend-problems"
headers = {'user-agent': 'AppleWebKit/537.36'}
req = requests.get(url, headers=headers)
sel = Selector(text=req.text)
I could extract the text body but when tried to get the XPath for comments,
I noticed that the HTML returned from requests is different from the inspector, therefore selecting the class='b-b' like,
sel.xpath("//div[#class='b-b']")
returns an empty list in Python. It seems that I'm missing something or the HTML is partially hidden from the bots.
After view(response) I found out the following is rendered,
My Questions
Why the same HTML cannot be seen in the HTTP response?
How to get the comments data using XPath expressions for this page
Run your url link in scrapy shell and view the page by that command:
view(response)
your url link open in browser there you can see the source code and if the item is available there you can get it by xpath, simply inspect that element and copy that xpath you can get that element. i did not have my system. so i cannot send you exact code try the above things. your problem will be solved.

Extracting tags from a HTML with data hidden using python

I'm trying to learn scraping from different webpages. I tried to scrape data from a page containing tabs as follows:
url = "https://www.bc.edu/bc-web/schools/mcas/departments/art/people/#par-bc_tabbed_content-tab-0"
page = requests.get(url)
content = page.content
tree = html.fromstring(page.content)
soup = BeautifulSoup(content,"html.parser")
p = soup.find_all('div',{"id":'e6bde0e9_358d_4966_8fde_be96e9dcad0b'})
print p
This returns empty result
Though inspecting the element displays the content but the source page doesn't display this data. Any pointers on how to extract the content.
this is because of javascript rendering, which means that the data you want doesn't come with the original request, but requests generated by the javascript of that response.
To check ALL the requests that were generated by the original request, you'll have to use something like developer tools in Chrome.
For this particular case the actual request you need is to this site, which will give you the information you need.