Trying to get the html on an open page - html

I am trying to make a bot that can play Cookie Clicker. I have successfully opened the website using the webbrowser module. When I use the developer tool to see the html I can see the information I want to obtain, such as how much money I have, how expensive items are ect. But when I try to get that information using the requests and beautifulsoup it instead gets the html of a new window. How can I make it so that I get the html of the already opened tab?
import webbrowser
webbrowser.open('https://orteil.dashnet.org/cookieclicker/')
from bs4 import BeautifulSoup
import requests
def scrape():
html = requests.get('https://orteil.dashnet.org/cookieclicker/')
print(html)
scrape()

You can try to do this:
body_element = html.find_element_by_xpath("//body")
body_content = body_element.get_attribute("innerHTML")
print(body_content)

Related

Can't find html tag when using Beautiful soup

I'm trying to get more familiar with web scraping. I came across this website, https://www.freecodecamp.org/news/how-to-scrape-websites-with-python-and-beautifulsoup-5946935d93fe/ that gives an intro into web scraping using Beautiful Soup. Following the demonstration, I tried to scrape the value and name of the S&P stock index with the code they provided, but that wasn't working. I think some things have changed like the tag for price is no longer under h1 as the author wrote on the website. When I inspect the web page to view the html code, I can see all the tags used. I figured out that some of the html code isn't being scraped from the bloomberg website. I printed what the webscraper is collecting onto the console.
The code:
import urllib2
from bs4 import BeautifulSoup
quote_page = "http://www.bloomberg.com/quote/SPX:IND"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
print (soup)
name_box = soup.find("h1", attrs={"class": "price"})
name = name_box.text.strip() #get 'Nonetype object has no attribute text' here
print(name)
I was having troubles displaying what the code prints on stack, but basically some of the tags are not there. I'm wondering why this is and how to actually scrape the website. When I inspect the website, I can find the tag I am looking for which is:
<span class="priceText__1853e8a5">2,912.43</span>
But using the code I have, I can't seem to get this tag.

Get full HTML for page with dynamic expanded containers with python

I am trying to pull the full HTML from ratemyprofessors.com however at the bottom of the page, there is a "Load More Ratings" button that allows you to see more comments.
I am using requests.get(url) and beautifulsoup, but that only gives the first 20 comments. Is there a way to have the page load all the comments before it returns?
Here is what I am currently doing that gives the top 20 comments, but not all of them.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
comments.append(j.text)
BeautifulSoup is more of an HTML parser for static pages than renderer for more dynamic web apps.
You could achieve what you want using a headless browser via Selenium by rendering the full page and repeatedly clicking the more link until there is no more to load.
Example: Clicking on a link via selenium
Since you're already using Requests, another option that might work is Requests-HTML which also supports dynamic rendering By calling .html.render() on the response object.
Example: https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render
Reference: Clicking link using beautifulsoup in python

missing HTML information when using requests.get

I am trying to scrape surfline.com using python3 with beautiful soup and requests. I am using this bit of code. Additionally I am using spyder 3.7. Also I am fairly new to webscraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.surfline.com/surf-report/salt-creek/5842041f4e65fad6a770882e'
r = requests.get(url)
html_soup = BeautifulSoup(r.text,'html.parser')
print(html_soup.prettify())
The goal is to scrape the surf height for each day. Using inspect I found the HTML section that contains the wave height. screen shot of surfline website & HTML . I run the code and it prints out the HMTL of the website. When I do ctrl find to look for the section I want to scrape it is not there. My question is why is it not being printed out and how do I fix it. I am aware that some website use java to load data onto website, is that the case here.
Thank you for any help you can provide.

Website hiding page footer from parser

I am trying to find the donation button on the website of
The University of British Columbia.
The donation button is located at the page footer, within the div classed as "span7"
However, when scraped, the html yeilded the div with nothing inside it.
My program works perfectly with direct div as source:
from bs4 import BeautifulSoup as bs
import re
site = '''<div class="span7" id="ubc7-footer-menu"><div class="row-fluid"><div class="span6"><h3>About UBC</h3><div>Contact UBC</div><div>About the University</div><div>News</div><div>Events</div><div>Careers</div><div>Make a Gift</div><div>Search UBC.ca</div></div><div class="span6"><h3>UBC Campuses</h3><div>Vancouver Campus</div><div>Okanagan Campus</div><h4>UBC Sites</h4><div>Robson Square</div><div>Centre for Digital Media</div><div>Faculty of Medicine Across BC</div><div>Asia Pacific Regional Office</div></div></div></'''
html = bs(site, 'html.parser')
link = html.find('a', string=re.compile('(?)(donate|donation|gift)'))
#returns proper donation URL
However, using the site does not work
from bs4 import BeautifulSoup as bs
import requests
import re
site = requests.get('https://www.ubc.ca/')
html = bs(site.content, 'html.parser')
link = html.find('a', string=re.compile('(?i)(donate|donation|gift)'))
#returns none
Is there something wrong with my parser? Is it some-sort of anti-scrape maneuver? Am I doomed?
I cannot seem to find the 'Donate' button on the URL that you provided, but there is nothing inherently wrong with your parser, its just that the GET request that you send only gives you the HTML initially returned from the response, rather than waiting for the page to fully render.
It appears that parts of the page are filled in by Javascript. You can use Splash, which is used to render Javascript-based pages. You can run Splash in Docker quite easily, and just make HTTP requests to the Splash container which will return HTML that looks just like the webpage as rendered in a web browser.
Although this sounds overly complicated, it is actually quite simple to set up since you don't need to modify the Docker image at all, and you need no previous knowledge of Docker to get it to work. It requires just a single line from the command line to start a local Splash server:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
You then just modify any existing requests you have in your Python code to route to splash instead:
i.e. http://example.com/ becomes
http://localhost:8050/render.html?url=http://example.com/

How to fill out a web form and return the data with knowing the web form id/name in python

I am currently trying to automatically submit information into the web forms on this website : https://coinomi.com/recovery-phrase-tool.html Unfortunately I do not know the name of the forms, and cant seem to find out from its source code. Now I have tried to fill out the forms using the requests python module, and just by passing the parameters through the URL before scraping it. Unfortunately I have trouble finding the name of the form so I cant do this.
If possible I wanted to do this with the offline version of the website at https://github.com/Coinomi/bip39/blob/master/bip39-standalone.html so that it is more secure but I barely know how to use regular web forms with the tools I have, let alone locally from my computer.
I am not sure what exactly are you looking for. However, here is a part of code, which use selenium to fill some parts of the form that you mention.
import selenium
from selenium import webdriver
from selenium.webdriver.support.select import Select
browser = browser = webdriver.Chrome('C:\\Users...\\chromedriver.exe')
browser.get('https://coinomi.com/recovery-phrase-tool.html')
# Example to fill a text box
recoveryPhrase = browser.find_element_by_id('phrase')
recoveryPhrase.send_keys('your answer')
# Example to select a element
numberOfWords = Select(browser.find_element_by_id('strength'))
numberOfWords.select_by_visible_text('24')
# Example to click a button
generateRandomMnemonic = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div/form/div[4]/div/div/span/button')
generateRandomMnemonic.click()