Website hiding page footer from parser - html

I am trying to find the donation button on the website of
The University of British Columbia.
The donation button is located at the page footer, within the div classed as "span7"
However, when scraped, the html yeilded the div with nothing inside it.
My program works perfectly with direct div as source:
from bs4 import BeautifulSoup as bs
import re
site = '''<div class="span7" id="ubc7-footer-menu"><div class="row-fluid"><div class="span6"><h3>About UBC</h3><div>Contact UBC</div><div>About the University</div><div>News</div><div>Events</div><div>Careers</div><div>Make a Gift</div><div>Search UBC.ca</div></div><div class="span6"><h3>UBC Campuses</h3><div>Vancouver Campus</div><div>Okanagan Campus</div><h4>UBC Sites</h4><div>Robson Square</div><div>Centre for Digital Media</div><div>Faculty of Medicine Across BC</div><div>Asia Pacific Regional Office</div></div></div></'''
html = bs(site, 'html.parser')
link = html.find('a', string=re.compile('(?)(donate|donation|gift)'))
#returns proper donation URL
However, using the site does not work
from bs4 import BeautifulSoup as bs
import requests
import re
site = requests.get('https://www.ubc.ca/')
html = bs(site.content, 'html.parser')
link = html.find('a', string=re.compile('(?i)(donate|donation|gift)'))
#returns none
Is there something wrong with my parser? Is it some-sort of anti-scrape maneuver? Am I doomed?

I cannot seem to find the 'Donate' button on the URL that you provided, but there is nothing inherently wrong with your parser, its just that the GET request that you send only gives you the HTML initially returned from the response, rather than waiting for the page to fully render.
It appears that parts of the page are filled in by Javascript. You can use Splash, which is used to render Javascript-based pages. You can run Splash in Docker quite easily, and just make HTTP requests to the Splash container which will return HTML that looks just like the webpage as rendered in a web browser.
Although this sounds overly complicated, it is actually quite simple to set up since you don't need to modify the Docker image at all, and you need no previous knowledge of Docker to get it to work. It requires just a single line from the command line to start a local Splash server:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
You then just modify any existing requests you have in your Python code to route to splash instead:
i.e. http://example.com/ becomes
http://localhost:8050/render.html?url=http://example.com/

Related

Scraping prices with BeautifulSoup4 in Python3

I am new scraping with Python and BeautifulSoup4. Also, I do not have knowledge of HTML. To practice, I am trying to use it on Carrefour website to extract the price and price per kilogram of the product that I search for EAN code.
My code:
barcodes = ['5449000000996']
for barcode in barcodes:
url = 'https://www.carrefour.es/?q=' + barcode
html = requests.get(url).content
bs = BeautifulSoup(html, 'lxml')
searchingprice = bs.find_all('strong', {'class':'ebx-result-price__value'})
print(searchingprice)
searchingpricerperkg = bs.find_all('span', {'class':'ebx-result__quantity ebx-result-quantity'})
print(searchingpricerperkg)
But I do not get any result at all
Here is a screenshot of the HTML code:
What am I doing wrong? I tried with other website and it seems to work
The problem here is that you're scraping a page with Javascript-generated content. Basically, the page that you're grabbing with requests actually doesn't have the thing you're grabbing from it - it has a bunch of javascript. When your browser goes to the page, it runs the javascript, which generates the content - so the page you see in the rendered version in your browser is not the same thing returned from the actual page itself. The page contains instructions for your browser to write the page that you see.
If you're just practicing, you might want to simply try a different source to scrape from, but to scrape from this page, you'll need to look into other solutions that can handle javascript generated content:
Web-scraping JavaScript page with Python
Alternatively, the javascript generates content by requesting data from other sources. I don't speak spanish, so I'm not much help in figuring this part out, but you might be able to.
As an exercise, go ahead and have BS4 prettify and print out the page that it receives. You'll see that within that page there are requests to other locations to get the info you're asking for. You might be able to change your request to not go to the page where you view the info, but to the location that page gets it's data from.

LXML xpath does not detect in a "dirty" html file, however after indenting and cleaning it, it succeeds

Every sort of help will be extremely appreciated. I am building a parser to a web-site. I am trying to detect an element using lxml package, the element has a pretty simple relative xpath: '//div[#id="productDescription"]'. When I am manually going to the web page, making 'view page source' and copying the html string to local html file, everything works perfectly. However, if I download the file automatically:
headers = {"user-Agent": "MY SCRAPER USER-AGENT", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT": "1","Connection": "close", "Upgrade-Insecure-Requests": "1"}
product_HTML_bytes = requests.get(product_link, headers=headers, proxies={'http': "***:***"}).content
product_HTML_str = product_HTML_bytes.decode()
main_data = html.fromstring(product_HTML_str)
product_description_tags = main_data.xpath('//div[#id="productDescription"]')
...
I get nothing (and the data does exist in the file). I had also tried to first scrape a sample of pages using the same request.get with the same headers and so on, saving the files locally and then cleaning the extra spaces and indenting the document manually using this html formatter: https://www.freeformatter.com/html-formatter.html and then boom, it works again. However, I couldn't put my finger on what exactly changes in the files, but I was pretty sure extra spaces and indented tabs should not make a difference.
What am I missing here?
Thanks in Advance
Edit:
URL: https://www.amazon.com/Samsung-MicroSDXC-Adapter-MB-ME128GA-AM/dp/B06XWZWYVP
cause pasting it here is impossible because the file exceeds the length limit, I uploaded them to the web.
The not working HTML: https://easyupload.io/231pdd
The indented, clean, and formatted HTML page: https://easyupload.io/a9oiyh
For some strange reason, it seems the the lxml library mangles the text output of requests.get() when the output is filtered through the lxml.html.fromstring() method. I have no idea why.
The target data is still there, no doubt:
from bs4 import BeautifulSoup as bs
soup = bs(product_HTML_str,'lxml') #note that the lxml parser is used here!
for elem in soup.select_one('#productDescription p'):
print(elem.strip())
Output:
Simply the right card. With stunning speed and reliability, the...
etc.
I personally much prefer using xpath in lxml to find() and css selectors methods used by BeautifulSoup, but this time BeautifulSoup wins...

Trying to get the html on an open page

I am trying to make a bot that can play Cookie Clicker. I have successfully opened the website using the webbrowser module. When I use the developer tool to see the html I can see the information I want to obtain, such as how much money I have, how expensive items are ect. But when I try to get that information using the requests and beautifulsoup it instead gets the html of a new window. How can I make it so that I get the html of the already opened tab?
import webbrowser
webbrowser.open('https://orteil.dashnet.org/cookieclicker/')
from bs4 import BeautifulSoup
import requests
def scrape():
html = requests.get('https://orteil.dashnet.org/cookieclicker/')
print(html)
scrape()
You can try to do this:
body_element = html.find_element_by_xpath("//body")
body_content = body_element.get_attribute("innerHTML")
print(body_content)

missing HTML information when using requests.get

I am trying to scrape surfline.com using python3 with beautiful soup and requests. I am using this bit of code. Additionally I am using spyder 3.7. Also I am fairly new to webscraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.surfline.com/surf-report/salt-creek/5842041f4e65fad6a770882e'
r = requests.get(url)
html_soup = BeautifulSoup(r.text,'html.parser')
print(html_soup.prettify())
The goal is to scrape the surf height for each day. Using inspect I found the HTML section that contains the wave height. screen shot of surfline website & HTML . I run the code and it prints out the HMTL of the website. When I do ctrl find to look for the section I want to scrape it is not there. My question is why is it not being printed out and how do I fix it. I am aware that some website use java to load data onto website, is that the case here.
Thank you for any help you can provide.

How to fill out a web form and return the data with knowing the web form id/name in python

I am currently trying to automatically submit information into the web forms on this website : https://coinomi.com/recovery-phrase-tool.html Unfortunately I do not know the name of the forms, and cant seem to find out from its source code. Now I have tried to fill out the forms using the requests python module, and just by passing the parameters through the URL before scraping it. Unfortunately I have trouble finding the name of the form so I cant do this.
If possible I wanted to do this with the offline version of the website at https://github.com/Coinomi/bip39/blob/master/bip39-standalone.html so that it is more secure but I barely know how to use regular web forms with the tools I have, let alone locally from my computer.
I am not sure what exactly are you looking for. However, here is a part of code, which use selenium to fill some parts of the form that you mention.
import selenium
from selenium import webdriver
from selenium.webdriver.support.select import Select
browser = browser = webdriver.Chrome('C:\\Users...\\chromedriver.exe')
browser.get('https://coinomi.com/recovery-phrase-tool.html')
# Example to fill a text box
recoveryPhrase = browser.find_element_by_id('phrase')
recoveryPhrase.send_keys('your answer')
# Example to select a element
numberOfWords = Select(browser.find_element_by_id('strength'))
numberOfWords.select_by_visible_text('24')
# Example to click a button
generateRandomMnemonic = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div/form/div[4]/div/div/span/button')
generateRandomMnemonic.click()