Recover a link from a site automatically on python - html

in order to lighten my program and make it easy to use, I created an exe that centralizes all functions including the download of various applications and their installation. But the problem I face is that the link is dynamic (the link of the download page is fixed but not the download link). So how to get the second link on the page from the fixed link ?
For example, this link "https://anonfiles.com/D031ebu3uf/untitled.95_png" is fixed and I want to automate the recovery of the non-fixed link store in
<a target="_blank" type="button" id="download-url" class="btn btn-primary btn-block" href="https://cdn-31.anonfiles.com/D031ebu3uf/91f535ad-1619920351/untitled.95.png"> Download (365 KB)a></a>
Code:
url = 'https://anonfiles.com/D031ebu3uf/untitled.95_png'
r = requests.get(url, allow_redirects=True)
open('page.html', 'wb').write(r.content)

Use bS4 and requests. Download and parse the html. Use the id of the element to target it, then extract the href attribute:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://anonfiles.com/D031ebu3uf/untitled.95_png')
soup = bs(r.content, 'lxml')
link = soup.select_one('#download-url')['href']
print(link)

Related

Beautifulsoup scraping "lazy faded" images

I am looking for a way to parse the images on a web page. Many posts already exist on the subject, and I was inspired by many of them, in particular :
How Can I Download An Image From A Website In Python
The script presented in this post works very well, but I have encountered a type of image that I don't manage to automate the saving. On the website, inspection of the web page gives me:
<img class="lazy faded" data-src="Uploads/Media/20220315/1582689.jpg" src="Uploads/Media/20220315/1582689.jpg">
And when I parse the page with Beautifulsoup4, I get this (fonts.gstatic.com Source section content) :
<a class="ohidden" data-size="838x1047" href="Uploads/Media/20220315/1582689.jpg" itemprop="contentUrl">
<img class="lazy" data-src="Uploads/Media/20220315/1582689.jpg" />
</a>
The given URL is not a bulk web URL which can be used to download the image from anywhere, but a link to the "Sources" section of the web page (CTRL + MAJ + I on the webpage), where the image is.
When I put my mouse on the src link of the source code of the website, I can get the true bulk url under "Current source". This information is located in the Elements/Properties of the DevTools (CTRL + MAJ + I on the webpage), but I don't know how to automate the saving of the images, either by directly using the link to access the web page sources, or to access the bulk address to download the images. Do you have some idea ?
PS : I found this article about lazy fading images, but my HTLM knowledge isn't enough to find a solution for my problem (https://davidwalsh.name/lazyload-image-fade)
I'm not too familiar with web scraping or the benefits. However, I found this article here that you can reference and I hope it helps!
Reference
However, here is the code and everything you need in one place.
First you have to find the webpage you want to download the images from, which is your decision.
Now we have to get the urls of the images, create an empty list, open it, select them, loop through them, and then append them.
url = ""
link_list[]
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response, "html.parser")
image_list = soup.select('div.boxmeta.clearfix > h2 > a')
for image_link in image_list:
link_url = image_link.attrs['href']
link_list.append(link_url)
This theoretically should look for any href tag linking an image to the website and then append them to that list.
Now we have to get the tags of the image file.
for page_url in link_list:
page_html = urllib.request.urlopen(page_url)
page_soup = BeautifulSoup(page_html, "html.parser")
img_list = page_soup.select('div.seperator > a > img')
This should find all of the div tags that seperate from the primary main div class, look for an a tag and then the img tag.
for img in img_list:
img_url = (img.attrs['src'])
file_name = re.search(".*/(.*png|.*jpg)$", img_url)
save_path = output_folder.joinpath(filename.group(1))
Now we are going to try to download that data using the try except method.
try:
image = requests.get(img_url)
open(save_path, 'wb').write(image.content)
print(save_path)
except ValueError:
print("ValueError!")
I think you are talking about the relative path and absolute path.
Things like Uploads/Media/20220315/1582689.jpg is a relative path.
The main difference between absolute and relative paths is that absolute URLs always include the domain name of the site with http://www. Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. --- ref.
So in your case try this:
import requests
from bs4 import BeautifulSoup
from PIL import Image
URL = 'YOUR_URL_HERE'
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')
for img in soup.find_all("img"):
# Get the image absolute path url
absolute_path = requests.compat.urljoin(URL, img.get('data-src'))
# Download the image
image = Image.open(requests.get(absolute_path, stream=True).raw)
image.save(absolute_path.split('/')[-1].split('?')[0])

Get full HTML for page with dynamic expanded containers with python

I am trying to pull the full HTML from ratemyprofessors.com however at the bottom of the page, there is a "Load More Ratings" button that allows you to see more comments.
I am using requests.get(url) and beautifulsoup, but that only gives the first 20 comments. Is there a way to have the page load all the comments before it returns?
Here is what I am currently doing that gives the top 20 comments, but not all of them.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
comments.append(j.text)
BeautifulSoup is more of an HTML parser for static pages than renderer for more dynamic web apps.
You could achieve what you want using a headless browser via Selenium by rendering the full page and repeatedly clicking the more link until there is no more to load.
Example: Clicking on a link via selenium
Since you're already using Requests, another option that might work is Requests-HTML which also supports dynamic rendering By calling .html.render() on the response object.
Example: https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render
Reference: Clicking link using beautifulsoup in python

Trying to get the html on an open page

I am trying to make a bot that can play Cookie Clicker. I have successfully opened the website using the webbrowser module. When I use the developer tool to see the html I can see the information I want to obtain, such as how much money I have, how expensive items are ect. But when I try to get that information using the requests and beautifulsoup it instead gets the html of a new window. How can I make it so that I get the html of the already opened tab?
import webbrowser
webbrowser.open('https://orteil.dashnet.org/cookieclicker/')
from bs4 import BeautifulSoup
import requests
def scrape():
html = requests.get('https://orteil.dashnet.org/cookieclicker/')
print(html)
scrape()
You can try to do this:
body_element = html.find_element_by_xpath("//body")
body_content = body_element.get_attribute("innerHTML")
print(body_content)

missing HTML information when using requests.get

I am trying to scrape surfline.com using python3 with beautiful soup and requests. I am using this bit of code. Additionally I am using spyder 3.7. Also I am fairly new to webscraping.
import requests
from bs4 import BeautifulSoup
url = 'https://www.surfline.com/surf-report/salt-creek/5842041f4e65fad6a770882e'
r = requests.get(url)
html_soup = BeautifulSoup(r.text,'html.parser')
print(html_soup.prettify())
The goal is to scrape the surf height for each day. Using inspect I found the HTML section that contains the wave height. screen shot of surfline website & HTML . I run the code and it prints out the HMTL of the website. When I do ctrl find to look for the section I want to scrape it is not there. My question is why is it not being printed out and how do I fix it. I am aware that some website use java to load data onto website, is that the case here.
Thank you for any help you can provide.

Website hiding page footer from parser

I am trying to find the donation button on the website of
The University of British Columbia.
The donation button is located at the page footer, within the div classed as "span7"
However, when scraped, the html yeilded the div with nothing inside it.
My program works perfectly with direct div as source:
from bs4 import BeautifulSoup as bs
import re
site = '''<div class="span7" id="ubc7-footer-menu"><div class="row-fluid"><div class="span6"><h3>About UBC</h3><div>Contact UBC</div><div>About the University</div><div>News</div><div>Events</div><div>Careers</div><div>Make a Gift</div><div>Search UBC.ca</div></div><div class="span6"><h3>UBC Campuses</h3><div>Vancouver Campus</div><div>Okanagan Campus</div><h4>UBC Sites</h4><div>Robson Square</div><div>Centre for Digital Media</div><div>Faculty of Medicine Across BC</div><div>Asia Pacific Regional Office</div></div></div></'''
html = bs(site, 'html.parser')
link = html.find('a', string=re.compile('(?)(donate|donation|gift)'))
#returns proper donation URL
However, using the site does not work
from bs4 import BeautifulSoup as bs
import requests
import re
site = requests.get('https://www.ubc.ca/')
html = bs(site.content, 'html.parser')
link = html.find('a', string=re.compile('(?i)(donate|donation|gift)'))
#returns none
Is there something wrong with my parser? Is it some-sort of anti-scrape maneuver? Am I doomed?
I cannot seem to find the 'Donate' button on the URL that you provided, but there is nothing inherently wrong with your parser, its just that the GET request that you send only gives you the HTML initially returned from the response, rather than waiting for the page to fully render.
It appears that parts of the page are filled in by Javascript. You can use Splash, which is used to render Javascript-based pages. You can run Splash in Docker quite easily, and just make HTTP requests to the Splash container which will return HTML that looks just like the webpage as rendered in a web browser.
Although this sounds overly complicated, it is actually quite simple to set up since you don't need to modify the Docker image at all, and you need no previous knowledge of Docker to get it to work. It requires just a single line from the command line to start a local Splash server:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
You then just modify any existing requests you have in your Python code to route to splash instead:
i.e. http://example.com/ becomes
http://localhost:8050/render.html?url=http://example.com/