Beautifulsoup scraping "lazy faded" images - html

I am looking for a way to parse the images on a web page. Many posts already exist on the subject, and I was inspired by many of them, in particular :
How Can I Download An Image From A Website In Python
The script presented in this post works very well, but I have encountered a type of image that I don't manage to automate the saving. On the website, inspection of the web page gives me:
<img class="lazy faded" data-src="Uploads/Media/20220315/1582689.jpg" src="Uploads/Media/20220315/1582689.jpg">
And when I parse the page with Beautifulsoup4, I get this (fonts.gstatic.com Source section content) :
<a class="ohidden" data-size="838x1047" href="Uploads/Media/20220315/1582689.jpg" itemprop="contentUrl">
<img class="lazy" data-src="Uploads/Media/20220315/1582689.jpg" />
</a>
The given URL is not a bulk web URL which can be used to download the image from anywhere, but a link to the "Sources" section of the web page (CTRL + MAJ + I on the webpage), where the image is.
When I put my mouse on the src link of the source code of the website, I can get the true bulk url under "Current source". This information is located in the Elements/Properties of the DevTools (CTRL + MAJ + I on the webpage), but I don't know how to automate the saving of the images, either by directly using the link to access the web page sources, or to access the bulk address to download the images. Do you have some idea ?
PS : I found this article about lazy fading images, but my HTLM knowledge isn't enough to find a solution for my problem (https://davidwalsh.name/lazyload-image-fade)

I'm not too familiar with web scraping or the benefits. However, I found this article here that you can reference and I hope it helps!
Reference
However, here is the code and everything you need in one place.
First you have to find the webpage you want to download the images from, which is your decision.
Now we have to get the urls of the images, create an empty list, open it, select them, loop through them, and then append them.
url = ""
link_list[]
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response, "html.parser")
image_list = soup.select('div.boxmeta.clearfix > h2 > a')
for image_link in image_list:
link_url = image_link.attrs['href']
link_list.append(link_url)
This theoretically should look for any href tag linking an image to the website and then append them to that list.
Now we have to get the tags of the image file.
for page_url in link_list:
page_html = urllib.request.urlopen(page_url)
page_soup = BeautifulSoup(page_html, "html.parser")
img_list = page_soup.select('div.seperator > a > img')
This should find all of the div tags that seperate from the primary main div class, look for an a tag and then the img tag.
for img in img_list:
img_url = (img.attrs['src'])
file_name = re.search(".*/(.*png|.*jpg)$", img_url)
save_path = output_folder.joinpath(filename.group(1))
Now we are going to try to download that data using the try except method.
try:
image = requests.get(img_url)
open(save_path, 'wb').write(image.content)
print(save_path)
except ValueError:
print("ValueError!")

I think you are talking about the relative path and absolute path.
Things like Uploads/Media/20220315/1582689.jpg is a relative path.
The main difference between absolute and relative paths is that absolute URLs always include the domain name of the site with http://www. Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. --- ref.
So in your case try this:
import requests
from bs4 import BeautifulSoup
from PIL import Image
URL = 'YOUR_URL_HERE'
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')
for img in soup.find_all("img"):
# Get the image absolute path url
absolute_path = requests.compat.urljoin(URL, img.get('data-src'))
# Download the image
image = Image.open(requests.get(absolute_path, stream=True).raw)
image.save(absolute_path.split('/')[-1].split('?')[0])

Related

Get full HTML for page with dynamic expanded containers with python

I am trying to pull the full HTML from ratemyprofessors.com however at the bottom of the page, there is a "Load More Ratings" button that allows you to see more comments.
I am using requests.get(url) and beautifulsoup, but that only gives the first 20 comments. Is there a way to have the page load all the comments before it returns?
Here is what I am currently doing that gives the top 20 comments, but not all of them.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
comments = []
for j in soup.findAll('div', attrs={'class': 'Comments__StyledComments-dzzyvm-0 dEfjGB'}):
comments.append(j.text)
BeautifulSoup is more of an HTML parser for static pages than renderer for more dynamic web apps.
You could achieve what you want using a headless browser via Selenium by rendering the full page and repeatedly clicking the more link until there is no more to load.
Example: Clicking on a link via selenium
Since you're already using Requests, another option that might work is Requests-HTML which also supports dynamic rendering By calling .html.render() on the response object.
Example: https://requests-html.kennethreitz.org/index.html#requests_html.HTML.render
Reference: Clicking link using beautifulsoup in python

Downloading files to google drive using beautifulsoup

I need to downloading files using beautifulsoup to my googledrive using a colaboratory.
I´m using the code below:
u = urllib.request.urlopen("https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html")
html = u.read()
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')
I need only links that name contains '1706'. So, i´m trying:
for link in links:
files = link.get('href')
if '1706' in files:
urllib.request.urlretrieve(filelink, filename)
and don´t worked. "TypeError: argument of type 'NoneType' is not iterable". Ok, I know why this error but I don´t how fix, what is missing.
Using this
urllib.request.urlretrieve("https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32142_turnstile-170624/turnstile-170624.txt", 'turnstile-170624.txt')
I can get the individual files. But I want some way to downloading all files (that contains '1706') and to save this files to my google drive.
How can I do this?
You can use attribute = value css selector with * contains operator to specify the href attribute value contains 1706
links = [item['href'] for item in soup.select("[href*='1706']")]
Change from
soup.find_all('a')
To this instead
soup.select('a[href]')
It will select only an a tag that has href attribute.

Website hiding page footer from parser

I am trying to find the donation button on the website of
The University of British Columbia.
The donation button is located at the page footer, within the div classed as "span7"
However, when scraped, the html yeilded the div with nothing inside it.
My program works perfectly with direct div as source:
from bs4 import BeautifulSoup as bs
import re
site = '''<div class="span7" id="ubc7-footer-menu"><div class="row-fluid"><div class="span6"><h3>About UBC</h3><div>Contact UBC</div><div>About the University</div><div>News</div><div>Events</div><div>Careers</div><div>Make a Gift</div><div>Search UBC.ca</div></div><div class="span6"><h3>UBC Campuses</h3><div>Vancouver Campus</div><div>Okanagan Campus</div><h4>UBC Sites</h4><div>Robson Square</div><div>Centre for Digital Media</div><div>Faculty of Medicine Across BC</div><div>Asia Pacific Regional Office</div></div></div></'''
html = bs(site, 'html.parser')
link = html.find('a', string=re.compile('(?)(donate|donation|gift)'))
#returns proper donation URL
However, using the site does not work
from bs4 import BeautifulSoup as bs
import requests
import re
site = requests.get('https://www.ubc.ca/')
html = bs(site.content, 'html.parser')
link = html.find('a', string=re.compile('(?i)(donate|donation|gift)'))
#returns none
Is there something wrong with my parser? Is it some-sort of anti-scrape maneuver? Am I doomed?
I cannot seem to find the 'Donate' button on the URL that you provided, but there is nothing inherently wrong with your parser, its just that the GET request that you send only gives you the HTML initially returned from the response, rather than waiting for the page to fully render.
It appears that parts of the page are filled in by Javascript. You can use Splash, which is used to render Javascript-based pages. You can run Splash in Docker quite easily, and just make HTTP requests to the Splash container which will return HTML that looks just like the webpage as rendered in a web browser.
Although this sounds overly complicated, it is actually quite simple to set up since you don't need to modify the Docker image at all, and you need no previous knowledge of Docker to get it to work. It requires just a single line from the command line to start a local Splash server:
docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
You then just modify any existing requests you have in your Python code to route to splash instead:
i.e. http://example.com/ becomes
http://localhost:8050/render.html?url=http://example.com/

Embed image in HTML r markdown document that can be shared

I have an R markdown document which is created using a shiny app, saved as a HTML. I have inserted a logo in the top right hand corner of the output, which has been done using the following code:
<script>
$(document).ready(function() {
$head = $('#header');
$head.prepend('<img src=\"FILEPATH/logo.png\" style=\"float: right;padding-right:10px;height:125px;width:250px\"/>')
});
</script>
However, when I save the HTML output and share the output, of course the user cannot see the logo since the code is trying to find a file path which will not exist on their computer.
So, my question is - is there a way to include the logo in the output without the use of file paths? Ideally I don't want to upload the image to the web, and change the source to a web address.
You can encode an image file to a data URI with knitr::image_uri. If you want to add it in your document, you can add the html code produced by the following command in your header instead of your script:
htmltools::img(src = knitr::image_uri("FILEPATH/logo.png"),
alt = 'logo',
style = 'float: right;padding-right:10px;height:125px;width:250px')

How to convert en-media to img when convert enml to html

I'm working with evernote api on iOS, and want to translate enml to html. How to translate en-media to img? for example:
en-media:
<en-media type="image/jpeg" width="1200" hash="317ba2d234cd395150f2789cd574c722" height="1600" />
img:
<img src="imagePath"/>
I use core data to save information on iOS. So I can't give the local path of img file to "src = ". How to deal with this problem?
The simplest way is embedding image data using Data URI:
Find a Evernote Resource associated with this hash code.
Build the following Data URI (sorry for Java syntax, I'm not very familiar with Objective C):
String imgUrl = "data:" + resource.getMime() + ";base64," + java.util.prefs.Base64.byteArrayToBase64(resource.getData().getBody());
Create HTML img tag using imgUrl from (2).
Note: the following solution will allow you to display the image outside of the note's content.
On this page, you'll find the following url template:
https://host.evernote.com/shard/shardId/res/GUID
First compile the url from some variables, then point the html image src = the url.
In ruby, you might compile the url with a method similar to this one:
def resource_url
"https://#{EVERNOTE_HOST}/shard/#{self.note.notebook.user.evernote_shard_id}/res/#{self.evernote_id}"
end
...where self references the resource, EVERNOTE_HOST is equivalent to the host url (i.e. sandbox.evernote.com), evernote_shard_id is the user's shardId, and evernote_id is the user's guid.