Downloading files to google drive using beautifulsoup - html

I need to downloading files using beautifulsoup to my googledrive using a colaboratory.
I´m using the code below:
u = urllib.request.urlopen("https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32290_turnstile/turnstile.html")
html = u.read()
soup = BeautifulSoup(html, "html.parser")
links = soup.find_all('a')
I need only links that name contains '1706'. So, i´m trying:
for link in links:
files = link.get('href')
if '1706' in files:
urllib.request.urlretrieve(filelink, filename)
and don´t worked. "TypeError: argument of type 'NoneType' is not iterable". Ok, I know why this error but I don´t how fix, what is missing.
Using this
urllib.request.urlretrieve("https://s3.amazonaws.com/video.udacity-data.com/topher/2018/November/5bf32142_turnstile-170624/turnstile-170624.txt", 'turnstile-170624.txt')
I can get the individual files. But I want some way to downloading all files (that contains '1706') and to save this files to my google drive.
How can I do this?

You can use attribute = value css selector with * contains operator to specify the href attribute value contains 1706
links = [item['href'] for item in soup.select("[href*='1706']")]

Change from
soup.find_all('a')
To this instead
soup.select('a[href]')
It will select only an a tag that has href attribute.

Related

Beautifulsoup scraping "lazy faded" images

I am looking for a way to parse the images on a web page. Many posts already exist on the subject, and I was inspired by many of them, in particular :
How Can I Download An Image From A Website In Python
The script presented in this post works very well, but I have encountered a type of image that I don't manage to automate the saving. On the website, inspection of the web page gives me:
<img class="lazy faded" data-src="Uploads/Media/20220315/1582689.jpg" src="Uploads/Media/20220315/1582689.jpg">
And when I parse the page with Beautifulsoup4, I get this (fonts.gstatic.com Source section content) :
<a class="ohidden" data-size="838x1047" href="Uploads/Media/20220315/1582689.jpg" itemprop="contentUrl">
<img class="lazy" data-src="Uploads/Media/20220315/1582689.jpg" />
</a>
The given URL is not a bulk web URL which can be used to download the image from anywhere, but a link to the "Sources" section of the web page (CTRL + MAJ + I on the webpage), where the image is.
When I put my mouse on the src link of the source code of the website, I can get the true bulk url under "Current source". This information is located in the Elements/Properties of the DevTools (CTRL + MAJ + I on the webpage), but I don't know how to automate the saving of the images, either by directly using the link to access the web page sources, or to access the bulk address to download the images. Do you have some idea ?
PS : I found this article about lazy fading images, but my HTLM knowledge isn't enough to find a solution for my problem (https://davidwalsh.name/lazyload-image-fade)
I'm not too familiar with web scraping or the benefits. However, I found this article here that you can reference and I hope it helps!
Reference
However, here is the code and everything you need in one place.
First you have to find the webpage you want to download the images from, which is your decision.
Now we have to get the urls of the images, create an empty list, open it, select them, loop through them, and then append them.
url = ""
link_list[]
response = urllib.request.urlopen(url)
soup = BeautifulSoup(response, "html.parser")
image_list = soup.select('div.boxmeta.clearfix > h2 > a')
for image_link in image_list:
link_url = image_link.attrs['href']
link_list.append(link_url)
This theoretically should look for any href tag linking an image to the website and then append them to that list.
Now we have to get the tags of the image file.
for page_url in link_list:
page_html = urllib.request.urlopen(page_url)
page_soup = BeautifulSoup(page_html, "html.parser")
img_list = page_soup.select('div.seperator > a > img')
This should find all of the div tags that seperate from the primary main div class, look for an a tag and then the img tag.
for img in img_list:
img_url = (img.attrs['src'])
file_name = re.search(".*/(.*png|.*jpg)$", img_url)
save_path = output_folder.joinpath(filename.group(1))
Now we are going to try to download that data using the try except method.
try:
image = requests.get(img_url)
open(save_path, 'wb').write(image.content)
print(save_path)
except ValueError:
print("ValueError!")
I think you are talking about the relative path and absolute path.
Things like Uploads/Media/20220315/1582689.jpg is a relative path.
The main difference between absolute and relative paths is that absolute URLs always include the domain name of the site with http://www. Relative links show the path to the file or refer to the file itself. A relative URL is useful within a site to transfer a user from point to point within the same domain. --- ref.
So in your case try this:
import requests
from bs4 import BeautifulSoup
from PIL import Image
URL = 'YOUR_URL_HERE'
r = requests.get(URL)
soup = BeautifulSoup(r.text, 'html.parser')
for img in soup.find_all("img"):
# Get the image absolute path url
absolute_path = requests.compat.urljoin(URL, img.get('data-src'))
# Download the image
image = Image.open(requests.get(absolute_path, stream=True).raw)
image.save(absolute_path.split('/')[-1].split('?')[0])

How to save a text as .txt file via a downloadable link?

HTML newbie here
I am working on an app using Streamlit. Based on the user inputs to the fields available, I am generating some data which I want to download in the form of a .txt file.
The data that I want to download is generated when I do
to_save = abc.serialize().encode("ascii", "ignore")
and when I do print(to_save), I get (this is just a small part of a very huge text data)
b"UNA:+.?
'UNB+UNOC:3+9978715000006:14+9978715000006:14+200529:1139+50582307060_WP?+_200101_200201++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+50582307060_WP?+_200101_200201-1+9'DTM+137:202005291139:203'RFF+Z13:13008'NAD+MS+9978715000006::9'CTA+IC+:Michael
Jordan'COM+m.jordan#energycortex.com:EM'NAD+MR+9978715000006::9'"
Now, I want to save this information as a .txt file via an HTML link. I am following:
How to download a file in Streamlit
How to force fully download txt file on link?
and I have
reference = 50582307060_WP+_200101_200201
to_save = abc.serialize().encode("ascii", "ignore")
href = f'<a href="data:text/plain;charset=UTF-8,{to_save}" download={reference}.txt>Download File</a> (right-click and save as {reference}.txt)'
st.markdown(href, unsafe_allow_html=True)
But this doesn't work and shows as follows:
The start
The end
and when I do:
to_save = abc.serialize().encode("ascii", "ignore")
href = f'<a href="data:text/plain;charset=UTF-8" download={reference}.txt>Download File</a> (right-click and save as {reference}.txt)'
st.markdown(href, unsafe_allow_html=True)
I get
The problem with this being that the information that has to be saved as a .txt file (to_save = abc.serialize().encode("ascii", "ignore")) isn't being saved and I get a Failed-Network error
What is the mistake that I am doing and how can I enable saving the information stored in to_save (to_save = abc.serialize().encode("ascii", "ignore")) as an HTML downloadable link? Also, the file should be saved as 'reference.txt', with reference being defined as a variable above.
I reckon I have found a solution to your problem. Though I can't be entirely sure, it has two causes. The first one lies in the href attribute of the download link. The problem here is the " (double quotes) in the to_save variables data. The html, for what I could test with the by you provided data, renders it as follows:
<a href="data:text/plain;charset=UTF-8,b"UNA:+.? 'UNB+UNOC:3+9978715000006:14+9978715000006:14+200529:1139+50582307060_WP?+_200101_200201++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+50582307060_WP?+_200101_200201-1+9'DTM+137:202005291139:203'RFF+Z13:13008'NAD+MS+9978715000006::9'CTA+IC+:Michael Jordan'COM+m.jordan#energycortex.com:EM'NAD+MR+9978715000006::9'"" download=filename.txt>Download File</a>
As you can see the value of the href attribute isn't all in the blue color (here in the stackoverflow code container above). That is because of the " that interrupt the string, it closes the earlier opened " after href=". In order to prevent this behaviour, you should replace the " in to_save with ". To the user this will look the same as " but the browser will treat it like a normal string.
You should add the following line of code to your python script to make that happen
to_save = abc.serialize().encode("ascii", "ignore")
#add this line:
to_save = to_save.replace('"','"')
Next the download attribute doesn't have any double quotes around it's value. It should, formally, look like this: download="filename.txt". Then again, for safety replace any possible"with"`.
The full python code should now look like this:
reference = 50582307060_WP+_200101_200201
reference = reference.replace('"','"')
to_save = abc.serialize().encode("ascii", "ignore")
to_save = to_save.replace('"','"')
href = f'Download File (right-click and save as {reference}.txt)'
st.markdown(href, unsafe_allow_html=True)
Hope this helps! If not, please comment.

Opening html file from hard drive and doing xpath search on it

I have an html file on my HD that I want to do an xpath search on like you do when scraping a website.
I have used the following code to scrape from websites:
from lxml import html
import requests
response = requests.get('http://www.website.com/')
if (response.status_code == 200):
pagehtml = html.fromstring(response.text)
for elt in pagehtml.xpath('//div[#class="content"]/ul/li/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
Now this works well when getting something from a website, but how do I go about when the HTML file is on my HD. I have tried about 10 things and at the moment my code looks like this:
with open(r'website.html', 'rb') as infile:
data = infile.read()
for elt in data.xpath('//h3/a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
I keep getting different errors and sometimes '_io.BufferedReader' errors, but I just don't get the code right.
Any suggestions? Regards
You could use the following code:
from lxml import html
pagehtml = html.parse('index.html')
for elt in pagehtml.xpath('//a'):
print("**",'"',elt.text_content(),'"',"****", elt.attrib['href'])
This makes sure that the decoding of the file data is handled automatically.

LOCAL HTML file to generate a text file

I am trying to generate a TEXT/XML file from a LOCAL HTML file. I know there are a lot of answers to generating a file locally, usually suggesting using ActiveX object or HTML 5.
I'm guessing there is a way to make it work on all browsers (in the end HTML extension is opened by a browser even if it is a LOCAL file) and easily since this is a LOCAL file put in by user himself.
My HTML file will be on client's local machine not accessed via HTTP.
It is basically just a form written in HTML that upon "SAVE" command should be generating an XML file in the local disk (anywhere user decides) and saving form's content in.
Any good way?
One way that I can think of is, the html form elements can be set into class variables and then using the jaxb context you can create an XML file out of it.
Useful Link: http://www.vogella.com/tutorials/JAXB/article.html
What you can do is use base64 data-urls (no support for IE9-) to download the file:
First you need to create a temporary iframe element for your file to download in:
var ifrm = document.createElement('iframe');
ifrm.style.display = 'none';
document.body.appendChild(ifrm);
Then you need to define what you want the contents of the file to download to be, and convert it to a base64 data-url:
var html = '<!DOCTYPE html><html><head><title>Foo</title></head><body>Hello World</body></html>';
htmlurl = btoa(html);
and set it as source for the iframe
ifrm.src = 'data:text/x-html;base64,'+htmlurl;

How to convert en-media to img when convert enml to html

I'm working with evernote api on iOS, and want to translate enml to html. How to translate en-media to img? for example:
en-media:
<en-media type="image/jpeg" width="1200" hash="317ba2d234cd395150f2789cd574c722" height="1600" />
img:
<img src="imagePath"/>
I use core data to save information on iOS. So I can't give the local path of img file to "src = ". How to deal with this problem?
The simplest way is embedding image data using Data URI:
Find a Evernote Resource associated with this hash code.
Build the following Data URI (sorry for Java syntax, I'm not very familiar with Objective C):
String imgUrl = "data:" + resource.getMime() + ";base64," + java.util.prefs.Base64.byteArrayToBase64(resource.getData().getBody());
Create HTML img tag using imgUrl from (2).
Note: the following solution will allow you to display the image outside of the note's content.
On this page, you'll find the following url template:
https://host.evernote.com/shard/shardId/res/GUID
First compile the url from some variables, then point the html image src = the url.
In ruby, you might compile the url with a method similar to this one:
def resource_url
"https://#{EVERNOTE_HOST}/shard/#{self.note.notebook.user.evernote_shard_id}/res/#{self.evernote_id}"
end
...where self references the resource, EVERNOTE_HOST is equivalent to the host url (i.e. sandbox.evernote.com), evernote_shard_id is the user's shardId, and evernote_id is the user's guid.