Download a file from a webpage using python - html

I need to download a file every 2 weeks from a webpage but the file is a new one every 2 weeks and therefore the name changes too, but it only changes the last 3 characters and the first "Vermeldung %%%" are the same. After that I need to send it to someone via email could someone help me accomplish that?
This is the code I have right now;
url ='https://worbis-kirche.de/downloads?view=document&id=339:vermeldungen-kw-9&catid=61'
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
It gives me all the links I need but how do I tell the program which link to download. The link that needs to be downloaded is /downloads?view=document&id=339&format=raw

I think you need to get this link:
https://worbis-kirche.de/downloads?view=document&id=339&format=raw
So, you could just do this:
import shutil
...
for link in soup.find_all('a', href=True):
myLink = link['href'] # Assuming the myLink is /downloads?view=document&id=339&format=raw
myLink = "https://worbis-kirche.de" + myLink
r = requests.get(myLink, stream=True) # To download it
r.raw.decode_content = True
with open(filename, "wb") as f: # Filename is the name of pdf
shutil.copyfileobj(r.raw, f)
try:
shutil.move(os.getcwd() + "/" + filename, directory + filename) # Directory is your aimed (preferred) downloads folder
except Exception as e:
print(e, ": File couldn\'t be transferred")
I hope I answered your question...

Related

Creating an exe - file for data exchange with server in Tcl

I am completely lost and I do not know how to approach the following problem which my boss assigned to me.
I have to create an exe - file containing a code which works as follows when I run it: It sends a certain file, say file_A, to a server. When the server receives this file it sends back a json-file, say file_B, which contains an url. More precisely, the attribute of the json-file contains the url. The code should then open the url in a browser.
And here are the details:
The above code (one version in tcl) must accept three parameters and a fourth optional parameter (so, it is not necessary to pass a fourth parameter). The three parameters are: server, type and file.
server: this is the path to the server. For example, https://localhost:0008.
type: this is the type of the file (file_A) to be send to the server: xdt / png
file: this is the path to the file (file_A) to be send to the server.
The fourth optional parameter is:
wksName: if this paramater is given, then the url should be opened with it in the browser.
I got an example code for the above procedure written in python. It should serve as an orientation. I do not know anything about this language but to a large extend I understand the code. It looks as follows:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import platform
import sys
import webbrowser
import requests
args_dict = {}
for arg in sys.argv[1:]:
if '=' in arg:
sep = arg.find('=')
key, value = arg[:sep], arg[sep + 1:]
args_dict[key] = value
server = args_dict.get('server', 'http://localhost:0008')
request_url = server + '/NAME_List_number'
type = args_dict.get('type', 'xdt')
file = args_dict.get('file', 'xdtExamples/xdtExample.gdt')
wksName = args_dict.get('wksName', platform.node())
try:
with open(file, 'rb') as f:
try:
r = requests.post(request_url, data={'type': type}, files={'file': f})
request_url = r.json()['url'] + '?wksName=' + wksName
webbrowser.open(url = request_url, new = 2)
except Exception as e:
print('Error:', e)
print('Request url:', request_url)
except:
print('File \'' + file + '\' not found')
As you can see, the crucial part of the above code is this:
try:
with open(file, 'rb') as f:
try:
r = requests.post(request_url, data={'type': type}, files={'file': f})
request_url = r.json()['url'] + '?wksName=' + wksName
webbrowser.open(url = request_url, new = 2)
except Exception as e:
print('Error:', e)
print('Request url:', request_url)
except:
print('File \'' + file + '\' not found')
Everything else above it are just definitions. If possible, I would like to translate the above code into tcl. Could you please help me with this issue?
It doesn't have to be a 1-1 "tcl-translation" as long as it works as described above, and hopefully as simple as the above one.
I am not familiar with concepts such as sending/receiving data to/from servers, reading json-files etc.
Any help is welcome.

Exiftool export JSON with Python

I´m trying to extract some metadata and store them in a JSON file using Exiftool via Python.
If I run the following command (according to the documentation) in the CMD it works fine, generating a temp.json file:
exiftool -filename -createdate -json C:/Users/XXX/Desktop/test_folder > C:/Users/XXX/Desktop/test_folder/temp.json
When managing Exiftool from Python the data is extracted correctly but no JSON file is generated.
import os
import subprocess
root_path = 'C:/Users/XXX/Desktop/test_folder'
for path, dirs, files in os.walk(root_path):
for file in files:
file_path = path + os.sep + file
exiftool_exe = 'C/Users/XXX/Desktop/exiftool.exe'
json_path = path + os.sep + 'temp.json'
export = os.path.join(path + ' > ' + json_path)
exiftool_command = [exiftool_exe, '-filename', '-createdate', '-json', export]
process = subprocess.run(exiftool_command)
print(process.stdout)
When I run the code it shows the error:
Error: File not found - C:/Users/XXX/Desktop/test_folder > C:/Users/XXX/Desktop/test_folder/temp.json
What am I missing, any ideas on how to get it to work? Thanks!
Edit with the solution:
I let the fixed code here just in case it could help someone else:
import os
import subprocess
root_path = 'C:/Users/XXX/Desktop/test_folder'
for path, dirs, files in os.walk(root_path):
for file in files:
file_path = path + os.sep + file
exiftool_exe = 'C/Users/XXX/Desktop/exiftool.exe'
export = root_path + os.sep + 'temp.json'
exiftool_command = [exiftool_exe, file_path, '-filename', '-createdate', '-json', '-W+!', export]
process = subprocess.run(exiftool_command)
print(process.stdout)
Thanks to StarGeek!
I believe the problem is that file redirection is a property of the command line and isn't available with subprocess.run. See this StackOverflow question.
For a exiftool solution, you would use the -W (-tagOut) option, specifically -W+! C:/Users/XXX/Desktop/test_folder/temp.json. See Note #3 under that link.

How can i use web scraping with Python to download multiple images? So far I've only manage to make it work with one image

I'm reading the book Automate The Boring Stuff With Python and this is the practice project on chapter 12 (web scraping). I tried everything but only managed to make the code work with 1 image, not all the images from the page. Any ideas? Here is my code so far:
# Goes to a img site (in this case a blog), searches for a category of photos (in this case the comics) and
# downloads all the resulting images
import requests, os, bs4
# Downloads the URL.
url = 'http://cersibonforever.blogspot.com/'
res = requests.get(url)
res.raise_for_status()
# Create the folder in path.
os.makedirs('cersibon', exist_ok = True)
# Find the images.
soup = bs4.BeautifulSoup(res.text, 'html.parser')
post = soup.select('img[src]')
post = post[0].get('src')
if post == []:
print('Could not find image.')
# Download the images.
else:
res = requests.get(post)
res.raise_for_status()
# Saves the images to cersibon folder.
print('Downloading %s to folder...' % (post))
imageFile = open(os.path.join('cersibon', 'tirinha.jpg'), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
print('Done.')
You've done everything right, but instead of selecting just the first post with post = post[0].get('src') you should iterate over then all, downloading every image with each source:
import requests, os, bs4
# Downloads the URL.
url = 'http://cersibonforever.blogspot.com/'
res = requests.get(url)
res.raise_for_status()
# Create the folder in path.
os.makedirs('cersibon', exist_ok = True)
# Find the images.
soup = bs4.BeautifulSoup(res.text, 'html.parser')
all_post = soup.select('img[src]')
for post in all_post:
src = post.get('src')
if not src:
print('Could not find image.')
# Download the images.
else:
res = requests.get(src)
res.raise_for_status()
# Saves the images to cersibon folder.
print('Downloading %s to folder...' % (src))
imageFile = open(os.path.join('cersibon', 'tirinha.jpg'), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
print('Done.')

Download files from FTP or HTTP using filenames in CSV - Python 3

I have a csv file of products for an ecommerce site I'm working on, as well as FTP access to the corresponding images for each product (~15K products).
I would like to use Python to pull only the images listed in the csv from either the FTP or HTTP and save them locally.
import urllib.request
import urllib.parse
import re
url='http://www.fakesite.com/pimages/filename.jpg'
split = urllib.parse.urlsplit(url)
filename = split.path.split("/")[-1]
urllib.request.urlretrieve(url, filename)
print(filename)
saveFile = open(filename,'r')
saveFile.close()
import csv
with open('test.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",")
images = []
for row in readCSV:
image = row[14]
print(image)
The code I have currently can pull the filename from the URL and save the file as that filename. It can also pull the filename of the image from the CSV file. (filename and image are the exact same) What I need it to do, is input the filename, from the CSV into the end of the URL, and then save that file as the filename.
I have graduated to this:
import urllib.request
import urllib.parse
import re
import os
import csv
with open('test.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",")
images = []
for row in readCSV:
image = row[14]
images.append(image)
x ='http://www.fakesite.com/pimages/'
url = os.path.join (x,image)
split = urllib.parse.urlsplit(url)
filename = split.path.split("/")[-1]
urllib.request.urlretrieve(url,filename)
saveFile = open(filename,'r')
saveFile.close()
Now this is great. It works perfectly. It is pulling the correct filename out of the CSV file, adding it on to the end of the URL, downloading the file, and saving it as the filename.
However, I can't seem to figure out how to make this work for more than one line of the CSV file. As of now, it takes that last line, and pulls the relevant information. Ideally, I would use the CSV file with all of the products on it, and it would go through and download every single one, not just the last image.
You're doing strange things ...
import urllib.request
import csv
# the images list should be outside the with block
images = []
IMAGE_COLUMN = 14
with open('test.csv') as csvfile:
# read csv
readCSV = csv.reader(csvfile, delimiter=",")
for row in readCSV:
# I guess 14 is the column-index of the image-name like 'image.jpg'
# I've put it in some constant
# now append all the image-names into the list
images.append(row[IMAGE_COLUMN])
# no need for the following
# image = row[14]
# images.append(image)
# make sure, root_url ends with a slash
# x was some strange name for an url
root_url = 'http://www.fakesite.com/pimages/'
# iterate through the list
for image in images:
# you don't need os.path.join, because that's operating system dependent.
# you don't need to urlsplit, because you have created the url yourself.
# you don't need to split the filename as it is the image name
# with the following line, the root_url must end with a slash
url = root_url + image
# urlretrieve saves the file as whatever image is into the current directory
urllib.request.urlretrieve(url, image)
or in short, that's all you need:
import urllib.request
import csv
IMAGE_COLUMN = 14
ROOT_URL = 'http://www.fakesite.com/pimages/'
images = []
with open('test.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",")
for row in readCSV:
images.append(row[IMAGE_COLUMN])
for image in images:
url = ROOT_URL + image
urllib.request.urlretrieve(url, image)

ipython notebook .png figures after nbconvert not loaded by latest chrome/firefox

Running $ipython3 notebook --pylab=inline locally, I saved a simple notebook with a small png figure using pylab and python 3.3.
Contents of notebook cell:
from pylab import *
x = linspace(0, 5, 10)
y = x ** 2
figure()
plot(x, y, 'r')
xlabel('x')
ylabel('y')
title('title')
show()
running the cell resulted in an inline png figure being displayed.
The saved file (my_notebook.ipynb) has a .png saved as a data uri:
{ ..., "png":"iVBO...ZUmwK\n...", ... }
after executing command:
ipython3 nbconvert --to html my_notebook.html
my_notebook.html is generated with the figure as a data uri like this:
<img src="data:image/png;base64,b'iVBO...ZUmwk\n..." >
In latest chrome or firefox the image data uri does not load/display when opening file:///.../my_notebook.html locally and chrome console reports 'failed to load resource' for the img tag.
I have had the same results with images loaded and then displayed with imshow().
The figures appear fine in the notebook. It is after nbconvert to html that they do not display (at all).
(notice the escaped newline in the image data uri - I tried replacing all escaped newlines in the data string with actual newlines with no change in results)
How can I get png figures to display in an nbconverted-html-version of an ipython notebook opened locally ("file:///.../my_notebook.html") in browser?
(I would rather not have to save each figure and hand modify the converted html to reference the saved figure on disk.)
EDIT:
versions:
python 3.3.1
ipython==1.0.0
matplotlib==1.2.1
Pillow==2.1.0 (PIL)
Install BeautifulSoup4 first:
pip install BeautifulSoup4
Then use following function to freeze your generated html file. The images will be placed in the images folder under the same directory as the html file.
import os
import re
import base64
from bs4 import BeautifulSoup as BS
from uuid import uuid4
def dump(path, data):
root = os.path.dirname(path)
if not os.path.exists(root):
os.makedirs(root)
with open(path, 'wb') as f:
f.write(data)
# for windows
return path.replace('\\', '/')
def freeze_html(path):
'''pass in absolute path of your html'''
root = os.path.dirname(path)
with open(path, 'rb') as f:
soup = BS(f.read())
for img in soup.find_all('img'):
m = re.search(r"data:image/png;base64,b'(.*)'", img['src'])
if m:
iname = uuid4()
ipath = os.path.join(root, 'images', '%s.png' % iname)
# remove '\n'
s = m.group(1).replace(r'\n', '')
img['src'] = os.path.relpath(
dump(ipath, base64.b64decode(s.encode('ascii'))),
root
)
with open(path, 'wb') as f:
f.write(soup.encode('utf-8'))
If you do not need to further convert it to tex or pdf, you can just write string (\n removed) back to img['src'](with data:image/png;base64, prefix):
import re
from bs4 import BeautifulSoup as BS
def freeze_html(path):
'''pass in absolute path of your html'''
with open(path, 'rb') as f:
soup = BS(f.read())
for img in soup.find_all('img'):
m = re.search(r"data:image/png;base64,b'(.*)'", img['src'])
if m:
# remove '\n'
s = m.group(1).replace(r'\n', '')
img['src'] = 'data:image/png;base64,' + s
with open(path, 'wb') as f:
f.write(soup.encode('utf-8'))
I prefer to save png to separate file because it's more friendly to xelatex.