ipython notebook .png figures after nbconvert not loaded by latest chrome/firefox - html

Running $ipython3 notebook --pylab=inline locally, I saved a simple notebook with a small png figure using pylab and python 3.3.
Contents of notebook cell:
from pylab import *
x = linspace(0, 5, 10)
y = x ** 2
figure()
plot(x, y, 'r')
xlabel('x')
ylabel('y')
title('title')
show()
running the cell resulted in an inline png figure being displayed.
The saved file (my_notebook.ipynb) has a .png saved as a data uri:
{ ..., "png":"iVBO...ZUmwK\n...", ... }
after executing command:
ipython3 nbconvert --to html my_notebook.html
my_notebook.html is generated with the figure as a data uri like this:
<img src="'iVBO...ZUmwk\n..." >
In latest chrome or firefox the image data uri does not load/display when opening file:///.../my_notebook.html locally and chrome console reports 'failed to load resource' for the img tag.
I have had the same results with images loaded and then displayed with imshow().
The figures appear fine in the notebook. It is after nbconvert to html that they do not display (at all).
(notice the escaped newline in the image data uri - I tried replacing all escaped newlines in the data string with actual newlines with no change in results)
How can I get png figures to display in an nbconverted-html-version of an ipython notebook opened locally ("file:///.../my_notebook.html") in browser?
(I would rather not have to save each figure and hand modify the converted html to reference the saved figure on disk.)
EDIT:
versions:
python 3.3.1
ipython==1.0.0
matplotlib==1.2.1
Pillow==2.1.0 (PIL)

Install BeautifulSoup4 first:
pip install BeautifulSoup4
Then use following function to freeze your generated html file. The images will be placed in the images folder under the same directory as the html file.
import os
import re
import base64
from bs4 import BeautifulSoup as BS
from uuid import uuid4
def dump(path, data):
root = os.path.dirname(path)
if not os.path.exists(root):
os.makedirs(root)
with open(path, 'wb') as f:
f.write(data)
# for windows
return path.replace('\\', '/')
def freeze_html(path):
'''pass in absolute path of your html'''
root = os.path.dirname(path)
with open(path, 'rb') as f:
soup = BS(f.read())
for img in soup.find_all('img'):
m = re.search(r"'(.*)'", img['src'])
if m:
iname = uuid4()
ipath = os.path.join(root, 'images', '%s.png' % iname)
# remove '\n'
s = m.group(1).replace(r'\n', '')
img['src'] = os.path.relpath(
dump(ipath, base64.b64decode(s.encode('ascii'))),
root
)
with open(path, 'wb') as f:
f.write(soup.encode('utf-8'))
If you do not need to further convert it to tex or pdf, you can just write string (\n removed) back to img['src'](with data:image/png;base64, prefix):
import re
from bs4 import BeautifulSoup as BS
def freeze_html(path):
'''pass in absolute path of your html'''
with open(path, 'rb') as f:
soup = BS(f.read())
for img in soup.find_all('img'):
m = re.search(r"'(.*)'", img['src'])
if m:
# remove '\n'
s = m.group(1).replace(r'\n', '')
img['src'] = 'data:image/png;base64,' + s
with open(path, 'wb') as f:
f.write(soup.encode('utf-8'))
I prefer to save png to separate file because it's more friendly to xelatex.

Related

How can i convert json file from labelme interface to png or image format file?

When i used the labeling the images from labelme interface as output i get json file.but i need to in image format like png,bmp,jpeg after labeling. can anyone suggest me any code ?
import json
from PIL import Image
with open('your,json') as f:
data = json.load(f)
# Load the file path from the json
imgpath = data['yourkey']
# Place the image path into the open method
img = Image.open(imgpath)
Based on the tutorial of the original repository, you can use labelme_json_to_dataset <<JSON_PATH>> -o <<OUTPUT_FOLDER_PATH>>.
To run it on python / jupyter, you can use:
import os
def labelme_json_to_dataset(json_path):
os.system("labelme_json_to_dataset "+json_path+" -o "+json_path.replace(".","_"))
If you need to do it for multiple images, just loop the function.
Based on the issue, labelme_json_to_dataset behavior can be reimplemented by using either labelme2voc.py or labelme2coco.py.
You also could use other implementation like labelme2Datasets
You also can implement your own modification of labelme_json_to_dataset using labelme library. Basically, you use label_file = labelme.LabelFile(filename=filename) followed by img = labelme.utils.img_data_to_arr(label_file.imageData). An example of a process would be like this:
import labelme
import os
import glob
def labelme2images(input_dir, output_dir, force=False, save_img=False, new_size=False):
"""
new_size_width, new_size_height = new_size
"""
if save_img:
_makedirs(path=osp.join(output_dir, "images"), force=force)
if new_size:
new_size_width, new_size_height = new_size
print("Generating dataset")
filenames = glob.glob(osp.join(input_dir, "*.json"))
for filename in filenames:
# base name
base = osp.splitext(osp.basename(filename))[0]
label_file = labelme.LabelFile(filename=filename)
img = labelme.utils.img_data_to_arr(label_file.imageData)
h, w = img.shape[0], img.shape[1]
if save_img:
if new_size:
img_pil = Image.fromarray(img).resize((new_size_height, new_size_width))
else:
img_pil = Image.fromarray(img)
img_pil.save(osp.join(output_dir, "images", base + ".jpg"))

Download a file from a webpage using python

I need to download a file every 2 weeks from a webpage but the file is a new one every 2 weeks and therefore the name changes too, but it only changes the last 3 characters and the first "Vermeldung %%%" are the same. After that I need to send it to someone via email could someone help me accomplish that?
This is the code I have right now;
url ='https://worbis-kirche.de/downloads?view=document&id=339:vermeldungen-kw-9&catid=61'
from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
for link in soup.find_all('a', href=True):
print(link['href'])
It gives me all the links I need but how do I tell the program which link to download. The link that needs to be downloaded is /downloads?view=document&id=339&format=raw
I think you need to get this link:
https://worbis-kirche.de/downloads?view=document&id=339&format=raw
So, you could just do this:
import shutil
...
for link in soup.find_all('a', href=True):
myLink = link['href'] # Assuming the myLink is /downloads?view=document&id=339&format=raw
myLink = "https://worbis-kirche.de" + myLink
r = requests.get(myLink, stream=True) # To download it
r.raw.decode_content = True
with open(filename, "wb") as f: # Filename is the name of pdf
shutil.copyfileobj(r.raw, f)
try:
shutil.move(os.getcwd() + "/" + filename, directory + filename) # Directory is your aimed (preferred) downloads folder
except Exception as e:
print(e, ": File couldn\'t be transferred")
I hope I answered your question...

PDFMiner does not detect all pages

I am trying to extract text from pdfs, but I am running into an error because my script sometimes detects every page of a pdf, and sometimes only detects the first page of a pdf. I even included this line from a previous post on stackoverflow.
print(len(list(extract_pages(pdf_file))))
Anytime my script extracted just the first page, the script only detected 1 page.
I've even tried another library (PyPDF2) to extract text, but had even worse results.
If I look up the properties of the pdfs that my script mishandles, Adobe clearly shows in the pdf's properties the correct number of pages.
Below is the code I am using. Any recommendations on how I might change my script to detect all pages of a pdf would be appreciated.
import os
from os.path import isfile, join
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
pdf_dir = "/dir/pdfs/"
txt_dir = "/dir/txt/"
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
print(filename)
output_string = StringIO()
with open(join(pdf_dir, filename), 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
txt_name = "{}.txt".format(filename[:-4])
with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
o.write(output_string.getvalue())
Here is a solution. After trying different libraries in R (pdftools) and Python (pdfplumber), PyMuPDF works best.
from io import StringIO
import os
from os.path import isfile, join
import fitz
pdf_dir = "pdf path"
txt_dir = "txt path"
output_string = StringIO()
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
print(filename)
output_string = StringIO()
doc = fitz.open(join(pdf_dir,filename))
for page in doc:
output_string.write(page.getText("rawdict"))
txt_name = "{}.txt".format(filename[:-4])
with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
o.write(output_string.getvalue())

Open JSON files in different directory - Python3, Windows, pathlib

I am trying to open JSON files located in a directory other than the current working directory (cwd). My setting: Python3.5 on Windows (using Anaconda).
from pathlib import *
import json
path = Path("C:/foo/bar")
filelist = []
for f in path.iterdir():
filelist.append(f)
for file in filelist:
with open(file.name) as data_file:
data = json.load(data_file)
In this setting I have these values:
file >> C:\foo\bar\0001.json
file.name >> 0001.json
However, I get the following error message:
---> 13 with open(file.name) as data_file:
14 data = json.load(data_file)
FileNotFoundError: [Errno 2] No such file or directory: '0001.json'
Here is what I tried so far:
Use .joinpath() to add the directory to the file name in the open command:
with open(path.joinpath(file.name)) as data_file:
data = json.load(data_file)
TypeError: invalid file: WindowsPath:('C:/foo/bar/0001.json')
Used .resolve() as that works for me to load CSV files into Pandas. Did not work here.
for file in filelist:
j = Path(path, file.name).resolve()
with open(j) as data_file:
data = json.load(data_file)
Since I'm on Windows write path as (and yes, the file is in that directory):
path = Path("C:\\foo\\bar") #resulted in the same FileNotFoundError above.
Instantiate path like this:
path = WindowsPath("C:/foo/bar")
#Same TypeError as above for both '\\' and '/'
The accepted answer has a lot of redundants - re-collected generator and mixed with statement with pathlib.Path.
pathlib.Path is awesome solution to handle paths especially if we want to create scripts which may work with Linux and Windows.
# modules
from pathlib import Path
import json
# static values
JSON_SUFFIXES = [".json", ".js", ".other_suffix"]
folder_path = Path("C:/users/user/documents")
for file_path in folder_path.iterdir():
if file_path.suffix in JSON_SUFFIXES:
data = json.loads(file_path.read_bytes())
Just adding modification for new users. pathlib.Path works with Python3.
Complete solution; thanks #eryksun:
from pathlib import *
import json
path = Path("C:/foo/bar")
filelist = []
for f in path.iterdir():
filelist.append(f)
for file in filelist:
with open(str(file) as data_file:
data = json.load(data_file)
This line works as well:
with file.open() as data_file:

Download files from FTP or HTTP using filenames in CSV - Python 3

I have a csv file of products for an ecommerce site I'm working on, as well as FTP access to the corresponding images for each product (~15K products).
I would like to use Python to pull only the images listed in the csv from either the FTP or HTTP and save them locally.
import urllib.request
import urllib.parse
import re
url='http://www.fakesite.com/pimages/filename.jpg'
split = urllib.parse.urlsplit(url)
filename = split.path.split("/")[-1]
urllib.request.urlretrieve(url, filename)
print(filename)
saveFile = open(filename,'r')
saveFile.close()
import csv
with open('test.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",")
images = []
for row in readCSV:
image = row[14]
print(image)
The code I have currently can pull the filename from the URL and save the file as that filename. It can also pull the filename of the image from the CSV file. (filename and image are the exact same) What I need it to do, is input the filename, from the CSV into the end of the URL, and then save that file as the filename.
I have graduated to this:
import urllib.request
import urllib.parse
import re
import os
import csv
with open('test.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",")
images = []
for row in readCSV:
image = row[14]
images.append(image)
x ='http://www.fakesite.com/pimages/'
url = os.path.join (x,image)
split = urllib.parse.urlsplit(url)
filename = split.path.split("/")[-1]
urllib.request.urlretrieve(url,filename)
saveFile = open(filename,'r')
saveFile.close()
Now this is great. It works perfectly. It is pulling the correct filename out of the CSV file, adding it on to the end of the URL, downloading the file, and saving it as the filename.
However, I can't seem to figure out how to make this work for more than one line of the CSV file. As of now, it takes that last line, and pulls the relevant information. Ideally, I would use the CSV file with all of the products on it, and it would go through and download every single one, not just the last image.
You're doing strange things ...
import urllib.request
import csv
# the images list should be outside the with block
images = []
IMAGE_COLUMN = 14
with open('test.csv') as csvfile:
# read csv
readCSV = csv.reader(csvfile, delimiter=",")
for row in readCSV:
# I guess 14 is the column-index of the image-name like 'image.jpg'
# I've put it in some constant
# now append all the image-names into the list
images.append(row[IMAGE_COLUMN])
# no need for the following
# image = row[14]
# images.append(image)
# make sure, root_url ends with a slash
# x was some strange name for an url
root_url = 'http://www.fakesite.com/pimages/'
# iterate through the list
for image in images:
# you don't need os.path.join, because that's operating system dependent.
# you don't need to urlsplit, because you have created the url yourself.
# you don't need to split the filename as it is the image name
# with the following line, the root_url must end with a slash
url = root_url + image
# urlretrieve saves the file as whatever image is into the current directory
urllib.request.urlretrieve(url, image)
or in short, that's all you need:
import urllib.request
import csv
IMAGE_COLUMN = 14
ROOT_URL = 'http://www.fakesite.com/pimages/'
images = []
with open('test.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=",")
for row in readCSV:
images.append(row[IMAGE_COLUMN])
for image in images:
url = ROOT_URL + image
urllib.request.urlretrieve(url, image)