PDFMiner does not detect all pages - ocr

I am trying to extract text from pdfs, but I am running into an error because my script sometimes detects every page of a pdf, and sometimes only detects the first page of a pdf. I even included this line from a previous post on stackoverflow.
print(len(list(extract_pages(pdf_file))))
Anytime my script extracted just the first page, the script only detected 1 page.
I've even tried another library (PyPDF2) to extract text, but had even worse results.
If I look up the properties of the pdfs that my script mishandles, Adobe clearly shows in the pdf's properties the correct number of pages.
Below is the code I am using. Any recommendations on how I might change my script to detect all pages of a pdf would be appreciated.
import os
from os.path import isfile, join
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
pdf_dir = "/dir/pdfs/"
txt_dir = "/dir/txt/"
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
print(filename)
output_string = StringIO()
with open(join(pdf_dir, filename), 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
txt_name = "{}.txt".format(filename[:-4])
with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
o.write(output_string.getvalue())

Here is a solution. After trying different libraries in R (pdftools) and Python (pdfplumber), PyMuPDF works best.
from io import StringIO
import os
from os.path import isfile, join
import fitz
pdf_dir = "pdf path"
txt_dir = "txt path"
output_string = StringIO()
corpus = (f for f in os.listdir(pdf_dir) if not f.startswith('.') and isfile(join(pdf_dir, f)))
for filename in corpus:
print(filename)
output_string = StringIO()
doc = fitz.open(join(pdf_dir,filename))
for page in doc:
output_string.write(page.getText("rawdict"))
txt_name = "{}.txt".format(filename[:-4])
with open(join(txt_dir, txt_name), mode="w", encoding='utf-8') as o:
o.write(output_string.getvalue())

Related

Pandas parallel URL downloads with pd.read_html

I know I can download a csv file from a web page by doing:
import pandas as pd
import numpy as np
from io import StringIO
URL = "http://www.something.com"
data = pd.read_html(URL)[0].to_csv(index=False, header=True)
file = pd.read_csv(StringIO(data), sep=',')
Now I would like to do the above for more URLs at the same time, like when you open different tabs in your browser. In other words, a way to parallelize this when you have different URLs, instead of looping through or doing it one at a time. So, I thought of having a series of URLs inside a dataframe, and then create a new column which contains the strings 'data', one for each URL.
list_URL = ["http://www.something.com", "http://www.something2.com",
"http://www.something3.com"]
df = pd.DataFrame(list_URL, columns =['URL'])
df['data'] = pd.read_html(df['URL'])[0].to_csv(index=False, header=True)
But it gives me error: cannot parse from 'Series'
Is there a better syntax, or does this mean I cannot do this in parallel for more than one URL?
You could try like this:
import pandas as pd
URLS = [
"https://en.wikipedia.org/wiki/Periodic_table#Presentation_forms",
"https://en.wikipedia.org/wiki/Planet#Planetary_attributes",
]
df = pd.DataFrame(URLS, columns=["URL"])
df["data"] = df["URL"].map(
lambda x: pd.read_html(x)[0].to_csv(index=False, header=True)
)
print(df)
# Output
URL data
0 https://en.wikipedia.org/wiki/Periodic_t... 0\r\nPart of a series on the\r\nPeriodic...
1 https://en.wikipedia.org/wiki/Planet#Pla... 0\r\n"The eight known planets of the Sol...

Python: How to save *.dat-files as *.csv-files to new folder

I have a folder with lots of *.dat files (which were created with the program IDL). I am able to take one single file, convert it to a *.csv file and save it in a different (already existing) folder:
import idlsave
import csv
input_file = idlsave.read("C:/Users/RAW/06211714.dat")
n = input_file["raw"]
with open("C:/Users/CSV/06211714.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerows(n)
The line input_file = idlsave.read("C:/Users/RAW/06211714.dat") shows the following output:
Available variables: raw class ['numpy.recarray']
So, this works fine for just taking one file, but I am looking for a way to take all *.dat files at once and convert each of them to a *.csv file with their original name.
I was thinking of something like this, but it didn't work:
import glob
for filename in glob.glob("C:/Users/RAW/*.dat"):
for element in filename:
i = idlsave.read(element)
n = i["raw"]
with open("C:/Users/CSV/*.csv", "w", newline='') as f:
writer = csv.writer(f)
writer.writerows(n)
Can someone please give me some advice?
Thanks.
import csv
import idlsave
from os import listdir
from os.path import isfile, join, splitext
dat_folder = "/folder/to/dat/files/"
csv_folder = "/folder/to/save/new/csv/files/"
onlyfilenames = [f for f in listdir(dat_folder) if isfile(join(dat_folder,f))]
for fullfilename in onlyfilenames:
file_name, file_extension = splitext(fullfilename)
if file_extension == ".dat":
input_file = idlsave.read(dat_folder + fullfilename)
n = input_file["raw"]
with open(join(csv_folder, file_name + ".csv"), "w", newline='') as f:
writer = csv.writer(f)
writer.writerows(n)

Python folder contents CSV writer

I'm trying to make a simple command line script with Python code that generates a CSV when it scans the contents of a directory, but I'm not sure if I'm doing it correctly, cause I keep getting errors. Can someone tell me what the heck I'm doing wrong?
import sys
import argparse
import os
import string
import fnmatch
import csv
from string import Template
from os import path
from os.path import basename
header = ["Title","VersionData","PathOnClient","OwnerId","FirstPublishLocationId","RecordTypeId","TagsCsv"]
if not sys.argv.len < 2:
with open(sys.argv[1], 'w') as f:
writer = csv.DictWriter(f, fieldnames = header, delimiter=',')
writer.writeheader()
if os.path.isdir(sys.argv[2]):
for d in os.scandir(sys.argv[2]):
row = Template('"$title","$path","$path"') #some default values in the template were omitted here
writer.writerow(row.substitute(title=basename(d.path)), path=path.abspath(d.path))
Right off the bat, csvwriter.writerow(row) takes only one argument. You need to wrap your arguments inside brackets and then join with comma.
Moreover, you cannot call other functions within the row object, which is what you are trying to do with row.substitute(args) etc.
Figured it out. For anyone else needing a quick CSV listing of folders, here's the code I got to work:
#!/usr/bin/env python3
import sys, os, csv
from string import Template
from pathlib import PurePath, PureWindowsPath
from os.path import basename
header = ["Title","Path","","","","",""] # insert what header you need, if any
if not len(sys.argv) < 2:
with open(sys.argv[1], 'w') as f:
writer = csv.DictWriter(f, fieldnames=header, dialect='excel', delimiter=',', quoting=csv.QUOTE_ALL)
writer.writeheader()
initPath = os.path.abspath(sys.argv[2])
if sys.platform.startswith('linux') or sys.platform.startswith('cygwin') or sys.platform.startswith('darwin'):
p = PurePath(initPath)
else:
if sys.platform.startswith('win32'):
p = PureWindowsPath(initPath)
if os.path.isdir(str(p)) and not str(p).startswith('.'):
for d in os.scandir(str(p)):
srow = Template('"$title","$path", "","","",""')
#s = srow.substitute({'title': basename(d.path), 'path': os.path.abspath(d.path)) #
#print(s) # this is for testing if the content produces what's expected
row = {'Title': basename(d.path), 'Path': os.path.abspath(d.path)} # the dictionary must have the same number of entries as the number of header fields your CSV is going to contain.
writer.writerow(row)

Python, UnicodeEncodeError

Hello I've got this piece of code
import urllib.request
import string
import time
import gzip
from io import BytesIO
from io import StringIO
from zipfile import ZipFile
import csv
import datetime
from datetime import date
import concurrent.futures
den = date.today().replace(day=1) - datetime.timedelta(days=1)
url = '' + den.strftime("%Y%m%d") + '_OB_ADR_csv.zip'
data = urllib.request.urlopen(url).read()
zipdata = BytesIO()
zipdata.write(data)
csvfile = open('./test.csv', 'w', newline='')
csvwrite = csv.writer(csvfile, delimiter=';')
with ZipFile(zipdata) as zip:
for i, nazev in enumerate(zip.namelist()):
if i == 0:
continue
csvstring = StringIO(str(zip.read(nazev), encoding='windows-1250'))
csvreader = csv.reader(csvstring, delimiter=';')
for j, row in enumerate(csvreader):
if j == 0 and i != 1:
continue
csvwrite.writerow(row)
csvfile.close()
When i run it it sometimes throws "UnicodeEncodeError: 'ascii' codec can't encode character '\xf3' in position 1: ordinal not in range(128)" at "csvwrite.writerow(row)"
How can I solve this issue? Thank you.
EDIT:
I run it under Python 3.3
You didn't tell csv.writer about the encoding. Take a look at the pydocs for the csv module:
To decode a file using a different encoding, use the encoding argument
of open...[t]he same applies to writing in something other than the
system default encoding: specify the encoding argument when opening
the output file.
You can see from the UnicodeEncodeError that Python thinks you want the file written in ascii. Just specify the encoding parameter and choose your desired encoding (my suggestion is encoding='utf-8').

ipython notebook .png figures after nbconvert not loaded by latest chrome/firefox

Running $ipython3 notebook --pylab=inline locally, I saved a simple notebook with a small png figure using pylab and python 3.3.
Contents of notebook cell:
from pylab import *
x = linspace(0, 5, 10)
y = x ** 2
figure()
plot(x, y, 'r')
xlabel('x')
ylabel('y')
title('title')
show()
running the cell resulted in an inline png figure being displayed.
The saved file (my_notebook.ipynb) has a .png saved as a data uri:
{ ..., "png":"iVBO...ZUmwK\n...", ... }
after executing command:
ipython3 nbconvert --to html my_notebook.html
my_notebook.html is generated with the figure as a data uri like this:
<img src="'iVBO...ZUmwk\n..." >
In latest chrome or firefox the image data uri does not load/display when opening file:///.../my_notebook.html locally and chrome console reports 'failed to load resource' for the img tag.
I have had the same results with images loaded and then displayed with imshow().
The figures appear fine in the notebook. It is after nbconvert to html that they do not display (at all).
(notice the escaped newline in the image data uri - I tried replacing all escaped newlines in the data string with actual newlines with no change in results)
How can I get png figures to display in an nbconverted-html-version of an ipython notebook opened locally ("file:///.../my_notebook.html") in browser?
(I would rather not have to save each figure and hand modify the converted html to reference the saved figure on disk.)
EDIT:
versions:
python 3.3.1
ipython==1.0.0
matplotlib==1.2.1
Pillow==2.1.0 (PIL)
Install BeautifulSoup4 first:
pip install BeautifulSoup4
Then use following function to freeze your generated html file. The images will be placed in the images folder under the same directory as the html file.
import os
import re
import base64
from bs4 import BeautifulSoup as BS
from uuid import uuid4
def dump(path, data):
root = os.path.dirname(path)
if not os.path.exists(root):
os.makedirs(root)
with open(path, 'wb') as f:
f.write(data)
# for windows
return path.replace('\\', '/')
def freeze_html(path):
'''pass in absolute path of your html'''
root = os.path.dirname(path)
with open(path, 'rb') as f:
soup = BS(f.read())
for img in soup.find_all('img'):
m = re.search(r"'(.*)'", img['src'])
if m:
iname = uuid4()
ipath = os.path.join(root, 'images', '%s.png' % iname)
# remove '\n'
s = m.group(1).replace(r'\n', '')
img['src'] = os.path.relpath(
dump(ipath, base64.b64decode(s.encode('ascii'))),
root
)
with open(path, 'wb') as f:
f.write(soup.encode('utf-8'))
If you do not need to further convert it to tex or pdf, you can just write string (\n removed) back to img['src'](with data:image/png;base64, prefix):
import re
from bs4 import BeautifulSoup as BS
def freeze_html(path):
'''pass in absolute path of your html'''
with open(path, 'rb') as f:
soup = BS(f.read())
for img in soup.find_all('img'):
m = re.search(r"'(.*)'", img['src'])
if m:
# remove '\n'
s = m.group(1).replace(r'\n', '')
img['src'] = 'data:image/png;base64,' + s
with open(path, 'wb') as f:
f.write(soup.encode('utf-8'))
I prefer to save png to separate file because it's more friendly to xelatex.