Why is pytesseract not identifying this image? - ocr

I am trying to identify single digits in python with tesseract.
My code is this:
import numpy as np
from PIL import Image
from PIL import ImageOps
import pytesseract
import cv2
def predict(imageArray):
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
newImageArray = Image.open(imageArray)
number = pytesseract.image_to_string(newImageArray, lang='eng', config='--psm 10 --oem 1 -c tessedit_char_whitelist=0123456789')
return number
It has no problem saying this is an 8
but it does not recognise this as a 4
My images are just digits 0-9.
This is just one such example there are other instances where it struggles to identify "obvious/clear" digits.
Currently the only thing I am doing to my starting image,image is converting the colour. Using the following:
cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
Is there a way I can improve the accuracy. All of my images are clear computer typed images so I feel the accuracy should be a lot higher than it is.

You did not provide any information about your tesseract version and language model you used.
Best model identify '4' in your image without any preprocessing.

Related

Python: Creating PDF from PNG images and CSV tables using reportlab

I am trying to create a PDF document using a series of PDF images and a series of CSV tables using the python package reportlab. The tables are giving me a little bit of grief.
This is my code so far:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.pdfgen.canvas import Canvas
from reportlab.platypus import *
from reportlab.platypus.tables import Table
from PIL import Image
from matplotlib.backends.backend_pdf import PdfPages
# Set the path to the folder containing the images and tables
folder_path = 'Files'
# Create a new PDF document
pdf_filename = 'testassessment.pdf'
canvas = Canvas(pdf_filename)
# Iterate through the files in the folder
for file in os.listdir(folder_path):
file_path = os.path.join(folder_path, file)
# If the file is an image, draw it on the PDF
if file.endswith('.png'):
canvas.drawImage(file_path, 105, 148.5, width=450, height=400)
canvas.showPage() #ends page
# If the file is a table, draw it on the PDF
elif file.endswith('.csv'):
df = pd.read_csv(file_path)
table = df.to_html()
canvas.drawString(10, 10, table)
canvas.showPage()
# Save the PDF
canvas.save()
The tables are not working. When I use .drawString it ends up looking like this:
Does anyone know how I can get the table to be properly inserted into the PDF?
According to the reportlab docs, page 14, "The draw string methods draw single lines of text on the canvas.". You might want to have a look at "The text object methods" on the same page.
You might want to consider using PyMuPDF with Stories it allows for more flexibility of layout from a data input. For an example of something very similar to what you are trying to achieve see: https://pymupdf.readthedocs.io/en/latest/recipes-stories.html#how-to-display-a-list-from-json-data

When I use tesseract to recognize the table, I can't get any output, how can I recognize the table

Recently, I want to make a tools for Table Recognition. I have tried tesseract ocr, but I can't get any output, can anyone give me the answer?
Highly recommand paddleocr for table recognition! It can output textfile and excel file using just a few lines of code.
import os
import cv2
from paddleocr import PPStructure,save_structure_res
table_engine = PPStructure(layout=False, show_log=True, use_gpu=False)
save_folder = './output'
img_path = 'PaddleOCR_pub/ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
for line in result:
line.pop('img')
print(line
The output files are as follows, which can help you more.
you can experience it here: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/quickstart_en.md#214-table-recognition

Detect language/script from pdf with python

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)
I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.
So a sort of two step OCR as you will that tesseract both can preform
Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages
Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.
#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz
pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice =0, timeout=0))
I run the script as follows:
script_detect.py myunknown.pdf
I am getting the following error atm:
TypeError: Unsupported image object
Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect
from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)
```output fr````
or
from langdetect import detect
lang = detect("我是法国人")
print(lang)
output ch
There are other libraries, such as polyglot, useful if you have mixed languages.

NLTK Word Tokenize doesn't return anything

I am trying to tokenize a sentence, and I believe that the code is correct but there is no output. What could be the problem? Here is the code.
import nltk
from nltk.tokenize import word_tokenize
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)
It seems the following packages are missing.
punkt
averaged_perceptron_tagger
Note: You need to download them for the first time.
Try this..
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
text = word_tokenize("And now for something completely different")
print(nltk.pos_tag(text))
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
print(nltk.pos_tag(text))
print("----End of execution----")
Try this on IDE

Python 2 to 3: telling 2to3 "I got this"

with either linters or coverage.py, you can tell the tool to ignore certain parts of your code.
for example, #pragma: no cover tells coverage not to count an exception branch as missing:
except (Exception,) as e: #pragma: no cover
if cpdb(): pdb.set_trace()
raise
Now, I know I can exclude specific fixers from 2to3. For example, to avoid fixing imports below, I can use 2to3 test_import_stringio.py -x imports.
But can use code annotations/directives to keep the fixer active, except at certain locations? For example, this bit of code is already adjusted to work for 2 and 3.
#this import should work fine in 2 AND 3.
try:
from io import StringIO
except ImportError:
#pragma-for-2to3: skip
from StringIO import StringIO
but 2to3 helpfully converts, because there is no such directive/pragma
And now this won't work in 2:
#this tests should work fine in 2 and 3.
try:
from io import StringIO
except ImportError:
#pragma-for-2to3: skip
from io import StringIO
Reason I am asking is because I want to avoid a big-bang approach. I intend to refactor code bit by bit, starting with unittests, to run under 2 and 3.
I am guessing this is not possible, just looking at my options. What I'll probably end up doing is to run the converter only on imports with -f imports for example, check what it ended up doing, do that manually myself on the code and then exclude imports from future consideration with -x imports.