OCR detecting E as £ - ocr

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")
Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.

A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:
Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
Adding a line in that file containing only the characters that are you want to search in text:
tessedit_char_whitelist *list_of_characters*
Then call your script using the whitelist vocabulary:
tesseract input.tif output nobatch whitelist
In this case the parameters must be setted in your Python script as:
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")

Related

pytesseract - Invalid resolution 0 dpi

I am using pytesseract v5.0 and I am rotating the image with OpenCV and then passing it to pytesseract.image_to_osd(). There are some images that work with the image_to_osd, but other images do not and the program gives me the following error:
TesseractError: (1, 'Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 179 Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')
I am using python 3.9.5.
Please share the solution / sample code to fix this issue.
I've been facing this error for a quite long time but finally realized the reason.
Tesseract OSD seems to get the correct image rotation if it's only 0,90,180, or 270.
If you're using OpenCV or Pillow to read your image, it's likely to get the error above.
If you view Tesseract parameters, you will notice something called "Min_characters_to_try" which is the minimum number of characters to run OSD. It's set to 50 by default, which might be too much for you. So, we have to reduce it.
What you can do is cropping your background to make your object have one of the angles stated above. Then, pass your image file directly to Tesseract and reduce the min_characters_to_try like the following:
osd = pytesseract.image_to_osd(r'D:\image.jpg',config='--psm 0 -c min_characters_to_try=5')

caffe could not open or find file

I'm new to caffe and after successfully running an example I'm trying to use my own data. However, when trying to either write my data into the lmdb data format or directly trying to use the solver, in both cases I get the error:
E0201 14:26:00.450629 13235 io.cpp:80] Could not open or find file ~/Documents/ChessgameCNN/input/train/731_1.bmp 731
The path is right, but it's weird that the label 731 is part of this error message. That implies that it's reading it as part of the path instead of as a label. The text file looks like this:
~/Documents/ChessgameCNN/input/train/731_1.bmp 731
Is it because the labels are too high? Or maybe because the labels don't start with 0? I've searched for this error and all I found were examples with relatively few labels, about ~1-5, but I have about 4096 classes of which I don't always actually have examples in the training data. Maybe this is a problem, too (certainly for learning, at least, but I didn't expect for it to give me an actual error message). Usually, the label does not seem to be part of this error message.
For the creation of the lmdb file, I use the create_imagenet.sh from the caffe examples. For solving, I use:
~/caffe/build/tools/caffe train --solver ~/Documents/ChessgameCNN/caffe_models/caffe_model_1/solver_1.prototxt 2>&1 | tee ~/Documents/ChessgameCNN/caffe_models/caffe_model_1/model_1_train.log
I tried different image data types, too: PNG, JPEG and BMP. So this isn't the culprit, either.
If it is really because of my choice of labels, what would be a viable workaround for this problem?
Thanks a lot for your help!
I had the same issue. Check that lines in your text file don't have spaces in the end.
I was facing a similar problem with convert_imageset. I have solved just removing the trailing spaces in the text file which contains the labels.

Tesseract OCR - recognize checkboxes as word

for a customer I want to teach Tesseract to recognize checkboxes as a word. It worked fine when Tesseract should recognize a empty checkbox.
This command in combination with this tutorial worked like a charm and Tesseract was able to find empty checkboxes and interpret them to "[_]":
tesseract -psm 10 deu2.unchecked1.exp0.JPG deu2.unchecked1.exp0.box nobatch box.train
Here is my command to successful analyze a document:
tesseract test.png test -l deu1+deu2
Then I tried to train a checked checkbox, but got this error:
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 1/[X] ((60,30),(314,293)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 1
Boxes failed resegmentation: 1
Found 0 good blobs.
Generated training data for 0 words
Does anyone have an idea how to teach Tesseract recognize checked checkboxes as well?
Thank you in advance!
After much more tries I figured out that it is of course possible to teach Tesseract different kind of letters. But as I know today, there is no possibility to teach Tesseract a sign which is not conform to some "visual rules" of a letter. For example: A letter is always one connected line of ink, at most a combination of ink and "something outside it" (for example: i,ä,ö,ü) Problem here ist that there is nothing what is similiat to checkbox (one object in antother object) This leads for Tesseract to irritations and crashes.

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create:
I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.
tesseract infile.tiff outfile_stem -l eng -psm 1 hocr
gives a bounding box for everything and has some labelling in class tags e.g.
<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">
<span class='ocr_line' id='line_5_142' ...
but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?
The current head version details:
tesseract 3.04.00
leptonica-1.71
libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5
Edit
I'm really looking to achieve this using the command line tool (as in examples above). #nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?
Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.
Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.
Overview
Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid
Code
brew install wine # takes a little while >10m
brew install gs # only for generating a tif example. Not required, you can use Preview
brew install wget # only for downloading example paper. Not required, you can do so manually!
cd ~/Downloads
wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf"
# This command can be ommitted and you can do the conversion to tiff with Preview
gs \
-o paper-%d.tif \
-sDEVICE=tiff24nc \
-r300x300 \
paper.pdf
cd ~/Downloads
# ttptool is the location you downloaded the Tesseract to PAGE tool to
ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3"
# sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"
touch "$ttptool/log.txt"
wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" \
-inp-img "$dl/Downloads/paper-3.tif" \
-out-xml "$dl/Downloads/paper-3-tool.xml" \
-rec-mode layout>>log.txt
# pvtool is the location you downloaded the PAGE Viewer tool to
pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)"
cd "$pvtool"
dl=~
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"
Results
Document with overlays (rollover to see text and type)
Overlays alone (use GUI buttons to toggle)
Appendix
You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!
# Note that the pvtool does take as input HOCR xml but it ignores the region type
brew install tesseract --devel # installs v 3.03 at time of writing
tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr
mv paper-3.hocr paper-3.xml # The page viewer will only open XML files
java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"
At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:
pctool="/Users/Me/Project/tools/JPageConverter 1.0"
java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST
Unfortunately, I kept getting null pointers.
Could not convert to target XML schema format.
java.lang.NullPointerException
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
Could not save target PAGE XML file: paper-3-hocrconvert.xml
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)
You can use its API to obtain the bounding boxes at various levels (character/word/line/para) -- see API Example. You have to draw the labels yourself.
If you are python familiar, you can directly use tesserocr library which is a nice python wrapper around the C++ API. Here is a code snippet to draw polygons at block level using PIL:
from PIL import Image, ImageDraw
from tesserocr import PyTessBaseAPI, RIL, iterate_level, PSM
img = Image.open(filename)
results = []
with PyTessBaseAPI() as api:
api.SetImage(img)
api.SetPageSegMode(PSM.AUTO_ONLY)
iterator = api.AnalyseLayout()
for w in iterate_level(iterator, RIL.BLOCK):
if w is not None:
results.append((w.BlockType(), w.BlockPolygon()))
print('Found {} block elements.'.format(len(results)))
draw = ImageDraw.Draw(img)
for block_type, poly in results:
# you can define a color per block type (see tesserocr.PT for block types list)
draw.line(poly + [poly[0]], fill=(0, 255, 0), width=2)
With Tesseract 4.0.0, a command like tesseract source/dir/myimage.tiff target/directory/basefilename hocr will create a basefilename.hocr file with block-, paragraph-, line-, and word-level bounding boxes for the OCR'ed text. Even the command without the hocr config creates a text file with newlines between block-level text, but the hocr format is more explicit.
More config options here: https://github.com/tesseract-ocr/tesseract/tree/master/tessdata/configs
Shortcut
It is also possible to open HOCR files directly with the PageViewer tool. The file extension has to be .xml, however.
The HOCR individual character step is now available in Tesseract since 4.1.
Once the installation check, use :
tesseract {image file} {output name} -c tessedit_create_hocr=1 -c hocr_char_boxes=1

Train tesseract stopped working

I'm using Serak Tesseract Trainer for Tesseract 3.0x. I added a Train Image, which then came from jTessBoxEditor (a Box Generator). When I pressed Train Tesseract, a DOS command prompts me, it's like training the image, then suddenly this appeared:
Reading dos.bookmanoldstyle.exp0.tr ... Font id = -1/0, class id = 1/42 on sample 0 font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file ....\classify\trainingsampleset.cpp, line 622
then a dialog box appeared that tells something about Shape clustering that has stopped working.
I don't know what went wrong, the first time I used this, it worked fine though. Anyone who can help me resolve this?
Save your font_properties file in Unix format (.sh) (which is available in notepad++) instead of normal text. Then use font_properties file as shown below wherever it is needed.
-F font_properties.sh
Do you have a correct entry in font_properties file or correct input filename? Assert failed - Training Tesseract