Tesseract OCR - recognize checkboxes as word

Tesseract OCR - recognize checkboxes as word - ocr

for a customer I want to teach Tesseract to recognize checkboxes as a word. It worked fine when Tesseract should recognize a empty checkbox.
This command in combination with this tutorial worked like a charm and Tesseract was able to find empty checkboxes and interpret them to "[_]":
tesseract -psm 10 deu2.unchecked1.exp0.JPG deu2.unchecked1.exp0.box nobatch box.train
Here is my command to successful analyze a document:
tesseract test.png test -l deu1+deu2
Then I tried to train a checked checkbox, but got this error:
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 1/[X] ((60,30),(314,293)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 1
Boxes failed resegmentation: 1
Found 0 good blobs.
Generated training data for 0 words
Does anyone have an idea how to teach Tesseract recognize checked checkboxes as well?
Thank you in advance!

After much more tries I figured out that it is of course possible to teach Tesseract different kind of letters. But as I know today, there is no possibility to teach Tesseract a sign which is not conform to some "visual rules" of a letter. For example: A letter is always one connected line of ink, at most a combination of ink and "something outside it" (for example: i,ä,ö,ü) Problem here ist that there is nothing what is similiat to checkbox (one object in antother object) This leads for Tesseract to irritations and crashes.

Related

OCR detecting E as £

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")
Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.

A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:
Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
Adding a line in that file containing only the characters that are you want to search in text:
tessedit_char_whitelist *list_of_characters*
Then call your script using the whitelist vocabulary:
tesseract input.tif output nobatch whitelist
In this case the parameters must be setted in your Python script as:
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")

caffe could not open or find file

I'm new to caffe and after successfully running an example I'm trying to use my own data. However, when trying to either write my data into the lmdb data format or directly trying to use the solver, in both cases I get the error:
E0201 14:26:00.450629 13235 io.cpp:80] Could not open or find file ~/Documents/ChessgameCNN/input/train/731_1.bmp 731
The path is right, but it's weird that the label 731 is part of this error message. That implies that it's reading it as part of the path instead of as a label. The text file looks like this:
~/Documents/ChessgameCNN/input/train/731_1.bmp 731
Is it because the labels are too high? Or maybe because the labels don't start with 0? I've searched for this error and all I found were examples with relatively few labels, about ~1-5, but I have about 4096 classes of which I don't always actually have examples in the training data. Maybe this is a problem, too (certainly for learning, at least, but I didn't expect for it to give me an actual error message). Usually, the label does not seem to be part of this error message.
For the creation of the lmdb file, I use the create_imagenet.sh from the caffe examples. For solving, I use:
~/caffe/build/tools/caffe train --solver ~/Documents/ChessgameCNN/caffe_models/caffe_model_1/solver_1.prototxt 2>&1 | tee ~/Documents/ChessgameCNN/caffe_models/caffe_model_1/model_1_train.log
I tried different image data types, too: PNG, JPEG and BMP. So this isn't the culprit, either.
If it is really because of my choice of labels, what would be a viable workaround for this problem?
Thanks a lot for your help!

I had the same issue. Check that lines in your text file don't have spaces in the end.

I was facing a similar problem with convert_imageset. I have solved just removing the trailing spaces in the text file which contains the labels.

Train tesseract stopped working

I'm using Serak Tesseract Trainer for Tesseract 3.0x. I added a Train Image, which then came from jTessBoxEditor (a Box Generator). When I pressed Train Tesseract, a DOS command prompts me, it's like training the image, then suddenly this appeared:
Reading dos.bookmanoldstyle.exp0.tr ... Font id = -1/0, class id = 1/42 on sample 0 font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert failed:in file ....\classify\trainingsampleset.cpp, line 622
then a dialog box appeared that tells something about Shape clustering that has stopped working.
I don't know what went wrong, the first time I used this, it worked fine though. Anyone who can help me resolve this?

Save your font_properties file in Unix format (.sh) (which is available in notepad++) instead of normal text. Then use font_properties file as shown below wherever it is needed.
-F font_properties.sh

Do you have a correct entry in font_properties file or correct input filename? Assert failed - Training Tesseract

Tesseract: Specifying regions of text

I'm using tesseract-ocr-3.01 to scan many forms. The forms all follow a template, so I already know where the regions/rectangles of text are.
Is there a way to pass those regions to tesseract when using the command-line tool?

I found the answer, thanks to this thread.
It seems that tesseract suports the uzn format (used in the unvl tests).
From the thread:
Calling tesseract with parameter "-psm 4" and renaming the uzn file
with the same name of the image seem works.
Example: If we have C:\input.tif and C:\input.uzn, we do this:
tesseract -psm 4 C:\input.tif C:\output

This may not be an optimal answer, but here goes:
I'm not sure whether the command-line tool has options to specify text-regions.
What you can do is use a Tesseract wrapper on another platform (EmguCV has Tesseract built-in). So you get the the scanned image, crop out the text-regions, and give them to Tesseract one-at-a-time. This way you'll also avoid any inaccuracies in Tesseract's page-layout analysis.
eg.
Image<Gray,Byte> scannedImage = new Image<Gray,Byte>(path_to_scanned_image);
//assuming you know a text region
Image<Gray,Byte> textRegion = new Image(100,20);
scannedImage.ROI = new Rectangle(0,0,100,20);
scannedImage.copyTo(textRegion);
ocr.recognize(textRegion);

igraph for python

I'm thoroughly confused about how to read/write into igraph's Python module. What I'm trying right now is:
g = igraph.read("football.gml")
g.write_svg("football.svg", g.layout_circle() )
I have a football.gml file, and this code runs and writes a file called football.svg. But when I try to open it using InkScape, I get an error message saying the file cannot be loaded. Is this the correct way to write the code? What could be going wrong?

The write_svg function is sort of deprecated; it was meant only as a quick hack to allow SVG exports from igraph even if you don't have the Cairo module for Python. It has not been maintained for a while so it could be the case that you hit a bug.
If you have the Cairo module for Python (on most Linux systems, you can simply install it from an appropriate package), you can simply do this:
igraph.plot(g, "football.svg", layout="circle")
This would use Cairo's SVG renderer, which is likely to generate the correct result. If you cannot install the Cairo module for Python for some reason, please file a bug report on https://bugs.launchpad.net/igraph so we can look into this.
(Even better, please file a bug report even if you managed to make it work using igraph.plot).

Couple years late, but maybe this will be helpful to somebody.
The write_svg function seems not to escape ampersands correctly. Texas A&M has an ampersand in its label -- InkScape is probably confused because it sees & rather than &. Just open football.svg in a text editor to fix that, and you should be golden!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008