Can you train tesseract with images instead of text and a font? - ocr

In the tesseract documentation a method of training with sample text and a font is explained.
I used jTessBoxEditor but works pretty much like the tesseract training tools.
I got somewhat acceptable results with this, but I guess the optimal solution would be training tesseract with the actual kind of images it will have to recognize anyway.
As I only need to recognize digits, I can cut by hand each of them, maybe many versions of each digit, and train tesseract with those images, even setting the boxes by hand.
Is there a way to do this?

If you are trying to train tesseract4, you can use ocrd-train
you basically prepare images corresponding to each line of text with their ground truth and it will do all the remaining work for you.

Related

Tesseract on very specific large amout of similar images

I have a huge dataset of images from which i want to read the text. These data is always in the same form on this images: There are two temperature values, and two velocity value. Here are some examples:
the biggest problem i think is that the text is slightly transparent.
I tried to do it with tesseract (pytesseract and tesseract.js) but the results are not really good. Somethings the temperature values are interpreted correct but the velocity values are rarely correct. Especially the point isn't found.
Is there any posiibility to optimize the predictions of tesseract by telling it the pattern of my text, because it ist always the same in every image.
What i already did is congif the whitelist to
tessedit_char_whitelist =
Do you maybe have any other idea maybe how to preprocces this images best to get better results. I Already tried to increase the contrast. This resulted in a small improvement, but still not particularly good.
Of course i'm also open to any other ocr libraries and programming languages if you think they would work better

achieve better recognition results via training tesseract

I have a question regarding achieving better recognition results with tesseract. I am using tesseract to recognize serial numbers. The serial numbes consist of only one font-type, characters A-Z, 0-9 and occur in different sizes and lengths.
At the moment I am able to recognize about 40% of the serial number images correct. Images are taken via mobile phone camera. Therefore the image quality isn't the best.
Special problem characters are 8/B, 5/6. Since I am recognizing only serial numbers, I am not using any dictionary improvements and every character is recognized independently.
My question is: Does someone has already experience in achieving better recognition results with training tesseract? How many images would be needed to be able to get good results.
For training tesseract should I use printed and afterwards photographed serial numbers, or should I use original digital serial numbers, without printing and photographing?
Maybe somebody has already experience in that kind of area.
Regarding training tesseract: I have already trained tesseract with some images. Therefore I have printed all characters in different sizes, photographed and labeled them correctly. Example training photo of the character 5
Is this a good/bad training example? Since I only want to recognize single characters without any dependency, I though I don't have to use words for training.
Actual I only have trained with 3 of these images for the characters B 8 6 5 which doesn't result in a better recognition in comparison with the original english (eng) tesseract database.
best regards,
Christoph
I am currently working on a Sikuli application using Tesseract to read text (Strings and numbers) from screenshots. I found that the best way to achieve accuracy was to process the screenshot before performing the OCR on it. However, most of the text I am reading is green text-on black background, making this my preferred solution. I used Scalr's method within BufferedImage to increase the size of the image:
BufferedImage bufImg = Scalr.resize(...)
which instantly yielded more accurate results with black text on gray background. I then used BufferedImage's options BufferedImage.TYPE_BYTE_GRAY and BufferedImage.TYPE_BYTE_BINARY when creating a new BufferedImage to process the Image to grayscale and black/white, respectively.
Following these steps brought Tesseract's accuracy from a 30% to around an 85% when dealing with green text on black background, and a really-close-to-100% accuracy when dealing with normal black text on white background. (sometimes letters within a word are mistaken by numbers i.e. hel10)
I hope this helps!

Tesseract OCR finds too few boxes / ignores small characters

I have a problem with the training/text recognition process with Tesseract. Here is my trainingdata: http://s11.postimg.org/867aq10ur/dot_dotmatrixfont_exp0.png While training Tesseract ignores the dashes (I've marked them with red boxes, just to make it clear which ones I mean) and if I'm using the trained data for text recognition it also ignores them. Today I've played around with the Tesseract parameters (SetVariable(name, value)) but unfortunately I had no success.
What can I do to teach Tesseract those dashes? Thank you in advance!
Tesserect training is pretty tricky.
Your best chance might be to handle the dashes as a single char.
If your box editor or whatever tools you are using does not see the dashes as all, try running some image processing first, especially threshold or invert. try taking a look at OpenCV. They have some excellent tool for this kind of image processing.

Creating a training image for Tesseract OCR

I'm writing a generator for training images for Tesseract OCR.
When generating a training image for a new font for Tesseract OCR, what are the best values for:
The DPI
The font size in points
Should the font be anti-aliased or not
Should the bounding boxes fit snugly: , or not:
The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images
There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)
Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:
convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif
But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.
I found the answer to the 4th question - "Should the bounding boxes fit snugly".
It seems that fitting the rectangles as much as possible gives much better results.
For the other 12 pts and 300 dpi will be good enough, as #Yaroslav suggested. I think anti-aliasing is better turned off.
Good tool for tesseract training http://vietocr.sourceforge.net/training.html
It is good tool because having number of advantages
bounding box on letter can be editable by GUI based interface
automatically create all require file
automatically combined all files like freq-dawg, word-dawg, user-words (can be empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (can be empty file), shapetable into single eng.traineddata file.
New training data can be used with existing tesseract file end.traineddata

How to make tesseract to give relevant results in the presence of noise?

I am using tesseract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?
As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.
Disclaimer: I work for ABBYY