What thresholding (binarization) algorithm is used in Tesseract OCR? - ocr

I am working on a project that needs accurate OCR results for images with rich background. So I am comparing results of two OCRs (one of them is Tesseract) to make my choice. The point is that results are strongly affected by the pre-processing step and especially image binarization. I extracted the binarized image of the other OCR and passed it to Tesseract which enhanced the results of Tesseract by 30-40%.
I have two questions and your answers would be of much help to me:
What binarization algorithm does tesseract use, and is it configurable?
Is there a way to extract the binarized image of Tesseract OCR so I can test the other OCR with it?
Thanks in advance :)

I think I have found the answers to my questions:
1- The binarization algorithm used is Otsu thresholding. You can see it here in line 179.
2- To get the binarized image, a method in tesseract api can be called:
PIX* thresholded = api->GetThresholdedImage(); //thresholded must be freed

Otsu thresholding is a global filter. You can use some local filter to get better results.
You can look for Sauvalo's binarization see hereor Nick's here . Those both algorithm are Niblack's improvement.
I used it to binarize my image for an OCR and I get better result
Good luck

Related

Can I train tesseract with single words?

Can I train tesseract using image/text pairs where the images and texts are just single words each? Most training examples I've seen on Github use lines, each line of text being an image with the correct text for that line. However I have a system which is already going to be producing word image/text pairs and I'd like to feed that back into training. Any reason why not? I know that there are page segmentation modes and that word segmentation and line segmentation are not the same thing. But I understand that psm only applies to inference and not training?
Update: I've posted this to the Tesseract github issues and the google group with no response there either. I'm not sure whether the question is badly formulated, or if it's just the case that noone knows the answer? I'm hoping that a bounty might encourage some input.

How to improve accuracy of tesseract engine on my images?

I use tesseract engine to OCR my images as below.
image1 to OCR
image2 to OCR
I used eng lang and have configured the engine with a white list of chars: "0123456789abcdefghijklmnopqrstuvwxyz"
pOCREngine->SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz");
The accuracy is not good, around 10% or so. I have tried to train the engine with ~200 of such images and combine the trained data with eng+mytrainedfont. The accuracy was not improved.
Does anyone have any idea to improve OCR of such images? Thanks in advanced.
The images you provided are difficult to get a 100% accuracy on when I tried it. To improve tesseract ocr you will need to apply some image processing methods.
I used a Gaussian filter on both and used a Maximum filter after that to reduce the noise. After that I made the images binary.
I'm using tesseract ocr in c++ and I'm using OpenCV libraries for image processing. I tested the following images with the following results:
image1
result: yfsxf
image2
result: 26ww(
Hope this gives you an idea on how to improve tesseract results. Unfortunately the images you provided are a bit tough to read with tesseract.

Howto improve OCR results

I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
unpaper
Fred's ImageMagick Scripts: TEXTCLEANER
manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file)
http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document:
http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...
Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature.
This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation.
Actually is used by google because is Google its main sponsor!
The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.
Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)

OCR tool for handwritten mathematical notes

I have a pdf of 100+ handwritten pages that I need to convert to machine readable text. So far I have tried tesseract and a free online tool with no success. The output seems to be jibberish.
tesseract myscan.png out -l eng
I've attached one example page. It contains both text, mathematical symbols (eg. integral sign) and occasionally pictures.
Maybe I'm using tesseract wrong? Could anyone try and get a decent output off this?
I use http://www.techsupportalert.com/best-free-ocr-software.htm
Watch out for the installer trying to load you up with other stuff
When it works, it just gives you bits to copy and paste.
But don't rush to download this one, try your's again first.
The problem likely isn't with the software, it's probably your input.
Scan at 600 dpi.
Try to increase the contrast and sharpen the image. The thinner and more defined from the background that the letters are, and the clearer the interspacing of the loops are, the better your chance of OCR capture.
These adjustments are best made in your original scanning software. 8MP or better camera can also make the scan.
Use GIMP to tweak after the scan.

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?
I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers
Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)