Howto improve OCR results - open-source

I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
unpaper
Fred's ImageMagick Scripts: TEXTCLEANER
manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file)
http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document:
http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...

Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature.
This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation.
Actually is used by google because is Google its main sponsor!
The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.
Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)

Related

OCR tool for handwritten mathematical notes

I have a pdf of 100+ handwritten pages that I need to convert to machine readable text. So far I have tried tesseract and a free online tool with no success. The output seems to be jibberish.
tesseract myscan.png out -l eng
I've attached one example page. It contains both text, mathematical symbols (eg. integral sign) and occasionally pictures.
Maybe I'm using tesseract wrong? Could anyone try and get a decent output off this?
I use http://www.techsupportalert.com/best-free-ocr-software.htm
Watch out for the installer trying to load you up with other stuff
When it works, it just gives you bits to copy and paste.
But don't rush to download this one, try your's again first.
The problem likely isn't with the software, it's probably your input.
Scan at 600 dpi.
Try to increase the contrast and sharpen the image. The thinner and more defined from the background that the letters are, and the clearer the interspacing of the loops are, the better your chance of OCR capture.
These adjustments are best made in your original scanning software. 8MP or better camera can also make the scan.
Use GIMP to tweak after the scan.

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?
I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers
Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)
From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!
I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!
I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).

Alternative to Tesseract OCR Training?

For the past 3 months I've been trying to train the Tesseract
With identifying a collection of images I've had, due a real lack
of proper documentation, and very high level of complexity I'm starting to
give up on Tesseract as a solution.
I'm looking for an alternative, which would be relatively pain free
for training, I'm not looking to rediscover the wheel here.
If there isn't anything free, I guess paid solutions would
have to do (nothing above 200$)
Based on your comment, all you need is to scan relatively small amount of documents with almost 100% accuracy and your budget is about 200$
Well, the answer is simple then. You don't need any programming solution. Just buy quality commercial OCR product, f.e. ABBYY FineReader (disclaimer: I work for ABBYY). It has different prices in different regions, but I guess it is somewhere in about your budget.
Commercial desktop OCR product will provide you out-of-the box almost 100% accuracy on typical languages. Also they have convenient manual verification tools to fix all remaining errors. Typically they support whole variety of modern fonts, but if your font is not trivial, they do have font training utility for that.
I do think that is optimal solution for you.
UPDATE: Linux platform.
Unfortunately, there is almost no choice of high quality OCR products for Linux, sorry. The only one I know is from ABBYY: http://ocr4linux.com/en:start but it does not have UI, verification and font training. But at least you can give it a try to see if it will give you good enough accuracy as it is, which may happen to be the case.
You can use jTessBoxEditor to edit the box files you generate. Bundled with it is a PowerShell script to automate box file and final .traineddata file generation.

Reliably extracting identity fields from scanned documents / images?

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)
I've tried the following settings:
Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
CVISION Maestro Recognition Server 4.0
ABBYY Recognition Server 2.0
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.