Using Stanford classifier for character recognition

Using Stanford classifier for character recognition - ocr

I am working on an OCR related android app and I need to use multivariate logistic regressions for the classification of alphabets. My question is that that can I use Stanford classifier(http://nlp.stanford.edu/software/classifier.shtml) for character recognition? If it can train on a dataset of images? And if I can't then please suggest me a JAVA library for the purpose.

Great minds think alike. I was wondering the same thing. Specifically for OCR.
Even though it's almost a year after you asked your question.
It sounds simple enough; all you would need to do is normalize each character into a 5x7 array (or maybe 64x128), and then classify into the 26 upper and 26 lower case characters; plus 10 digits and 31 punctuation glyphs on a keyboard... Seems doable. Maybe when I get a round tuit...
It turns out that there is a Java library for OCR https://sourceforge.net/projects/javaocr/ and it's called Java OCR (surprise! :-) ). The only problem is that:
1. It doesn't work out of the box. It needs to be trained.
2. The documentation isn't very good.
3. People have had trouble getting it to work.
Good luck.

Related

Howto improve OCR results

I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
unpaper
Fred's ImageMagick Scripts: TEXTCLEANER
manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file)
http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document:
http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...

Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature.
This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation.
Actually is used by google because is Google its main sponsor!
The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.
Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?

I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers

Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)

Digit Recognition with Bayesian classes

I need to write an OCR program for digits only. I will use MNIST datasets. The problem is I do not know where to start. There are a lot of papers which doesn't really explain the algorithm. I don't really have much knowledge about pattern recognition. So I have a few questions.
Q1 : Where can I find the algorithm (or a tutorial)
Q2 : How do I classify digits? I don't need very advanced things. First thing that comes to my mind is finding the ratio of upper half/lower half and left side/ right side. Is there more useful and easy classification methods.
Q3 : What is back propagation and the layers which is shown in most of the papers. Do I need them for my simple OCR.
Note: I know my OCR program won't be accurate. It isn't very important for now.

If the closest engineering library to you has a section on image processing, computer vision, or machine vision, then with luck that library will have a copy of a book I recommend for OCR:
Character Recognition Systems by Cheriet, Kharma, Liu, and Suen
This book provides a fairly comprehensive overview of OCR techniques and recent research. It does not go into great depth on any particular subject, but it does provide references to academic papers.
Make sure you have access to a good introductory textbook on image processing. The book by Gonzalez and Woods is a standard in many universities:
Digital Image Processing by Gonzalez and Woods
Even "simple" OCR gets tricky very quickly. It could be overwhelming if you jump into a class about neural networks, Bayes theorem, etc., before you have a firm grasp of basic image processing principles.
If you can, try writing one or more OCR algorithms for machine-printed characters before you attempt to write an algorithm for handwritten characters.
Q1 : Where can I find the algorithm (or a tutorial)
There are numerous algorithms for OCR. The Cheriet book will give you a good start.
Q2 : How do I classify digits? I don't need very advanced things. First thing that comes to my mind is finding the ratio of upper half/lower half and left side/ right side. Is there more useful and easy classification methods.
Try implementing that technique and see how well it works. Even if the implementation doesn't work as well as you'd like, lessons learned while implementing it could help you later.
You can also subdivide a character into a 2 x 2 grid or 3 x 3 grid and check for relatively densities of pixels. Unlike machine printed characters, handwritten characters won't line up nicely in rectilinear grids.
Template matching using normalized correlation is simple, and it can work reasonably well for machine printed characters for a single, known font. It's relatively simple to implement and worth learning:
http://en.wikipedia.org/wiki/Cross-correlation#Normalized_cross-correlation
For OCR it's common to thin the characters in your sample as an initial step. Thinning is a technique to reduce a character (or any other shape) to a representation that is 1 pixel wide. Once you have a thinned character it can be easier to identify lines and intersections. If you can identify lines (or curves) and intesections, then one technique is to look at the relative position and angle of each line with respect to the others.
Common thinning algorithms include Stentiford and Zhang-Suen. There's a freeware version of WinTopo that demonstrates both of these algorithms:
http://wintopo.com/
You can look into academic papers about "stroke extraction", but those techniques tend to be more difficult to implement.
Q3 : What is back propagation and the layers which is shown in most of the papers. Do I need them for my simple OCR.
These terms refer to artificial neural networks. For a simple OCR algorithm you'll hard-code the recognition logic OR use simple training methods. Artificial neural networks can be trained to recognize characters that aren't hard-coded in your software.
http://en.wikipedia.org/wiki/Neural_network
Although you don't need to learn about artificial neural network to write a simple OCR algorithm, a simple algorithm will have only limited success with handwritten characters.
Above all, keep in mind that OCR for handwritten characters is an extremely difficult problem. If you could achieve a handwritten character read rate of 20% with a simple technique, then consider that a success.

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)

From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!

I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!

I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).

OCR lib for math formulas

I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data).
Is there something like this already? Or are current OCR technics just able to parse line-oriented text?
(Note that I also posted this question on Metaoptimize because some people there might have additional knowledge.)
The problem was also described by OpenAI as im2latex.

SESHAT is a open source system written in C++ for recognizing handwritten mathematical expressions. SESHAT was developed as part of a PhD thesis at the PRHLT research center at Universitat Politècnica de València.
An online demo:http://cat.prhlt.upv.es/mer/
The source: https://github.com/falvaro/seshat
Seshat is an open-source system for recognizing handwritten mathematical expressions. Given a sample represented as a sequence of strokes, the parser is able to convert it to LaTeX or other formats like InkML or MathML.

According to the answers on Metaoptimize and the discussion on the Tesseract mailinglist, there doesn't seem to be an open/free solution yet which can do that.
The only solution which seems to be able to do it (but I cannot verify as it is Windows-only and non-free) is, like a few other people have mentioned, the InftyProject.

InftyReader is the only one I'm aware of. It is NOT free software (it seems the money goes to a non-profit org, IIRC).
http://www.sciaccess.net/en/InftyReader/
I don't know why PDF can't have metadata in LaTeX? As in: put the LaTeX equation in it! Is this so hard? (I dunno anything about PDF syntax, but I imagine it can be done).
LaTeX syntax is THE ONE TRIED AND TRUE STANDARD for mathematics notation. It seems amazingly stupid that folks that produced MathML and other stuff don't take this in consideration. InftyReader generates MathML or LaTeX syntax.
If I want HTML (pure) I then use TTH to read the LaTeX syntax. Just works.
ABBYY FineReader (a great OCR program) claims you can train the software for Math, but this is immensely braindead (who has the time?)
And Unicode has lots of math symbols. That today's OCR readers can't grok them shows the sorry state of software and the brain deficit in this activity.
As to "one symbol at a time", TeX obviously has rules as to where it will place symbols. They can't write software that know those rules?! TeX is even public domain! They can just "use it" in their comercial products.

Check out "Web Equation." It can convert handwritten equations to LaTeX, MathML, or SymbolTree. I'm not sure if the engine is open source.

Considering that current technologies read one symbol at a time (see http://detexify.kirelabs.org/classify.html), I doubt there is an OCR for full mathematical equations.

Infty works fairly well. My former company integrated it into an application that reads equations out loud for blind people and is getting good feedback from users.
http://www.inftyproject.org/en/download.html

Since the output from math OCR for complex formulas will likely have bugs -- even humans have trouble with it -- you will have to proofread th results, at least if they matter. The (human) proofreader will then have to correct the results, meaning you need to have a math formula editor. Given the effort needed by humans, the probably limited corpus of complex formulas, you might find it easier to assign the task to humans.
As a research problem, reading math via OCR is fun -- you need a formalism for 2-D grammars plus a symbol recognizer.
In addition to references already mentioned here, why not google for this? There is work that was done at Caltech, Rochester, U. Waterloo, and UC Berkeley. How much of it is ready to use out of the box? Dunno.

As of August 2019, there are a few options, depending on what you need:
For converting printed math equations/formulas to LaTex, Mathpix is absolutely the best choice. It's free.
For converting handwritten math to LaTex or printed math, MyScript is the best option, although its app costs a few dollars.

You know, there's an application in Win7 just for that: Math Input Panel. It even handles handwritten input (it's actually made for this). Give it a shot if you have Win7, it's free!

there is this great short video: http://www.youtube.com/watch?v=LAJm3J36tLQ
explaining how you can train your Fine Reader to recognize math formulas. If you use Fine Reader already, better to stick with one tool. Of course it is not free ware :(

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008