Which version of Tesseract to use for training a new language?

Which version of Tesseract to use for training a new language? - ocr

I'm seeking advice on which version of Tesseract should I use to train for an ancient language that has unique letters. The language is very similar to Arabic in terms of characteristics. It also goes from right-to-left and some letter can connect in the word. In other words, a letter might have three shapes depending if it comes in the beginning, middle or end. It also has harakat (short vowel marks) that come above or below letters.
The reason I'm asking is because I want to take advantage of the tools available for version 3.X but this warning about Arabic threw me off since this language is very similar to it.
For anyone who's familiar with Tesseract, which version do you recommend to train for such a language? Also, if you are aware of a better tool, kindly share it please.

If you have a large amount of documents need to OCR, would recommend to use Tesseract 4.0 as it's faster in general. You may refer to below for more information in case you haven't read that before.
Tesseract 4.0 Accuracy and Performance
Tesseract 4.0 with LSTM
Training Tesseract 4.0
Language Data File for 4.0, you may have a test to see if the Arbic OCR works fine in OCR Engine Mode 1 (i.e --oem 1) which is Neural nets LSTM only.
Tesseract 4.0.0 alpha has been released since last Nov/Dec.
Hope this help.

Related

Does anyone know how Tesseract - OCR postprocessing / spellchecking works?

I was using tesseract-ocr (pytesseract) for spanish and it achieves very high accuracy when you set the language to spanish and of course, the text is in spanish. If you do not set language to spanish this does not perform that good. So, I'm assuming that tesseract is using many postprocessing models for spellchecking and improving the performance, I was wondering if anybody knows some of those models (ie edit distance, noisy channel modeling) that tesseract is applying.
Thanks in advance!

Your assumption is wrong: If you do not specify language, tesseract uses English model as default for OCR. That is why you got wrong result for Spanish input text. There is no spellchecking post processing.

Using Stanford classifier for character recognition

I am working on an OCR related android app and I need to use multivariate logistic regressions for the classification of alphabets. My question is that that can I use Stanford classifier(http://nlp.stanford.edu/software/classifier.shtml) for character recognition? If it can train on a dataset of images? And if I can't then please suggest me a JAVA library for the purpose.

Great minds think alike. I was wondering the same thing. Specifically for OCR.
Even though it's almost a year after you asked your question.
It sounds simple enough; all you would need to do is normalize each character into a 5x7 array (or maybe 64x128), and then classify into the 26 upper and 26 lower case characters; plus 10 digits and 31 punctuation glyphs on a keyboard... Seems doable. Maybe when I get a round tuit...
It turns out that there is a Java library for OCR https://sourceforge.net/projects/javaocr/ and it's called Java OCR (surprise! :-) ). The only problem is that:
1. It doesn't work out of the box. It needs to be trained.
2. The documentation isn't very good.
3. People have had trouble getting it to work.
Good luck.

Tesseract OCR - Handwritten font

I'm trying to use Tesseract-OCR to detect the text of images with pure text in it but these text has a handwritten font called Journal.
Example:
The result is not the best:
Maxima! size` W (35)
Is there any possibility to improve the result or rather to get the exact result?

I am surprised Tesseract is doing so well. With a little bit of training you should be able to train the lower case 'l' to be recognised correctly.
The main problem you have is the top of the large T character. The horizontal line extends across 2 (possibly 3) other character cells and this would cause a problem for any OCR engine when it tries to segment the characters for recognition. Training may be able to help in this case.
The next problem is the . and : which are very light/thin and are possibly being removed with image pre-processing before the OCR even starts.
Overall the only chance to improve the results with Tesseract would be to investigate training. Here are some links which may help.
Alternative to Tesseract OCR Training?
Tesseract OCR Library learning font
Tesseract confuses two numbers

Like Andrew Cash mentioned, it'll be very hard to perform OCR for that T letter because of its intersection with a number of next characters.
For results improvement you may want to try a more accurate SDK. Have a look at ABBYY Cloud OCR SDK, it's a cloud-based OCR SDK recently launched by ABBYY. It's in beta, so for now it's totally free to use. I work # ABBYY and can provide you additional info on our products if necessary. I've sent the image you've attached to our SDK and got this response:
Maximal size: lall (35)

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)

From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!

I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!

I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).

OCR lib for math formulas

I need an open OCR library which is able to scan complex printed math formulas (for example some formulas which were generated via LaTeX). I want to get some LaTeX-like output (or just some AST-like data).
Is there something like this already? Or are current OCR technics just able to parse line-oriented text?
(Note that I also posted this question on Metaoptimize because some people there might have additional knowledge.)
The problem was also described by OpenAI as im2latex.

SESHAT is a open source system written in C++ for recognizing handwritten mathematical expressions. SESHAT was developed as part of a PhD thesis at the PRHLT research center at Universitat Politècnica de València.
An online demo:http://cat.prhlt.upv.es/mer/
The source: https://github.com/falvaro/seshat
Seshat is an open-source system for recognizing handwritten mathematical expressions. Given a sample represented as a sequence of strokes, the parser is able to convert it to LaTeX or other formats like InkML or MathML.

According to the answers on Metaoptimize and the discussion on the Tesseract mailinglist, there doesn't seem to be an open/free solution yet which can do that.
The only solution which seems to be able to do it (but I cannot verify as it is Windows-only and non-free) is, like a few other people have mentioned, the InftyProject.

InftyReader is the only one I'm aware of. It is NOT free software (it seems the money goes to a non-profit org, IIRC).
http://www.sciaccess.net/en/InftyReader/
I don't know why PDF can't have metadata in LaTeX? As in: put the LaTeX equation in it! Is this so hard? (I dunno anything about PDF syntax, but I imagine it can be done).
LaTeX syntax is THE ONE TRIED AND TRUE STANDARD for mathematics notation. It seems amazingly stupid that folks that produced MathML and other stuff don't take this in consideration. InftyReader generates MathML or LaTeX syntax.
If I want HTML (pure) I then use TTH to read the LaTeX syntax. Just works.
ABBYY FineReader (a great OCR program) claims you can train the software for Math, but this is immensely braindead (who has the time?)
And Unicode has lots of math symbols. That today's OCR readers can't grok them shows the sorry state of software and the brain deficit in this activity.
As to "one symbol at a time", TeX obviously has rules as to where it will place symbols. They can't write software that know those rules?! TeX is even public domain! They can just "use it" in their comercial products.

Check out "Web Equation." It can convert handwritten equations to LaTeX, MathML, or SymbolTree. I'm not sure if the engine is open source.

Considering that current technologies read one symbol at a time (see http://detexify.kirelabs.org/classify.html), I doubt there is an OCR for full mathematical equations.

Infty works fairly well. My former company integrated it into an application that reads equations out loud for blind people and is getting good feedback from users.
http://www.inftyproject.org/en/download.html

Since the output from math OCR for complex formulas will likely have bugs -- even humans have trouble with it -- you will have to proofread th results, at least if they matter. The (human) proofreader will then have to correct the results, meaning you need to have a math formula editor. Given the effort needed by humans, the probably limited corpus of complex formulas, you might find it easier to assign the task to humans.
As a research problem, reading math via OCR is fun -- you need a formalism for 2-D grammars plus a symbol recognizer.
In addition to references already mentioned here, why not google for this? There is work that was done at Caltech, Rochester, U. Waterloo, and UC Berkeley. How much of it is ready to use out of the box? Dunno.

As of August 2019, there are a few options, depending on what you need:
For converting printed math equations/formulas to LaTex, Mathpix is absolutely the best choice. It's free.
For converting handwritten math to LaTex or printed math, MyScript is the best option, although its app costs a few dollars.

You know, there's an application in Win7 just for that: Math Input Panel. It even handles handwritten input (it's actually made for this). Give it a shot if you have Win7, it's free!

there is this great short video: http://www.youtube.com/watch?v=LAJm3J36tLQ
explaining how you can train your Fine Reader to recognize math formulas. If you use Fine Reader already, better to stick with one tool. Of course it is not free ware :(

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Which version of Tesseract to use for training a new language? - ocr

Related

Does anyone know how Tesseract - OCR postprocessing / spellchecking works?

Using Stanford classifier for character recognition

Tesseract OCR - Handwritten font

handwriting recognition with simple training

OCR lib for math formulas

Categories

Resources