Introduction to OCR - ocr

Someone gave me a trove full of amazing information. It is 200MB .tiff images of scanned announcements that goes back until the 40's. I want to digitize this, but I have no knowledge whatsoever about OCR. Some of the early material is barely readable by a human, let alone a machine. It is also in Hebrew.
I'm looking for advice on how to approach this. A good suggestion about books, articles, code libraries or software (all of them should be available freely on the web). I'm proficient in C++ and Python and can pick up another language if it is needed.
Thank you.

This sounds like a great task for Python, using an OCR library. A quick Google search turned up pytesser:
PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.
...
Usage Example
>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord

Related

Image file not found

Through Homebrew, I have installed the Tesseract OCR engine on my Mac. All the directories (jpeg, leptonica, libpng, libtiff, openssl, tesseract) are now installed in /usr/local/Cellar
Before putting an image in the Cellar directory, when I try the following at the command line, obviously it fails:
$ tesseract image.png outcome
So, because there is no such image, I get the following error messages:
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
Where are the programs/scripts that generate these messages? I can only find include files in the installed Tesseract directory... Where are the files that contain these error messages strings if the image was not found, etc...?
Also, where are the scripts/programs that perform image pre-processing (such as segmentation, binarization, noise removal, etc...) before Tesseract actually does the OCR on the image?
Context/Background
We are planning to improve (rather customize) Tesseract for our needs (for example recognizing products' serial numbers and vehicle number plates) but obviously first we need to know what kind of filtering and thresholding Tesseract does by default.
I understand Tesseract performs various image-processing operations internally (using the Leptonica library) before doing the actual OCR. For example, I understand Tesseract does Binarization and Segmentation and Noise Removal internally, as well as having a default segmentation method. Is this right? Which script(s) contain these methods, so that I can see in what order these internal image-processing operations are carried out before doing the actual OCR?
The github download has a lot of directories and code, so I would really appreciate someone pointing us in the right direction -- where we should look to see the standard parameters and image-processing operations that Tesseract does before doing the actual OCR. We can only find .h header files...
Thanks,

Tesseract training for a new font

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.
For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.
Edit:
Tesseract 5 training tutorial: https://www.youtube.com/watch?v=KE4xEzFGSU8
I made a video tutorial explaining the process for the latest version of Tesseract (The LSTM model), hope it helps. https://www.youtube.com/watch?v=TpD76k2HYms
If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

How Read an OCR file data

I am building a tool which can read an ocr file. I am using idolondemand (idolondemand.com), but I found that not much promising. That is not reading file properly (ex. spell mistakes, special chars).
I can move to any other languages, basically now this problem for me is become language independent, I can go for any language.
I need help in building one.

What should we use instead of nltk.Text.generate()?

It seems that nltk.Text.generate() is not available in NLTK 3.0 (see this answer). How should we be generating sentences instead? Thanks.
Unfortunately the generate() function relied on a buggy implementation of ngram models. It has been removed from NLTK 3.0 until someone can get around to fixing it, as you can see here (search for the words "removed ngram model package"). No replacement for this functionality has been provided.
The package nltk.model is still present in the NLTK 3.0 source tree, but it is not part of the distribution. So in principle you could download the source and get it to work, but given the bugs that led to its removal, it's probably a better idea to do without it, or to roll your own. Random text generation is not very interesting unless you control the generation algorithm, anyway.

Weka: Limitations on what one can output as source?

I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.
RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.
Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.