How do I train tesseract 4 with image data instead of a font file? - ocr

I'm trying to train Tesseract 4 with images instead of fonts.
In the docs they are explaining only the approach with fonts, not with images.
I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.
I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?

Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.
These files need to be single lines of text.
In the tesstrain repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -

Related

Include file in caffe .prototxt

I am building a siamese network from the example in BLVC's site
there they use a simple convolutonal net to generate the features for the contrastive loss function, this is done by copy and pasting the .prototxt of each of the networks in the .prototxt of the final siamese network, the problem is I am using a much larger network, the .prototxt having about 5700 lines.
Is there a directive that allows me to tell it to just "include" that file in runtime? Something in the lines of "input" in LATEX so I don't have a 12k+ lines file.

Image file not found

Through Homebrew, I have installed the Tesseract OCR engine on my Mac. All the directories (jpeg, leptonica, libpng, libtiff, openssl, tesseract) are now installed in /usr/local/Cellar
Before putting an image in the Cellar directory, when I try the following at the command line, obviously it fails:
$ tesseract image.png outcome
So, because there is no such image, I get the following error messages:
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
Where are the programs/scripts that generate these messages? I can only find include files in the installed Tesseract directory... Where are the files that contain these error messages strings if the image was not found, etc...?
Also, where are the scripts/programs that perform image pre-processing (such as segmentation, binarization, noise removal, etc...) before Tesseract actually does the OCR on the image?
Context/Background
We are planning to improve (rather customize) Tesseract for our needs (for example recognizing products' serial numbers and vehicle number plates) but obviously first we need to know what kind of filtering and thresholding Tesseract does by default.
I understand Tesseract performs various image-processing operations internally (using the Leptonica library) before doing the actual OCR. For example, I understand Tesseract does Binarization and Segmentation and Noise Removal internally, as well as having a default segmentation method. Is this right? Which script(s) contain these methods, so that I can see in what order these internal image-processing operations are carried out before doing the actual OCR?
The github download has a lot of directories and code, so I would really appreciate someone pointing us in the right direction -- where we should look to see the standard parameters and image-processing operations that Tesseract does before doing the actual OCR. We can only find .h header files...
Thanks,

Tesseract training for a new font

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.
For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.
Edit:
Tesseract 5 training tutorial: https://www.youtube.com/watch?v=KE4xEzFGSU8
I made a video tutorial explaining the process for the latest version of Tesseract (The LSTM model), hope it helps. https://www.youtube.com/watch?v=TpD76k2HYms
If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

recognizing punctuation in Tesseract OCR

I am running tesseract to extract text from PDF files in a context where it is important to distinguish between semicolons and commas.
I find that semi-colons often show up as commas after OCR. The accuracy is otherwise pretty good.
I am looking for suggestions on how to improve accuracy in semicolon versus comma detection. Following this suggestion, my procedure is to first convert a multipage PDF file to a ppm file using pdftoppm from Xpdf, then convert that to tif using imagemagick, then run tesseract on the .tif file.
I have set the resolution of the ppm file to 1000 DPI and used the -sharpen option in imagemagick in an effort to improve resolution, but neither seems to improve the semi-colon recognition.
Any suggestions for pre-processing the image files or is this just an tough hill to climb?
Here are links to the original PDF, the .ppm and .tif files, and the .txt output.
Note that this is copyrighted material which I do not own.
You can always custom train the tesseract according to your dataset. You can check this article How to custom train tesseract.
But for sure it will be a long process to train a new model by collecting
dataset first and all but it's a way to improve the OCR.

Tesseract finished training, but poor output

I have 5 PDFS, which I converted to TIFF, merged with jtessbox, created a box file, and then went through the process of picking up each and every letter. After building the language, I tried running tesseract on the same big TIFF and converted PDFS, but I'm getting worse accuracy than just using the default dictionary. Is there anything I could be doing wrong?