Tesseract training for a new font - ocr

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.

For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.

Edit:
Tesseract 5 training tutorial: https://www.youtube.com/watch?v=KE4xEzFGSU8
I made a video tutorial explaining the process for the latest version of Tesseract (The LSTM model), hope it helps. https://www.youtube.com/watch?v=TpD76k2HYms

If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

Related

How do I train tesseract 4 with image data instead of a font file?

I'm trying to train Tesseract 4 with images instead of fonts.
In the docs they are explaining only the approach with fonts, not with images.
I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.
I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?
Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.
These files need to be single lines of text.
In the tesstrain repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -

recognizing punctuation in Tesseract OCR

I am running tesseract to extract text from PDF files in a context where it is important to distinguish between semicolons and commas.
I find that semi-colons often show up as commas after OCR. The accuracy is otherwise pretty good.
I am looking for suggestions on how to improve accuracy in semicolon versus comma detection. Following this suggestion, my procedure is to first convert a multipage PDF file to a ppm file using pdftoppm from Xpdf, then convert that to tif using imagemagick, then run tesseract on the .tif file.
I have set the resolution of the ppm file to 1000 DPI and used the -sharpen option in imagemagick in an effort to improve resolution, but neither seems to improve the semi-colon recognition.
Any suggestions for pre-processing the image files or is this just an tough hill to climb?
Here are links to the original PDF, the .ppm and .tif files, and the .txt output.
Note that this is copyrighted material which I do not own.
You can always custom train the tesseract according to your dataset. You can check this article How to custom train tesseract.
But for sure it will be a long process to train a new model by collecting
dataset first and all but it's a way to improve the OCR.

Tesseract OCR cube files for Turkish

Where can I find tesseract ocr Turkish language extension for cube mode ?
files:
tr.cube.fold
tr.cube.lm
tr.cube.nn
tr.cube.params
tr.cube.size
tr.cube.word-freq
It includes all files, just this file is enough "tur.traineddata"
https://github.com/tesseract-ocr/tessdata/blob/master/tur.traineddata
and
https://github.com/tesseract-ocr/langdata/tree/master/tur
--
You could also use the trained data from tessdata_fast if you really need performance and are willing to lose some accuracy.
Grab the Turkish version at https://github.com/tesseract-ocr/tessdata_fast/blob/master/tur.traineddata
Nowhere. Cube is dead-end and will be eliminated from tesseract e.g. see https://github.com/tesseract-ocr/tesseract/issues/40

Tesseract ocr: How can I train my own tessdata library on batch with lots of single character image?

I have lots of images which only have 1 single character, how can I use them to train my own tessdata library on batch ? Is there any tips?
2.
And besides,
I'm confused with the feature extraction part between library training and character recognization ? Could anyone explained the flow ?
Thanks very much!
If they are of same font, put them in a multi-page TIFF and conduct training on it. jTessBoxEditor can help you with the TIFF merging and box editing.

Introduction to OCR

Someone gave me a trove full of amazing information. It is 200MB .tiff images of scanned announcements that goes back until the 40's. I want to digitize this, but I have no knowledge whatsoever about OCR. Some of the early material is barely readable by a human, let alone a machine. It is also in Hebrew.
I'm looking for advice on how to approach this. A good suggestion about books, articles, code libraries or software (all of them should be available freely on the web). I'm proficient in C++ and Python and can pick up another language if it is needed.
Thank you.
This sounds like a great task for Python, using an OCR library. A quick Google search turned up pytesser:
PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.
...
Usage Example
>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord