How to train Abbyy OCR ( command line version ) - ocr

We are scanning PDF files and we are not getting a date as correctly as we want.
Is there any way to train abbyy OCR like we do with Tesseract?

Related

Image file not found

Through Homebrew, I have installed the Tesseract OCR engine on my Mac. All the directories (jpeg, leptonica, libpng, libtiff, openssl, tesseract) are now installed in /usr/local/Cellar
Before putting an image in the Cellar directory, when I try the following at the command line, obviously it fails:
$ tesseract image.png outcome
So, because there is no such image, I get the following error messages:
Error in fopenReadStream: file not found
Error in findFileFormat: image file not found
Error during processing.
Where are the programs/scripts that generate these messages? I can only find include files in the installed Tesseract directory... Where are the files that contain these error messages strings if the image was not found, etc...?
Also, where are the scripts/programs that perform image pre-processing (such as segmentation, binarization, noise removal, etc...) before Tesseract actually does the OCR on the image?
Context/Background
We are planning to improve (rather customize) Tesseract for our needs (for example recognizing products' serial numbers and vehicle number plates) but obviously first we need to know what kind of filtering and thresholding Tesseract does by default.
I understand Tesseract performs various image-processing operations internally (using the Leptonica library) before doing the actual OCR. For example, I understand Tesseract does Binarization and Segmentation and Noise Removal internally, as well as having a default segmentation method. Is this right? Which script(s) contain these methods, so that I can see in what order these internal image-processing operations are carried out before doing the actual OCR?
The github download has a lot of directories and code, so I would really appreciate someone pointing us in the right direction -- where we should look to see the standard parameters and image-processing operations that Tesseract does before doing the actual OCR. We can only find .h header files...
Thanks,

How do I train tesseract 4 with image data instead of a font file?

I'm trying to train Tesseract 4 with images instead of fonts.
In the docs they are explaining only the approach with fonts, not with images.
I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.
I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?
Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.
These files need to be single lines of text.
In the tesstrain repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -

recognizing punctuation in Tesseract OCR

I am running tesseract to extract text from PDF files in a context where it is important to distinguish between semicolons and commas.
I find that semi-colons often show up as commas after OCR. The accuracy is otherwise pretty good.
I am looking for suggestions on how to improve accuracy in semicolon versus comma detection. Following this suggestion, my procedure is to first convert a multipage PDF file to a ppm file using pdftoppm from Xpdf, then convert that to tif using imagemagick, then run tesseract on the .tif file.
I have set the resolution of the ppm file to 1000 DPI and used the -sharpen option in imagemagick in an effort to improve resolution, but neither seems to improve the semi-colon recognition.
Any suggestions for pre-processing the image files or is this just an tough hill to climb?
Here are links to the original PDF, the .ppm and .tif files, and the .txt output.
Note that this is copyrighted material which I do not own.
You can always custom train the tesseract according to your dataset. You can check this article How to custom train tesseract.
But for sure it will be a long process to train a new model by collecting
dataset first and all but it's a way to improve the OCR.

Tesseract OCR cube files for Turkish

Where can I find tesseract ocr Turkish language extension for cube mode ?
files:
tr.cube.fold
tr.cube.lm
tr.cube.nn
tr.cube.params
tr.cube.size
tr.cube.word-freq
It includes all files, just this file is enough "tur.traineddata"
https://github.com/tesseract-ocr/tessdata/blob/master/tur.traineddata
and
https://github.com/tesseract-ocr/langdata/tree/master/tur
--
You could also use the trained data from tessdata_fast if you really need performance and are willing to lose some accuracy.
Grab the Turkish version at https://github.com/tesseract-ocr/tessdata_fast/blob/master/tur.traineddata
Nowhere. Cube is dead-end and will be eliminated from tesseract e.g. see https://github.com/tesseract-ocr/tesseract/issues/40

Tesseract finished training, but poor output

I have 5 PDFS, which I converted to TIFF, merged with jtessbox, created a box file, and then went through the process of picking up each and every letter. After building the language, I tried running tesseract on the same big TIFF and converted PDFS, but I'm getting worse accuracy than just using the default dictionary. Is there anything I could be doing wrong?