recognizing punctuation in Tesseract OCR - ocr

I am running tesseract to extract text from PDF files in a context where it is important to distinguish between semicolons and commas.
I find that semi-colons often show up as commas after OCR. The accuracy is otherwise pretty good.
I am looking for suggestions on how to improve accuracy in semicolon versus comma detection. Following this suggestion, my procedure is to first convert a multipage PDF file to a ppm file using pdftoppm from Xpdf, then convert that to tif using imagemagick, then run tesseract on the .tif file.
I have set the resolution of the ppm file to 1000 DPI and used the -sharpen option in imagemagick in an effort to improve resolution, but neither seems to improve the semi-colon recognition.
Any suggestions for pre-processing the image files or is this just an tough hill to climb?
Here are links to the original PDF, the .ppm and .tif files, and the .txt output.
Note that this is copyrighted material which I do not own.

You can always custom train the tesseract according to your dataset. You can check this article How to custom train tesseract.
But for sure it will be a long process to train a new model by collecting
dataset first and all but it's a way to improve the OCR.

Related

Tesseract multiple output format

My context
I'm using tesseract to extract text from an image.
I'm generating a .tsv to retrieve the extracted text and perform some regex on it and a .pdf to have a searchable pdf.
The way I do it is by calling tesseract 2 times:
One asking for the .tsv
One asking for the .pdf
But I feel like this is not very efficient (the same computations must be made two times)
What I wish
I wish to make my computations go faster. And my idea is to call tesseract only once but specifying two output formats
Is it possible? If so how?
You can try the command:
tesseract yourimage.tif out pdf tsv

Tesseract training for a new font

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.
For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.
Edit:
Tesseract 5 training tutorial: https://www.youtube.com/watch?v=KE4xEzFGSU8
I made a video tutorial explaining the process for the latest version of Tesseract (The LSTM model), hope it helps. https://www.youtube.com/watch?v=TpD76k2HYms
If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

Tesseract OCR cube files for Turkish

Where can I find tesseract ocr Turkish language extension for cube mode ?
files:
tr.cube.fold
tr.cube.lm
tr.cube.nn
tr.cube.params
tr.cube.size
tr.cube.word-freq
It includes all files, just this file is enough "tur.traineddata"
https://github.com/tesseract-ocr/tessdata/blob/master/tur.traineddata
and
https://github.com/tesseract-ocr/langdata/tree/master/tur
--
You could also use the trained data from tessdata_fast if you really need performance and are willing to lose some accuracy.
Grab the Turkish version at https://github.com/tesseract-ocr/tessdata_fast/blob/master/tur.traineddata
Nowhere. Cube is dead-end and will be eliminated from tesseract e.g. see https://github.com/tesseract-ocr/tesseract/issues/40

Tesseract finished training, but poor output

I have 5 PDFS, which I converted to TIFF, merged with jtessbox, created a box file, and then went through the process of picking up each and every letter. After building the language, I tried running tesseract on the same big TIFF and converted PDFS, but I'm getting worse accuracy than just using the default dictionary. Is there anything I could be doing wrong?

Tesseract ocr: How can I train my own tessdata library on batch with lots of single character image?

I have lots of images which only have 1 single character, how can I use them to train my own tessdata library on batch ? Is there any tips?
2.
And besides,
I'm confused with the feature extraction part between library training and character recognization ? Could anyone explained the flow ?
Thanks very much!
If they are of same font, put them in a multi-page TIFF and conduct training on it. jTessBoxEditor can help you with the TIFF merging and box editing.