Tesseract OCR cube files for Turkish - ocr

Where can I find tesseract ocr Turkish language extension for cube mode ?
files:
tr.cube.fold
tr.cube.lm
tr.cube.nn
tr.cube.params
tr.cube.size
tr.cube.word-freq

It includes all files, just this file is enough "tur.traineddata"
https://github.com/tesseract-ocr/tessdata/blob/master/tur.traineddata
and
https://github.com/tesseract-ocr/langdata/tree/master/tur
--
You could also use the trained data from tessdata_fast if you really need performance and are willing to lose some accuracy.
Grab the Turkish version at https://github.com/tesseract-ocr/tessdata_fast/blob/master/tur.traineddata

Nowhere. Cube is dead-end and will be eliminated from tesseract e.g. see https://github.com/tesseract-ocr/tesseract/issues/40

Related

Tesseract training for a new font

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.
For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.
Edit:
Tesseract 5 training tutorial: https://www.youtube.com/watch?v=KE4xEzFGSU8
I made a video tutorial explaining the process for the latest version of Tesseract (The LSTM model), hope it helps. https://www.youtube.com/watch?v=TpD76k2HYms
If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

recognizing punctuation in Tesseract OCR

I am running tesseract to extract text from PDF files in a context where it is important to distinguish between semicolons and commas.
I find that semi-colons often show up as commas after OCR. The accuracy is otherwise pretty good.
I am looking for suggestions on how to improve accuracy in semicolon versus comma detection. Following this suggestion, my procedure is to first convert a multipage PDF file to a ppm file using pdftoppm from Xpdf, then convert that to tif using imagemagick, then run tesseract on the .tif file.
I have set the resolution of the ppm file to 1000 DPI and used the -sharpen option in imagemagick in an effort to improve resolution, but neither seems to improve the semi-colon recognition.
Any suggestions for pre-processing the image files or is this just an tough hill to climb?
Here are links to the original PDF, the .ppm and .tif files, and the .txt output.
Note that this is copyrighted material which I do not own.
You can always custom train the tesseract according to your dataset. You can check this article How to custom train tesseract.
But for sure it will be a long process to train a new model by collecting
dataset first and all but it's a way to improve the OCR.

How Read an OCR file data

I am building a tool which can read an ocr file. I am using idolondemand (idolondemand.com), but I found that not much promising. That is not reading file properly (ex. spell mistakes, special chars).
I can move to any other languages, basically now this problem for me is become language independent, I can go for any language.
I need help in building one.

Tesseract ocr: How can I train my own tessdata library on batch with lots of single character image?

I have lots of images which only have 1 single character, how can I use them to train my own tessdata library on batch ? Is there any tips?
2.
And besides,
I'm confused with the feature extraction part between library training and character recognization ? Could anyone explained the flow ?
Thanks very much!
If they are of same font, put them in a multi-page TIFF and conduct training on it. jTessBoxEditor can help you with the TIFF merging and box editing.

Introduction to OCR

Someone gave me a trove full of amazing information. It is 200MB .tiff images of scanned announcements that goes back until the 40's. I want to digitize this, but I have no knowledge whatsoever about OCR. Some of the early material is barely readable by a human, let alone a machine. It is also in Hebrew.
I'm looking for advice on how to approach this. A good suggestion about books, articles, code libraries or software (all of them should be available freely on the web). I'm proficient in C++ and Python and can pick up another language if it is needed.
Thank you.
This sounds like a great task for Python, using an OCR library. A quick Google search turned up pytesser:
PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.
...
Usage Example
>>> from pytesser import *
>>> image = Image.open('fnord.tif') # Open image object using PIL
>>> print image_to_string(image) # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord