correct spacing when training tesseract

correct spacing when training tesseract - ocr

I use tesseract 3.0.1 on windows 7 64 bit. I train the library with a new language.
My sample data is very well spaced. When I define the coordinates for the box of each character, how important is it for the box to be tightly closed to the character? I use one of the addins and it is much faster to define coarse grained boxes over each character which include some (or a lot of) white space. Of course the box never overlaps other characters.

Practically, it's recommended that you put spaces as similar as they are in the real cases (tests). Then by using tesseract-box-editor or jTessBoxEditor you will be able to correct the bounding of letters boxes.

Related

Can I denote a glyph as being two chars (NA) in a box file in Tesseract 3.05

I am using tesseract 3.05 for reasons beyond my control. I am using source files to train the engine to detect this unique font. As I have a vast amount of samples, I am simply using the samples themselves as the training images rather than segment them into a font training image as this should give it more variation and training with the specific spacing issues this font has.
My question when generating the box files, as some letters are touching at corners (i.e . no clear break between glyphs), it will detect them as one glyph instead of two separate glyphs. An example it sometimes struggles with NA as the front serif of the A has bled into serif of the N. The image pre-processing I have applied has improved it by leaps and bounds but there are still some that I cannot correct on the image enough.
My question is this: can I simply denote the glyph as being NA in the box file?
If I cannot what would be the simplest solution? Introducing another glyph box seems like it wouldn't be a good idea but the only other solution I can see is to manually edit the image to make the separation of glyphs more obvious. This is itself anthi-thetical however as this is the kind of problem the font will have in the future that I am trying to OCR.
Thank you in advance but the documentation isn't specific on if I can correct a box glyph to being two characters instead of just one (or I just haven't found a relevant section where they explain this).

After scouring the documentation, I managed to find a lone paragraph that wasn't appearing in my website scraping:
"If you didn't successfully space out the characters on the training image, some may have been joined into a single box. In this case, you can either remake the images with better spacing and start again, or if the pair is common, put both characters at the start of the line, leaving the bounding box to represent them both. (As of 3.00, there is a limit of 24 bytes for the description of a "character". This will allow you between 6 and 24 unicodes to describe the character, depending on where your codes sit in the unicode set. If anyone hits this limit, please file an issue describing your situation.)"
Thus you can do what I ask: represent a glyph with two or more characters in a box file for Tesseract.

Thicken the accent marks for OCR

I am using an image for text extraction using Tesseract.
The accent marks in some words are so thin and broken (ex: the left side of '^' in word 'Bội' seems very dim) that cause some inaccuracies in the text output('Bội'->'Bủi'). Is there any library that can improve this condition or is there any algorithm that iterates through every pixel of the images and set them to the same pixel color value ?

Such a thing is easily accomplished, yes, but it will likely cause issues elsewhere. For example, eroding a 3x3 kernel creates:
Which can be threasholded at 252 to produce:
Note how the 9 and 6 in the phone number are now merging into a single blob.
As far as the specific library to accomplish such things, look at OpenCV or any other CV library.

achieve better recognition results via training tesseract

I have a question regarding achieving better recognition results with tesseract. I am using tesseract to recognize serial numbers. The serial numbes consist of only one font-type, characters A-Z, 0-9 and occur in different sizes and lengths.
At the moment I am able to recognize about 40% of the serial number images correct. Images are taken via mobile phone camera. Therefore the image quality isn't the best.
Special problem characters are 8/B, 5/6. Since I am recognizing only serial numbers, I am not using any dictionary improvements and every character is recognized independently.
My question is: Does someone has already experience in achieving better recognition results with training tesseract? How many images would be needed to be able to get good results.
For training tesseract should I use printed and afterwards photographed serial numbers, or should I use original digital serial numbers, without printing and photographing?
Maybe somebody has already experience in that kind of area.
Regarding training tesseract: I have already trained tesseract with some images. Therefore I have printed all characters in different sizes, photographed and labeled them correctly. Example training photo of the character 5
Is this a good/bad training example? Since I only want to recognize single characters without any dependency, I though I don't have to use words for training.
Actual I only have trained with 3 of these images for the characters B 8 6 5 which doesn't result in a better recognition in comparison with the original english (eng) tesseract database.
best regards,
Christoph

I am currently working on a Sikuli application using Tesseract to read text (Strings and numbers) from screenshots. I found that the best way to achieve accuracy was to process the screenshot before performing the OCR on it. However, most of the text I am reading is green text-on black background, making this my preferred solution. I used Scalr's method within BufferedImage to increase the size of the image:
BufferedImage bufImg = Scalr.resize(...)
which instantly yielded more accurate results with black text on gray background. I then used BufferedImage's options BufferedImage.TYPE_BYTE_GRAY and BufferedImage.TYPE_BYTE_BINARY when creating a new BufferedImage to process the Image to grayscale and black/white, respectively.
Following these steps brought Tesseract's accuracy from a 30% to around an 85% when dealing with green text on black background, and a really-close-to-100% accuracy when dealing with normal black text on white background. (sometimes letters within a word are mistaken by numbers i.e. hel10)
I hope this helps!

Tesseract OCR finds too few boxes / ignores small characters

I have a problem with the training/text recognition process with Tesseract. Here is my trainingdata: http://s11.postimg.org/867aq10ur/dot_dotmatrixfont_exp0.png While training Tesseract ignores the dashes (I've marked them with red boxes, just to make it clear which ones I mean) and if I'm using the trained data for text recognition it also ignores them. Today I've played around with the Tesseract parameters (SetVariable(name, value)) but unfortunately I had no success.
What can I do to teach Tesseract those dashes? Thank you in advance!

Tesserect training is pretty tricky.
Your best chance might be to handle the dashes as a single char.
If your box editor or whatever tools you are using does not see the dashes as all, try running some image processing first, especially threshold or invert. try taking a look at OpenCV. They have some excellent tool for this kind of image processing.

Creating a training image for Tesseract OCR

I'm writing a generator for training images for Tesseract OCR.
When generating a training image for a new font for Tesseract OCR, what are the best values for:
The DPI
The font size in points
Should the font be anti-aliased or not
Should the bounding boxes fit snugly: , or not:

The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images
There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)
Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:
convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif
But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.

I found the answer to the 4th question - "Should the bounding boxes fit snugly".
It seems that fitting the rectangles as much as possible gives much better results.
For the other 12 pts and 300 dpi will be good enough, as #Yaroslav suggested. I think anti-aliasing is better turned off.

Good tool for tesseract training http://vietocr.sourceforge.net/training.html
It is good tool because having number of advantages
bounding box on letter can be editable by GUI based interface
automatically create all require file
automatically combined all files like freq-dawg, word-dawg, user-words (can be empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (can be empty file), shapetable into single eng.traineddata file.
New training data can be used with existing tesseract file end.traineddata

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008