Creating a training image for Tesseract OCR

Creating a training image for Tesseract OCR - ocr

I'm writing a generator for training images for Tesseract OCR.
When generating a training image for a new font for Tesseract OCR, what are the best values for:
The DPI
The font size in points
Should the font be anti-aliased or not
Should the bounding boxes fit snugly: , or not:

The 2th question is somehow answered here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Generate_Training_Images
There is no need to train with multiple sizes. 10 point will do. (An exception to this is very small text. If you want to recognize text with an x-height smaller than about 15 pixels, you should either train it specifically or scale your images before trying to recognize them.)
Questions 1 and 3: by experience, I've successfully used 300 dpi images/non anti-aliased fonts. More specifically, I have used the following convert parameters on a training pdf, which generated a satisfactory image:
convert -density 300 -depth 8 [input].pdf -background white -flatten +matte -compress none -monochrome [output].tif
But then I tried to add a dotted font to Tesseract and it only detected characters properly when I used a 150 dpi image. So, I don't think there's a general solution, it depends on the kind of fonts you're trying to add.

I found the answer to the 4th question - "Should the bounding boxes fit snugly".
It seems that fitting the rectangles as much as possible gives much better results.
For the other 12 pts and 300 dpi will be good enough, as #Yaroslav suggested. I think anti-aliasing is better turned off.

Good tool for tesseract training http://vietocr.sourceforge.net/training.html
It is good tool because having number of advantages
bounding box on letter can be editable by GUI based interface
automatically create all require file
automatically combined all files like freq-dawg, word-dawg, user-words (can be empty file), Inttemp, Normproto, Pffmtable, Unicharset, DangAmbigs (can be empty file), shapetable into single eng.traineddata file.
New training data can be used with existing tesseract file end.traineddata

Related

Can I denote a glyph as being two chars (NA) in a box file in Tesseract 3.05

I am using tesseract 3.05 for reasons beyond my control. I am using source files to train the engine to detect this unique font. As I have a vast amount of samples, I am simply using the samples themselves as the training images rather than segment them into a font training image as this should give it more variation and training with the specific spacing issues this font has.
My question when generating the box files, as some letters are touching at corners (i.e . no clear break between glyphs), it will detect them as one glyph instead of two separate glyphs. An example it sometimes struggles with NA as the front serif of the A has bled into serif of the N. The image pre-processing I have applied has improved it by leaps and bounds but there are still some that I cannot correct on the image enough.
My question is this: can I simply denote the glyph as being NA in the box file?
If I cannot what would be the simplest solution? Introducing another glyph box seems like it wouldn't be a good idea but the only other solution I can see is to manually edit the image to make the separation of glyphs more obvious. This is itself anthi-thetical however as this is the kind of problem the font will have in the future that I am trying to OCR.
Thank you in advance but the documentation isn't specific on if I can correct a box glyph to being two characters instead of just one (or I just haven't found a relevant section where they explain this).

After scouring the documentation, I managed to find a lone paragraph that wasn't appearing in my website scraping:
"If you didn't successfully space out the characters on the training image, some may have been joined into a single box. In this case, you can either remake the images with better spacing and start again, or if the pair is common, put both characters at the start of the line, leaving the bounding box to represent them both. (As of 3.00, there is a limit of 24 bytes for the description of a "character". This will allow you between 6 and 24 unicodes to describe the character, depending on where your codes sit in the unicode set. If anyone hits this limit, please file an issue describing your situation.)"
Thus you can do what I ask: represent a glyph with two or more characters in a box file for Tesseract.

Can you train tesseract with images instead of text and a font?

In the tesseract documentation a method of training with sample text and a font is explained.
I used jTessBoxEditor but works pretty much like the tesseract training tools.
I got somewhat acceptable results with this, but I guess the optimal solution would be training tesseract with the actual kind of images it will have to recognize anyway.
As I only need to recognize digits, I can cut by hand each of them, maybe many versions of each digit, and train tesseract with those images, even setting the boxes by hand.
Is there a way to do this?

If you are trying to train tesseract4, you can use ocrd-train
you basically prepare images corresponding to each line of text with their ground truth and it will do all the remaining work for you.

achieve better recognition results via training tesseract

I have a question regarding achieving better recognition results with tesseract. I am using tesseract to recognize serial numbers. The serial numbes consist of only one font-type, characters A-Z, 0-9 and occur in different sizes and lengths.
At the moment I am able to recognize about 40% of the serial number images correct. Images are taken via mobile phone camera. Therefore the image quality isn't the best.
Special problem characters are 8/B, 5/6. Since I am recognizing only serial numbers, I am not using any dictionary improvements and every character is recognized independently.
My question is: Does someone has already experience in achieving better recognition results with training tesseract? How many images would be needed to be able to get good results.
For training tesseract should I use printed and afterwards photographed serial numbers, or should I use original digital serial numbers, without printing and photographing?
Maybe somebody has already experience in that kind of area.
Regarding training tesseract: I have already trained tesseract with some images. Therefore I have printed all characters in different sizes, photographed and labeled them correctly. Example training photo of the character 5
Is this a good/bad training example? Since I only want to recognize single characters without any dependency, I though I don't have to use words for training.
Actual I only have trained with 3 of these images for the characters B 8 6 5 which doesn't result in a better recognition in comparison with the original english (eng) tesseract database.
best regards,
Christoph

I am currently working on a Sikuli application using Tesseract to read text (Strings and numbers) from screenshots. I found that the best way to achieve accuracy was to process the screenshot before performing the OCR on it. However, most of the text I am reading is green text-on black background, making this my preferred solution. I used Scalr's method within BufferedImage to increase the size of the image:
BufferedImage bufImg = Scalr.resize(...)
which instantly yielded more accurate results with black text on gray background. I then used BufferedImage's options BufferedImage.TYPE_BYTE_GRAY and BufferedImage.TYPE_BYTE_BINARY when creating a new BufferedImage to process the Image to grayscale and black/white, respectively.
Following these steps brought Tesseract's accuracy from a 30% to around an 85% when dealing with green text on black background, and a really-close-to-100% accuracy when dealing with normal black text on white background. (sometimes letters within a word are mistaken by numbers i.e. hel10)
I hope this helps!

How to make tesseract to give relevant results in the presence of noise?

I am using tesseract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?

As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.
Disclaimer: I work for ABBYY

downscale large 1 bit tiff to 8 bit grayscale / 24 bit

Let's say i have a 100000x100000 1 bit (K channel) tiff with a dpi of 2000 and i want to downscale this to a dpi of 200. My resulting image would be 10000x10000 image. Does this mean that every 10 bits in the 1 bit image correspond to 1 pixel in the new image? By the way, i am using libtiff and reading the 1 bit tiff with tiffreadscanline. Thanks!

That means every 100 bits in the 1 bit image correspond to 1 pixel in the new image. You'd need to average the value over 10x10 1bit pixel area. For smoother greyscales, you'd better average over n bits where n is the bit depth of your target pixel, overlying the calculated area partially with neighbor areas (16x16px squares 10x10px apart, so their borders overlay, for a smooth 8-bit grayscale.)

It is important to understand why you want to downscale (because of output medium or because of file size?). As SF pointed out, colors/grayscale are somewhat interchangeable with resolution. If it is only about file size losless/lossy compression is also worth to look at..
The other thing is to understand a bit of the characteristics of your source image. For instance, if the source image is rasterized (as for newspaper images) you may get akward patterns because the dot-matrix is messed up. I have once tried to restore an old news-paper image, and I found it a lot of work. I ended up converting it to gray scale first before enhancing the image.
I suggest to experiment a bit with VIPS or Irfanview to find the best results (i.e. what is the effect of a certain resampling algorithm on your image quality). The reason for these programs (over i.e. Photoshop) is that you can experiment with GUI/command line while being aware of name/parameters of the algorithms behind it. With VIPS you can control most if not all parameters.
[Edit]
TiffDump (supplied with LibTiff binaries) is a valuable source of information. It will tell you about byte ordering etc. What I did was to start with a known image. For instance, LibTIFF.NET comes with many test images, including b&w (some with 0=black, some with 1=black). [/Edit]

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008