Why is Tesseract OCR using Otsu binarization? - ocr

Why is Tesseract OCR engine using a global thresholding technique such as Otsu binarization? Aren't local thresholding techniques (e.g. Sauvola, Niblack, etc.) more effective in leaving out text from images?

Tesseract was used in Google book project and AFAIK they run tests for best binarization and Otsu was most universal. If Otsu is not best for your case you can use other binarization algorithm before sending image to tesseract.

Basically, depending on the input image we need to select which thresholding algorithm to use. Tesseract uses Otsu method for thresholding because generally the input to Tesseract for extracting the text is having image homogeneities. Otsu method is efficient as well as good enough for such images.
Global thresholding method is useful and good enough when the background does not show local variation relative to the foreground (target) intensity. While local thresholding is necessary when there is local variation occurring between the intensity difference of background and target.
So, while Tesseract does use Otsu method (global thresholding) for binarization, you could pre-process the image with local thresholding methods to get better output from Tesseract.

Related

How can you bypass Tesseract OCR's internal image pre-processing?

In many cases, Tesseract OCR's internal image pre-processing is degrading the quality of the image. I want to control the image processing before feeding to Tesseract OCR and disable the OCR's internal pre-processing steps.
I am not sure what do you mean with "degrading the quality of the image", but tesseract use binarized image for OCR, so if you provide binarized image as input, you can skip tesseract internal function for this. AFAIK there is no other preprocessing in tesseract (user has to do preprocessing).

Mask R-CNN annotation tool

I’m new to deep learning and I was reading some state of art papers and I found that mask r-cnn is utterly used in segmentation and classification of images. I would like to apply it to my MSc project but I got some questions that you may be able to answer. I apologize if this isn’t the right place to do it.
First, I would like to know what are the best strategy to get the annotations. It seems kind of labor intensive and I’m not understanding if there is any easy way. Following that, I want to know if you know any annotation tool for mask r-cnn that generates the binary masks that are manually done by the user.
I hope this can turn into a productive and informative thread so any suggestion, experience would be highly appreciated.
Regards
You can use MASK-RCNN, I recommend it, is a two-stage framework, first you can scan the image and generate areas likely contain an object. And the second stage classifies the proposal drawing bounding boxes.
But the two-big question
how to train a model from scratch? And What happens when we want to
train our own dataset?
You can use annotations downloaded from the internet, or you can start creating your own annotations, this takes a lot of time!
You have tools like:
VIA GGC image annotator
http://www.robots.ox.ac.uk/~vgg/software/via/via_demo.html
it's online and you don't have to download any program. It is the one that I recommend you, save the images in a .json file, and so you can use the class of ballons that comes by default in SAMPLES in the framework MASK R-CNN, you would only have to put your json file and your images and to train your dataset.
But there are always more options, you have labellimg which is also used for annotation and is very well known but save the files in xml, you will have to make a few changes to your Class in python. You also have labelme, labelbox, etc.

Tesseract training for a new font

I'm still new to Tesseract OCR and after using it in my script noticed it had a relatively big error rate for the images I was trying to extract text from. I came across Tesseract training, which supposedly would be able to decrease error rate for a specific font you'd use. I came across a website (http://ocr7.com/) which is a tool powered by Anyline to do all the training for a font you specify. So I recieved a .traineddata file and I am not quite sure what to do with it. Could anybody explain what I have to do with this file for it to work? Or should I just learn how to do Tesseract training the manual way, which according to the Anyline website may take a day's work. Thanks in advance.
For anyone that is still going to read this, you can use this tool to get a traineddata file of whichever font you want. After that move the traineddata file in your tessdata folder. To use tesseract with the new font in Python or any other language (I think?) put lang = "Font"as second parameter in image_to_string function. It improves accuracy significantly but can still make mistakes ofcourse. Or you can just learn how to train tesseract for a new font manually with this guide: http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/.
Edit:
Tesseract 5 training tutorial: https://www.youtube.com/watch?v=KE4xEzFGSU8
I made a video tutorial explaining the process for the latest version of Tesseract (The LSTM model), hope it helps. https://www.youtube.com/watch?v=TpD76k2HYms
If you want to train tesseract with the new font, then generate .traineddata file with your desired font. For generating .traineddata, first you will need .tiff file and .box file. You can create these files using jTessBoxEditor. Tutorial for jBossTextEditor is here. While making .tiff file you can set the font in which you have train tesseract. Either you can jTessBoxEditor for generating .traineddata or serak-tesseract-trainer is also there. I have used both and I would say that for generating tiff and box files jTessBoxEditor is great and for training tesseract use serak.

In keras(deep learning library), sush custom embedding layer possible?

I just moved in recently from theano, lasagne to keras.
When I in theano, I used such custom embedding layer.
How to keep the weight value to zero in a particular location using theano or lasagne?
It' was useful when deal of variable length input by adding padding.
In keras, such custom embedding layer possible?
Then, how can I make it?
And, such embedding layer may be wrong?
This may not be exactly what you want, but the solution I personally use as it is used in Keras examples (e.g. this one) is to pad the data to a constant length before feeding it to network.
Keras itself provide this pre-processing tool for sequences in keras.preprocessing.sequence.pad_sequences(seq, length)

Can I write a program in binary directly ? How can I get the computer to execute it?

I know that may seem weird and looking for troubles but I think experiencing what the ancient programmers experienced before is something interesting. So how can I execute a program written only in binary? (Suppose that I know what I am doing and not using assembly of course.)
I just want to write a series of bits like 111010111010101010101 and execute that. So how can I do that?
Use a hex editor. You'll need to find out the relevant executable format for your operating system, of course - assuming you want to use an operating system... I suppose you could always write your own bootloader and just run the code directly that way, if you want to get all hardcore.
I don't think you'll really be experiencing what programmers experienced back then though - for one thing, you won't be using punch cards, paper tape etc. For another, your context is completely different - you know what computers are like now, so it'll feel painfully primitive to you... whereas back then, it would have been bleeding edge and exciting just on those grounds.
Use a hex editor, write your bits and save it as an executable file (either just with the file extension .exe in Windows or with chmod a+x filename in Linux).
The problem is: You'd also have to write all the OS-specific stuff in binary format, and you'll have to have a table that translates from assembler code to binary stuff.
Why not, if you want to experience low-level programming, give D.E. Knuth's assembler MMIX a try?
It really depends on the platform you are using. But that's sort of irrelevant based on your proposed purpose. The earliest programmers of modern computers as you think of them did not program in binary -- they programmed in assembly.
You will learn nothing trying to program in binary for a specific Operating System and specific CPU type using a hex editor.
If you want to find out how pre-assembly programmers worked (with plain binary data), look up Punch Cards.
.
Use a hex editor to create your file, be sure to use a format that the loader of your respective OS understands and then double click it.
most assemblers (MMIX assembler for instance see www.mmix.cs.hm.edu) dont care if
you write instructions or data.
So instead of wirting
Main ADD $0,$0,3
SUB $1,$0,4
...
you can write
Main TETRA #21000003
TETRA #25010004
...
So this way you can assemble your program by hand and then have the assembler transform it in a form the loader needs. Then you execute it. Normaly you use hex notatition not binary because keeping track of so many digits is difficult. You can also use decimal, but the charts that tell you which instructions have which codes are typically in hex notation.
Good luck! I had to do things like this when I started programming computers. Everybody was glad to have an assembler or even a compiler then.
Martin
Or he is just writing some malicious code.
I've seen some funny methods that use a AVR as a keyboard emulator, open some simple text editor, write the code that's in the AVR eeprom memory, and pipe it to "debug" (in windows systems), and run it. It's a good way to escape some restrictions too ;)
I imagine that by interacting directly with hardware you could write in binary. To flip the proper binary bits, you could use a magnetized needle on your disk drive. Or butterflies.