How to process my images to help Tesseract? - ocr

I have some images containing only digits, and a semicolon.
Example:
You can see more here: https://imgur.com/a/54dsl6h
They seem pretty clean and straightforward to me, but Tesseract considers them as empty "pages" (Empty page!!).
I tried both with oem 1 and oem 0 with a character list:
tesseract processed/35.0.png stdout -c tessedit_char_whitelist=0123456789: --oem 0
tesseract processed/35.0.png stdout
What can I do to get Tesseract to recognize the characters better?

Tesseract still gives me pretty bad results overall, but making the text bolder with a simple dilatation algorithm helped a bit.
In the end, since the font is really square, I used a trick, where I defined a bunch of segments for each digits, and depending on which segments intersect, or dont intersect with the digit, I can determine with 99% accuracy which digit it is.

Related

Best method to train Tesseract 3.02

i'm wondering what is the best method to train Tesseract (kind of text/TIFF and so on) for a particular kind of documents, with these particularities:
the structure and main text of the documents is always the same
the only things that change are 5 alphanumeric codes (THIS ARE THE REAL IMPORTANT THING TO DETECT!)
Some of thes codes are bold
At the moment I used standard trained datas, I detect the entire text and I extrapolate the codes with some regular expressions.
It's okay, but I've got errors sometimes, for example:
0 / O
L / I / 1
Please someone knowns some "tricks" to improve precision?
Thanks!
during training part of Tesseract, you have to make a file manually to give to the engine in order to specify ambiguous characters.
For more information look at the "unicharambigs" part of the Tesseract documentation.
Best Regards.

Tesseract with limited words

Is it possible to recognize a limited set of words in Tesseract?
I need to recognize a set of words (around 200) and want tesseract correct some words to the closest matching ones. In order to do that, I've updated the language models with my words (eng.word-dawg and eng.freq-dawg) and increased the sensitivity by setting language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word to large numbers (tried 0.9 and 1.0). However, this does not have any affect on the output.
I have a word (BENZOATE) which tesseract always recognize as UENZOATE. This is weird as I have BENZOATE in my dictionary.

Huffman Coding: handling negative ambiguity with zero

I've written a simple text file compressor that uses Huffman coding. I encode the text and write the binary resulting from Huffman to a file. To decode, I read in the binary and step through the Huffman tree.
That part is straightforward. The problem arises with 0 and negative numbers. For practice/fun/learning, I decided to do my own binary conversion methods (from a Java byte to a string and vice-versa) and I decided to represent negative numbers by flipping the last bit to a 1.
E.g, -2 = 00000101;; 2 = 00000100 (the extra 0's for padding since even the unnecessary 0's are important in Huffman... it's irrelevant, though)
However, 0 = 00000000 = 00000001
This may not seem like a problem, but those two binary strings map to two different characters in the huffman tree.
Is there a better way handle negatives in binary that will get around this?
I'm not sure this will help you, but i will try:
First of all, there is different kind of binary, pure or the others. Binary pure DON'T allow negatives, it goes from 0.......
You can use magnitude and sign, another kind of binnary, it allows negative numbers, and the - or + sign is represented with the most important bit of the number, for example:
A number with 4 bits:
0100=2
1100=-2
(1 bit for the sign, the most important, the first left one, and the other 3 for the number)
You can use too the Two's complement, but it's harder and you need to get the number in binary and then translate it to the other type.
I hope i could help you, and sorry for the lot of mistakes in english!

Tools, approaches for analysing proprietary data format?

I need to analyse a binary data file containing raw data from a scientific instrument. A quick look in a hex viewer indicates that's probably no encryption or anything fancy: integers will probably be written as integers (but I don't know what byte order), and who knows about floating point.
I have access to a (closed source) program that can view the contents of the file. So I can see that a certain value is 74078. Actually searching for that value I'm not sure about - do I search for 00 01 21 5E, some other byte order, etc? (Hex Fiend doesn't support searching for decimal values) And how would I find a floating point number?
The software that produces these files runs on XP. I'd prefer tools that run on OSX if possible.
(Hmm, I wrote up this question, forgot to post it, then solved the problem. I guess I will write my own answer.)
In the end, Hex Fiend turned out to be just enough. What I was expecting to do:
Convert a known value into hex
Search for it
What I actually did:
Pick a random chunk of hex that looked like it might be a useful value
Tell Hex Fiend to display it as integer, or as float, in either little endian or big endian, until it gave a plausible looking result (ie, 45.000 is a lot more plausible than some huge integer)
Search for that result in the results I had from the closed source program.
Document it, go back to step 1. (Except that normally the next chunk wouldn't be 'random', but would follow sequentially.)
In this case there were really only three (binary) variables for how to interpret data:
float or integer
2 bytes or 4 bytes
little or big endian
With more variables the task would be a lot harder. It would have been nice if Hex Fiend could search for integers/floats directly, perhaps trying out the different combinations. Perhaps other hex viewers do.
And to answer one of my original questions, 74078 turned out to be stored as 5E2101. A bit more trial and error and I would have got there. :)
UPDATE
If I was doing this over, I'd use "Synalyze It!", a tool designed for exactly this purpose.

Tesseract confuses two numbers

I'm writing an application to scan numbers from an image.
The numbers are using the OCR-B font and may also contain + and > characters.
This is my source image:
The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.
I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.
Then I did all steps described here to create the other necessary files.
Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was
$ tesseract esr2c.tif ocrb-esr2c -l ocrb
and the output for the source image was
0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20
If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).
How could this happen? Did I do some mistake in the training process? How can I fix it?
It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.
I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14.
Training image should be high Contrast image.
Use "GIMP" image editor and do following
Menu Colors->Info->Histgram- Read Std Deviation value
colors-> Threshould -> Write "Std Deviation value" as Threshould value
Save image
Use it for training.
Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use.
Check All boxes and characters in it.
It is very important. Somewhere in your box file has incorrect characters for 1 and 8.
Run other cmds.