How to ocr the combination of the number and the letter? - ocr

I know the command tesseract image.tif outputbase nobatch digits to recognize the number and tesseract image.tif outputbase to get the letter result. But if I get a image with the combination of the letter and number how to recognize it ?

Related

How to process my images to help Tesseract?

I have some images containing only digits, and a semicolon.
Example:
You can see more here: https://imgur.com/a/54dsl6h
They seem pretty clean and straightforward to me, but Tesseract considers them as empty "pages" (Empty page!!).
I tried both with oem 1 and oem 0 with a character list:
tesseract processed/35.0.png stdout -c tessedit_char_whitelist=0123456789: --oem 0
tesseract processed/35.0.png stdout
What can I do to get Tesseract to recognize the characters better?
Tesseract still gives me pretty bad results overall, but making the text bolder with a simple dilatation algorithm helped a bit.
In the end, since the font is really square, I used a trick, where I defined a bunch of segments for each digits, and depending on which segments intersect, or dont intersect with the digit, I can determine with 99% accuracy which digit it is.

How to configure Octave terminal to output numbers without scientific notation?

I am trying to print a long table of numbers in octave terminal.
disp(vec);
What I get
7.0931e-01
6.2041e-05
9.7740e-01
9.9989e-01
8.8428e-01
9.0524e-01
...
Such numerical notation is a pain to read. How can I set octave terminal to output numbers normally as 0.7, 0.014, 0.95?
You can use format short g to display each number is a more logical format
format short g
disp(vec)
% 0.70931
% 6.2041e-05
% 0.9774
% 0.99989
% 0.88428
% 0.90524
Using 'fprintf' could help in such cases
a=0.0001234;
fprintf('%.3f\n',a)
But here the limitation is that number of decimal points would be fixed so in some numbers it will display zeros at the end while for some numbers it might cut off the number.

How to convert alphabet to binary?

How to convert alphabet to binary? I search on Google and it says that first convert alphabet to its ASCII numeric value and than convert the numeric value to binary. Is there any other way to convert ?
And if that's the only way than is the binary value of "A" and 65 are same?
BECAUSE ASCII vale of 'A'=65 and when converted to binary its 01000001
AND 65 =01000001
That is indeed the way which text is converted to binary.
And to answer your second question, yes it is true that the binary value of A and 65 are the same. If you are wondering how CPU distinguishes between "A" and "65" in that case, you should know that it doesn't. It is up to your operating system and program to distinguish how to treat the data at hand. For instance, say your memory looked like the following starting at 0 on the left and incrementing right:
00000001 00001111 000000001 01100110
This binary data could mean anything, and only has a meaning in the context of whatever program it is in. In a given program, you could have it be read as:
1. An integer, in which case you'll get one number.
2. Character data, in which case you'll output 4 ASCII characters.
In short, binary is read by CPUs, which do not understand the context of anything and simply execute whatever they are given. It is up to your program/OS to specify instructions in order for data to be handled properly.
Thus, converting the alphabet to binary is dependent on the program in which you are doing so, and outside the context of a program/OS converting the alphabet to binary is really the exact same thing as converting a sequence of numbers to binary, as far as a CPU is concerned.
Number 65 in decimal is 0100 0001 in binary and it refers to letter A in binary alphabet table (ASCII) https://www.bin-dec-hex.com/binary-alphabet-the-alphabet-letters-in-binary. The easiest way to convert alphabet to binary is to use some online converter or you can do it manually with binary alphabet table.

Tesseract with limited words

Is it possible to recognize a limited set of words in Tesseract?
I need to recognize a set of words (around 200) and want tesseract correct some words to the closest matching ones. In order to do that, I've updated the language models with my words (eng.word-dawg and eng.freq-dawg) and increased the sensitivity by setting language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word to large numbers (tried 0.9 and 1.0). However, this does not have any affect on the output.
I have a word (BENZOATE) which tesseract always recognize as UENZOATE. This is weird as I have BENZOATE in my dictionary.

How to define digits or alphabet for each character with Tesseract recognition?

For example: A1234567
I want to define the first character to be alphabet only, and the rest are digits.
How to do this with tesseract?
It looks like what you want would be equivalent to use a whilelist of words from 'A-Z0000000' to 'A-Z9999999'. Unfortunately, it appears that tesseract doesn't support whitelist of words, at least according to this question.
This is what I would do if I were you: run tesseract with letters and digits, and discard the words that do not begin with letters or have any non-digit characters after beginning with a letter.
Try bazaar matching pattern.
\c\d\d\d\d\d\d\d