OCR tesseract: trained data creation issue for special type of fonts (using Jtessboxeditor) - ocr

Unable to create proper trained data for windows non-native fonts, i.e.,for catia drafting fonts
Even if some of the alpha-numerals are recognized, letters with broken characters like " i , j " etc., special symbols like Ø (Phi), ° (degree), ± (plus-minus) are not recognized properly. Its box file values are improper.
JTessboxeditor is the tool we used to train and create trained data for tesseract
Request your assistance on the same. Thanks

I also need these 3 characters - though it might be too late to answer this.
May not be of much help in all situations, but the Norwegian .traineddata file does include the Ø (Phi) character, this trained data file has helped me with this character.
The ° (degree) character may be a bit trickier, as it normally isn't recognized because it's too small, if you can see the inside of the character is clear, Tesseract might be able to decipher.
Now the most difficult, the ± (plus-minus). I haven't cracked this one yet, and this may be a very wooly approach; but I was thinking, the plus-minus is always recognized as + plus only.
I can use this to my advantage.
I could use Tesseract's engine which exposes PageSegMode.SingleChar to detect each individual character and use Tesseract's GetSegmentedRegions() to get the area of the bitmap/image where each character is - you can later reassemble all characters into a string.
Then I could run an ImageMagick to calculate/compare how similar the plus character found is to an image of either plus or plus-minus. The one with most similarity will tell you which character.
With my approach, I still have to parse the text recognised and transform it into something usable.
The Ø (Phi) character for example may be detected as lower-case, but I will want it upper-case.
Or the degree is detected as an apostrophe, but the expected result is the degree.
Another transformation is when I detect a dimension, a decimal may be incorrectly recognized with a comma, but I will want the decimal separator to be a dot (1,99 - 1.99)

Related

TCL command with spaces parsing as Â

I'm trying to send command which includes spaces,
I got the following error -
"system1 -loadChange "+10" -portAddress "10.10.X.X/1/15â€"
ecah space replaced by Â.
The full command is:
av::perform ChangeLoadForPort system1 -loadChange "+10" -portAddress "10.10.X.X/1/15”
X.X is the IP.
The sequence   (a  followed by a space) is the ISO 8859-1 interpretation of the sequence of bytes that make a non-breaking space in UTF-8. (It's also the same with ISO 8859-15 and a few other encodings.) †is another tell-tale. That you're seeing this points to two problems:
Various bits and pieces of the systems you're dealing with are disagreeing on what encodings are in use. That's Bad News, but would be just theoretically a problem if your code stuck to ASCII. (Almost all encodings have the majority of ASCII as a subset.)
Your code has unexpected non-ASCII characters in it, and that's just not going to work. You're probably being sabotaged by whatever program you're using to edit text; a programmer's editor is strongly recommended, and not a word processor.

Tesseract SetVariable tessedit_char_whitelist in another language

Tesseract setVariable whitelist works ok for english language for example i use this to recognize only digits and letters from image (excluding special characters &*^%! etc)
_ocr.SetVariable("tessedit_char_whitelist",
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
But i can't do the same thing for Thai language
_ocr.SetVariable("tessedit_char_whitelist","0123456789กขคงจฉ");
Is there a different principle? Because this does not work. Instead of all determined characters I receive only digits in output, tesseract ignores all Thai letters which I put into the whitelist.
How can I pass this variable correctly?
You might need to use the language package for Thai first... please refer the download list here https://code.google.com/p/tesseract-ocr/downloads/list
Then you need to replace "eng" with "tha" in your code to use the new language data to OCR

Pinyin tone marked symbols and MySQL

I’m creating a MySQL database storing Chinese characters with associated pīnyīn pronunciations. I’ve set up everything to work in UTF-8 charset, so I’m having no troubles with most of the symbols I’m using. Except, strangely, some of certain latin characters with tone marks, and only when I write them into the database from $_POST, using PHP.
Those are: all characters with an acute accent (á, é, í, ó, ú), except ǘ (?!); and all characters with a grave accent (à, è ì ò ù), again, except ǜ. When they are typed into a form, and that form is submitted to the db, those characters are just cut off, like they never existed. E.g., cháng submits like chng. Any other characters (with a caron, like ǎ, or a macron, like ā) are written in fine, and so are actual Chinese characters.
Again, I’m using UTF-8 everywhere possible, and this sort of problem so far has been only experienced upon submitting data from a form. Before, I ran a script to manually insert an array, containing those characters, to the database, and everything went fine.
Any ideas?
I think you may post pinyin in a numbered format.
e.g. cháng as cha2ng
And dealing with the post information in php script by some mapping methods.
Here's a method to deal with it.
Convert numbered to accentuated Pinyin?
Hopefully, it helps you.
I got a solution!
Before:
SELECT 'liàng' = 'liǎng';
Change to:
SELECT CONVERT('liàng' USING BINARY)= CONVERT('liǎng' USING BINARY) as equal;

What is the difference between plaintext and binary data?

Many languages have functions which only process "plaintext", not binary. Does this mean that only characters within the ASCII range will be allowed?
Binary is just a series of bytes, isn't it similar to plaintext which is just a series of bytes interpreted as characters? So, can plaintext store the same data formats / protocols as binary?
a plain text is human readable, a binary file is usually unreadable by a human, since it's composed of printable and non-printable characters.
Try to open a jpeg file with a text editor (e.g. notepad or vim) and you'll understand what I mean.
A binary file is usually constructed in a way that optimizes speed, since no parsing is needed.
A plain text file is editable by hand, a binary file not.
"Plaintext" can have several meanings.
The one most useful in this context is that it is merely a binary files which is organized in byte sequences that a particular computers system can translate into a finite set of what it considers "text" characters.
A second meaning, somewhat connected, is a restriction that said system should display these "text characters" as symbols readable by a human as members of a recognizable alphabet. Often, the unwritten implication is that the translation mechanism is ASCII.
A third, even more restrictive meaning, is that this system must be a "simple" text editor/viewer. Usually implying ASCII encoding. But, really, there is VERY little difference between you, the human, reading text encoded in some funky format and displayed by a proprietary program, vs. VI text editor reading ASCII encoded file.
Within programming context, your programming environment (comprized by OS + system APIs + your language capabilities) defines both a set of "text" characters, and a set of encodings it is able to read to convert to these "text" characters. Please note that this may not necessarily imply ASCII, English, or 8 bits - as an example, Perl can natively read and use the full Unicode set of "characters".
To answer your specific question, you can definitely use "character" strings to transmit arbitrary byte sequences, with the caveat that string termination conventions must apply.
The problem is that the functions that already exist to "process character data" would probably not have any useful functionality to deal with your binary data.
One thing it often means is that the language might feel free to interpret certian control characters, such as the values 10 or 13, as logical line terminators. In other words, an output operation might automagicly append these characters at the end, and an input operation might strip them from the input (and/or terminate reading there).
In contrast, language I/O operations that advertise working on "binary" data will usually include an input parameter for the length of data to operate on, since there is no other way (short of reading past end of file) to know when it is done.
Generally, it depends on the language/environment/functionality.
Binary data is always that: binary. It is transferred without modification.
"Plain text" mode may mean one or more of the following things:
the stream of bytes is split into lines. The line delimiters are \r, \n, or \r\n, or \n\r. Sometimes it is OS-dependent (like *nix likes \n, while windows likes \r\n). The line ending may be adjusted for the reading application
character encoding may be adjusted. The environment might detect and/or convert the source encoding into the encoding the application expects
probably some other conversions should be added to this list, but I can't think of any more at this moment
Technically nothing. Plain text is a form of binary data. However a major difference is how values are stored. Think of how an integer might be stored. In binary data it would use a two's complement format, probably taking 32 bits of space. In text format a number would be stored instead as a series of unicode digits. So the number 50 would be stored as 0x32 (padded to take up 32 bits) in binary but would be stored as '5' '0' in plain text.

Implementing run-length encoding

I've written a program to perform run length encoding.
In typical scenario if the text is
AAAAAABBCDEEEEGGHJ
run length encoding will make it
A6B2C1D1E4G2H1J1
but it was adding extra 1 for each non repeating character. Since i'm compressing BMP files with it, i went with an idea of placing a marker "$" to signify the occurance of a repeating character, (assuming that image files have huge amount of repeating text).
So it'd look like
$A6$B2CD$E4$G2HJ
For the current example it's length is the same, but there's a noticable difference for BMP files. Now my problem is in decoding. It so happens some BMP Files have the pattern $<char><num> i.e. $I9 in the original file, so in the compressed file also i'd contain the same text. $I9, however upon decoding it'd treat it as a repeating I which repeats 9 times! So it produces wrong output. What i want to know is which symbol can i use to mark the start of a repeating character (run) so that it doesn't conflict with the original source.
Why don't you encode each $ in the original file as $$ in the compressed file?
And/or use some other character instead of $ - one that is not used much in bmp files.
Also note that the BMP format has RLE compression 'built-in' - look here, near the bottom of the page - under "Image Data and Compression".
I don't know what you're using your program for, or if it's just for learning, but if you used the "official" bmp method, your compressed images wouldn't need decompression before viewing.
AAAAAABBCDEEEEGGHJ$IIIIIIIII ==> $A6$B2CD$E4$G2HJ$$I9
If the repeat character occurs in the data, try inserting an extra repeat character in the encoded data. Then if the decoder sees a double repeat character it can insert the actual repeat character
$A6$B2CD$E4$G2HJ$$I9 ==> AAAAAABBCDEEEEGGHJ$IIIIIIIII
What most programs do to signify that some character needs to be treated literally is that they have a defined escape sequence.
For example, in regular expressions, the following are specially defined characters that usually have a meaning:
^[].*+{}()$
Yes, your fun dollar sign character is in there, and it usually means end of line.
So what a programmer using regular expressions has to do to have these characters interpreted literally is that they need to express those characters as an escape sequence. For example, to interpret $ as $, and not end of line, the programmer uses \$, which is the escape sequence.(1)
In your case, you can store literal dollar signs into your compressed file as \$.(2)
NB: grep inverts this logic.
The above solutions to store $ as $$ becomes confusing when you have runs of $ in the BMP file.
If you have the luxury of being able to scan the entire input before starting to compress it, you could choose the least frequent value in the input as your escape value.
For example, given this input:
AAAABBCCCCDDEEEEEEEFFG
You could choose "G" as your escape value (or even "H" if it's part of your symbol set) and adopt a convention whereby the first character of the encoded stream is the escape value. So the string above might encode to:
GGA4BBGC4DDGE7FFGG
or even better:
HHA4BBHC4DDHE7FFG
Please note that there's no point in encoding a "run" of two identical values because the "compressed" version (e.g. HD2) is longer than the uncompressed version (DD).
Hope that helps!
If I understand correctly, the problem is that $ is both a symbol for marking a repeat, and also can be a 'BMP' value as well?
If so, what you could do is to mark a double $ ('$$') character to denote that the '$' character should be treated not as a repeat, but as a single '$'. This would of course mean that the '$' is expensive to encode (takes two symbols instead of 1), but would solve your problem.
If you wanted to have a run of the '$' character, you would need to encode it as:
$$$5 - meaning '$' run of '$$'=$, '5' - 5 times.
I'm honestly not sure what would possessed someone to use a text-based RLE if they want to compress binary data with it. A BMP is not text.
Right now, since only a single byte is read after the $, and it is interpreted as ascii number from 0 to 9, this process has a run length range of 0 to 9, meaning you can only compress values up to 9 repetitions before a new run-length flag needs to be written. After all, you can't make the difference between $I34 for a run-length of 34, and $I3 + 4 for a literal 4 behind the repeat of 3.
If this same byte is instead interpreted as binary value, it can contain values from 0 to 255, giving a massive difference in efficiency.
As for the escaping of $ signs themselves, I'd advice either always treating it as repeat of at least 1 ($$1), or, better yet, encoding the entire thing differently, with the order of the run length values and the data swapped, so a code becomes $<length><data>; then you can use $0 as special symbol to mean 'just $'. When decompressing and encountering the 0 after a $, simply don't read on for a third byte. A run length of 0 should never appear in the compressed data anyway, so it can be given a special meaning, but this is useless if the data byte is put first, since then it'd still be the same length as a normal repeat.