Is it possible for Tesseract to show a recognition percent for a character? - ocr

I am using Tesseract for recognizing custom symbols (more like pictographs, not numbers or letters). I need this for implementing a "spell-casting" mechanic in an Android game where you have to draw a symbol to cast a spell. I trained Tesseract on my symbol sheet and it recognizes the symbols just fine, but it also recognizes gibberish images as symbols. Obviously, I don't want this to happen, as it defeats the purpose of drawing a specific symbol. Does Tesseract have an option to display something like a recognition percent for a symbol?

The "recognition percent" is called "confidence level" in Tesseract, and can be accessed by the tsv output option. More in detail in this answer: https://stackoverflow.com/a/66899977/15523359

Related

Recognize Micr font using OCR Engine?

I am using Microsoft OCR Library for reading text.
The Microsoft OCR library works perfectly. However i want to read the following list of characters given in the link http://www.ict4u.net/databases/database-images/micr.jpg . Is there a way in which i can train the OCR library to read the following characters or is there a language that allows to read the following characters.
[Microsoft OCR crew here] We don't yet support training OCR to customize it for your use-cases. However, we do actively keep an eye on stackoverflow to see what developers need, so we can keep improving the OCR engine.
I have been working with Microsoft OCR for a while now.
Compared with Tesseract it has very basic functionality.
For example Microsoft OCR returns the words and lines.
But the lines are nonsense. Randomly 2 or 3 words are grouped together as a "line" but they are not a real line. And the "lines" are completely unordered. In this aspect it is worse than Tesseract. You have to take the coordinates of each word and order them on your own.
Microsoft does not return the rectangles of characters and there is absolutely no way to configure or train Microsoft OCR in any way. You can add languages with Windows Update for "Basic Typing" = OCR (see http://www.thewindowsclub.com/install-uninstall-languages-windows-10), but you cannot train your own language data.
MSDN says that the following 25 languages are supported with different accuracy:
Excellent: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish and Swedish.
Very good: Chinese Simplified, Greek, Japanese, Russian and Turkish.
Good: Chinese Traditional and Korean.
The recognition quality is very similar to Tesseract. It has even exactly the same problems as Tesseract. Some single characters are not recognized (separate symbols like a single '$') and it has the same huge problem with asterisks as Tesseract. Also does it insert spaces at the wrong places as Tesseract does. So I ask myself if Microsoft is using Tesseract under the hood?
However Microsoft OCR has an advantage over Tesseract: The image preprocessing is much better. It does not matter if you have red text on yellow background or white text on black. This is a catch for Tesseract which needs a black and white image of good quality as input.
For both OCR libraries applies: If you have recognition problems, try to amplify the image. Even blurring the image may be very helful because this removes the noise from the image.

Map MS Word 2010 Office symbols to Html unicode or MathML Symbol equivalent programmatically

I want to read MS Word 2010 office symbols to equivalent HTML based unicode or MATHML Symbols equivalent.
I am currently using DrawString() to get image of the symbol, but it is bit blur and bold type.
I want to display it either as HTML Unicode or as MATHML Symbol, whichever is better and possible.
Any Ideas?
As per the comments, the question is about mapping the Symbol font (commonly available in Windows) to Unicode or MathML. This font uses its own encoding, and it originally lacked exact definition (e.g., some glyphs are ambiguous and could be interpreted as different characters), but now we can regard the Adobe Symbol Encoding to Unicode mapping as official. This mapping file is a plain text file in a specific format, so it can be read and parsed programmatically. It gives you the Unicode code number equivalent for each of the 256 code positions in the Symbol font. In MathML, you can use either the Unicode characters as such or as numeric references such as Ω for U+03A9 GREEK CAPITAL LETTER OMEGA (Ω).
Note that the Symbol font is legacy software with some oddities. In particular, the mapping is from Unicode to Symbol font encoding rather than vice versa. When mapping in the other direction, e.g. converting legacy data in Symbol font format to Unicode, you need to make decisions. For example, both U+03A9 and U+2126 OHM SIGN map to 0x57 in Symbol font. This is a simple case: when mapping from Symbol font data to Unicode, 0x57 should map to U+03A9, according to principles set in the Unicode standard. There are more difficult cases where two different characters are possible and the choice should depend on context, such as GREEK CAPITAL LETTER DELTA versus INCREMENT.

FlashDevelop supporting 3 different languages (Eng, Kor, Chn)?

I developed an app (originally in Korean and English), but I want to add Chinese support.
When I move the Chinese translations from Word to FlashDevelop, though, some characters show up as boxes. When I run the app, they don't show up at all.
Does anyone have experience developing in multiple languages using the same IDE, or preferably, FlashDevelop?
Thanks!
You need to check the file encoding and check if the font you are using support this kind of character. If you use transformation like rotation and alpha, you need to embed your font. For french, I need to convert my file to UFT-8 and embed the font with accentued character.

'font-family: Symbol' and Windows-1252

I have a bunch of HTML documents that contain some simple text in Windows-1252 encoding, but throughout the text there are numerous appearances of span elements with font-family: Symbol.
For example:
<span style='font-family:Symbol'>Ñ</span>
Which appears as the greek delta - Δ in the browser.
Google told me that using the Symbol font might show different results on different systems, as it's not actually a well defined font.
Is this really true? Is it "unsafe" to use the Symbol font?
If so, is there any way to reliably convert (on my own system) such symbols in the Symbol font to their Windows-1252 counterparts?
It's been always unsafe to rely on having certain font installed on all the computers/smartphones/gadgets that visit your site. There're some font embedding techniques that work reasonably well in some modern browsers but you'd need to repack the Symbol font and I doubt the copyright owner allows you to do it.
Of course, most characters in the Symbol font are not in the Windows-1252 encoding but that should not be an issue. You can use the following map to obtain the appropriate HTML entities. However, you'll have to write a script or program using a programming language (HTML is just a markup language).
When using font-family, if neither of the listed font faces are found on the client, that is without the webfont embeds, may result in changing to default font of client hence a different font replacement for what you'd show to your users.
You may want to use UTF-8 encoding and put the delta (Δ) sign in your HTML content, or use webfont embeds to provide an option, "use the font I want from this".
The problem is that the greek letter you see is just the appearance, the actual letter is something completely different.
I can think of two ways to convert it:
1. Write a script (in your language of choice) that converts each letter to it's Greek counterpart. (Ñ => Δ)
2. Take a screenshot of the document/page and use an OCR-program to convert it to Greek text.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?
Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.