Recognize Micr font using OCR Engine? - windows-runtime

I am using Microsoft OCR Library for reading text.
The Microsoft OCR library works perfectly. However i want to read the following list of characters given in the link http://www.ict4u.net/databases/database-images/micr.jpg . Is there a way in which i can train the OCR library to read the following characters or is there a language that allows to read the following characters.

[Microsoft OCR crew here] We don't yet support training OCR to customize it for your use-cases. However, we do actively keep an eye on stackoverflow to see what developers need, so we can keep improving the OCR engine.

I have been working with Microsoft OCR for a while now.
Compared with Tesseract it has very basic functionality.
For example Microsoft OCR returns the words and lines.
But the lines are nonsense. Randomly 2 or 3 words are grouped together as a "line" but they are not a real line. And the "lines" are completely unordered. In this aspect it is worse than Tesseract. You have to take the coordinates of each word and order them on your own.
Microsoft does not return the rectangles of characters and there is absolutely no way to configure or train Microsoft OCR in any way. You can add languages with Windows Update for "Basic Typing" = OCR (see http://www.thewindowsclub.com/install-uninstall-languages-windows-10), but you cannot train your own language data.
MSDN says that the following 25 languages are supported with different accuracy:
Excellent: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish and Swedish.
Very good: Chinese Simplified, Greek, Japanese, Russian and Turkish.
Good: Chinese Traditional and Korean.
The recognition quality is very similar to Tesseract. It has even exactly the same problems as Tesseract. Some single characters are not recognized (separate symbols like a single '$') and it has the same huge problem with asterisks as Tesseract. Also does it insert spaces at the wrong places as Tesseract does. So I ask myself if Microsoft is using Tesseract under the hood?
However Microsoft OCR has an advantage over Tesseract: The image preprocessing is much better. It does not matter if you have red text on yellow background or white text on black. This is a catch for Tesseract which needs a black and white image of good quality as input.
For both OCR libraries applies: If you have recognition problems, try to amplify the image. Even blurring the image may be very helful because this removes the noise from the image.

Related

Is it possible for Tesseract to show a recognition percent for a character?

I am using Tesseract for recognizing custom symbols (more like pictographs, not numbers or letters). I need this for implementing a "spell-casting" mechanic in an Android game where you have to draw a symbol to cast a spell. I trained Tesseract on my symbol sheet and it recognizes the symbols just fine, but it also recognizes gibberish images as symbols. Obviously, I don't want this to happen, as it defeats the purpose of drawing a specific symbol. Does Tesseract have an option to display something like a recognition percent for a symbol?
The "recognition percent" is called "confidence level" in Tesseract, and can be accessed by the tsv output option. More in detail in this answer: https://stackoverflow.com/a/66899977/15523359

How does each platform display their own versions of emojis?

If you take a bit of text that contains emojis from whatsapp web, and paste it into facebook messenger, you'll get different versions on each platform. How does each platform use their own images as placeholders for emojis? Please note that the emoji code is preserved when copy-pasted. So a melon on one platform will still be a melon on the other platform.
I'm not even sure this is a programming question, if not I'd be very grateful if you could point me to the right direction :)
Emoji are represented as unicode characters that individual platforms and apps can interpret as they see fit. While most modern platforms will automatically translate the unicode character into the appropriate image, some apps will override this behavior and replace the platform-standard unicode character with their own image.
text.replace("{unicodeEmojiString}", "{eitherAMarkerOrImageSpecificToMyApp}"

Using BERT in order to detect language of a given word

I have words in the Hebrew language. Part of them are originally in English, and part of them are 'Hebrew English', meaning that those are words that are originally from English but are written with Hebrew words.
For example: 'insulin' in Hebrew is "אינסולין" (Same phonetic sound).
I have a simple binary dataset.
X: words (Written with Hebrew characters)
y: label 1 if the word is originally in English and is written with Hebrew characters, else 0
I've tried using the classifier, but the input for it is full text, and my input is just words.
I don't want any MASKING to happen, I just want simple classification.
Is it possible to use BERT for this mission? Thanks
BERT is intended to work with words in context. Without context, a BERT-like model is equivalent to simple word2vec lookup (there is fancy tokenization, but I don't know how it works with Hebrew - probably, not very efficiently). So if you really really want to use distributional features in your classifier, you can take a pretrained word2vec model instead - it's simpler than BERT, and no less powerful.
But I'm not sure it will work anyway. Word2vec and its equivalents (like BERT without context) don't know much about inner structure of a word - only about contexts it is used in. In your problem, however, word structure is more important than possible contexts. For example, words בלוטת (gland) or דם (blood) or סוכר (sugar) often occur in the same context as insulin, but בלוטת and דם are Hebrew, whereas סוכר is English (okay, originally Arabic, but we are probably not interested in too ancient origins). You just cannot predict it from context only.
So why not start with some simple model (e.g. logistic regression or even naive bayes) over simple features (e.g. character n-grams)? Distributional features (I mean w2v) may be added as well, because they tell about topic, and topics may be informative (e.g. in medicine, and technology in general, there are probably relatively more English words than in other domains).

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?
Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

OCR and Distinguishing Between 2 or 3 Fonts

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:
Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.
Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.
I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."
I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg02157.html), if that's relevant.
Is there anything out there that I can use to do this simple formatting classification?
Edit:
Is there anything out there that will do this without costing me an arm and a leg?
I’m not sure whether tesseract can solve the task you describe, but I believe good ocr engine should detect font styles. For example, ABBYY OCR SDK can not only identify bold/italic font style, but it can also define proper font face to use in the output.
Based on what you describe I guess you are trying to determine document style hierarchy like header levels etc. ABBYY FineReader Engine provides this functionality and you don’t have engage into the font size&style based text purpose routine. Besides, it provides the best ocr quality and it’s free to try. Consider trying it out if you plan commercial software. I work # ABBYY and can provide you more info our OCR SDK if necessary.
Best regards.