How do i measure the confidence with which the transliteration output works? - microsoft-translator

I am using the Python code https://learn.microsoft.com/en-us/azure/cognitive-services/translator/quickstart-python-transliterate to transliterate my Hindi text to English. How do I determine the level of confidence of this transliteration output?

Transliteration does not change the language, only the script. When you transliterate Hindi text to Latin, the text is still in Hindi, just written in Latin characters. The API maps words, phrases and individual characters.
If you are targeting an English translation, you will want to use the Translate function of the Translator API.
The API does not return a confidence rating in either case.

Related

Does anyone know how Tesseract - OCR postprocessing / spellchecking works?

I was using tesseract-ocr (pytesseract) for spanish and it achieves very high accuracy when you set the language to spanish and of course, the text is in spanish. If you do not set language to spanish this does not perform that good. So, I'm assuming that tesseract is using many postprocessing models for spellchecking and improving the performance, I was wondering if anybody knows some of those models (ie edit distance, noisy channel modeling) that tesseract is applying.
Thanks in advance!
Your assumption is wrong: If you do not specify language, tesseract uses English model as default for OCR. That is why you got wrong result for Spanish input text. There is no spellchecking post processing.

Machine Learning with phonics ASR

There are many research on Automated Speech Recognition that convert speech to text. These tools are using deep learning to do that.
I have found that the way it works is based on the english language. If audio of word "Phonics" they will be either Foniks but the closest english word for that is Phonics.
Google APIs can provide us with ASR that gives us the end result. Is there any tool or open source that can give us the phonics sounds? Something like this "ˈfəʊnɪks" instead of "Phonics"
Thanks.
There are several open source tools for ASR. Kaldi, CMU Sphinx and HTK are the most popular and well documented. Kaldi will be probably the best if you want to use DNNs for ASR.
However, the form of recognition result provided depends on your vocabulary. If you wish to have a word ˈfəʊnɪks instead of Phonics, you have to define it in the vocabulary. For instance:
!SIL sil
<UNK> spn
eight ey t
five f ay v
...
f_ey_ow_n_i_k_s f ey ow n i k s
....
Using Unicode symbols for word representation is not possible (as far as I remember), so I replaced them with X-SAMPA notation.
Follow this tutorial for in-depth explanation.

Does Google Maps Support European Languages?

I tried to display a map by using various languages in the url in the scrip tag:
http://maps.googleapis.com/maps/api/js?client=this-is-me&language=fr
http://maps.googleapis.com/maps/api/js?client=this-is-me&language=de
http://maps.googleapis.com/maps/api/js?client=this-is-me&language=es
The map names are shown both in English and the local language. With some exceptions e.g. Ivory Coast is only in local language: Cote d'ivoire.
One would expect the map to be in the language requested. This is never the case in the languages I tested. Am I doing something wrong or is the actual translation totally wrong?
There is no guarantee that anything will be translated.
At least the interface-elements(e.g. maptype-controls) should be translated.
Basically all 3 languages are supported (list of supported languages)

What are the common OCR errors with capitals?

What are the common errors in OCR (optical character recognition) with capital letters?
E.g. FOR -> FOB
To get most accurate answers, it is probably best to test this yourself with a sample of your data specific to your problem. Error rates for different character/word combinations can vary largely, depending on the input.
However, there are also a number of articles that can be found with Google Scholar that deal with OCR error correction, such as A statistical approach to automatic OCR error correction in context. Although that particular article is not capital letter specific, they discuss a few common cases of misclassification.

Can not recognize pdf scanned page with greek words by using PB , EZTWAIN and TOCR 3.0

Iam using PB 10.5.2 and EZTwain 3.30.0.28, XDefs 1.36b1 by Dosadi for scanning.
Also Iam using the TOCR 3.0 for OCR management.
In a function we use the following among all others :
...
Long ll_acquire
(as_path_filename is a function argument)
...
...
TWAIN_SetAutoOCR(1)
ll_acquire = TWAIN_AcquireMultipageFile(0, as_path_filename)
the problem is that the scanned pdf page has latin (english) and greek words.
The English characters are searched quite precisely but the greek don't at all.
Do you think this that this has to do with the TOCR software.
I just want to search AND for greek words
Thanks in advance
The OCR software should be where it is failing to convert the Greek words into OCR'd text. It looks like you are using EZTwain for the OCR portion which uses TOCR for its actual OCR engine. You may want to look at the docs for that software and see if they mention any settings that can be modified for multilingual usage.
According to the website TOCR recognizes English, French, Italian, German, Dutch, Swedish, Finnish, Norwegian, Danish, Spanish and Portuguese. You'll need software that can handle mixed Greek and English text. ABBYY FineReader Professional lists support for English and Greek, along with dozens of others.