determinate the alphabet of string by looking for language-specific characters - language-agnostic

(this is NOT duplicate of How to detect the language of a string?)
I need to be able to determinate the alphabet of given string (single word) by the language/alphabet specific characters.
For example, if the string contains:
'Ü' it should be recognized as German,
'ش' as Arabic,
'Φ' as Greek and etc
I'm looking for list of alphabet-specific characters listed by language/alphabet. As is single non-dictionary word using GoogleTranslate API or other dictionary based solutions won't work
(Although the question isn't programming language specific, the actual code is written in C#)

You could start with the unicode name of each character. For example (in Python):
>>> import unicodedata
>>> unicodedata.name(u'Φ')
'GREEK CAPITAL LETTER PHI'
>>> unicodedata.name(u'ش')
'ARABIC LETTER SHEEN'
>>> unicodedata.name(u'Ü')
'LATIN CAPITAL LETTER U WITH DIAERESIS'
You might have to special-case the Latin characters, since Unicode doesn't assign them to particular language-specific alphabets. Most of them appear in several languages that use Latin-based alphabets, but if you're somehow confident that your data will contain Ü only if it is German, then you can identify that character as German for your purposes. There are only a few dozen Latin characters to worry about.
Similarly, loads of languages use the Unicode CYRILLIC letters, and so in most cases their presence doesn't tell you the language. Some are described by Unicode as belonging to particular languages. CYRILLIC SMALL LETTER YI has the note "Ukranian" in http://www.unicode.org/charts/PDF/U0400.pdf. I don't know whether or not those notes are exhaustive, i.e. whether or not Ukranian is the only language that uses that character. And I'm certain that there are plenty of Ukranian words that don't have that character in them. Fundamentally you cannot distinguish Ukranian words from Russian words solely by the presence or absence of Ukranian-specific letters.
I expect the same is true of other alphabets in Unicode. If you're really lucky you might find a Unicode database that includes any such notes on each character, so you can mine it for mention of particular languages.

Related

Do ampersands still need to be encoded in URLs in HTML5?

I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:
...
One should write:
...
Apparently, the former example shouldn't work, but browser error recovery means it does.
Is this still the case in HTML5?
We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?
It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:
The ampersand (&) may be left unescaped in more cases compared to HTML4.
In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.
In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:
a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
any other character ===> the parser will try to find a named character reference, e.g., something like ∉.
The last case is the one of interest to you since your example has:
...
You have the character sequence
AMPERSAND
LATIN SMALL LETTER Y
EQUAL SIGN
Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:
If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.
You don't have a semicolon there, so you don't have a parse error.
Now suppose you had, instead,
...
which is different because é is a named entity reference in HTML. In this case, the following rule kicks in:
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
So there the = makes it an error, because legacy browsers might get confused.
Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.
It would be interesting to see what validators can do.

how to detect thai language in SQL query

I have a column in a table which is a string, and some of those strings have thai language in it, so an example of a thai string is:
อักษรไทย
Is there such way to query/find a string like this in a column?
You could search for strings that start with a character in the Thai Unicode block (i.e. between U+0E01 and U+0E5B):
WHERE string BETWEEN 'ก' AND '๛'
Of course this won't include strings that start with some other character and go on to include Thai language, such as those that start with a number. For that, you would have to use a much less performant regular expression:
WHERE string RLIKE '[ก-๛]'
Note however the warning in the manual:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
You can do some back and forth conversion between character sets.
where convert(string, 'AL32UTF8') =
convert(convert(string, 'TH8TISASCII'), 'AL32UTF8', 'TH8TISASCII' )
will be true if string is made only of thai and ASCII, so if you add
AND convert(string, 'AL32UTF8') != convert(string, 'US7ASCII')
you filter out the strings made only of ASCII and you get the strings made of thai.
Unfortunately, this will not work if your strings contain something outside of ASCII and Thai.
Note: Some of the convert may be superfluous depending on your database default encoding.

Tesseract SetVariable tessedit_char_whitelist in another language

Tesseract setVariable whitelist works ok for english language for example i use this to recognize only digits and letters from image (excluding special characters &*^%! etc)
_ocr.SetVariable("tessedit_char_whitelist",
"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ");
But i can't do the same thing for Thai language
_ocr.SetVariable("tessedit_char_whitelist","0123456789กขคงจฉ");
Is there a different principle? Because this does not work. Instead of all determined characters I receive only digits in output, tesseract ignores all Thai letters which I put into the whitelist.
How can I pass this variable correctly?
You might need to use the language package for Thai first... please refer the download list here https://code.google.com/p/tesseract-ocr/downloads/list
Then you need to replace "eng" with "tha" in your code to use the new language data to OCR

UTF-8 is an Encoding or a Document Character Set?

According with W3C Recommendation says that every aplicattion requires its document character set (Not be confused with Character Encoding).
A document character set consists of:
A Repertoire: A set of abstract characters, such as the Latin letter "A", the Cyrillic letter "I", the Chinese character meaning "water", etc.
Code positions: A set of integer references to characters in the repertoire.
Each document is a sequence of characters from the repertoire.
Character Encoding is:
How those characters may be represented
When i save a file in Windows notepad im guessing that this are the "Document Character Sets":
ANSI
UNICODE
UNICODE BIG ENDIAN
UTF-8
Simple 3 questions:
I want to know if those are the "document character sets". And if they are,
Why is UTF-8 on the list? UTF-8 is not supposed to be an encoding?
If im not wrong with all this stuff:
Are there another Document Character Sets that Windows do not allow you to define?
How to define another document character sets?
In my understanding:
ANSI is both a character set and an encoding of that character set.
Unicode is a character set; the the encoding in question is probably UTF-16. An alternative encoding of the same character set is big-endian UTF-16, which is probably what the third option is referring to.
UTF-8 is an encoding of Unicode.
The purpose of that dropdown in the Save dialog is really to select both a character set and an encoding for it, but they've been a little careless with the naming of the options.
(Technically, though, an encoding just maps integers to byte sequences, so any encoding could be used with any character set that is small enough to "fit" the encoding. However, the UTF-* encodings are designed with Unicode in mind.)
Also, see Joel on Software's mandatory article on the subject.
UTF-8 is a character encoding that is also used to specify a character set for HTML and other textual documents. It is one of several Unicode encodings (UTF-16 is another).
To answer your questions:
It is on the list because Microsoft decided to implement it in notepad.
There are many other character sets, though defining your own is not useful, so not really possible.
You can't define other character sets to save with notepad. Try using a programmers editor such as notepad++ that will give you more character sets to use.

MySQL collation to store multilingual data of unknown language

I am new to multilingual data and my confession is that I never did tried it before.
Currently I am working on a multilingual site, but I do not know which language will be used.
Which collation/character set of MySQL should I use to achieve this?
Should I use some Unicode type of character set?
And of course these languages are not out of this universe, these must be in the set which we mostly use.
You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:
utf8_general_ci is a very simple collation. It just
- removes all accents
- then converts to upper case
and uses the code of this sort of "base letter" result letter to compare.
utf8_unicode_ci uses the default Unicode collation element table.
The main differences are:
utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.
utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.
+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.
Extracted from MySQL forums.
UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.
UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).
You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.