how to detect thai language in SQL query - mysql

I have a column in a table which is a string, and some of those strings have thai language in it, so an example of a thai string is:
อักษรไทย
Is there such way to query/find a string like this in a column?

You could search for strings that start with a character in the Thai Unicode block (i.e. between U+0E01 and U+0E5B):
WHERE string BETWEEN 'ก' AND '๛'
Of course this won't include strings that start with some other character and go on to include Thai language, such as those that start with a number. For that, you would have to use a much less performant regular expression:
WHERE string RLIKE '[ก-๛]'
Note however the warning in the manual:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

You can do some back and forth conversion between character sets.
where convert(string, 'AL32UTF8') =
convert(convert(string, 'TH8TISASCII'), 'AL32UTF8', 'TH8TISASCII' )
will be true if string is made only of thai and ASCII, so if you add
AND convert(string, 'AL32UTF8') != convert(string, 'US7ASCII')
you filter out the strings made only of ASCII and you get the strings made of thai.
Unfortunately, this will not work if your strings contain something outside of ASCII and Thai.
Note: Some of the convert may be superfluous depending on your database default encoding.

Related

MySQL REGEXP word boundary detection with german umlauts when using BINARY Operator

I made a strange discovery. If I execute the following SQL-Command:
SELECT 'Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]'
it gives me the result - as expected - 0
But if I do the same together with the BINARY operator:
SELECT BINARY 'Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]'
the result ist 1, so I think there is a MySQL problem with the regexp word boundary detection and german umlauts (like here the "ä") in conjunction with the BINARY Operator. As another example I can do this query:
SELECT BINARY 'Konzessionsäre' REGEXP '[[:<:]]Konzession[[:>:]]'
So here the result is 0 - as I would expect. So how can I solve this? Is this probably a bug in MySQL?
Thanks
By casting your string as BINARY you have stripped its associated character set property. So it's unclear how the word-boundary pattern should match. I'd guess it matches only ASCII values A-Z, a-z, 0-9, and also _.
When casting the string as BINARY, MySQL knows nothing about any other higher character values that also should be considered alphanumeric, because which characters should be alphanumeric depends on the character set.
I guess you are using BINARY to make this a case-sensitive regular expression search. Apparently, this has the unintended consequence of spoiling the word-boundary pattern-match.
You should not use BINARY in this comparison. You could do a secondary comparison to check for case-sensitive matching, but not with word boundaries.
SELECT (BINARY 'Konzessionäre' REGEXP 'Konzession') AND ('Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]')
MySQL's REGEXP works with bytes, not characters. So, in CHARACTER SET utf8, ä is 2 bytes. It is unclear what the definition of "word boundary" in such a situation.
Recent versions of MariaDB have a better regexp engine.

How to select data with REGEXP without containing Extended ASCII

I just want to display data with :
containing a-z
containing A-Z
contain numbers 0-9
contain printable ASCII charachter on the following link http://www.theasciicode.com.ar/extended-ascii-code/latin-diphthong-ae-uppercase-ascii-code-146.html
I stuck using code like this, please help point -4 above:
select * from Delin where the address REGEXP '^ [A-Za-z0-9]'
with sample raw data below:
and i wanna output like this(where these images show a-Z and only printable symbols) :
Your items 1–3 (a–z, A–Z, and 0–9) are all subsets of item 4 (printable ASCII characters), so you need only concern yourself with the latter. The following query satisfies that criterion:
SELECT * FROM Delin
WHERE alamat REGEXP '^[ -~]+$';
The character class [ -~], indicates the ASCII characters from space to tilde inclusive, which happens to be all of the printable ASCII characters and no others.
You can see it in an SQL Fiddle here: http://sqlfiddle.com/#!9/6c7b8/1
Terminology note: There is no such thing as "Extended ASCII." The ASCII character set corresponds to the numbers 0–127 inclusive. Any character corresponding to a number greater than 127 is not ASCII. The term "Extended ASCII" is often mistakenly applied to various non-ASCII encodings, none of which is an "extension" of ASCII in any official sense.

MYSQL commands and Encoding

I will use PhP notation for variables to make life easy.
Suppose the database is UTF-8, and the client is set to UTF-8.
There are two sides to the question. Knowing that the ASCII for ' (quote) is 39 decimal:
Client Side
When the query variable, $title, is escaped, using function such as real_escape_string(), will the function escape all bytes that have value of 39 separately? Or will it see if the byte of value 39 is a part of UTF-8 symbol?
Server Side
SELECT * from STORIES WHERE title = 'Hello'
What does MYSQL assume the query encoding to be? This includes the part:
SELECT * from STORIES WHERE title = '
Then if a $filteredTitle happens to have the byte 39 in it which is part of a UTF-8 symbol, how does MYSQL know that it is not a quote?
Let's look at two issues.
When providing a SELECT statement, strings must be "escaped". Otherwise, there would be syntax problems with quotes inside quotes. In particular ', ", and \ must be preceded by a backslash to avoid confusion. mysqli_real_escape_string() provides that function. (Don't use mysql_real_escape_string(), it belongs to the deprecated mysql_* API.) No "un-escaping" is needed when you SELECT the string.
The ascii apostrophe (decimal 39, hex 27, sometimes called "single quote") is commonly used in many programming languages for quoting strings. A long list of utf8 "quotes" can be found here.

How can I query for text containing Asian-language characters in MySQL?

I have a MySQL table using the UTF-8 character set with a single column called WORDS of type longtext. Values in this column are typed in by users and are a few thousand characters long.
There are two types of rows in this table:
In some rows, the WORDS value has been composed by English speakers and contains only characters used in ordinary English writing. (Not all are necessarily ASCII, e.g. the euro symbol may appear in some cases.)
Other rows have WORDS values written by speakers of Asian languages (Korean, Chinese, Japanese, and possibly others), which include a mix of English words and words in the Asian languages using their native logographic characters (and not, for example, Japanese romaji).
How can I write a query that will return all the rows of type 2 and no rows of type 1? Alternatively, if that's hard, is there a way to query most such rows (here it's OK if I miss a few rows of type 2, or include a few false positives of type 1)?
Update: Comments below suggest I might do better to avoid the MySQL query engine altogether, as its regex support for unicode doesn't sound too good. If that's true, I could extract the data into a file (using mysql -B -e "some SQL here" > extract.txt) and then use perl or similar on the file. An answer using this method would be OK (but not as good as a native MySQL one!)
In theory you could do this:
Find the unicode ranges that you want to test for.
Manually encode the start and end into UTF-8.
Use the first byte of each of the encoded start and end as a range for a REGEXP.
I believe that the CJK range is far enough removed from things like the euro symbol that the false positives and false negatives would be few or none.
Edit: We've now put theory into practice!
Step 1: Choose the character range. I suggest \u3000-\u9fff; easy to test for, and should give us near-perfect results.
Step 2: Encode into bytes. (Wikipedia utf-8 page)
For our chosen range, utf-8 encoded values will always be 3 bytes, the first of which is 1110xxxx, where xxxx is the most significant four bits of the unicode value.
Thus, we want to mach bytes in the range 11100011 to 11101001, or 0xe3 to 0xe9.
Step 3: Make our regexp using the very handy (and just now discovered by me) UNHEX function.
SELECT * FROM `mydata`
WHERE `words` REGEXP CONCAT('[',UNHEX('e3'),'-',UNHEX('e9'),']')
Just tried it out. Works like a charm. :)
You can also use the HEX value of the character. SELECT * FROM table WHERE <hex code>
Try it out with SELECT HEX(column) FROM table
This might help as well http://dev.mysql.com/doc/refman/5.0/en/faqs-cjk.html

MySQL collation to store multilingual data of unknown language

I am new to multilingual data and my confession is that I never did tried it before.
Currently I am working on a multilingual site, but I do not know which language will be used.
Which collation/character set of MySQL should I use to achieve this?
Should I use some Unicode type of character set?
And of course these languages are not out of this universe, these must be in the set which we mostly use.
You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:
utf8_general_ci is a very simple collation. It just
- removes all accents
- then converts to upper case
and uses the code of this sort of "base letter" result letter to compare.
utf8_unicode_ci uses the default Unicode collation element table.
The main differences are:
utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.
utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.
+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.
Extracted from MySQL forums.
UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.
UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).
You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.