MySQL string contains only certain unicode characters - mysql

I need to query the database for entries that contain only a certain set of Unicode Japanese characters and nothing else.
I've tried using WHERE word RLIKE '^([あいうえお])+$' but that doesn't work with Japanese because of lack of Unicode support in MySQL's regex.
Is there any other way to accomplish this?

MySQL is looking at each character as a byte sequence, so あ is 0xE3, 0x81, 0x82 and your [あいうえお] is actually looking for any sequence of bytes 0xE3, 0x81, 0x82, 0x84, 0x86, 0x88 and 0x8A. That will match あ fine, but it will also match other sequences that don't correspond to a single character in the list, for example 0xE3, 0x82, 0x81 which is め.
An alternative way of saying [あいうえお] that would still work when each character is considered by the regex engine as being more than one symbol would be (あ|い|う|え|お).
SELECT 'あ' RLIKE '^([あいうえお])+$'; -- 1
SELECT 'め' RLIKE '^([あいうえお])+$'; -- 1
SELECT 'あ' RLIKE '^(あ|い|う|え|お)+$'; -- 1
SELECT 'め' RLIKE '^(あ|い|う|え|お)+$'; -- 0

Related

SQL how to SELECT all utf8mb4 characters?

I have this query:
SELECT count(*) from TABLE WHERE LENGTH(COLUMN) !=CHAR_LENGTH(COLUMN);
If count returns a value more than zero it tell me that I have non-ASCII characters in some row.
How I can know if I have utf8mb4 characters in the TABLE?
Is there a way to query all utf8mb4 characters?
It depends on what you mean by "utf8mb4 characters". This sentence is entirely composed of "utf8mb4 characters". This sentence is entirely composed of "ascii" characters.
Assuming you meant "non-ASCII" and the column is CHARACTER SET utf8mb4, then your query should work fine.
This technique works for any of the multi-byte character sets, such as utf8, big5, etc. It does not work for single-byte character sets such as latin1, latin5, etc.
If you want to extract the non-ascii bytes from the column, that would be better done in some application language. It might have a straightforward way of doing that, or you could fetch the HEX and look for some pair of hex with regexp [CDEF].
If you meant "utf8mb4" but not "utf8", then the hex would be F. And the row can be discovered via
HEX(col) RLIKE "^(..)*F."

Regex returning inexplicable results (to me)

I want to return entries from a table that match the format:
prefix + optional spaces + Thai digit
Testing using ยก as the prefix I use the following SQL
SELECT term
FROM entries
WHERE term REGEXP "^ยก[\s]*[๐-๙]+$"
This returns 9 entries, 4 of which don't have the correct prefix, and none of them ends in a digit.
ยกนะ
ยกบัตร
ยกมือ
ยกยอ
ยกยอด
ยกหยิบ
ยมทูต
ยมนา
ยมบาล
ยมล
It doesn't return
ยก ๑
ยก ๒
which I know are in the database and are the entries I want.
I'm very new to all this. What am I doing wrong?
FWIW, this is against a MySQL database and everything is in Unicode.
Thanks
As quoted from the MySQL docs:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
Doesn't seem like MySQL's REGEXP can handle the [๐-๙] range correctly due to the above.
I use utf8_general_ci and try.I matched
ยกนะ
with "^ยก[\s]*[๐-๙]+$" but did't matched ยก ๑.So I change the regexp to
"^ยก[ ]*[๐-๙]+$"
,and it can match
ยกนะ
ยก ๑
Maybe the problem is character encoding.

How to detect rows with chinese characters in MySQL?

How can I detect and delete rows with Chinese characters in MySQL?
Here is the Table "Chinese_Test" Contains the Chinese Character on my PhpMyAdmin
Data:
Structure
notice my type of Collation is utf8, thus let's take a look at the Chinese Characters in utf8 table.
http://www.ansell-uebersetzungen.com/gbuni.html
Notice the Chinese Character is from E4 to E9, hence we use the code
select number
from Chinese_Test
where HEX(contents) REGEXP '^(..)*(E[4-9])';
and here is the result:
If all the other rows have alphanumeric values try the following:
DELETE FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
Do check the results before deletion, using the following:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
I don't have an answer, but to provide you with a starting point: Chinese characters will occupy certain blocks in the UTF-8 character set. Example
You would have to query for rows that contain characters between the first and the last point of that block. I can't think of a way to automate this though (i.e. to query for characters inside a certain range without naming each character explicitly).
Another untested idea that comes to mind is using iconv() to convert the string to a specifically Chinese encoding, using //IGNORE, and seeing whether any data is left. If anything is left, the string may contain chinese characters.... although this would probably be disrupted by any numbers inside the string,
It's an interesting problem.

Match beginning of words in Mysql for UTF8 strings

I m trying to match beignning of words in a mysql column that stores strings as varchar. Unfortunately, REGEXP does not seem to work for UTF-8 strings as mentioned here
So,
select * from names where name REGEXP '[[:<:]]Aandre';
does not work if I have name like Foobar Aándreas
However,
select * from names where name like '%andre%'
matches the row I need but does not guarantee beginning of words matches.
Is it better to do the like and filter it out on the application side ? Any other solutions?
A citation from tha page you mentioned:
Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
select * from names where name like 'andre%'
select * from names where name like 'andre%' is not solution for eg:
name = 'richard andrew', because the string begining with richa... and not with andre...
for the moment, the temporaly solution, for search words (words != string) starting with a string
select * from names where name REGEXP '[[:<:]]andre';
But it no matching with accented words, eg: ándrew.
Any other solution, with regular expressions (mysql) to search in accented words?

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:
select * from TABLE where COLUMN like '%\0%';
Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike #Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
Try Using this query for searching special character records
SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
#zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:
SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
In Oracle we can use below.
SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
for this question we can also use this method :
Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG
Non-ASCII characters
ans: select*from nobel where winner like'P% GR%_%berg';