How to detect rows with chinese characters in MySQL? - mysql

How can I detect and delete rows with Chinese characters in MySQL?

Here is the Table "Chinese_Test" Contains the Chinese Character on my PhpMyAdmin
Data:
Structure
notice my type of Collation is utf8, thus let's take a look at the Chinese Characters in utf8 table.
http://www.ansell-uebersetzungen.com/gbuni.html
Notice the Chinese Character is from E4 to E9, hence we use the code
select number
from Chinese_Test
where HEX(contents) REGEXP '^(..)*(E[4-9])';
and here is the result:

If all the other rows have alphanumeric values try the following:
DELETE FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
Do check the results before deletion, using the following:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';

I don't have an answer, but to provide you with a starting point: Chinese characters will occupy certain blocks in the UTF-8 character set. Example
You would have to query for rows that contain characters between the first and the last point of that block. I can't think of a way to automate this though (i.e. to query for characters inside a certain range without naming each character explicitly).
Another untested idea that comes to mind is using iconv() to convert the string to a specifically Chinese encoding, using //IGNORE, and seeing whether any data is left. If anything is left, the string may contain chinese characters.... although this would probably be disrupted by any numbers inside the string,
It's an interesting problem.

Related

Why no collation other than 'utf8mb4_0900_bin' can compare strings properly that contains ASCII Control Characters?

This question is an extension of the following question - How to make mysql consider the control characters when doing string comparison?
Here is my query -
SELECT 'abc' < 'abcSOH' COLLATE utf8mb4_0900_bin;
Here SOH is the Start Of Header which is an ASCII control character with ASCII code 1. My expectation is that this query will return 1 as the second string's length is 4. I have even tried with Space (ASCII code 32) with the same results!!
If you check this fiddle, you can see only the 'utf8mb4_0900_bin' collation gives the expected result. All other collations that I have tested give the opposite result.
https://dbfiddle.uk/mDLVWOZG
I have gone through the documentation and could not find the reason behind this. Can anyone please explain why is this?
I am interested to know this because I would like to use a 1-byte character set (and corresponding collation) instead of a 4-byte character set because I have some legacy tables (converting to MySQL) that have a lot of columns and if I use a 4-byte character set, it gives an error that the row is too big.
Each column can have its own CHARACTER SET and COLLATION. But different rows must agree.
CREATE TABLE provides only "defaults" for those settings -- these defaults are used if you don't override them when declaring the individual columns.
So, legacy columns may as well be declared with whatever antique charset was used. (Sorry, EBCDIC is not available.)
All the "printable" characters of ASCII are available in UTF-8 (MySQL's utf8/utf8mb3/utf8mb4). In fact, the binary encoding is identical.
The "control characters" -- well, stick with ascii or latin1 (perhaps with latin1_bin).
Any _bin collation says to simply look at the bits.
I do not know if control characters are turned into space (hex 20) when INSERTing into a UTF-8 column.

Regex returning inexplicable results (to me)

I want to return entries from a table that match the format:
prefix + optional spaces + Thai digit
Testing using ยก as the prefix I use the following SQL
SELECT term
FROM entries
WHERE term REGEXP "^ยก[\s]*[๐-๙]+$"
This returns 9 entries, 4 of which don't have the correct prefix, and none of them ends in a digit.
ยกนะ
ยกบัตร
ยกมือ
ยกยอ
ยกยอด
ยกหยิบ
ยมทูต
ยมนา
ยมบาล
ยมล
It doesn't return
ยก ๑
ยก ๒
which I know are in the database and are the entries I want.
I'm very new to all this. What am I doing wrong?
FWIW, this is against a MySQL database and everything is in Unicode.
Thanks
As quoted from the MySQL docs:
The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.
Doesn't seem like MySQL's REGEXP can handle the [๐-๙] range correctly due to the above.
I use utf8_general_ci and try.I matched
ยกนะ
with "^ยก[\s]*[๐-๙]+$" but did't matched ยก ๑.So I change the regexp to
"^ยก[ ]*[๐-๙]+$"
,and it can match
ยกนะ
ยก ๑
Maybe the problem is character encoding.

MySQL Query to Identify bad characters?

We have some tables that were set with the Latin character set instead of UTF-8 and it allowed bad characters to be entered into the tables, the usual culprit is people copy / pasting from Word or Outlook which copys those nasty hidden characters...
Is there any query we can use to identify these characters to clean them?
Thanks,
I assume that your connection chacater set was set to UTF8 when you filled the data in.
MySQL replaces unconvertable characters with ? (question marks):
SELECT CONVERT('тест' USING latin1);
----
????
The problem is distinguishing legitimate question marks from illegitimate ones.
Usually, the question marks in the beginning of a word are a bad sign, so this:
SELECT *
FROM mytable
WHERE myfield RLIKE '\\?[[:alnum:]]'
should give a good start.
You're probably noticing something like this 'bug'. The 'bad characters' are most likely UTF-8 control characters (eg \x80). You might be able to identify them using a query like
SELECT bar FROM foo WHERE bar LIKE LOCATE(UNHEX(80), bar)!=0
From that linked bug, they recommend using type BLOB to store text from windows files:
Use BLOB (with additional encoding field) instead of TEXT if you need to store windows files (even text files). Better than 3-byte UTF-8 and multi-tier encoding overhead.
Take a look at this Q/A (it's all about your client encoding aka SET NAMES )

MySQL won't replace words with empty space

Basically, I have a problem with replace() function in MySQL (via phpMyAdmin). One table got messed and some special characters (+ empty space after it) appeared inside a word. So all I wanted to do was:
UPDATE myTable SET columnName =
(replace(columnName, 'Å house',
'house'))
But MySQL returns
0 row(s) affected. ( Query took 0.0107 sec )
The same is when I try to replace foreign towns with special characters in the name of a town (Swedish town, German town, etc.)
Am I doing something wrong???
Å house
Is likely to actually be:
Å house
That is, with a U+00A0 Non Break Space character and not a normal space. Of course normally you cannot see the difference, but a string replace can and won't touch it.
This was probably originally just a single non-breaking-space character, that has been mangled through a classic UTF-8-read-as-ISO-8859-1 encoding screw-up. Other non-ASCII characters in your database are likely to have been similarly messed up.

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:
select * from TABLE where COLUMN like '%\0%';
Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike #Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
Try Using this query for searching special character records
SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
#zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:
SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
In Oracle we can use below.
SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
for this question we can also use this method :
Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG
Non-ASCII characters
ans: select*from nobel where winner like'P% GR%_%berg';