MYSQL select all rows with asian characters - mysql

On a database with customer information and in a table where names and addresses are mixed in latin and asian characters I'd like to select all that do (or don't) contain any asian characters. The data is UTF-8 encoded. Is that possible with MYSQL itself or do I need to write a custom script using PHP / Perl?

You might be able to do this with regular expressions. The idea is to look for all the simple characters that might be in string and use ^. So, to find unexpected (i.e. "Asian") characters:
where col regexp '.%[^0-9a-zA-Z.,:()& ].%'
The .% at the beginning and end are not necessary, but I like to have them so the patters are similar to LIKE patterns.

The page linked to in amdixon's comment had a working answer. Here it is so that we have it on SO:
To select all rows with non latin characters on col:
SELECT *
FROM table
WHERE col != CONVERT(col USING latin1)

Related

How to perform a multi-byte safe SQL REGEXP query?

I have the following SQL query to find the dictionary words that contain specific letters.
It's working fine in the English dictionary:
SELECT word
FROM english_dictionary
WHERE word REGEXP '[abcdef]'
But running the same query on Slovak dictionary, which includes UTF8 special accented letters don't work.
SELECT word
FROM slocak_dictionary
WHERE word REGEXP '[áäčďéóú]'
I've searched everywhere, can't find the answer to this issue. If I use LIKE, it's working, but the query is getting very ugly:
SELECT word
FROM slocak_dictionary
WHERE
word LIKE '%á%'
AND word LIKE '%ä%'
AND word LIKE '%č%'
AND word LIKE '%ď%'
AND word LIKE '%é%'
AND word LIKE '%ó%'
AND word LIKE '%ú%'
Because I deal with many letters that need to be excluded or includes in the query, breaking it down like this is not very elegant.
Is there any way to perform a multi-byte safe SQL REGEXP query on MySQL?
MariaDB has better support of REGEXP.
In MySQL, this will test for word having any of those accented characters:
HEX(word) REGEXP '^(..)*(C3A1|C3A4|C48D|C48F|C3A9|C3B3|C3BA)'
The ^(..)* is to make sure the subsequent test is byte (2 hex chars) aligned.
You can see those utf8 encodings by doing something like
SELECT HEX('áäčďéóú');
(Your attempt with LIKE should have said OR instead of AND.)

Mysql replace all special unicode characters with their ascii counterpart

I have a field with encoding utf8-general-ci in which many values contain non-ascii characters. I want to
Search for all fields with any non-ascii characters
Replace all non-ascii characters with their corresponding ascii version.
For example: côte-d'ivoire should be replaced with cote-d-i'voire, são-tomé should be replaced with sao-tome, etc.
How do I achieve this? If I just change the field type to ascii, non-ascii characters get replaced by '?'. I am not even able to search for all such fields using
RLIKE '%[^a-z]%'
For example
SELECT columname
FROM tablename
WHERE NOT columname REGEXP '[a-z]';
returns an empty set.
Thanks
An sql fiddle example is at
http://www.sqlfiddle.com/#!2/c1d90/1/0
the query to select is
select * from test where maintext rlike '[^\x00-\x7F]'
Hope this helps
I'm presuming from your previous questions that you're using PHP.
https://github.com/silverstripe-labs/silverstripe-unidecode
You could then use skv's answer to return the object's you wish to use and then use unidecode to attempt to convert the object to it's ascii equivalents.
In Perl, you can use Text::Unidecode.
In MySQL, there isn't an easy function to convert from utf8 (or utf8mb4) into ascii without it spitting out some ugly '?' characters as replacements. It's best to replace these prior to inserting them in the DB, or run something in Perl (or whatever) to extract the data and re-update it one row at a time.
There are many different ports of Text::Unidecode in different languages: Python, PHP, Java, Ruby, JavaScript, Haskell, C#, Clojure, Go.

How to detect rows with chinese characters in MySQL?

How can I detect and delete rows with Chinese characters in MySQL?
Here is the Table "Chinese_Test" Contains the Chinese Character on my PhpMyAdmin
Data:
Structure
notice my type of Collation is utf8, thus let's take a look at the Chinese Characters in utf8 table.
http://www.ansell-uebersetzungen.com/gbuni.html
Notice the Chinese Character is from E4 to E9, hence we use the code
select number
from Chinese_Test
where HEX(contents) REGEXP '^(..)*(E[4-9])';
and here is the result:
If all the other rows have alphanumeric values try the following:
DELETE FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
Do check the results before deletion, using the following:
SELECT * FROM tableName WHERE NOT columnToCheck REGEXP '[A-Za-z0-9.,-]';
I don't have an answer, but to provide you with a starting point: Chinese characters will occupy certain blocks in the UTF-8 character set. Example
You would have to query for rows that contain characters between the first and the last point of that block. I can't think of a way to automate this though (i.e. to query for characters inside a certain range without naming each character explicitly).
Another untested idea that comes to mind is using iconv() to convert the string to a specifically Chinese encoding, using //IGNORE, and seeing whether any data is left. If anything is left, the string may contain chinese characters.... although this would probably be disrupted by any numbers inside the string,
It's an interesting problem.

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER TRIANGULAR COLON).
word | ipa
--------+----------
Hallo | haˈloː
IPA | ˌiːpeːˈʔaː
I need to search the second field with LIKE and REGEXP, but MySQL (5.0.77) seems to interpret these fields as bytes, not as characters.
SELECT * FROM pronunciation WHERE ipa LIKE '%ha?lo%'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa LIKE '%ha??lo%'; -- 1 row
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha.lo'; -- 0 rows
SELECT * FROM pronunciation WHERE ipa REGEXP 'ha..lo'; -- 1 row
I'm quite sure that the data is stored correctly, as it seems good when I retrieve it and shows up fine in phpMyAdmin. I'm on a shared host, so I can't really install programs.
How can I solve this problem? If it's not possible: is there a plausible work-around that does not involve processing the entire database with PHP every time? There are 40 000 lines, and I'm not dead-set on using MySQL (or UTF8, for that matter). I only have access to PHP and MySQL on the host.
Edit: There is an open 4-year-old MySQL bug report, Bug #30241 Regular expression problems, which notes that the regexp engine works byte-wise. Thus, I'm looking for a work-around.
EDITED to incorporate fix to valid critisism
Use the HEX() function to render your bytes to hexadecimal and then use RLIKE on that, for example:
select * from mytable
where hex(ipa) rlike concat('(..)*', hex('needle'), '(..)*'); -- looking for 'needle' in haystack, but maintaining hex-pair alignment.
The odd unicode chars render consistently to their hex values, so you're searching over standard 0-9A-F chars.
This works for "normal" columns too, you just don't need it.
p.s. #Kieren's (valid) point addressed using rlike to enforce char pairs
I'm not dead-set on using MySQL
Postgres seems to handle it quite fine:
test=# select 'ˌˈʔ' like '___';
?column?
----------
t
(1 row)
test=# select 'ˌˈʔ' ~ '^.{3}$';
?column?
----------
t
(1 row)
If you go down that road, note that in Postgres' ilike operator matches that of MySQL's like. (In Postgres, like is case-sensitive.)
For the MySQL-specific solution, you mind be able to work around by binding some user-defined function (maybe bind the ICU library?) into MySQL.
You have problems with UTF8? Eliminate them.
How many special characters do you use? Are you using only locase letters, am I right? So, my tip is: Write a function, which converts spec chars to regular chars, e.g. "æ" ->"A" and so on, and add a column to the table which stores that converted value (you have to convert all values first, and upon each insert/update). When searching, you just have to convert search string with the same function, and use it on that field with regexp.
If there're too many kind of special chars, you should convert it to multi-char. 1. Avoid finding "aa" in the "ba ab" sequence use some prefix, like "#ba#ab". 2. Avoid finding "#a" in "#ab" use fixed length tokens, say, 2.

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:
select * from TABLE where COLUMN like '%\0%';
Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike #Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
Try Using this query for searching special character records
SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
#zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:
SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
In Oracle we can use below.
SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
for this question we can also use this method :
Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG
Non-ASCII characters
ans: select*from nobel where winner like'P% GR%_%berg';