Mysql replace all special unicode characters with their ascii counterpart - mysql

I have a field with encoding utf8-general-ci in which many values contain non-ascii characters. I want to
Search for all fields with any non-ascii characters
Replace all non-ascii characters with their corresponding ascii version.
For example: côte-d'ivoire should be replaced with cote-d-i'voire, são-tomé should be replaced with sao-tome, etc.
How do I achieve this? If I just change the field type to ascii, non-ascii characters get replaced by '?'. I am not even able to search for all such fields using
RLIKE '%[^a-z]%'
For example
SELECT columname
FROM tablename
WHERE NOT columname REGEXP '[a-z]';
returns an empty set.
Thanks

An sql fiddle example is at
http://www.sqlfiddle.com/#!2/c1d90/1/0
the query to select is
select * from test where maintext rlike '[^\x00-\x7F]'
Hope this helps

I'm presuming from your previous questions that you're using PHP.
https://github.com/silverstripe-labs/silverstripe-unidecode
You could then use skv's answer to return the object's you wish to use and then use unidecode to attempt to convert the object to it's ascii equivalents.

In Perl, you can use Text::Unidecode.
In MySQL, there isn't an easy function to convert from utf8 (or utf8mb4) into ascii without it spitting out some ugly '?' characters as replacements. It's best to replace these prior to inserting them in the DB, or run something in Perl (or whatever) to extract the data and re-update it one row at a time.
There are many different ports of Text::Unidecode in different languages: Python, PHP, Java, Ruby, JavaScript, Haskell, C#, Clojure, Go.

Related

MYSQL select all rows with asian characters

On a database with customer information and in a table where names and addresses are mixed in latin and asian characters I'd like to select all that do (or don't) contain any asian characters. The data is UTF-8 encoded. Is that possible with MYSQL itself or do I need to write a custom script using PHP / Perl?
You might be able to do this with regular expressions. The idea is to look for all the simple characters that might be in string and use ^. So, to find unexpected (i.e. "Asian") characters:
where col regexp '.%[^0-9a-zA-Z.,:()& ].%'
The .% at the beginning and end are not necessary, but I like to have them so the patters are similar to LIKE patterns.
The page linked to in amdixon's comment had a working answer. Here it is so that we have it on SO:
To select all rows with non latin characters on col:
SELECT *
FROM table
WHERE col != CONVERT(col USING latin1)

MySQL query with non-printing characters (left-to-right mark)

I just found myself lost in the interesting situation that I need to query MySQL for fields containing a so called Left-to-right mark.
As the nature of this character is to be non-printing, thus invisible, I'm unable to simply copy/paste it into a query.
As mentioned in the linked Wikipedia article, the Left-to-right mark is Unicode character U+200F, which is a fact that I'm sure is the key to success in my current adventure.
My question is: How do I use raw Unicode in a MySQL query? Something along the lines of:
SELECT * FROM users WHERE username LIKE '%\U+200F%'
or
SELECT * FROM users WHERE username REGEXP '\U+200F'
or whatever the correct syntax for Unicode in MySQL is and depending on whether this is supported with LIKE and/or REGEXP.
To get a unicode char, something like this should work:
SELECT CHAR(<number> USING utf8);
Also, don't use REGEXP, because the regexp lib used by MySQL is very old, and doesn't support multi-byte charsets.

Unicode (hexadecimal) character literals in MySQL

Is there a way to specify Unicode character literals in MySQL?
I want to replace a Unicode character with an Ascii character, something like the following:
Update MyTbl Set MyFld = Replace(MyFld, "ẏ", "y")
But I'm using even more obscure characters which are not available in most fonts, so I want to be able to use Unicode character literals, something like
Update MyTbl Set MyFld = Replace(MyFld, "\u1e8f", "y")
This SQL statement is being invoked from a PHP script - the first form is not only unreadable, but it doesn't actually work!
You can specify hexadecimal literals (or even binary literals) using 0x, x'', or X'':
select 0xC2A2;
select x'C2A2';
select X'C2A2';
But be aware that the return type is a binary string, so each and every byte is considered a character. You can verify this with char_length:
select char_length(0xC2A2)
2
If you want UTF-8 strings instead, you need to use convert:
select convert(0xC2A2 using utf8mb4)
And we can see that C2 A2 is considered 1 character in UTF-8:
select char_length(convert(0xC2A2 using utf8mb4))
1
Also, you don't have to worry about invalid bytes because convert will remove them automatically:
select char_length(convert(0xC1A2 using utf8mb4))
0
As can be seen, the output is 0 because C1 A2 is an invalid UTF-8 byte sequence.
Thanks for your suggestions, but I think the problem was further back in the system.
There's a lot of levels to unpick, but as far as I can tell, (on this server at least) the command
set names utf8
makes the utf-8 handling work correctly, whereas
set character set utf8
doesn't.
In my environment, these are being called from PHP using PDO, for what difference that may make.
Thanks anyway!
You can use the hex and unhex functions, e.g.:
update mytable set myfield = unhex(replace(hex(myfield),'C383','C3'))
The MySQL string syntax is specified here, as you can see, there is no provision for numeric escape sequences.
However, as you are embedding the SQL in PHP, you can compute the right bytes in PHP. Make sure the bytes you put into the SQL actually match your client character set.
There is also the char function that will allow what you wanted (providing byte numbers and a charset name) and getting a char.

Escape characters in MySQL, in Ruby

I have a couple escaped characters in user-entered fields that I can't figure out.
I know they are the "smart" single and double quotes, but I don't know how to search for them in mysql.
The characters in ruby, when output from Ruby look like \222, \223, \224 etc
irb> "\222".length => 1
So - do you know how to search for these in mysql? When I look in mysql, they look like '?'.
I'd like to find all records that have this character in the text field. I tried
mysql> select id from table where field LIKE '%\222%'
but that did not work.
Some more information - after doing a mysqldump, this is how one of the characters is represented - '\\xE2\\x80\\x99'. It's the smart single quote.
Ultimately, I'm building an RTF file and the characters are coming out completely wrong, so I'm trying to replace them with 'dumb' quotes for now. I was able to do a gsub(/\222\, "'").
Thanks.
I don't quite understand your problem but here is some info for you:
First, there are no escaped characters in the database. Because every character being stored as is, with no escaping.
they don't "look ilke ?". I's just wrong terminal settings. SET NAMES query always should be executed first, to match client encoding.
you have to determine character set and use it on every stage - in the database, in the mysql client, in ruby.
you should distinguish ruby strings representation from character itself.
To enter character in the mysql query, you can use char function. But in terminal only. In ruby just use the character itself.
smart quotes looks like 2-byte encoded in the unicode. You have to determine your encoding first.

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:
select * from TABLE where COLUMN like '%\0%';
Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike #Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
Try Using this query for searching special character records
SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
#zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:
SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
In Oracle we can use below.
SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
for this question we can also use this method :
Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG
Non-ASCII characters
ans: select*from nobel where winner like'P% GR%_%berg';