Unicode (hexadecimal) character literals in MySQL - mysql

Is there a way to specify Unicode character literals in MySQL?
I want to replace a Unicode character with an Ascii character, something like the following:
Update MyTbl Set MyFld = Replace(MyFld, "ẏ", "y")
But I'm using even more obscure characters which are not available in most fonts, so I want to be able to use Unicode character literals, something like
Update MyTbl Set MyFld = Replace(MyFld, "\u1e8f", "y")
This SQL statement is being invoked from a PHP script - the first form is not only unreadable, but it doesn't actually work!

You can specify hexadecimal literals (or even binary literals) using 0x, x'', or X'':
select 0xC2A2;
select x'C2A2';
select X'C2A2';
But be aware that the return type is a binary string, so each and every byte is considered a character. You can verify this with char_length:
select char_length(0xC2A2)
2
If you want UTF-8 strings instead, you need to use convert:
select convert(0xC2A2 using utf8mb4)
And we can see that C2 A2 is considered 1 character in UTF-8:
select char_length(convert(0xC2A2 using utf8mb4))
1
Also, you don't have to worry about invalid bytes because convert will remove them automatically:
select char_length(convert(0xC1A2 using utf8mb4))
0
As can be seen, the output is 0 because C1 A2 is an invalid UTF-8 byte sequence.

Thanks for your suggestions, but I think the problem was further back in the system.
There's a lot of levels to unpick, but as far as I can tell, (on this server at least) the command
set names utf8
makes the utf-8 handling work correctly, whereas
set character set utf8
doesn't.
In my environment, these are being called from PHP using PDO, for what difference that may make.
Thanks anyway!

You can use the hex and unhex functions, e.g.:
update mytable set myfield = unhex(replace(hex(myfield),'C383','C3'))

The MySQL string syntax is specified here, as you can see, there is no provision for numeric escape sequences.
However, as you are embedding the SQL in PHP, you can compute the right bytes in PHP. Make sure the bytes you put into the SQL actually match your client character set.

There is also the char function that will allow what you wanted (providing byte numbers and a charset name) and getting a char.

Related

How can I insert arbitrary binary data into a VARCHAR column?

I have a MySQL table with a VARCHAR(100) column, using the utf8_general_ci collation.
I can see rows where this column contains arbitrary byte sequences (i.e. data that contains invalid UTF8 character sequences), but I can't figure out how to write an UPDATE or INSERT statement that allows this type of data to be entered.
For example, I've tried the following:
UPDATE DataTable SET Data = CAST(BINARY(X'16d7a4fca7442dda3ad93c9a726597e4') AS CHAR(100)) WHERE Id = 1;
But I get the error:
Incorrect string value: '\xFC\xA7D-\xDA:...' for column 'Data' at row 1
How can I write an INSERT or UPDATE statement that bypasses the destination column's collation, allowing me to insert arbitrary byte sequences?
Have you considered using one of the Blob data types instead of varchar? I believe that this'd take a lot of the pain away from your use-case.
EDIT: Alternatively, there is the HEX and UNHEX functions, which MySQL supports. Hex takes either a str or a numeric argument and returns the hexadecimal representation of your argument as a string. Unhex does the inverse; taking a hexadecimal string and returning a binary string.
The short answer is that it shouldn't be possible to insert values with invalid UTF8 characters into VARCHAR column declared to use UTF8 characterset.
That's the design goal of MySQL, to disallow invalid values. When there's an attempt to do that, MySQL will return either an error or a warning, or (more leniently?) silently truncate the supplied value at the first invalid character encountered.
The more usual variety of characterset issues are with MySQL performing a characterset conversion when a characterset conversion isn't required.
But the issue you are reporting is that invalid characters were inserted into a UTF8 column. It's as if a latin1 (ISO-8859) encoding was supplied, and a characterset conversion was required, but was not performed.
As far as working around that... I believe it was possible in earlier versions of MySQL. I believe it was possible to cast a value to BINARY, and then warp that in CONVERT( ... USING UTF8), and MySQL wouldn't perform a validation of the characterset. I don't know if that's still possible with the current MySQL Connectors.
If it is possible, then that's (IMO) a bug in the Connector.
The only way I can think of getting around that characterset check/validation would be to get the MySQL sever to trust the client, and determine that no check of the characterset is required. (That would also mean the MySQL server wouldn't be doing a characterset conversion, the client lying to the server, the client telling the server that it's supplying valid UTF8 characters.
Basically, the client would be telling the server "Hey server, I'm going to be sending UTF8 character encodings".
And the server says "Okay. I'll not do any characterset conversion then, since we match. And I'll just trust that what you send is valid UTF8".
And then the client mischievously chuckles to itself, "Heh, heh, I lied. I'm actually sending character encodings that aren't valid UTF8".
And I think it's much more likely to be able to achieve such mischief using prepared statements with the old school MySQL C API (mysql_stmt_prepare, mysql_stmt_execute), supplying nvalid UTF8 encodings as values for string bind parameters. (The onus is really on the client to supply valid values for bind parameters.)
You should base64 encode your value beforehand so you can generate a valid SQL with it:
UPDATE DataTable SET Data = from_base64('mybase64-encoded-representation-of-my-value') WHERE Id = 1;

SQL query with Unicode codepoints in SQLite and MySQL

What is the right way to use unicode codepoints in a query? (SQLite and MySQL):
sqlite> select name from city where name like '%rich';
Zürich
Zurich
I tried using the codepoint but nothing worked so far:
sqlite> select * from city where name like 'Z\u00fc%';
(empty)
Anybody? Thanks
EDIT: I created this sql fiddle which also doesn't work http://sqlfiddle.com/#!9/0faee/2
(Referring to MySQL; I do not know about SQLite.)
Do you really have backslash u 0 0 f c? If you are going with Unicode, you need to go all the way, with the unhex of 005A00FC. That is Z would take two bytes and ü would take two bytes. My point is that MySQL does not see Z\u00fc as anything other than 7 ascii character. (Or maybe 6 if the \ is treated as an escape.)
005A00FC is the hex for Zü when using CHARACTER SET ucs2.
If you are using PHP 5.4+ and JSON, you need an extra parameter:
$t = json_encode($s, JSON_UNESCAPED_UNICODE);

Mysql replace all special unicode characters with their ascii counterpart

I have a field with encoding utf8-general-ci in which many values contain non-ascii characters. I want to
Search for all fields with any non-ascii characters
Replace all non-ascii characters with their corresponding ascii version.
For example: côte-d'ivoire should be replaced with cote-d-i'voire, são-tomé should be replaced with sao-tome, etc.
How do I achieve this? If I just change the field type to ascii, non-ascii characters get replaced by '?'. I am not even able to search for all such fields using
RLIKE '%[^a-z]%'
For example
SELECT columname
FROM tablename
WHERE NOT columname REGEXP '[a-z]';
returns an empty set.
Thanks
An sql fiddle example is at
http://www.sqlfiddle.com/#!2/c1d90/1/0
the query to select is
select * from test where maintext rlike '[^\x00-\x7F]'
Hope this helps
I'm presuming from your previous questions that you're using PHP.
https://github.com/silverstripe-labs/silverstripe-unidecode
You could then use skv's answer to return the object's you wish to use and then use unidecode to attempt to convert the object to it's ascii equivalents.
In Perl, you can use Text::Unidecode.
In MySQL, there isn't an easy function to convert from utf8 (or utf8mb4) into ascii without it spitting out some ugly '?' characters as replacements. It's best to replace these prior to inserting them in the DB, or run something in Perl (or whatever) to extract the data and re-update it one row at a time.
There are many different ports of Text::Unidecode in different languages: Python, PHP, Java, Ruby, JavaScript, Haskell, C#, Clojure, Go.

MySQL Replace with Char Function

I try to create a MySQL function to parse a given string into a version I can use in an URL. So I have to eliminate some special characters and I try to do it with a loop to keep the code simple and to not have to specify any character.
My current code is:
DECLARE parsedString VARCHAR(255);
# convert to low chars
SET parsedString = CONVERT(paramString USING latin1);
SET parsedString = LOWER(parsedString);
# replace chars with A
SET #x = 223;
charReplaceA: WHILE #x <= 229 DO
SET #x = #x + 1;
SET parsedString = REPLACE (parsedString, CHAR(#x USING ASCII), 'a');
END WHILE charReplaceA;
# convert to utf8 and return
RETURN CONVERT(parsedString USING utf8);
If I try to use my code it doesn't work. Somehow it doesn't recognize the CHAR(#x USING ASCII) part.
SELECT urlParser('aäeucn');
returns
aäeucn
If I change my code to
SET parsedString = REPLACE (parsedString, 'ä', 'a');
it somehow works and returns
aaeucn
Does anyone have any idea how to use REPLACE() with CHAR()? I don't want to specify any possible character.
You could try
REPLACE(CONVERT('aäeucn' USING ascii), '?', 'a')
When you convert international characters to ascii, all non-ascii characters are represented by a literal '?' character. Then you don't have to do the loop (so it will probably run a lot faster).
Also consider other methods of encoding international characters in URLs.
See http://www.w3.org/International/articles/idn-and-iri/
Re your comment:
If you need a character-by-character replacement, I wouldn't do it in an SQL function. MySQL functions are not efficient, and coding them and debugging them is awkward. I'd recommend fetching the utf8 strings back into your application and do the character translation there.
You didn't specify which application programming language you're using, but for what it's worth, PHP supports a function strtr() that can be used for exactly this scenario. There's even an example of mapping i18n characters to ascii characters in the manual page:
http://php.net/strtr
$addr = strtr($addr, "äåö", "aao"); // That was easy!
That solution will be far faster and easier to code than a MySQL stored function.

How can I find non-ASCII characters in MySQL?

I'm working with a MySQL database that has some data imported from Excel. The data contains non-ASCII characters (em dashes, etc.) as well as hidden carriage returns or line feeds. Is there a way to find these records using MySQL?
MySQL provides comprehensive character set management that can help with this kind of problem.
SELECT whatever
FROM tableName
WHERE columnToCheck <> CONVERT(columnToCheck USING ASCII)
The CONVERT(col USING charset) function turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.
See this for more discussion. https://dev.mysql.com/doc/refman/8.0/en/charset-repertoire.html
You can use any character set name you wish in place of ASCII. For example, if you want to find out which characters won't render correctly in code page 1257 (Lithuanian, Latvian, Estonian) use CONVERT(columnToCheck USING cp1257)
You can define ASCII as all characters that have a decimal value of 0 - 127 (0x00 - 0x7F) and find columns with non-ASCII characters using the following query
SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
This was the most comprehensive query I could come up with.
It depends exactly what you're defining as "ASCII", but I would suggest trying a variant of a query like this:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9]';
That query will return all rows where columnToCheck contains any non-alphanumeric characters. If you have other characters that are acceptable, add them to the character class in the regular expression. For example, if periods, commas, and hyphens are OK, change the query to:
SELECT * FROM tableName WHERE columnToCheck NOT REGEXP '[A-Za-z0-9.,-]';
The most relevant page of the MySQL documentation is probably 12.5.2 Regular Expressions.
This is probably what you're looking for:
select * from TABLE where COLUMN regexp '[^ -~]';
It should return all rows where COLUMN contains non-ASCII characters (or non-printable ASCII characters such as newline).
One missing character from everyone's examples above is the termination character (\0). This is invisible to the MySQL console output and is not discoverable by any of the queries heretofore mentioned. The query to find it is simply:
select * from TABLE where COLUMN like '%\0%';
Based on the correct answer, but taking into account ASCII control characters as well, the solution that worked for me is this:
SELECT * FROM `table` WHERE NOT `field` REGEXP "[\\x00-\\xFF]|^$";
It does the same thing: searches for violations of the ASCII range in a column, but lets you search for control characters too, since it uses hexadecimal notation for code points. Since there is no comparison or conversion (unlike #Ollie's answer), this should be significantly faster, too. (Especially if MySQL does early-termination on the regex query, which it definitely should.)
It also avoids returning fields that are zero-length. If you want a slightly-longer version that might perform better, you can use this instead:
SELECT * FROM `table` WHERE `field` <> "" AND NOT `field` REGEXP "[\\x00-\\xFF]";
It does a separate check for length to avoid zero-length results, without considering them for a regex pass. Depending on the number of zero-length entries you have, this could be significantly faster.
Note that if your default character set is something bizarre where 0x00-0xFF don't map to the same values as ASCII (is there such a character set in existence anywhere?), this would return a false positive. Otherwise, enjoy!
Try Using this query for searching special character records
SELECT *
FROM tableName
WHERE fieldName REGEXP '[^a-zA-Z0-9#:. \'\-`,\&]'
#zende's answer was the only one that covered columns with a mix of ascii and non ascii characters, but it also had that problematic hex thing. I used this:
SELECT * FROM `table` WHERE NOT `column` REGEXP '^[ -~]+$' AND `column` !=''
In Oracle we can use below.
SELECT * FROM TABLE_A WHERE ASCIISTR(COLUMN_A) <> COLUMN_A;
for this question we can also use this method :
Question from sql zoo:
Find all details of the prize won by PETER GRÜNBERG
Non-ASCII characters
ans: select*from nobel where winner like'P% GR%_%berg';