MySQL Character Encodings for Connector/C - mysql

MySQL's character encoding mechanism is legendary in both it's complexity and it's opaqueness, and I have a question about how to correctly interpret string data being returned from a MySQL Connector/C query.
If my Connector/C code is set to UTF-8 (using mysql_set_character_set()), will the MySQL library (and/or server) transcode data in latin1 that's stored in the server to UTF-8 or am I still required to use mysql_fetch_field on a per-field basis to determine the character set of any string data?

Give this page a read: http://dev.mysql.com/doc/refman/5.7/en/charset-connection.html
Since mysql_set_character_set() works like SET NAMES statement, it will modify the character set that the server sends back to the client.
SET NAMES indicates what character set the client will use to send SQL statements to the server. Thus, SET NAMES 'cp1251' tells the server, “future incoming messages from this client are in character set cp1251.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)

It is extremely improbable to have collisions between properly encoded strings in Latin1 and UTF-8. You can check proper UTF-8 encoding and assume Latin1 for badly encoded strings and convert them yourself on a case by case basis. Assuming correct configuration and encoding everywhere is risky.

Related

detect with Python if the string will lead to "Incorrect string value" error in MySQL

I have a table in MySQL (5.7) database, which has collation utf8_unicode_ci,
and where I'm inserting some data with Python (3.6).
With some of the strings (for example, '\xCE\xA6') I get "Incorrect string value" error. On the DB side, I can mute this error by turning off the strict mode in MySQL, or changing the field's collation to utf8mb4.
However, such strings are "anomalies", and it is not desirable to change a collation or the sql_mode.
How can I detect in Python 3, that a given string will lead to "incorrect string value" error with MySQL, before inserting into a Table ?
Where do you get the error message? What operation is being performed?
C3A6 is the UTF-8 (cf MySQL's utf8 or utf8mb4) hex for æ; does it seem likely that that was the desired character?
To handle utf8 (or utf8mb4), you need to determine what the client's encoding. Sounds like UTF-8. So, when connecting to MySQL, tell it that -- use these in the connect call:
charset="utf8", use_unicode=True
If the character is in the python source, you need
# -*- coding: utf-8 -*-
at the beginning of the source.
Also the column you are inserting into needs to be CHARACTER SET utf8 (or utf8mb4).
utf8mb4 is needed for Emoji and some of Chinese; otherwise it is 'equivalent' to utf8.
Do not use decode() or any other conversion functions; that will just make things harder to fix. In this arena, two wrongs does not make a right; it makes a worse mess.
If you have other symptoms of garbled characters, see Trouble with UTF-8 characters; what I see is not what I stored
To discuss further, please provide the connection call, the SQL statement involved, SHOW CREATE TABLE, and anything else involved.
C3A6 is a valid utf8/utf8mb4 character æ, and could be interpreted as valid, though unlikely, latin1 æ. But it is invalid for CHARACTER SET ascii. (I don't know how the error message occurred unless the connection said ascii or some obscure charset.)

mysql Incorrect string value for a column [duplicate]

This is my environment: Client -> iOS App, Server ->PHP and MySQL.
The data from client to server is done via HTTP POST.
The data from server to client is done with json.
I would like to add support for emojis or any utf8mb4 character in general. I'm looking for the right way for dealing with this under my scenario.
My questions are the following:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols? If so, which encoding method should I use so that it works flawlessly in Objective-C and PHP (and java for the future android version)?
Right now I have the DB with utf8mb4 but I get errors when trying to store a raw emoji. On the other hand, I can store non-utf8 symbols such ¿ or á.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8 (if I get them in utf8mb4 the json_decode function doesn't work), then such symbols are encoded (e.g., ¿ is encoded to \u00bf).
MySQL's utf8 charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4 is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Yes.
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?
God no, use raw UTF-8 (utf8mb4) for all that is holy.
When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8
Well, there's your problem; channeling your data through MySQL's utf8 charset will discard any characters above U+FFFF. Use utf8mb4 all the way through MySQL.
if I get them in utf8mb4 the json_decode function doesn't work
You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:
echo json_encode('😀');
"\ud83d\ude00"
echo json_decode('"\ud83d\ude00"');
😀
Use utf8mb4 throughout MySQL:
SET NAMES utf8mb4
Declare the table/columns CHARACTER SET utf8mb4
Emoji and certain Chinese characters will work in utf8mb4, but not in MySQL's utf8.
Use UTF-8 throughout other things:
HTML:
¿ or á are (or at least can be) encoded in utf8 (utf8mb4)

How can I insert arbitrary binary data into a VARCHAR column?

I have a MySQL table with a VARCHAR(100) column, using the utf8_general_ci collation.
I can see rows where this column contains arbitrary byte sequences (i.e. data that contains invalid UTF8 character sequences), but I can't figure out how to write an UPDATE or INSERT statement that allows this type of data to be entered.
For example, I've tried the following:
UPDATE DataTable SET Data = CAST(BINARY(X'16d7a4fca7442dda3ad93c9a726597e4') AS CHAR(100)) WHERE Id = 1;
But I get the error:
Incorrect string value: '\xFC\xA7D-\xDA:...' for column 'Data' at row 1
How can I write an INSERT or UPDATE statement that bypasses the destination column's collation, allowing me to insert arbitrary byte sequences?
Have you considered using one of the Blob data types instead of varchar? I believe that this'd take a lot of the pain away from your use-case.
EDIT: Alternatively, there is the HEX and UNHEX functions, which MySQL supports. Hex takes either a str or a numeric argument and returns the hexadecimal representation of your argument as a string. Unhex does the inverse; taking a hexadecimal string and returning a binary string.
The short answer is that it shouldn't be possible to insert values with invalid UTF8 characters into VARCHAR column declared to use UTF8 characterset.
That's the design goal of MySQL, to disallow invalid values. When there's an attempt to do that, MySQL will return either an error or a warning, or (more leniently?) silently truncate the supplied value at the first invalid character encountered.
The more usual variety of characterset issues are with MySQL performing a characterset conversion when a characterset conversion isn't required.
But the issue you are reporting is that invalid characters were inserted into a UTF8 column. It's as if a latin1 (ISO-8859) encoding was supplied, and a characterset conversion was required, but was not performed.
As far as working around that... I believe it was possible in earlier versions of MySQL. I believe it was possible to cast a value to BINARY, and then warp that in CONVERT( ... USING UTF8), and MySQL wouldn't perform a validation of the characterset. I don't know if that's still possible with the current MySQL Connectors.
If it is possible, then that's (IMO) a bug in the Connector.
The only way I can think of getting around that characterset check/validation would be to get the MySQL sever to trust the client, and determine that no check of the characterset is required. (That would also mean the MySQL server wouldn't be doing a characterset conversion, the client lying to the server, the client telling the server that it's supplying valid UTF8 characters.
Basically, the client would be telling the server "Hey server, I'm going to be sending UTF8 character encodings".
And the server says "Okay. I'll not do any characterset conversion then, since we match. And I'll just trust that what you send is valid UTF8".
And then the client mischievously chuckles to itself, "Heh, heh, I lied. I'm actually sending character encodings that aren't valid UTF8".
And I think it's much more likely to be able to achieve such mischief using prepared statements with the old school MySQL C API (mysql_stmt_prepare, mysql_stmt_execute), supplying nvalid UTF8 encodings as values for string bind parameters. (The onus is really on the client to supply valid values for bind parameters.)
You should base64 encode your value beforehand so you can generate a valid SQL with it:
UPDATE DataTable SET Data = from_base64('mybase64-encoded-representation-of-my-value') WHERE Id = 1;

For which languages should I use SET NAMES in mysql?

In order to get Hebrew text into the DB, I have to use SET NAMES 'hebrew'. For which other languages should I do the same?
No; you should not SET NAMES 'hebrew'. This will lock you in to using a Hebrew-specific character set, making it impossible to store text in other non-Roman scripts.
Use SET NAMES 'utf8' to set MySQL to store text as Unicode. Always.
Best is to do it always, that's the only way to be sure your encoding is 100% correctly transmitted between Database and Client.
Always use SET NAMES with the character set your client is running on.
For example if your client code got strings and expects strings in utf-8 set it to utf-8, if you're running on latin1 set it to latin1.
Effectively it tells the Database what characterset you want to communicate in, what character set your data uses and what character set you want for the result.

Where did I go wrong with this unicode field in MySQL?

I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"Bing® Blang™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?
There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)
I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)
Could it be that some of the columns have an explicitly different character set than the table default?
something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci