UTF-8 data in Latin1 database: can it be saved? - mysql

I have a rails app that receives data from an Android device. I noticed that some of the data, when in Japanese, is not saved correctly. It shows up as literal question marks (not the diamond ones) in the MySQL client and in the rails website.
It turns out that the database that I have connected to the rails app is set to Latin1. Rails is set to UTF-8.
I read a lot about character encodings, but they all mention that the data is somehow a bit readable. Mine however is only literal question marks. Also trying to convert the data to UTF-8 using several methods on the web doesn't change a thing. I suspect that the data is converted to question marks when it's written to the database.
Sample output from the MySQL console:
select * from foo where bar = "foobar";
+-------+------+------------------------+---------------------+---------------------+
| id | name | bar | created_at | updated_at |
+-------+------+------------------------+---------------------+---------------------+
| 24300 | ???? | foobar | 2012-01-23 05:04:22 | 2012-01-23 05:04:22 |
+-------+------+------------------------+---------------------+---------------------+
1 row in set (0.00 sec)
The input data, that my rails app got from the Android client was:
name = 爆笑笑話
This input data has been verified to exist in the rails app before saving to the database. So it's not mangled in the Android client or during transfer to the server. Is there any chance I can get this data back? Or is it completely lost?

It's actually very easy to think that data is encoded in one way, when it is actually encoded in some other way: this is because any attempt to directly retrieve the data will result in conversion first to the character set of your database connection and then to the character set of your output medium—therefore you should first verify the actual encoding of your stored data through either SELECT BINARY name FROM foo WHERE bar = 'foobar' or SELECT HEX(name) FROM foo WHERE bar = 'foobar'.
Where the character 爆 is expected, you will likely find either of the following byte sequences:
0xe78886, indicating that your column actually contains UTF-8 encoded data: this usually happens when the character set of the database connection over which the text was originally inserted was set to latin1 but actually UTF-8 encoded data was sent.
You must be seeing ? characters when fetching the data because something between the data storage and the display has been unable to transcode those bytes (however, given that MySQL thinks they represent 爆 and those characters are likely available in most character sets, it's unlikely that it's occurring within MySQL itself—unless you're explicitly adjusting the encoding information during retrieval).
Anyway, if this is the case, you need to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under ALTER TABLE Syntax:
Warning 
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
0x3f, indicating that the database does actually contain the literal character ? and your original data has been lost: this doesn't happen easily, since MySQL usually throws error 1366 if implicit transcoding results in loss of data. Perhaps there was some explicit transcoding in your insert statement?
In this case, you need to convert the storage encoding to a suitable format, then update or re-insert the data:
ALTER TABLE foo CONVERT TO utf8;
UPDATE foo SET name = _utf8 '爆笑笑話' WHERE bar = 'foobar';

Related

MySQL AES_DECRYPT of mediumblob column with charset latin1

I have a database where there are 3 tables with 3 columns which are AES_ENCRYPTed.
One table has a column which is a VARCHAR(255) with charset latin1. It contains some text encrypted.
Applying the AES_DECRYPT with an UNHEX of the column value shows the output of the real value properly.
Another table has a column which is a mediumblob with charset utf8. It contains some large text encrypted.
Applying just the AES_DECRYPT and then casting it to char displays the original value properly too.
But the 3rd table has a column which is also a mediumblob but charset latin1. It contains a large JSON data stringified.
Now when I apply just the AES_DECRYPT('<column_name>', '') it outputs null. Applying unhex or casting the column to other types before the decryption did not do anything.
For the first 2 tables, just applying the AES_DECRYPT without the conversion also output something.
But the 3rd table does not output anything; just shows NULL.
Any idea what is happening here? It would be very helpful if someone with DB expertise can point me to the right direction why the output is NULL and what needs to be done in the query to get the real output.
EDIT:
The DB Columns are populated by a microservice which uses JAVA Hibernate ColumnTransformer for doing the write and read.
write = AES_ENCRYPT(, ), read = AES_DECRYPT(, )
The values posted via this is also returned properly in the GET response. But the same query does not output the 3rd column value and print NULL as described.
With this structure :
And this command :
UPDATE encryption SET `encrypted` = AES_encrypt(CONVERT(`json` USING latin1), "key");
UPDATE encryption SET `decrypted` = AES_decrypt(encrypted, "key");
This works well for me.
However blobs doesn't any character sets...

MySQL strange characters replace with <BR

I inherited a MySQL table (MyISAM utf8_general_ci encoding) that has a strange character looks like this in myPHPAdmin: •
I assume this a bullet point of some type?
When rendered on a HTML page it looks like this: �
How do I replace this value with a <BR><LI> so I can turn it into a line break with a properly formatted list item?
I've tried a standard UPDATE query but it does not replace these values? I assume I need to escape them somehow?
Query attempted:
UPDATE `FL_Regs` SET `Remarks` = "<BR><LI>" WHERE `Remarks` = "•"
You did not showed your query, so I'm only guessing.
If you're having hard times with your client encoding characters for you (I imagine you may use phpmyadmin, which involve a lot of steps between your browser and the actual server), you may try by giving the string to search as sequence of bytes.
It happen that • is U+2022, a character named "BULLET" in Unicode, which is encoded as e2 80 a2 in UTF8. So you can use X'E280A2' instead of '•' in your query.
Typically:
> select X'E280A2';
+-----------+
| X'E280A2' |
+-----------+
| • |
+-----------+
You can, if you want to better understand what's happening, try to use the HEX() function, first maybe to check what's MySQL is receiving when your're sending a bullet:
SELECT HEX('•');
Typically I'm getting E280A2 which is as previously seen the UTF8 encoding of the BULLET character.
And so see what's actually stored in your table:
SELECT HEX(your_column) FROM your_table;
Try to limit the search to a single raw to make it almost readable.

How can I insert arbitrary binary data into a VARCHAR column?

I have a MySQL table with a VARCHAR(100) column, using the utf8_general_ci collation.
I can see rows where this column contains arbitrary byte sequences (i.e. data that contains invalid UTF8 character sequences), but I can't figure out how to write an UPDATE or INSERT statement that allows this type of data to be entered.
For example, I've tried the following:
UPDATE DataTable SET Data = CAST(BINARY(X'16d7a4fca7442dda3ad93c9a726597e4') AS CHAR(100)) WHERE Id = 1;
But I get the error:
Incorrect string value: '\xFC\xA7D-\xDA:...' for column 'Data' at row 1
How can I write an INSERT or UPDATE statement that bypasses the destination column's collation, allowing me to insert arbitrary byte sequences?
Have you considered using one of the Blob data types instead of varchar? I believe that this'd take a lot of the pain away from your use-case.
EDIT: Alternatively, there is the HEX and UNHEX functions, which MySQL supports. Hex takes either a str or a numeric argument and returns the hexadecimal representation of your argument as a string. Unhex does the inverse; taking a hexadecimal string and returning a binary string.
The short answer is that it shouldn't be possible to insert values with invalid UTF8 characters into VARCHAR column declared to use UTF8 characterset.
That's the design goal of MySQL, to disallow invalid values. When there's an attempt to do that, MySQL will return either an error or a warning, or (more leniently?) silently truncate the supplied value at the first invalid character encountered.
The more usual variety of characterset issues are with MySQL performing a characterset conversion when a characterset conversion isn't required.
But the issue you are reporting is that invalid characters were inserted into a UTF8 column. It's as if a latin1 (ISO-8859) encoding was supplied, and a characterset conversion was required, but was not performed.
As far as working around that... I believe it was possible in earlier versions of MySQL. I believe it was possible to cast a value to BINARY, and then warp that in CONVERT( ... USING UTF8), and MySQL wouldn't perform a validation of the characterset. I don't know if that's still possible with the current MySQL Connectors.
If it is possible, then that's (IMO) a bug in the Connector.
The only way I can think of getting around that characterset check/validation would be to get the MySQL sever to trust the client, and determine that no check of the characterset is required. (That would also mean the MySQL server wouldn't be doing a characterset conversion, the client lying to the server, the client telling the server that it's supplying valid UTF8 characters.
Basically, the client would be telling the server "Hey server, I'm going to be sending UTF8 character encodings".
And the server says "Okay. I'll not do any characterset conversion then, since we match. And I'll just trust that what you send is valid UTF8".
And then the client mischievously chuckles to itself, "Heh, heh, I lied. I'm actually sending character encodings that aren't valid UTF8".
And I think it's much more likely to be able to achieve such mischief using prepared statements with the old school MySQL C API (mysql_stmt_prepare, mysql_stmt_execute), supplying nvalid UTF8 encodings as values for string bind parameters. (The onus is really on the client to supply valid values for bind parameters.)
You should base64 encode your value beforehand so you can generate a valid SQL with it:
UPDATE DataTable SET Data = from_base64('mybase64-encoded-representation-of-my-value') WHERE Id = 1;

mysql charsets, can I perform the conversion in python?

I have a MySQL database which contains some bad data.
I start with this Unicode string:
u'TECNOLOGÍA Y EDUCACIÓN'
Encoding to UTF-8 for the database yields:
'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'
When I send these bytes to the database, using connection charset latin1 and database charset utf8 (yes, I know this is wrong, but this has already happened, many, many times, and the goal now is to figure out the exact process of corruption so it can be reversed), the data is converted to this (checked using BINARY()):
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'
Double-encoding aside, the result I'd expect here is:
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'
Most of this makes sense, as it is interpreting the multi-byte UTF-8 chars as latin1, and encoding each byte as an individual char, but the conversion of \x93 -> \xe2\x80\x9c makes no sense. latin1's \x93 does not convert to UTF-8 \xe2\x80\x9c, although \xe2\x80\x9c can be converted to Unicode, yielding u'\u201c', which is codepoint \x93 in the CP-1252 charset.
Is mysql combining latin1 and CP-1252 when it handles conversions? How can I replicate the conversion process entirely in python? I've iterated through every encoding on the system and none of them work for the entire string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'? Decoding as UTF-8 will handle the first 3/4ths correctly, but that last one is just wrong, and nothing I've tried will return the correct results.
the goal now is to figure out the exact process of corruption so it can be reversed
As documented under ALTER TABLE Syntax:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
In your case:
change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
then drop the encoding information (by modifying the column so that it becomes a binary string):
ALTER TABLE my_table MODIFY my_column BLOB;
then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).
If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():
SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
FROM my_table
Is mysql combining latin1 and cp1252 when it handles conversions?
As documented under West European Character Sets:
MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

Where did I go wrong with this unicode field in MySQL?

I have a table with a field which contains strings in my MySQL database.
The MySQL version is 5.0.51a. The default character set for the table is 'utf8'.
Many of the strings have unicode characters such as \xae and \u21222 (registered symbol and trademark symbol respectively).
For example, suppose I have a row with a field this value:
"Bing® Blang™ Blaow"
The default character set of my mysql command line client is "latin1".
If I issue a SELECT statement in the mysql client program from the command line without specifying a character set, the output of the title shows up like so:
"Bing® Blang Blaow"
The (R) symbol is correct but the (TM) symbol is missing. If I cut and paste this string from the console into TextMate, the (TM) symbol appears, but is half-way behind the g in the word "Blang".
I am assuming that the half-way-behind-the-g thing is a just a display error in TextMate (though if anyone can provide further detail that'd be great, but that's not really the important part).
The main thing I am inferring from the its-there-after-you-cut-and-paste behavior is that the data is in the database but there's something wrong with some sort of character set setting somewhere.
If I override the default encoding of the mysql client on the command line like so:
mysql --default-character-set=utf8
Then do the same select, the string comes out as:
"Bing® Blang™ Blaow"
which is to say that both the (R) and (TM) symbols appear and are in the right place but both are preceded by the unicode character \xae which is an A with a circumflex on top.
(Incidentally this is also how the data is displayed when I pull it out using python and display it on a web page, which is what my real problem is).
Anyway, what is going on here? Everything we have done recently has used UTF8 everywhere possible, but it's possible that some of these rows were inserted prior to that change which means they would've been using the latin1 default... however neither encoding seems to produce the right result?
If the rows were inserted when the default encoding on the table was latin1 before it was switched to utf8, then the encoding was switched (via alter table..) then would the encoding have actually been updated? Should one of the encodings work now? Will unicode ever stop kicking my ass?
There are quite a number of issues here:
About the characters
You indicate that the text has characters U+AE and U+2122 (® and ™ respectively). However, the results imply that the text has U+99 as the character after "Blang": When you set MySQL to output UTF8, then you see this "™" -- which is the UTF8 sequence for U+99 displayed on a terminal that is interpreting this byte stream as Windows-1252.
U+99 probably isn't what you wanted: In Unicode, that is an extended control character with no graphic representation. It just so happens that in Windows-1252, that 0x99 is the encoding of the trademark symbol (U+2122).
(Please note that both MySQL and most web browsers have a common, "broken" behavior of using Windows-1252 when you choose Latin1. Sigh.)
What's probably wrong
Your terminal isn't operating in the right character set. It is clearly operating in Windows-1252.
Programs should be connecting to the database in UTF-8. You can do that in the command line, as you've found, or by executing the statement SET NAMES utf8_general_ci; in your database handle before doing anything else. Some other database APIs may have other ways of doing this, but there is no generic way for all SQL engines. SET NAMES ... is specific to MySQL, but sets all the required character set variables (there are three!) at once.
The process that is inserting data into the database is taking user input and not correctly converting it from Windows-1252 into UTF-8 before inserting. This is how you got a U+99 into your database. Since I don't know how you are getting that data, I'm not sure what to fix, but here are several possibilities:
If the data comes from a web page form, be sure the page with the form is served in UTF-8, is properly marked as such (via the MIME Type, and the <meta> tag.) Be sure also, that the <form> tag is not specifying a different character set.
When converting the data, be sure that you use iconv or similar libraries to convert from the input character set to UTF-8. Even if you think the input is Latin1, do not try to do this by hand (for example, by zero expanding every byte to 16-bits then claiming this is UTF-16 - that won't work for Windwos-1252!). Make absolutely certain that you know the character set of the source data. In particular, be sure to know if it is Latin1 or Windows-1252.
Instead of converting the user input, you could connect to the database in character set of the user input, and then just insert the raw byte data you get from the user. However, you must be sure to only do insertions this way: reading back data from the data with the user's character set in effect will lose information if other rows have data that can't be represented in that character set. It is possible to set up a MySQL connection so that you issue statements in one character set and read results back in another... But it isn't for the faint of heart, and future programmers will likely go nuts trying to understand why the code does this.
If, when you pull the data out with Python and display it in a web page, you see the string "™", then that is indication that your are pulling the data out of the database correctly as UTF-8, but then putting it into a web page that is not correctly identified as UTF-8. Probably it is just defaulting to Latin1, which as noted above will really be Windows-1252.
Nonetheless, even if you fix the display, note that the data base has bad data in it, since U+99 isn't really the trademark symbol in a UTF-8 column. You'll need to clean up your data, by reading all the data, and replacing any characters in the range of U+80 through U+9F with what they were likely to have been, assuming the data was really Windows-1252. If you're not certain what character set the data was in originally -- then this data is, alas, just junk.
About changing character sets of tables
Converting the character set and collation of the table after inserting data will convert the columns, but, of course, any data already inserted will have already lost whatever characters the original character set couldn't represent.
Be careful to note the difference between ALTER TABLE foo CONVERT TO CHARACTER SET ... and ALTER TABLE foo CHARACTER SET ... The later only changes the default character set for the table, and will not change any columns, even if they were set to the default at creation. (MySQL only uses the defaults at column creation time, it doesn't remember that a given column is "defaulted" not does it keep it in sync with the table's default.)
I think it has to do with the settings of the mysql connection in your Python code.
try setting conn.character_set_name or something like that, depends on the mysql connection lib you are using.
in case of MySQLdb it should be smthng like this:
def character_set_name(*args, **kwargs): return 'utf-8'
conn.character_set_name = new.instancemethod(character_set_name, conn, conn.__class__)
Could it be that some of the columns have an explicitly different character set than the table default?
something like this...?
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci