Here's my scenario. I save a bunch of Strings containing asian characters in MySQL using Hibernate. These strings are written in varbinary columns. Everything works fine during the saving operation. The DB contains the correct values (sequence of bytes). If I query (again using Hibernate) for the Strings that I saved I get the correct results. But when Hibernate fills the entity to which the Strings belong with the values from the DB I get different values then the ones I used in the query that retrieved them. Instead of receiving the correct values I receive a bunch of FFFD replacement characters.
For example: if I store "하늘" in the DB and then I query for it, the resulting String will be \uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD.
the DB connection has the following parameters set useUnicode=true&characterEncoding=UTF-8,
I've tried using the following configurations for Hibernate but that didn't solve the problem:
- connection.useUnicode = true
- connection.characterEncoding = UTF-8
By the way, this all works fine if the MySQL columns are of type varchar.
What am I missing? Any suggestions?
Thanks
Set the connection character set to be binary too:
SET NAMES 'binary';
Related
I have a database where there are 3 tables with 3 columns which are AES_ENCRYPTed.
One table has a column which is a VARCHAR(255) with charset latin1. It contains some text encrypted.
Applying the AES_DECRYPT with an UNHEX of the column value shows the output of the real value properly.
Another table has a column which is a mediumblob with charset utf8. It contains some large text encrypted.
Applying just the AES_DECRYPT and then casting it to char displays the original value properly too.
But the 3rd table has a column which is also a mediumblob but charset latin1. It contains a large JSON data stringified.
Now when I apply just the AES_DECRYPT('<column_name>', '') it outputs null. Applying unhex or casting the column to other types before the decryption did not do anything.
For the first 2 tables, just applying the AES_DECRYPT without the conversion also output something.
But the 3rd table does not output anything; just shows NULL.
Any idea what is happening here? It would be very helpful if someone with DB expertise can point me to the right direction why the output is NULL and what needs to be done in the query to get the real output.
EDIT:
The DB Columns are populated by a microservice which uses JAVA Hibernate ColumnTransformer for doing the write and read.
write = AES_ENCRYPT(, ), read = AES_DECRYPT(, )
The values posted via this is also returned properly in the GET response. But the same query does not output the 3rd column value and print NULL as described.
With this structure :
And this command :
UPDATE encryption SET `encrypted` = AES_encrypt(CONVERT(`json` USING latin1), "key");
UPDATE encryption SET `decrypted` = AES_decrypt(encrypted, "key");
This works well for me.
However blobs doesn't any character sets...
I'm encrypting some user data (no passwords) using Crypto.Cipher AES.
The return is a AESCipher of the form:
b'o\xab\xdd\x19\xaat\xfcIAN\xd2\x00\xe9'
sometimes it produces spaces and non hex representations.
b"N%?\x91\xe8'J\xc0\x10 p"
b'QV8>K\xd8\xfa\x9a\x05%\xe8LJp\xd0gf'
When I try to insert these values into the SQL table created as:
data_encrypted VARBINARY(40)
I get the following warning:
Warning: (1300, "Invalid utf8 character string: 'ABDD19'")
Seems like it's trimming the binary array, when I Query the inserted rows in the table, the row was effectively inserted but only the first bytes of the array, from b'o\xab\xdd\x19\xaat\xfcIAN\xd2\x00\xe9' it inserts only the 'o'.
Do I have to specify something else in the format?
Thanks
Adding a _binary solved the warning
self.cur.execute("INSERT IGNORE INTO members VALUES(%s,_binary %s...
And I'm also formatting the array to be hex represented, it looks better in the MySQL table and the decryption still works ok binascii.b2a_hex
I have a MySQL table with a VARCHAR(100) column, using the utf8_general_ci collation.
I can see rows where this column contains arbitrary byte sequences (i.e. data that contains invalid UTF8 character sequences), but I can't figure out how to write an UPDATE or INSERT statement that allows this type of data to be entered.
For example, I've tried the following:
UPDATE DataTable SET Data = CAST(BINARY(X'16d7a4fca7442dda3ad93c9a726597e4') AS CHAR(100)) WHERE Id = 1;
But I get the error:
Incorrect string value: '\xFC\xA7D-\xDA:...' for column 'Data' at row 1
How can I write an INSERT or UPDATE statement that bypasses the destination column's collation, allowing me to insert arbitrary byte sequences?
Have you considered using one of the Blob data types instead of varchar? I believe that this'd take a lot of the pain away from your use-case.
EDIT: Alternatively, there is the HEX and UNHEX functions, which MySQL supports. Hex takes either a str or a numeric argument and returns the hexadecimal representation of your argument as a string. Unhex does the inverse; taking a hexadecimal string and returning a binary string.
The short answer is that it shouldn't be possible to insert values with invalid UTF8 characters into VARCHAR column declared to use UTF8 characterset.
That's the design goal of MySQL, to disallow invalid values. When there's an attempt to do that, MySQL will return either an error or a warning, or (more leniently?) silently truncate the supplied value at the first invalid character encountered.
The more usual variety of characterset issues are with MySQL performing a characterset conversion when a characterset conversion isn't required.
But the issue you are reporting is that invalid characters were inserted into a UTF8 column. It's as if a latin1 (ISO-8859) encoding was supplied, and a characterset conversion was required, but was not performed.
As far as working around that... I believe it was possible in earlier versions of MySQL. I believe it was possible to cast a value to BINARY, and then warp that in CONVERT( ... USING UTF8), and MySQL wouldn't perform a validation of the characterset. I don't know if that's still possible with the current MySQL Connectors.
If it is possible, then that's (IMO) a bug in the Connector.
The only way I can think of getting around that characterset check/validation would be to get the MySQL sever to trust the client, and determine that no check of the characterset is required. (That would also mean the MySQL server wouldn't be doing a characterset conversion, the client lying to the server, the client telling the server that it's supplying valid UTF8 characters.
Basically, the client would be telling the server "Hey server, I'm going to be sending UTF8 character encodings".
And the server says "Okay. I'll not do any characterset conversion then, since we match. And I'll just trust that what you send is valid UTF8".
And then the client mischievously chuckles to itself, "Heh, heh, I lied. I'm actually sending character encodings that aren't valid UTF8".
And I think it's much more likely to be able to achieve such mischief using prepared statements with the old school MySQL C API (mysql_stmt_prepare, mysql_stmt_execute), supplying nvalid UTF8 encodings as values for string bind parameters. (The onus is really on the client to supply valid values for bind parameters.)
You should base64 encode your value beforehand so you can generate a valid SQL with it:
UPDATE DataTable SET Data = from_base64('mybase64-encoded-representation-of-my-value') WHERE Id = 1;
I have been tasked with migrating a Microsoft SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.
-MSSQL source database has latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.
-MySQL target database wants the character set UTF-8
I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the MSSQL database.
The migration toolkit copytable program did not provide a very useful error message: "Error during charset conversion of wstring: No error". It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking at the binary data for the field and comparing byte pairs (little endian) with data that was being migrated successfully).
When this surrogate pair was removed from the MSSQL database the row was migrated successfully to MySQL.
Here is the problem:
I have tried to search for these characters in a test MSSQL table (this chartest table is just various test strings an nvarchar field) to prepare a replacement script and keep getting strange results... I must be doing something incorrectly.
Searching for
SELECT * FROM chartest WHERE text LIKE NCHAR(0xdc83)
Will return any surrogate pair character (whether or not it uses DC83), but obviously, only if it is the only character (or part of the pair) in that field. This isn't a big deal since I would like to remove any instance of these anyway (I dont like to remove data like this but I think we can afford it).
Searching for
SELECT * FROM chartest WHERE text LIKE '%' + (NCHAR(0xdc83))+ '%'
Will return every row! Regardless of whether it even has a unicode character present in the field let alone the DC83 character. Is there a better way to find and replace these characters? Or something else I should try?
I have also tried setting the target databse, table, and field character set to UCS-2 but it seems as though it does not make a difference.
I should also mention that this migration is using live data (~50GB database!) while one of the sites that feeds it is taken offline so any solutions to this need to have a quick running time...
I would appreciate any suggestions very much! Please let me know if there is any information I have left out.
I had this error, and now I have discovered the source of the problem. I had a hard time finding out, so maybe this will be useful to someone, even though I realize, my problem and workaround may not be spot on matching op's original trouble.
I am migrating data from MSSQL to MySQL, and the content being migrated is html-content from Sitecore CMS (target CMS is Drupal, btw).
I've found, that I get this error when converting the database and hitting records, that contain Instagram-embeds. Instagram-embeds work in the way, that the embedded post data is copied to the embed code (instead of being loaded async., et.c. - even the image is included as base64-css...), and the young people nowadays tend to put a lot of emoji's in their image-descriptions (using their iPhones with Emoji keyboard). Emoji's are represented by 4-byte encoded characters, but MySQL utf8 only allows for 3-byte encoded unicode characters.
My initial error from running wbcopytables.exe (which is the non-GUI way of doing Migration Wizard in MySQL Workbench) was the
Error during charset conversion of wstring: No error
but upgrading MySQL Workbench to recent version (from 5.something to 6.x) makes the error a bit more descriptive, hinting table and column (alas, not row):
ERROR: Could not successfully convert UCS-2 string to UTF-8 in table
[MyDatabase].[dbo].[MyTable] (column MyColumn).
Original string: ...
Anyway - a solution *could* be to use utf8mb4 which would allow for the emoji's. Read more here.
But it looks like, it's a bad idea to do this in e.g. my case with Drupal.
So - the solution I ended up with was simply to strip these characters in my migrate-script. There is no point in keeping these for users of the site in question, since they are being displayed as rectangles on the webpage anyway. Since you can't search-and-replace with regex in SQL Server, I processed the data using a DAL and c# .NET, and I found the help here (thanks a ton, Jon Skeet) - turns out there is a regex-pattern for matching one half of a surrogate pair in UTF-16. See below (and use the pattern in another language if needed).
var noUcs2SurrogatePairsString = Regex.Replace(stringWithUcs2SurrogatePairs, #"\p{Cs}", string.Empty);
I had a very similar problem today, and I found that it was caused by empty strings, replaced them with NULLs or a value representing no data and the migration worked fine.
I solved just editing the "import data script.cmd" where it reads columns "As NVARCHAR" by replacing those with "VARCHAR" only.
Note: My table columns was VARCHAR type already, so... for some stupid reason the migration script improperly cast it to UNICODE (NVARCHAR) type.
This issue has now been resolved. I used user Remus Rusanu's suggestion here for finding the rows with these surrogate pair characters using CHARINDEX and have decided to use SUBSTRING to exclude the troublesome characters like so:
UPDATE test
SET a = SUBSTRING(a, 1, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 - 1) -- string before the unwanted character
+ SUBSTRING(a, (CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000)))+1)/2 +1, LEN(a) ) -- string after the unwanted character
WHERE CHARINDEX(0x83dc, CAST(a AS VARBINARY(8000))) % 2 = 1 -- only odd numbered charindexes (to signify match at beginning of byte pair character)
I have a php script that inserts values into mySQL table
INSERT INTO stories (title) VALUES('$_REQUEST[title]);
I checked the values of my request variables before going into the table and it's fine.
But when I add title=john to the table for example,
I get something like this:
title = "[][][][]john"
and when I extract the value, it's a newline then john.
I have my columns set to utf-8, I tried swedish character set as well.
Note: I don't get this error when inserting values from the phpMyAdmin commandline
You need {} around any array notation when used inside "".
$q="INSERT INTO stories(title) VALUES('{$_REQUEST['title']}')";
BTW, it would be better, when checking your $_REQUEST vars to store the sanitized versions in new variables, and to be sure to escape them with real_escape_string()
SET NAMES <encoding> query must be executed every time you connect to your database.
very simple rule.
where <encoding> is your HTML page encoding in mysql dialect (utf8 for the utf-8)
You need to check the character set of the database, the server, and the client.
Note that it's not a swedish character set, it's a swedish collation.