Mixing UTF8 and latin1_bin fields in MySQL with PHP - mysql

On a database I'll have to store names and such in UTF8, and hashes in latin1_bin. I called SET NAMES utf8, but I noticed that it corrupted the latin1 fields when I tried to read them (I was able to write them just fine). Which is odd, since if I understood correctly that query is only about sending data to the server, not receiving it.
phpMyAdmin displays broken data too.
Any clue about what I might be doing wrong?
(using MAMP 1.9.6)
edit: this answer specifies this is also the charset used to send data back to the client. I'm getting confused: what's the point of specifying the charset of a column if that will be ignored anyway?
edit:
excerpt from the column definition:
`tok` char(64) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
`sal` char(16) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_roman_ci ;
excerpt from queries:
SELECT tok,sal FROM user WHERE id=4 LIMIT 1
.
INSERT INTO user (tok, sal) VALUES (x'1387ea0c22277d3000bd23241c357e3a9ba45a2e28f50581d63a73bf785a7458a95cca4de27d0a86588f5bdfa94415d6a255c2c0379ebc2f00dacba03ae6b866', x'8fca28a592c29f245ff0a3ba5f97420c')

You should change your table definition to use MySql's BINARY type, which is the perfect fit for this kind of data:
The BINARY and VARBINARY types are similar to CHAR and VARCHAR, except
that they contain binary strings rather than nonbinary strings. That
is, they contain byte strings rather than character strings. This
means that they have no character set, and sorting and comparison are
based on the numeric values of the bytes in the values.
The column definitions would become:
`tok` binary(64) NOT NULL,
`sal` binary(16) NOT NULL,

set names ... only sets the character set of the connection (of the client side only!) for both sending and receiving the data. It is not related to the encoding of the fields in the database! It is only related to the encoding which you use as a client to the database, e.g. in PHP. The database engine will always convert the encoding between the encoding in tables to the encoding specified in set names ... query.
So in table definition you just specify encoding for each field as you need, and you don't change anything in the way how you used the set names ... command - it just remains the same as before.

Leaving SET NAMES on, and the INSERT query intact, I changed the SELECT query to:
SELECT BINARY(tok) AS tok, BINARY(sal) AS sal FROM user WHERE id=4 LIMIT 1
i.e. I casted the hash fields to binary.
While this worked, I'll leave this open in case someone provides a (perhaps more correct?) alternative.

Related

Converting non-utf8 database to utf-8

I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:
I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*
For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).
Is there any way to fix it although there is some data already inserted in the database?
That feels like "double-encoded" Hebrew strings.
This partially recovers the text:
UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1)))
--> '� �תה מסכי�,��
I do not know what leads to the � symbols.
Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.
Please provide the output from that SELECT, then I will work on how to recover the data.
Edit
Here's what I think happened.
The client had characters encoded as utf8; and
SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
The column in the table declared CHARACTER SET utf8.
Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123
So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.
UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;
If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)
I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.
the db collation has to be utf8_general_ci
the collation of the table with hebrew has to be utf8_general_ci
for example:
CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE `col`.`hebrew` (
`id` INT NOT NULL AUTO_INCREMENT,
`heb` VARCHAR(45) NOT NULL,
PRIMARY KEY (`id`)
) CHARACTER SET utf8
COLLATE utf8_general_ci;
INSERT INTO hebrew(heb) values ('שלום');

mysql charsets, can I perform the conversion in python?

I have a MySQL database which contains some bad data.
I start with this Unicode string:
u'TECNOLOGÍA Y EDUCACIÓN'
Encoding to UTF-8 for the database yields:
'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'
When I send these bytes to the database, using connection charset latin1 and database charset utf8 (yes, I know this is wrong, but this has already happened, many, many times, and the goal now is to figure out the exact process of corruption so it can be reversed), the data is converted to this (checked using BINARY()):
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'
Double-encoding aside, the result I'd expect here is:
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'
Most of this makes sense, as it is interpreting the multi-byte UTF-8 chars as latin1, and encoding each byte as an individual char, but the conversion of \x93 -> \xe2\x80\x9c makes no sense. latin1's \x93 does not convert to UTF-8 \xe2\x80\x9c, although \xe2\x80\x9c can be converted to Unicode, yielding u'\u201c', which is codepoint \x93 in the CP-1252 charset.
Is mysql combining latin1 and CP-1252 when it handles conversions? How can I replicate the conversion process entirely in python? I've iterated through every encoding on the system and none of them work for the entire string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'? Decoding as UTF-8 will handle the first 3/4ths correctly, but that last one is just wrong, and nothing I've tried will return the correct results.
the goal now is to figure out the exact process of corruption so it can be reversed
As documented under ALTER TABLE Syntax:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
In your case:
change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
then drop the encoding information (by modifying the column so that it becomes a binary string):
ALTER TABLE my_table MODIFY my_column BLOB;
then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).
If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():
SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
FROM my_table
Is mysql combining latin1 and cp1252 when it handles conversions?
As documented under West European Character Sets:
MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.

MySQL can't restore table from backup - #1366 - Incorrect string value

A site I'm working on recently had an issue with the database, apparently it got corrupted when they restored the tables any text field with strange symbols (eg half symbol and degree symbol) the text field stopped at the character before that symbol). I've got a copy of the table and distilled it down to the code below:
CREATE TABLE `products2` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`description` text CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
insert into products2 values
(25, 0x5468652044504D203931322069732061206C617267652033BD204469676974204C434420566F6C746D657465722E20546865207369676E616C206265696E67206D6561737572656420697320616C736F207573656420746F20706F77657220746865206D657465722C20696E636C7564696E6720746865206261636B6C696768742E20546865206D657465722066656174757265732061203320746F20363056206D6561737572656D656E742072616E67652C20776974682061207265736F6C7574696F6E206F662031306D56206265747765656E20332E303020616E642031392E39395620616E64203130306D56206265747765656E2032302E3020616E642036302E30562E205768656E2074686520766F6C746167652064726F70732062656C6F772033562C204C4F20697320646973706C617965642028646F776E20746F20322E38562C207768656E2074686520646973706C61792077696C6C207475726E206F6666292E209148499220697320646973706C61796564207768656E2074686520766F6C7461676520676F65732061626F7665203630562E0D0A0D0A5363726577207465726D696E616C7320616C6C6F7720666F7220717569636B20616E64206561737920636F6E6E656374696F6E2E20546865206D6574657220697320686F7573656420696E206120726F6275737420636172726965722077686963682063616E20626520626F6C74656420696E20706C616365206F722070616E656C206D6F756E746564207573696E6720746865206C6F772070726F6669206C652062657A656C20616E6420636C6970732070726F76696465642E20416E2049503637202F204E454D412034582062657A656C20697320616C736F20617661696C61626C6520666F722070726F74656374696F6E20616761696E7374206475737420616E64206D6F6973747572652E0D0A0D0A417320746869732069732061206E65772064657369676E2077652073756767657374207468617420796F7520636F6E74616374204C617363617220666F7220757020746F2064617465206C6561642D74696D6520696E666F726D6174696F6E206265666F7265206F72646572696E67206F6E6C696E652E0D0A)
This throws an error:
#1366 - Incorrect string value: '\xBD Digi...' for column 'description' at row 1
Looking into this problem on stackoverflow and around the web it seems to be an issue with the encoding, I've tried changing the collation to utf_unicode_ci on the description field and the collation of the table to utf_bin (and all combinations of those) all to no avail.
I can't redo the dump as it's a backup. I don't understand how the system can output the dump but not accept it back - presumably the backup is via the command line (not certain) and I am using PHPMyAdmin to restore it I don't know if that makes a difference.
If it's not possible to import the data I'd be grateful if someone could tell me how to read the encoded data into text that I can then manually cut and paste.
Decoding the first 32 bytes as ASCII, we have (where ? is the 0xBD byte about which MySQL is complaining):
The DPM 912 is a large 3? Digit
A little bit of Googling for "DPM 912" suggests to me that character should be the vulgar one-half fraction, ½.
A number of character sets encode that character with the byte 0xBD, but one in particular jumps out: windows-1252—which was not only the default codepage in the (pre-Unicode) Windows world, but is also MySQL's default encoding. It'd be a good guess that your data is encoded in windows-1252.
As explained in the MySQL manual, you can specify the encoding of a string literal by prefixing it with the encoding name:
A character string literal may have an optional character set introducer and COLLATE clause:
[_charset_name]'string' [COLLATE collation_name]
It goes on to say:
An introducer is also legal before standard hex literal and numeric hex literal notation (x'literal' and 0xnnnn), or before bit-field literal notation (b'literal' and 0bnnnn).
Therefore (and because MySQL refers to windows-1252 as latin1), you could change your INSERT command to:
INSERT INTO products2 VALUES (25, _latin1 0x5468652044504D203931322069...);
The documentation also states:
For the simple statement SELECT 'string', the string has the character set and collation defined by the character_set_connection and collation_connection system variables.
That is, if such an introducer is omitted (as it was in your original INSERT statement), the character set is assumed to be that defined by the character_set_connection system variable.
As mentioned here, there are number of ways of setting that variable (including by specifying it when your client connects which, in phpMyAdmin, is set with the [DefaultCharset] configuration option, of which the default was latin1 prior to v3.4, but has been utf8 since - perhaps this change is the origin of your problems; one can also specify the character set of import files with [Import][charset]). If one doesn't specify the desired character set upon connecting, issuing any of these commands after connecting but before your INSERT command will fix it (you could, for example, add one of them to the top of your dump file):
SET NAMES 'latin1';
SET CHARACTER SET latin1;
SET character_set_connection = latin1;
My recommendation, which makes the dumpfile as portable as possible, would be to add SET NAMES 'latin1' to the top of it.

How to interpret a column as having a different character set per query?

I need to interface with a database for which I cannot change the collation and charset.
However, I would like to pick some binary data from it, interpret it as if it were UTF8 and then do an UPPER on it (since just doing UPPER() on binary returns the raw value).
I would assume that this works:
SELECT UPPER(Filename.Name) COLLATE utf8_general_ci FROM Filename;
but it doesn't and complains that
COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'binary'
which is fair enough, I need some incantation to cast the binary field as being utf-8. How do I do a select which gives me a computed column with the right character set assigned to it?
Ok figured it out. For modern MySQL versions you can use CAST, and for older ones CONVERT (which is actually standard SQL).
SELECT UPPER(CONVERT(BINARY(Filename.Name) USING utf8)) FROM Filename;
I think you're looking for:
SELECT UPPER(Filename.Name COLLATE utf8_general_ci) FROM Filename;
But I'm not sure because I don't have a broken database to test with.

How can I process data to avoid MySQL "incorrect string value" error?

I am trying to use a Rake task to migrate some legacy data from MS Access to MySQL. I'm working on Windows XP, using Ruby 1.8.6.
I have the encoding for Rails set as "utf8" in database.yml.
Also, the default character set for MySQL is utf8.
99% of the data is coming in fine, but every now and then I'll get a column value that gives me a error something like this:
Mysql::Error: Incorrect string value: '\x92 Comm...' for column 'name'
at row 1:
INSERT INTO `organizations` ( [...] )
VALUES('Lawyers’ Committee', [...] )
It looks as though the thing that's giving MySQL trouble is the apostrophe immediately after the "s" in the word "Lawyers".
Here's another one...
Mysql::Error: Incorrect string value: '\x99 aoc' for column 'department'
at row 1:
INSERT INTO `addresses`
[...]
'TRInfo™ aoc'
[....]
Looks like it's choking on the "TM" after "TRInfo".
Is there any Ruby or Rails method that I can run the data through to cleanse from it any characters that MySQL will choke on?
Ideally, it would be great to replace them with more palatable characters -- replace the apostrophe with a single quote and the TM symbol with the string "(TM)".
Or, if I could somehow configure MySQL to store those characters as-is without errors that would be great too.
It looks like your input data is not in utf-8.
I did a little investigating and the styled quote used in Lawyer's is encoded as \x92 in the Windows-1252 encoding, but would be nonsense for utf-8 (when I decoded it and encoded it into utf8, I got \xe2\x80\x99).
Thus you will need to convert the input strings from windows-1252 to utf-8 (or to unicode).
I had the same problem when putting contents of UTF-16 encoded files - which usually store one character per 16bit block - into mysql tables with java. The problem was that the UTF-16 encoded string contained so called surrogate pairs. It means two consecutive 16bit UTF-16 blocks encode one special character but cannot be translated into a corresponding UTF-8 encoding individually. See wikipedia for further explanation.
The solution was to simply replace these characters with spaces. This is the character range you might want to strip out of your string: U+D800–U+DFFF
In general, this happens when you insert strings to columns with incompatible encoding/collation.
I got this error when I had TRIGGERs, which inherit server's collation for some reason.
And mysql's default is (at least on Ubuntu) latin-1 with swedish collation.
Even though I had database and all tables set to UTF-8, I had yet to set my.cnf:
/etc/mysql/my.cnf :
[mysqld]
character-set-server=utf8
default-character-set=utf8
And this must list all triggers with utf8-*:
select TRIGGER_SCHEMA, TRIGGER_NAME, CHARACTER_SET_CLIENT, COLLATION_CONNECTION, DATABASE_COLLATION from information_schema.TRIGGERS
And some of variables listed by this should also have utf-8-* (no latin-1 or other encoding):
show variables like 'char%';
It looks like your old database is in one string format (utf8?) and your rails is expecting something else. If you input is in utf8, have you tried configuring your rails to support it?
I encountered the same problem today.
After tried many times, I found out the reason and fix it at last.
For applications that store data using the default MySQL character set and collation (latin1, latin1_swedish_ci), so you need to specify the character set and collation to utf8/utf8_general_ci when your create your database or table.
e.g.:
$sql = "CREATE TABLE " . $table_name . " (
id mediumint(9) NOT NULL AUTO_INCREMENT,
bookname varchar(128) NOT NULL,
author varchar(64) NOT NULL,
PRIMARY KEY (id),
KEY (bookname)
)CHARACTER SET utf8 COLLATE utf8_general_ci;";
Reference:
《mysql create table problem? SOLVED!!!!!!!!!!!》
http://forums.mysql.com/read.php?121,193883,193883
《10.1.5. Configuring the Character Set and Collation for Applications》
http://dev.mysql.com/doc/refman/5.0/en/charset-applications.html
Hoping this can help you.
Adding binary before the weirdcolumn solves the problem.
In my case, I have an update trigger on tableA to insert data into other table.
There are some special characters in column weirdcolumn, and the update failed with message: "ERROR 1366 (HY000): Incorrect string value: '\xE7....'"
After I dig in a lot, I found the solution by adding binary before the string column name, or using cast(weirdcolumn as binary);
Hope this can help.
I had the same issue importing data from SQL Server to MySql using Php.
My solution was utf8_encode() when inserting into MySql and use utf8_decode() when retrieving from MySql to display into the browser.
Here you have my FULL code, that works good.
//For string values
$Gro2=(is_null($row["GrpNm"]))?"NULL":"\"".mysql_escape_string(utf8_encode($row["GrpNm"]))."\"";
$sqlMy ="INSERT INTO `tbl_name` VALUES ($Gro2)";
Please note: For new projects use
mysqli_escape_string()
link