After viewing my prod logs, I have some error mentionning :
[2012-08-31 15:56:43] request.CRITICAL: Doctrine\DBAL\DBALException:
An exception occurred while executing 'SELECT t0.username ....... FROM fos_user t0 WHERE t0.username = ?'
with params {"1":"Nrv\u29e7Kasi"}:
SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT)
and (utf8_general_ci,COERCIBLE) for operation '='
Alghout i have UTF-8 default under the doctrine cfg :
doctrine:
dbal:
charset: UTF8
It seems that all my MySQL Tables are in latin1_swedish_ci, so my question is :
Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?
It is helpful to understand the following definitions:
A character encoding details how each symbol is represented in binary (and therefore stored in the computer). For example, the symbol é (U+00E9, latin small letter E with acute) is encoded as 0xc3a9 in UTF-8 (which MySQL calls utf8) and 0xe9 in Windows-1252 (which MySQL calls latin1).
A character set is the alphabet of symbols that can be represented using a given character encoding. Confusingly, the term is also used to mean the same as character encoding.
A collation is an ordering on a character set, so that strings can be compared. For example: MySQL's latin1_swedish_ci collation treats most accented variations of a character as equivalent to the base character, whereas its latin1_general_ci collation will order them before the next base character but not equivalent (there are other, more significant, differences too: such as the order of characters like å, ä, ö and ß).
MySQL will decide which collation should be applied to a given expression as documented under Collation of Expressions: in particular, the collation of a column takes precedence over that of a string literal.
The WHERE clause of your query compares the following strings:
a value in fos_user.username, encoded in the column's character set (Windows-1252) and expressing a preference for its collation latin1_swedish_ci (with a coercibility value of 2); with
the string literal 'Nrv⧧Kasi', encoded in the connection's character set (UTF-8, as configured by Doctrine) and expressing a preference for the connection's collation utf8_general_ci (with a coercibility value of 4).
Since the first of these strings has a lower coercibility value than the second, MySQL attempts to perform the comparison using that string's collation: latin1_swedish_ci. To do so, MySQL attempts to convert the second string to latin1—but since the ⧧ character does not exist in that character set, the comparison fails.
Warning
One should pause for a moment to consider how the column is currently encoded: you are attempting to filter for records where fos_user.username is equal to a string that contains a character which cannot exist in that column!
If you believe that the column does contain such characters, then you probably wrote to the column whilst the connection character encoding was set to something (e.g. latin1) that caused MySQL to interpret the received byte sequence as characters which are all in the Windows-1252 character set.
If this is the case, before continuing any further you should fix your data!
convert such columns to the character encoding that was used on data insertion, if different to the incumbent encoding:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET foo;
drop the encoding information associated with such columns by converting them to the binary character set:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET binary;
associate with such columns the encoding in which data was actually transmitted by converting them to the relevant character set.
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET bar;
Note that, if converting from a multi-byte encoding, you may need to increase the size of the column (or even change its type) in order to accomodate the maximum possible length of the converted string.
Once one is certain that the columns are correctly encoded, one could force the comparison to be conducted using a Unicode collation by either—
explicitly converting the value fos_user.username to a Unicode character set:
WHERE CONVERT(fos_user.username USING utf8) = ?
forcing the string literal to have a lower coercibility value than the column (will cause an implicit conversion of the column's value to UTF-8):
WHERE fos_user.username = ? COLLATE utf8_general_ci
Or one could, as you say, permanently convert the column(s) to a Unicode encoding and set its collation appropriately.
Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?
The principle consideration is that Unicode encodings take up more space than single-byte character sets, so:
more storage may be required;
comparisons may be slower; and
index prefix lengths may need to be adjusted (note that the maximum is in bytes, so may represent fewer characters than previously).
Also, be aware that, as documented under ALTER TABLE Syntax:
To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in a TEXT column's length bytes, so MySQL will convert the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.
To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns.
Thats right. I ran into this problem and the best quick and fast solution is
CONVERT(fos_user.username USING utf8)
Simply convert table's character set by command as follows,
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8;
Related
This question is an extension of the following question - How to make mysql consider the control characters when doing string comparison?
Here is my query -
SELECT 'abc' < 'abcSOH' COLLATE utf8mb4_0900_bin;
Here SOH is the Start Of Header which is an ASCII control character with ASCII code 1. My expectation is that this query will return 1 as the second string's length is 4. I have even tried with Space (ASCII code 32) with the same results!!
If you check this fiddle, you can see only the 'utf8mb4_0900_bin' collation gives the expected result. All other collations that I have tested give the opposite result.
https://dbfiddle.uk/mDLVWOZG
I have gone through the documentation and could not find the reason behind this. Can anyone please explain why is this?
I am interested to know this because I would like to use a 1-byte character set (and corresponding collation) instead of a 4-byte character set because I have some legacy tables (converting to MySQL) that have a lot of columns and if I use a 4-byte character set, it gives an error that the row is too big.
Each column can have its own CHARACTER SET and COLLATION. But different rows must agree.
CREATE TABLE provides only "defaults" for those settings -- these defaults are used if you don't override them when declaring the individual columns.
So, legacy columns may as well be declared with whatever antique charset was used. (Sorry, EBCDIC is not available.)
All the "printable" characters of ASCII are available in UTF-8 (MySQL's utf8/utf8mb3/utf8mb4). In fact, the binary encoding is identical.
The "control characters" -- well, stick with ascii or latin1 (perhaps with latin1_bin).
Any _bin collation says to simply look at the bits.
I do not know if control characters are turned into space (hex 20) when INSERTing into a UTF-8 column.
I have a MySQL database which contains some bad data.
I start with this Unicode string:
u'TECNOLOGÍA Y EDUCACIÓN'
Encoding to UTF-8 for the database yields:
'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'
When I send these bytes to the database, using connection charset latin1 and database charset utf8 (yes, I know this is wrong, but this has already happened, many, many times, and the goal now is to figure out the exact process of corruption so it can be reversed), the data is converted to this (checked using BINARY()):
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN'
Double-encoding aside, the result I'd expect here is:
'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xc2\x93N'
Most of this makes sense, as it is interpreting the multi-byte UTF-8 chars as latin1, and encoding each byte as an individual char, but the conversion of \x93 -> \xe2\x80\x9c makes no sense. latin1's \x93 does not convert to UTF-8 \xe2\x80\x9c, although \xe2\x80\x9c can be converted to Unicode, yielding u'\u201c', which is codepoint \x93 in the CP-1252 charset.
Is mysql combining latin1 and CP-1252 when it handles conversions? How can I replicate the conversion process entirely in python? I've iterated through every encoding on the system and none of them work for the entire string. How, in python, can I get from 'TECNOLOG\xc3\x83\xc2\x8dA Y EDUCACI\xc3\x83\xe2\x80\x9cN' back to 'TECNOLOG\xc3\x8dA Y EDUCACI\xc3\x93N'? Decoding as UTF-8 will handle the first 3/4ths correctly, but that last one is just wrong, and nothing I've tried will return the correct results.
the goal now is to figure out the exact process of corruption so it can be reversed
As documented under ALTER TABLE Syntax:
Warning
The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns.
In your case:
change the column's encoding to the connection character set that was used on insertion (i.e. latin1), so that the stored bytes become the same as those that were originally received:
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET latin1;
then drop the encoding information (by modifying the column so that it becomes a binary string):
ALTER TABLE my_table MODIFY my_column BLOB;
then apply the correct encoding information (by modifying the column so that it becomes a character string in the utf8 character set):
ALTER TABLE my_table MODIFY my_column TEXT CHARACTER SET utf8;
Be careful to use datatypes of sufficient length to avoid data truncation. Also be careful to ensure that application code thenceforth uses the correct connection character set (or else you may end up with a table where some records are encoded in one manner and others in another, which can be a nightmare to resolve).
If you cannot modify the database just yet, simply fetching data whilst the connection character is set to latin1 (but with your application expecting UTF-8) will yield correct data. Or else, use CONVERT():
SELECT CONVERT(BINARY CONVERT(my_column USING latin1) USING utf8)
FROM my_table
Is mysql combining latin1 and cp1252 when it handles conversions?
As documented under West European Character Sets:
MySQL's latin1 is the same as the Windows cp1252 character set. This means it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers Authority) latin1, except that IANA latin1 treats the code points between 0x80 and 0x9f as “undefined,” whereas cp1252, and therefore MySQL's latin1, assign characters for those positions. For example, 0x80 is the Euro sign. For the “undefined” entries in cp1252, MySQL translates 0x81 to Unicode 0x0081, 0x8d to 0x008d, 0x8f to 0x008f, 0x90 to 0x0090, and 0x9d to 0x009d.
I'm coming from a SQL Server background. What would the equivalent data types be for the following in MySQL:
NVARCHAR - provides support for international, multi-byte characters for all languages
NVARCHAR(max) - allows for very long text documents
Going by http://msdn.microsoft.com/en-us/library/ms186939.aspx, I would say that
VARCHAR(n) CHARSET ucs2
is the closest equivalent. But I don't see many people using this, more often people use:
VARCHAR(n) CHARSET utf8
As far as I am aware, both charactersets utf8 and ucs2 allow for the same characters, only the encoding is different. So either one should work, and I would probably go for the latter since it is used more often.
There are some limitations in MySQL's unicode support that may or may not apply to your use case. Please refer to http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html for more info on MySQL's unicode support.
the equivalent is VARCHAR or TEXT
MySql supports nchar and nvarchar but in a different way. This is implemented using character set and collation (which is a set of rules used for comparison of strings using the character set - such as whether the comparison should be case sensitive or not).
MySql defines character set at 4 level:
Server
Database
Table
Column
character set is inherited from the immediate top level if you do not specify one. What you need to do is to ensure that the column that you want to make of type nvarchar has "character set" set to any unicode character set (utf8 is the best choice I believe) either explicitly or by inheritance.
For column level you can set "character set" as follows:
create table nvarchar_test
(
_id varchar(10) character set utf8,
_name varchar(200) character set ascii
)
In the above example _id will be able to take 10 unicode characters and _name will be able to take 100 unicode characters (each unicode character will evaluate to two ascii character)
Please go through MySql reference for further explanation.
In MySQL 5, NVARCHAR (shorthand for National Varchar) is supported and uses utf8 as the predefined character set.
http://dev.mysql.com/doc/refman/5.0/en/charset-national.html
I have problem inserting rows to my DB.
When a row contains characters like: 'è', 'ò', 'ò', '€', '²', '³' .... etc ... it returns an error like this (charset set to utf8):
Incorrect string value: '\xE8 pass...' for column 'descrizione' at row 1 - INSERT INTO materiali.listino (codice,costruttore,descrizione,famiglia) VALUES ('E 251-230','Abb','Relè passo passo','Relè');
But, if I set the charset to latin1 or *utf8_general_ci* it works fine, and no errors are found.
Can somebody explain me why does this happens? I always thought that utf8 was "larger" than latin1
EDIT: I also tried to use mysql_real_escape_string, but the error was always the same!!!!
mysql_real_escape_string() is not relevant, as it merely escapes string termination quotes that would otherwise enable an attacker to inject SQL.
utf8 is indeed "larger" than latin1 insofar as it is capable of representing a superset of the latter's characters. However, not every byte-sequence represents valid utf8 characters; whereas every possibly byte sequence does represent valid latin1 characters.
Therefore, if MySQL receives a byte sequence it expects to be utf8 (but which isn't), some characters could well trigger this "incorrect string value" error; whereas if it expects the bytes to be latin1 (even if they're not), they will be accepted - but incorrect data may be stored in the table.
Your problem is almost certainly that your connection character set does not match the encoding in which your application is sending its strings. Use the SET NAMES statement to change the current connection's character set, e.g. SET NAMES 'utf8' if your application is sending strings encoded as UTF-8.
Read about connection character sets for more information.
As an aside, utf8_general_ci is not a character set: it's a collation for the utf8 character set. The manual explains:
A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set.
According to the doc for UTF-8, the default collation is utf8_general_ci.
If you want a specific order in your alphabet that is not the general_ci one, you should pick one of the utf8_* collation that are provided for the utf8 charset, whichever match your requirements in term of ordering.
Both your table and your connection to the DB should be encoded in utf8, preferably the same collation, read more about setting connection collation.
To be completely safe you should check your table collation and make sure it's utf8_* and that your connection is too, using the complete syntax of SET NAMES
SET NAMES 'utf8' COLLATE 'utf8_general_ci'
You can find information about the different collation here
mysql_query("SET NAMES 'utf8' COLLATE 'utf8_general_ci'");
Eurika, the above did it :-)
I'm coming from a SQL Server background. What would the equivalent data types be for the following in MySQL:
NVARCHAR - provides support for international, multi-byte characters for all languages
NVARCHAR(max) - allows for very long text documents
Going by http://msdn.microsoft.com/en-us/library/ms186939.aspx, I would say that
VARCHAR(n) CHARSET ucs2
is the closest equivalent. But I don't see many people using this, more often people use:
VARCHAR(n) CHARSET utf8
As far as I am aware, both charactersets utf8 and ucs2 allow for the same characters, only the encoding is different. So either one should work, and I would probably go for the latter since it is used more often.
There are some limitations in MySQL's unicode support that may or may not apply to your use case. Please refer to http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html for more info on MySQL's unicode support.
the equivalent is VARCHAR or TEXT
MySql supports nchar and nvarchar but in a different way. This is implemented using character set and collation (which is a set of rules used for comparison of strings using the character set - such as whether the comparison should be case sensitive or not).
MySql defines character set at 4 level:
Server
Database
Table
Column
character set is inherited from the immediate top level if you do not specify one. What you need to do is to ensure that the column that you want to make of type nvarchar has "character set" set to any unicode character set (utf8 is the best choice I believe) either explicitly or by inheritance.
For column level you can set "character set" as follows:
create table nvarchar_test
(
_id varchar(10) character set utf8,
_name varchar(200) character set ascii
)
In the above example _id will be able to take 10 unicode characters and _name will be able to take 100 unicode characters (each unicode character will evaluate to two ascii character)
Please go through MySql reference for further explanation.
In MySQL 5, NVARCHAR (shorthand for National Varchar) is supported and uses utf8 as the predefined character set.
http://dev.mysql.com/doc/refman/5.0/en/charset-national.html