What's the difference between "VARCHAR BINARY" and "VARBINARY" in MySQL? - mysql

I've created the following test table:
CREATE TABLE t (
a VARCHAR(32) BINARY,
b VARBINARY(32)
);
INSERT INTO t (a, b) VALUES ( 'test ', 'test ');
INSERT INTO t (a, b) VALUES ( 'test \0', 'test \0');
But this query indicated no difference between the two types:
SELECT a, LENGTH(a), HEX(a), b, LENGTH(b), HEX(b) FROM t;
a LENGTH(a) HEX(a) b LENGTH(b) HEX(b)
--------- --------- ------------------ --------- --------- --------------------
test 8 7465737420202020 test 8 7465737420202020
test 9 746573742020202000 test 9 746573742020202000

Here are the difference I was able to find reading the documentation :
VARCHAR BINARY
The BINARY attribute cause the binary collation for the column character set to be used, and the column itself contains nonbinary character strings rather than binary byte strings.
When BINARY values are stored, they are right-padded with the pad value to the specified length.
You should consider the preceding padding and stripping characteristics carefully if you plan to use the BINARY data type for storing binary data and you require that the value retrieved be exactly the same as the value stored.
VARBINARY
If strict SQL mode is not enabled and you try to assign a value that exceeds the column's maximum length, the value is truncated to fit and a warning is generated.
There is no padding on insert, and no bytes are stripped on select. All bytes are significant in comparisons.
Utilisation is preferable when the value retrieved must be the same as the value specified for storage with no padding.

As the MySQL manual page on String Data Type Syntax explains, VARBINARY is equivalent to VARCHAR CHARACTER SET binary, while VARCHAR BINARY is equivalent to VARCHAR CHARACTER SET latin1 COLLATE latin1_bin (or some other non-binary character set with the corresponding binary collation; it depends on table settings):
Specifying the CHARACTER SET binary attribute for a character string data type causes the column to be created as the corresponding binary string data type: CHAR becomes BINARY, VARCHAR becomes VARBINARY, and TEXT becomes BLOB.
The BINARY attribute is a nonstandard MySQL extension that is shorthand for specifying the binary (_bin) collation of the column character set (or of the table default character set if no column character set is specified).
So, VARBINARY stores bytes; VARCHAR BINARY stores character codes but sorts them like bytes (almost - see below).
What this means in practice is explained on the manual page The binary Collation Compared to _bin Collations:
VARBINARY sorts by comparing byte by byte; VARCHAR BINARY compares the byte groups that correspond to characters (not much of a difference for most encodings)
VARCHAR BINARY performs a character set conversion when assigning value from another column with a different encoding, or when the value is inserted/updated by a client with a different encoding; VARBINARY just takes the value as a raw byte string.
Case conversion in SQL (ie. the LOWER / UPPER functions) has no effect on VARBINARY (bytes have no case).
Trailing spaces will be usually ignored in VARCHAR BINARY comparisons (that is, 'x ' = 'x' will be true).

Related

Wrong max length for char datatype in MySQL

When character encoding is latin1 (single-byte character set), below SQL statement execute without error.
create table chartype (chardata char(255));
But When character encoding is UTF-8 (3 byte for each character) ,
create table chartype (chardata char(255));
this statement should throw error but it is executed without any error.
Max length for char datatype is 255 bytes, for UTF-8 encoding it should allow only below statement
create table chartype (chardata char(85));
85*3=255 bytes , so 85 is max length for UTF-8 character set
Please clarify me.
When you say CHAR(255) you're creating a fixed length field that can accommodate 255 characters. This is distinct from bytes. UTF-8 characters vary in length from 1 to 4 bytes depending on which character you're talking about but the default in MySQL is to accommodate only 3-byte length characters.
To handle the full range of Unicode characters you need to use utf8mb4 encoding.
Be sure to use VARCHAR in preference to CHAR as CHAR is fixed length and creates a lot of wasted space.
http://dev.mysql.com/doc/refman/5.7/en/storage-requirements.html says in part:
For a VARCHAR column that stores multibyte characters, the effective maximum number of characters is less. For example, utf8mb3 characters can require up to three bytes per character, so a VARCHAR column that uses the utf8mb3 character set can be declared to be a maximum of 21,844 characters.
The same applies to CHAR or TEXT, or any other data type that supports character sets.

Incorrect string value for column in MySQL with utf8

Running below query on MySQL workbench fails with Incorrect string value error.
insert into mytable (key) values (0x8080808080) gives me below error:
Error Code: 1366. Incorrect string value: '\x80\x80\x80\x80\x80' for column 'key' at row 1
Column data type is defined as char(5) and it uses table's default charset/collation i.e., "utf8 - default collation". This query fails to insert any character value above 0x7F.
I want to understand why it fails to insert values above 0x7F. If i change charset/collation type to latin1__, it works fine till characters 0xFF.
This query fails to insert any character value above 0x7F.
It's failing to insert a byte value above 0x7F. If you wanted to insert character U+0080 you would have to encode it as UTF-8 sequence 0xC280. These bytes are above 0x7F but will insert OK because it's a valid UTF-8 sequence.
This is true for any encoding; 0x8080 is an invalid sequence of bytes in Shift-JIS too, so if you created a character-string column stored in sjis that value would fail top. latin1, on the other hand, has no invalid byte sequences, so all bytes would happen to work there.
But if you want to store arbitrary bytes and don't care about characters and encodings, you should use a binary collation (eg VARBINARY column type) instead.

MySQL Illegal mix of collations

After viewing my prod logs, I have some error mentionning :
[2012-08-31 15:56:43] request.CRITICAL: Doctrine\DBAL\DBALException:
An exception occurred while executing 'SELECT t0.username ....... FROM fos_user t0 WHERE t0.username = ?'
with params {"1":"Nrv\u29e7Kasi"}:
SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT)
and (utf8_general_ci,COERCIBLE) for operation '='
Alghout i have UTF-8 default under the doctrine cfg :
doctrine:
dbal:
charset: UTF8
It seems that all my MySQL Tables are in latin1_swedish_ci, so my question is :
Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?
It is helpful to understand the following definitions:
A character encoding details how each symbol is represented in binary (and therefore stored in the computer). For example, the symbol é (U+00E9, latin small letter E with acute) is encoded as 0xc3a9 in UTF-8 (which MySQL calls utf8) and 0xe9 in Windows-1252 (which MySQL calls latin1).
A character set is the alphabet of symbols that can be represented using a given character encoding. Confusingly, the term is also used to mean the same as character encoding.
A collation is an ordering on a character set, so that strings can be compared. For example: MySQL's latin1_swedish_ci collation treats most accented variations of a character as equivalent to the base character, whereas its latin1_general_ci collation will order them before the next base character but not equivalent (there are other, more significant, differences too: such as the order of characters like å, ä, ö and ß).
MySQL will decide which collation should be applied to a given expression as documented under Collation of Expressions: in particular, the collation of a column takes precedence over that of a string literal.
The WHERE clause of your query compares the following strings:
a value in fos_user.username, encoded in the column's character set (Windows-1252) and expressing a preference for its collation latin1_swedish_ci (with a coercibility value of 2); with
the string literal 'Nrv⧧Kasi', encoded in the connection's character set (UTF-8, as configured by Doctrine) and expressing a preference for the connection's collation utf8_general_ci (with a coercibility value of 4).
Since the first of these strings has a lower coercibility value than the second, MySQL attempts to perform the comparison using that string's collation: latin1_swedish_ci. To do so, MySQL attempts to convert the second string to latin1—but since the ⧧ character does not exist in that character set, the comparison fails.
Warning
One should pause for a moment to consider how the column is currently encoded: you are attempting to filter for records where fos_user.username is equal to a string that contains a character which cannot exist in that column!
If you believe that the column does contain such characters, then you probably wrote to the column whilst the connection character encoding was set to something (e.g. latin1) that caused MySQL to interpret the received byte sequence as characters which are all in the Windows-1252 character set.
If this is the case, before continuing any further you should fix your data!
convert such columns to the character encoding that was used on data insertion, if different to the incumbent encoding:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET foo;
drop the encoding information associated with such columns by converting them to the binary character set:
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET binary;
associate with such columns the encoding in which data was actually transmitted by converting them to the relevant character set.
ALTER TABLE fos_users MODIFY username VARCHAR(123) CHARACTER SET bar;
Note that, if converting from a multi-byte encoding, you may need to increase the size of the column (or even change its type) in order to accomodate the maximum possible length of the converted string.
Once one is certain that the columns are correctly encoded, one could force the comparison to be conducted using a Unicode collation by either—
explicitly converting the value fos_user.username to a Unicode character set:
WHERE CONVERT(fos_user.username USING utf8) = ?
forcing the string literal to have a lower coercibility value than the column (will cause an implicit conversion of the column's value to UTF-8):
WHERE fos_user.username = ? COLLATE utf8_general_ci
Or one could, as you say, permanently convert the column(s) to a Unicode encoding and set its collation appropriately.
Can I manually change the collation to utf8_general_ci for all my tables without any complications/precautions ?
The principle consideration is that Unicode encodings take up more space than single-byte character sets, so:
more storage may be required;
comparisons may be slower; and
index prefix lengths may need to be adjusted (note that the maximum is in bytes, so may represent fewer characters than previously).
Also, be aware that, as documented under ALTER TABLE Syntax:
To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;
For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET will change the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length will not fit in a TEXT column's length bytes, so MySQL will convert the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.
To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns.
Thats right. I ran into this problem and the best quick and fast solution is
CONVERT(fos_user.username USING utf8)
Simply convert table's character set by command as follows,
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8;

mysql treatment of of ' '

MySQL statement
mysql> select * from field where dflt=' '
appears to match empty values; and is different from statement
mysql> select * from field where concat('_',dflt,'_') = '_ _';
I couldn't find a description of this behavior in MySQL reference. How can I make MySQL interpret
input literally?
EDITED: This indeed won't match NULL values, but it does match empty values.
As mentioned in The CHAR and VARCHAR Types:
All MySQL collations are of type PADSPACE. This means that all CHAR and VARCHAR values in MySQL are compared without regard to any trailing spaces.
The definition of the LIKE operator states:
In particular, trailing spaces are significant, which is not true for CHAR or VARCHAR comparisons performed with the = operator:
As mentioned in this answer:
This behavior is specified in SQL-92 and SQL:2008. For the purposes of comparison, the shorter string is padded to the length of the longer string.
From the draft (8.2 <comparison predicate>):
If the length in characters of X is not equal to the length in characters of Y, then the shorter string is effectively replaced, for the purposes of comparison, with a copy of itself that has been extended to the length of the longer string by concatenation on the right of one or more pad characters, where the pad character is chosen based on CS. If CS has the NO PAD characteristic, then the pad character is an implementation-dependent character different from any character in the character set of X and Y that collates less than any string under CS. Otherwise, the pad character is a <space>.
In addition to the other excellent solutions:
select binary 'a' = 'a '
I couldn't find any documentation, but it is widely known that trailing spaces are ignored when doing a text comparison.
To force a literal match, try this:
select *
from field
where dflt = ' '
and length(dflt) = 1; // length does not ignore trailing spaces

mysql BLOB and TEXT data type difference

What's difference between BLOB and TEXT data type in mysql ? ( except sortable )
BLOB is used for storing binary data, while TEXT is used to store large strings.
As stated in the MySQL 5.1 Reference Manual:
BLOB values are treated as binary strings (byte strings). They have no
character set, and sorting and comparison are based on the numeric
values of the bytes in column values. TEXT values are treated as
nonbinary strings (character strings). They have a character set, and
values are sorted and compared based on the collation of the character
set.
Mmm google is your friend I guess:
TEXT and CHAR will convert to/from the character set they have associated with time. BLOB and BINARY simply store bytes.
Main difference between BLOB and TEXT: BLOB is casesensetive TEXT.