I have done enough research and read the MySQL documentation, but I seem not to find a good explanation of the BINARY (BIN) column property in MySQL Table.
Could someone explain when this should be checked and/or what is used for?
The BIN column means the column uses a binary collation.
I tested this by creating a table with a VARCHAR datatype and I checked the BIN column in MySQL Workbench.
Then I viewed the DDL for the table in the command-line client:
mysql> show create table mytable\G
*************************** 1. row ***************************
Table: mytable
Create Table: CREATE TABLE `mytable` (
`title` varchar(45) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
PRIMARY KEY (`title`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
You can see that the collation is utf8mb4_bin, which is the binary collation for that character set.
String comparisons to that column will use byte-by-byte comparison instead of using character equivalences according to any unicode-compatible collation.
So it's case-sensitive, and characters will compare as different even if they differ only in diacritics. For example 'e' = 'é' is false in binary comparisons.
Related
I've been researching this question for several hours now, on SO, in MySQL docs, and elsewhere, but still can't find a satisfactory solution. The problem is:
What is the simplest way to make MySQL treat strings just like SQLite does, without any extra "smart" conversions?
For example, the following works perfectly in SQLite:
CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');
INSERT INTO `dummy` (`key`) VALUES ('one ');
INSERT INTO `dummy` (`key`) VALUES ('One');
INSERT INTO `dummy` (`key`) VALUES ('öne');
SELECT * FROM `dummy`;
However, in MySQL, with the following settings:
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_bin
and the following CREATE DATABASE statement:
CREATE DATABASE `dummydb` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_bin;
it still fails on the second INSERT.
I'd rather keep string column declarations as simple as possible, SQLite's TEXT being the ideal. Looks like VARBINARY is the way to go, but I would still like to hear your opinions on any other, potentially better options.
Addendum: The SHOW CREATE TABLE dummy output is
mysql> SHOW CREATE TABLE dummy;
+-------+-----------------------------------------------------
| Table | Create Table
+-------+-----------------------------------------------------
| dummy | CREATE TABLE `dummy` (
`key` varchar(255) COLLATE utf8mb4_bin NOT NULL,
UNIQUE KEY `key` (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+-----------------------------------------------------
1 row in set (0.00 sec)
MySQL wants to convert strings when doing INSERT and SELECT. The conversion is between what you declare the client to have and what the column is declared to be storing.
The only way to avoid that is with VARBINARY and BLOB instead of VARCHAR and TEXT.
The use of COLLATION utf8mb4_bin does not avoid conversion to/from CHARACTER SET utf8mb4; it merely says that WHERE and ORDER BY should compare the bits instead of dealing with accents and case folding.
Keep in mind that CHARACTER SET utf8mb4 is a way to encode text; COLLATION utf8mb4_* is rules for comparing texts in that encoding. _bin is simpleminded.
UNIQUE involves comparing for equality, hence COLLATION. In most utf8mb4 collations, the 3 (without spaces) will compare equal. utf8mb4_bin will treat the 3 as different. utf8mb4_hungarian_ci treats one=One>öne.
The trailing spaces are controlled by the datatype of the column (VARCHAR or other). The latest version even has a setting relating to whether to consider trailing spaces.
The approach shown in the question should (mostly) work just fine in MySQL for the following reasons:
Collation (not to be confused with encoding) is the set or rules that define how to sort and compare characters, typically used to replicate at database level the user expectations from a cultural perspective (if I search for cafe I expect to find café as well).
Collation plays an important rule on unique constraints because its establishes the definition of unique.
Binary collations are specifically meant to ignore cultural rules and work at byte level, thus utf8mb4_bin is the right choice here.
MySQL allows to set a combination of encoding and collation with a column level granularity.
If a column definition is missing collation, it'll use the table level one.
If a table definition is missing collation, it'll use the database level one.
If a database definition is missing collation, it'll use the server level one.
It's also worth noting that MySQL will convert between encodings transparently as long as:
Connection encoding is properly set
Conversion is physically possible (e.g. all source characters also belong to target encoding)
For this last reason, VARBINARY is possibly not the best choice for a column that's still text because it opens the door to getting café stored from a connection configured to use ISO-8859-1 and not being able to retrieve it correctly from a connection configured to use UTF-8.
Side note: the table definition shown may trigger the following error:
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes
Indexes may have a relatively small maximum size. From docs:
If innodb_large_prefix is enabled (the default), the index key prefix
limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED
row format. If innodb_large_prefix is disabled, the index key prefix
limit is 767 bytes for tables of any row format.
innodb_large_prefix is deprecated and will be removed in a future
release. innodb_large_prefix was introduced in MySQL 5.5 to disable
large index key prefixes for compatibility with earlier versions of
InnoDB that do not support large index key prefixes.
The index key prefix length limit is 767 bytes for InnoDB tables that
use the REDUNDANT or COMPACT row format. For example, you might hit
this limit with a column prefix index of more than 255 characters on a
TEXT or VARCHAR column, assuming a utf8mb3 character set and the
maximum of 3 bytes for each character.
Attempting to use an index key prefix length that exceeds the limit
returns an error. To avoid such errors in replication configurations,
avoid enabling innodb_large_prefix on the master if it cannot also be
enabled on slaves.
Since utf8_mb8 allocates 4 bytes per character, a 767 limit will be overflowed with only 192 characters.
We have one more problem:
mysql> CREATE TABLE `dummy` (
-> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL,
-> UNIQUE KEY `key` (`key`)
-> )
-> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one ');
ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'
Pardon?
mysql> INSERT INTO `dummy` (`key`) VALUES ('One');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('öne');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM `dummy`;
+-----+
| key |
+-----+
| One |
| one |
| öne |
+-----+
3 rows in set (0.00 sec)
This last issue is a interesting subtlety of MySQL collations. From docs:
All MySQL collations are of type PADSPACE. This means that all CHAR,
VARCHAR, and TEXT values in MySQL are compared without regard to any
trailing spaces. “Comparison” in this context does not include the
LIKE pattern-matching operator, for which trailing spaces are
significant
[...]
For those cases where trailing pad characters are stripped or
comparisons ignore them, if a column has an index that requires unique
values, inserting into the column values that differ only in number of
trailing pad characters will result in a duplicate-key error.
I'd dare say then that VARBINARY type is the only way to overcome this...
After reading articles about how to properly store your data inside a mysql database, I sort of have a good understanding of how character sets work. Yet, there are still some instances that I find hard to understand. Let me give a quick example: Say I have a table with a varchar column using the latin1 charset (for example purposes)
CREATE TABLE `foobar` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`foo` varchar(100) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
and I force insert a latin1 character using its hex value (á for instance, which would be E1)
SET NAMES 'latin1';
INSERT INTO foobar SET foo = UNHEX('E1')
why does a SELECT query against this table under a latin1 connection fail to give me the value I am expecting? I get the replacement character instead. I would have expected it to just give back the value as it is since the column was in latin1, the connection itself was in latin1 and the value stored is a valid latin1 character. Is mysql trying to interpret E1 as a UTF8 value? Or am I missing something here?
Using a UTF8 connection (SET NAMES 'utf8') before selecting gives me the expected value of á, which I guess is correct since from what I understand, if the connection charset is different from the column charset being selected, then mysql would read data as whatever the column charset was (so E1 gets parsed as a latin1 character, or á)
In creating a simple (temporary) MySQL table, taking data from the same column of the same source table, the two resulting columns wind up with different CHARACTER SET and resulting default COLLATION settings:
mysql> CREATE TABLE tempDates
SELECT SUBDATE(MAX(EventDate), INTERVAL 90 DAY) AS StartDate,
MAX(EventDate) AS EndDate FROM james_bond_007
WHERE EventCategory = 'Successful_Kills';
Here is the output showing the resulting table structures:
mysql> SHOW CREATE TABLE tempDates;
CREATE TABLE `tempDates` (
`StartDate` varchar(29) CHARACTER SET utf8 DEFAULT NULL,
`EndDate` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
I ran an alter table command, but NOTHING changed:
ALTER TABLE tempdates CHARACTER SET latin1 COLLATE latin1_swedish_ci;
From a curiosity standpoint, I want to know why this happens, and from a practical standpoint, how do I make this not happen?
The result I want is for all columns to have the server defaults: CHARACTER SET latin1 COLLATE latin1_swedish_ci
Even better would be a way to impose the server defaults on all columns so I don't have to type more than I want to in future queries of this type.
#Rick James
This solved my problem so I want to mark it answered.
If you've a moment, perhaps an explanation as to why? (gives me another excuse to upvote you and accept your answer)
I have a table with a field a using encoding utf8 and collation utf8_unicode_ci:
CREATE TABLE dictionary (
a varchar(128) NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The collation utf8_unicode_ci is required for an efficient case insensitive search with extensions and ligations. For this purpose i have the index:
CREATE INDEX a_idx on dictionary(a);
Problem: Additionally i must ensure that all stored values of the field a are unique but in a case sensitive way.
German example: "blühen" and "Blühen" must both be stored in the table. But adding "Blühen" a second time should not be possible.
Is there a build-in functionality in MySQL to have both?
Unfortunately it seems not to be possible to set the collation for the index in MySQL 5.1.
Solutions to this problem include a uniqueness check before insert or a trigger. Both are far less elegant than using a unique index.
Well, there are 2 ways to accomplish this:
using _bin collation
change your datatype to VARBINARY
Case 1: using _bin collation
Create your table as follows:
CREATE TABLE `dictionary` (
`a` VARCHAR(128) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
UNIQUE KEY `idx_un_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Please note:
the datatype of the column a
the UNIQUE index on column a
Case 2: using VARBINARY dataype
Create your table as follows:
CREATE TABLE `dictionary` (
`a` VARBINARY(128) NOT NULL,
UNIQUE KEY `idx_uniq_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Please note:
the new datatype VARBINARY
the UNIQUE index on column a
So, both the above will solve your purpose. That is, they both will allow values like 'abc', 'Abc', 'ABC', 'aBc' etc but not allow the same value again if the case matches.
Please note that giving an "_bin" collation is different than using the binary datatype. So please feel free to refer to the following links:
The BINARY and VARBINARY datatypes
The _bin and binary Collations
I hope the above helps!
You can achieve this by adding additinal column 'column_lower'.
CREATE TABLE `dictionary` (
`a` VARCHAR(128) NOT NULL,
`a_lower` VARCHAR(128) NOT NULL,
UNIQUE KEY `idx_un_a_lower` (`a_lower`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Insert that goes like this:
insert into dictionary set a = x, a_lower = lower(x);
Select can now be case-insensitive:
select * from dictionary where a_lower like lower('search_term%')
Note that column which has index on it, can store at max 191 characters. MySQL can have at max 767 bytes long index, that is 767 / 4 (unicode can take up to 4 bytes if you use utf8mb4 collation) = 191.75 = 191 characters. If you use utf8 collation that takes up at max 3 bytes per character column can store at max 767 / 3 = 255 characters.
SELECT * FROM dictionary WHERE a COLLATE utf8_general_ci = 'abc'
Try this It will work .. it worked for me.
Recently I changed a bunch of columns to utf8_general_ci (the default UTF-8 collation) but when attempting to change a particular column, I received the MySQL error:
Column 'node_content' cannot be part of FULLTEXT index
In looking through docs, it appears that MySQL has a problem with FULLTEXT indexes on some multi-byte charsets such as UCS-2, but that it should work on UTF-8.
I'm on the latest stable MySQL 5.0.x release (5.0.77 I believe).
Oops, so I have found the answer to my problem:
All columns of a FULLTEXT index must have not only the same character set but also the same collation.
My FULLTEXT index had utf8_unicode_ci on one of its columns, and utf8_general_ci on its other columns.
Just to add to Thomas's good advice: And to sort things out in PHPMyAdmin you have to change the characterset for all columns AT THE SAME TIME.
Just wasted half a day trying again and again to change the columns one at a time and continually getting the error message about the FULLTEXT index.
For DBeaver/database tool users.
When you use interface to modify more than one column, the tool generate commands like this :
ALTER TABLE databaseName.tableName MODIFY COLUMN columnName1 text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL;
ALTER TABLE databaseName.tableName MODIFY COLUMN columnName2 varchar(128) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL;
This is not working because you must modify the charsets at the same time.
So, you have to change it manually, in one command :
ALTER TABLE databaseName.tableName
MODIFY COLUMN columnName1 text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL,
MODIFY COLUMN columnName2 text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL;
utf8 or utf8mb4 ? See here.