MySQL database with unique fields ignored ending spaces - mysql

My projects requires to start inputs from the user with the spacing on the left and spacing on the right of a word, for example 'apple'. If the user types in ' apple' or 'apple ', whether it is one space or multiple space on the left or right of the word, I need to store it that way.
This field has the Unique attribute, but I attempt to insert the word with spacing on the left, and it works fine. But when I attempt to insert the word with spacing on the right it trims off all the spacing from the right of the word.
So I am thinking of adding a special character to the right of the word after the spacing. But I am hoping there is a better solution for this issue.
CREATE TABLE strings
( id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
string varchar(255) COLLATE utf8_bin NOT NULL,
created_ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (id), UNIQUE KEY string (string) )
ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin

The problem is that MySQL ignores trailing whitespace when doing string comparison. See
http://dev.mysql.com/doc/refman/5.7/en/char.html
All MySQL collations are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without regard to any trailing spaces.
...
For those cases where trailing pad characters are stripped or comparisons ignore them, if a column has an index that requires unique values, inserting into the column values that differ only in number of trailing pad characters will result in a duplicate-key error. For example, if a table contains 'a', an attempt to store 'a ' causes a duplicate-key error.
(This information is for 5.7; for 8.0 this changed, see below)
The section for the like operator gives an example for this behavior (and shows that like does respect trailing whitespace):
mysql> SELECT 'a' = 'a ', 'a' LIKE 'a ';
+------------+---------------+
| 'a' = 'a ' | 'a' LIKE 'a ' |
+------------+---------------+
| 1 | 0 |
+------------+---------------+
1 row in set (0.00 sec)
Unfortunately the UNIQUE index seems to use the standard string comparison to check if there is already such a value, and thus ignores trailing whitespace.
This is independent from using VARCHAR or CHAR, in both cases the insert is rejected, because the unique check fails. If there is a way to use like semantics for the UNIQUE check then I do not know it.
What you could do is store the value as VARBINARY:
mysql> create table test_ws ( `value` varbinary(255) UNIQUE );
Query OK, 0 rows affected (0.13 sec)
mysql> insert into test_ws (`value`) VALUES ('a');
Query OK, 1 row affected (0.08 sec)
mysql> insert into test_ws (`value`) VALUES ('a ');
Query OK, 1 row affected (0.06 sec)
mysql> SELECT CONCAT( '(', value, ')' ) FROM test_ws;
+---------------------------+
| CONCAT( '(', value, ')' ) |
+---------------------------+
| (a) |
| (a ) |
+---------------------------+
2 rows in set (0.00 sec)
You better do not want to do anything like sorting alphabetically on this column, because sorting will happen on the byte values instead, and that will not be what the users expect (most users, anyway).
The alternative is to patch MySQL and write your own collation which is of type NO PAD. Not sure if someone wants to do that, but if you do, let me know ;)
Edit: meanwhile MySQL has collations which are of type NO PAD, according to https://dev.mysql.com/doc/refman/8.0/en/char.html :
Most MySQL collations have a pad attribute of PAD SPACE. The exceptions are Unicode collations based on UCA 9.0.0 and higher, which have a pad attribute of NO PAD.
and https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
Unicode collations based on UCA versions later than 4.0.0 include the version in the collation name. Thus, utf8mb4_unicode_520_ci is based on UCA 5.2.0 weight keys, whereas utf8mb4_0900_ai_ci is based on UCA 9.0.0 weight keys.
So if you try:
create table test_ws ( `value` varbinary(255) UNIQUE )
character set utf8mb4 collate utf8mb4_0900_ai_ci;
you can insert values with and without trailing whitespace
You can find all available NO PAD collations with:
show collation where Pad_attribute='NO PAD';

This is not about CHAR vs VARCHAR. SQL Server does not consider trailing spaces when it comes to string comparison, which is applied also when checking a unique key constraint. So it is not that you cannot insert value with trailing spaces, but once you insert, you cannot insert another value with more or fewer spaces.
As a solution to your problem, you can add a column that keeps the length of the string, and make the length AND the string value as a composite unique key constraint.
In SQL Server 2012, you can even make the length column as a computed column so that you don't have to worry about the value at all. See http://sqlfiddle.com/#!6/32e94 for an example with SQL Server 2012. (I bet something similar is possible in MySQL.)

You probably need to read about the differences between VARCHAR and CHAR types.
The CHAR and VARCHAR Types
When CHAR values are stored, they are right-padded with spaces to the specified length. When CHAR values are retrieved, trailing spaces are removed unless the PAD_CHAR_TO_FULL_LENGTH SQL mode is enabled.
For VARCHAR columns, trailing spaces in excess of the column length are truncated prior to insertion and a warning is generated, regardless of the SQL mode in use. For CHAR columns, truncation of excess trailing spaces from inserted values is performed silently regardless of the SQL mode.
VARCHAR values are not padded when they are stored. Trailing spaces are retained when values are stored and retrieved, in conformance with standard SQL.
Conclusion: if you want to retain whitespace on the right side of a text string, use the CHAR type (and not VARCHAR).

Thanks to #kennethc. His answer works for me.
Add a string length field to the table and to the unique key.
CREATE TABLE strings
( id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
string varchar(255) COLLATE utf8_bin NOT NULL,
created_ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
string_length int(3),
PRIMARY KEY (id), UNIQUE KEY string (string,string_length) )
ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
In MySQL it's possible to update the string length field with couple of triggers like this:
CREATE TRIGGER `string_length_insert` BEFORE INSERT ON `strings` FOR EACH ROW SET NEW.string_length = char_length(NEW.string);
CREATE TRIGGER `string_length_update` BEFORE UPDATE ON `strings` FOR EACH ROW SET NEW.string_length = char_length(NEW.string);

Related

Why mysql query is slow without quotation mark?

The table DDL as flows:
CREATE TABLE `video` (
`short_id` varchar(50) NOT NULL,
`prob` float DEFAULT NULL,
`star_id` varchar(50) NOT NULL,
`qipu_id` int(11) NOT NULL,
`cloud_url` varchar(100) DEFAULT NULL,
`is_identical` tinyint(1) DEFAULT NULL,
`quality` varchar(1) DEFAULT NULL,
PRIMARY KEY (`short_id`),
KEY `ix_video_short_id` (`short_id`),
KEY `sid` (`star_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
The video table has 4.5 million lines.
I execute the same query in mysql shell client as flows. except in where clause the star_id equal to a value with quatation mark, another not as flows.
select * from video where star_id="215343405";
12914 rows in set (0.22 sec)
select * from video where star_id=215343405;
12914 rows in set (3.17 sec)
the one with quatation mark is 10x faster then another(I have create index on star_id).i watch out the slow one does not use the index. I just wonder how mysql process the query?
mysql> explain select * from video where star_id=215343405;
Thanks advance!
This is answered in the manual:
For comparisons of a string column with a number, MySQL cannot use an
index on the column to look up the value quickly. If str_col is an
indexed string column, the index cannot be used when performing the
lookup in the following statement:
SELECT * FROM tbl_name WHERE str_col=1;
The reason for this is that there are many different strings that may convert to the value 1, such as '1', ' 1', or '1a'.
If you do not use Quotation marks mysql uses the value as an int and must convert the value for every record. Therefor the db needs a lot of time.
The quotes define the expression as a string, whereas without the single quote it is evaluated as a number. This means that MySQL is forced to perform a Type Conversion to convert the number to a CHAR to do a proper comparison.
As the doc above says,
For comparisons of a string column with a number, MySQL cannot use an
index on the column to look up the value quickly. If str_col is an
indexed string column, the index cannot be used when performing the
lookup...
However, the inverse of that is not true and while the index can be used, using a string as a value causes a poor execution plan (as illustrated by jkavalik's sqlfiddle) where using where is used instead of the faster using index condition. The main difference between the two is that the former requires a row lookup and the latter can get the data directly from the index.
You should definitely modify the column data type (assuming it truly is only meant to contain numbers) to the appropriate data type ASAP, but make sure that no queries are actually using single quotes, otherwise you'll be back where you started.

How to make MySQL handle strings like SQLite does, with regard to Unicode and collation?

I've been researching this question for several hours now, on SO, in MySQL docs, and elsewhere, but still can't find a satisfactory solution. The problem is:
What is the simplest way to make MySQL treat strings just like SQLite does, without any extra "smart" conversions?
For example, the following works perfectly in SQLite:
CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE);
INSERT INTO `dummy` (`key`) VALUES ('one');
INSERT INTO `dummy` (`key`) VALUES ('one ');
INSERT INTO `dummy` (`key`) VALUES ('One');
INSERT INTO `dummy` (`key`) VALUES ('öne');
SELECT * FROM `dummy`;
However, in MySQL, with the following settings:
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
[mysqld]
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_bin
and the following CREATE DATABASE statement:
CREATE DATABASE `dummydb` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_bin;
it still fails on the second INSERT.
I'd rather keep string column declarations as simple as possible, SQLite's TEXT being the ideal. Looks like VARBINARY is the way to go, but I would still like to hear your opinions on any other, potentially better options.
Addendum: The SHOW CREATE TABLE dummy output is
mysql> SHOW CREATE TABLE dummy;
+-------+-----------------------------------------------------
| Table | Create Table
+-------+-----------------------------------------------------
| dummy | CREATE TABLE `dummy` (
`key` varchar(255) COLLATE utf8mb4_bin NOT NULL,
UNIQUE KEY `key` (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+-------+-----------------------------------------------------
1 row in set (0.00 sec)
MySQL wants to convert strings when doing INSERT and SELECT. The conversion is between what you declare the client to have and what the column is declared to be storing.
The only way to avoid that is with VARBINARY and BLOB instead of VARCHAR and TEXT.
The use of COLLATION utf8mb4_bin does not avoid conversion to/from CHARACTER SET utf8mb4; it merely says that WHERE and ORDER BY should compare the bits instead of dealing with accents and case folding.
Keep in mind that CHARACTER SET utf8mb4 is a way to encode text; COLLATION utf8mb4_* is rules for comparing texts in that encoding. _bin is simpleminded.
UNIQUE involves comparing for equality, hence COLLATION. In most utf8mb4 collations, the 3 (without spaces) will compare equal. utf8mb4_bin will treat the 3 as different. utf8mb4_hungarian_ci treats one=One>öne.
The trailing spaces are controlled by the datatype of the column (VARCHAR or other). The latest version even has a setting relating to whether to consider trailing spaces.
The approach shown in the question should (mostly) work just fine in MySQL for the following reasons:
Collation (not to be confused with encoding) is the set or rules that define how to sort and compare characters, typically used to replicate at database level the user expectations from a cultural perspective (if I search for cafe I expect to find café as well).
Collation plays an important rule on unique constraints because its establishes the definition of unique.
Binary collations are specifically meant to ignore cultural rules and work at byte level, thus utf8mb4_bin is the right choice here.
MySQL allows to set a combination of encoding and collation with a column level granularity.
If a column definition is missing collation, it'll use the table level one.
If a table definition is missing collation, it'll use the database level one.
If a database definition is missing collation, it'll use the server level one.
It's also worth noting that MySQL will convert between encodings transparently as long as:
Connection encoding is properly set
Conversion is physically possible (e.g. all source characters also belong to target encoding)
For this last reason, VARBINARY is possibly not the best choice for a column that's still text because it opens the door to getting café stored from a connection configured to use ISO-8859-1 and not being able to retrieve it correctly from a connection configured to use UTF-8.
Side note: the table definition shown may trigger the following error:
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes
Indexes may have a relatively small maximum size. From docs:
If innodb_large_prefix is enabled (the default), the index key prefix
limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED
row format. If innodb_large_prefix is disabled, the index key prefix
limit is 767 bytes for tables of any row format.
innodb_large_prefix is deprecated and will be removed in a future
release. innodb_large_prefix was introduced in MySQL 5.5 to disable
large index key prefixes for compatibility with earlier versions of
InnoDB that do not support large index key prefixes.
The index key prefix length limit is 767 bytes for InnoDB tables that
use the REDUNDANT or COMPACT row format. For example, you might hit
this limit with a column prefix index of more than 255 characters on a
TEXT or VARCHAR column, assuming a utf8mb3 character set and the
maximum of 3 bytes for each character.
Attempting to use an index key prefix length that exceeds the limit
returns an error. To avoid such errors in replication configurations,
avoid enabling innodb_large_prefix on the master if it cannot also be
enabled on slaves.
Since utf8_mb8 allocates 4 bytes per character, a 767 limit will be overflowed with only 192 characters.
We have one more problem:
mysql> CREATE TABLE `dummy` (
-> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL,
-> UNIQUE KEY `key` (`key`)
-> )
-> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('one ');
ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'
Pardon?
mysql> INSERT INTO `dummy` (`key`) VALUES ('One');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO `dummy` (`key`) VALUES ('öne');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM `dummy`;
+-----+
| key |
+-----+
| One |
| one |
| öne |
+-----+
3 rows in set (0.00 sec)
This last issue is a interesting subtlety of MySQL collations. From docs:
All MySQL collations are of type PADSPACE. This means that all CHAR,
VARCHAR, and TEXT values in MySQL are compared without regard to any
trailing spaces. “Comparison” in this context does not include the
LIKE pattern-matching operator, for which trailing spaces are
significant
[...]
For those cases where trailing pad characters are stripped or
comparisons ignore them, if a column has an index that requires unique
values, inserting into the column values that differ only in number of
trailing pad characters will result in a duplicate-key error.
I'd dare say then that VARBINARY type is the only way to overcome this...

mysql enum is not case sensitive it match the first one without case-sensivity

I have enum field that contains lowercase and uppercase of a same letter,
when I try to update a row and change the value it doesn’t work.
this is the way how reproduce the problem:
CREATE TABLE `mytable` (
`id` bigint(20) NOT NULL,
`name` varchar(100) NOT NULL,
`strategy` enum('g','G','r','R') NOT NULL DEFAULT 'g'
) ENGINE=InnoDB;
INSERT INTO `mytable` VALUES(1,'test','g');
now when I try to change strategy from g to G it doesn't work:
UPDATE `mytable` SET `strategy`='G' WHERE id=1;
it returnes:
Query OK, 0 rows affected (0.00 sec)
Rows matched: 1 Changed: 0 Warnings: 0
I use MySQL 5.5, please help me
EDIT:
as mentiond #farshad in his comment,
It use the first match, if I change the order of enum and use 'G','g',... it will always use G and you can not change it back to g
My solution is changing the collation to ASCII:
ALTER TABLE `your_table` CHANGE `strategy` ENUM('g', 'G', 'r', 'R')
CHARACTER SET ASCII COLLATE ascii_bin NOT NULL DEFAULT 'g';
From the doc:
When retrieved, values stored into an ENUM column are displayed using the lettercase that was used in the column definition. Note that ENUM columns can be assigned a character set and collation. For binary or case-sensitive collations, lettercase is taken into account when assigning values to the column.
So you have to change the column collation.

MYSQL not accepting whitespaces

I have created a table in mysql as:
CREATE TABLE `test1` (
`age` int(12) NOT NULL DEFAULT '0',
`name` varchar(20) NOT NULL DEFAULT '',
`gender` varchar(10) DEFAULT NULL,
PRIMARY KEY (`age`,`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I am inserting 2 rows in this table as:
insert into test1 values(1,'user1','m');
insert into test1 values(1,'user1 ','m');
In the second row insertion, I want my 'name' filed to have white space.
But when I run the second query it gives error of primary key violation.
Is there a way I can insert white spaces in the table having primary key also?
Values in VARCHAR columns are variable-length strings. You can declare
a VARCHAR column to be any length between 1 and 255, just as for CHAR
columns. However, in contrast to CHAR, VARCHAR values are stored using
only as many characters as are needed, plus one byte to record the
length. Values are not padded; instead, trailing spaces are removed
when values are stored. (This space removal differs from the SQL-99
specification.)
You probably want lpad, rpad, or space
If you are developing for html you can replace the white space with a different character and once you query the you replace the character with the white space, you can even use " " that will insert an empty space into your html browser
If you need to insert the values with white spaces you can use name nvarchar(20) instead of varchar(20)
Note :
The exact problem is that for SQL norm if you compare two string with different lengths the first thing done by SQL is to make them to the same length by adding trailing spaces.
So, if your query compares string1 'a' and string2 'a ', string1 is first converted to 'a ' then compared to string2, and now the two string are the same.
Finally and fortunately, if the field is a UNIQUE INDEX or Primary key it is not possible to have 'a' and 'a ' in two different rows. If it is not a UNIQUE INDEX or primary key field, then you will have to use RTRIM, LTRIM and LEN function with an extra character like LEN('a'+'#')=2 and LEN('a '+'#')=3.
Len('a') and len('a ') give ...1
What you must keep in mind is :
Remove trailing spaces before insertion ! It will be the better option
I have seen character or string primary key systems speeded up radically when converted to Integer.

Can someone explain why MySQL returns both values when name = 'test': "test" and "test "

I have the following table and data:
CREATE TABLE `test` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(8) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;
INSERT INTO `test` (`id`, `name`) VALUES (1, 'test');
INSERT INTO `test` (`id`, `name`) VALUES (2, 'test ');
When I do either of the following queries, it returns 2 rows (both rows):
SELECT * FROM test WHERE name = 'test';
SELECT * FROM test WHERE name IN ('test');
Can anyone explain this to me and/or how to fix it?
I'm runing MySQL 5.0.27.
From the mysql manual:
Note that all MySQL collations are of type PADSPACE. This means that all CHAR and VARCHAR values in MySQL are compared without regard to any trailing spaces.
Take note that MySQL does not remove the trailing spaces in a version 5.0.3 or higher, they are stored, but not used during comparisons:
VARCHAR values are not padded when they are stored. Handling of trailing spaces is version-dependent. As of MySQL 5.0.3, trailing spaces are retained when values are stored and retrieved, in conformance with standard SQL. Before MySQL 5.0.3, trailing spaces are removed from values when they are stored into a VARCHAR column; this means that the spaces also are absent from retrieved values.
Both of these quotes come from this page of the manual: 10.4.1. The CHAR and VARCHAR Types
MySQL removes whitespace from the end of varchar columns - not exactly sure why this is implemented this way in MySQL - clearly not ANSI standard.
You options are to go with char or text fields if you want to preserver trailing whitespace.
EDIT: I believe that this was changed as of version 5.0.3
Use the BINARY keyword:
SELECT * FROM test WHERE BINARY name = 'test';