MYSQL COLLATE Performance - mysql

I have a table called users with a column firstname and with collation utf8_bin
I want to know what happens under the hood when I execute an query like
SELECT * FROM `users` `a` WHERE `a`.`firstname` = 'sander' COLLATE utf8_general_ci
the column firstname isn't an index, what happens with the performance when the command executed?
And what if the default collation was utf8_general_ci and the query is executed without COLLATE
I want to know the impact it has on a big table (8 million+ records)

In this case, since the forced collation is defined over the same character set as the column's encoding, there won't be any performance impact.
However, if one forced a collation that is defined over a different character set, MySQL may have to transcode the column's values (which would have a performance impact); I think MySQL will do this automatically if the forced collation is over the Unicode character set, and any other situation will raise an "illegal mix of collations" error.
Note that the collation recorded against a column's definition is merely a hint to MySQL over which collation is preferred; it may or may not be used in a given expression, depending on the rules detailed under Collation of Expressions.

Related

Database conversion from latin1 to utf8mb4, what about indexes?

I noticed that my MODX database still uses latin1 character set in the database and in its tables. I would like to convert them to utf8mb4 and update collations accordingly.
Not totally sure how I should do this. Is this correct?
I alter every table to use utf8mb4 and utf8_unicode_ci?
I update the default character set and collation of the database.
Are indexes updated automatically? Is there something else I should be aware of?
A bonus question: what would be the most suitable latest utf8_unicode collation? Western languages should work.
Changing the default character sets of a table or a schema does not change the data in the column itself, it only changes the default to apply the next time you add a table or add a column to a table.
To convert current data, alter one table at a time:
ALTER TABLE <name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
The collation utf8mb4_0900_ai_ci is faster than earlier collations (at least according to the documentation), and it's the most current and accurate. This collation requires MySQL 8.0.
The most current collation in MySQL 5.7 is utf8_unicode_520_ci.
A table-conversion like this rebuilds the indexes, so there's nothing else you need to do.

Why changing the table collation in mysql breaks with unique index violation?

Problem:
Let's say that table customers (id, name, email, .. ) is encoded using utf-8 (utf8_general_ci collation).
This table also has a unique key constraint on column email.
When trying to change the collation to utf8mb4 using
ALTER TABLE customers CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;,
some entries cause the unique index constraint to flare up, because they are considered duplicates.
Example:
row1: (1, "strauss" "strauss#email.com")
row2: (10, "Strauss" "strauß#email.com")
same happens if two emails differ only by a zero width space char.
Tested with mysql version 5.7.20
In German, ß is considered basically equal to ss.
While the xxx_unicode_ci collation respects that, the xxx_general_ci does not, so you switched from a collation that considers "straus#email.com" not the same as "strauß#email.com" to one that does.
Actually, xxx_general_ci treats ß and s the same, so it would complain about "straus#email.com" and "strauß#email.com" instead.
See the documentation:
_general_ci Versus _unicode_ci Collations
For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, ß is equal to ss in German and some other languages. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='

I got this error;
Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='
I changed "Collations" to "utf8mb4_unicode_ci". Then tables were truncated and I re-import rows again. But still getting same error
I am guessing you have different collations on the tables you are joining. It says you are using an illegal mix of collations in operations =.
So you need to set collation.
For example:
WHERE tableA.field COLLATE utf8mb4_general_ci = tableB.field
Then you have set the same collations on the = operation.
Since you have not provided more info about the tables this is the best pseudo code I can provide.
For Join Query I used this piece of query to resolve such error:
select * from contacts.employees INNER JOIN contacts.sme_info
ON employees.login COLLATE utf8mb4_unicode_ci = sme_info.login
Earlier using the following query, I was getting the same error:
select * from contacts.employees LEFT OUTER JOIN contacts.sme_info
ON employees.login = sme_info.login
Error: Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='
I don't know much about collations but seems like both tables follow different rules for character set. Hence, the equal to operator was not able to perform. So in the first query I specified a collation set to collect and combine.
After many hours i finally found a solution that worked for me (using phpMyAdmin).
Remember to first backup your database before performing these operations.
Log into phpMyAdmin.
Select your database from the list on the left.
Click on "Operations" from the top set of tabs.
In the Collation box (near the bottom of the page), choose your new collation from the dropdown menu.
I also checked
*Change all tables collations
*Change all tables
columns collations
I don't think its 100% necessary, but its also a good idea to restart your mySQL/MariaDb service + Disconnect and reconnect to the database.
Additional note:
I had to use utf8mb4_general_ci because the issue persisted when using utf8mb4_unicode_ci (which i originally wanted to use)
For additional information, command line queries and illustrated examples i recommend this article: https://mediatemple.net/community/products/dv/204403914/default-mysql-character-set-and-collation
Check connection with charset=utf8mb4
'dsn' => 'mysql:dbname=DatabaseName;host=localhost;charset=utf8mb4';
-- This worked for me
SET collation_connection = 'utf8mb4_general_ci';
ALTER DATABASE your_bd CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
ALTER TABLE your_table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Had the same problem and fixed it by updating the field's collation.
Even if you change the table's collation, the individual table fields still have the old collation. Try to alter the table and update those varchar fields
See example here
First, check your collation of table and database, make it similar if they are different.
If you are getting error in stored procedure, than first check the Collations of your DB and the column on which you are making equal operation, if they are different change the column collate to whatever is your DB and then you have to re-create that stored procedure by dropping it.
The issue is both tables have the same column,
to overcome this issue we need to ALTER both the tables and database
ALTER DATABASE databasename CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE tabel1 CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE tabel2 CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Is it possible to enable emojis for specific columns of specific tables?

First, I would like to assure you that I have done my "homework" and have read this, this, this and this. Also, one of my former questions is closely related to this one, but in that question I am dealing about flourishlib's compatibility issues with utf8mb4. This question deals with a deeper level. Let's suppose that I have several tables and I want to modify just a few columns to have utf8mb4 encoding, to preserve some storage space and performance after the change. If I changed the whole database to have an encoding of utf8mb4, then its size would increase with 33%, which will affect its performance badly as well. So, we have selected four columns from three different tables to support emojis. These are:
users.bio (tinytext, utf8_general_ci)
questions.question (longtext, utf8_general_ci)
questions.answer (longtext, ut8_general_ci)
comments.comment (tinytext, utf8_general_ci)
As a result, my action plan is as follows:
Create a backup of the database
Run these commands:
alter table comments change comment comment tinytext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table users change bio bio tinytext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table questions change question question longtext character set utf8mb4 collate utf8mb4_unicode_ci;
alter table questions change answer answer longtext character set utf8mb4 collate utf8mb4_unicode_ci;
Expectations:
this should make the specified columns use utf8mb4 instead of utf8
existent data will be correctly converted to utf8mb4, that is, previous texts will be preserved and the users will be able to correctly read their content
other columns will not be changed
queries involving the affected tables will be slower
Are my expectations accurate? Do I need to change the connection? Thanks
You need utf8mb4 in any columns that are storing Chinese.
In VARCHAR(...) utf8mb4, each "character" takes 1-4 bytes. No 33% increase.
On the other hand, CHAR(10) utf8mb4 is always allocated 40 bytes.
You do need to establish that your client is talking utf8mb4, not just utf8. That comes in some parameter in the connection or with SET NAMES utf8mb4.
If you need to automate the ALTERs, it is pretty easy to generate them via a SELECT into information_schema.
Addenda
Expectations 1-3: Yes.
Expectation 4 (queries involving the affected tables will be slower) -- processing will be essentially the same speed.

Sense of command collate in create table sql

I understand function of command collate (a little). It is truth that I did not test if it is possible to have tables with various collation (or even various charset) inside one DB.
But I found that (at least in phpmyadmin) when I create any DB, I set its charset and collation - and if I miss this command in CREATE TABLE ..., then automatically will be set collation set in creation of DB.
So, my question is: What is sense of presence of command collate in sql of CREATE TABLE ... if it can be missing there - and is recommended to have collate in CREATE TABLE ... or is it irrelevant?
In SQL Server if you don't specify the COLLATE it is defaulted to what ever DB is set to. Thus there is no danger in not specifying.
In MySQL behavior is the same:
The table character set and collation are used as default values for
column definitions if the column character set and collation are not
specified in individual column definitions. MySQL Reference
Collate is only used when you want to specify to non-default value. If all you are using is English character set than you have nothing to worry about it. If you store data from multiple languages than you have specify specific collation to ensure what characters are stored correctly.