Efficient way to index MySQL table column with utf8 charset - mysql

CREATE TABLE profile_category (
id mediumint UNSIGNED NOT NULL AUTO_INCREMENT,
pc_name char(255) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY idx_name (name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This is one of the tables in database that is entirely in utf8 charset. The problem is here (and I didn't new about it until now) that index for pc_name column will triple times bigger, because MySQL reserves 3 bites for every char. In this case indexes will take much more space.
I cannot make shorter index, because I need this value to be unique. One of the solutions could be set pc_name char(255) CHARSET latin1 NOT NULL, but I dont't know if this is a problem or not.
Is this is a good Idea, or are there any solutions that I don't know ?
Update: the pc_name column is validated in application to be valid utf8. And it allows non western characters. But in this case I can just make a trade of and allow only /[_A-Za-z]/ if the case is worth it.
Update 2: I tried to set pc_name to latin1 charset, but now I get exceptions like: Zend_Db_Statement_Exception: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='

If pc_name is going to contain non-Western text then latin1 isn't going to be an option here - otherwise, go for it.
Not being a hardcore MySQL'er, I don't know if mixing InnoDB and MySQL tables is fraught with problems - if not, perhaps you could make this table a standard MySQL table and leave it as utf8?

Related

Does joining ASCII and UTF-8 tables add overhead?

Many tables will do fine using CHARACTER SET ascii COLLATE ascii_bin which will be slightly faster. Here's an example:
CREATE TABLE `session` (
`id` CHAR(64) NOT NULL,
`created_at` INTEGER NOT NULL,
`modified_at` INTEGER NOT NULL,
PRIMARY KEY (`id`),
CONSTRAINT FOREIGN KEY (`user_id`) REFERENCES `user`(`id`)
) CHARACTER SET ascii COLLATE ascii_bin;
But if I were to join it with:
CREATE TABLE `session_value` (
`session_id` CHAR(64) NOT NULL,
`key` VARCHAR(64) NOT NULL,
`value` TEXT,
PRIMARY KEY (`session_id`, `key`),
CONSTRAINT FOREIGN KEY (`session_id`) REFERENCES `session`(`id`) ON DELETE CASCADE
) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
what's gonna happen? Logic tells me it should be seamless, because ASCII is a subset of UTF-8. Human nature tells me I can expect anything from a core dump to a message Follow the white rabbit. appearing on my screen. ¯\_(ツ)_/¯
Does joining ASCII and UTF-8 tables add overhead?
Yes.
If you do
SELECT whatever
FROM session s
JOIN session_value v
ON s.id = v.session_id
the query engine must compare many values of id and session_id to satisfy your query.
If id and session_id have exactly the same datatype, the query planner will be able to exploit indexes and fast comparisons.
But if they have different character sets, the query planner must interpret your query as follows.
... JOIN session_value v
ON CONVERT(s.id USING utf8mb4) = v.session_id
When a WHERE or ON condition has the form f(column) it makes the query non-sargable: it prevents efficient index use. That can hammer query performance.
In your case, similar performance problems will occur when you insert rows to session_value: the server must do the conversion to check your foreign key constraint.
If these tables are going to production, you'd be very wise to use the same character set for these columns. It's much easier to fix this when you have thousands of rows than when you have millions. Seriously.
What makes a SQL statement sargable?
Why not UTF-8 all the way through? Having ASCII tables is usually a mistake, a sign you forgot to set the encoding on something. Using a singular encoding vastly simplifies your internal architecture.
Encoding is only relevant if and when you have CHAR, VARCHAR or TEXT columns.
If you have a column of that type then it's worth setting it as UTF8MB4 by default.

MySQL mixing Charset & Collations

I read different articles and topics on this forum to help me setting up the charset & collation for my database. Not sure about the choices I made. I would appreciate any comments or advice.
I'm using MySQL 5.5.
The database (used with PHP) will have some datas from different languages (chinese, french, dutch, Us, spanish, arabic etc..)
I will mainly insert datas and get information from table ID'S. I won't need to full search and compare text.
So here is what I've done to create my database, I decided to use CHARSET utf8mb4 and COLLATION utf8mb4_unicode_ci
ALTER DATABASE testDB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
When I create the table:
CREATE TABLE IF NOT EXISTS sector (
idSector INT(5) NOT NULL AUTO_INCREMENT,
sectoreName VARCHAR(45) NOT NULL DEFAULT '',
PRIMARY KEY (idSector)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=0;
For some tables, I thought it was better to use utf8_bin
Ex: timezone (contain 168 047 rows)
CREATE TABLE timezone (
zone_id int(10) NOT NULL,
abbreviation varchar(6) COLLATE utf8_bin NOT NULL,
time_start decimal(11,0) NOT NULL,
gmt_offset int(11) NOT NULL,
dst char(1) COLLATE utf8_bin NOT NULL,
KEY idx_zone_id (zone_id),
KEY idx_time_start (time_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=0;
So basically I would like to know if I'm on the right or if I'm doing something that could lead to problems.
Different columns can have different character sets and/or collations, but...
If you compare columns of different charset or collation (WHERE a.x = b.y), indexes cannot be used.
utf8 does not handle all of Chinese, nor does it handle some Emoji. For those, you need utf8mb4.
On other issues...
In INT(5), the (5) means nothing. Check out SMALLINT UNSIGNED with a range of 0..65535.
time_start decimal(11,0) is strange for a time. If it is a unix timestamp, either TIMESTAMP or INT UNSIGNED should work ok. See also TIME.
dst char(1) COLLATE utf8_bin -- this takes 3 bytes, because of utf8. Perhaps you want CHARACTER SET ascii so it will be only 1 byte?
InnoDB tables really should be given an explicit PRIMARY KEY. (Probably zone_id?)
You are making good a good choice for your sectoreName column. Notice one thing: utf8mb4_unicode_ci is a good collation for most language. But, for Spanish, it gets the alphabet wrong: in that language N and Ñ are considered different letters. Ñ appears immediately after N in the collating sequence. But in other European language they are considered the same letter. So, your Spanish-language users will, when they ask for Niña, get back Niña and Nina. That may appear to them as a mistake. (But, they're probably used to getting this sort of thing from pan-European software applications.)
You should use utf8mb4 as your character set throughout any new application. So, use that instead of utf8 in your timezone table. Using the _bin collation for your abbreviation column is fine.

Understanding character sets in column and data and how they affect select queries

After reading articles about how to properly store your data inside a mysql database, I sort of have a good understanding of how character sets work. Yet, there are still some instances that I find hard to understand. Let me give a quick example: Say I have a table with a varchar column using the latin1 charset (for example purposes)
CREATE TABLE `foobar` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`foo` varchar(100) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;
and I force insert a latin1 character using its hex value (á for instance, which would be E1)
SET NAMES 'latin1';
INSERT INTO foobar SET foo = UNHEX('E1')
why does a SELECT query against this table under a latin1 connection fail to give me the value I am expecting? I get the replacement character instead. I would have expected it to just give back the value as it is since the column was in latin1, the connection itself was in latin1 and the value stored is a valid latin1 character. Is mysql trying to interpret E1 as a UTF8 value? Or am I missing something here?
Using a UTF8 connection (SET NAMES 'utf8') before selecting gives me the expected value of á, which I guess is correct since from what I understand, if the connection charset is different from the column charset being selected, then mysql would read data as whatever the column charset was (so E1 gets parsed as a latin1 character, or á)

mysql query about create table ddl format

I am a mysql newbie. I have a question about the right thing to do for create table ddl. Up until now I have just been writing create table ddl like this...
CREATE TABLE file (
file_id mediumint(10) unsigned NOT NULL AUTO_INCREMENT,
filename varchar(100) NOT NULL,
file_notes varchar(100) DEFAULT NULL,
file_size mediumint(10) DEFAULT NULL,
file_type varchar(40) DEFAULT NULL,
file longblob DEFAULT NULL,
CONSTRAINT pk_file PRIMARY KEY (file_id)
);
But I often see people doing their create table ddl like this...
CREATE TABLE IF NOT EXISTS `etags` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`item_code` varchar(100) NOT NULL,
`item_description` varchar(500) NOT NULL,
`btn_type` enum('primary','important','success','default','warning') NOT NULL DEFAULT 'default',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;
A few questions...
What difference do the quotes around the table name and column names make?
Is it good practice to explicitly declare the engine and character set? What engine and character sets are used by default?
thanks
There's no difference. Identifiers (table names, column names, et al.) must be enclosed in the backticks if they contain special characters or are reserved words. Otherwise, the backticks are optional.
Yes, it's good practice, for portability to other systems. If you re-create the table, having the storage engine and character set specified explicitly in the CREATE TABLE statement means that your statement won't be dependent on the settings of the default_character_set and default-storage-engine variables (these may get changed, or be set differently on another database.)
You can get your table DDL definition in that same format using the SHOW CREATE TABLE statement, e.g.
SHOW CREATE TABLE `file`
The CREATE TABLE DDL syntax you are seeing posted by other users is typically in the format produced as output of this statement. Note that MySQL doesn't bother with checking whether an identifier contains special characters or reserved words (to see if backticks are required or not), it just goes ahead and wraps all of the identifiers in backticks.
With backticks, reserved words and some special characters can be used in names.
It's simply a safety measure and many tools automatically add these.
The default engine and charset can be set in the servers configuration.
They are often (but not always) set to MyISAM and latin1.
Personally, I would consider it good practice to define engine and charset, just so you can be certain what you end up with.

Create database, what's the right charset for my purpose

I have just a little information about MySql. I just need to create a database to store some score of a videogame, taken from all over the world. (The game will be in every available store, also Chinese etc.)
I'm worried about the charset. Db schema's will be similar to (pseudocode):
leaderboard("PhoneId" int primary key, name varchar(50), score smallint);
What will happen if a chinese guy will put his score with a name with that characters? Should I specify something into db creation script?
create database if not exists "test_db";
create table if not exists "leaderboard" (
"phoneid" integer unsigned NOT NULL,
"name" varchar(20) NOT NULL, -- Gestione errori per questo
"score" smallint unsigned NOT NULL default 0,
"timestamp" timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY ("phoneid")
);
UTF8 is your obvious choice.
For details on UTF8 and MySQL integration, you can go through the Tutorial pages:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
There are certain things that needs to be kept in mind while using the UTF8 charset in any database. For example, To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length.
Similarly you should analyze other performance constraints and design your database.