Create database, what's the right charset for my purpose - mysql

I have just a little information about MySql. I just need to create a database to store some score of a videogame, taken from all over the world. (The game will be in every available store, also Chinese etc.)
I'm worried about the charset. Db schema's will be similar to (pseudocode):
leaderboard("PhoneId" int primary key, name varchar(50), score smallint);
What will happen if a chinese guy will put his score with a name with that characters? Should I specify something into db creation script?
create database if not exists "test_db";
create table if not exists "leaderboard" (
"phoneid" integer unsigned NOT NULL,
"name" varchar(20) NOT NULL, -- Gestione errori per questo
"score" smallint unsigned NOT NULL default 0,
"timestamp" timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY ("phoneid")
);

UTF8 is your obvious choice.
For details on UTF8 and MySQL integration, you can go through the Tutorial pages:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
There are certain things that needs to be kept in mind while using the UTF8 charset in any database. For example, To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length.
Similarly you should analyze other performance constraints and design your database.

Related

MySQL mixing Charset & Collations

I read different articles and topics on this forum to help me setting up the charset & collation for my database. Not sure about the choices I made. I would appreciate any comments or advice.
I'm using MySQL 5.5.
The database (used with PHP) will have some datas from different languages (chinese, french, dutch, Us, spanish, arabic etc..)
I will mainly insert datas and get information from table ID'S. I won't need to full search and compare text.
So here is what I've done to create my database, I decided to use CHARSET utf8mb4 and COLLATION utf8mb4_unicode_ci
ALTER DATABASE testDB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
When I create the table:
CREATE TABLE IF NOT EXISTS sector (
idSector INT(5) NOT NULL AUTO_INCREMENT,
sectoreName VARCHAR(45) NOT NULL DEFAULT '',
PRIMARY KEY (idSector)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=0;
For some tables, I thought it was better to use utf8_bin
Ex: timezone (contain 168 047 rows)
CREATE TABLE timezone (
zone_id int(10) NOT NULL,
abbreviation varchar(6) COLLATE utf8_bin NOT NULL,
time_start decimal(11,0) NOT NULL,
gmt_offset int(11) NOT NULL,
dst char(1) COLLATE utf8_bin NOT NULL,
KEY idx_zone_id (zone_id),
KEY idx_time_start (time_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=0;
So basically I would like to know if I'm on the right or if I'm doing something that could lead to problems.
Different columns can have different character sets and/or collations, but...
If you compare columns of different charset or collation (WHERE a.x = b.y), indexes cannot be used.
utf8 does not handle all of Chinese, nor does it handle some Emoji. For those, you need utf8mb4.
On other issues...
In INT(5), the (5) means nothing. Check out SMALLINT UNSIGNED with a range of 0..65535.
time_start decimal(11,0) is strange for a time. If it is a unix timestamp, either TIMESTAMP or INT UNSIGNED should work ok. See also TIME.
dst char(1) COLLATE utf8_bin -- this takes 3 bytes, because of utf8. Perhaps you want CHARACTER SET ascii so it will be only 1 byte?
InnoDB tables really should be given an explicit PRIMARY KEY. (Probably zone_id?)
You are making good a good choice for your sectoreName column. Notice one thing: utf8mb4_unicode_ci is a good collation for most language. But, for Spanish, it gets the alphabet wrong: in that language N and Ñ are considered different letters. Ñ appears immediately after N in the collating sequence. But in other European language they are considered the same letter. So, your Spanish-language users will, when they ask for Niña, get back Niña and Nina. That may appear to them as a mistake. (But, they're probably used to getting this sort of thing from pan-European software applications.)
You should use utf8mb4 as your character set throughout any new application. So, use that instead of utf8 in your timezone table. Using the _bin collation for your abbreviation column is fine.

Is it ok to use VARCHAR (5000) field in MYSQL for given scenario?

I am developing a classified website using ASP.NET and DB is MYSQL.
I have a header table for store common details of ads.
So here is my header table's database schema.
CREATE TABLE `test`.`header` (
`header_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`title` VARCHAR (500) NOT NULL,
`description` VARCHAR (5000) NOT NULL,
`is_published` TINYINT (1) NOT NULL DEFAULT TRUE,
//etc..
PRIMARY KEY (`header_id`)
) ENGINE = INNODB CHARSET = latin1 COLLATE = latin1_swedish_ci ;
So I am using varchar(500) for title and varchar(5000) for description. So is it OK to use varchar 5000? Reason why I am asking this is some people are saying long varchar fields are converted to Text field inside MYSQL ( I dont know about this). How much is this long? Also some people are saying there is a limitation in row size. So is varchar(5000) field will lead to any performance issue?
Yes I can use Text field but remember I want a limitation for the description. otherwise users will copy paste a novel to description field. :)
What is your suggestion? Another data type or anything....
Thank you very much.
Assuming that 5000 characters is your limitation, then VARCHAR(5000) is perfectly reasonable.
Take a look at this question if you are curious about the differences between VARCHAR and TEXT: MySQL: Large VARCHAR vs. TEXT?.

mysql query about create table ddl format

I am a mysql newbie. I have a question about the right thing to do for create table ddl. Up until now I have just been writing create table ddl like this...
CREATE TABLE file (
file_id mediumint(10) unsigned NOT NULL AUTO_INCREMENT,
filename varchar(100) NOT NULL,
file_notes varchar(100) DEFAULT NULL,
file_size mediumint(10) DEFAULT NULL,
file_type varchar(40) DEFAULT NULL,
file longblob DEFAULT NULL,
CONSTRAINT pk_file PRIMARY KEY (file_id)
);
But I often see people doing their create table ddl like this...
CREATE TABLE IF NOT EXISTS `etags` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`item_code` varchar(100) NOT NULL,
`item_description` varchar(500) NOT NULL,
`btn_type` enum('primary','important','success','default','warning') NOT NULL DEFAULT 'default',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;
A few questions...
What difference do the quotes around the table name and column names make?
Is it good practice to explicitly declare the engine and character set? What engine and character sets are used by default?
thanks
There's no difference. Identifiers (table names, column names, et al.) must be enclosed in the backticks if they contain special characters or are reserved words. Otherwise, the backticks are optional.
Yes, it's good practice, for portability to other systems. If you re-create the table, having the storage engine and character set specified explicitly in the CREATE TABLE statement means that your statement won't be dependent on the settings of the default_character_set and default-storage-engine variables (these may get changed, or be set differently on another database.)
You can get your table DDL definition in that same format using the SHOW CREATE TABLE statement, e.g.
SHOW CREATE TABLE `file`
The CREATE TABLE DDL syntax you are seeing posted by other users is typically in the format produced as output of this statement. Note that MySQL doesn't bother with checking whether an identifier contains special characters or reserved words (to see if backticks are required or not), it just goes ahead and wraps all of the identifiers in backticks.
With backticks, reserved words and some special characters can be used in names.
It's simply a safety measure and many tools automatically add these.
The default engine and charset can be set in the servers configuration.
They are often (but not always) set to MyISAM and latin1.
Personally, I would consider it good practice to define engine and charset, just so you can be certain what you end up with.

mysql should I use char for unicode?

I'm just newbie in mysql.
I want to declare my table something like this,
CREATE TABLE `test` (
`NO` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`ID` char(50) CHARACTER SET utf8 NOT NULL ,
`FIRST_NAME` char(50)CHARACTER SET utf8 NOT NULL,
`LAST_NAME` smallint(50)char(50)CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`NO`),
)
I read in mysql about char and varchar that
The length of a CHAR column is fixed to the length that you declare when you create the table.
My question is that if I use unicode char(50) for chinese, japanese, korean or other unicode characters, would these columns use too much storage in database and can affect the performance?
For english characters, it only can accept up to 50 characters if I declare my table like this?
Is there any better way or is it good to use char(100) for unicode?
Correct me if I'm wrong.
According to the MySQL manual, the number of bytes required by a CHAR(M) column is M x w, where w is "the number of bytes required for the maximum-length character in the character set". This clearly suggests that a CHAR(50) column will store 50 characters, not 50 bytes. If your characters happen to be Chinese, etc., it will still store up to 50 of them. Note that a VARCHAR(50) will also store up to 50 characters, but will (if one ignores the 1-2 byte overhead that comes with a VARCHAR column) take up less storage than a CHAR(50) column when there are fewer than 50 characters stored.
EDIT This applies only if you are using MySQL 5.5 or later. Earlier versions interpreted the length of character and text fields as bytes, not characters.

Efficient way to index MySQL table column with utf8 charset

CREATE TABLE profile_category (
id mediumint UNSIGNED NOT NULL AUTO_INCREMENT,
pc_name char(255) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY idx_name (name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This is one of the tables in database that is entirely in utf8 charset. The problem is here (and I didn't new about it until now) that index for pc_name column will triple times bigger, because MySQL reserves 3 bites for every char. In this case indexes will take much more space.
I cannot make shorter index, because I need this value to be unique. One of the solutions could be set pc_name char(255) CHARSET latin1 NOT NULL, but I dont't know if this is a problem or not.
Is this is a good Idea, or are there any solutions that I don't know ?
Update: the pc_name column is validated in application to be valid utf8. And it allows non western characters. But in this case I can just make a trade of and allow only /[_A-Za-z]/ if the case is worth it.
Update 2: I tried to set pc_name to latin1 charset, but now I get exceptions like: Zend_Db_Statement_Exception: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
If pc_name is going to contain non-Western text then latin1 isn't going to be an option here - otherwise, go for it.
Not being a hardcore MySQL'er, I don't know if mixing InnoDB and MySQL tables is fraught with problems - if not, perhaps you could make this table a standard MySQL table and leave it as utf8?