mysql should I use char for unicode? - mysql

I'm just newbie in mysql.
I want to declare my table something like this,
CREATE TABLE `test` (
`NO` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`ID` char(50) CHARACTER SET utf8 NOT NULL ,
`FIRST_NAME` char(50)CHARACTER SET utf8 NOT NULL,
`LAST_NAME` smallint(50)char(50)CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`NO`),
)
I read in mysql about char and varchar that
The length of a CHAR column is fixed to the length that you declare when you create the table.
My question is that if I use unicode char(50) for chinese, japanese, korean or other unicode characters, would these columns use too much storage in database and can affect the performance?
For english characters, it only can accept up to 50 characters if I declare my table like this?
Is there any better way or is it good to use char(100) for unicode?
Correct me if I'm wrong.

According to the MySQL manual, the number of bytes required by a CHAR(M) column is M x w, where w is "the number of bytes required for the maximum-length character in the character set". This clearly suggests that a CHAR(50) column will store 50 characters, not 50 bytes. If your characters happen to be Chinese, etc., it will still store up to 50 of them. Note that a VARCHAR(50) will also store up to 50 characters, but will (if one ignores the 1-2 byte overhead that comes with a VARCHAR column) take up less storage than a CHAR(50) column when there are fewer than 50 characters stored.
EDIT This applies only if you are using MySQL 5.5 or later. Earlier versions interpreted the length of character and text fields as bytes, not characters.

Related

Proper configuration to mix Collations in SQL?

I’m a little confused by the collations. Not sure if the DB would traduce a column collation to the table collation on a SELECT, or is just a ruleset for when comparing.
So what to put as CHARSET and COLLATE? (10.4.11-MariaDB)
Here are some examples of what I have:
Case #1: The utf8_bin column I just SELECT it, not compare it, but the ascii I do WHERE bot=?
CREATE TABLE `bots_trace` (
`id` int(10) UNSIGNED NOT NULL,
`bot` varchar(20) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
`info` varchar(2000) COLLATE utf8_bin DEFAULT NULL,
`seen` enum('yes','no') CHARACTER SET ascii COLLATE ascii_bin NOT NULL DEFAULT 'no'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
I almost never ask the DB to do an utf8mb4_bin comparison or similar, just SELECT.
So what collations I should use in those cases, what to use as DEFAULT and as COLLATE
Case #2: The only time I ask the DB to do something with an uft8mb4 is to check the mail.
CREATE TABLE `changed_email` (
`id` int(10) UNSIGNED NOT NULL,
`old_mail` varchar(256) COLLATE utf8mb4_bin NOT NULL,
`ctime` int(10) UNSIGNED NOT NULL,
`ip` varchar(94) CHARACTER SET ascii COLLATE ascii_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
SELECT id FROM changed_email WHERE old_mail = ? LIMIT 1
What to do in this case? Because the only comparison I do is a utf8mb4_bin I'm assuming that would be the correct CHARSET & COLLATE.
Also, I use PHP and I set mysqli_set_charset($link, 'utf8mb4'), which I needed to retrieve the data correctly, if I change some table COLLATION to ascii, could I have trouble retrieving utf8mb4 data columns?
ascii encoding is a subset of utf8 which is a subset of utf8mb4. But that is probably irrelevant.
mysqli_set_charset() announces the CHARACTER SET of the data in the client.
MySQL, during INSERT will convert the bytes from the encoding indicated by mysqli_set_charset to the encoding specified for the column in the table. Similarly SELECT will convert the other direction.
If you are only dealing with ascii characters, then there is effectively no conversion, and no possibility of problems. If, on the other hand, you have accented letters or Emoji, there will be problems.
The above talks about CHARACTER SET, which is the "encoding" of letters. the COLLATION is a different matter; this term refers to the ordering, including case folding and accent stripping. For example, should 'a' = 'A' or not? For COLLATION ascii_general_ci or utf8mb4...ci, those are "equal". For any collation ...bin they are "not equal", and one of them will consistently be sorted (think ORDER BY) before the other.
In some, but not all, situations, MySQL will allow mixing character sets or collections, and "do the right thing". For example, storing a character in once CHARACTER SET into another, either it can be converted, or it will mess up. A is available in perhaps all character sets, but, for example, an accented A is not available in Ascii.
In the case of COLLATION, when there is a conflict of collections, there may be a rule that says which collation to use, but often it gives up and complains about a "mix of collections".
Keep in mind that all of this comes from multiple places:
The column definition
The connection parameters (between client and MySQL server)
The bytes in the client.
A common example is latin1 accented letters cannot be interpreted as utf8 bytes, but they can be converted to utf8. This raises its ugly head when connection specification disagrees with the bytes in the client.

MySQL mixing Charset & Collations

I read different articles and topics on this forum to help me setting up the charset & collation for my database. Not sure about the choices I made. I would appreciate any comments or advice.
I'm using MySQL 5.5.
The database (used with PHP) will have some datas from different languages (chinese, french, dutch, Us, spanish, arabic etc..)
I will mainly insert datas and get information from table ID'S. I won't need to full search and compare text.
So here is what I've done to create my database, I decided to use CHARSET utf8mb4 and COLLATION utf8mb4_unicode_ci
ALTER DATABASE testDB CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
When I create the table:
CREATE TABLE IF NOT EXISTS sector (
idSector INT(5) NOT NULL AUTO_INCREMENT,
sectoreName VARCHAR(45) NOT NULL DEFAULT '',
PRIMARY KEY (idSector)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 AUTO_INCREMENT=0;
For some tables, I thought it was better to use utf8_bin
Ex: timezone (contain 168 047 rows)
CREATE TABLE timezone (
zone_id int(10) NOT NULL,
abbreviation varchar(6) COLLATE utf8_bin NOT NULL,
time_start decimal(11,0) NOT NULL,
gmt_offset int(11) NOT NULL,
dst char(1) COLLATE utf8_bin NOT NULL,
KEY idx_zone_id (zone_id),
KEY idx_time_start (time_start)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=0;
So basically I would like to know if I'm on the right or if I'm doing something that could lead to problems.
Different columns can have different character sets and/or collations, but...
If you compare columns of different charset or collation (WHERE a.x = b.y), indexes cannot be used.
utf8 does not handle all of Chinese, nor does it handle some Emoji. For those, you need utf8mb4.
On other issues...
In INT(5), the (5) means nothing. Check out SMALLINT UNSIGNED with a range of 0..65535.
time_start decimal(11,0) is strange for a time. If it is a unix timestamp, either TIMESTAMP or INT UNSIGNED should work ok. See also TIME.
dst char(1) COLLATE utf8_bin -- this takes 3 bytes, because of utf8. Perhaps you want CHARACTER SET ascii so it will be only 1 byte?
InnoDB tables really should be given an explicit PRIMARY KEY. (Probably zone_id?)
You are making good a good choice for your sectoreName column. Notice one thing: utf8mb4_unicode_ci is a good collation for most language. But, for Spanish, it gets the alphabet wrong: in that language N and Ñ are considered different letters. Ñ appears immediately after N in the collating sequence. But in other European language they are considered the same letter. So, your Spanish-language users will, when they ask for Niña, get back Niña and Nina. That may appear to them as a mistake. (But, they're probably used to getting this sort of thing from pan-European software applications.)
You should use utf8mb4 as your character set throughout any new application. So, use that instead of utf8 in your timezone table. Using the _bin collation for your abbreviation column is fine.

Is it ok to use VARCHAR (5000) field in MYSQL for given scenario?

I am developing a classified website using ASP.NET and DB is MYSQL.
I have a header table for store common details of ads.
So here is my header table's database schema.
CREATE TABLE `test`.`header` (
`header_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`title` VARCHAR (500) NOT NULL,
`description` VARCHAR (5000) NOT NULL,
`is_published` TINYINT (1) NOT NULL DEFAULT TRUE,
//etc..
PRIMARY KEY (`header_id`)
) ENGINE = INNODB CHARSET = latin1 COLLATE = latin1_swedish_ci ;
So I am using varchar(500) for title and varchar(5000) for description. So is it OK to use varchar 5000? Reason why I am asking this is some people are saying long varchar fields are converted to Text field inside MYSQL ( I dont know about this). How much is this long? Also some people are saying there is a limitation in row size. So is varchar(5000) field will lead to any performance issue?
Yes I can use Text field but remember I want a limitation for the description. otherwise users will copy paste a novel to description field. :)
What is your suggestion? Another data type or anything....
Thank you very much.
Assuming that 5000 characters is your limitation, then VARCHAR(5000) is perfectly reasonable.
Take a look at this question if you are curious about the differences between VARCHAR and TEXT: MySQL: Large VARCHAR vs. TEXT?.

Create database, what's the right charset for my purpose

I have just a little information about MySql. I just need to create a database to store some score of a videogame, taken from all over the world. (The game will be in every available store, also Chinese etc.)
I'm worried about the charset. Db schema's will be similar to (pseudocode):
leaderboard("PhoneId" int primary key, name varchar(50), score smallint);
What will happen if a chinese guy will put his score with a name with that characters? Should I specify something into db creation script?
create database if not exists "test_db";
create table if not exists "leaderboard" (
"phoneid" integer unsigned NOT NULL,
"name" varchar(20) NOT NULL, -- Gestione errori per questo
"score" smallint unsigned NOT NULL default 0,
"timestamp" timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY ("phoneid")
);
UTF8 is your obvious choice.
For details on UTF8 and MySQL integration, you can go through the Tutorial pages:
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-utf8.html
http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
There are certain things that needs to be kept in mind while using the UTF8 charset in any database. For example, To save space with UTF-8, use VARCHAR instead of CHAR. Otherwise, MySQL must reserve three bytes for each character in a CHAR CHARACTER SET utf8 column because that is the maximum possible length.
Similarly you should analyze other performance constraints and design your database.

varchar(255) vs tinytext/tinyblob and varchar(65535) vs blob/text

By definition:
VARCHAR: The range of Length is 1 to 255 characters. VARCHAR values are sorted and compared in case-insensitive fashion unless the BINARY keyword is given. x+1 bytes
TINYBLOB, TINYTEXT: A BLOB or TEXT column with a maximum length of 255 (2^8 - 1) characters x+1 bytes
So based on this, I creaate the following table:
CREATE TABLE `user` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255),
`lastname` tinytext,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Or is it better to create a varchar or tinytext and why?
Is it the same for:
VARCHAR: The range of Length is > 255 characters. VARCHAR values are sorted and compared in case-insensitive fashion unless the BINARY keyword is given. x+2 bytes
BLOB, TEXT A BLOB or TEXT column with a maximum length of 65535 (2^16 - 1) characters x+2 bytes
In this case varchar is better.
Note that varchar can be from 1 to 65535 chars.
Values in VARCHAR columns are variable-length strings. The length can be specified as a value from 0 to 255 before MySQL 5.0.3, and 0 to 65,535 in 5.0.3 and later versions. The effective maximum length of a VARCHAR in MySQL 5.0.3 and later is subject to the maximum row size (65,535 bytes, which is shared among all columns) and the character set used. See Section E.7.4, “Table Column-Count and Row-Size Limits”.
Blobs are saved in a separate section of the file.
They require an extra fileread to include in the data.
For this reason varchar is fetched much faster.
If you have a large blob that you access infrequently, than a blob makes more sense.
Storing the blob data in a separate (part of the) file allows your core data file to be smaller and thus be fetched quicker.