Many tables will do fine using CHARACTER SET ascii COLLATE ascii_bin which will be slightly faster. Here's an example:
CREATE TABLE `session` (
`id` CHAR(64) NOT NULL,
`created_at` INTEGER NOT NULL,
`modified_at` INTEGER NOT NULL,
PRIMARY KEY (`id`),
CONSTRAINT FOREIGN KEY (`user_id`) REFERENCES `user`(`id`)
) CHARACTER SET ascii COLLATE ascii_bin;
But if I were to join it with:
CREATE TABLE `session_value` (
`session_id` CHAR(64) NOT NULL,
`key` VARCHAR(64) NOT NULL,
`value` TEXT,
PRIMARY KEY (`session_id`, `key`),
CONSTRAINT FOREIGN KEY (`session_id`) REFERENCES `session`(`id`) ON DELETE CASCADE
) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
what's gonna happen? Logic tells me it should be seamless, because ASCII is a subset of UTF-8. Human nature tells me I can expect anything from a core dump to a message Follow the white rabbit. appearing on my screen. ¯\_(ツ)_/¯
Does joining ASCII and UTF-8 tables add overhead?
Yes.
If you do
SELECT whatever
FROM session s
JOIN session_value v
ON s.id = v.session_id
the query engine must compare many values of id and session_id to satisfy your query.
If id and session_id have exactly the same datatype, the query planner will be able to exploit indexes and fast comparisons.
But if they have different character sets, the query planner must interpret your query as follows.
... JOIN session_value v
ON CONVERT(s.id USING utf8mb4) = v.session_id
When a WHERE or ON condition has the form f(column) it makes the query non-sargable: it prevents efficient index use. That can hammer query performance.
In your case, similar performance problems will occur when you insert rows to session_value: the server must do the conversion to check your foreign key constraint.
If these tables are going to production, you'd be very wise to use the same character set for these columns. It's much easier to fix this when you have thousands of rows than when you have millions. Seriously.
What makes a SQL statement sargable?
Why not UTF-8 all the way through? Having ASCII tables is usually a mistake, a sign you forgot to set the encoding on something. Using a singular encoding vastly simplifies your internal architecture.
Encoding is only relevant if and when you have CHAR, VARCHAR or TEXT columns.
If you have a column of that type then it's worth setting it as UTF8MB4 by default.
Related
I am joining with a table and noticed that if the field I join on has a varchar size that's too high then MySQL doesn't use the index for that field in the join, thus resulting in a significantly longer query time. I've put explains and table definition below. It is version MySQL 5.7. Any ideas why this is happening?
Table definition:
CREATE TABLE `LotRecordsRaw` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lotNumber` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`scrapingJobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `lotNumber_UNIQUE` (`lotNumber`),
KEY `idx_Lot_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=14551 DEFAULT CHARSET=latin1;
Explains:
explain
(
select lotRecord.*
from LotRecordsRaw lotRecord
left join (
select lotNumber, max(scrapingJobId) as id
from LotRecordsRaw
group by lotNumber
) latestJob on latestJob.lotNumber = lotRecord.lotNumber
)
produces:
The screenshot above shows that the derived table is not using the index on "lotNumber". In that example, the "lotNumber" field was a varchar(255). If I change it to be a smaller size, e.g. varchar(45), then the explain query produces this:
The query then runs orders of magnitude faster (2 seconds instead of 100 sec). What's going on here?
Hooray! You found an optimization reason for not blindly using 255 in VARCHAR.
Please try 191 and 192 -- I want to know if that is the cutoff.
Meanwhile, I have some other comments:
A UNIQUE is a KEY. That is, idx_Lot_lotNumber is redundant and may as well be removed.
The Optimizer can (and probably would) use INDEX(lotNumber, scrapingJobId) as a much faster way to find those MAXes.
Unfortunately, there is no way to specify "make a unique index on lotNumber, but also have that other column in the index.
Wait! With lotNumber being unique, there is only one row per lotNumber. That means MAX and GROUP BY are totally unnecessary!
It seems like lotNumber could be promoted to PRIMARY KEY (and completely get rid of id).
The normal examples I see for creating a table go like this:
CREATE TABLE supportContacts
(
id int auto_increment primary key,
type varchar(20),
details varchar(30)
);
However an example I'm looking at does it like this:
CREATE TABLE IF NOT EXISTS `main`.`user` (
`user_id` int(11) NOT NULL AUTO_INCREMENT,
`user_name` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`user_password_hash` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`user_email` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`user_id`),
UNIQUE KEY `user_name` (`user_name`),
UNIQUE KEY `user_email` (`user_email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Specifically, on the create table like it is specifying the database and the new table and surrounding them in `'s. What is the reasoning for this, and does one way have an advantage over the other?
The backticks are escape characters needed when identifiers contain special characters (such as spaces) or are reserved words (such as group or order).
Otherwise, they are not needed, and I do not think they are needed for any of the identifiers in this create table statement.
My personal preference is that over-use of escape characters is a bad thing:
They make the query harder to read, because there are unnecessary characters everywhere.
They make it harder to write the query. I imagine the backtick key on people who do this alot starts to break.
They encourage (or at least do not discourage) the use of "difficult" identifers.
They make it more difficult to move code between databases. (MySQL is one of the few databases that use backticks as an escape character.)
Of course, some people have different opinions on some of these points (although I think the second and fourth points are more truth than opinion).
Backticks are used to escape table and column names.
You can do this to use keywords. If you want to name a column from for instance then you need the backticks. Otherwise the the DB interprets this a keyword.
Or if you want spaces in your table name like my table which BTW I recommend not to do.
In SQL Server you would use [] to escape the names.
I am a mysql newbie. I have a question about the right thing to do for create table ddl. Up until now I have just been writing create table ddl like this...
CREATE TABLE file (
file_id mediumint(10) unsigned NOT NULL AUTO_INCREMENT,
filename varchar(100) NOT NULL,
file_notes varchar(100) DEFAULT NULL,
file_size mediumint(10) DEFAULT NULL,
file_type varchar(40) DEFAULT NULL,
file longblob DEFAULT NULL,
CONSTRAINT pk_file PRIMARY KEY (file_id)
);
But I often see people doing their create table ddl like this...
CREATE TABLE IF NOT EXISTS `etags` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`item_code` varchar(100) NOT NULL,
`item_description` varchar(500) NOT NULL,
`btn_type` enum('primary','important','success','default','warning') NOT NULL DEFAULT 'default',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;
A few questions...
What difference do the quotes around the table name and column names make?
Is it good practice to explicitly declare the engine and character set? What engine and character sets are used by default?
thanks
There's no difference. Identifiers (table names, column names, et al.) must be enclosed in the backticks if they contain special characters or are reserved words. Otherwise, the backticks are optional.
Yes, it's good practice, for portability to other systems. If you re-create the table, having the storage engine and character set specified explicitly in the CREATE TABLE statement means that your statement won't be dependent on the settings of the default_character_set and default-storage-engine variables (these may get changed, or be set differently on another database.)
You can get your table DDL definition in that same format using the SHOW CREATE TABLE statement, e.g.
SHOW CREATE TABLE `file`
The CREATE TABLE DDL syntax you are seeing posted by other users is typically in the format produced as output of this statement. Note that MySQL doesn't bother with checking whether an identifier contains special characters or reserved words (to see if backticks are required or not), it just goes ahead and wraps all of the identifiers in backticks.
With backticks, reserved words and some special characters can be used in names.
It's simply a safety measure and many tools automatically add these.
The default engine and charset can be set in the servers configuration.
They are often (but not always) set to MyISAM and latin1.
Personally, I would consider it good practice to define engine and charset, just so you can be certain what you end up with.
I have a Mysql 5.6 table with 70 million rows in it, but it will grow to 100+ million rows or more in a few weeks.
I have a dedicated machine with a humble 500GB disk and 4GB RAM and the innodb_buffer_pool_size is set to 2GB.
The database uses 99% to selects and 1% to inserts (once a month).
The most important column is descripcion_detallada_producto varchar(300) and it is where the selects are aimed at in 90% of the times.
My table is:
CREATE TABLE `t1` (
`N_orden` bigint(20) NOT NULL DEFAULT '0',
`Fecha` varchar(15) COLLATE latin1_spanish_ci DEFAULT NULL,
`Ncm` int(11) NOT NULL,
`Origen` int(11) NOT NULL,
`Adquisicion` int(11) NOT NULL,
`Medida_Estadistica` int(11) NOT NULL,
`Unidad_Comercializacion` varchar(30) COLLATE latin1_spanish_ci DEFAULT NULL,
`Descripcion_Detallada_Producto` varchar(300) COLLATE latin1_spanish_ci DEFAULT NULL,
`Cantidad_Estadistica` double DEFAULT NULL,
`Peso_Liquido_Kg` double DEFAULT NULL,
`Valor_Fob` double DEFAULT NULL,
`Valor_Frete` double DEFAULT NULL,
`Valor_Seguro` double DEFAULT NULL,
`Valor_Unidad` double DEFAULT NULL,
`Cantidad` double DEFAULT NULL,
`Valor_Total` double DEFAULT NULL,
PRIMARY KEY (`N_orden`),
KEY `Ncm` (`Ncm`),
KEY `Origen` (`Origen`),
KEY `Adquisicion` (`Adquisicion`),
KEY `Medida_Estadistica` (`Medida_Estadistica`),
KEY `Descripcion_Detallada_Producto` (`Descripcion_Detallada_Producto`),
CONSTRAINT `t1_ibfk_1` FOREIGN KEY (`Ncm`) REFERENCES `ncm` (`Ncm`),
CONSTRAINT `t1_ibfk_2` FOREIGN KEY (`Origen`) REFERENCES `paises` (`Codigo_Pais`),
CONSTRAINT `t1_ibfk_3` FOREIGN KEY (`Adquisicion`) REFERENCES `paises` (`Codigo_Pais`),
CONSTRAINT `t1_ibfk_4` FOREIGN KEY (`Medida_Estadistica`) REFERENCES `medida_estadistica` (`Codigo_Medida_Estadistica`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_spanish_ci;
My question: Today a SELECT query using LIKE '%whatever%' takes normally 5 to 7 minutes, sometimes more. From where I understand the varchar index just are used when 'whatever%' is used, but I NEED to have the possibility to search for strings using left and right wildcards without needing to wait ~7 minutes each search. How can I do it?
The right way to fix the problem is to look at all the queries being run against the table, and their relative frequency. You've only given us part of one. You didn't even say which field it relates to. Since you do say "The most important column is descripcion_detallada_producto varchar(300) and it is where the selects are aimed at in 90% of the times" I'll assume that you only need to optimize
WHERE descripcion_detallada_producto LIKE '%wathever%'
As Vatev has already said, you probably should be using fulltext searches - which are sematically (and syntactically) different from LIKE predicates. Further you should be splitting the descripcion_detallada_producto attribute into it's own relation to reduce the buffer flushing effects of reading huge rows into memory from disk.
If you are searching for entire words that may be anywhere in a text column, you should consider using fulltext indexes, which are obviously used differently than wildcard searches. If you're unsure how to search your fulltext indexes, you can always get help with that.
Doing a search like the following will not use any of your indexes. Instead, it will scan through all rows of your table data, and you're subjected to disk reads (and any correlated disk fragmentation, which isn't usually a problem because we don't usually scan through tables):
SELECT * FROM t1
WHERE Descripcion_Detallada_Producto LIKE `%whatever%'
The following query would just scan through your index on Descripcion_Detallada_Producto which would act as a "covering" index (notice that the columns in the select make the difference):
SELECT N_orden FROM t1
WHERE Descripcion_Detallada_Producto LIKE `%whatever%'
The advantage in scanning an index instead of the actual table data is that the amount of data that is read as it scans is minimized, and ideally with a large innodb_buffer_pool_size, that index would be in memory, which would avoid disk seeks.
Once you get the N_orden values, then you could retrieve the individual records from the table data.
Additional Info
Consider reducing the size of the columns (bigint to unsigned int for N_orden) and reduce size of Descripcion_Detallada_Producto. Even though VARCHAR only uses up actual bytes (plus length) in the table data, each index entry actually uses the max, so reducing even a VARCHAR column size in an index will improve index scan speed.
In addition, if you have categories, restrict searches to selected categories and create a multi-column index on category+description. The following will only have to scan through a portion of a multi-column index on both category and description by restricting the search to a particular category:
SELECT N_orden FROM t1
WHERE Category = 1
AND Descripcion_Detallada_Producto LIKE `%whatever%'
Finally, consider removing wildcard prefixes. Make the user at least type the beginning of the model number.
CREATE TABLE profile_category (
id mediumint UNSIGNED NOT NULL AUTO_INCREMENT,
pc_name char(255) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY idx_name (name)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This is one of the tables in database that is entirely in utf8 charset. The problem is here (and I didn't new about it until now) that index for pc_name column will triple times bigger, because MySQL reserves 3 bites for every char. In this case indexes will take much more space.
I cannot make shorter index, because I need this value to be unique. One of the solutions could be set pc_name char(255) CHARSET latin1 NOT NULL, but I dont't know if this is a problem or not.
Is this is a good Idea, or are there any solutions that I don't know ?
Update: the pc_name column is validated in application to be valid utf8. And it allows non western characters. But in this case I can just make a trade of and allow only /[_A-Za-z]/ if the case is worth it.
Update 2: I tried to set pc_name to latin1 charset, but now I get exceptions like: Zend_Db_Statement_Exception: SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
If pc_name is going to contain non-Western text then latin1 isn't going to be an option here - otherwise, go for it.
Not being a hardcore MySQL'er, I don't know if mixing InnoDB and MySQL tables is fraught with problems - if not, perhaps you could make this table a standard MySQL table and leave it as utf8?