Why is InnoDB table size much larger than expected? - mysql

I'm trying to figure out storage requirements for different storage engines. I have this table:
CREATE TABLE `mytest` (
`num1` int(10) unsigned NOT NULL,
KEY `key1` (`num1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
When I insert some values and then run show table status; I get the following:
+----------------+--------+---------+------------+---------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+----------------+--------+---------+------------+---------+----------------+-------------+------------------+--------------+-----------+----------------+---------------------+---------------------+------------+-------------------+----------+----------------+---------+
| mytest | InnoDB | 10 | Compact | 1932473 | 35 | 67715072 | 0 | 48840704 | 4194304 | NULL | 2010-05-26 11:30:40 | NULL | NULL | latin1_swedish_ci | NULL | | |
Notice avg_row_length is 35. I am baffled that InnoDB would not make better use of space when I'm just storing a non-nullable integer.
I have run this same test on myISAM and by default myISAM uses 7 bytes per row on this table. When I run
ALTER TABLE mytest MAX_ROWS=50000000, AVG_ROW_LENGTH = 4;
causes myISAM to finally correctly use 5-byte rows.
When I run the same ALTER TABLE statement for InnoDB the avg_row_length does not change.
Why would such a large avg_row_length be necessary when only storing a 4-byte unsigned int?

InnoDB tables are clustered, that means that all data are contained in a B-Tree with the PRIMARY KEY as a key and all other columns as a payload.
Since you don't define an explicit PRIMARY KEY, InnoDB uses a hidden 6-byte column to sort the records on.
This and overhead of the B-Tree organization (with extra non-leaf-level blocks) requires more space than sizeof(int) * num_rows.

Here is some more info you might find useful.
InnoDB allocates data in terms of 16KB pages, so 'SHOW TABLE STATUS' will give inflated numbers for row size if you only have a few rows and the table is < 16K total. (For example, with 4 rows the average row size comes back as 4096.)
The extra 6 bytes per row for the "invisible" primary key is a crucial point when space is a big consideration. If your table is only one column, that's the ideal column to make the primary key, assuming the values in it are unique:
CREATE TABLE `mytest2`
(`num1` int(10) unsigned NOT NULL primary key)
ENGINE=InnoDB DEFAULT CHARSET=latin1;
By using a PRIMARY KEY like this:
No INDEX or KEY clause is needed, because you don't have a secondary index. The index-organized format of InnoDB tables gives you fast lookup based on the primary key value for free.
You don't wind up with another copy of the NUM1 column data, which is what happens when that column is indexed explicitly.
You don't wind up with another copy of the 6-byte invisible primary key values. The primary key values are duplicated in each secondary index. (That's also the reason why you probably don't want 10 indexes on a table with 10 columns, and you probably don't want a primary key that combines several different columns or is a long string column.)
So overall, sticking with just a primary key means less data associated with the table + indexes. To get a sense of overall data size, I like to run with
set innodb_file_per_table = 1;
and examine the size of the data/database/*table*.ibd files. Each .ibd file contains the data for an InnoDB table and all its associated indexes.
To quickly build up a big table for testing, I usually run a statement like so:
insert into mytest
select * from mytest;
Which doubles the amount of data each time. In the case of the single-column table using a primary key, since the values had to be unique, I used a variation to keep the values from colliding with each other:
insert into mytest2
select num1 + (select count(*) from mytest2) from mytest2;
This way, I was able to get average row size down to 25. The space overhead is based on the underlying assumption that you want to have fast lookup for individual rows using a pointer-style mechanism, and most tables will have a column whose values serve as pointers (i.e. the primary key) in addition to the columns with real data that gets summed, averaged, and displayed.

IN addition to Quassnoi's very fine answer, you should probably try it out using a significant data set.
What I'd do is, load 1M rows of simulated production data in, then measure the table size and use that as a guide.
That's what I've done in the past anyway

MyISAM
MyISAM, except in really old versions, uses a 7-byte "pointer" for locating a row, and a 6-byte pointer inside indexes. These defaults lead to a huge max table size. More details: http://mysql.rjweb.org/doc.php/limits#myisam_specific_limits . The kludgy way to change those involves the ALTER .. MAX_ROWS=50000000, AVG_ROW_LENGTH = 4 that you discovered. The server multiplies those values together to compute how many bytes the data pointer needs to be. Hence, you stumbled on how to shrink the avg_row_length.
But you actually needed to declare a table with fewer than 7 bytes to hit it! The pointer size shows in multiple places:
Free space links in the .MYD default to 7 bytes. So, when you delete a row, a link is provided to the next free spot. That link needs to be 7 bytes (by default), hence the row size was artificially extended from the 4-byte INT to make room for it! (There are more details having to do with whether the column is NULLable , etc.
FIXED vs DYNAMIC row -- When the table is FIXED size, the "pointer" is a row number. For DYNAMIC, it is a byte offset into the .MYD.
Index entries must also point to data rows with a pointer. So your ALTER should have shrunk the .MYI file as well!
There are more details, but MyISAM is likely to go away, so this ancient history is not likely to be of concern to anyone.
InnoDB
https://stackoverflow.com/a/64417275/1766831

Related

How do these table sizes make sense?

I have a MyISAM table, that contains just one field, a SMALL INT. That field has an index on it, and there are 5.6 million records.
So in theory 5.6mil * 2 bytes (smallint) = 11MB (approx), but the data file of the table is 40MB, why so different?
The index file takes up 46MB, would would it be bigger than the data file?
Here is the create table:
CREATE TABLE `key_test` (
`key2` smallint(5) unsigned NOT NULL DEFAULT '0',
KEY `key2` (`key2`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
There is some overhead.
First, some controllable 'variables':
myisam_data_pointer_size defaults to 6 (bytes).
myisam_index_pointer_size defaults to 1 less than that.
For the data (.MYD):
N bytes for up to 8*N NULLable columns. (N=0 for your table.)
1 byte for "deleted". You do have this.
DELETEd rows leave gaps.
When a record is deleted, the gap is filled in with a data_pointer to the next record. This implies that the smallest a row can be is 6 bytes.
So: 1 + MAX(row_length, 6) = 7 bytes per row.
If you had 3 SMALLINTs, the table would be the same size.
For the index (.MYI):
BTree organization has some overhead; if randomly built, it settles in on about 69% full.
A 6-byte pointer (byte offset into .MYD, DYNAMIC) is needed in each leaf row.
Links within the BTree are a 5-byte row that is controlled by an unlisted setting (myisam_index_pointer_size).
So: row_length + 6 per record, plus some overhead. 46M sounds like the data was sorted so that the index was built "in order".
Beyond that, my memory of MyISAM details is fading.

Why does it contain "Using where"?

Here is my table schema.
CREATE TABLE `usr_block_phone` (
`usr_block_phone_uid` BIGINT (20) UNSIGNED NOT NULL AUTO_INCREMENT,
`time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`usr_uid` INT (10) UNSIGNED NOT NULL,
`block_phone` VARCHAR (20) NOT NULL,
`status` INT (4) NOT NULL,
PRIMARY KEY (`usr_block_phone_uid`),
KEY `block_phone` (`block_phone`),
KEY `usr_uid_block_phone` (`usr_uid`, `block_phone`) USING BTREE,
KEY `usr_uid` (`usr_uid`) USING BTREE
) ENGINE = INNODB DEFAULT CHARSET = utf8
And This is my SQL
SELECT
ubp.usr_block_phone_uid
FROM
usr_block_phone ubp
WHERE
ubp.usr_uid = 19
AND ubp.block_phone = '80000000001'
By the way, when I ran "EXPLAIN", I got the result as following.
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| 1 | SIMPLE | ubp | ref | block_phone,usr_uid_block_phone,usr_uid | usr_uid_block_phone | 66 | const,const | 1 | Using where; Using index |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
Why is index usr_uid_block_phone not working?
I want to use using index only.
This table has 20000 rows now.
Your index is actually used, see the key column. At the moment the query looks good and the execution plan is good as well.
Fill it with at least a hundred for it to be used (and ensure you still use a predicate that filters just one row).
And a general advice: it's near to impossible to predict how optimiser would behave in a particular situation unless you're a mysql dbms developer yourself. So it's always better to try on a dataset that is as close (in terms of size and quality of data) to your production as possible.
Both columns that are used in the WHERE clause (usr_uid and block_phone) are present in the usr_uid_block_phone index and this makes it a possible key to be used to process the query. Even more, it is the index selected but because of the small number of rows in the table, MySQL decides that is faster to not use an index.
The reason is in the expressions present in the SELECT clause:
SELECT
ubp.usr_block_phone_uid
Because the column usr_block_phone_uid is not present in the selected index, in order to process the query MySQL needs to read both the index (to determine what rows match the WHERE conditions) and the table data (to get the value of column usr_block_phone_uid of those rows).
It is faster to read only the table data and use the WHERE conditions to find the matching rows and get their usr_block_phone_uid column. It needs to read data from storage from one place. It needs to read the same data and the index data if it uses an index.
The situation (and the report of EXPLAIN) changes when the table grows. At some point, reading information from the index (and using it to filter out rows) is compensated by the large number of rows that are filtered out (i.e. their data is not read from the storage).
The exact point when this happens is not fixed. It depends a lot of the structure of your table and how the values in the table are spread out. Even when the table is large, MySQL can decide to ignore the index in order to read less information from the storage medium. For example, if a large percentage (let's say 90%) of the table rows match the WHERE condition, it is more efficient to read all the table data (and ignore the index) than to read 90% of table data and 90% of the index.
90% in the previous paragraph is a figure I made up for explanation purposes. I don't know how MySQL decides that it's better to ignore the index.

Performance penalization in (VAR)CHAR vs INT for PKs [duplicate]

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.
My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.
So, does anyone have experience with this particular use-case and the performance concerns associated with it?
I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.
Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.
The setup was as follows:
Intel® Core™ i7-7500U CPU # 2.70GHz × 4
15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
148.6 GB SSD drive, with plenty of free space.
Ubuntu 16.04 64-bit
MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)
The tables:
create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;
Then, I filled 10 million rows in each table with a PHP script whose essence is like this:
$pdo = get_pdo();
$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];
for ($k = 0; $k < 10; $k++) {
for ($j = 0; $j < 1000; $j++) {
$val = '';
for ($i = 0; $i < 1000; $i++) {
$val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
}
$val = rtrim($val, ',');
$pdo->query('INSERT INTO jan_char VALUES ' . $val);
}
echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}
For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.
Then I ran in MySQL:
SET SESSION query_cache_type=0;
For jan_int table:
SELECT count(*) FROM jan_int WHERE myindex = 5;
SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.
Times of the BENCHMARK query on each table:
jan_int: 21.30 sec
jan_int_index: 18.79 sec
jan_char: 21.70 sec
jan_char_index: 18.85 sec
jan_varchar: 21.76 sec
jan_varchar_index: 18.86 sec
Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int | InnoDB | 10 | Dynamic | 9739094 | 43 | 422510592 | 0 | 0 | 4194304 | NULL | utf8mb4_unicode_520_ci |
| jan_int_index | InnoDB | 10 | Dynamic | 9740329 | 43 | 420413440 | 0 | 132857856 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_char | InnoDB | 10 | Dynamic | 9726613 | 51 | 500170752 | 0 | 0 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_char_index | InnoDB | 10 | Dynamic | 9719059 | 52 | 513802240 | 0 | 202342400 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar | InnoDB | 10 | Dynamic | 9722049 | 53 | 521142272 | 0 | 0 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar_index | InnoDB | 10 | Dynamic | 9738381 | 49 | 486539264 | 0 | 202375168 | 7340032 | NULL | utf8mb4_unicode_520_ci |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
My conclusion is that there's no performance difference for this particular use case.
You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.
That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.
Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.
Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.
It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.
INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.
Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..
So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...
At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .
For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...
Absolutely not.
I have done several... several... performance checks between INT, VARCHAR, and CHAR.
10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.
That being said... use whatever is best for your application. Don't worry about the performance.
For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).
For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.
As for Primary Key, whatever physically makes a row unique should be determined as the primary key.
For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.
The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:
ex.
id value
1 A
2 B
3 C
Update 3 to D
id value
1 A
2 B
3 D
Update 2 to C
id value
1 A
2 C
3 D
Update 3 to B
id value
1 A
2 C
3 B
It all depends on what you really need to worry about in your structure and what means most.
Common cases where a surrogate AUTO_INCREMENT hurts:
A common schema pattern is a many-to-many mapping:
CREATE TABLE map (
id ... AUTO_INCREMENT,
foo_id ...,
bar_id ...,
PRIMARY KEY(id),
UNIQUE(foo_id, bar_id),
INDEX(bar_id) );
Performance of this pattern is much better, especially when using InnoDB:
CREATE TABLE map (
# No surrogate
foo_id ...,
bar_id ...,
PRIMARY KEY(foo_id, bar_id),
INDEX (bar_id, foo_id) );
Why?
InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
The secondary index is "covering", so it does not need the extra lookup.
This table is smaller because of getting rid of id and one index.
Another case (country):
country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii
All too often the novice normalizes country_code into a 4-byte INT instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.
At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.
Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.
I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.
Fact tables relationships:
+----------+ +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1 * +----v----+
1| |1
| +----------+ |
+---<| Casualty |>---+
* +----------+ *
RDMS: MySQL 5.6
Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change.
In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions.
After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost.
Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.
The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.
Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.
The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.
Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.
My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app'
s responsiveness.
Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):
1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)
2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)
3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)
Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.
As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.
There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.
For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.
As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.
The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.

MySQL indexing char(1) columns

I have a table with a complex query that I look for optimization,
I read most of the documentation on MySQL indexing .. but in this case I`m not sure
what to do:
Data structure:
-- please, don't comment on the field types and names, it is outsourced project.
CREATE TABLE items(
record_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
solid CHAR(1) NOT NULL, -- only 'Y','N' values
optional CHAR(1) NULL, -- only 'Y','N', NULL values
data TEXT
);
Query:
SELECT * FROM items
WHERE record_id != 88
AND solid = 'Y'
AND optional !='N' -- 'Y' OR NULL
Of course there are extra joins and related data, but this are the biggest filters.
In the scenario of:
- 200 000+ records,
- 10% (from all) with solid = 'Y',
- 10% (from all) with optional !='N',
What would be good index for this query ?
or more precisely:
does the first check record != 88 slows they query in any way ?
(it only eleminates one result...?)
which is faster (optional !='N') or ( 'optional' = 'Y' OR 'optional' iS NULL )
as mentioned above optional = 'N' are 10% of the total count.
is there anything special for indexing a CHAR(1) column with only 2 possible values?
can I use this index (record_id, solid, optional)?
can I create a index for specific value (solid = 'Y', optional !='N')?
As #Jack requested, current EXPLAIN result (out of 30 000 total rows with 20 results):
+-------------+-------+--------------+---------+---------+------+-------+-------------+
| select_type | type | possible_key | key | key_len | ref | rows | Extra |
+-------------+-------+--------------+---------+---------+------+-------+-------------+
| PRIMARY | range | PRIMARY | PRIMARY | 4 | NULL | 16228 | Using where |
+-------------+-------+--------------+---------+---------+------+-------+-------------+
This is an interesting question. Overall, your query has an estimated selectivity of about 1%. So, if 100 records fit on a page, then you would assume that each page would still have to be read, even with the index. Because a record is so small (depending on data that is), this is quite likely. From that perspective, an index is not worth it.
An index would be worth it under the following circumstances. The first is when the index is a covering index, meaning that you can satisfy the query with all the columns in the index. For example:
select count(*)
FROM items
WHERE record_id != 88 AND solid = 'Y' AND optional !='N' -- 'Y' OR NULL
Where the index is on solid, optional, record_id. The query doesn't need to go back to the original data pages.
Another case would be when the index is a primary (or clustered) index. The data is stored in that order, so fetching a limited number of results would reduce the read overhead of the query. The downside to this is that updates and inserts are more expensive, because data actually has to move.
My best guess in your case is that an index would not be useful, unless data is quite large (in the kilobyte range).
You should try to put indexes on the columns that will do the most discrimination. Usually indexing a binary column is not very helpful, if the database is about evenly split between the values. But if the value you often search for only appears 10% of the time, it can be a useful index.
If any of the columns are indexed, they will usually be checked before doing any other WHERE processing. The order that you put the conditions in the WHERE clause is not generally relevant. You can use EXPLAIN to find out which indexes a query uses.

Is there a REAL performance difference between INT and VARCHAR primary keys?

Is there a measurable performance difference between using INT vs. VARCHAR as a primary key in MySQL? I'd like to use VARCHAR as the primary key for reference lists (think US States, Country Codes) and a coworker won't budge on the INT AUTO_INCREMENT as a primary key for all tables.
My argument, as detailed here, is that the performance difference between INT and VARCHAR is negligible, since every INT foreign key reference will require a JOIN to make sense of the reference, a VARCHAR key will directly present the information.
So, does anyone have experience with this particular use-case and the performance concerns associated with it?
I was a bit annoyed by the lack of benchmarks for this online, so I ran a test myself.
Note though that I don't do it on a regular basic, so please check my setup and steps for any factors that could have influenced the results unintentionally, and post your concerns in comments.
The setup was as follows:
Intel® Core™ i7-7500U CPU # 2.70GHz × 4
15.6 GiB RAM, of which I ensured around 8 GB was free during the test.
148.6 GB SSD drive, with plenty of free space.
Ubuntu 16.04 64-bit
MySQL Ver 14.14 Distrib 5.7.20, for Linux (x86_64)
The tables:
create table jan_int (data1 varchar(255), data2 int(10), myindex tinyint(4)) ENGINE=InnoDB;
create table jan_int_index (data1 varchar(255), data2 int(10), myindex tinyint(4), INDEX (myindex)) ENGINE=InnoDB;
create table jan_char (data1 varchar(255), data2 int(10), myindex char(6)) ENGINE=InnoDB;
create table jan_char_index (data1 varchar(255), data2 int(10), myindex char(6), INDEX (myindex)) ENGINE=InnoDB;
create table jan_varchar (data1 varchar(255), data2 int(10), myindex varchar(63)) ENGINE=InnoDB;
create table jan_varchar_index (data1 varchar(255), data2 int(10), myindex varchar(63), INDEX (myindex)) ENGINE=InnoDB;
Then, I filled 10 million rows in each table with a PHP script whose essence is like this:
$pdo = get_pdo();
$keys = [ 'alabam', 'massac', 'newyor', 'newham', 'delawa', 'califo', 'nevada', 'texas_', 'florid', 'ohio__' ];
for ($k = 0; $k < 10; $k++) {
for ($j = 0; $j < 1000; $j++) {
$val = '';
for ($i = 0; $i < 1000; $i++) {
$val .= '("' . generate_random_string() . '", ' . rand (0, 10000) . ', "' . ($keys[rand(0, 9)]) . '"),';
}
$val = rtrim($val, ',');
$pdo->query('INSERT INTO jan_char VALUES ' . $val);
}
echo "\n" . ($k + 1) . ' millon(s) rows inserted.';
}
For int tables, the bit ($keys[rand(0, 9)]) was replaced with just rand(0, 9), and for varchar tables, I used full US state names, without cutting or extending them to 6 characters. generate_random_string() generates a 10-character random string.
Then I ran in MySQL:
SET SESSION query_cache_type=0;
For jan_int table:
SELECT count(*) FROM jan_int WHERE myindex = 5;
SELECT BENCHMARK(1000000000, (SELECT count(*) FROM jan_int WHERE myindex = 5));
For other tables, same as above, with myindex = 'califo' for char tables and myindex = 'california' for varchar tables.
Times of the BENCHMARK query on each table:
jan_int: 21.30 sec
jan_int_index: 18.79 sec
jan_char: 21.70 sec
jan_char_index: 18.85 sec
jan_varchar: 21.76 sec
jan_varchar_index: 18.86 sec
Regarding table & index sizes, here's the output of show table status from janperformancetest; (w/ a few columns not shown):
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Collation |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| jan_int | InnoDB | 10 | Dynamic | 9739094 | 43 | 422510592 | 0 | 0 | 4194304 | NULL | utf8mb4_unicode_520_ci |
| jan_int_index | InnoDB | 10 | Dynamic | 9740329 | 43 | 420413440 | 0 | 132857856 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_char | InnoDB | 10 | Dynamic | 9726613 | 51 | 500170752 | 0 | 0 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_char_index | InnoDB | 10 | Dynamic | 9719059 | 52 | 513802240 | 0 | 202342400 | 5242880 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar | InnoDB | 10 | Dynamic | 9722049 | 53 | 521142272 | 0 | 0 | 7340032 | NULL | utf8mb4_unicode_520_ci |
| jan_varchar_index | InnoDB | 10 | Dynamic | 9738381 | 49 | 486539264 | 0 | 202375168 | 7340032 | NULL | utf8mb4_unicode_520_ci |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
My conclusion is that there's no performance difference for this particular use case.
You make a good point that you can avoid some number of joined queries by using what's called a natural key instead of a surrogate key. Only you can assess if the benefit of this is significant in your application.
That is, you can measure the queries in your application that are the most important to be speedy, because they work with large volumes of data or they are executed very frequently. If these queries benefit from eliminating a join, and do not suffer by using a varchar primary key, then do it.
Don't use either strategy for all tables in your database. It's likely that in some cases, a natural key is better, but in other cases a surrogate key is better.
Other folks make a good point that it's rare in practice for a natural key to never change or have duplicates, so surrogate keys are usually worthwhile.
It's not about performance. It's about what makes a good primary key. Unique and unchanging over time. You may think an entity such as a country code never changes over time and would be a good candidate for a primary key. But bitter experience is that is seldom so.
INT AUTO_INCREMENT meets the "unique and unchanging over time" condition. Hence the preference.
Depends on the length.. If the varchar will be 20 characters, and the int is 4, then if you use an int, your index will have FIVE times as many nodes per page of index space on disk... That means that traversing the index will require one fifth as many physical and/or logical reads..
So, if performance is an issue, given the opportunity, always use an integral non-meaningful key (called a surrogate) for your tables, and for Foreign Keys that reference the rows in these tables...
At the same time, to guarantee data consistency, every table where it matters should also have a meaningful non-numeric alternate key, (or unique Index) to ensure that duplicate rows cannot be inserted (duplicate based on meaningful table attributes) .
For the specific use you are talking about (like state lookups ) it really doesn't matter because the size of the table is so small.. In general there is no impact on performance from indices on tables with less than a few thousand rows...
Absolutely not.
I have done several... several... performance checks between INT, VARCHAR, and CHAR.
10 million record table with a PRIMARY KEY (unique and clustered) had the exact same speed and performance (and subtree cost) no matter which of the three I used.
That being said... use whatever is best for your application. Don't worry about the performance.
For short codes, there's probably no difference. This is especially true as the table holding these codes are likely to be very small (a couple thousand rows at most) and not change often (when is the last time we added a new US State).
For larger tables with a wider variation among the key, this can be dangerous. Think about using e-mail address/user name from a User table, for example. What happens when you have a few million users and some of those users have long names or e-mail addresses. Now any time you need to join this table using that key it becomes much more expensive.
As for Primary Key, whatever physically makes a row unique should be determined as the primary key.
For a reference as a foreign key, using an auto incrementing integer as a surrogate is a nice idea for two main reasons.
- First, there's less overhead incurred in the join usually.
- Second, if you need to update the table that contains the unique varchar then the update has to cascade down to all the child tables and update all of them as well as the indexes, whereas with the int surrogate, it only has to update the master table and it's indexes.
The drawaback to using the surrogate is that you could possibly allow changing of the meaning of the surrogate:
ex.
id value
1 A
2 B
3 C
Update 3 to D
id value
1 A
2 B
3 D
Update 2 to C
id value
1 A
2 C
3 D
Update 3 to B
id value
1 A
2 C
3 B
It all depends on what you really need to worry about in your structure and what means most.
Common cases where a surrogate AUTO_INCREMENT hurts:
A common schema pattern is a many-to-many mapping:
CREATE TABLE map (
id ... AUTO_INCREMENT,
foo_id ...,
bar_id ...,
PRIMARY KEY(id),
UNIQUE(foo_id, bar_id),
INDEX(bar_id) );
Performance of this pattern is much better, especially when using InnoDB:
CREATE TABLE map (
# No surrogate
foo_id ...,
bar_id ...,
PRIMARY KEY(foo_id, bar_id),
INDEX (bar_id, foo_id) );
Why?
InnoDB secondary keys need an extra lookup; by moving the pair into the PK, that is avoided for one direction.
The secondary index is "covering", so it does not need the extra lookup.
This table is smaller because of getting rid of id and one index.
Another case (country):
country_id INT ...
-- versus
country_code CHAR(2) CHARACTER SET ascii
All too often the novice normalizes country_code into a 4-byte INT instead of using a 'natural' 2-byte, nearly-unchanging 2-byte string. Faster, smaller, fewer JOINs, more readable.
At HauteLook, we changed many of our tables to use natural keys. We did experience a real-world increase in performance. As you mention, many of our queries now use less joins which makes the queries more performant. We will even use a composite primary key if it makes sense. That being said, some tables are just easier to work with if they have a surrogate key.
Also, if you are letting people write interfaces to your database, a surrogate key can be helpful. The 3rd party can rely on the fact that the surrogate key will change only in very rare circumstances.
I faced the same dilemma. I made a DW (Constellation schema) with 3 fact tables, Road Accidents, Vehicles in Accidents and Casualties in Accidents. Data includes all accidents recorded in UK from 1979 to 2012, and 60 dimension tables. All together, about 20 million records.
Fact tables relationships:
+----------+ +---------+
| Accident |>--------<| Vehicle |
+-----v----+ 1 * +----v----+
1| |1
| +----------+ |
+---<| Casualty |>---+
* +----------+ *
RDMS: MySQL 5.6
Natively the Accident index is a varchar(numbers and letters), with 15 digits. I tried not to have surrogate keys, once the accident indexes would never change.
In a i7(8 cores) computer, the DW became too slow to query after 12 million records of load depending of the dimensions.
After a lot of re-work and adding bigint surrogate keys I got a average 20% speed performance boost.
Yet to low performance gain, but valid try. Im working in MySQL tuning and clustering.
The question is about MySQL so I say there is a significant difference. If it was about Oracle (which stores numbers as string - yes, I couldn't believe it at first) then not much difference.
Storage in the table is not the issue but updating and referring to the index is. Queries involving looking up a record based on its primary key are frequent - you want them to occur as fast as possible because they happen so often.
The thing is a CPU deals with 4 byte and 8 byte integers naturally, in silicon. It's REALLY fast for it to compare two integers - it happens in one or two clock cycles.
Now look at a string - it's made up of lots of characters (more than one byte per character these days). Comparing two strings for precedence can't be done in one or two cycles. Instead the strings' characters must be iterated until a difference is found. I'm sure there are tricks to make it faster in some databases but that's irrelevant here because an int comparison is done naturally and lightning fast in silicon by the CPU.
My general rule - every primary key should be an autoincrementing INT especially in OO apps using an ORM (Hibernate, Datanucleus, whatever) where there's lots of relationships between objects - they'll usually always be implemented as a simple FK and the ability for the DB to resolve those fast is important to your app'
s responsiveness.
Allow me to say yes there is definitely a difference, taking into consideration the scope of performance (Out of the box definition):
1- Using surrogate int is faster in application because you do not need to use ToUpper(), ToLower(), ToUpperInvarient(), or ToLowerInvarient() in your code or in your query and these 4 functions have different performance benchmarks. See Microsoft performance rules on this. (performance of application)
2- Using surrogate int guarantees not changing the key over time. Even country codes may change, see Wikipedia how ISO codes changed over time. That would take lots of time to change the primary key for subtrees. (performance of data maintenance)
3- It seems there are issues with ORM solutions, such as NHibernate when PK/FK is not int. (developer performance)
Not sure about the performance implications, but it seems a possible compromise, at least during development, would be to include both the auto-incremented, integer "surrogate" key, as well as your intended, unique, "natural" key. This would give you the opportunity to evaluate performance, as well as other possible issues, including the changeability of natural keys.
As usual, there are no blanket answers. 'It depends!' and I am not being facetious. My understanding of the original question was for keys on small tables - like Country (integer id or char/varchar code) being a foreign key to a potentially huge table like address/contact table.
There are two scenarios here when you want data back from the DB. First is a list/search kind of query where you want to list all the contacts with state and country codes or names (ids will not help and hence will need a lookup). The other is a get scenario on primary key which shows a single contact record where the name of the state, country needs to be shown.
For the latter get, it probably does not matter what the FK is based on since we are bringing together tables for a single record or a few records and on key reads. The former (search or list) scenario may be impacted by our choice. Since it is required to show country (at least a recognizable code and perhaps even the search itself includes a country code), not having to join another table through a surrogate key can potentially (I am just being cautious here because I have not actually tested this, but seems highly probable) improve performance; notwithstanding the fact that it certainly helps with the search.
As codes are small in size - not more than 3 chars usually for country and state, it may be okay to use the natural keys as foreign keys in this scenario.
The other scenario where keys are dependent on longer varchar values and perhaps on larger tables; the surrogate key probably has the advantage.