MySQL Clustered vs Non Clustered Index Performance - mysql

I'm running a couple tests on MySQL Clustered vs Non Clustered indexes where I have a table 100gb_table which contains ~60 million rows:
100gb_table schema:
CREATE TABLE 100gb_table (
id int PRIMARY KEY NOT NULL AUTO_INCREMENT,
c1 int,
c2 text,
c3 text,
c4 blob NOT NULL,
c5 text,
c6 text,
ts timestamp NOT NULL default(CURRENT_TIMESTAMP)
);
and I'm executing a query that only reads the clustered index:
SELECT id FROM 100gb_table ORDER BY id;
I'm seeing that it takes almost an ~55 min for this query to complete which is strangely slow. I modified the table by adding another index on top of the Primary Key column and ran the following query which forces the non-clustered index to be used:
SELECT id FROM 100gb_table USE INDEX (non_clustered_key) ORDER BY id;
This finished in <10 minutes, much faster than reading with the clustered index. Why is there such a large discrepancy between these two? My understanding is that both indexes store the index column's values in a tree structure, except the clustered index contains table data in the leaf nodes so I would expect both queries to be similarly performant. Could the BLOB column possibly be distorting the clustered index structure?

The answer comes in how the data is laid out.
The PRIMARY KEY is "clustered" with the data; that is, the data is order ed by the PK in a B+Tree structure. To read all of the ids, the entire BTree must be read.
Any secondary index is also in a B+Tree structure, but it contains (1) the columns of the index, and (2) any other columns in the PK.
In your example (with lots of [presumably] bulky columns), the data BTree is a lot bigger than the secondary index (on just id). Either test probably required reading all the relevant blocks from the disk.
A side note... This is not as bad as it could be. There is a limit of about 8KB on how big a row can be. TEXT and BLOB columns, when short enough, are included in that 8KB. But when one is bulky, it is put in another place, leaving behind a 'pointer' to the text/blob. Hence, the main part of the data BTree is smaller than it might be if all the text/blob data were included directly.
Since SELECT id FROM tbl is a mostly unnecessary query, the design of InnoDB does not worry about the inefficiency you discovered.
Tack on ORDER BY or WHERE, etc, and there are many different optimizations that could into play. You might even find that INDEX(c1) will let your query run in not much more than 10 minutes. (I think I have given you all the clues for 'why'.)
Also, if you had done SELECT * FROM tbl, it might have taken much longer than 55 minutes. This is because of having extra [random] fetches to get the texts/blobs from the "off-record" storage. And from the network time to shovel far more data.

Related

Clustered index on integer column surprisingly slow

I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.

Is the primary key stored implicitly in other keys in mysql myisam engine?

My problem: imagine a table with millions of rows, like
CREATE TABLE a {
id INT PRIMARY KEY,
column2..,
column3..,
many other columns..
..
INDEX (column2);
and a query like this:
SELECT id FROM a WHERE column2 > 10000 LIMIT 1000 OFFSET 5000;
My question: does mysql only use the index "column2" (so the primary key id is implicitly stored as a reference in other indexes), or does it have to fetch all rows to get also the id, which is selected for output? In that case the query should be much faster with a key declared as:
INDEX column2(column2, id)
Short answer: No.
Long answer:
MyISAM, unlike InnoDB, has a "pointer" to the data in the leaf node of each index, including that PRIMARY KEY.
So, INDEX(col2) is essentially INDEX(col2, ptr). Ditto for INDEX(id) being INDEX(id, ptr).
The "pointer" is either a byte offset into the .MYD file (for DYNAMIC) or record number (for FIXED). In either case, the pointer leads to a "seek" into the .MYD file.
The pointer defaults to a 6-byte number, allowing for a huge number of rows. It can be changed by a setting, either for saving space or allowing an even bigger number of rows.
For your particular query, INDEX(col2, id) is optimal and "covering". It is better than INDEX(col2) for MyISAM, but they are equivalent for InnoDB, since InnoDB implicitly has the PK in each secondary index.
The query will have to scan at least 5000+1000 rows, at least in the index's BTree.
Note that InnoDB's PRIMARY KEY is clustered with the data, but MyISAM's PRIMARY KEY is a separate BTree, just like other secondary indexes.
You really should consider moving to InnoDB; there is virtually no reason to use MyISAM today.
An index on column2 is required. Your suggestion with id in the index will prevent table scans and should be very efficient.
Further more it is faster to do this assuming that column2 is a continuous sequence:
SELECT id FROM a WHERE column2 > 15000 LIMIT 1000;
This is because to work with the offset it would just have to scan the next 5000 records (MySQL does not realize that you are actually offsetting column2).

Optimizing key_len on mysql index

What are the general parameters for determining whether the key_len for a mysql index is 'too long'?
For example, for longer VARCHAR columns, having a key would be longer. In one case I have an index with key_len=767. In which case should I do something like:
ADD INDEX (title(50));
If your table holds many rows and/or if you execute lots of SELECT ... ORDER BY queries, having index(es) on the field(s) involved will speed up the SELECT queries.
However, it will slow down inserts, writes and updates on these fields, because the indexes need to be updated as well.
Every index on a table uses disk space: if you're really short on disk space, use a smaller key_len. But your queries will execute slower.
Imagine a table with only one column, and only the first 3 characters are indexed. The table contains:
Timo
Timothée
Tim
Timothy
Timmo
SELECT * FROM `my_table` ORDER BY `name` ASC
This query will use the index, but since its content is partially indexed, it will need extra processing to do a complete sort, and the execution will take longer.
MySQL indexes can be up to 1000 bytes long (767 bytes for InnoDB tables). (link)
The Innodb documentation says:
In InnoDB, having a long PRIMARY KEY (either a single column with a lengthy value, or several columns that form a long composite value) wastes a lot of disk space. The primary key value for a row is duplicated in all the secondary index records that point to the same row. (See Section 14.3.11, “InnoDB Table and Index Structures”.) Create an AUTO_INCREMENT column as the primary key if your primary key is long, or index a prefix of a long VARCHAR column instead of the entire column.

Creating an Index that is a subset of another Index in MySQL

I'm using MySQL, although I suspect this is a generic database question.
I have a table consisting of 6 numeric columns. The first 5 of these make up the primary key.
It is a large table (20 million rows and growing), so some queries take time - about 10secs, which in itself is not too long, but I need to run a lot of them.
I understand that the primary key is automatically indexed - is there any advantage in me separately indexing some groups of columns within the primary key that I usually query on?
That is,, if I regularly query on the first 3 of the 5 primary key columns, should I create an additional index for these 3, or is that redundant because it's already part of the primary key index?
Ten seconds is quite a long time for a query that returns one or a tiny handful of rows. If the query is returning 3% of the table's contents, though, ten seconds is not too long.
Your primary unique key is backed up by a composite index, let's say an index on
(I1,I2,I3,I4,I5)
You are correct that a query like
WHERE I1 = val AND I2 = val AND I3 = val
and
WHERE I3 = val AND I2 = val AND I1 = val
should use the index created for the primary key. The important thing is that the columns in the composite index are all used, starting with the leftmost one. A query like
WHERE I3 = val AND I4 = val AND I5 = val
won't use the primary key's composite index very well, if at all. Neither will a query that does some kind of computation on the column values mentioned in the key, like
WHERE I1+I2+I3=sumvalue
Keep in mind that "should work" is not the same as "does work." Try using the EXPLAIN command in MySQL to figure out whether the DBMS is doing what you expect it to for your query.
http://dev.mysql.com/doc/refman/5.1/en/explain.html
Why not just create a few test queries, create the index on a copy of the table and see how it performs?
When it comes to performance, measuring is always better than trusting an opinion.
The "best" solution in a database largely depends on the specific details of the table(s) involved. What range of values in the columns, what distribution of values, what type of queries, relative frequency of select/delete/insert/update queries, etc.
That being said, my guess is that an index on a subset will help if that subset contains all columns used in a query. You might get better performance if you include the result set (column in the select) in the index.

MySql Query very slow

I run the following query on my database :
SELECT e.id_dernier_fichier
FROM Enfants e JOIN FichiersEnfants f
ON e.id_dernier_fichier = f.id_fichier_enfant
And the query runs fine. If I modifiy the query like this :
SELECT e.codega
FROM Enfants e JOIN FichiersEnfants f
ON e.id_dernier_fichier = f.id_fichier_enfant
The query becomes very slow ! The problem is I want to select many columns in table e and f, and the query can take up to 1 minute ! I tried different modifications but nothing works. I have indexes on id_* also on e.codega. Enfants has 9000 lines and FichiersEnfants has 20000 lines. Any suggestions ?
Here are the info asked (sorry not having shown them from the beginning) :
The difference in performance is possibly due to e.id_dernier_fichier being in the index used for the JOIN, but e.codega not being in that index.
Without a full definition of both tables, and all of their indexes, it's not possible to tell for certain. Also, including the two EXPLAIN PLANs for the two queries would help.
For now, however, I can elaborate on a couple of things...
If an INDEX is CLUSTERED (this also applies to PRIMARY KEYs), the data is actually physically stored in the order of the INDEX. This means that knowing you want position x in the INDEX also implicity means you want position x in the TABLE.
If the INDEX is not clustered, however, the INDEX is just providing a lookup for you. Effectively saying position x in the INDEX corresponds to position y in the TABLE.
The importance here is when accessing fields not specified in the INDEX. Doing so means you have to actually go to the TABLE to get the data. In the case of a CLUSTERED INDEX, you're already there, the overhead of finding that field is pretty low. If the INDEX isn't clustered, however, you effectifvely have to JOIN the TABLE to the INDEX, then find the field you're interested in.
Note; Having a composite index on (id_dernier_fichier, codega) is very different from having one index on just (id_dernier_fichier) and a seperate index on just (codega).
In the case of your query, I don't think you need to change the code at all. But you may benefit from changing the indexes.
You mention that you want to access many fields. Putting all those fields in a composite index is porbably not the best solution. Instead you may want to create a CLUSTERED INDEX on (id_dernier_fichier). This will mean that once the *id_dernier_fichier* has been located, you're already in the right place to get all the other fields as well.
EDIT Note About MySQL and CLUSTERED INDEXes
13.2.10.1. Clustered and Secondary Indexes
Every InnoDB table has a special index called the clustered index where the data for the rows is stored:
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
If you do not define a PRIMARY KEY for your table, MySQL picks the first UNIQUE index that has only NOT NULL columns as the primary key and InnoDB uses it as the clustered index.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.