What are the general parameters for determining whether the key_len for a mysql index is 'too long'?
For example, for longer VARCHAR columns, having a key would be longer. In one case I have an index with key_len=767. In which case should I do something like:
ADD INDEX (title(50));
If your table holds many rows and/or if you execute lots of SELECT ... ORDER BY queries, having index(es) on the field(s) involved will speed up the SELECT queries.
However, it will slow down inserts, writes and updates on these fields, because the indexes need to be updated as well.
Every index on a table uses disk space: if you're really short on disk space, use a smaller key_len. But your queries will execute slower.
Imagine a table with only one column, and only the first 3 characters are indexed. The table contains:
Timo
Timothée
Tim
Timothy
Timmo
SELECT * FROM `my_table` ORDER BY `name` ASC
This query will use the index, but since its content is partially indexed, it will need extra processing to do a complete sort, and the execution will take longer.
MySQL indexes can be up to 1000 bytes long (767 bytes for InnoDB tables). (link)
The Innodb documentation says:
In InnoDB, having a long PRIMARY KEY (either a single column with a lengthy value, or several columns that form a long composite value) wastes a lot of disk space. The primary key value for a row is duplicated in all the secondary index records that point to the same row. (See Section 14.3.11, “InnoDB Table and Index Structures”.) Create an AUTO_INCREMENT column as the primary key if your primary key is long, or index a prefix of a long VARCHAR column instead of the entire column.
Related
I'm running a couple tests on MySQL Clustered vs Non Clustered indexes where I have a table 100gb_table which contains ~60 million rows:
100gb_table schema:
CREATE TABLE 100gb_table (
id int PRIMARY KEY NOT NULL AUTO_INCREMENT,
c1 int,
c2 text,
c3 text,
c4 blob NOT NULL,
c5 text,
c6 text,
ts timestamp NOT NULL default(CURRENT_TIMESTAMP)
);
and I'm executing a query that only reads the clustered index:
SELECT id FROM 100gb_table ORDER BY id;
I'm seeing that it takes almost an ~55 min for this query to complete which is strangely slow. I modified the table by adding another index on top of the Primary Key column and ran the following query which forces the non-clustered index to be used:
SELECT id FROM 100gb_table USE INDEX (non_clustered_key) ORDER BY id;
This finished in <10 minutes, much faster than reading with the clustered index. Why is there such a large discrepancy between these two? My understanding is that both indexes store the index column's values in a tree structure, except the clustered index contains table data in the leaf nodes so I would expect both queries to be similarly performant. Could the BLOB column possibly be distorting the clustered index structure?
The answer comes in how the data is laid out.
The PRIMARY KEY is "clustered" with the data; that is, the data is order ed by the PK in a B+Tree structure. To read all of the ids, the entire BTree must be read.
Any secondary index is also in a B+Tree structure, but it contains (1) the columns of the index, and (2) any other columns in the PK.
In your example (with lots of [presumably] bulky columns), the data BTree is a lot bigger than the secondary index (on just id). Either test probably required reading all the relevant blocks from the disk.
A side note... This is not as bad as it could be. There is a limit of about 8KB on how big a row can be. TEXT and BLOB columns, when short enough, are included in that 8KB. But when one is bulky, it is put in another place, leaving behind a 'pointer' to the text/blob. Hence, the main part of the data BTree is smaller than it might be if all the text/blob data were included directly.
Since SELECT id FROM tbl is a mostly unnecessary query, the design of InnoDB does not worry about the inefficiency you discovered.
Tack on ORDER BY or WHERE, etc, and there are many different optimizations that could into play. You might even find that INDEX(c1) will let your query run in not much more than 10 minutes. (I think I have given you all the clues for 'why'.)
Also, if you had done SELECT * FROM tbl, it might have taken much longer than 55 minutes. This is because of having extra [random] fetches to get the texts/blobs from the "off-record" storage. And from the network time to shovel far more data.
I have a table with index on a int column.
Create table sample(
col1 varchar,
col2 int)
Create index idx1 on sample(col2);
When I explain the following query
Select * from sample where col2>2;
It does a full table scan.
Why doesn't the indexing work here?
How can i optimize such queries when table has around 20 million records?
Just because you create an index, does not mean MySQL will always use it. According to the docs, here are several reasons why it may choose to use a full table scan over the index:
The table is so small that it is faster to perform a table scan than to bother with a key lookup. This is common for tables with fewer than 10 rows and a short row length.
There are no usable restrictions in the ON or WHERE clause for indexed columns.
You are comparing indexed columns with constant values and MySQL has calculated (based on the index tree) that the constants cover too large a part of the table and that a table scan would be faster. See Section 8.2.1.1, “WHERE Clause Optimization”.
You are using a key with low cardinality (many rows match the key value) through another column. In this case, MySQL assumes that by using the key it probably will do many key lookups and that a table scan would be faster.
You can use FORCE INDEX to ensure your query uses the index instead of allowing the optimizer to determine the appropriate path, although usually MySQL will take the most efficient approach.
SELECT * FROM t1, t2 FORCE INDEX (index_for_column) WHERE t1.col_name=t2.col_name;
Reference: https://dev.mysql.com/doc/refman/8.0/en/table-scan-avoidance.html
My problem: imagine a table with millions of rows, like
CREATE TABLE a {
id INT PRIMARY KEY,
column2..,
column3..,
many other columns..
..
INDEX (column2);
and a query like this:
SELECT id FROM a WHERE column2 > 10000 LIMIT 1000 OFFSET 5000;
My question: does mysql only use the index "column2" (so the primary key id is implicitly stored as a reference in other indexes), or does it have to fetch all rows to get also the id, which is selected for output? In that case the query should be much faster with a key declared as:
INDEX column2(column2, id)
Short answer: No.
Long answer:
MyISAM, unlike InnoDB, has a "pointer" to the data in the leaf node of each index, including that PRIMARY KEY.
So, INDEX(col2) is essentially INDEX(col2, ptr). Ditto for INDEX(id) being INDEX(id, ptr).
The "pointer" is either a byte offset into the .MYD file (for DYNAMIC) or record number (for FIXED). In either case, the pointer leads to a "seek" into the .MYD file.
The pointer defaults to a 6-byte number, allowing for a huge number of rows. It can be changed by a setting, either for saving space or allowing an even bigger number of rows.
For your particular query, INDEX(col2, id) is optimal and "covering". It is better than INDEX(col2) for MyISAM, but they are equivalent for InnoDB, since InnoDB implicitly has the PK in each secondary index.
The query will have to scan at least 5000+1000 rows, at least in the index's BTree.
Note that InnoDB's PRIMARY KEY is clustered with the data, but MyISAM's PRIMARY KEY is a separate BTree, just like other secondary indexes.
You really should consider moving to InnoDB; there is virtually no reason to use MyISAM today.
An index on column2 is required. Your suggestion with id in the index will prevent table scans and should be very efficient.
Further more it is faster to do this assuming that column2 is a continuous sequence:
SELECT id FROM a WHERE column2 > 15000 LIMIT 1000;
This is because to work with the offset it would just have to scan the next 5000 records (MySQL does not realize that you are actually offsetting column2).
I have many tables where I have indexes on foreign keys, and clustered indexes which include those foreign keys. For example, I have a table like the following:
TABLE: Item
------------------------
id PRIMARY KEY
owner FOREIGN KEY
status
... many more columns
MySQL generates indexes for primary and foreign keys, but sometimes, I want to improve query performance so I'll create clustered or covering indexes. This leads to have indexes with overlapping columns.
INDEXES ON: Item
------------------------
idx_owner (owner)
idx_owner_status (owner, status)
If I dropped idx_owner, future queries that would normally use idx_owner would just use idx_owner_status since it has owner as the first column in the index.
Is it worth keeping idx_owner around? Is there an additional I/O overhead to use idx_owner_status even though MySQL only uses part of the index?
Edit: I am really only interested in the way InnoDB behaves regarding indexes.
Short Answer
Drop the shorter index.
Long Anwser
Things to consider:
Drop it:
Each INDEX is a separate BTree that resides on disk, so it takes space.
Each INDEX is updated (sooner or later) when you INSERT a new row or an UPDATE modifies an indexed column. This takes some CPU and I/O and buffer_pool space for the 'change buffer'.
Any functional use (as opposed to performance) for the shorter index can be performed by the longer one.
Don't drop it:
The longer index is bulkier than the shorter one. So it is less cacheable. So (in extreme situations) using the bulkier one in place of the shorter one could cause more I/O. A case that aggravates this: INDEX(int, varchar255).
It is very rare that the last item really overrides the other items.
Bonus
A "covering" index is one that contains all the columns mentioned in a SELECT. For example:
SELECT status FROM tbl WHERE owner = 123;
This will touch only the BTree for INDEX(owner, status), thereby being noticeably faster than
SELECT status, foo FROM tbl WHERE owner = 123;
If you really need that query to be faster, then replace both of your indexes with INDEX(owner, status, foo).
PK in Secondary key
One more tidbit... In InnoDB, the columns of the PRIMARY KEY are implicitly appended to every secondary key. So, the three examples are really
INDEX(owner, id)
INDEX(owner, status, id)
INDEX(owner, status, foo, id)
More discussion in my blogs on composite indexes and index cookbook.
I run the following query on my database :
SELECT e.id_dernier_fichier
FROM Enfants e JOIN FichiersEnfants f
ON e.id_dernier_fichier = f.id_fichier_enfant
And the query runs fine. If I modifiy the query like this :
SELECT e.codega
FROM Enfants e JOIN FichiersEnfants f
ON e.id_dernier_fichier = f.id_fichier_enfant
The query becomes very slow ! The problem is I want to select many columns in table e and f, and the query can take up to 1 minute ! I tried different modifications but nothing works. I have indexes on id_* also on e.codega. Enfants has 9000 lines and FichiersEnfants has 20000 lines. Any suggestions ?
Here are the info asked (sorry not having shown them from the beginning) :
The difference in performance is possibly due to e.id_dernier_fichier being in the index used for the JOIN, but e.codega not being in that index.
Without a full definition of both tables, and all of their indexes, it's not possible to tell for certain. Also, including the two EXPLAIN PLANs for the two queries would help.
For now, however, I can elaborate on a couple of things...
If an INDEX is CLUSTERED (this also applies to PRIMARY KEYs), the data is actually physically stored in the order of the INDEX. This means that knowing you want position x in the INDEX also implicity means you want position x in the TABLE.
If the INDEX is not clustered, however, the INDEX is just providing a lookup for you. Effectively saying position x in the INDEX corresponds to position y in the TABLE.
The importance here is when accessing fields not specified in the INDEX. Doing so means you have to actually go to the TABLE to get the data. In the case of a CLUSTERED INDEX, you're already there, the overhead of finding that field is pretty low. If the INDEX isn't clustered, however, you effectifvely have to JOIN the TABLE to the INDEX, then find the field you're interested in.
Note; Having a composite index on (id_dernier_fichier, codega) is very different from having one index on just (id_dernier_fichier) and a seperate index on just (codega).
In the case of your query, I don't think you need to change the code at all. But you may benefit from changing the indexes.
You mention that you want to access many fields. Putting all those fields in a composite index is porbably not the best solution. Instead you may want to create a CLUSTERED INDEX on (id_dernier_fichier). This will mean that once the *id_dernier_fichier* has been located, you're already in the right place to get all the other fields as well.
EDIT Note About MySQL and CLUSTERED INDEXes
13.2.10.1. Clustered and Secondary Indexes
Every InnoDB table has a special index called the clustered index where the data for the rows is stored:
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
If you do not define a PRIMARY KEY for your table, MySQL picks the first UNIQUE index that has only NOT NULL columns as the primary key and InnoDB uses it as the clustered index.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.