I run the following query on my database :
SELECT e.id_dernier_fichier
FROM Enfants e JOIN FichiersEnfants f
ON e.id_dernier_fichier = f.id_fichier_enfant
And the query runs fine. If I modifiy the query like this :
SELECT e.codega
FROM Enfants e JOIN FichiersEnfants f
ON e.id_dernier_fichier = f.id_fichier_enfant
The query becomes very slow ! The problem is I want to select many columns in table e and f, and the query can take up to 1 minute ! I tried different modifications but nothing works. I have indexes on id_* also on e.codega. Enfants has 9000 lines and FichiersEnfants has 20000 lines. Any suggestions ?
Here are the info asked (sorry not having shown them from the beginning) :
The difference in performance is possibly due to e.id_dernier_fichier being in the index used for the JOIN, but e.codega not being in that index.
Without a full definition of both tables, and all of their indexes, it's not possible to tell for certain. Also, including the two EXPLAIN PLANs for the two queries would help.
For now, however, I can elaborate on a couple of things...
If an INDEX is CLUSTERED (this also applies to PRIMARY KEYs), the data is actually physically stored in the order of the INDEX. This means that knowing you want position x in the INDEX also implicity means you want position x in the TABLE.
If the INDEX is not clustered, however, the INDEX is just providing a lookup for you. Effectively saying position x in the INDEX corresponds to position y in the TABLE.
The importance here is when accessing fields not specified in the INDEX. Doing so means you have to actually go to the TABLE to get the data. In the case of a CLUSTERED INDEX, you're already there, the overhead of finding that field is pretty low. If the INDEX isn't clustered, however, you effectifvely have to JOIN the TABLE to the INDEX, then find the field you're interested in.
Note; Having a composite index on (id_dernier_fichier, codega) is very different from having one index on just (id_dernier_fichier) and a seperate index on just (codega).
In the case of your query, I don't think you need to change the code at all. But you may benefit from changing the indexes.
You mention that you want to access many fields. Putting all those fields in a composite index is porbably not the best solution. Instead you may want to create a CLUSTERED INDEX on (id_dernier_fichier). This will mean that once the *id_dernier_fichier* has been located, you're already in the right place to get all the other fields as well.
EDIT Note About MySQL and CLUSTERED INDEXes
13.2.10.1. Clustered and Secondary Indexes
Every InnoDB table has a special index called the clustered index where the data for the rows is stored:
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
If you do not define a PRIMARY KEY for your table, MySQL picks the first UNIQUE index that has only NOT NULL columns as the primary key and InnoDB uses it as the clustered index.
If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.
Related
So I am just learning about clustered/nonclustered indexes.
Now I read that clustered indexes order the data physically by i.e. the primary key.
But why would this even be necessary? Isn't the table ordered by the ID (Primary Key) by default? Because you start with record A (ID 1) then record B (ID 2) and so on. They are always sorted. Why is there a need for clustered indexes?
Tables are not sorted. While an auto incremented ID is issued in ascending order, the DBMS is free to store the record wherever there is place on the disk. And if you query table data without an ORDER BY clause, you may get the rows in any old order.
An index on the ID can be used to find these rows quickly. It is very fast to find an ID in the index and the index tells you which row to read from the table.
If your table is all about finding a row by ID quickly, which is typical for mere lookup tables, say a table with all country names, you can instead make this a clusterted index.
"Clustered index" simply means that the whole table data is inside the index structure, so instead of searching the index and then get to the table row, you get the row straight away. Oracle has come up with a better name for this in my opinion; they call this "index organized table".
I was wondering how would mysql act if i partition a table by date and then have some select or update queries by primary key ?
is it going to search all partitions or query optimizer knows in which partition the row is saved ?
What about other unique and not-unique indexed columns ?
Background
Think of a PARTITIONed table as a collection of virtually independent tables, each with its own data BTree and index BTree(s).
All UNIQUE keys, including the PRIMARY KEY must include the "partition key".
If the partition key is available in the query, the query will first try to do "partition pruning" to limit the number of partitions to actually look at. Without that info, it must look at all partitions.
After the "pruning", the processing goes to each of the possible partitions, and performs the query.
Select, Update
A SELECT logically does a UNION ALL of whatever was found in the non-pruned partitions.
An UPDATE applies its action to each non-pruned partitions. No harm is done (except performance) by the updates that did nothing.
Opinion
In my experience, PARTITIONing often slows thing down due to things such as the above. There are a small number of use cases for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Your specific questions
partition a table by date and then have some select or update queries by primary key ?
All partitions will be touched. The SELECT combines the one result with N-1 empty results. The UPDATE will do one update, plus N-1 useless attempts to update.
An AUTO_INCREMENT column must be the first column in some index (not necessarily the PK, not necessarily alone). So, using the id is quite efficient in each partition. But that means that it is N times as much effort as in a non-partitioned table. (This is a performance drag for partitioning.)
I am working on a database with large number of rows (6 Mil+).
This table has a composite primary key on two columns.
It also has separate index on each of those fields as there are queries that require this. Obviously, one of those indexes (indices?) is redundant and slowing down performance for write operations.
How do I find out which one is redundant? I understand the first column of a primary key is already indexed and need not be indexed separately. Is that correct? If so, is there a query I can run to find out which is the first one in the list?
SHOW INDEXES FROM tablename will include a Seq_in_index column, which tells you which is first (aka, left most) column, second column, etc.
Therefore, whichever column is listed with a value of 1 for Seq_in_index is the column that does not need it's own single column index.
You can also use SHOW CREATE TABLE tablename to see the index listed from left to right, and that order displayed correctly represents the order of columns in the index.
SHOW CREATE TABLE tablename gives you all the indexes, in their established order.
You don't need INDEX(a) because the column(s) in it are the first column(s) in the INDEX(a,b),
That applies to INDEX / UNIQUE / PRIMARY KEY in (a,b).
I understand the first column of a primary key is already indexed
Erm, no. All the columns in the primary key are indexed.
An explanation of how indexes work is stretching the scope of a post here, and the question of which indexes to put on your table is way too broad.
Suppose you have a primary key defined on attributes a,b,c. This index can be used for queries with predicates
a
a and b
a and b and c
But (at least, the last time I checked) it would not be used for a query with predicates
b
b and c
The optimizer will only ever use one index for each table in a query.
The right indexes depend on the volume of data, the cardinality of the data and the frequency and combination of predicates in your queries. There are execution and storage overheads when you start adding indexes, even just for select operations badly designed indexes can make your query slower than it would run without indexes.
I have a table with two partitions. Partitions are pactive = 1 and pinactive = 0. I understand that two partitions does not make so much of a gain, but I have used it to truncate and load in one partition and plain inserts in another partition.
The problem comes when I create indexes.
Query goes this way
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
Created index -
create index idx_try on customformattributes(partitionflag,companyid,activityname,completiondate,attributename,isclosed)
there are around 200000 records that will be retreived from the above query. But the query along with the mentioned index takes 30+ seconds. What is the reason for such a long time? Also, if remove the partitionflag from the mentioned index, the index is not even used.
And is the understanding that,
Even with the partitions available, the optimizer needs to have the required partition mentioned in the index definition, so that it only hits the required partition ---- Correct?
Any ideas on understanding this would be very helpful
You can optimize your index by reordering the columns in it. Usually the columns in the index are ordered by its cardinality (starting from the highest and go down to the lowest). Cardinality is the uniqueness of data in the given column. So in your case I suppose there are many variations of companyid in customformattributes table while partitionflag will have cardinality of 2 (if all the options for this column are 1 and 0).
Your query will first filter all the rows with partitionflag=0, then it will filter by company id and so on.
When you remove partitionflag from the index the query did not used the index because may be the optimizer decides that it will be faster to make full table scan instead of using the index (in most of the cases the optimizer is right)
For the given query:
select partitionflag,companyid,activityname
from customformattributes
where companyid=47
and activityname = 'Activity 1'
and partitionflag=0
the following index may be would be better (but of course :
create index idx_try on customformattributes(companyid,activityname, completiondate,attributename, partitionflag, isclosed)
For the query to use index the following rule must be met - the left most column in the index should be present in the where clause ... and depending on the mysql version you are using additional query requirements may be needed. For example if you are using old version of mysql - you may need to order the columns in the where clause in the same order they are listed in the index. In the last versions of mysql the query optimizer is responsible for ordering the columns in the where clause in the correct order.
Your SELECT query took 30+ seconds because it returns 200k rows and because the index might not be the optimal for the given query.
For the second question about the partitioning: the common rule is that the column you are partitioning by must be part of all the UNIQUE keys in a table (Primary key is also unique key by definition so the column should be added to the PK also). If table structure and logic allows you to add the partitioning column to all the UNIQUE indexes in the table then you add it and partition the table.
When the partitioning is made correctly you can take the advantage of partitioning pruning - this is when SELECT query searches the data only in the partitions where given data is stored (otherwise it looks in all partitions)
You can read more about partitioning here:
https://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html
The query is slow simply because disks are slow.
Cardinality is not important when designing an index.
The optimal index for that query is
INDEX(companyid, activityname, partitionflag) -- in any order
It is "covering" since it includes all the columns mentioned anywhere in the SELECT. This is indicated by "Using index" in the EXPLAIN.
Leaving off the other 3 columns makes the query faster because it will have to read less off the disk.
If you make any changes to the query (add columns, change from '=' to '>', add ORDER BY, etc), then the index may no longer be optimal.
"Also, if remove the partitionflag from the mentioned index, the index is not even used." -- That is because it was no longer "covering".
Keep in mind that there are two ways an index may be used -- "covering" versus being a way to look up the data. When you don't have a "covering" index, the optimizer chooses between using the index and bouncing between the index and the data versus simply ignoring the index and scanning the table.
Let's use lastName as an example.
Assuming that there are no duplicate last names in your DB (by chance, not because of a unique), would there be any benefit to indexing this lastName column?
The query that would be used to search would be something like SELECT * IN t WHERE lastName='Smith'.
If every entry in the column is unique, then how can an index have an effect? Wouldn't it have to search every entry regardless?
Sorry, I am just learning about indexing and I would really like to understand it better.
Thanks.
Yes, there is a benefit in indexing even if the column values are unique. In the index the values are not only unique but they are also organised in a tree structure that lets you search for a row with O(log N) complexity.
There is a great article in Wikipedia about it: Database Index
...
The data is present in arbitrary order, but the logical ordering is specified
by the index. The data rows may be spread throughout the table
regardless of the value of the indexed column or expression. The
non-clustered index tree contains the index keys in sorted order, with
the leaf level of the index containing the pointer to the record (page
and the row number in the data page in page-organized engines; row
offset in file-organized engines).
In a non-clustered index
The physical order of the rows is not the same as the index order. The
indexed columns are typically non-primary key columns used in JOIN,
WHERE, and ORDER BY clauses. There can be more than one non-clustered
index on a database table.
...
Consider the following SQL statement:
SELECT first_name FROM people WHERE last_name = 'Smith';
To process this statement without an index
the database software must look at the last_name column on every row
in the table (this is known as a full table scan). With an index the
database simply follows the B-tree data structure until the Smith
entry has been found; this is much less computationally expensive than
a full table scan.
Generally speaking the more unique values there are in a column, or the higher its cardinality What is cardinality in MySQL?, the more useful an index will be on that column.