One composite index or many indexes for foreign keys? - mysql

Whats is the difference between creating a covering index for all the foreign keys of a relation table and creating one index for each column (foreign key) of the relation table ?
For instance, I have the table sales(p_id, e_id, c_id, ammount) where p_id is a foreign key (products table), e_id is a foreign key (employee table) and c_id a foreign key (customer_table). The primary key of the table is {p_id, e_id, c_id}.
Which on is better ?
CREATE INDEX cmpindex ON sales(p_id, e_id, c_id)
OR
CREATE INDEX pindex on sales(p_id)
CREATE INDEX eindex on sales(e_id)
CREATE INDEX cindex on sales(c_id)
I mostly run queries with joins on the relation table and the parent tables.

Which one is better depends on your actual queries.
One thing to understand is that when you join the table sales once in your query, it will only use one index for it (at the most). So you need to make sure an index is available that is most appropriate for the query.
If you join the sales table always to all three other tables (customer, product and employee) then a composite index is to be preferred, assuming that the engine will actually use it and not perform a table scan.
The order of the fields in the composite index is important when it comes to the order of the results. For instance, if your query is going to group the results by product (first), and then order the details per customer, you could benefit from an index that has the product id first, and the customer id as second.
But it may also be that the engine decides that it is better to start scanning the table sales first and then join in the other three tables using their respective primary key indexes. In that case no index is used that exists on the sales table.
The only way to find out is to get the execution plan of your query and see which indexes will be used when they are defined.
If you only have one query on the sales table, there is no need to have several indexes. But more likely you have several queries which output completely different results, with different field selections, filters, groupings, ...Etc.
In that case you may need several indexes, some of which will serve for one type of query, and others for others. Note that what you propose is not mutually exclusive. You could maybe benefit from several composite indexes, which just have a different order of fields.
Obviously, a multitude of indexes will slow down data changes in those tables, so you need to consider that trade-off as well.

Note that an index on a compound key will only be used if you query on the first portion, the first and second portion, the first, second and third portion, etc., so querying on p_id or p_id and e_id, etc. or even e_id and p_id will utilize the index. Indeed, any query containing p_id will use this index.
However, if you query your Sales table on e_id or c-id or any combination of these two, cmpindex will not be used and a full table scan will be performed.
One benefit of having an index on each foreign key (a non-unique index, as there could be multiple sales of the same product, or by the same employee, or to the same customer, leading to duplicate entries in the index) is that the query optimizer has the option of using the index to reduce the number of rows returned, and then doing a sequential search through the result set.
E.g. if the query is a search on sales of a particular product to a particular customer (regardless of employee) and you have a million sales, the foreign key index cindex could be used to return 20 sales items to that particular customer, and that result set could be very efficiently searched sequentially to find which of these sales were for a particular product.
If the search was performed on Product and pindex was used, the result set may be 10,000 rows (all sales of that product), which would have to be sequentially searched to find the sales of that product to a particular customer, leading to a very inefficient query.
I believe that the statistics kept for a table (used by the optimizer) keep track of the average number of rows that will be returned for a query using each index, so the optimizer will be able to work out that cindex should be used rather than pindex in the examples above. Alternatively, you can give hints on your queries to specify that a particular index be used.
It is, obviously, important to run UPDATE STATISTICS on a regular basis, as the execution plan would use pindex in the example above if there were, on average, only 10 sales of each product.

If your queries(search) propagates through sales for each of tables independently then you must create a separate index for each.
If that's not necessary then you can go for composite.

As HoneyBadger commented, you already have a composite index, since your primary key is itself an index.
In general, you should use a single index for each column whenever you think you will have queries involving each field by itself.
As stated here, when you have a composite index, it can work with queries involving all fields, or with queries involving the first field (in order), the first and the second, or the first,second and third together. It won't be used in queries involving only the second and third field.

The other answers are missing an important point. When you declare a foreign key in MySQL, it creates an index on the column. This is not (necessarily) true in other databases, but it is true in MySQL.
So, the declaration automatically creates these indexes:
CREATE INDEX pindex on sales(p_id);
CREATE INDEX eindex on sales(e_id);
CREATE INDEX cindex on sales(c_id);
(These indexes are very handy for dealing with cascading constraints and maintaining the data integrity based on the foreign key.)
If you happen to have also declared an index on sales(p_id, e_id, c_id, amount), then the first of the indexes is not needed -- it is a subset of this index. However, the other two are needed.
Is this index needed? As mentioned in other questions, that depends on the queries you want to use the index for. I recommend starting with the documentation on this subject to understand how the indexes get used.

Related

Mysql select by auto increment primary key while partitioned by date

I was wondering how would mysql act if i partition a table by date and then have some select or update queries by primary key ?
is it going to search all partitions or query optimizer knows in which partition the row is saved ?
What about other unique and not-unique indexed columns ?
Background
Think of a PARTITIONed table as a collection of virtually independent tables, each with its own data BTree and index BTree(s).
All UNIQUE keys, including the PRIMARY KEY must include the "partition key".
If the partition key is available in the query, the query will first try to do "partition pruning" to limit the number of partitions to actually look at. Without that info, it must look at all partitions.
After the "pruning", the processing goes to each of the possible partitions, and performs the query.
Select, Update
A SELECT logically does a UNION ALL of whatever was found in the non-pruned partitions.
An UPDATE applies its action to each non-pruned partitions. No harm is done (except performance) by the updates that did nothing.
Opinion
In my experience, PARTITIONing often slows thing down due to things such as the above. There are a small number of use cases for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Your specific questions
partition a table by date and then have some select or update queries by primary key ?
All partitions will be touched. The SELECT combines the one result with N-1 empty results. The UPDATE will do one update, plus N-1 useless attempts to update.
An AUTO_INCREMENT column must be the first column in some index (not necessarily the PK, not necessarily alone). So, using the id is quite efficient in each partition. But that means that it is N times as much effort as in a non-partitioned table. (This is a performance drag for partitioning.)

MYSQL index optimization for table that stores relationship between 2 other tables

My question is regarding database structuring for a table that links 2 other tables for storing the relationship.
for example, I have 3 tables, users, locations, and users_locations.
users and locations table both have an id column.
users_locations table has the user_id and location_id from the other 2 tables.
how do you define your indexes/constraints on these tables to efficiently answer questions such as what locations does this user have or what users belong to this location?
eg.
select user_id from users_locations where location_id = 5;
or
select location_id from users_locations where user_id = 5;
currently, I do not have a foreign key constraint set, which I assume I should add, but does that automatically speed up the queries or create an index?
I don't think I can create an index on each column since there will be duplicates eg. multiple user_id entries for each location, and vice versa.
Will adding a composite key like PRIMARY_KEY (user_id, location_id) speed up queries when most queries only have half of the key?
Is there any reason to just set an AUTO INCREMENT PRIMARY_KEY field on this table when you will never query by that id?
Do I really even need to set a PRIMARY KEY?
Basically, for any table, decision to create an index or not create an index, totally depends on your use cases which you support. Indexes must always be on the per use basis and not on nice to have.
For your particular queries that you have mentioned, separate indexes on both the columns are good enough, that is query doesn't need to go to your rows to fetch the information.
Creating foreign key on a table column automatically creates an index so you need not create indexes yourself if you decide to set up foreign keys.
If you keep an auto increment key as primary key, you will still have to make user_id and location id combination as unique otherwise you will bloat your table with duplicates.So keeping a separate auto increment key doesn't make sense in your use case. However if you want to keep track of each visit to a location and save user experience each time then auto increment primary key will be a required thing.
However I would like to point it out that creating indexes does not guarantee that your queries will use them unless specified explicitly. For a single query there can be many execution plans and most efficient may not use an index.
The optimal indexes for a many-to-many mapping table:
PRIMARY KEY (aid, bid),
INDEX(bid, aid)
More discussion and more tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
(Comments on specific points in the Question)
FOREIGN KEYs implicitly create indexes, unless an explicit index has already been provided.
Composite indexes are better for many-to-many tables.
A FOREIGN KEY involves an integrity check, so it is inherently slower than simply having the index. (And the integrity check for this kind of table is of dubious value.)
There is no need for an AUTO_INCREMENT on a many:many table. However, ...
It is important to have a PRIMARY KEY on every table. The pair of columns is fine as a "natural" PRIMARY KEY.
A WHERE clause would like to use the first column(s) of some index; don't worry that it is not using all the columns.
In EXPLAIN you sometimes see "Using index". This means that a "covering index" was used. That means that all the columns used in the SELECT were found in that one index -- without having to reach into the data to get more columns. This is a performance boost. And necessitates two two-column indexes (on is the PK, one is a plain INDEX.)
With InnoDB, any 'secondary' index (INDEX or UNIQUE) implicitly includes the columns of the PK. So, given PRIMARY KEY(a,b), INDEX(b), that secondary index is effectively INDEX(b,a). I prefer to spell out the two columns to point out the to reader that I deliberately wanted those two columns in that order.
Hopefully, the above link will answer any further questions.

Combined Index performance with optional where clause

I have a table with the following columns:
id-> PK
customer_id-> index
store_id-> index
order_date-> index
last_modified-> index
other_columns...
other_columns...
I have three single column index. I also have a customer_id_store_id index which is a foreign key constraint referencing other tables.
id, customer_id, store_id are char(36) which is UUID. order_date is datetime and last_modifed is UNIX timestamp.
I want to gain some performance by removing all index and adding one with (customer_id, store_id, order_date). Most queries will have these fields in the where clause. But sometimes the store_id will not be needed.
What is the best approach? to add "store_id IS NOT NULL" in the where clause or creating the index this way (customer_id, order_date, store_id).
I also frequently need to query the table by last_modified field (where clause includes customer_id=, store_id=, last_modified>).
As I only have a single column index on it and there are hundreds of customers who is insert/updating the tables, more often the index scans rows more than necessary. Is it better to create another index (customer_id, store_id, last_modified) or leave it as it is? Or add this column to the previous index making it four columns composite index. But then again the order_date is irrelevant here and omitting it might result the index not being used as intended.
The query works fast on customers that don't have many rows possibly using the customer_id index there. But for customers with large amount of data, this isn't optimal. More often I need only few days of data.
Can anyone please advise what's the best index in this scenario.
It is true that lots of single column indexes on a MySQL table are generally considered harmful.
A query with
WHERE customer_id=constant AND store_id=constant AND last_modified>=constant
will be accelerated by an index on (customer_id, store_id, last_modified). Why? The MySQL query planner can random-access the index to the first item it needs to retrieve, then scan the index sequentially. That same index works for
WHERE customer_id=constant AND store_id=constant
AND last_modified>=constant
AND last_modified< constant + INTERVAL 1 DAY
BUT, that index will not be useful for a query with just
WHERE store_id=constant AND last_modified>constant
or
WHERE customer_id=constant AND store_id IS NOT NULL AND last_modified>=constant
For the first of those query patterns you need (store_id, last_modified) to achieve the ability to sequentially scan the index.
The second of those query patterns requires two different range searches. One is something IS NOT NULL. That's a range search because it has to romp through all the non-null values in the column. The second range search is last_modified>=constant. That's a range search, because it starts with the first value of last_modified that meets the given criterion, and scans to the end of the index.
MySQL indexes are B-trees. That means, essentially, that they're sorted into a particular single order. So, an index is best for accelerating queries that require just one range search. So, the second query pattern is inherently hard to satisfy with an index.
A table can have multiple compound indexes designed to satisfy multiple different query patterns. That's usually the strategy to large tables work well in practical applications. Each index imposes a little bit of performance penalty on updates and inserts. Indexes also take storage space. But storage is very cheap these days.
If you want to use a compound index to search on multiple criteria, these things must be true:
all but one of the criteria must be equality criteria like store_id = constant.
one criterion can be a range-scan criterion like last_modified >= constant or something IS NOT NULL.
the columns in the index must be ordered so that the columns involved in equality criteria all appear, then the the column involved in the range-scan criterion.
you may mention other columns after the range scan criterion. But they make up part of a covering index strategy (beyond the scope of this post).
http://use-the-index-luke.com/ is a good basic intro to the black art of indexing.

Should I use multiple index method if indexed fields are also foreign keys?

After adding foreign keys mysql forced to index the keys which indexed before with multiple index method. I use InnoDB.
Here is a structure of my table:
id, company_id, student_id ...
company_id and student_id had been indexed using:
ALTER TABLE `table` ADD INDEX `multiple_index` (`company_id`,`student_id`)
Why I use multiple index column? Because, the most time my query is:
SELECT * FROM `table` WHERE company_id = 1 AND student_id = 3
Sometime i just fetch columns by student_id:
SELECT * FROM `table` WHERE student_id = 3
After adding foreign keys for company_id and student_id mysql indexed both of them separately. So, now I have multiple and separately indexed fields.
My question is should I drop the multiple indexed keys?
It depends. If the same student belongs to many companies, no, don't drop it. When querying for company_id = 1 AND student_id = 3, the optimizer has to pick one index, and after that, it will either have to check multiple students or multiple companies.
My gut tells me this won't be the case, though; students won't be associated with more than ~10 companies, so scanning over the index a bit won't be a big deal. That said, this is a lot more brittle than having the index on both columns. In that case, the optimizer knows what the right thing to do is, here. When it has two indices to pick from, it might not, and it might not in the future, so you should FORCE INDEX to make sure it uses the student_id index.
The other thing to consider is how this table is used. If it's rarely written to but read frequently, there's not much of a penalty beyond space for the extra index.
TL;DR: the index is not redundant. Whether or not you should keep it is complicated.

Best way to index a table with a unique multi-column?

I am creating a table which will store around 100million rows in MySQL 5.6 using InnoDB storage engine. This table will have a foreign key that will link to another table with around 5 million rows.
Current Table Structure:
`pid`: [Foreign key from another table]
`price`: [decimal(9,2)]
`date`: [date field]
and every pid should have only one record for a date
What is the best way to create indexes on this table?
Option #1: Create Primary index on two fields pid and date
Option #2: Add another column id with AUTO_INCREMENT and primary index and create a unique index on column pid and date
Or any other option?
Only select query i will be using on this table is:
SELECT pid,price,date FROM table WHERE pid = 123
Based on what you said (100M; the only query is...; InnoDB; etc):
PRIMARY KEY(pid, date);
and no other indexes
Some notes:
Since it is InnoDB, all the rest of the fields are "clustered" with the PK, so a lookup by pid is acts as if price were part of the PK. Also WHERE pid=123 ORDER BY date would be very efficient.
No need for INDEX(pid, date, price)
Adding an AUTO_INCREMENT gains nothing (except a hint of ordering). If you needed ordering, then an index starting with date might be best.
Extra indexes slow down inserts. Especially UNIQUE ones.
Either method is fine. I prefer having synthetic primary keys (that is, the auto-incremented version with the additional unique index). I find that this is useful for several reasons:
You can have a foreign key relationship to the table.
You have an indicator of the order of insertion.
You can change requirements, so if some pids allows two values per day or only one per week, then the table can support them.
That said, there is additional overhead for such a column. This overhead adds space and a small amount of time when you are accessing the data. You have a pretty large table, so you might want to avoid this additional effort.
I would try with an index that attempts to cover the query, in the hope that MySQL has to access to the index only in order to get the result set.
ALTER TABLE `table` ADD INDEX `pid_date_price` (`pid` , `date`, `price`);
or
ALTER TABLE `table` ADD INDEX `pid_price_date` (`pid` , `price`, `date`);
Choose the first one if you think you may need to select applying conditions over pid and date in the future, or the second one if you think the conditions will be most probable over pid and price.
This way, the index has all the data the query needs (pid, price and date) and its indexing on the right column (pid)
By the way, always use EXPLAIN to see if the query planner will really use the whole index (take a look at the key and keylen outputs)