MYSQL: Index keyword in create table and when to use it - mysql

What does index keyword mean and what function it serves? I understand that it is meant to speed up querying, but I am not very sure how this can be done.
When how to choose the column to be indexed?
A sample of index keyword usage is shown below in create table query:
CREATE TABLE `blog_comment`
(
`id` INTEGER NOT NULL AUTO_INCREMENT,
`blog_post_id` INTEGER,
`author` VARCHAR(255),
`email` VARCHAR(255),
`body` TEXT,
`created_at` DATETIME,
PRIMARY KEY (`id`),
INDEX `blog_comment_FI_1` (`blog_post_id`),
CONSTRAINT `blog_comment_FK_1`
FOREIGN KEY (`blog_post_id`)
REFERENCES `blog_post` (`id`)
)Type=MyISAM
;

I'd recommend reading How MySQL Uses Indexes from the MySQL Reference Manual. It states that indexes are used...
To find the rows matching a WHERE clause quickly.
To eliminate rows from consideration.
To retrieve rows from other tables when performing joins.
To find the MIN() or MAX() value for a specific indexed column.
To sort or group a table (under certain conditions).
To optimize queries using only indexes without consulting the data rows.
Indexes in a database work like an index in a book. You can find what you're looking for in an book quicker, because the index is listed alphabetically. Instead of an alphabetical list, MySQL uses B-trees to organize its indexes, which is quicker for its purposes (but would take a lot longer for a human).
Using more indexes means using up more space (as well as the overhead of maintaining the index), so it's only really worth using indexes on columns that fulfil the above usage criteria.
In your example, the id and blog_post_id columns both uses indexes (PRIMARY KEY is an index too) so that the application can find them quicker. In the case of id, it is likely that this allows users to modify or delete a comment quickly, and in the case of blog_post_id, so the application can quickly find all comments for a given post.
You'll notice that there is no index for the email column. This means that searching for all blog posts by a particular e-mail address would probably take quite a long time. If searching for all comments by a particular e-mail address is something you'd want to add, it might make sense to add an index to that too.

This keyword means that you are creating an index on column blog_post_id along with the table.
Queries like that:
SELECT *
FROM blog_comment
WHERE blog_post_id = #id
will use this index to search on this field and run faster.
Also, there is a foreign key on this column.
When you decide to delete a blog post, the database will need check against this table to see there are no orphan comments. The index will also speed up this check, so queries like
DELETE
FROM blog_post
WHERE ...
will also run faster.

Related

MySQL: Slow SELECT because of Index / FKEY?

Dear StackOverflow Members
It's my first post, so please be nice :-)
I have a strange SQL behavior which i can't explain and don't find any resources which explains it.
I have built a web honeypot which record all access and attacks and display it on a statistic page.
However since the data increased, the generation of the statistic page is getting slower and slower.
I narrowed it down to a some select statements which takes a quite a long time.
The "issue" seems to be an index on a specific column.
*For sure the real issue is my lack of knowledge :-)
Database: mysql
DB schema
Event Table (removed unrelated columes):
Event table size: 30MB
Event table records: 335k
CREATE TABLE `event` (
`EventID` int(11) NOT NULL,
`EventTime` datetime NOT NULL DEFAULT current_timestamp(),
`WEBURL` varchar(50) COLLATE utf8_bin DEFAULT NULL,
`IP` varchar(15) COLLATE utf8_bin NOT NULL,
`AttackID` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `event`
ADD PRIMARY KEY (`EventID`),
ADD KEY `AttackID` (`AttackID`);
ALTER TABLE `event`
ADD CONSTRAINT `event_ibfk_1` FOREIGN KEY (`AttackID`) REFERENCES `attack` (`AttackID`);
Attack Table
attack table size: 32KB
attack Table records: 11
CREATE TABLE attack (
`AttackID` int(4) NOT NULL,
`AttackName` varchar(30) COLLATE utf8_bin NOT NULL,
`AttackDescription` varchar(70) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `attack`
ADD PRIMARY KEY (`AttackID`),
SLOW Query:
SELECT Count(EventID), IP
-> FROM event
-> WHERE AttackID >0
-> GROUP BY IP
-> ORDER BY Count(EventID) DESC
-> LIMIT 5;
RESULT: 5 rows in set (1.220 sec)
(This seems quite long for me, for a simple query)
QuerySlow
Now the Strange thing:
If I remove the foreign key relationship the performance of the query is the same.
But if I remove the the index on event.AttackID same select statement is much faster:
(ALTER TABLE `event` DROP INDEX `AttackID`;)
The result of the SQL SELECT query:
5 rows in set (0.242 sec)
QueryFast
From my understanding indexes on columns which are used in "WHERE" should improve the performance.
Why does removing the index have such an impact on the query?
What can I do to keep the relations between the table and have a faster
SELECT execution?
Cheers
Why does removing the index improve performance?
The query optimizer has multiple ways to resolve a query. For instance, two methods for filtering data are:
Look up the rows that match the where clause in the index and then fetch related data from the data pages.
Scan the index.
This doesn't get into the use of indexes for joins or aggregations or alternative algorithms.
Which is better? Under some circumstances, the first method is horribly slower than the second. This occurs when the data for the table does not fit into memory. Under such circumstances, the index can read a record from page 124 and then from 1068 and then from 124 again and -- well, all sorts of random intertwined reading of pages. Reading data pages in order is usually faster. And when the data doesn't fit into memory, thrashing occurs, which means that a page in memory is aged (overwritten) -- and then needed again.
I'm not saying that is occurring in your case. I am simply saying that what optimizers do is not always obvious. The optimizer has to make judgements based on the nature of the data -- and those judgements are not right 100% of the time. They are usually correct. But there are borderline cases. Sometimes, the issue is out-of-date statistics. Sometimes the issue is that what looks best to the optimizer is not best in practice.
Let me emphasize that optimizers usually do a very good job, and a better job than a person would do. Even if they occasionally come up with suboptimal plans, they are still quite useful.
Get rid of your redundant UNIQUE KEYs. A primary key is a unique key.
Use COUNT(*) rather than COUNT(IP) in your query. They mean the same thing because you declared IP to be NOT NULL.
Your query can be much faster if you stop saying WHERE AttackId>0. Because that column is a FK to the PK of your other table, those values should be nonzero anyway. But to get that speedup you'll need an index on event(IP) something like this.
CREATE INDEX IpDex ON event (IP)
But you're still summarizing a large table, and that will always take time.
It looks like you want to display some kind of leaderboard. You could add a top_ips table, and use an EVENT to populate it, using your query, every few minutes. Then you could display it to your users without incurring the cost of the query every time. This of course would display slightly stale data; only you know whether that's acceptable in your app.
Pro Tip. Read https://use-the-index-luke.com by Marcus Winand.
Essentially every part of your query, except for the FKey, conspires to make the query slow.
Your query is equivalent to
SELECT Count(*), IP
FROM event
WHERE AttackID >0
GROUP BY IP
ORDER BY Count(*) DESC
LIMIT 5;
Please use COUNT(*) unless you need to avoid NULL.
If AttackID is rarely >0, the optimal index is probably
ADD INDEX(AttackID, -- for filtering
IP) -- for covering
Else, the optimal index is probably
ADD INDEX(IP, -- to avoid sorting
AttackID) -- for covering
You could simply add both indexes and let the Optimizer decide. Meanwhile, get rid of these, if they exist:
DROP INDEX(AttackID)
DROP INDEX(IP)
because any uses of them are handled by the new indexes.
Furthermore, leaving the 1-column indexes around can confuse the Optimizer into using them instead of the covering index. (This seems to be a design flaw in at least some versions of MySQL/MariaDB.)
"Covering" means that the query can be performed entirely in the index's BTree. EXPLAIN will indicate it with "Using index". A "covering" index speeds up a query by 2x -- but there is a very wide variation on this prediction. ("Using index condition" is something different.)
More on index creation: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Cannot achieve a cover index with this table (2 equalities and one selection)?

CREATE TABLE `discount_base` (
`id` varchar(12) COLLATE utf8_unicode_ci NOT NULL,
`amount` decimal(13,4) NOT NULL,
`description` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`family` varchar(4) COLLATE utf8_unicode_ci NOT NULL,
`customer_id` varchar(8) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_CUSTOMER` (`customer_id`),
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`,`amount`),
CONSTRAINT `FK_CUSTOMER` FOREIGN KEY (`customer_id`)
REFERENCES `customer` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I've added a cover index IDX_FAMILY_CUSTOMER_AMOUNT on family, customer_id and amount because most of the time I use the following query:
SELECT amount FROM discount_base WHERE family = :family AND customer_id = :customer_id
However using EXPLAIN and a bounce of records (~ 250000) it says:
'1', 'SIMPLE', 'discount_base', 'ref', 'IDX_CUSTOMER,IDX_FAMILY_CUSTOMER_AMOUNT', 'IDX_FAMILY_CUSTOMER_AMOUNT', '40', 'const,const', '1', 'Using where; Using index'
Why I'm getting using where; using index instead of just using index?
EDIT: Fiddle with a small amount of data (Using where; Using index):
EXPLAIN SELECT amount
FROM discount_base
WHERE family = '0603' and customer_id = '20000275';
Another fiddle where id is family + customer_id (const):
EXPLAIN SELECT amount
FROM discount_base
WHERE `id` = '060320000275';
Interesting problem. It would seem "obvious" that the IDX_FAMILY_CUSTOMER_AMOUNT index would be used for this query:
SELECT amount
FROM discount_base
WHERE family = :family AND customer_id = :customer_id;
"Obvious" to us people, but clearly not to the optimizer. What is happening?
This aspect of index usage is poorly documented. I (intelligently) speculate that when doing comparisons on strings using case-insensitive collations (and perhaps others), then the = operation is really more like an in. Something sort of like this, conceptually:
WHERE family in (lower(:family, upper(:family), . . .) and . . .
This is conceptual. But it means that an index scan is required for the = rather than an index lookup. Minor change typographically. Very important semantically. It prevents the use of the second key. Yup, that is an unfortunately consequence of inequalities, even when they look like =.
So, the optimizer compares the two possible indexes, and it decides that customer_id is more selective than family, and chooses the former.
Alas, both of your keys are case-insensitive strings. My suggestion would be to replace at least one of them with an auto-incrementing integer id. In fact, my suggestion is that basically all tables have an auto-incrementing integer id, which is then used for all foreign key references.
Another solution would be to use a trigger to create a single column CustomerFamily with the values concatenated together. Then this index:
KEY IDX_CUSTOMERFAMILY_AMOUNT (CustomerFamily, amount)
should do what you want. It is also possible that a case-sensitive encoding would also solve the problem.
Are family and customer_id strings? I guess you could be passing customer_id maybe as a integer which could be causing a type conversion to take place and so the index not being used for that particular column.
Ensure you pass customer_id as string or consider changing your table to store cusomer_id as INT.
If you are using alphanumeric Ids then this don't apply.
I'm pretty sure Using index is the important part, and it means "using a covering index".
Two things to further check:
EXPLAIN FORMT=JSON SELECT ...
may give further clues.
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
will show you how many rows were read/written/etc in various ways. If some number says about 250000 (in your case), it indicates a table scan. If all the numbers a small (approximately the number of rows returned by the query), then you can be assured that it did do what that query efficiently.
The numbers there do not distinguish between read to an index versus data. But they ignore caching. Timings (for two identical runs) can differ significantly due to caching; Handler% values won't change.
The answer to your question relies on what the engine is actually using your index for.
In given query, you ask the engine to:
Lookup for values (WHERE/JOIN)
Retrieve information (SELECT) based on this lookup result
For the first part, as soon as you filter the results (lookup), there's an entry in Extra indicating USING WHERE, so this is the reason you see it in your explain plan.
For the second part, the engine does not need to go anywhere out of one given index because it is a covering index. The explain plan notifies it by showing USING INDEX. This USING INDEX hint, combined with USING WHERE, means your index is also used in the lookup portion of the query, as explained in mysql documentation:
https://dev.mysql.com/doc/refman/5.0/en/explain-output.html
Using index
The column information is retrieved from the table using only
information in the index tree without having to do an additional seek
to read the actual row. This strategy can be used when the query uses
only columns that are part of a single index.
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
Check this fiddle:
http://sqlfiddle.com/#!9/8cdf2/10
I removed the where clause and the query now displays USING INDEX only. This is because no lookup is necessary in your table.
The MySQL documentation on EXPLAIN has this to say:
Using index
The column information is retrieved from the table using
only information in the index tree without having to do an additional
seek to read the actual row. This strategy can be used when the query
uses only columns that are part of a single index.
If the Extra column
also says Using where, it means the index is being used to perform
lookups of key values. Without Using where, the optimizer may be
reading the index to avoid reading data rows but not using it for
lookups. For example, if the index is a covering index for the query,
the optimizer may scan it without using it for lookups.
My best guess, based on the information you have provided, is that the optimizer first uses your IDX_CUSTOMER index and then performs a key lookup to retrieve non-key data (amount and family) from the actual data page based on the key (customer_id).
This is most likely caused by cardinality (eg. uniqueness) of the columns in your indexes. You should check the cardinality of
the columns used in your where clause and put the one with the highest cardinality first on your index. Guessing from the column
names and your current results, customer_id has the highest cardinality.
So change this:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`,`amount`)
to this:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`customer_id`,`family`,`amount`)
After making the change, you should run ANALYZE TABLE to update table statistics. This will update table statistics, which can
affect the choices the optimizer makes regarding your indexes.
This sounds fine. According to the MySQL documentation:
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
That means, Using index alone would read entire index to retrieve the results, but not using the index structure to find specific values. You probably could get this with SELECT family, customer_id, amount FROM discount_base. Using where; using index means the optimizer exploits the index to find and retrieve rows matching query parameters (family, customer_id).
This indeed could be a problem.
Note, that there might be millions of string matching by a single index key using the utf8_unicode_ci collation. For example, all these letters are matched by the same index key:
A, a, À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å, Ā, ā, Ă, ă, Ą, ą, Ǎ, ǎ, Ǟ, ǟ, Ǡ, ǡ, Ǻ, ǻ, Ȁ, ȁ, Ȃ, ȃ, Ȧ, ȧ, Ḁ, ḁ, Ạ, ạ, Ả, ả, Ấ, ấ, Ầ, ầ, Ẩ, ẩ, Ẫ, ẫ, Ậ, ậ, Ắ, ắ, Ằ, ằ, Ẳ, ẳ, Ẵ, ẵ, Ặ, ặ.
And there are substantial grounds for believing that then processing an query using CHAR/VARCHAR index, MySQL in addition to regular index lookup performs full linear scan of all the values matched by index to make sure it is indeed matched by the original query parameter. This may be really needed when the index collation and the WHERE collation do not match, but I don't know why it do so all the time, even when this is clearly not needed (in your case for example).
See this question for an evidence and additional details: Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I would only recommend this solution:
Remove amount from index like:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`)
When you do request force USE INDEX:
USE INDEX (`IDX_FAMILY_CUSTOMER_AMOUNT`)
That trick allow to avoid Using where. Hope performance will also be on acceptable level:
http://sqlfiddle.com/#!9/86f46/2
SELECT amount FROM discount_base
USE INDEX (`IDX_FAMILY_CUSTOMER_AMOUNT`)
WHERE family = '1' AND customer_id = '1'
Based on the fiddle provided, it appears that only numeric values are being used for family and customer id. If this assumption is correct, changing these columns to numeric and using just a single key on customer and family appears to have resolved the issue.
Please check this fiddle

Use an index on a join without "where"

I'm trying to understand if it's possible to use an index on a join if there is no limiting where on the first table.
Note: this is not a line-by-line real-case usage, just a thing I draft together for understanding purposes. Don't point out the obvious "what are your trying to obtain with this schema?", "you should use UNSIGNED" or the likes because that's not the question.
Note2: this MySQL JOINS without where clause is somehow related but not the same
Schema:
CREATE TABLE posts (
id_post INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
text VARCHAR(100)
);
CREATE TABLE related (
id_relation INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
id_post1 INT NOT NULL,
id_post2 INT NOT NULL
);
CREATE INDEX related_join_index ON related(id_post1) using BTREE;
Query:
EXPLAIN SELECT * FROM posts FORCE INDEX FOR JOIN(PRIMARY) INNER JOIN related ON id_post=id_post1 LIMIT 0,10;
SQL Fiddle: http://sqlfiddle.com/#!2/84597/3
As you can see, the index is being used on the second table, but the engine is doing a full table scan on the first one (the FORCE INDEX is there just to highlight the general question).
I'd like to understand if it's possible to get a "ref" on the left side too.
Thanks!
Update: if the first table has significantly more record than the second, the thing swap: the engine uses an index for the first one and a full table scan for the second http://sqlfiddle.com/#!2/3a3bb/1 Still, no way to get indexes used on both.
The DBMS has an optimizer to figure out the best plan to execute a query. It's up to the optimizer to decide whether to use an index or simply read the table directly.
An index makes sense when the DBMS expects only few records to read from a table (say 1% of all rows only). But once it expects to read many records (say 99% of all rows) it will not use the index. The threshold may lie at low as 5% (i.e. <= 5% -> index; > 5% table scan).
There are exceptions. One is when an index holds all columns needed. Then the table itself doesn't have to be read at all. Another may be when the optimizer thinks an index access may result faster in spite of having to read many rows. It's also always possible the optimizer simply guesses wrong.
There is a page on the MySQL documentation about this subject.
Regarding the possibility to get a ref on the first table from the query, the short answer is NO.
The reason is obvious: because there is no WHERE clause ALL the rows from table posts are analyzed because they could be included in the result set. There is no reason to use an index for that, a full table scan is better because it gets all the rows; and because the order doesn't matter, the access is (more or less) sequential. Using an index requires reading more information from the storage (index and data).
MySQL will use the join type index if all the columns that appear in the SELECT clause are present in an index. In this case MySQL will perform a full index scan (join type index) instead of a full table scan (join type ALL) because it requires reading less information from the storage (an index is usually smaller than the entire table data).

MySQL indexes and when to group them

I'm still trying to get my head around the best way to use INDEXES in MySQL. How do you know when to merge them together and when to have them separate?
Below are the indexes from the Wordpress posts table. See how post_name, post_parent and post_author are seperate entries? And then they have type_status_date which is a mixture of 4 fields?
http://img215.imageshack.us/img215/5976/screenshot20120426at431.png
I don't understand the logic behind this? Can anyone enlighten me?
Going to be a bit of a long answer but here we go. Please note I am not going to deal with the differences in database engines here(MyISAM and InnoDB have distinct way of implementing what I am trying to describe)
First thing you have to understand about a index is that it is a separate data structure stored on disk. Normally this is a b-tree data structure containing the column(s) that you have indexed and also contain a pointer to the row in the table(this pointer is normally the primary key).
The only index that is stored with the data is the primary key index. Thus a primary key index IS the table.
Lets assume you have following table definition.
CREATE TABLE `Student` (
`StudentNumber` INT NOT NULL ,
`Name` VARCHAR(32) NULL ,
`Surname` VARCHAR(32) NULL ,
`StudentEmail` VARCHAR(32) NULL ,
PRIMARY KEY (`StudentNumber`) );
Since we have a primary key on StudentID there will be a index containing the primary key and the other columns in the index. If you had to look at the data in the index you would probably see something like this.
1 , John ,Doe ,Jdoe#gmail.com
As you can see this is the table data once again showing you that the primary key index IS the table.
The StudentNumber column is indexed which allows your to effectively search on it the rest of the data is stored with the key. Thus if ran the following query:
SELECT * FROM Student WHERE StudentNumber=1
MySQL would use the primary index to quickly find the row and the read the data stored with the indexed column. Since there is a index MySQL can use the index to do a effective binary seek operation on the b-tree.
Also when it comes to retrieving the data after doing the search MySQL can read the data from the index thus we are using 1 operation in the index to retrieve the data. Now if I ran the following query:
SELECT * FROM Student WHERE Name ='Joe'
MySQL would check if there is a index that it could use to speed the query up. However in my case there is no index on name so MySQL would do a sequential read from the table one row at a time from the first row to the last.
At each row it would evaluate the row against the where clause and return matching row. So basically it reads the primary key index from top to bottom. Remember the primary key index is the table.
If I ran the following statement:
ALTER TABLE `TimLog`.`student`
ADD INDEX `ix_name` (`Name` ASC) ;
ALTER TABLE `TimLog`.`student`
ADD INDEX `ix_surname` (`Surname` ASC) ;
MySQL would create new indexes on the Student table. This will be stored away from the table on disk and the data inside would look something like this:
Data in ix_Name
John, 1 <--PRIMARY KEY VALUE
Data in ix_Surname
Doe, 1 <--PRIMARY KEY VALUE
Notice the data in the ix_Name index is the name and the primary key value. Great so if I ran the previous select statement MySQL would then read the ix_name index and get the primary key value for matching items and then use the primary key index to get the rest of the data.
So the number of operations to get the data from the index is 2. The matching rows are found in the index and then a lookup happens on the primary key to get the row data out.
You now have the following query:
SELECT * FROM Student WHERE Name='John' AND surname ='Doe'
Here MySQL cant use both indexes as it would be a waste of operations. If MySQL had to use both indexes in this query the following would happen(this should not happen).
1 Find in the ix_Name the rows with the value John
2 Read the primary key that matches to get the row data
3 Store the matching results
4 Find in the ix Surname the rows with the value Doe
5 Read the primary key that matches to get row data.
6 Store the matching results
7 Take the Name results and Surname results and merge them
8 Return query results.
This is really a waste of IO as MySQL would then read the table twice. Basically using one index would be better than trying to use two(I will explain in a momnet why). MySQL will choose 1 index to use in a this simple query.
So how does MySQL decide on which index to use?
MySQL keeps statistics around indexes internally. These statistics tell MySQL basically how unique a index is. So for the sake of argument lets say the surname index (ix_surname)was more unique than the name index(ix_name) MySQL would use the surname index (ix_surname).
Thus query retrieval would be like this:
1 Use the ix_surname and find rows that match the value Doe
2 Read the primary key and apply the filter for the value John on the actual column data in the row.
3 Return the matched row.
As you can see the number of operations in this search is much less. I have over simplified a lot of the technical detail. Indexing is a interesting thing to master but you have to look at it from the perspective of how do I get the data with the minimal amount of IO.
Hope it is as clear as mud now!
MySQL cannot normally use more than one index at a time. That means, for instance, that when you have a query that filters or sorts on two fields you put them both into the same index.
WordPress likely has a common query that filters and/or sorts on post_type, post_status and post_date. Making an educated guess as to what they stand for, this would likely be the core query for WordPress's Post listing pages. So the three fields are put into the same index.

Need a little clarification on MySQL Indexes

I've been thinking about my database indexes lately, in the past I just kind of non-chalantly threw them in as an afterthought, and never really put much thought into if they are correct or even helping. I've read conflicting information, some say that more indexes are better and others that too many indexes are bad, so I'm hoping to get some clarification and learn a bit here.
Let's say I have this hypothetical table:
CREATE TABLE widgets (
widget_id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
widget_name VARCHAR(50) NOT NULL,
widget_part_number VARCHAR(20) NOT NULL,
widget_price FLOAT NOT NULL,
widget_description TEXT NOT NULL
);
I would typically add an index for fields that will be joined and fields that will be sorted on most often:
ALTER TABLE widgets ADD INDEX widget_name_index(widget_name);
So now, in a query such as:
SELECT w.* FROM widgets AS w ORDER BY w.widget_name ASC
The widget_name_index is used to sort the resultset.
Now if I add a search parameter:
SELECT w.* FROM widgets AS w
WHERE w.widget_price > 100.00
ORDER BY w.widget_name ASC
I guess I need a new index.
ALTER TABLE widgets ADD INDEX widget_price_index(widget_price);
But, will it use both indexes? As I understand it it won't...
ALTER TABLE widgets ADD INDEX widget_price_name_index(widget_price, widget_name);
Now widget_price_name_index will be used to both select and order the records. But what if I want to turn it around and do this:
SELECT w.* FROM widgets AS w
WHERE w.widget_name LIKE '%foobar%'
ORDER BY w.widget_price ASC
Will widget_price_name_index be used for this? Or do I need a widget_name_price_index also?
ALTER TABLE widgets ADD INDEX widget_name_price_index(widget_name, widget_price);
Now what if I have a search box that searches widget_name, widget_part_number and widget_description?
ALTER TABLE widgets
ADD INDEX widget_search(widget_name, widget_part_number, widget_description);
And what if end users can sort by any column? It's easy to see how I could end up with more than a dozen indexes for a mere 5 columns.
If we add another table:
CREATE TABLE specials (
special_id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
widget_id INT UNSIGNED NOT NULL,
special_title VARCHAR(100) NOT NULL,
special_discount FLOAT NOT NULL,
special_date DATE NOT NULL
);
ALTER TABLE specials ADD INDEX specials_widget_id_index(widget_id);
ALTER TABLE specials ADD INDEX special_title_index(special_title);
SELECT w.widget_name, s.special_title
FROM widgets AS w
INNER JOIN specials AS s ON w.widget_id=s.widget_id
ORDER BY w.widget_name ASC, s.special_title ASC
I am assuming this will use widget_id_index and the widgets.widget_id primary key index for the join, but what about the sorting? Will it use both widget_name_index and special_title_index ?
I don't want to ramble on too long, there are an endless number of scenarios I could conujure up. Obviously this can get much more complex with real world scenarios rather than a couple of simple tables. Any clarification would be appreciated.
By best practices, you do not have to create an index while defining the table schematics. It is always better to create an index as you create the queries in your application. In most cases, you will be starting with a single-column index to satisfy a query. If you want to use many columns in a query, you can create a covering index.
A covering index is an index with two or more columns in it. If the index satisfies all the column requirements of a query, then the storage engine can obtain all the results from the index instead of kicking in a disk I/O operation. So, when creating a query that uses more columns, you can either create a new index covering all the required columns, or, you can extend the existing index to include more columns.
You have to take some considerations while doing any one of the above. MySQL considers an index only when the left-most column of the index can be used in the query. Otherwise, it simply seeks the whole table for fetching results. So if you can extend an existing index without affecting all the queries that use that index, then it would be a wise choice. Otherwise, you can go ahead and create a new index for the new query. Sometimes, the queries can be adjusted to adapt to the index structure.
An index speeds up selects, but slows down inserts and updates. You don't need to create an index for every possible combination of columns you can imagine. I usually just create the obvious indexes that I know I will be using often, and only add more if I can see that they are needed after taking performance measurements. The database can still use an index even if it doesn't cover all the columns in the query.
Only one index is ever used in a query. Fortunately, you can create an index covering multiple columns:
ALTER TABLE widgets ADD INDEX name_and_price_index(widget_name, widget_price);
The above index will be used if you SELECT by widget_name or widget_name + widget_price (but not just widget_price).
As MitMaro points out, use EXPLAIN on a query to see what indexes MySQL has to choose from, as well as what index it ends up using. See here for even more details.