the role of index in performance in mysql - mysql

why defining index for mysql tables increase performance in queries haveing join?

If you are interested in a specific topic in a book, you go to the back of the book and find it alphabetically in the index. The index tells you the page number(s) where the topic is discussed. Then you jump straight to the pages that you are interested in. Much, much faster than reading the whole book.
It's the same in a database. The index means that you can jump to the joining rows instead of scanning every row in the table looking for a match.
Have a look at how a clustered index works (http://msdn.microsoft.com/en-us/library/ms177443.aspx). You can have one of those per table.
This artical explains how a non clustered index works (http://msdn.microsoft.com/en-us/library/ms177484.aspx). You can have as many of them as you want.
Both of these articles are about Microsoft Sql Server, but the theory behind indexes is the same across all relational database management systems.
Indexes do have an associated cost. Every time an insert/update is performed on the table, the effected index(es) may have to be updated also. And of course indexes take up space - but that is not really an issue for most of us. So you need to balance the performance benefits of faster joins or filtering against the costs of inserts and updates.
As a guide, you will generally want an index that matches each of the columns included in a join or where clause:
SELECT
*
FROM
Customer
WHERE
RegistrationDate > #registrationDate
AND RegistrationCountry = #registrationCountry;
So an index on the Customer table that includes the RegistrationDate and RegistrationCountry columns would speed up this query. Since we are using a ">" in our query, this would be a good candidate for a clustered index (the first article shows that a clustered index physically arranges the data in index order so range queries can very quickly isolate a range of the index).
SELECT
*
FROM
Customer c
INNER JOIN Order o
ON o.CustomerID = c.CustomerID
AND o.OrderType = #orderType
Here, we would want an index on the Customer table that contains the CustomerID column. And we'd want an index on the Order table that contains the CustomerID and the OrderType columns. Then both sides of the join will not need to do a table scan.
Typically there will only be a small number of ways that data is queried from a table, so you won't get index overload. Lots of indexes is sometimes a sign that your tables have mixed concerns and could be normalized.

You might want to read up on the basics of database indexes. Indexes are basically used to organize data.

I have found that sometimes it can be significantly faster to replace the JOIN query with two smaller queries, then join them in PHP or whatever language is calling MySQL. So try both and time them to see which is better for the particular situation, but bear in mind that the "fastest" solution may change as database size increases.

Related

Indexing mysql table for selects

I'm looking to add some mysql indexes to a database table. Most of the queries are for selects. Would it be best to create a separate index for each column, or add an index for province, province and number, and name. Does it even make sense to index province since there are only about a dozen options?
select * from employees where province = 'ab' and number = 'v45g';
select * from employees where province = 'ab';
If the usage changed to more inserts should I remove all the indexes except for the number?
An index is a data structure that maps the values of a column into a fast searchable tree. This tree contains the index of rows which the DB can use to find rows fast. One thing to know, some DB engines read plus or minus a bunch of rows to take advantage of disk read ahead. So you may actually read 50 or 100 rows per index read, and not just one. Hence, if you access 30% of a table through an index, you may wind up reading all table data multiple times.
Rule of thumb:
- index the more unique values, a tree with 2 branches and half of your table on either side is not too useful for narrowing down a search
- use as few index as possible
- use real world examples numbers as much as possible. Performance can change dynamically based on data or the whim of the DB engine, so it's very important to try and track how fast your queries are running consistently (ie: log this in case a query ever gets slow). But from this data you can add indexes without being blind
Okay, so there are multiple kinds of index, single and multiple column. You want multiple indexes when it makes sense for indexes to access each other, multiple columns typically when you are refining with a where clause. Think of the first as good when you want joins, or you have "or" conditions. The second is better when you have and conditions and successively filter rows.
In your case name does not make sense since like does not use index. city and number do make sense, probably as a multi-column index. Province could help as well as the last index.
So an index with these columns would likely help:
(number,city,province)
Or try as well just:
(number,city)
You should index fields that are searched upon and have high selectivity / cardinality. Indexes make writes slower.
Other thing is that indexes can be added and dropped at any time so maybe you should let this for a later review of the database and optimization of querys.
That being said one index that you can be sure to add is in the column that holds the name of a person. That's almost always used in searching.
According to MySQL documentation found here:
You can create multiple column indexes and the first column mentioned in the index declaration uses index when searched alone but not the others.
Documentation also says that if you create a hash of the columns and save in another column and index the hashed column the search could be faster then multiple indexes.
SELECT * FROM tbl_name
WHERE hash_col=MD5(CONCAT(val1,val2))
AND col1=val1 AND col2=val2;
You could use an unique index on province.

Explain query - MySQL not using index from table

I'm trying to learn the explain statement in MySQL but ran into a wall.
For my experiment, I created two tables (each having 10 rows) and ran explain over a simple join. Naturally, no indexes were used and 10*10 = 100 rows were scanned (I've added the output in images because the very long output of EXPLAIN was being wrapped on itself. The code is also in this pastebin):
I then added primary keys and indexes and reissued the explain command:
But as you can see, the users table is still being fully scanned by MySQL, as if there was no primary key. What is going wrong?
This is a bit long for a comment.
Basically, your tables are too small. You cannot get reasonable performance indications on such small data -- the query only needs to load two data pages into memory for the query. A nested loop join requires 100 comparisons. By comparison, loading indexes and doing the binary search is probably about the same amount of effort, if not more.
If you want to get a feel for explain, then use tables with a few tens of thousands of rows.
You seem to be asking about EXPLAIN, INDEXing, and optimizing particular SELECTs.
For this:
select u.name
from users as u
join accounts as a on u.id = a.user_id
where a.amount > 1000;
the optimizer will pick between users and accounts for which table to look at first. Then it will repeatedly reach into the other table.
Since you say a.amount > ... but nothing about u, the optimizer is very likely to pick a first.
If a.amount > 1000 is selective enough (less than, say, 20% of the rows) and there is INDEX(amount), it will use that index. Else it will do a table scan of a.
To reach into u, it needs some index starting with id. Keep in mind that a PRIMARY KEY is an index.
This, and many more basics, are covered in my index cookbook.
See also myxlpain for a discussion of EXPLAIN.
Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.
EXPLAIN FORMAT=JSON SELECT... is also somewhat cryptic, but it does have more details than a regular EXPLAIN.
well,
As your main filter has '>' comparison operator, it does full table scan Because it may or may not return all rows.
as you join the 'accounts' table with 'user_id' column, it shows the 'user_id' index in Possible Keys, but it doesn't use it, because of the FULL TABLE SCAN process.

Demonstration of performance benefit of indexing a SQL table

I've always heard that "proper" indexing of one's SQL tables is key for performance. I've never seen a real-world example of this and would like to make one using SQLFiddle but not sure on the SQL syntax to do so.
Let's say I have 3 tables: 1) Users 2) Comments 3) Items.
Let's also say that each item can be commented on by any user. So to get item=3's comments here's what the SQL SELECT would look like:
SELECT * from comments join users on comments.commenter_id=users.user_id
WHERE comments.item_id=3
I've heard that generally speaking if the number of rows gets large, i.e., many thousands/millions, one should put indices on the WHERE and the JOINed column. So in this case, comments.item_id, comments.commenter_id, and users.user_id.
I'd like to make a SQLFiddle to compare having these tables indexed vs. not using many thousands, millions rows for each table. Might someone help with generating this SQLFiddle?
I'm the owner of SQL Fiddle. It definitely is not the place for generating huge databases for performance testing. There are too many other variables that you don't (but should, in real life) have control over, such as memory, hdd configuration, etc.... Also, as a shared environment, there are other people using it which could also impact your tests. That being said, you can still build a small db in sqlfiddle and then view the execution plans for queries with and without indexes. These will be consistent regardless of other environmental factors, and will be a good source for learning optimization.
There's quite a few different ways to index a table and you might choose to index multiple tables differently depending on what your most used SELECT statements are. The 2 fundamental types of indexes are called clustered and non-clustered.
Clustered indexes store all of the information on the index itself rather than storing a list of references that the database can pull from and then use to find the actual data. The easiest way to visualize this is to think of the index and the table itself as separate objects. In a clustered index, if the column you indexed is used as a criterion (in the WHERE clause) then the information the query pulls will be pulled directly from the index and not the table.
On the other hand, non-clustered indexes is more like a reference table. It tells the query where the actual information it is requesting is stored at on the table object itself. So in essence, there is an extra step involved of actually retrieving the data from the table itself when you use non-clustered indexes.
Clustered indexes store data physically on the hard disk in a sequential order, and as a result of that, you can only have one clustered index on a table (since we can only store a table in one 'physical' way on a disk drive). Clustered indexes also need to be unique (although this may not be the case to the naked eye, it is always the case to the database itself). Because of this, most clustered indexes are put on the primary key (since most primary keys are unique).
Unlike clustered indexes, you can have as many non-clustered indexes are you want on a table since after all, they are just reference tables for the actual table itself. Since we have an essentially unlimited number of options for non-clustered indexes, users like to put as many of these as needed on columns that are commonly used in the WHERE clause of a SELECT statement.
But like all things, excess is not always good. The more indexes you put on a table, the more 'overhead' there is on that table. Indexes might speed up your query runs, but excessive overhead will also slow them down. The key is to find a balance between too many indexes and not enough indexes for your particular situation.
As far as a good place to test the performance of your queries with or without indexes, I would recommend using SQL Server. There's a function in SQL Server Management Studio called 'Execution Plan' which tells you the cost and time to run of a query.

Is it necessary to have an index on every combination of queryable fields in a SQL table to optimize performance?

If my User table has several fields that are queryable (say DepartmentId, GroupId, RoleId) will it make any speed difference if I create an index for each combination of those fields?
By "queryable", I'm referring to a query screen where the end user can select records based on Department, Group or Role by selecting from a drop-down.
At the moment, I have a index on DepartmentId, GroupId and RoleId. That's a single non-unique index per field.
If an end user selects "anyone in Group B", the SQL looks like:
select * from User where GroupId = 2
Having an index on GroupId should speed that up.
But if the end user select "anyone in Group B and in Role C", the SQL would look like this:
select * from User where GroupId = 2 and RoleId = 3
Having indexes on GroupId and RoleId individually may not make any difference, right?
A better index for that search would be if I had one index spanning both GroupId and RoleId.
But if that's the case, than that would mean that I would need to have an index for every combination of queryable fields. So I would need all these indexes:
DepartmentId
GroupId
RoleId
DepartmentId and GroupId
DepartmentId and RoleId
GroupId and RoleId
Department Id, GroupId and RoleId
Can anyone shed some light on this? I'm using MySQL if that makes a difference.
A multi-column index can be used for any left prefix of that index. So, an index on (A, B, C) can be used for queries on (A), (A, B) and (A, B, C), but it cannot, for example, be used for queries on (B) or (B, C).
If the columns are all indexed individually, MySQL (5.0 or later) may also use Index Merge Optimization.
Generally speaking, indexes will increase query speed, but decrease insert/update speed, and increase disk space/overhead. So asking if you should index each combination of columns is like asking if you should optimize every function in your code. It may make some things faster, or it may barely help, and it might just hurt more than it helps.
The effectiveness of indexes depends on:
Percentage of SELECTs vs. INSERTs and UPDATEs
The specifics of the SELECT queries, and whether they use JOINs
Size of table being indexed
RAM and processor speed
MySQL settings for how much RAM to use, etc
So, it's hard to give a general answer. The basic sound advice would be: Add indexes if queries are too slow. And remember to use EXPLAIN to see which indexes to add. Note that this is kind of like the database version of the general advice: Profile your app before spending time on optimization.
My experience is with SQL Server rather than mysql and it is possible that this makes a difference. However, in general, the engine can use multiple indexes on a single query. While there are certainly benefits to having a more comprehensive single index(it provides a greater boost, especially if it forms a covering index), you will still have a benefit from using an index on each field of the query.
Furthermore, keep in mind that each index must be maintained separately, so you will suffer a performance reduction on write operations as your number of indexes grow.
Create indexes carefully!
I would suggest to collect queries statistics and decide which column is more often used whilst search so you can create Clustered index on this particular column (anyway when you are creating Index on multiple columns - physically data can be ordered only by a single column)
Also please be aware that Clustered index could significantly decrease performance of UPDATE/INSERT/DELETE queries because it causes physical data reordering.
What I have found is that it's best to index anything the user will search on. I have actually found better performance by creating indexes with multiple columns if a search for those columns will be executed.
For instance, if someone can search on both roleid and groupid at the same time, having an index with both of those columns will actually be a little faster than having just one index on each one. However, having an index on each queryable column can still good, since you may miss a combination of columns.
A key consideration is to see how much space the indexes will take up. Since these columns are integer fields, it shouldn't be a big deal. A little time creating indexes could reap significant benefits.
The best thing to do will be to experiment. Do a search on multiple columns and time it, then add a combined index and rerun it.
Remove all indexes and run CRUD statements against the table using a free tool called "SQL sentry plan explorer".
It will show you which indexes are necessary.
Indexes are created based on CRUD and not on the table by itself.

Composite Primary and Cardinality

I have some questions on Composite Primary Keys and the cardinality of the columns. I searched the web, but did not find any definitive answer, so I am trying again. The questions are:
Context: Large (50M - 500M rows) OLAP Prep tables, not NOSQL, not Columnar. MySQL and DB2
1) Does the order of keys in a PK matter?
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Thanks in advance.
You've got "MySQL and DB2". This answer is for DB2, MySQL has none of this.
Yes, of course that is logical, but the optimiser takes much more than just that into account.
Generally, the order of the columns in the WHERE clause (join) do not (and should not) matter.
However, there are two items related to the order of predicates which may be the reason for your question.
What does matter, is the order of the columns in the index, against which the WHERE clause is processed. Yes, there it is best to specify the columns in the order of highest cardinality to lowest. That allows the optimiser to target a smaller range of rows.
And along those lines do not bother implementing indices for single-column, low cardinality columns (there are useless). If the index is correct, then it will be used more often.
.
The order of tables being joined (not columns in the join) matters very much, it is probably the most important consideration. In fact Join Transitive Closure is automatic, and the optimiser evaluates all possible join orders, and chooses what it thinks is the best, based on Statistics (which is why UPDATE STATS is so important).
Regardless of the no of rows in the tables, if you are joining 100 rows from table_A on a bad index with 1,000,000 rows in table_B on a good index, you want the order A:B, not B:A. If you are getting less than the max IOPS, you may want to do something about it.
The correct sequence of steps is, no surprise:
check that the index is correct as per (1). Do not just add another index, correct the ones you have.
check that update stats is being executed regularly
always try the default operation of the optimiser first. Set stats on and measure I/Os. Use representative sets of values (that the user will use in production).
check the shoowplan, to ensure that the code is correct. Of course that will also identify the join order chosen.
if the performance is not good enough, and you believe that the the join order chosen by the optimiser for those sets of values is sub-optimal, SET JTC OFF (syntax depends on your version of DB2), then specify the order that you want in the WHERE clause. Measure I/Os. Use representative sets
form an opinion. Choose whichever is better performance overall. Never tune for single queries.
1) Does the order of keys in a PK matter?
Yes, it changes the order of the record for the index that is used to police the PRIMARY KEY.
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
For select queries, this totally depends on the queries you are going to use. If you are searching for all three columns at once, the order is not important; if you are searching for two or one columns, they should be leading in the index.
For inserts, it is better to make the leading column match the order in which the records are inserted.
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Again, this depends on the WHERE clause.