Demonstration of performance benefit of indexing a SQL table - mysql

I've always heard that "proper" indexing of one's SQL tables is key for performance. I've never seen a real-world example of this and would like to make one using SQLFiddle but not sure on the SQL syntax to do so.
Let's say I have 3 tables: 1) Users 2) Comments 3) Items.
Let's also say that each item can be commented on by any user. So to get item=3's comments here's what the SQL SELECT would look like:
SELECT * from comments join users on comments.commenter_id=users.user_id
WHERE comments.item_id=3
I've heard that generally speaking if the number of rows gets large, i.e., many thousands/millions, one should put indices on the WHERE and the JOINed column. So in this case, comments.item_id, comments.commenter_id, and users.user_id.
I'd like to make a SQLFiddle to compare having these tables indexed vs. not using many thousands, millions rows for each table. Might someone help with generating this SQLFiddle?

I'm the owner of SQL Fiddle. It definitely is not the place for generating huge databases for performance testing. There are too many other variables that you don't (but should, in real life) have control over, such as memory, hdd configuration, etc.... Also, as a shared environment, there are other people using it which could also impact your tests. That being said, you can still build a small db in sqlfiddle and then view the execution plans for queries with and without indexes. These will be consistent regardless of other environmental factors, and will be a good source for learning optimization.

There's quite a few different ways to index a table and you might choose to index multiple tables differently depending on what your most used SELECT statements are. The 2 fundamental types of indexes are called clustered and non-clustered.
Clustered indexes store all of the information on the index itself rather than storing a list of references that the database can pull from and then use to find the actual data. The easiest way to visualize this is to think of the index and the table itself as separate objects. In a clustered index, if the column you indexed is used as a criterion (in the WHERE clause) then the information the query pulls will be pulled directly from the index and not the table.
On the other hand, non-clustered indexes is more like a reference table. It tells the query where the actual information it is requesting is stored at on the table object itself. So in essence, there is an extra step involved of actually retrieving the data from the table itself when you use non-clustered indexes.
Clustered indexes store data physically on the hard disk in a sequential order, and as a result of that, you can only have one clustered index on a table (since we can only store a table in one 'physical' way on a disk drive). Clustered indexes also need to be unique (although this may not be the case to the naked eye, it is always the case to the database itself). Because of this, most clustered indexes are put on the primary key (since most primary keys are unique).
Unlike clustered indexes, you can have as many non-clustered indexes are you want on a table since after all, they are just reference tables for the actual table itself. Since we have an essentially unlimited number of options for non-clustered indexes, users like to put as many of these as needed on columns that are commonly used in the WHERE clause of a SELECT statement.
But like all things, excess is not always good. The more indexes you put on a table, the more 'overhead' there is on that table. Indexes might speed up your query runs, but excessive overhead will also slow them down. The key is to find a balance between too many indexes and not enough indexes for your particular situation.
As far as a good place to test the performance of your queries with or without indexes, I would recommend using SQL Server. There's a function in SQL Server Management Studio called 'Execution Plan' which tells you the cost and time to run of a query.

Related

What's the minimum number of rows where indexing becomes valuable in MySQL?

I've read that indexing on some databases (SQL Server is the one I read about) doesn't have much effect until you cross a certain threshold of rows because the database will hold the entire table X in memory.
Ordinarily, I'd plan to index on my WHEREs and unique columns/lesser-changed tables. After hearing about the suggested minimum (which was about 10k), I wanted to learn more about that idea. If there are tables that I know will never pass a certain point, this might change the way I index some of them.
For something like MySQL MyISAM/INNODB, is there a point where indexing has little value and what are some ways of determining that?
Note: Very respectfully, I'm not looking for suggestions about structuring my database like "You should index anyway," I'm looking to understand this concept, if it's true or not, how to determine the thresholds, and similar information.
One of the major uses of indexes is to reduce the number of pages being read. The index itself is usually smaller than the table. So, just in terms of page read/writes, you generally need at least three data pages to see a benefit, because using an index requires at least two data pages (one for the index and one for the original data).
(Actually, if the index covers the query, then the breakeven is two.)
The number of data pages needed for a table depends on the size of the records and the number of rows. So, it is really not possible to specify a threshold on the number of rows.
The above very rudimentary explanation leaves out a few things:
The cost of scanning the data pages to do comparisons for each row.
The cost of loading and using index pages.
Other uses of indexing.
But it gives you an idea, and you can see benefits on tables much smaller than 10k rows. That said you can easily do tests on your data to see how queries work on the tables in question.
Also, I strongly, strongly recommend having primary keys on all tables and using those keys for foreign key relationships. The primary key itself is an index.
Indexes serve a lot of purposes. InnoDB tables are always organized as an index, on the cluster key. Indexes can be used to enforce unique constraints, as well as support foreign key constraints. The topic of "indexes" spans way more than query performance.
In terms of query performance, it really depends on what the query is doing. If we are selecting a small subset of rows, out of large set, then effective use of an index can speed that up by eliminating vast swaths of rows from being checked. That's where the biggest bang comes from.
If we are pulling all of the rows, or nearly all the rows, from a set, then an index typically doesn't help narrow down which rows to check; even when an index is available, the optimizer may choose to do a full scan of all of the rows.
But even when pulling large subsets, appropriate indexes can improve performance for join operations, and can significantly improve performance of queries with GROUP BY or ORDER BY clauses, by making use of an index to retrieve rows in order, rather than requiring a "Using filesort" operation.
If we are looking for a simple rule of thumb... for a large set, if we are needing to pull (or look at) less than 10% of the total rows, then an access plan using a suitable index will typically outperform a full scan. If we are looking for a specific row, based on a unique identifier, index is going to be faster than full scan. If we are pulling all columns for every row in the table n no particular order, then a full scan is going to be faster.
Again, it really comes down to what operations are being performed. What queries are being executed, and the performance profile that we need from those queries. That is going to be the key to determining the indexing strategy.
In terms of gaining understanding, use EXPLAIN to see the execution plan. And learn the operations available to MySQl optimizer.
(The topic of indexing strategy in terms of database performance is much too large for a StackOverflow question.)
Each situation is different. If you profile your code, then you'll understand better each anti-pattern. To demonstrate the extreme unexpectedness, consider Oracle:
If this were Oracle, I would say zero because if an empty table's high water mark is very high, then a query that motivates a full table scan that returns zero rows would be much more expensive than the same query that were to induce even a full index scan.
The same process that I went through to understand Oracle you can do with MySQL: profile your code.

Indexing all field of two columns table

I have a table/schema with two columns named day of DateTime and user_id of Integer. Right know I made both columns indexed.
Is performance improvements gained from indexing worth it, considering huge fraction of additional space used by the index and there are only two columns? How do you justify them?
How does this differ if I use MongoDB or MySQL?
If there are few rows, you might not see great improvements with indexes. If there are many rows, you probably will see great improvements.
The good thing is that you don't have to guess, and you don't have to agonize over what few and many mean in practice. Every modern SQL dbms includes some way to measure SELECT statement performance. That includes MySQL.
MySQL EXPLAIN
How to use it
Is performance improvements gained from indexing worth it
Depends on the queries you intend to run.
If you have something like: WHERE day = ..., then you'll need an index whose leading edge contains day. If properly used, indexes can speed-up querying many orders of magnitude, especially on large data sets.
OTOH, every additional index costs space/cache and INSERT/UPDATE/DELETE performance.
At the end of the day, I recommend you measure on realistic amounts of data and come to your own conclusions.
BTW, If you are using InnoDB, then your table is clustered (see also: Understanding InnoDB clustered indexes) and the whole table is effectively stored in the primary index. The secondary indexes in clustered tables contain copy of the PK fields, which (I'm assuming) is user_id in this case. And since we only have two fields in the table, the secondary index on { day } will cover the user_id as well, avoiding a double-lookup that could otherwise happen in a clustered table. Effectively, you'll end-up with two separate (but synchronized) B-Trees and an index-only scan no matter which one of them you access (which is good). Of course, you could explicitly make a composite index on {day, user_id} instead of just { day }, for a very similar effect.

Anyone has experience in index covering

What is the technique of Index Covering (a.k.a. Covering Index)?
When considering overall performance, what are the advantages/disadvantages to their use?
The reasoning behind creating a covering index is so that all the columns that are required to ,be either output or are referenced in the where clause of your query, are present "within" the Index data structure (either as part of the index key or as an included column).
This in turn means that the database engine does not need to retrieve any additional database data pages in order to satisfy the needs of your query. In a nutshell, this means that in the vast majority of cases the query will be faster.
There is an excellent reference, SQL Server Optimization that provides an explanation with example of a covering index in SQL Server.
Here is a nice discussion on MySQL: How to exploit MySQL index optimizations
Now when considering disadvantages, that's an interesting question, suppose we had a very wide table and in order to create a covering index for your query you had to incorporate say 20 large data type columns, your index could quickly become quite large. You would then need to weigh up the performance gain in relation to the index maintenance and table insert/update costs.It would be one of those, it depends (dependant on workload patterns, data used etc.) cases.
IN addition to Johns answer:
Advantage: Faster access speed if the query can be answered from the covered fields as the access to the row is not needed.
Disadvantage: Slower update speed as more data in indices needs to be updated.

MySQL indexes - what are the best practices?

I've been using indexes on my MySQL databases for a while now but never properly learnt about them. Generally I put an index on any fields that I will be searching or selecting using a WHERE clause but sometimes it doesn't seem so black and white.
What are the best practices for MySQL indexes?
Example situations/dilemmas:
If a table has six columns and all of them are searchable, should I index all of them or none of them?
What are the negative performance impacts of indexing?
If I have a VARCHAR 2500 column which is searchable from parts of my site, should I index it?
You should definitely spend some time reading up on indexing, there's a lot written about it, and it's important to understand what's going on.
Broadly speaking, an index imposes an ordering on the rows of a table.
For simplicity's sake, imagine a table is just a big CSV file. Whenever a row is inserted, it's inserted at the end. So the "natural" ordering of the table is just the order in which rows were inserted.
Imagine you've got that CSV file loaded up in a very rudimentary spreadsheet application. All this spreadsheet does is display the data, and numbers the rows in sequential order.
Now imagine that you need to find all the rows that have some value "M" in the third column. Given what you have available, you have only one option. You scan the table checking the value of the third column for each row. If you've got a lot of rows, this method (a "table scan") can take a long time!
Now imagine that in addition to this table, you've got an index. This particular index is the index of values in the third column. The index lists all of the values from the third column, in some meaningful order (say, alphabetically) and for each of them, provides a list of row numbers where that value appears.
Now you have a good strategy for finding all the rows where the value of the third column is "M". For instance, you can perform a binary search! Whereas the table scan requires you to look N rows (where N is the number of rows), the binary search only requires that you look at log-n index entries, in the very worst case. Wow, that's sure a lot easier!
Of course, if you have this index, and you're adding rows to the table (at the end, since that's how our conceptual table works), you need to update the index each and every time. So you do a little more work while you're writing new rows, but you save a ton of time when you're searching for something.
So, in general, indexing creates a tradeoff between read efficiency and write efficiency. With no indexes, inserts can be very fast -- the database engine just adds a row to the table. As you add indexes, the engine must update each index while performing the insert.
On the other hand, reads become a lot faster.
Hopefully that covers your first two questions (as others have answered -- you need to find the right balance).
Your third scenario is a little more complicated. If you're using LIKE, indexing engines will typically help with your read speed up to the first "%". In other words, if you're SELECTing WHERE column LIKE 'foo%bar%', the database will use the index to find all the rows where column starts with "foo", and then need to scan that intermediate rowset to find the subset that contains "bar". SELECT ... WHERE column LIKE '%bar%' can't use the index. I hope you can see why.
Finally, you need to start thinking about indexes on more than one column. The concept is the same, and behaves similarly to the LIKE stuff -- essentially, if you have an index on (a,b,c), the engine will continue using the index from left to right as best it can. So a search on column a might use the (a,b,c) index, as would one on (a,b). However, the engine would need to do a full table scan if you were searching WHERE b=5 AND c=1)
Hopefully this helps shed a little light, but I must reiterate that you're best off spending a few hours digging around for good articles that explain these things in depth. It's also a good idea to read your particular database server's documentation. The way indices are implemented and used by query planners can vary pretty widely.
Check out presentations like More Mastering the Art of Indexing.
Update 12/2012: I have posted a new presentation of mine: How to Design Indexes, Really. I presented this in October 2012 at ZendCon in Santa Clara, and in December 2012 at Percona Live London.
Designing the best indexes is a process that has to match the queries you run in your app.
It's hard to recommend any general-purpose rules about which columns are best to index, or whether you should index all columns, no columns, which indexes should span multiple columns, etc. It depends on the queries you need to run.
Yes, there is some overhead so you shouldn't create indexes needlessly. But you should create the indexes that give benefit to the queries you need to run quickly. The overhead of an index is usually far outweighed by its benefit.
For a column that is VARCHAR(2500), you probably want to use a FULLTEXT index or a prefix index:
CREATE INDEX i ON SomeTable(longVarchar(100));
Note that a conventional index can't help if you're searching for words that may be in the middle of that long varchar. For that, use a fulltext index.
I won't repeat some of the good advice in other answers, but will add:
Compound Indices
You can create compound indices - an index that includes multiple columns. MySQL can use these from left to right. So if you have:
Table A
Id
Name
Category
Age
Description
if you have a compound index that includes Name/Category/Age in that order, these WHERE clauses would use the index:
WHERE Name='Eric' and Category='A'
WHERE Name='Eric' and Category='A' and Age > 18
but
WHERE Category='A' and Age > 18
would not use that index because everything has to be used from left to right.
Explain
Use Explain / Explain Extended to understand what indices are available to MySQL and which one it actually selects. MySQL will only use ONE key per query.
EXPLAIN EXTENDED SELECT * from Table WHERE Something='ABC'
Slow Query Log
Turn on the slow query log to see which queries are running slow.
Wide Columns
If you have a wide column where MOST of the distinction happens in the first several characters, you can use only the first N characters in your index. Example: We have a ReferenceNumber column defined as varchar(255) but 97% of the cases, the reference number is 10 characters or less. I changed the index to only look at the first 10 characters and improved performance quite a bit.
If a table has six columns and all of them are searchable, should i index all of them or none of them
Are you searching on a field by field basis or are some searches using multiple fields?
Which fields are most being searched on?
What are the field types? (Index works better on INTs than on VARCHARs for example)
Have you tried using EXPLAIN on the queries that are being run?
What are the negetive performance impacts of indexing
UPDATEs and INSERTs will be slower. There's also the extra storage space requirments, but that's usual unimportant these days.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it
No, unless it's UNIQUE (which means it's already indexed) or you only search for exact matches on that field (not using LIKE or mySQL's fulltext search).
Generally I put an index on any fields that i will be searching or selecting using a WHERE clause
I'd normally index the fields that are the most queried, and then INTs/BOOLEANs/ENUMs rather that fields that are VARCHARS. Don't forget, often you need to create an index on combined fields, rather than an index on an individual field. Use EXPLAIN, and check the slow log.
Load Data Efficiently: Indexes speed up retrievals but slow down inserts and deletes, as well as updates of values in indexed columns. That is, indexes slow down most operations that involve writing. This occurs because writing a row requires writing not only the data row, it requires changes to any indexes as well. The more indexes a table has, the more changes need to be made, and the greater the average performance degradation. Most tables receive many reads and few writes, but for a table with a high percentage of writes, the cost of index updating might be significant.
Avoid Indexes: If you don’t need a particular index to help queries perform better, don’t create it.
Disk Space: An index takes up disk space, and multiple indexes take up correspondingly more space. This might cause you to reach a table size limit more quickly than if there are no indexes. Avoid indexes wherever possible.
Takeaway: Don't over index
In general, indices help speedup database search, having the disadvantage of using extra disk space and slowing INSERT / UPDATE / DELETE queries. Use EXPLAIN and read the results to find out when MySQL uses your indices.
If a table has six columns and all of them are searchable, should i index all of them or none of them?
Indexing all six columns isn't always the best practice.
(a) Are you going to use any of those columns when searching for specific information?
(b) What is the selectivity of those columns (how many distinct values are there stored, in comparison to the total amount of records on the table)?
MySQL uses a cost-based optimizer, which tries to find the "cheapest" path when performing a query. And fields with low selectivity aren't good candidates.
What are the negetive performance impacts of indexing?
Already answered: extra disk space, lower performance during insert - update - delete.
If i have a VARCHAR 2500 column which is searchable from parts of my site, should i index it?
Try the FULLTEXT Index.
1/2) Indexes speed up certain select operations but they slow down other operations like insert, update and deletes. It can be a fine balance.
3) use a full text index or perhaps sphinx

MySQL indexes - how many are enough?

I'm trying to fine-tune my MySQL server so I check my settings, analyzing slow-query log, and simplify my queries if possible.
Sometimes it is enough if I am indexing correctly, sometimes not. I've read somewhere (please correct me if this is stupidity) that more indexes than I need make the same effect, like if I don't have any of indexes.
How many indexes are enough? You can say it depends on hundreds of factors, but I'm curious about how can I clean up my mysql-slow.log enough to reduce server load.
Furthermore, I saw some "interesting" log entries like this:
# Query_time: 0 Lock_time: 0 Rows_sent: 22 Rows_examined: 44
SELECT * FROM `categories` ORDER BY `orderid` ASC;
The table in question contains exactly 22 rows, index set in orderid. Why is this query showing up in the log after all? Why examine 44 rows if it only contains 22?
The amount of indexing and the line of doing too much will depend on a lot of factors. On small tables like your "categories" table you usually don't want or need an index and it can actually hurt performance. The reason being is that it takes I/O (i.e. time) to read an index and then more I/O and time to retrieve the records associated with the matched rows. An exception being when you only query the columns contained within the index.
In your example you are retrieving all the columns and with only 22 rows and it may be faster to just do a table scan and sort those instead of using the index. The optimizer may/should be doing this and ignoring the index. If that is the case, then the index is just taking up space with no benefit. If your "categories" table is accessed often, you may want to consider pinning it into memory so the db server keeps it accessible without having to goto the disk all the time.
When adding indexes you need to balance out disk space, query performance, and the performance of updating and inserting into the tables. You can get away with more indexes on tables that are static and don't change much as opposed to tables with millions of updates a day. You'll start feeling the affects of index maintenance at that point. What is acceptable in your environment though is and can only be determined by you and your organization.
When doing your analysis, be sure to generate/update your table and index statistics so that you can be assured of accurate calculations.
As a general rule, you should have indexes on all primary keys (you don't have a choice in that), all foreign keys, and any other fields you commonly use to fetch rows.
For example, if I commonly look up users by username, I would have that indexed, even if user ID was the primary key.
How many indexes depends entirely on the queries your running, what kinds of joins are being done (if any), the kind of data stored in the table and how big the tables are (as well as many other factors). There's really no exact science to it. The greatest tool in your arsenal for figuring out how to optimize a query is explain. Using explain you can find out what kind of joins are being down, what possible keys could be used and which key (if any) was used as well as how many rows were examined for each table in the join.
Using this information you can decide how to key your tables and/or modify your queries to make them more efficient. The syntax for explain is very simple.
EXPLAIN SELECT * FROM `categories` ORDER BY `orderid` ASC;
Note, explain does not actually run the query. So if you're using this to debug a query that takes 5 minutes to run, explain will still be very fast.
You do need to be careful when adding indexes though as they do cause inserts and updates to go slower and on very large tables this performance hit can become noticeable. Especially if that same table is used for a lot of reads. While adding a lot of indexes generally won't kill the performance of a query, you should still only add them as yo
Also keep in mind that MySQL will use a maximum of one index per select statement (although if you are using a join, it can also use one for each join). So indexing just because is a waste of disk space and will slow the database down on writes. If you commonly use a where statement on two columns, do one index containing both of those columns, it will be significantly faster than indexing just one alone.
An index can speed up a SELECT query, but it will slow down INSERT/UPDATE/DELETE queries because they need to update the index as well, not just the row.
This is just personal opinion (I've got no facts to back it up), but I think that if there is a query that is taking a long time and an index would speed it up - go for it! "Too many" indexes would be if you added indexes that didn't do any good (e.g. there were no queries it would speed up). For example, a silly thing to do would be to place an index on every column "just because".
There's no magic number for the "best" number of indexes. The basic rule is this: add indexes for queries that are used often and/or need to run quickly.
Having "too many" indexes shouldn't slow down queries, but it each index added adds a small amount of time to add/update items in the db (since it modifies the indices as well), and a small amount of space. However, if you're just adding indexes as required, this is probably not a big concern.