Optimizing usage of covering indexes - mysql

I had never heard of covering indexes before and just came across them. I was reading this page on them and it says..
"A covering index can dramatically speed up data retrieval but may itself be large due to the additional keys, which slow down data insertion & update. To reduce such index size, some systems allow non-key fields to be included in the index. Non-key fields are not themselves part of the index ordering but only included at the leaf level, allowing for a covering index with less overall index size."
So my question is how do you know if your system allows non-key fields to be included in the index?

MySQL does not (currently) support non-Key columns. For other DMBS you will need to check the reference manual.
A similar question has been asked and answered here. However since the performance improvement gained by using covering indexes is generally greater for non-selective queries returning a large number of rows, I can't envisage the solution of just including the extra columns within the index itself ever offering a performance improvement. However, there may be scenarios I am not thinking of, and yours may be one of these, so as always when looking for performance improvement, testing, execution plans and IO statistics will tell you far more than my conjecture!

Related

What's the minimum number of rows where indexing becomes valuable in MySQL?

I've read that indexing on some databases (SQL Server is the one I read about) doesn't have much effect until you cross a certain threshold of rows because the database will hold the entire table X in memory.
Ordinarily, I'd plan to index on my WHEREs and unique columns/lesser-changed tables. After hearing about the suggested minimum (which was about 10k), I wanted to learn more about that idea. If there are tables that I know will never pass a certain point, this might change the way I index some of them.
For something like MySQL MyISAM/INNODB, is there a point where indexing has little value and what are some ways of determining that?
Note: Very respectfully, I'm not looking for suggestions about structuring my database like "You should index anyway," I'm looking to understand this concept, if it's true or not, how to determine the thresholds, and similar information.
One of the major uses of indexes is to reduce the number of pages being read. The index itself is usually smaller than the table. So, just in terms of page read/writes, you generally need at least three data pages to see a benefit, because using an index requires at least two data pages (one for the index and one for the original data).
(Actually, if the index covers the query, then the breakeven is two.)
The number of data pages needed for a table depends on the size of the records and the number of rows. So, it is really not possible to specify a threshold on the number of rows.
The above very rudimentary explanation leaves out a few things:
The cost of scanning the data pages to do comparisons for each row.
The cost of loading and using index pages.
Other uses of indexing.
But it gives you an idea, and you can see benefits on tables much smaller than 10k rows. That said you can easily do tests on your data to see how queries work on the tables in question.
Also, I strongly, strongly recommend having primary keys on all tables and using those keys for foreign key relationships. The primary key itself is an index.
Indexes serve a lot of purposes. InnoDB tables are always organized as an index, on the cluster key. Indexes can be used to enforce unique constraints, as well as support foreign key constraints. The topic of "indexes" spans way more than query performance.
In terms of query performance, it really depends on what the query is doing. If we are selecting a small subset of rows, out of large set, then effective use of an index can speed that up by eliminating vast swaths of rows from being checked. That's where the biggest bang comes from.
If we are pulling all of the rows, or nearly all the rows, from a set, then an index typically doesn't help narrow down which rows to check; even when an index is available, the optimizer may choose to do a full scan of all of the rows.
But even when pulling large subsets, appropriate indexes can improve performance for join operations, and can significantly improve performance of queries with GROUP BY or ORDER BY clauses, by making use of an index to retrieve rows in order, rather than requiring a "Using filesort" operation.
If we are looking for a simple rule of thumb... for a large set, if we are needing to pull (or look at) less than 10% of the total rows, then an access plan using a suitable index will typically outperform a full scan. If we are looking for a specific row, based on a unique identifier, index is going to be faster than full scan. If we are pulling all columns for every row in the table n no particular order, then a full scan is going to be faster.
Again, it really comes down to what operations are being performed. What queries are being executed, and the performance profile that we need from those queries. That is going to be the key to determining the indexing strategy.
In terms of gaining understanding, use EXPLAIN to see the execution plan. And learn the operations available to MySQl optimizer.
(The topic of indexing strategy in terms of database performance is much too large for a StackOverflow question.)
Each situation is different. If you profile your code, then you'll understand better each anti-pattern. To demonstrate the extreme unexpectedness, consider Oracle:
If this were Oracle, I would say zero because if an empty table's high water mark is very high, then a query that motivates a full table scan that returns zero rows would be much more expensive than the same query that were to induce even a full index scan.
The same process that I went through to understand Oracle you can do with MySQL: profile your code.

MySql partitoning vs indexing performance

In MySql InnoDB, is there an performance advantage of partitioning the table compared to simply using an index?
Common considerations:
Is an Index the Best Solution?
An index isn’t always the right tool. At a high level, keep in mind that indexes are most
effective when they help the storage engine find rows without adding more work than
they avoid. For very small tables, it is often more effective to simply read all the rows
in the table. For medium to large tables, indexes can be very effective. For enormous
tables, the overhead of indexing, as well as the work required to actually use the indexes,
can start to add up. In such cases you might need to choose a technique that identifies
groups of rows that are interesting to the query, instead of individual rows. You can
use partitioning for this purpose.
If you have lots of tables, it can also make sense to create a metadata table to store some
characteristics of interest for your queries. For example, if you execute queries that
perform aggregations over rows in a multitenant application whose data is partitioned
into many tables, you can record which users of the system are actually stored in each
table, thus letting you simply ignore tables that don’t have information about those
users. These tactics are usually useful only at extremely large scales. In fact, this is a
crude approximation of what Infobright does. At the scale of terabytes, locating individual rows doesn’t make sense; indexes are replaced by per-block metadata.
One thing is sure: you can’t scan the whole table every time you want to query it,
because it’s too big. And you don’t want to use an index because of the maintenance
cost and space consumption. Depending on the index, you could get a lot of fragmentation and poorly clustered data, which would cause death by a thousand cuts through
random I/O. You can sometimes work around this for one or two indexes, but rarely
for more. Only two workable options remain: your query must be a sequential scan
over a portion of the table, or the desired portion of the table and index must fit entirely
in memory.
It’s worth restating this: at very large sizes, B-Tree indexes don’t work. Unless the index
covers the query completely, the server needs to look up the full rows in the table, and
that causes random I/O a row at a time over a very large space, which will just kill query
response times. The cost of maintaining the index (disk space, I/O operations) is also
very high. Systems such as Infobright acknowledge this and throw B-Tree indexes out
entirely, opting for something coarser-grained but less costly at scale, such as per-block
metadata over large blocks of data.
This is what partitioning can accomplish, too. The key is to think about partitioning
as a crude form of indexing that has very low overhead and gets you in the neighborhood
of the data you want. From there, you can either scan the neighborhood sequentially,
or fit the neighborhood in memory and index it. Partitioning has low overhead because
there is no data structure that points to rows and must be updated—partitioning
doesn’t identify data at the precision of rows, and has no data structure to speak of.
Instead, it has an equation that says which partitions can contain which categories of
rows.
(many thanks to High Performance MySQL great book)
99% of cases I have looked at do not benefit from PARTITIONing as much as from INDEXing.
My Rules of Thumb for using Partitioning are in http://mysql.rjweb.org/doc.php/partitionmaint . Also, that lists the only 4 use cases where partitioning improves performance.
OK, I can't say "exactly" 99%, but it is very close to that. I do believe strongly in the "4" -- I have been searching since partitioning was added to MySQL many years ago.
For Data Warehousing, the usual performance solution is to create and maintain "Summary tables". This works nicely for 'most' DW applications.
"Very large BTrees don't work"? Bull. A million-row index will have a BTree depth of about 3. A trillion rows -- about 6. Where's the "won't work"? A "point query" on a trillion row table will touch twice as many nodes in the BTree, and more of them are unlikely to be cached. But it "will work".
Infobright, with its "columnar storage", has its niche. TokuDB, with its "fractal indexing", has its niche. Neither one can say "we are better than BTrees most of the time". (Both those engines get part of their speed by compression.)
Bottom Line: Use an index. Probably a "composite" index. (More indexing tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql )

Indexes, why don't just index everything and when to use indexes?

Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows.
Says our beloved MySQL manual.
In that case, why don't just index every column?
And since I have this feeling that it would be a bigger hit to performance, when should we use indexes/best practices for indexes?
Thanks in advance!
Creating an index always comes at a cost: The more indices you have on a table, the more expensive it is to modify that table (i.e. inserts, updates and deletes take longer).
In turn, queries that can use the indices will be faster. It's a classical tradeoff. On most tables a small number of commonly used indices is worth the cost, because queries happen often enough (or their performance is much more important than the modification performance).
On the other hand, if you have some kind of log table that is updated very often, but queried only very rarely (for example in case of a catastrophic failure), then adding an index would add a big cost and provide very little advantage.
Also: whether or not an index is useful depends a lot on the exact query to be executed. It's possible that you have indices spanning each column, but the query can't use it because the indices are in the wrong order, have the wrong information or the wrong format. So not all indices help all queries.
By your logic, you wouldn't index just every column, but every permutation of every column. The overhead involved in storing this information, and in keeping it up to date, would be utterly vast.
Generally index is helpful if it has a good selectivity, i.e. when the query selects a little portion of data based on the value (or range) of indexed attribute.
Also indice are good for merge joins, when sorting rows by a joining attribute in both joined tables allows to match rows and retrieve data in one pass.
As it was already mentioned, indexes slow down updates and take up some memory (which, by itself, slows down performance as well)

Anyone has experience in index covering

What is the technique of Index Covering (a.k.a. Covering Index)?
When considering overall performance, what are the advantages/disadvantages to their use?
The reasoning behind creating a covering index is so that all the columns that are required to ,be either output or are referenced in the where clause of your query, are present "within" the Index data structure (either as part of the index key or as an included column).
This in turn means that the database engine does not need to retrieve any additional database data pages in order to satisfy the needs of your query. In a nutshell, this means that in the vast majority of cases the query will be faster.
There is an excellent reference, SQL Server Optimization that provides an explanation with example of a covering index in SQL Server.
Here is a nice discussion on MySQL: How to exploit MySQL index optimizations
Now when considering disadvantages, that's an interesting question, suppose we had a very wide table and in order to create a covering index for your query you had to incorporate say 20 large data type columns, your index could quickly become quite large. You would then need to weigh up the performance gain in relation to the index maintenance and table insert/update costs.It would be one of those, it depends (dependant on workload patterns, data used etc.) cases.
IN addition to Johns answer:
Advantage: Faster access speed if the query can be answered from the covered fields as the access to the row is not needed.
Disadvantage: Slower update speed as more data in indices needs to be updated.

Most efficient way to find index'd records in mysql?

hey all I have tables with millions of rows in them and some of the select queries key off of 3 fields
company, user, articleid
would it be faster to create a composite index of those three fields as a key
or MD5 (company, user, articleid) together and then index the hash that's created.
?
thanks
You would have to benchmark to be sure, but I believe that you will find that there isn't going to be a significant performance difference between a composite index of three fields and a single index of a hash of those fields.
In my opinion, creating data that wouldn't otherwise exist and is only going to be used for indexing is a bad idea (except in the case of de-normalization for performance reasons, but you'd need a conclusive case to do it here). For a 32 byte field of md5 data (minus any field overhead), consider that for every one million rows you have, you have created approximately an extra 30 MB of data. Even if the index was a teensy tiny bit faster, you've just upped the disk and memory requirements for that table. Your index seek time might be offset by disk seek time. Add in the fact that you have to have application logic to support this field, and I would opine that it's not worth it.
Again, the only true way to know would be to benchmark it, but I don't think you'll find much of a difference.
for performance, you might see advantages with the composite index. if you are selecting only the fields in the index, this is a "covering index" situation. that means the data engine will not have to read the actual data page from the disk, just reading the index is enough to return the data requested by your application. this can be a big performance boost. if you store a hash, you eliminate the possibility of taking advantage of a covering index (unless you are selecting only the hash in your sql).
best regards,
don
One more consideration in favor of a composite key : having composite key on (company, user, articleid) means that it can be used when you search a record by company, or company+user, or by company+user+articleid. So you virtually have 3 indexes.
A composite index seems to be the way to go, in particular since some of the individual keys appear to be fairly selective. The only situation which may cause you to possibly avoid the composite index approach is if the length of the composite key is very long (say in excess of 64 characters, on average).
While a MD5-based index would be smaller and hence possibly slightly faster, it would let you deal with the task of filtering the false positives out of the list of records with a given MD5 value.
When building a composite index, a question arise of the order in which the keys should be listed in the index. While this speak, somewhat, to the potential efficiency of the index, the question of the ordering has more significant impact on the potential usability of the index in cases when only two (or even one...) of the keys are used in the query. One typically tries and put the most selective column(s) first, unless this (these) selective column(s) is (are) the ones most likely to not be used when a complete set of these columns is not found in the query.