Mysql: Unique Index = Performance Characteristics for large datasets?

Mysql: Unique Index = Performance Characteristics for large datasets? - mysql

what is the performance characteristic for Unique Indexes in Mysql and Indexes in general in MySQl (like the Primary Key Index):
Given i will insert or update a record in my databse: Will the speed of updating the record (=building/updating the indexes) be different if the table has 10 Thousand records compared to 100 Million records. Or said differently, does the Index buildingtime after changing one row depend on the total indexsize?
Does this also apply for any other indexes in Mysql like the Primary Key index?
Thank you very much
Tom

Most indexes in MySQL are really the same internally -- they're B-tree data structures. As such, updating a B-tree index is an O(log n) operation. So it does cost more as the number of entries in the index increases, but not badly.
In general, the benefit you get from an index far outweighs the cost of updating it.

A typical MySQL implementation of an index is as a set of sorted values (not sure if any storage engine uses different strategies, but I believe this holds for the popular ones) -- therefore, updating the index inevitably takes longer as it grows. However, the slow-down need not be all that bad -- locating a key in a sorted index of N keys is O(log N), and it's possible (though not trivial) to make the update O(1) (at least in the amortized sense) after the finding. So, if you square the number of records as in your example, and you pick a storage engine with highly optimized implementation, you could reasonably hope for the index update to take only twice as long on the big table as it did on the small table.

Note that if new primary key values are always larger than the previous (i.e. autoincrement integer field), your index will not need to be rebuilt.

Related

Index on Boolean field to delete records in a partitioned table

I have a large MySQL table which may contain 100 million records. The schema of the table is something like this-
Id varchar(36), --guid, primary key
IsDirty bit(1),
CreatedOn(Date),
Info varchar(500)
I have created a partition on CreatedOn field which creates a partition for monthly data. Some of the rows in the table are updated and isDirty is set to 1. At max, only 10% of the rows would have IsDirty = 1. There is a process that runs every night and deletes data which is 6 months old with value IsDirty = 0.
Is there any performance gain if I create an index on IsDirty field as well? From what I've read is, creating indexes on bit field may not add much to the performance but reindexing after deleting the records may downgrade the performance due to index.
Is my understanding correct? Is there a better way to achieve the desired functionality?

There is a rule of thumb which says, that it's best to index columns with a high cardinality. Cardinality is the estimated number of distinct values in the column. When you do a show indexes from your_table; you would see, that your IsDirty column has a cardinality of 2. Very bad.
However this does not consider the distribution of the data. When only 10% have IsDirty = 1, queries like select * from your_table where IsDirty = 1 would benefit from the index. Your delete job on the other hand, which checks for IsDirty = 0 would not benefit, as it's cheaper to simply do a full table scan, because using a secondary index means, that from the index the primary key is read (in every secondary index the primary key is stored, therefore it's always good to make the primary key as small as possible) to identify the row to be read.
The manual states the following about when a full table scan is prefered:
Each table index is queried, and the best index is used unless the optimizer believes that it is more efficient to use a table scan. At one time, a scan was used based on whether the best index spanned more than 30% of the table, but a fixed percentage no longer determines the choice between using an index or a scan. The optimizer now is more complex and bases its estimate on additional factors such as table size, number of rows, and I/O block size.
Also note, that the bit datatype is not ideal to store values 0 or 1. There is a bool datatype (which is internally realised as tinyint(1). I think I've read somewhere a reason for this, but I've forgotten about it).

Don't bother with partitioning, it is unlikely to help performance. Anyway, you would need to have a growing number of partitions and use PARTITION BY RANGE(to_days(..)). You would not be able to use DROP PARTITION, which would make the deletion very fast.
I'll tentatively take that back. This may work, and may allow for DROP PARTITION, but I am baffled as to the syntax.
PARTITION BY RANGE(TO_DAYS(CreatedOn))
SUBPARTITION BY LINEAR KEY(IsDirty)
SUBPARTITIONS 2
If you do end up with a big DELETE every night, then either
Do it hourly (or continually) so that the delete is not to big
Chunk it as discussed here
Also, have
INDEX(IsDirty, CreatedOn) -- in this order.
(Note: If the subpartitioning can be made to work; this index is not needed.)
Other tips:
Use InnoDB.
Set innodb_buffer_pool_size to about 70% of RAM size.
UUIDs are horrible for large tables due to the randomness of accessing -- hence high I/O.
Id varchar(36), --guid, primary key -- Pack it into BINARY(16). (Let me know if you need help.) Saving space --> shrinks table --> cuts back on I/O.
Because of the awfullness of uuids, the partitioning may help avoid a lot of the I/O -- This is because all of this month's inserts will be going into one partition. That is, the "working set", hence buffer_pool size can be less.

Innodb, clustered indexes, and slow_query_log - hurt by the primary key?

In the last few months we've migrated a few tables from MYiSAM to InnoDB. We did this, in theory, for the row locking advantage, since we are updating individual rows, through multiple web-scraping instances. I now have tens of thousands of slow_queries building up in my slow_query_log (10s). and a lot of full table scans. I am having event very simple updates to one row (updating 4 or 5 columns) take 28 seconds. (our i/o and efficiencies are very good, and failed/aborted attempts are very low < 0.5%)
The two tables we update the most have ID (int 11) as the primary key. In InnoDB the primary key is a clustered key, so are written to disk indexed in the order of ID. BUT our two most important record identifying columns are BillsofLading and Container (both varchar22). Most of our DML queries look up records based on these two columns. We also have indexes on BillsofLading and container.
The way I understand it, InnoDB also uses the primary key when creating these two secondary indexes.
So, I could have a record with ID=1, and BillsofLading='z', and another record ID=9 and BillsofLading='a'. With InnoDB indexes, when updating record based on a SELECT where BillsofLading='a', would I not still have to do a full scan to find 'a' since the index is based on the ID?
Thanks in advance for helping with the logic here!

No, your example should not require a full scan, assuming that MySQL is choosing to use your BillsofLading index. You say
The way I understand it, InnoDB also uses the primary key when creating these two secondary indexes.
This is correct, but it does not mean what you think it means.
A primary key (PK) to InnoDB is like a line number to a human being. It's InnoDB's only way to uniquely identify a row. So what a secondary index does is map each value of the target column(s) to all the PKs (i.e. line numbers) that match that value. Thus, MySQL can quickly jump to just the BillsofLading you care about, and then scan just those rows (PKs) for the one you want.

Did you severely decrease key_buffer_size and set innodb_buffer_pool_size to about 70% of available RAM?
There are a number of subtle differences between MyISAM and InnoDB. There are too many to list in this answer, and nothing you have said brings to mind any particular issue. I suggest you review Converting MyISAM to InnoDB to see what might be causing trouble.

Removing a Primary Key (Clustered Index) to increase Insert performance

We've been experiencing SQL timeouts and have identified that bottleneck to be an audit table - all tables in our system contain insert, update and delete triggers which cause a new audit record.
This means that the audit table is the largest and busiest table in the system. Yet data only goes in, and never comes out (under this system) so no select performance is required.
Running a select top 10 returns recently insert records rather than the 'first' records. order by works, of course, but I would expect that a select top should return rows based on their order on the disc - which I'd expect would return the lowest PK values.
It's been suggested that we drop the clustered index, and in fact the primary key (unique constraint) as well. As I mentioned earlier there's no need to select from this table within this system.
What sort of performance hit does a clustered index create on a table? What are the (non-select) ramifications of having an unindexed, unclustered, key-less table? Any other suggestions?
edit
our auditing involves CLR functions and I am now benchmarking with & without PK, indexes, FKs etc to determine the relative cost of the CLR functions & the contraints.
After investigation, the poor performance was not related to the insert statements but instead the CLR function which orchestrated the auditing. After removing the CLR and instead using a straight TSQL proc, performance improved 20-fold.
During the testing I've also determined that the clustered index and identity columns make little or no difference to the insert time, at least relative to any other processing that takes place.
// updating 10k rows in a table with trigger
// using CLR function
PK (identity, clustered)- ~78000ms
No PK, no index - ~81000ms
// using straight TSQL
PK (identity, clustered) - 2174ms
No PK, no index - 2102ms

According to Kimberly Tripp - the Queen of Indexing - having a clustered index on a table actually helps INSERT performance:
The Clustered Index Debate Continued
Inserts are faster in a clustered table (but only in the "right"
clustered table) than compared to a heap. The primary problem here is
that lookups in the IAM/PFS to determine the insert location in a heap
are slower than in a clustered table (where insert location is known,
defined by the clustered key). Inserts are faster when inserted into a
table where order is defined (CL) and where that order is
ever-increasing.
Source: blog post called The Clustered Index Debate Continues....

A great test script and description of this scenarion is available on Tibor Karaszi's blog at SQLblog.com
My numbers don't entirely match his - I see more difference on a batch statement than I do with per-row statements.
With the row count around one million I fairly consistently get a single-row insert loop on clustered index to perform slightly faster than on a non-indexed (clustered taking approximately 97% as long as non-indexed).
Conversely the batch insert (10000 rows) is faster into a non-indexed rather than clustered index (anything from 75%-85% of the clustered insert time).
clustered - loop - 1689
heap - loop - 1713
clustered - one statement - 85
heap - one statement - 62
He describes what's happening on each insert:
Heap: SQL Server need to find where the row should go. For this it
uses one or more IAM pages for the heap, and it cross references these
to one or more PFS pages for the database file(s). IMO, there should
be potential for a noticable overhead here. And even more, with many
users hammering the same table I can imagine blocking (waits) against
the PFS and possibly also IAM pages.
Clustered table: Now, this is dead simple. SQL server navigates the
clustered index tree and find where the row should go. Since this is
an ever increasing index key, each row will go to the end of the table
(linked list).

A table without a key? Not even an auto-incrementing surrogate key? :(
As long as the key is monotonically increasing the index maintenance upon insert should be good -- it's just "added at the end". The "clustered" just means the physical layout of the table follows the index (as the data is part of the index). As long as the index isn't fragmented (see monotonically increasing bit) then the cluster itself/data won't be logically fragmented and this shouldn't be a performance issue. (If there are updates then the clustering is a slightly different story: the record updated may "grow" and cause fragmentation.)
My suggestion is, if that is the chosen route then ... benchmark it with realistic data/load and then decide if such suggestions are warranted. It would be nice to see is this change was decided upon, and why.
Happy coding.
Also, any reliance upon order excepting that from an ORDER BY is flawed by design. It may work now, but it is an implementation detail and may change in subtle ways (as simple as a different query plan). With the auto-increment key, an ORDER BY DESC would always produce the correct result (bear in mind that auto-incremeent IDs can be skipped, but unless "reset" they will always be increasing based on insert order).

My primitive understanding is that even INSERT operations are usually faster with a clustered index, than with a heap. Additionally, disk-space requirements are lower with clustered indexes.
Some interesting tests / scenarios that might shed some light for your particular circumstance: http://technet.microsoft.com/en-us/library/cc917672.aspx.

What are the benefits of a sequential index key, and how much wiggle room do I have?

Sequential keys allow one to use clustered index. How material is that benefit? How much is lost if 1% (say) of the keys are out of sequential order by one or two rank?
Thanks,
JDelage

Short:
Clustered index, in general, can be used on anything that is sortable. Sequentiality (no gaps) is not required - your records will be maintained in order with common index maintenance principles (only difference is that with clustered index the leafs are big because they hold the data, too).
Long:
Good clustering can give you orders of magnitude improvements.
Basically with good clustering you will be reading data very efficiently on any spinning media.
The measure on which you should evaluate if the clustering is good should be done by examining the most common queries (that actually will read data and can not be answered by indexes alone).
So, for example if you have composite natural keys as primary key on which the table is clustered AND if you always access the data according to the subset of the key then with simple sequential disk reads you will get answers to your query in the most efficient way.
However, if the most common way to access this data is not according to the natural key (for example the application spends 95% of time looking for last 5 records within the group AND the date of update is not part of the clustered index), then you will not be doing sequential reads and your choice of the clustered index might not be the best.
So, all this is at the level of physical implementation - this is where things depend on the usage.
Note:
Today not so relevant, but tomorrow I would expect most DBs to run off the SSDs - where access times are nicer and nicer and with that (random access reads are similar in speed to sequential reads on SSDs) the importance of clustered indexes would diminish.

You need to understand the purpose of the clustered-index.
It may be helpful in some cases, to speed up inserts, but mostly, we use clustered indexes to make queries faster.
Consider the case where you want to read a range of keys from a table - this is very common - it's called a range scan.
Range scans on a clustered index are massively better than a range scan on a secondary index (not using a covering index). These are the main case for using clustered indexes. It mostly saves 1 IO operation per row in your result. This can be the difference between a query needing, say, 10 IO operations and 1000.
It really is amazing, particularly if you have no blobs and lots of records per page.
If you have no SPECIFIC performance problem that you need to fix, don't worry about it.
But do also remember, that it is possible to make a composite primary key, and that your "unique ID" need not be the whole primary key. A common (very good) technique is to add something which you want to range-scan, as the FIRST part of the PK, and add a unique ID (meaningless) afterwards.
So consider the case where you want to scan your table by time - you can make the time the first part of the PK (it is not going to be unique, so it's not enough on its own), and a unique ID the second.
Do not however, do premature optimisation. If your database fits in memory (Say 32Gb), you don't care about IO operations. It's never going to do any reads anyway.

Most efficient way to find index'd records in mysql?

hey all I have tables with millions of rows in them and some of the select queries key off of 3 fields
company, user, articleid
would it be faster to create a composite index of those three fields as a key
or MD5 (company, user, articleid) together and then index the hash that's created.
?
thanks

You would have to benchmark to be sure, but I believe that you will find that there isn't going to be a significant performance difference between a composite index of three fields and a single index of a hash of those fields.
In my opinion, creating data that wouldn't otherwise exist and is only going to be used for indexing is a bad idea (except in the case of de-normalization for performance reasons, but you'd need a conclusive case to do it here). For a 32 byte field of md5 data (minus any field overhead), consider that for every one million rows you have, you have created approximately an extra 30 MB of data. Even if the index was a teensy tiny bit faster, you've just upped the disk and memory requirements for that table. Your index seek time might be offset by disk seek time. Add in the fact that you have to have application logic to support this field, and I would opine that it's not worth it.
Again, the only true way to know would be to benchmark it, but I don't think you'll find much of a difference.

for performance, you might see advantages with the composite index. if you are selecting only the fields in the index, this is a "covering index" situation. that means the data engine will not have to read the actual data page from the disk, just reading the index is enough to return the data requested by your application. this can be a big performance boost. if you store a hash, you eliminate the possibility of taking advantage of a covering index (unless you are selecting only the hash in your sql).
best regards,
don

One more consideration in favor of a composite key : having composite key on (company, user, articleid) means that it can be used when you search a record by company, or company+user, or by company+user+articleid. So you virtually have 3 indexes.

A composite index seems to be the way to go, in particular since some of the individual keys appear to be fairly selective. The only situation which may cause you to possibly avoid the composite index approach is if the length of the composite key is very long (say in excess of 64 characters, on average).
While a MD5-based index would be smaller and hence possibly slightly faster, it would let you deal with the task of filtering the false positives out of the list of records with a given MD5 value.
When building a composite index, a question arise of the order in which the keys should be listed in the index. While this speak, somewhat, to the potential efficiency of the index, the question of the ordering has more significant impact on the potential usability of the index in cases when only two (or even one...) of the keys are used in the query. One typically tries and put the most selective column(s) first, unless this (these) selective column(s) is (are) the ones most likely to not be used when a complete set of these columns is not found in the query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008