Does a database have to rebuild its indexes every time a new row is inserted?
And by that token, wouldn't it mean if I was inserting alot, the index would be being rebuilt constantly and therefore less effective/useless for querying?
I'm trying to understand some of this database theory for better database design.
Updates definitely don't require rebuilding the entire index every time you update it (likewise insert and delete).
There's a little bit of overhead to updating entries in an index, but it's reasonably low cost. Most indexes are stored internally as a B+Tree data structure. This data structure was chosen because it allows easy modification.
MySQL also has a further optimization called the Change Buffer. This buffer helps reduce the performance cost of updating indexes by caching changes. That is, you do an INSERT/UPDATE/DELETE that affects an index, and the type of change is recorded in the Change Buffer. The next time you read that index with a query, MySQL reads the Change Buffer as a kind of supplement to the full index.
A good analogy for this might be a published document that periodically publishes "errata" so you need to read both the document and the errata together to understand the current state of the document.
Eventually, the entries in the Change Buffer are gradually merged into the index. This is analogous to the errata being edited into the document for the next time the document is reprinted.
The Change Buffer is used only for secondary indexes. It doesn't do anything for primary key or unique key indexes. Updates to unique indexes can't be deferred, but they still use the B+Tree so they're not so costly.
If you do OPTIMIZE TABLE or some types of ALTER TABLE changes that can't be done in-place, MySQL does rebuild the indexes from scratch. This can be useful to defragment an index after you delete a lot of the table, for example.
Yes, inserting affects them, but it's not as bad as you seem to think. Like most entities in relational databases, indexes are usually created and maintained with an extra amount of space to accommodate for growth, and usually set up to increase that extra amount automatically when index space is nearly exhausted.
Rebuilding the index starts from scratch, and is different from adding entries to the index. Inserting a new row does not result in the rebuild of an index. The new entry gets added in the extra space mentioned above, except for clustered indexes which operate a little differently.
Most DB administrators also do a task called "updating statistics," which updates an internal set of statistics used by the query planner to come up with good query strategies. That task, performed as part of maintenance, also helps keep the query optimizer "in tune" with the current state of indexes.
There are enormous numbers of high-quality references on how databases work, both independent sites and those of the publishers of major databases. You literally can make a career out of becoming a database expert. But don't worry too much about your inserts causing troubles. ;) If in doubt, speak to your DBA if you have one.
Does that help address your concerns?
Related
I'm trying to figure out how multiple indexes are actually affecting insertion performance for MySQL InnoDB tables.
Is it possible to get information about index update times using performance_schema?
It seems like there are no instruments for stages that may reflect such information.
Even if there is something in performance_schema, it would be incomplete.
Non-UNIQUE secondary indexes are handled thus:
An INSERT starts.
Any UNIQUE indexes (including the PRIMARY KEY) are immediately checked for "dup key".
Other index changes are put into the "change buffer".
The INSERT returns to the client.
The Change Buffer is a portion of the buffer_pool (default: 25%) where such index modifications are held. Eventually, they will be batched up for updating the actual blocks of the index's BTree.
In a good situation, many index updates will be combined into very few read-modify-write steps to update a block. In a poor case, each index update requires a separate read and write.
The I/O for the change buffer is done 'in the background' as is the eventual write of any changes to data blocks. These cannot be realistically monitored in any way -- especially if there are different clients with different queries contributing to the same index or data blocks being updated.
Oh, meanwhile, any index lookups need to look both in the on-disk (or cached in buffer_pool) blocks and the change buffer. This makes an index lookup faster or slower, depending on various things unrelated to the operation in hand.
I have a table which already contains an index in MySQL. I added some rows to the table, do I need to re-index the table somehow or does MySQL do this for me automatically?
This would be done automatically. This is the reason, why sometimes we don't want to create indexes -- rebuilding of parts of indexes on inserting have small but not empty overhead in performance.
If you define an index in MySQL then it will always reflect the current state of the database unless you have deliberately disabled indexing. As soon as indexing is re-enabled, the index will be brought up to date. Usually indexing is only disabled during large insertions for performance reasons.
There is a cost associated with each index on your table. While a good index can speed up retrieval times immensely, every index you define slows insertion by a small amount. The insertion costs grow slowly with the size of the database. This is why you should only define indexes you absolutely need if you are going to be working on large sets of data.
If you want to see what indexes are defined, you can use SHOW CREATE TABLE to have a look at a particular table.
No, you didn't need to rebuild index.
Record insertion will automatically affect old index..
I am trying to grasp some performance differences between Cassandra and relational databases.
From what I have read, Cassandra's write performance remains constant regardless of data volume. By write performance, I am assuming this implies both new rows being added as well as existing rows being replaced on a key match (like an update in the relational world). Is that assumption correct?
Also, from what I understand about relational databases updates get slower when tables/partitions become larger. This is because a full table scan must be performed to locate the row, or an index lookup needs to be performed and both of these things will take longer as the table or partition grows. So updates take perpetually longer based on the data volume of the table/partition?
When new data is inserted to a relational database, I know any indexes need to to have the new data but there is no lookup involved correct? So will inserts also become perpetually slower as data volume increases or stay constant with relational databases?
Thanks for any tips
They will become slower if the table has indexes. Not only the data must be written, but the index must be updated too. Inserting in a table that has no indexes and no constraints is lightning fast, because no checks need to be done. The record can just be written at the end of the table space.
On the relational DB side, I've been doing load testing on our RDBMS where I can see that the performance drops exponentially as data is added to the DB.
I'm still working on a Cassandra setup to be able to realize a comparable test. In the meantime, this Cassandra presentation gives some info on Cassandra compared to MySQL:
http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql
i have a table with about 200,000 records. i want to add a field to it:
ALTER TABLE `table` ADD `param_21` BOOL NOT NULL COMMENT 'about the field' AFTER `param_20`
but it seems a very heavy query and it takes a very long time, even on my Quad amd PC with 4GB of RAM.
i am running under windows/xampp and phpMyAdmin.
does mysql have a business with every record when adding a field?
or can i change the query so it makes the change more quickly?
MySQL will, in almost all cases, rebuild the table during an ALTER**. This is because the row-based engines (i.e. all of them) HAVE to do this to retain the data in the right format for querying. It's also because there are many other changes you could make which would also require rebuilding the table (such as changing indexes, primary keys etc)
I don't know what engine you're using, but I will assume MyISAM. MyISAM copies the data file, making any necessary format changes - this is relatively quick and is not likely to take much longer than the IO hardware can get the old datafile in and the new on out to disc.
Rebuilding the indexes is really the killer. Depending on how you have it configured, MySQL will either: for each index, put the indexed columns into a filesort buffer (which may be in memory but is typically on disc), sort that using its filesort() function (which does a quicksort by recursively copying the data between two files, if it's too big for memory) and then build the entire index based on the sorted data.
If it can't do the filesort trick, it will just behave as if you did an INSERT on every row, and populate the index blocks with each row's data in turn. This is painfully slow and results in far from optimal indexes.
You can tell which it's doing by using SHOW PROCESSLIST during the process. "Repairing by filesort" is good, "Repairing with keycache" is bad.
All of this will use AT MOST one core, but will sometimes be IO bound as well (especially copying the data file).
** There are some exceptions, such as dropping secondary indexes on innodb plugin tables.
You add a NOT NULL column, the tuples need to be populated. So it will be slow...
This touches each of 200.000 records, as each record needs to be updated with a new bool value which is not going to be null.
So; yes it's an expensive query... There is nothing you can do to make it faster.
Consider this scenario, my database table has 300000 rows and has a fulltext index. Whenever a search is done it locks the database and doesn't allow anyone else to login to the portal.
Any advice on how to get things sorted out here will be really appreciable.
Does logging on perform a write to the table? eg. a 'last visit' time?
If so you may expect behaviour something like this because MyISAM writes do a lock over the entire table. Usually this is avoided by not using noddy MyISAM and going to InnoDB instead, which has row-level locking (amongst other desirable database features).
The problem, of course, is that you only get fulltext search with MyISAM.
So you'll need to split your tables up. If you can keep the read-heavy and fulltext stuff in a different table to the stuff that needs writing (but linked using the same primary key), you can probably make it so that the two operations don't affect each other.
Better, migrate the bulk of the table to InnoDB, leaving only a fulltext field in MyISAM. Everything except fulltext searches can then steer clear of the MyISAM table, and use only the InnoDB table which exhibits much better locking performance. Personally, I now tend to store everything in the InnoDB table, including the text, and store a second copy of the text in the MyISAM table purely for fulltext searchbait purposes; this simplifies queries and code and brings the advantages of InnoDB's consistency to the text content, and I also use it to process the searchbait to get stemming and other features MySQL's fulltext doesn't normally support. But it does mean you have to spend a lot more space on storage.
You can also improve matters by cutting down number of writes. For example if it is a 'last visit' timestamp you're writing, you can avoid writing that unless, say, a minute has passed between the previous time and now, on the basis that no-one needs to know the exact second someone last accessed the site.
If you use an external search engine or MySQL search plug-ins Lucene or Sphinx, they should be able to read and index without locking the table. They store a local version of the indexed records, so they don't have to read the table very often, and never need to write to it.