I got an Oracle 9i book from the Oracle publisher.
In it it's written
Index is bad on a table which is updated/inserted new rows frequently
Is it true ? or is it just about Oracle [and not about other RDBMS packages] ?
Edit
I got a table in MySQL like this
ID [pk / AI]
User [integer]
Text [TinyText]
Time [Timestamp]
Only write/read is allowed to this table.
As PK creates Index, is the table design broken ?
If yes, how to solve this type of problem [where AI is the primary key]
This is true with any database to an extent. Whenever indexed columns updated, the index must also be updated. Each additional index adds extra overhead. Whether or not this matters for your specific situation depends on the indices you create and the workload the server is running. Performance implications are best discovered via benchmarking.
Indexes are only for retrieving data. Because they are a pointer to data location(s), INSERT/UPDATE/DELETE statements are slower to maintain the indexes. Even then, indexes can be fragmented because deletion/updating will change -- which is why there are tools to maintain this and table statistics (both are used by the optimizer to determine the EXPLAIN plan).
Keep in mind that indexes are not ANSI -- it's a miracle the syntax & terminology is so similar. But the functionality is near identical between databases that provide it. For example, Oracle only has "indexes" while both MySQL and SQL Server differentiate between clustered (one per table) and non-clustered indexes.
To address your update about the primary key. The primary key is unique, and considered immutable (though it is infact able to be updated, though the value has to be unique to the column). Deletion from the table would fragment the index, which requires monitoring with database vendor specific tools if performance becomes an issue.
It is not that indexes on highly volatile tables are "bad". It's simply that there is a performance impact on the DML operations. I think what the author was trying to say is that you should carefully consider the need for indexes on such active tables.
As in everything computing, it's all about tradeoffs. As #Michael essentially states, "it depends". You might have a high query rate on the table as well, in which the indexes on the table avoid a lot of full table scans. In such a case, your index maintenance overhead may well be worth the benefit derived from the indexes on queries.
Also, I'd probably not buy a 9i book anyway, unless it was a real bargain. I'd recommend you read most anything by Tom Kyte you can get your hands on, especially "Expert Oracle Database Architecture" or "Effective Oracle by Design".
Related
I am trying to improve a performance of some large tables (can be millions of records) in a MySQL 8.0.20 DB on RDS.
Scaling up DB instance and IOPS is not the way to go, as it is very expensive (the DB is live 24/7).
Proper indexes (including composite ones) do already exist to improve the query performance.
The DB is mostly read-heavy, with occasional massive writes - when these writes happen, reads can be just as massive at the same time.
I thought about doing partitioning. Since MySQL doesn't support vertical partitioning, I considered doing horizontal partitioning - which should work very well for these large tables, as they contain activity records from dozens/hundreds of accounts, and storing each account's records in a separate partition makes a lot of sense to me.
But these tables do contain some constraints with foreign keys, which rules out using MySQL's horizontal partitioning : Restrictions and Limitations on Partitioning
Foreign keys not supported for partitioned InnoDB tables. Partitioned tables using the InnoDB storage engine do not support foreign keys. More specifically, this means that the following two statements are true:
No definition of an InnoDB table employing user-defined partitioning may contain foreign key references; no InnoDB table whose definition contains foreign key references may be partitioned.
No InnoDB table definition may contain a foreign key reference to a user-partitioned table; no InnoDB table with user-defined partitioning may contain columns referenced by foreign keys.
What are my options, other than doing "sharding" by using separate tables to store activity records on a per account basis? That would require a big code change to accommodate such tables. Hopefully there is a better way, that would only require changes in MySQL, and not the application code. If the code needs to be changed - the less the better :)
storing each account's records in a separate partition makes a lot of sense to me
Instead, have the PRIMARY KEY start with acct_id. This provides performance at least as good as PARTITION BY acct_id, saves disk space, and "clusters" an account's data together for "locality of reference".
The DB is mostly read-heavy
Replicas allows 'infinite' scaling of reads. But if you are not overloading the single machine now, there may be no need for this.
with occasional massive writes
Let's discuss techniques to help with that. Please explain what those writes entail -- hourly/daily/sporadic? replace random rows / whole table / etc? keyed off what? Etc.
Proper indexes (including composite ones) do already exist to improve the query performance.
Use the slowlog (with long_query_time = 1 or lower) to verify. Use pt-query-digest to find the top one or two queries. Show them to us -- we can help you "think out of the box".
read-heavy
Is the working set size less than innodb_buffer_pool_size? That is, are you CPU-bound and not I/O-bound?
More on PARTITION
PRIMARY KEY(acct_id, ..some other columns..) orders the data primarily on acct_id and makes this efficient: WHERE acct_id=123 AND ....
PARTITION BY .. (acct_id) -- A PARTITION is implemented as a separate "table". "Partition pruning" is the act of deciding which partition(s) are needed for the query. So WHERE acct_id=123 AND ... will first do that pruning, then look for the row(s) in that "table" to handle the AND .... Hopefully, there is a good index (perhaps the PRIMARY KEY) to handle that part of the filtering.
The pruning is sort of takes the place of one level of BTree. It is hard to predict which will be slower or faster.
Note that when partitioning by, say, acct_id, there is usually not efficient to start the index with that column. (However, it would need to be later in the PK.)
Big Deletes
There are several ways to do a "big delete" while minimizing the impact on the system. Partitioning by date is optimal but does not sound viable for your type of data. Check out the others listed here: http://mysql.rjweb.org/doc.php/deletebig
Since you say that the deletion is usually less than 15%, the "copy over what needs to be kept" technique is not applicable either.
Before sharding or partitioning, first analyze your queries to make sure they are as optimized as you can make them. This usually means designing indexes specifically to support the queries you run. You might like my presentation How to Design Indexes, Really (video).
Partitioning isn't as much a solution as people think. It has many restrictions, including the foreign key issue you found. Besides that, it only improves queries that can take advantage of partition pruning.
Also, I've done a lot of benchmarking of Amazon RDS for my current job and also a previous job. RDS is slow. It's really slow. It uses remote EBS storage, so it's bound to incur overhead for every read from storage or write to storage. RDS is just not suitable for any application that needs high performance.
Amazon Aurora is significantly better on latency and throughput. But it's also very expensive. The more you use it, the more you use I/O requests, and they charge extra for that. For a busy app, you end up spending as much as you did for RDS with high provisioned IOPS.
The only way I found to get high performance in the cloud is to forget about managed databases like RDS and Aurora, and instead install and run your own instance of MySQL on an ec2 instance with locally-attached NVMe storage. This means the i3 family of ec2 instances. But local storage is ephemeral instance storage, so if the instance restarts, you lose your data. So you must add one or more replicas and have a failover plan.
If you need an OLTP database in the cloud, and you also need top-tier performance, you either have to spend $$$ for a managed database, or else you need to hire full-time DevOps and DBA staff to run it.
Sorry to give you the bad news, but the TANSTAAFL adage remains true.
Does anybody know, does FK reduce insert/update operations in MySQL?
I use engine INNODB.
Having a FK on a table implicitly creates (and maintains) an index.
When doing certain write operations, the FK's implicit INDEX is checked to verify the existence of the appropriate row in the other table. This is a minor performance burden during writes.
When doing SELECT ... JOIN for which you failed to explicitly provide the appropriate index, the implicit index produced by some FK may come into play. This is a big benefit to some JOINs, but does not require an FK, since you could have added the INDEX manually.
If the FK definition includes ON DELETE or UPDATE, then even more work may be done, especially for CASCADE. The effect of CASCADE can be achieved with a SELECT plus more code -- but not as efficiently as letting CASCADE do the work.
FKs are limited in what they can do. Stackoverflow is littered with question like "How can I get an FK to do X?"
Does any of this sound like "reducing insert/update operations"?
does FK reduce insert/update operations in MySQL?
It's not about MySQL but yes it does. Creating FK on a column will create a secondary index and thus upon DML operation those indexes needs to be updated as well in order to have a correct table statistics. So that, DB optimizer can generate a correct and efficient query plan
Does a database have to rebuild its indexes every time a new row is inserted?
And by that token, wouldn't it mean if I was inserting alot, the index would be being rebuilt constantly and therefore less effective/useless for querying?
I'm trying to understand some of this database theory for better database design.
Updates definitely don't require rebuilding the entire index every time you update it (likewise insert and delete).
There's a little bit of overhead to updating entries in an index, but it's reasonably low cost. Most indexes are stored internally as a B+Tree data structure. This data structure was chosen because it allows easy modification.
MySQL also has a further optimization called the Change Buffer. This buffer helps reduce the performance cost of updating indexes by caching changes. That is, you do an INSERT/UPDATE/DELETE that affects an index, and the type of change is recorded in the Change Buffer. The next time you read that index with a query, MySQL reads the Change Buffer as a kind of supplement to the full index.
A good analogy for this might be a published document that periodically publishes "errata" so you need to read both the document and the errata together to understand the current state of the document.
Eventually, the entries in the Change Buffer are gradually merged into the index. This is analogous to the errata being edited into the document for the next time the document is reprinted.
The Change Buffer is used only for secondary indexes. It doesn't do anything for primary key or unique key indexes. Updates to unique indexes can't be deferred, but they still use the B+Tree so they're not so costly.
If you do OPTIMIZE TABLE or some types of ALTER TABLE changes that can't be done in-place, MySQL does rebuild the indexes from scratch. This can be useful to defragment an index after you delete a lot of the table, for example.
Yes, inserting affects them, but it's not as bad as you seem to think. Like most entities in relational databases, indexes are usually created and maintained with an extra amount of space to accommodate for growth, and usually set up to increase that extra amount automatically when index space is nearly exhausted.
Rebuilding the index starts from scratch, and is different from adding entries to the index. Inserting a new row does not result in the rebuild of an index. The new entry gets added in the extra space mentioned above, except for clustered indexes which operate a little differently.
Most DB administrators also do a task called "updating statistics," which updates an internal set of statistics used by the query planner to come up with good query strategies. That task, performed as part of maintenance, also helps keep the query optimizer "in tune" with the current state of indexes.
There are enormous numbers of high-quality references on how databases work, both independent sites and those of the publishers of major databases. You literally can make a career out of becoming a database expert. But don't worry too much about your inserts causing troubles. ;) If in doubt, speak to your DBA if you have one.
Does that help address your concerns?
This post says:
If you’re running Innodb Plugin on Percona Server with XtraDB you get
benefit of a great new feature – ability to build indexes by sort
instead of via insertion
However I could not find any info on this. I'd like to have an ability to reorganize how a table is laid out physically, similar to Postgre CLUSTER command, or MyISAM "alter table ... order by". For example table "posts" has millions of rows in random insertion order, most queries use "where userid = " and I want the table to have rows belonging to one user physically separated nearby on disk, so that common queries require low IO. Is it possible with XtraDB?
Clarification concerning the blog post
The feature you are basically looking at is fast index creation. This features speeds up the creation of secondary indexes to InnoDB tables, but it is only used in very specific cases. For example the feature is not used while OPTIMIZE TABLE, which can therefore be dramatically speed up by dropping the indexes first, then run OPTIMIZE TABLE and then recreate the indexes with fast index creation (about this was the post you linked).
Some kind of automation for the cases, which can be improved by using this feature manually like above, was added to Percona Server as a system variable named expand_fast_index_creation. If activated, the server should use fast index creation not only in the very specific cases, but in all cases it might help, such as OPTIMIZE TABLE — the problem mentioned in the linked blog article.
Concerning your question
Your question was actually if it is possible to save InnoDB tables in a custom order to speed up specific kind of queries by exploiting locality on the disk.
This is not possible. InnoDB rows are saved in pages, based on the clustered index (which is essentially the primary key). The rows/pages might be in chaotic ordering, for which one can OPTIMIZE TABLE the InnoDB table. With this command the table is actually recreated in primary key order. This allows to gather primary key local rows on the same or neighboring pages.
That is all you can force InnoDB to do. You can read the manual about clustered index, another page in the manual as a definite answer that this is not possible ("ORDER BY does not make sense for InnoDB tables because InnoDB always orders table rows according to the clustered index.") and the same question on dba.stackexchange which answers might interest you.
I'm creating a data mart in SQL Server 2008 using SSIS for load, and SSAS for an OLAP cube. So far, everything is working great. However, I haven't created any indexes on the source database other than the default clustering on primary key.
I'm pretty comfortable with designing indexes on the application databases, but since this database is intended primary to be the source for a cube, I'm not sure what sort of indexing, if any, will be beneficial.
Is there any sort of indexing I should be doing to improve the processing of the dimensions and cube? I'm using your regular Molap storage.
Generally, the best practice is to keep indexes and constraints off of marts, unless they'll be used directly for reporting. Indexes and constraints can seriously hose your ETL time (especially with the amounts of data that usually go into warehouses).
What I've found works best is to have a single, solitary PK on all of your tables (including fact, because I have composite keys, and I'll just hash the composite to get myself a PK if I have to). Having PK's (that are identity columns) provides you with an autogenerated index, quick joining when the cubes are built, and very quick inserts.
If you're going to be doing reporting, then build out the indexes as you would, but make sure to disable and then rebuild the indexes as part of your ETL process. Otherwise, bulk inserts take some time to do (hours upon hours to commit, in some cases).