I have an important table which can be queried by end users of my app. These queries are very important to me, so I want to ensure they will be answered as quickly as possible. In order to do that, I need to make sure relevant indexes used with these queries "always" remain in the innodb_buffer_pool (even if there is a background job running a different query with a different index). Is this possible with MySQL?
Not possible, except by having a buffer_pool that is so big that it never completely fills up. How much data do you have? How much RAM?
To "lock" anything into the buffer_pool would probably slow down the entire system. This is because other activity would then be slowed down.
Anyway, a highly active index will tend to stay in the buffer_pool. That's how an LRU cache works. (It is not exactly LRU, bot close enough for this discussion.)
If you are having slow queries, let's see them, plus the SHOW CREATE TABLE. A common problem is "but I indexed every column", when you should look at the query to decide what indexes to make -- and the best is often "composite" and/or "covering".
Related
I'm writing a new feature using new tables on a MySQL database. Is there a performance hit I get by indexing columns (that I'll use for WHERE in SELECT queries) from the beginning or should I wait until my table reaches a considerable size before I start indexing?
If you are going to eventually need the indexes, you might as well create them with the tables. This does somewhat slow down inserts, but they might as well be there if you are going to need them. Why wait for slow queries, if you know the right answer?
One argument against putting them in right away is if the actual queries will inform the indexing strategy. You seem to have a pretty good idea of what the usage will be.
Do recognize that indexes make some operations fast (notably selects). However, they make other operations slower (notably, insert, update, and delete). For this reason you should be thoughtful about the indexes that go on.
From this related post:
One more index than you need is too many. One less is too little.
I've tried searching for a case where having too many indexes was a
problem and couldn't really find anything
You KNOW you have too many if your inserts are too slow, and the index used for reading are not speeding things up enough to make up for it.
When you insert, update, delete your table, the index need to updated too.
See this Article about indexing
Yes, inserts and updates could be slower because of an index. However, in my practice this has not been a common problem. You want to add only indexes that you know you are going to need. Wait to add other indexes when addressing new problems. One thing to consider is many times: “What if” indexes are forgotten, not needed and actually cause performance issues that can be hard to track down. After finding the problem index, the developer has to spend additional time determining if the index is actually needed in some other part of the application. As far as waiting to add indexes until a table reaches a certain size. I would seriously doubt that would buy you any performance and if it did I would question the system design.
I have a few tables with more than 100+ millions of rows.
I get about 20-40 millions of rows each month.
At this moment everything seems fine:
- all inserts are fast
- all selects are fast ( they are using indexes and don't use complex aggregations )
However, I am worried about two things, what I've read somewhere:
- When a table has few hundred millions of rows, there might be slow inserts, because it might take a while to re-balance the indexes ( binary trees )
- If index doesn't fit into memory, it might take a while to read it from the different parts of the disk.
Any comments would be highly appreciated.
Any suggestions how can I avoid it or how can I fix/mitigate the problem if/when it happens would be highly appreciated.
( I know we should start doing a sharding at some day )
Thank you in advance.
Today is the day you should think about sharding or partitioning because if you have 100MM rows today and you're gaining them at ~30MM per month then you're going to double the size of that in three months, and possibly double it again before the year is out.
At some point you'll hit an event horizon where your database is too big to migrate. Either you don't have enough working space left on your disk to switch to an alternate schema, or you don't have enough down-time to perform the migration before it needs to be operational again. Then you're stuck with it forever as it gets slower and slower.
The performance of write activity on a table is largely a function of how difficult the indices are to maintain. The more data you index the more punishing writes can be. The type of index is all relevant, some are more compact than others. If your data is lightly indexed you can usually get away with having more records before things start to get cripplingly slow, but that degradation factor is highly dependent on your system configuration, your hardware, and your IO capacity.
Remember, InnoDB, the engine you should be using, has a lot of tuning parameters and many people leave it set to the really terrible defaults. Have a look at the memory allocated to it and be sure you're doing that properly.
If you have any way of partitioning this data, like by month, by customer, or some other factor that is not going to change based on business logic, that is the data is intrinsically not related, you will have many simple options. If it's not, you'll have to make some hard decisions.
The one thing you want to be doing now is simulating what your table's performance is like with 1G rows in it. Create a sufficiently large, suitably varied amount of test data, then see how well it performs under load. You may find it's not an issue, in which case, no worries for another few years. If not, start panicking today and working towards a solution before your data becomes too big to split.
Database performance generally degrades in a fairly linear fashion, and then at some point it falls off a cliff. You need to know where this cliff is so you know how much time you have before you hit it. The sharp degradation in performance usually comes when your indexes can't fit in memory and when your disk buffers are stretched too thin to be useful.
I will attempt to address the points being made by the OP and the other responders. The Question only touches the surface; this Answer follows suit. We can dig deeper in more focused Questions.
A trillion rows gets dicey. 100M is not necessarily problematic.
PARTITIONing is not a performance panacea. The main case where it can be useful way is when you need to purge "old" data. (DROP PARTITION is a lot faster than DELETEing a zillion rows.)
INSERTs with an AUTO_INCREMENT PRIMARY KEY will 'never' slow down. This applies to any temporal key and/or small set of "hot spots". Example PRIMARY KEY(stock_id, date) is limited to as many hot spots as you have stocks.
INSERTs with a UUID PRIMARY KEY will get slower and slower. But this applies to any "random" key.
Secondary indexes suffer the same issues as the PK, however later. This is because it is dependent on the size of the BTree. (The data's BTree ordered by the PK is usually bigger than each secondary key.)
Whether an index (including the PK) "fits in memory" matters only if the inserts are 'random' (as with a UUID).
For Data Warehouse applications, it is usually advisable to provide Summary Tables instead of extra indexes on the 'Fact' table. This yields "report" queries that may be as much as 10 times as fast.
Blindly using AUTO_INCREMENT may be less than optimal.
The BTree for the data or index of a million-row table will be about 3 levels deep. For a trillion rows, 6 levels. This "number of levels" has some impact on performance.
Binary trees are not used; instead BTrees (actually B+Trees) are used by InnoDB.
InnoDB mostly keeps its BTrees balanced without much effort. Don't worry about it. (And don't use OPTIMIZE TABLE.)
All activity is done on 16KB blocks (of data or index) and done in RAM (in the buffer_pool). Neither a table nor an index is "loaded into RAM", at least not explicitly as a whole unit.
Replication is useful for read scaling. (And readily available in MySQL.)
Sharding is useful for write scaling. (This is a DYI task.)
As a Rule of Thumb, keep half of your disk free for various admin purposes on huge tables.
Before a table gets into the multi-GB size range, it is wise to re-think the datatypes and normalization.
The main tunable in InnoDB (these days) is innodb_buffer_pool_size, which should (for starters) be about 70% of available RAM.
Row_format=compressed is often not worth using.
YouTube, Facebook, Google, etc, are 'on beyond' anything discussed in this Q&A. They use thousands of servers, custom software, etc.
If you want to discuss your specific application, let's see some details. Different apps need different techniques.
My blogs, which provide more details on many of the above topics: http://mysql.rjweb.org
So I am trying to figure out a little bit of optimization with regards to MySQL and the sort rows functionality. As I understand it you can set a max row comparison and it is a good idea to set this fairly high if your machines memory can take it to reduce I/O. My question is does the memory get allocated dynamically as you load in more things to sort or statically as a massive block? Basically if I know 100% for sure I will never have more than say 1000 rows to sort would it be more efficient to set a max rows of say 1200,to give a small buffer just in case, versus 1 million. Thanks for your answers and sorry if I'm not explicit enough I'm still very new to SQL and MySQL.
When MySQL needs to sort a resultset, such as to satisfy an SELECT ... ORDER BY, it will act in one of several ways:
If the ORDER BY can be handled by an INDEX, no sort is needed.
If the number of rows is 'small' (and some other criteria), it will create a MEMORY table in RAM and use an in-memory sort to do the work. This is very fast. Such temp-MEMORY tables are limited by both tmp_table_size and max_heap_table_size. Since multiple connections may be doing such simultaneously, it is a good idea to not set those higher than, say, 1% of RAM.
If that overflows or otherwise fails, a MyISAM table is built instead. This can have essentially unlimited size. Still, because of caching, it may or may not spill to disk, thereby incurring I/O.
There are other cases where MySQL will sort things. For example, creating an index may gather all the info, sort it, then spew the info into the BTree for the index. This probably involves an Operating System sort, which, again, may or may not involved I/O.
1200 rows is likely to be done in RAM; 1M rows is likely to involve I/O.
But, the bottom line is: Don't worry about it. If you need ORDER BY, use it. Let MySQL do it the best way it can.
What is the difference between REBUILD ONLINE and REORGANIZE index in SQL Server?
The leaf level data in an index can easily get fragmented depending on the nature of the inserts and where SQL Server is able to place the data on disk during inserts, updates (or remove it during deletes). It will not always be able to place a specific value in the exact physical slot where it is supposed to be, and this fragmentation can have a serious impact on seek/scan operations.
Reorganizing tries to put the leaf level of the index back in logical order within the pages that are already allocated to the index.
Rebuilding basically creates an entirely new copy of the index, and is much more effective at reducing fragmentation - but this comes at a cost, both in terms of time and disk space. You'll likely need free space in the database, anywhere from 1.2x to 1.5x the existing index size, in order to perform a rebuild. This is similar to saying CREATE INDEX ... WITH DROP_EXISTING.
Rebuilding online means the old index is still available for querying by other users while the new index is being creates. This feature is not available in all editions (Enterprise+ only).
The choice between which method to use can depend on the size of the table, the level of fragmentation, the potential benefit to reducing fragmentation, and the available space on disk (with the additional decision to use online if you are on a certain edition). Ola Hallengren and Michelle Ufford have pretty robust solutions that help make these decisions for you:
http://ola.hallengren.com/
http://sqlfool.com/2011/06/index-defrag-script-v4-1/
The one nice thing about reorganizing is that if it's taking too long you can cancel it and you won't lose any of the work it's already done. If you cancel a rebuild it will roll back everything it's done.
Consider an indexed MySQL table with 7 columns, being constantly queried and written to. What is the advisable number of rows that this table should be allowed to contain before the performance would be improved by splitting the data off into other tables?
Whether or not you would get a performance gain by partitioning the data depends on the data and the queries you will run on it. You can store many millions of rows in a table and with good indexes and well-designed queries it will still be super-fast. Only consider partitioning if you are already confident that your indexes and queries are as good as they can be, as it can be more trouble than its worth.
There's no magic number, but there's a few things that affect performance in particular:
Index Cardinality: don't bother indexing a row that has 2 or 3 values (like an ENUM). On a large table, the query optimizer will ignore these.
There's a trade off between writes and indexes. The more indexes you have, the longer writes take. Don't just index every column. Analyze your queries and see which columns need to be indexed for your app.
Disk IO and a memory play an important role. If you can fit your whole table into memory, you take disk IO out of the equation (once the table is cached, anyway). My guess is that you'll see a big performance change when your table is too big to buffer in memory.
Consider partitioning your servers based on use. If your transactional system is reading/writing single rows, you can probably buy yourself some time by replicating the data to a read only server for aggregate reporting.
As you probably know, table performance changes based on the data size. Keep an eye on your table/queries. You'll know when it's time for a change.
MySQL 5 has partitioning built in and is very nice. What's nice is you can define how your table should be split up. For instance, if you query mostly based on a userid you can partition your tables based on userid, or if you're querying by dates do it by date. What's nice about this is that MySQL will know exactly which partition table to search through to find your values. The downside is if you're search on a field that isn't defining your partition its going to scan through each table, which could possibly decrease performance.
While after the fact you could point to the table size at which performance became a problem, I don't think you can predict it, and certainly not from the information given on a web site such as this!
Some questions you might usefully ask yourself:
Is performance currently acceptable?
How is performance measured - is
there a metric?
How do we recognise
unacceptable performance?
Do we
measure performance in any way that
might allow us to forecast a
problem?
Are all our queries using
an efficient index?
Have we simulated extreme loads and volumes on the system?
Using the MyISAM engine, you'll run into a 2GB hard limit on table size unless you change the default.
Don't ever apply an optimisation if you don't think it's needed. Ideally this should be determined by testing (as others have alluded).
Horizontal or vertical partitioning can improve performance but also complicate you application. Don't do it unless you're sure that you need it AND it will definitely help.
The 2G data MyISAM file size is only a default and can be changed at table creation time (or later by an ALTER, but it needs to rebuild the table). It doesn't apply to other engines (e.g. InnoDB).
Actually this is a good question for performance. Have you read Jay Pipes? There isn't a specific number of rows but there is a specific page size for reads and there can be good reasons for vertical partitioning.
Check out his kung fu presentation and have a look through his posts. I'm sure you'll find that he's written some useful advice on this.
Are you using MyISAM? Are you planning to store more than a couple of gigabytes? Watch out for MAX_ROWS and AVG_ROW_LENGTH.
Jeremy Zawodny has an excellent write-up on how to solve this problem.