So I am trying to figure out a little bit of optimization with regards to MySQL and the sort rows functionality. As I understand it you can set a max row comparison and it is a good idea to set this fairly high if your machines memory can take it to reduce I/O. My question is does the memory get allocated dynamically as you load in more things to sort or statically as a massive block? Basically if I know 100% for sure I will never have more than say 1000 rows to sort would it be more efficient to set a max rows of say 1200,to give a small buffer just in case, versus 1 million. Thanks for your answers and sorry if I'm not explicit enough I'm still very new to SQL and MySQL.
When MySQL needs to sort a resultset, such as to satisfy an SELECT ... ORDER BY, it will act in one of several ways:
If the ORDER BY can be handled by an INDEX, no sort is needed.
If the number of rows is 'small' (and some other criteria), it will create a MEMORY table in RAM and use an in-memory sort to do the work. This is very fast. Such temp-MEMORY tables are limited by both tmp_table_size and max_heap_table_size. Since multiple connections may be doing such simultaneously, it is a good idea to not set those higher than, say, 1% of RAM.
If that overflows or otherwise fails, a MyISAM table is built instead. This can have essentially unlimited size. Still, because of caching, it may or may not spill to disk, thereby incurring I/O.
There are other cases where MySQL will sort things. For example, creating an index may gather all the info, sort it, then spew the info into the BTree for the index. This probably involves an Operating System sort, which, again, may or may not involved I/O.
1200 rows is likely to be done in RAM; 1M rows is likely to involve I/O.
But, the bottom line is: Don't worry about it. If you need ORDER BY, use it. Let MySQL do it the best way it can.
Related
I have an important table which can be queried by end users of my app. These queries are very important to me, so I want to ensure they will be answered as quickly as possible. In order to do that, I need to make sure relevant indexes used with these queries "always" remain in the innodb_buffer_pool (even if there is a background job running a different query with a different index). Is this possible with MySQL?
Not possible, except by having a buffer_pool that is so big that it never completely fills up. How much data do you have? How much RAM?
To "lock" anything into the buffer_pool would probably slow down the entire system. This is because other activity would then be slowed down.
Anyway, a highly active index will tend to stay in the buffer_pool. That's how an LRU cache works. (It is not exactly LRU, bot close enough for this discussion.)
If you are having slow queries, let's see them, plus the SHOW CREATE TABLE. A common problem is "but I indexed every column", when you should look at the query to decide what indexes to make -- and the best is often "composite" and/or "covering".
I have a few tables with more than 100+ millions of rows.
I get about 20-40 millions of rows each month.
At this moment everything seems fine:
- all inserts are fast
- all selects are fast ( they are using indexes and don't use complex aggregations )
However, I am worried about two things, what I've read somewhere:
- When a table has few hundred millions of rows, there might be slow inserts, because it might take a while to re-balance the indexes ( binary trees )
- If index doesn't fit into memory, it might take a while to read it from the different parts of the disk.
Any comments would be highly appreciated.
Any suggestions how can I avoid it or how can I fix/mitigate the problem if/when it happens would be highly appreciated.
( I know we should start doing a sharding at some day )
Thank you in advance.
Today is the day you should think about sharding or partitioning because if you have 100MM rows today and you're gaining them at ~30MM per month then you're going to double the size of that in three months, and possibly double it again before the year is out.
At some point you'll hit an event horizon where your database is too big to migrate. Either you don't have enough working space left on your disk to switch to an alternate schema, or you don't have enough down-time to perform the migration before it needs to be operational again. Then you're stuck with it forever as it gets slower and slower.
The performance of write activity on a table is largely a function of how difficult the indices are to maintain. The more data you index the more punishing writes can be. The type of index is all relevant, some are more compact than others. If your data is lightly indexed you can usually get away with having more records before things start to get cripplingly slow, but that degradation factor is highly dependent on your system configuration, your hardware, and your IO capacity.
Remember, InnoDB, the engine you should be using, has a lot of tuning parameters and many people leave it set to the really terrible defaults. Have a look at the memory allocated to it and be sure you're doing that properly.
If you have any way of partitioning this data, like by month, by customer, or some other factor that is not going to change based on business logic, that is the data is intrinsically not related, you will have many simple options. If it's not, you'll have to make some hard decisions.
The one thing you want to be doing now is simulating what your table's performance is like with 1G rows in it. Create a sufficiently large, suitably varied amount of test data, then see how well it performs under load. You may find it's not an issue, in which case, no worries for another few years. If not, start panicking today and working towards a solution before your data becomes too big to split.
Database performance generally degrades in a fairly linear fashion, and then at some point it falls off a cliff. You need to know where this cliff is so you know how much time you have before you hit it. The sharp degradation in performance usually comes when your indexes can't fit in memory and when your disk buffers are stretched too thin to be useful.
I will attempt to address the points being made by the OP and the other responders. The Question only touches the surface; this Answer follows suit. We can dig deeper in more focused Questions.
A trillion rows gets dicey. 100M is not necessarily problematic.
PARTITIONing is not a performance panacea. The main case where it can be useful way is when you need to purge "old" data. (DROP PARTITION is a lot faster than DELETEing a zillion rows.)
INSERTs with an AUTO_INCREMENT PRIMARY KEY will 'never' slow down. This applies to any temporal key and/or small set of "hot spots". Example PRIMARY KEY(stock_id, date) is limited to as many hot spots as you have stocks.
INSERTs with a UUID PRIMARY KEY will get slower and slower. But this applies to any "random" key.
Secondary indexes suffer the same issues as the PK, however later. This is because it is dependent on the size of the BTree. (The data's BTree ordered by the PK is usually bigger than each secondary key.)
Whether an index (including the PK) "fits in memory" matters only if the inserts are 'random' (as with a UUID).
For Data Warehouse applications, it is usually advisable to provide Summary Tables instead of extra indexes on the 'Fact' table. This yields "report" queries that may be as much as 10 times as fast.
Blindly using AUTO_INCREMENT may be less than optimal.
The BTree for the data or index of a million-row table will be about 3 levels deep. For a trillion rows, 6 levels. This "number of levels" has some impact on performance.
Binary trees are not used; instead BTrees (actually B+Trees) are used by InnoDB.
InnoDB mostly keeps its BTrees balanced without much effort. Don't worry about it. (And don't use OPTIMIZE TABLE.)
All activity is done on 16KB blocks (of data or index) and done in RAM (in the buffer_pool). Neither a table nor an index is "loaded into RAM", at least not explicitly as a whole unit.
Replication is useful for read scaling. (And readily available in MySQL.)
Sharding is useful for write scaling. (This is a DYI task.)
As a Rule of Thumb, keep half of your disk free for various admin purposes on huge tables.
Before a table gets into the multi-GB size range, it is wise to re-think the datatypes and normalization.
The main tunable in InnoDB (these days) is innodb_buffer_pool_size, which should (for starters) be about 70% of available RAM.
Row_format=compressed is often not worth using.
YouTube, Facebook, Google, etc, are 'on beyond' anything discussed in this Q&A. They use thousands of servers, custom software, etc.
If you want to discuss your specific application, let's see some details. Different apps need different techniques.
My blogs, which provide more details on many of the above topics: http://mysql.rjweb.org
Although I currently do not have it, I'm interested in learning how someone would scale an individual table in MySQL that might have, say 20 million users. Is this something you would use sharding for? What are some strategies one might use to make an individual table of this magnitude "scalable" ?
20M records is generally considered "small". Depending on the size of records and the kind of queries performed, you are likely to get very good performance on the lowliest of servers.
Almost all servers can keep such a database in memory. Let's consider that a record takes 1024 bytes, including indexes. This is quite a large record, yet 20M rows is still only 20Gb, which fits comfortably within the RAM of a modest server.
While your database fits in RAM, queries are likely to be very fast.
But in any case, you need to consider what the access patterns are.
Do you have
Very high write rates - more than 100 transactions per second?
Lots of hard queries / reports?
If the answer to both of these is "no", you probably need no special equipment at all.
Certainly you don't want to shard. It's complicated, it massively complicates your application, and will require a LOT of developer time which is better spent on features (which you can actually sell to customers)
In order to improve performance with big data, in approximate order of preference, you want to:
Buy better hardware (within reason)
Reduce the amount of data you need to store
Use horizontal partitioning
Use vertical partitioning / functional partitioning
Get a better database engine which can use existing hardware more efficiently (possible examples: Infobright, Tokutek)
Shard (you really don't want to do this!)
In User table i have more than 1 million records so how can i manage using MySQL, Symfony 1.4. Make performance better.
So that it can give quick output.
To significantly improve performance of well designed system all you can do is increase the resources. Typically, these days, the cheapest way to do this is to distribute the task.
For example a slow thing in RDBM system is reading and writing to an from the storage (typically RDBMs systems start as I/O bound, that is, they mostly wait for data to get read or written to storage).
So, to offset, very commonly the RDBMS will allow you to split the table across multiple HDDs, effectively multiplying the I/O performance (approach similar to RAID0).
Adding more hard disks increases the performance. This goes on up to maximum I/O that your system could support (either simply because the system can not push more data through circuits or because it does need to crunch the numbers a bit when it fetches them so it becomes CPU bound; optimally you would be utilising both)
After that you have to start multiplying the systems distributing the data across database nodes. For this to work either RDBMS must support it or there should be application layer that will coordinate distributing the tasks and merging the results, but normally things would still scale.
I would say that with 512 systems you could have all trillion records effectively cached (10^12) and achieve relatively nice performance. But really you should specify what kind of performance you are looking for - there is a difference between full text searches on terra-records and running mostly simple fetches and updates. Also, for certain work 500ms (or even more) is considered good performance and then for other work it would be horrible.
at first: theres a big difference between 1 trillion and 1 million.
to your performance problems: show us the query thats running slow, without seeing it, it's hard to tell whats wrong with it. what you could try:
use EXPLAIN to get more information about your slow querys, see if they're using your indexes or if not (and if not, why not?)
use correct and reasonable indexes
Consider an indexed MySQL table with 7 columns, being constantly queried and written to. What is the advisable number of rows that this table should be allowed to contain before the performance would be improved by splitting the data off into other tables?
Whether or not you would get a performance gain by partitioning the data depends on the data and the queries you will run on it. You can store many millions of rows in a table and with good indexes and well-designed queries it will still be super-fast. Only consider partitioning if you are already confident that your indexes and queries are as good as they can be, as it can be more trouble than its worth.
There's no magic number, but there's a few things that affect performance in particular:
Index Cardinality: don't bother indexing a row that has 2 or 3 values (like an ENUM). On a large table, the query optimizer will ignore these.
There's a trade off between writes and indexes. The more indexes you have, the longer writes take. Don't just index every column. Analyze your queries and see which columns need to be indexed for your app.
Disk IO and a memory play an important role. If you can fit your whole table into memory, you take disk IO out of the equation (once the table is cached, anyway). My guess is that you'll see a big performance change when your table is too big to buffer in memory.
Consider partitioning your servers based on use. If your transactional system is reading/writing single rows, you can probably buy yourself some time by replicating the data to a read only server for aggregate reporting.
As you probably know, table performance changes based on the data size. Keep an eye on your table/queries. You'll know when it's time for a change.
MySQL 5 has partitioning built in and is very nice. What's nice is you can define how your table should be split up. For instance, if you query mostly based on a userid you can partition your tables based on userid, or if you're querying by dates do it by date. What's nice about this is that MySQL will know exactly which partition table to search through to find your values. The downside is if you're search on a field that isn't defining your partition its going to scan through each table, which could possibly decrease performance.
While after the fact you could point to the table size at which performance became a problem, I don't think you can predict it, and certainly not from the information given on a web site such as this!
Some questions you might usefully ask yourself:
Is performance currently acceptable?
How is performance measured - is
there a metric?
How do we recognise
unacceptable performance?
Do we
measure performance in any way that
might allow us to forecast a
problem?
Are all our queries using
an efficient index?
Have we simulated extreme loads and volumes on the system?
Using the MyISAM engine, you'll run into a 2GB hard limit on table size unless you change the default.
Don't ever apply an optimisation if you don't think it's needed. Ideally this should be determined by testing (as others have alluded).
Horizontal or vertical partitioning can improve performance but also complicate you application. Don't do it unless you're sure that you need it AND it will definitely help.
The 2G data MyISAM file size is only a default and can be changed at table creation time (or later by an ALTER, but it needs to rebuild the table). It doesn't apply to other engines (e.g. InnoDB).
Actually this is a good question for performance. Have you read Jay Pipes? There isn't a specific number of rows but there is a specific page size for reads and there can be good reasons for vertical partitioning.
Check out his kung fu presentation and have a look through his posts. I'm sure you'll find that he's written some useful advice on this.
Are you using MyISAM? Are you planning to store more than a couple of gigabytes? Watch out for MAX_ROWS and AVG_ROW_LENGTH.
Jeremy Zawodny has an excellent write-up on how to solve this problem.