I have a table named transactions, something like this:
id | user_id | business_id | amount | tracking_code | status | created_at | updated_at
As you can see, this is a table which keeps all transactions. Currently it has over 50M rows and every day about 4k new rows get added to it. I'm worried about one or two next years that the business scaled up and I will end up with a really huge table.
Currently we have two indexes on this table for a better search performance. Also the engine is innodb.
Any idea how it should be handled generally?
In the side of hardware resources, I'm completely ok to increase the server hardware when needed. But I guess, the issue will be managing the data in the future and not the resources.
I would not worry about application performance with 1 billion rows on a machine that can keep the indexes in memory.
However, I would suggest making the id columns as BigInt if you know that the table is growing at a fast pace as a 32-bit integer will be limited to 2^31-1= 2,147,483,647 rows
The performance of your table and search engine depends on:
How many join those queries do on this particular table?
How well your indexes are set up? Apparently good
How much RAM is in the machine hosting the DB?
Speed and number of processors related to it?
Size of the row/amount of data returned in the queries.
Disk space is cheap. You have, what, 3.2 GB in data and some factor on top of that in indexes. If the applications doesn't need all the data to be online, then you have the option of archive old data (dump then delete from table). You can look into compression as an option. Possibly in combination with some of the alternative storage engines
Don't worry about how many CPUs and their speed. That is rarely the limiting factor in MySQL.
Do you need either created_at or updated_at?
Don't bother with compression; it is unlikely to be worth it.
Do use smaller datatypes where practical (and conservative).
Stick with InnoDB. There are many reasons; I won't repeat them here.
Please show us SHOW CREATE TABLE and some of the critical queries. We may have more tips.
Related
I have a database currently at 6.5Gb but growing fast...
Currently on a R4L Aurora server, 15.25G Ram, 2 core CPU
I am looking at buying a Reserved Instance to cut costs, but worried that if the database grows fast, e.g. reaches over 15G within a year, I'll need to get a bigger server.
99% of the data is transactional history, this table is the biggest by far. It is written very frequently, but once a row has been written it doesn't change often (although it does on occasion).
So few questions...
1) Should I disable the cache?
2) Will I be ok with 15G ram, even if the database itself goes to (say) 30G, or will I see massive speed issues
3) The database is well indexed, but could this be improved? E.g. if (say) 1 million records belong to 1 user, is there a way to partition the data to prevent that slowing down access for other users?
Thanks
"Should I disable the cache?" -- Which "cache"?
"will I see massive speed issues" -- We need to see the queries, etc.
"The database is well indexed" -- If that means you indexed every column, then it is not well indexed. Please show us SHOW CREATE TABLE and a few of the important queries.
"partition" -- With few exceptions, partitioning does not speed up MySQL tables. Again, we need details.
"15.25G Ram" & "database...15G" -- It is quite common for the dataset size to be bigger, even much bigger, than RAM. So, this pair of numbers are not necessarily good to compare to each other.
"1 million records belong to 1 user" -- Again, details, please.
You should statistically explain the data growth. This can be done by running a count(*) query group by created date (year) column. Once you have a count of records per year then you can understand what's going on.
Now you can think of possible solutions
You can remove data which is no longer relevant from history standpoint and keep the storage limited.
If there's large amount of data e.g. Blob etc. possibly you can target storing that in S3 and store reference into database table
Delete any unwanted tables. Sometimes DBA creates temporary backup tables and they leave them there after work. You can clean such tables.
The memory of the instance just comes into play when the engine fetches pages into the buffer pool for page misses. It does not depend on your actual data size (except in extreme cases, for example, your records are really really huge). The rule of thumb is to make sure you always keep your working set warm in the buffer pool, and avoid pages getting flushed.
If your app does need to touch a large amount of data, then the ideal way to do that would be to have dedicated replicas for specific kinds of queries. That way, you avoid swapping out valid pages in favor of newer queries. Aurora has custom endpoints support now, and that makes this even easier to manage.
If you need more specific guidelines, you may need to share details about your data, indices, queries etc.
Brief: Is there any way to improve the performance of table scans on InnoDB tables?
Please, do not suggest adding indexes to avoid table scans. (see below)
innodb_buffer_pool_size sits at 75% of server memory (48 GB/64GB)
I'm using the latest version of Percona (5.7.19) if that changes anything
Longer: We have 600Gb of recent time series data (we aggregate and delete older data) spread over 50-60 tables. So most of it is "active" data that is regularly queried. These tables are somewhat large (400+ numeric columns) and many queries run against a number of those columns (alarming) which is why it is impractical to add indexes (as we would have to add a few dozen). The largest tables are partitioned per day.
I am fully aware that this is an application/table design problem and not a "server tuning" problem. We are currently working to significantly change the way these tables are designed and queried, but have to maintain the existing system until this happens so I'm looking for a way to improve things a bit to buy us a little time.
We recently split this system and have moved a part of it to a new server. It previously used MyISAM, and we tried moving to TokuDB which seemed appropriate but ran into some weird problems. We switched to InnoDB but performance is really bad. I get the impression that MyISAM is better with table scans which is why, barring any better option, we'll go back to it until the new system is in place.
Update
All tables have pretty much the same structure:
-timestamp
-primary key (varchar(20) field)
-about 15 fields of various types representing other secondary attributes that can be filtered upon (along with an appropriately indexed criteria first)
-And then about a few hundred measures (floats), between 200-400.
I already trimmed the row length as much as I could without changing the structure itself. The primary key used to be a varchar(100), all measures used to be doubles, many of the secondary attributes had their data types changed.
Upgrading hardware is not really an option.
Creating small tables with just the set of columns I need would help some processes perform faster. But at the cost of creating that table with a table scan first and duplicating data. Maybe if I created it as a memory table. By my estimate, it would take a couple of GB away from the buffer pool. Also there are aggregation processes that read about as much data from the main tables on a regular basis, and they need all columns.
There is unfortunately a lot of duplication of effort in those queries which I plan to address in the next version. The alarming and aggregation processes basically reprocess the entire day's worth of data every time some rows inserted (every half hour) instead of just dealing with new/changed data.
Like I said, the bigger tables are partitioned, so it's usually a scan over a daily partition rather than the entire table, which is a small consolation.
Implementing a system to hold this in memory outside of the DB could work, but that would entail a lot of changes on the legacy system and development work. Might as well spend that time on the better design.
The fact that InnoDB table are so much bigger for the same data as MyISAM (2-3x as big in my case) really hinders the performance.
MyISAM is a little bit better at table-scans, because it stores data more compactly than InnoDB. If your queries are I/O-bound, scanning through less data on disk is faster. But this is a pretty weak solution.
You might try using InnoDB compression to reduce the size of data. That might get you closer to MyISAM size, but you're still I/O-bound so it's going to suck.
Ultimately, it sounds like you need a database that is designed for an OLAP workload, like a data warehouse. InnoDB and TokuDB are both designed for OLTP workload.
It smells like a Data Warehouse with "Reports". By judicious picking of what to aggregate (selected of your Floats) over what time period (hour or day is typical), you can build and maintain Summary Tables that work much more efficiently for the Reports. This has the effect of scanning the data only once (to build the Summaries), not repeatedly. The Summary tables are much smaller, so the reports are much faster -- 10x is perhaps typical.
It may also be possible to augment the Summary tables as the raw data is being Inserted. (See INSERT .. ON DUPLICATE KEY UPDATE ..)
And use Partitioning by date to allow for efficient DROP PARTITION instead of DELETE. Don't have more than about 50 partitions.
Summary Tables
Time series Partitioning
If you would like to discuss in more detail, let's start with one of the queries that is scanning so much now.
In the various projects I have worked on, there were between 2 and 7 Summary tables.
With 600GB of data, you may be pushing the limits on 'ingestion'. If so, we can discuss that, too.
I have a few tables with more than 100+ millions of rows.
I get about 20-40 millions of rows each month.
At this moment everything seems fine:
- all inserts are fast
- all selects are fast ( they are using indexes and don't use complex aggregations )
However, I am worried about two things, what I've read somewhere:
- When a table has few hundred millions of rows, there might be slow inserts, because it might take a while to re-balance the indexes ( binary trees )
- If index doesn't fit into memory, it might take a while to read it from the different parts of the disk.
Any comments would be highly appreciated.
Any suggestions how can I avoid it or how can I fix/mitigate the problem if/when it happens would be highly appreciated.
( I know we should start doing a sharding at some day )
Thank you in advance.
Today is the day you should think about sharding or partitioning because if you have 100MM rows today and you're gaining them at ~30MM per month then you're going to double the size of that in three months, and possibly double it again before the year is out.
At some point you'll hit an event horizon where your database is too big to migrate. Either you don't have enough working space left on your disk to switch to an alternate schema, or you don't have enough down-time to perform the migration before it needs to be operational again. Then you're stuck with it forever as it gets slower and slower.
The performance of write activity on a table is largely a function of how difficult the indices are to maintain. The more data you index the more punishing writes can be. The type of index is all relevant, some are more compact than others. If your data is lightly indexed you can usually get away with having more records before things start to get cripplingly slow, but that degradation factor is highly dependent on your system configuration, your hardware, and your IO capacity.
Remember, InnoDB, the engine you should be using, has a lot of tuning parameters and many people leave it set to the really terrible defaults. Have a look at the memory allocated to it and be sure you're doing that properly.
If you have any way of partitioning this data, like by month, by customer, or some other factor that is not going to change based on business logic, that is the data is intrinsically not related, you will have many simple options. If it's not, you'll have to make some hard decisions.
The one thing you want to be doing now is simulating what your table's performance is like with 1G rows in it. Create a sufficiently large, suitably varied amount of test data, then see how well it performs under load. You may find it's not an issue, in which case, no worries for another few years. If not, start panicking today and working towards a solution before your data becomes too big to split.
Database performance generally degrades in a fairly linear fashion, and then at some point it falls off a cliff. You need to know where this cliff is so you know how much time you have before you hit it. The sharp degradation in performance usually comes when your indexes can't fit in memory and when your disk buffers are stretched too thin to be useful.
I will attempt to address the points being made by the OP and the other responders. The Question only touches the surface; this Answer follows suit. We can dig deeper in more focused Questions.
A trillion rows gets dicey. 100M is not necessarily problematic.
PARTITIONing is not a performance panacea. The main case where it can be useful way is when you need to purge "old" data. (DROP PARTITION is a lot faster than DELETEing a zillion rows.)
INSERTs with an AUTO_INCREMENT PRIMARY KEY will 'never' slow down. This applies to any temporal key and/or small set of "hot spots". Example PRIMARY KEY(stock_id, date) is limited to as many hot spots as you have stocks.
INSERTs with a UUID PRIMARY KEY will get slower and slower. But this applies to any "random" key.
Secondary indexes suffer the same issues as the PK, however later. This is because it is dependent on the size of the BTree. (The data's BTree ordered by the PK is usually bigger than each secondary key.)
Whether an index (including the PK) "fits in memory" matters only if the inserts are 'random' (as with a UUID).
For Data Warehouse applications, it is usually advisable to provide Summary Tables instead of extra indexes on the 'Fact' table. This yields "report" queries that may be as much as 10 times as fast.
Blindly using AUTO_INCREMENT may be less than optimal.
The BTree for the data or index of a million-row table will be about 3 levels deep. For a trillion rows, 6 levels. This "number of levels" has some impact on performance.
Binary trees are not used; instead BTrees (actually B+Trees) are used by InnoDB.
InnoDB mostly keeps its BTrees balanced without much effort. Don't worry about it. (And don't use OPTIMIZE TABLE.)
All activity is done on 16KB blocks (of data or index) and done in RAM (in the buffer_pool). Neither a table nor an index is "loaded into RAM", at least not explicitly as a whole unit.
Replication is useful for read scaling. (And readily available in MySQL.)
Sharding is useful for write scaling. (This is a DYI task.)
As a Rule of Thumb, keep half of your disk free for various admin purposes on huge tables.
Before a table gets into the multi-GB size range, it is wise to re-think the datatypes and normalization.
The main tunable in InnoDB (these days) is innodb_buffer_pool_size, which should (for starters) be about 70% of available RAM.
Row_format=compressed is often not worth using.
YouTube, Facebook, Google, etc, are 'on beyond' anything discussed in this Q&A. They use thousands of servers, custom software, etc.
If you want to discuss your specific application, let's see some details. Different apps need different techniques.
My blogs, which provide more details on many of the above topics: http://mysql.rjweb.org
We have a mySQL database table for products. We are utilizing a cache layer to reduce database load, but we think that it's a good idea to minimize the actual data needed to be stored in the cache layer to speed up the application further.
All the products in the database, that is visible to visitors have a price attached to them:
The prices are stored in a different table, called prices . There are multiple price categories depending on which discount level each visitor (customer) applies to. From time to time, there are campaigns which means that a special price for each product is available. The special prices are stored in a table called specials.
Is it a bad to make a temp table that binds the tables together?
It would only have the neccessary information and would ofcourse be cached.
-------------|-------------|------------
| productId | hasPrice | hasSpecial
-------------|-------------|------------
1 | 1 | 0
2 | 1 | 1
By doing such, it would be super easy to know if the specific product really has a price, without having to iterate through the complete prices or specials table each time a product should be listed or presented.
Are temp tables a common thing for web applications or is it just bad design?
If you're going to cache this data anyways, does it really need to be in a temp table? You would only incur the overhead of the query when you needed to rebuild the cache, so the temp table might not even be necessary.
You should approach it like any other performance problem: Decide how much performance is necessary, then iterate doing testing on production-grade hardware in your lab. Do not do needless optimisations.
You shoud profile your app and discover if it's doing too many queries or the queries themselves are slow; most cases of web-app slowness are caused by doing too many queries (in my experience) even though the queries are very easy.
Normally the best engineering solution is to restructure the database, in some cases denormalising, to make the common read use-cases require fewer queries. Caching may be helpful as well, but refactoring so you need fewer queries is often the best.
Essentially you can increase the amount of work on the write-path to reduce the amount on the read-path, if you are planning to do a lot more reading than writing.
Consider an indexed MySQL table with 7 columns, being constantly queried and written to. What is the advisable number of rows that this table should be allowed to contain before the performance would be improved by splitting the data off into other tables?
Whether or not you would get a performance gain by partitioning the data depends on the data and the queries you will run on it. You can store many millions of rows in a table and with good indexes and well-designed queries it will still be super-fast. Only consider partitioning if you are already confident that your indexes and queries are as good as they can be, as it can be more trouble than its worth.
There's no magic number, but there's a few things that affect performance in particular:
Index Cardinality: don't bother indexing a row that has 2 or 3 values (like an ENUM). On a large table, the query optimizer will ignore these.
There's a trade off between writes and indexes. The more indexes you have, the longer writes take. Don't just index every column. Analyze your queries and see which columns need to be indexed for your app.
Disk IO and a memory play an important role. If you can fit your whole table into memory, you take disk IO out of the equation (once the table is cached, anyway). My guess is that you'll see a big performance change when your table is too big to buffer in memory.
Consider partitioning your servers based on use. If your transactional system is reading/writing single rows, you can probably buy yourself some time by replicating the data to a read only server for aggregate reporting.
As you probably know, table performance changes based on the data size. Keep an eye on your table/queries. You'll know when it's time for a change.
MySQL 5 has partitioning built in and is very nice. What's nice is you can define how your table should be split up. For instance, if you query mostly based on a userid you can partition your tables based on userid, or if you're querying by dates do it by date. What's nice about this is that MySQL will know exactly which partition table to search through to find your values. The downside is if you're search on a field that isn't defining your partition its going to scan through each table, which could possibly decrease performance.
While after the fact you could point to the table size at which performance became a problem, I don't think you can predict it, and certainly not from the information given on a web site such as this!
Some questions you might usefully ask yourself:
Is performance currently acceptable?
How is performance measured - is
there a metric?
How do we recognise
unacceptable performance?
Do we
measure performance in any way that
might allow us to forecast a
problem?
Are all our queries using
an efficient index?
Have we simulated extreme loads and volumes on the system?
Using the MyISAM engine, you'll run into a 2GB hard limit on table size unless you change the default.
Don't ever apply an optimisation if you don't think it's needed. Ideally this should be determined by testing (as others have alluded).
Horizontal or vertical partitioning can improve performance but also complicate you application. Don't do it unless you're sure that you need it AND it will definitely help.
The 2G data MyISAM file size is only a default and can be changed at table creation time (or later by an ALTER, but it needs to rebuild the table). It doesn't apply to other engines (e.g. InnoDB).
Actually this is a good question for performance. Have you read Jay Pipes? There isn't a specific number of rows but there is a specific page size for reads and there can be good reasons for vertical partitioning.
Check out his kung fu presentation and have a look through his posts. I'm sure you'll find that he's written some useful advice on this.
Are you using MyISAM? Are you planning to store more than a couple of gigabytes? Watch out for MAX_ROWS and AVG_ROW_LENGTH.
Jeremy Zawodny has an excellent write-up on how to solve this problem.