MySQL Partitioning / Sharding / Splitting - which way to go? - mysql

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the biggest part of the data) and I’m now wondering, what would be the best way to do it.
The options I’m currently aware of are
Using MySQL Partitioning that comes with version 5.1
Using some kind of third party library that encapsulates the partitioning of the data (like hibernate shards)
Implementing it ourselves inside our application
Our application is built on J2EE and EJB 2.1 (hopefully we’re switching to EJB 3 some day).
What would you suggest?
EDIT (2011-02-11):
Just an update: Currently the size of the database is 380 GB, the data size of our "big" table is 220 GB and the size of its index is 36 GB. So while the whole table does not fit in memory any more, the index does.
The system is still performing fine (still on the same hardware) and we're still thinking about partitioning the data.
EDIT (2014-06-04):
One more update: The size of the whole database is 1.5 TB, the size of our "big" table is 1.1 TB. We upgraded our server to a 4 processor machine (Intel Xeon E7450) with 128 GB RAM.
The system is still performing fine.
What we're planning to do next is putting our big table on a separate database server (we've already done the necessary changes in our software) while simultaneously upgrading to new hardware with 256 GB RAM.
This setup is supposed to last for two years. Then we will either have to finally start implementing a sharding solution or just buy servers with 1 TB of RAM which should keep us going for some time.
EDIT (2016-01-18):
We have since put our big table in it's own database on a separate server. Currently the size ot this database is about 1.9 TB, the size of the other database (with all tables except for the "big" one) is 1.1 TB.
Current Hardware setup:
HP ProLiant DL 580
4 x Intel(R) Xeon(R) CPU E7- 4830
256 GB RAM
Performance is fine with this setup.

You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how poor it performs.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
This is incorrect. Partioning (either through the feature in MySQL 5.1, or the same thing using MERGE tables) can provide significant performance benefits even if the tables are on the same drive.
As an example, let's say that you are running SELECT queries on your big table using a date range. If the table is whole, the query will be forced to scan through the entire table (and at that size, even using indexes can be slow). The advantage of partitioning is that your queries will only run on the partitions where it is absolutely necessary. If each partition is 1 GB in size and your query only needs to access 5 partitions in order to fulfill itself, the combined 5 GB table is a lot easier for MySQL to deal with than a monster 42 GB version.
One thing you need to ask yourself is how you are querying the data. If there is a chance that your queries will only need to access certain chunks of data (i.e. a date range or ID range), partitioning of some kind will prove beneficial.
I've heard that there is still some buggyness with MySQL 5.1 partitioning, particularly related to MySQL choosing the correct key. MERGE tables can provide the same functionality, although they require slightly more overhead.
Hope that helps...good luck!

If you think you're going to be IO/memory bound, I don't think partitioning is going to be helpful. As usual, benchmarking first will help you figure out the best direction. If you don't have spare servers with 64GB of memory kicking around, you can always ask your vendor for a 'demo unit'.
I would lean towards sharding if you don't expect 1 query aggregate reporting. I'm assuming you'd shard the whole database and not just your big table: it's best to keep entire entities together. Well, if your model splits nicely, anyway.

This is a great example of what can MySql partitioning do in a real-life example of huge data flows:
http://web.archive.org/web/20101125025320/http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
Hoping it will be helpful for your case.

A while back at a Microsoft ArcReady event, I saw a presentation on scaling patterns that might be useful to you. You can view the slides for it online.

I would go for MariaDB InnoDB + Partitions (either by key or by date, depending on your queries).
I did this and now I don't have any Database problems anymore.
MySQL can be replaced with MariaDB in seconds...all the database files stay the same.

First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
Secondly, it's not necessarily the table with the largest physical size that you want to move. You may have a much smaller table that gets more activity, while your big table remains fairly constant or only appends data.
Whatever you do, don't implement it yourselves. Let the database system handle it.

What does the big table do.
If you're going to split it, you've got a few options:
- Split it using the database system (don't know much about that)
- Split it by row.
- split it by column.
Splitting it by row would only be possible if your data can be separated easily into chunks. e.g. Something like Basecamp has multiple accounts which are completely separate. You could keep 50% of the accounts in one table and 50% in a different table on a different machine.
Splitting by Column is good for situations where the row size contains large text fields or BLOBS. If you've got a table with (for example) a user image and a huge block of text, you could farm the image into a completely different table. (on a different machine)
You break normalisation here, but I don't think it would cause too many problems.

You would probably want to split that large table eventually. You'll probably want to put it on a separate hard disk, before thinking of a second server. Doing it with MySQL is the most convenient option. If it is capable, then go for it.
BUT
Everything depends on how your database is being used, really. Statistics.

Related

How to fine tune AWS R4 Aurora MySql database

I have a database currently at 6.5Gb but growing fast...
Currently on a R4L Aurora server, 15.25G Ram, 2 core CPU
I am looking at buying a Reserved Instance to cut costs, but worried that if the database grows fast, e.g. reaches over 15G within a year, I'll need to get a bigger server.
99% of the data is transactional history, this table is the biggest by far. It is written very frequently, but once a row has been written it doesn't change often (although it does on occasion).
So few questions...
1) Should I disable the cache?
2) Will I be ok with 15G ram, even if the database itself goes to (say) 30G, or will I see massive speed issues
3) The database is well indexed, but could this be improved? E.g. if (say) 1 million records belong to 1 user, is there a way to partition the data to prevent that slowing down access for other users?
Thanks
"Should I disable the cache?" -- Which "cache"?
"will I see massive speed issues" -- We need to see the queries, etc.
"The database is well indexed" -- If that means you indexed every column, then it is not well indexed. Please show us SHOW CREATE TABLE and a few of the important queries.
"partition" -- With few exceptions, partitioning does not speed up MySQL tables. Again, we need details.
"15.25G Ram" & "database...15G" -- It is quite common for the dataset size to be bigger, even much bigger, than RAM. So, this pair of numbers are not necessarily good to compare to each other.
"1 million records belong to 1 user" -- Again, details, please.
You should statistically explain the data growth. This can be done by running a count(*) query group by created date (year) column. Once you have a count of records per year then you can understand what's going on.
Now you can think of possible solutions
You can remove data which is no longer relevant from history standpoint and keep the storage limited.
If there's large amount of data e.g. Blob etc. possibly you can target storing that in S3 and store reference into database table
Delete any unwanted tables. Sometimes DBA creates temporary backup tables and they leave them there after work. You can clean such tables.
The memory of the instance just comes into play when the engine fetches pages into the buffer pool for page misses. It does not depend on your actual data size (except in extreme cases, for example, your records are really really huge). The rule of thumb is to make sure you always keep your working set warm in the buffer pool, and avoid pages getting flushed.
If your app does need to touch a large amount of data, then the ideal way to do that would be to have dedicated replicas for specific kinds of queries. That way, you avoid swapping out valid pages in favor of newer queries. Aurora has custom endpoints support now, and that makes this even easier to manage.
If you need more specific guidelines, you may need to share details about your data, indices, queries etc.

MYISAM sharding vs using InnoDB

I have a table with very high insert rate and update rate as well as read rate. On average there are about 100 rows being inserted and updated per second. And there are about 1000 selects per second.
The table has about 100 million tuples. It is a relationship table so it only has about 5 fields. Three fields contain keys so they are indexed. All the fields are of integers.
I am thinking of sharding the data, however, it adds a lot of complexity, but does offer speed. The other alternative is to use innodb.
The database runs on a raid 1 of 256GB ssd with 32GB 1600mhz of RAM and i7 3770k over clocked to 4Ghz
The database freezes constantly at peak times where the queries can be as high as 200 rows being inserted or updated and 2500 selects per second
Could you guys please point into as what I should do?
Sharding is usually a good idea to distribute table size. Load problems should generally be addressed with a replicated data environment. In your case your problem is a) huge table and b) table level locking and c) crappy hardware.
InnoDB
If you can use one of the keys on your table as a primary key, InnoDB might be a good way to go since he'll let you do row-level locking which may reduce your queries from waiting on each other. A good test might be to replicate your table to a test server and try all your queries against him and see what the performance benefit is. InnoDB has a higher resource consumption rates then MyISAM, so keep that in mind.
Hardware
I'm sorry bud, but your hardware is crap for the performance you need. Twitter does 34 writes per second at 2.6k QPS. You can't be doing Twitter's volume and think a beefed up gaming desktop is going to cut it. Buy a $15k Dell with some SSD drives and you'll be able to burst 100k QPS. You're in the big times now. It's time to ditch the start-up gear and get yourself a nice server. You do not want to shard. It will be cheaper to upgrade your hardware, and frankly, you need to.
Sharding
Sharding is awesome for splitting up large tables. And that's it.
Let me be clear about the bad. Developing a sharded architecture sucks. You want to do everything possible to not shard. Upgrade hardware, buy multiple servers and set up replication, optimize your code, but for the love of God, do not shard. You are way below the performance line for sharding. When your pushing sustained 30k+ QPS, then we can talk sharding. Until that day, NO.
You can buy a medium-range server ($30k Dell PowerEdge) with 5TB of Fusion IO on 16 cores and 256 GB of RAM and he'll take you all the way to 200k QPS.
But if you refuse to listen to me and are going to shard anyway, then here's what you need to do.
Rule 1: Stay on the Same Shard (ie. Picking a Partition Rule)
Once you shard, you do not want to be accessing data from across multiple shards. You need to pick a partition rule that keeps your query on the same shard as much as possible. Distributing a query (rule 4) is incredibly painful in distributed data environments.
Rule 2: Build a Shard Map and Replicate it
Your code will need to be able to get to all shards. Create a shard map based on your partition rule that lets your code know where to go to get the data he wants.
Rule 3: Write a Query Wrapper for your Shards
You do not want to manually decide which shard to go to. Write a wrapper that does it for you. You will thank yourself down the road when you're writing code.
Rule 4: Auto-balance
You'll eventually need to balance your shards to keep performance optimal. Plan for this before-hand and write your code with the intention that you'll have some kron job which balances your shards for you.
Rule 4: Support Distributed Queries
You inevitably will need to break Rule 1. When that happens, you'll need a query wrapper that can pull data from multiple shards and aggregate (bring) it into one place. The more shards you have, the more likely this will need to be multi-threaded. In my shop, we call this a distributed query (ie. a query which runs on multiple shards).
Bad News: There is no code out there for doing distributed queries and aggregating results. Apache Hadoop tries, but he's terrible. So is HiveDB. A good query distributor is hard to architect, hard to write, hard to optimize. This is a problem billion-dollar a year companies deal with. I shit you not, but if you come up with a good wrapper for distributing queries across shards that supports sorting+limit clauses and scales well, you could be a millionaire over night. Selling it for $300,000? You would have a line outside your door a mile long.
My point here is sharding is hard and it is expensive. It takes a lot of work and you want to do everything humanly possible to not shard. If you must, follow the rules.

MySQL InnoDB big table: to shard or to add more RAM?

Folks, I'm a developer of a social game and there are already 700k players in the game, and about 7k new players are registered every day, about 5k players are constantly online.
The DB server is running on a pretty powerful hardware: 16 cores CPU, 24 Gb RAM, RAID-10 with BBU built on 4 SAS disks. I'm using Percona server(patched MySQL-5.1) and currently InnoDB buffer pool is 18Gb(although according to innotop only a few free buffers available). The DB server is performing pretty well(2k QPS, iostat %util is 10-15%, almost always 0 processes in "b" state in vmstat, loadavg is 5-6). However from time to time(every few minutes) I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
There is one big InnoDB table in the MySQL database which occupies the most space. It has about 300 millions rows, it's size is about 20 Gb. Of course, this table is gradually growing... I'm starting to worry it's affecting the overall performance of the database in a negative way. In the nearest future I'll have to do something about it, but I'm not sure what exactly.
Basically question boils down to whether to shard or simply add more RAM. The latter is simpler, of course. Looks like I can add up to 256 Gb RAM. But the question is whether I should invest more time implementing sharding instead since it's more scalable?
Sharding seems reasonable if you need to have all 300m+ rows. It may be a pain to change now but when your table grows and grows there will be a point when no amount of ram will solve your problem. With such massive amounts of data it may be worth using something like couch db as you could store documents of data rather than rows ie 1 document could contain all records for an individual user.
Sounds to me like your main database table could use some normalization. Does all your information belong in that one table, or can you split it out to smaller tables? Normalization may invoke a small performance hit now, but as your table grows, that will be overwhelmed by the extra processing involved in accessing a huge, monolithic table.
I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
Quote of a comment: Database is properly normalized. The database has many tables, one of them is really huge and has nothing to do with normalization.
When im reading this i would say it has to do with your queries.. has nothing to do with your hardware.. Average companies would dream about kind of server you have!
If you write bad queries doesn't matter how good your tables are normalized, it will be slow.
maybe you got something about this, its almost a similar question with an answer(database is slow and stuff like that).
Also thought about archiving some stuff? For example from those 300 million it started with ID 1 so is that ID still get used? if not why not archive it to a other database or table(i would recommend database). I also believe that not every 700k users are logged in every day(if you got respect! but i don't believe that).
You also said 'This table contains player specific items' what kind of specific items?
Another question, can you post some of your 'slow' queries?
You also considered about a caching system from some data? that maybe changed once a month, like gear other game stuff?

Can MySQL Cluster handle a terabyte database

I have to look into solutions for providing a MySQL database that can handle data volumes in the terabyte range and be highly available (five nines). Each database row is likely to have a timestamp and up to 30 float values. The expected workload is up to 2500 inserts/sec. Queries are likely to be less frequent but could be large (maybe involving 100Gb of data) though probably only involving single tables.
I have been looking at MySQL Cluster given that is their HA offering. Due to the volume of data I would need to make use of disk based storage. Realistically I think only the timestamps could be held in memory and all other data would need to be stored on disk.
Does anyone have experience of using MySQL Cluster on a database of this scale? Is it even viable? How does disk based storage affect performance?
I am also open to other suggestions for how to achieve the desired availability for this volume of data. For example, would it be better to use a third party libary like Sequoia to handle the clustering of standard MySQL instances? Or a more straight forward solution based on MySQL replication?
The only condition is that it must be a MySQL based solution. I don't think that MySQL is the best way to go for the data we are dealing with but it is a hard requirement.
Speed wise, it can be handled. Size wise, the question is not the size of your data, but rather the size of your index as the indices must fit fully within memory.
I'd be happy to offer a better answer, but high-end database work is very task-dependent. I'd need to know a lot more about what's going on with the data to be of further help.
Okay, I did read the part about mySQL being a hard requirement.
So with that said, let me first point out that the workload you're talking about -- 2500 inserts/sec, rare queries, queries likely to have result sets of up to 10 percent of the whole data set -- is just about pessimal for any relational data base system.
(This rather reminds me of a project, long ago, where I had a hard requirement to load 100 megabytes of program data over a 9600 baud RS-422 line (also a hard requirement) in less than 300 seconds (also a hard requirement.) The fact that 1kbyte/sec × 300 seconds = 300kbytes didn't seem to communicate.)
Then there's the part about "contain up to 30 floats." The phrasing at least suggests that the number of samples per insert is variable, which suggests in turn some normaliztion issues -- or else needing to make each row 30 entries wide and use NULLs.
But with all that said, okay, you're talking about 300Kbytes/sec and 2500 TPS (assuming this really is a sequence of unrelated samples). This set of benchmarks, at least, suggests it's not out of the realm of possibility.
This article is really helpful in identifying what can slow down a large MySQL database.
Possibly try out hibernate shards and run MySQL on 10 nodes with 1/2 terabyte each so you can handle 5 terabytes then ;) well over your limit I think?

How big can a MySQL database get before performance starts to degrade

At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?
The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.
In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.
The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.
I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.
I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.
2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.
It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.
I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.
Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.
A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.
Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.
It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.
Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.
No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit
Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)