Can MySQL Cluster handle a terabyte database

Can MySQL Cluster handle a terabyte database - mysql

I have to look into solutions for providing a MySQL database that can handle data volumes in the terabyte range and be highly available (five nines). Each database row is likely to have a timestamp and up to 30 float values. The expected workload is up to 2500 inserts/sec. Queries are likely to be less frequent but could be large (maybe involving 100Gb of data) though probably only involving single tables.
I have been looking at MySQL Cluster given that is their HA offering. Due to the volume of data I would need to make use of disk based storage. Realistically I think only the timestamps could be held in memory and all other data would need to be stored on disk.
Does anyone have experience of using MySQL Cluster on a database of this scale? Is it even viable? How does disk based storage affect performance?
I am also open to other suggestions for how to achieve the desired availability for this volume of data. For example, would it be better to use a third party libary like Sequoia to handle the clustering of standard MySQL instances? Or a more straight forward solution based on MySQL replication?
The only condition is that it must be a MySQL based solution. I don't think that MySQL is the best way to go for the data we are dealing with but it is a hard requirement.

Speed wise, it can be handled. Size wise, the question is not the size of your data, but rather the size of your index as the indices must fit fully within memory.
I'd be happy to offer a better answer, but high-end database work is very task-dependent. I'd need to know a lot more about what's going on with the data to be of further help.

Okay, I did read the part about mySQL being a hard requirement.
So with that said, let me first point out that the workload you're talking about -- 2500 inserts/sec, rare queries, queries likely to have result sets of up to 10 percent of the whole data set -- is just about pessimal for any relational data base system.
(This rather reminds me of a project, long ago, where I had a hard requirement to load 100 megabytes of program data over a 9600 baud RS-422 line (also a hard requirement) in less than 300 seconds (also a hard requirement.) The fact that 1kbyte/sec × 300 seconds = 300kbytes didn't seem to communicate.)
Then there's the part about "contain up to 30 floats." The phrasing at least suggests that the number of samples per insert is variable, which suggests in turn some normaliztion issues -- or else needing to make each row 30 entries wide and use NULLs.
But with all that said, okay, you're talking about 300Kbytes/sec and 2500 TPS (assuming this really is a sequence of unrelated samples). This set of benchmarks, at least, suggests it's not out of the realm of possibility.

This article is really helpful in identifying what can slow down a large MySQL database.

Possibly try out hibernate shards and run MySQL on 10 nodes with 1/2 terabyte each so you can handle 5 terabytes then ;) well over your limit I think?

Related

How to increase database performance if there is 0.1M traffic

I am developing a site and I'm concerned about the performance.
In the current system there are transactions like adding 10,000 rows to a single table. It doesn't matter it took around 0.6 seconds to insert.
But I am worrying about what happens if there are 100,000 concurrent users and 1000 of the users want to add 10,000 rows to a single table at once.
How could this impact the performance compared to a single user? How can I improve these transactions if there is a large amount of traffic like in this situation?

When write speed is mandatory, the way we tackle it is getting quicker hard drives.
You mentioned transactions, that means you need your data durable (D of ACID). This requirement rules out MyISAM storage engine or any type of NoSQL so I'll focus the answer towards what goes on with relational databases.
The way it works is this: you get a set number of Input Output Operations per Second or IOPS per hard drive. Hard drives also have a metric called bandwith. The metric you are interested in is write speed.
Some crude calculation here would be this - Number of MB per second divided by number of IOPS = how much data you can squeeze per IOPS.
For mechanical drives, this magic IOPS number is anywhere between 150 and 300 - quite low. Given their bandwith of about 100 MB/sec, you get a real small number of writes and bandwith per write. This is where Solid State Drives kick in - their IOPS number starts at about 5 000 (some even go to 80 000) which is awesome for databases.
Connecting these drives in RAID gives you a super quick storage solution. If you are able to squeeze 10 000 inserts into one transaction, the disk will try to squeeze all 10k inserts through 1 IOPS.
Another strategy is partitioning your table and having multiple drives where MySQL stores the data.
This is as far as you can go with a single MySQL installation. There are strategies for distributing data to multiple MySQL nodes etc. but I assume that's out of scope of your question.
TL;DR: you need quicker disks.

If you are trying to scale for inserting millions of rows per second, you have bigger problems. That could add up to trillions of rows per month. That's hundreds of terabytes before the end of the month. Do you have a big enough disk farm for that? Can you afford enough SSDs for that.
Another thing. With a trillion rows, it is quite challenging to have any indexes other than a simple auto_increment. Without any indexes, how do you plan on accessing the data? A table scan of a trillion rows will take day(s).
Also, you said 100,000 users; you implied that they are connected simultaneously? That, too, is a challenge.
What are the users doing to generate 10K rows all at once? What about the network bandwidth?
Etc. Etc.
If you really have a task like this, Sharding is probably the only solution. And that is in addition to SSDs, RAID, IOPs, etc, etc.

Few stuff that you must consider both from software and hardware point.
Things must consider :
Go for SSD drive to have better IO.
Good to have 10GB of network, if you have that huge traffic.
Use mysql 5.6 or above, they made good improvement on performance over previous version.
Use bulk inserts, instead of sequential one, and even better if you can store all data in a file and use load_data_infile. This would be
20 times faster then regular insert.
Mysql provide multiple ways to scaleout. Its depend upon on your product requirement which way you want to go.

how to increase performance of mysql query if we have more than 1 million records?

In User table i have more than 1 million records so how can i manage using MySQL, Symfony 1.4. Make performance better.
So that it can give quick output.

To significantly improve performance of well designed system all you can do is increase the resources. Typically, these days, the cheapest way to do this is to distribute the task.
For example a slow thing in RDBM system is reading and writing to an from the storage (typically RDBMs systems start as I/O bound, that is, they mostly wait for data to get read or written to storage).
So, to offset, very commonly the RDBMS will allow you to split the table across multiple HDDs, effectively multiplying the I/O performance (approach similar to RAID0).
Adding more hard disks increases the performance. This goes on up to maximum I/O that your system could support (either simply because the system can not push more data through circuits or because it does need to crunch the numbers a bit when it fetches them so it becomes CPU bound; optimally you would be utilising both)
After that you have to start multiplying the systems distributing the data across database nodes. For this to work either RDBMS must support it or there should be application layer that will coordinate distributing the tasks and merging the results, but normally things would still scale.
I would say that with 512 systems you could have all trillion records effectively cached (10^12) and achieve relatively nice performance. But really you should specify what kind of performance you are looking for - there is a difference between full text searches on terra-records and running mostly simple fetches and updates. Also, for certain work 500ms (or even more) is considered good performance and then for other work it would be horrible.

at first: theres a big difference between 1 trillion and 1 million.
to your performance problems: show us the query thats running slow, without seeing it, it's hard to tell whats wrong with it. what you could try:
use EXPLAIN to get more information about your slow querys, see if they're using your indexes or if not (and if not, why not?)
use correct and reasonable indexes

Best free way to store 20 million rows a day?

Daily 20-25 million rows that will be removed at midnight for next days data. Can mySQL handle 25 million indexed rows? What would be another good solution?

You give very little information on the context but sometimes not using a database and instead a binary/plain text file is just fine and can -- depending on your requirements -- be much more efficient and maintainable. e.g if it's sensor data storing it in a binary file with each record at a known offset could be a good solution. You saying that the data would be deleted every 24h seems to indicate that you might not need some the properties of a relational database solution such as ACID, replication, integrated backup and so on, so perhaps a flat file approach is just fine?

Our MySQL database has over 300 million rows indexed and we only ever experience problems with complex joins running a little slow - most can be optimized though.
Handling the rows was no problem - the key to our performance was good indexes.
Considering you are dropping the information at midnight, i would also look at MySQL partitioning which would allow you to drop that part of the table whilst allowing the next day to continue inserting if need be.

The issue is not the number of rows itself -- it's what you do with the database. Are you doing only inserts during the day followed by some batch report? Or, are you doing thousands of queries per second on the data? Inserts/Updates/Deletes? If you slam enough load at any database platform, you can max it out with a single table and a single row (taking it to the most extreme). I used MySQL 4.1 w/ MyISAM (hardly the most modern of anything) on a site with a 40M row user table. It did < 5ms queries, I think. We were rendering pages in less than 200ms. However, we had lots and lots of caching set up, so the number of queries wasn't too high. And, we were doing simple statements like SELECT * FROM USER WHERE USER_NAME = 'SMITH'
Can you comment more on your use case?

If you are using Windows, you could do worse than use SqlExpress 2008, which should easily handle that load, depending on how many indexes you are creating on it. So long as you keep < 4GB total db size, it shouldn't be a problem.

From my experience, mySQL tends to not scale well at all. If you must have a free solution for this I would highly recommend postgreSQL.
Also (this may or may not be an issue for you), but keep in mind that if you're dealing with that much data, the maximum size of a mySQL database is 4 terabytes, if I remember correctly.
I don't think there is a practical limit on the max number of rows in mySQL, so if you MUST use mySQL, I think it would work for what you want to do, but personally for a production system I wouldn't recommend it.

As a general solution I'd recommend PostgreSQL too, but depending on your specific needs, other solutions might be better/faster. For example, if you do not need to query your data while it is being written, TokyoCabinet (the table based API / TDB) might be faster and more lightweight/robust.

I haven't looked into them in mysql, but this sounds like a perfect application for table partitions

use only as an index database and store it in the form of file approach would be more effective because you will remove within 24 hours and the process will be faster also not burden your server

MySQL Partitioning / Sharding / Splitting - which way to go?

We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the biggest part of the data) and I’m now wondering, what would be the best way to do it.
The options I’m currently aware of are
Using MySQL Partitioning that comes with version 5.1
Using some kind of third party library that encapsulates the partitioning of the data (like hibernate shards)
Implementing it ourselves inside our application
Our application is built on J2EE and EJB 2.1 (hopefully we’re switching to EJB 3 some day).
What would you suggest?
EDIT (2011-02-11):
Just an update: Currently the size of the database is 380 GB, the data size of our "big" table is 220 GB and the size of its index is 36 GB. So while the whole table does not fit in memory any more, the index does.
The system is still performing fine (still on the same hardware) and we're still thinking about partitioning the data.
EDIT (2014-06-04):
One more update: The size of the whole database is 1.5 TB, the size of our "big" table is 1.1 TB. We upgraded our server to a 4 processor machine (Intel Xeon E7450) with 128 GB RAM.
The system is still performing fine.
What we're planning to do next is putting our big table on a separate database server (we've already done the necessary changes in our software) while simultaneously upgrading to new hardware with 256 GB RAM.
This setup is supposed to last for two years. Then we will either have to finally start implementing a sharding solution or just buy servers with 1 TB of RAM which should keep us going for some time.
EDIT (2016-01-18):
We have since put our big table in it's own database on a separate server. Currently the size ot this database is about 1.9 TB, the size of the other database (with all tables except for the "big" one) is 1.1 TB.
Current Hardware setup:
HP ProLiant DL 580
4 x Intel(R) Xeon(R) CPU E7- 4830
256 GB RAM
Performance is fine with this setup.

You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how poor it performs.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
This is incorrect. Partioning (either through the feature in MySQL 5.1, or the same thing using MERGE tables) can provide significant performance benefits even if the tables are on the same drive.
As an example, let's say that you are running SELECT queries on your big table using a date range. If the table is whole, the query will be forced to scan through the entire table (and at that size, even using indexes can be slow). The advantage of partitioning is that your queries will only run on the partitions where it is absolutely necessary. If each partition is 1 GB in size and your query only needs to access 5 partitions in order to fulfill itself, the combined 5 GB table is a lot easier for MySQL to deal with than a monster 42 GB version.
One thing you need to ask yourself is how you are querying the data. If there is a chance that your queries will only need to access certain chunks of data (i.e. a date range or ID range), partitioning of some kind will prove beneficial.
I've heard that there is still some buggyness with MySQL 5.1 partitioning, particularly related to MySQL choosing the correct key. MERGE tables can provide the same functionality, although they require slightly more overhead.
Hope that helps...good luck!

If you think you're going to be IO/memory bound, I don't think partitioning is going to be helpful. As usual, benchmarking first will help you figure out the best direction. If you don't have spare servers with 64GB of memory kicking around, you can always ask your vendor for a 'demo unit'.
I would lean towards sharding if you don't expect 1 query aggregate reporting. I'm assuming you'd shard the whole database and not just your big table: it's best to keep entire entities together. Well, if your model splits nicely, anyway.

This is a great example of what can MySql partitioning do in a real-life example of huge data flows:
http://web.archive.org/web/20101125025320/http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
Hoping it will be helpful for your case.

A while back at a Microsoft ArcReady event, I saw a presentation on scaling patterns that might be useful to you. You can view the slides for it online.

I would go for MariaDB InnoDB + Partitions (either by key or by date, depending on your queries).
I did this and now I don't have any Database problems anymore.
MySQL can be replaced with MariaDB in seconds...all the database files stay the same.

First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
Secondly, it's not necessarily the table with the largest physical size that you want to move. You may have a much smaller table that gets more activity, while your big table remains fairly constant or only appends data.
Whatever you do, don't implement it yourselves. Let the database system handle it.

What does the big table do.
If you're going to split it, you've got a few options:
- Split it using the database system (don't know much about that)
- Split it by row.
- split it by column.
Splitting it by row would only be possible if your data can be separated easily into chunks. e.g. Something like Basecamp has multiple accounts which are completely separate. You could keep 50% of the accounts in one table and 50% in a different table on a different machine.
Splitting by Column is good for situations where the row size contains large text fields or BLOBS. If you've got a table with (for example) a user image and a huge block of text, you could farm the image into a completely different table. (on a different machine)
You break normalisation here, but I don't think it would cause too many problems.

You would probably want to split that large table eventually. You'll probably want to put it on a separate hard disk, before thinking of a second server. Doing it with MySQL is the most convenient option. If it is capable, then go for it.
BUT
Everything depends on how your database is being used, really. Statistics.

How big can a MySQL database get before performance starts to degrade

At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?

The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.

In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.

The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.

I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.

I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.

2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.

It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.

I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.

Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.

A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.

Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.

It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.

Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.

No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit

Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008