I have this machine: Core 2 CPU 6600, 4GB, 64 bit system, Windows VISTA.
I am designing a system with 10 billion rows, this table has a foreign key to another table, which should contains 10x10 billion rows. Normally, I just do insert into two tables. I don't usually do joins.
I don't need user-facing real time performance. I wonder if mysql can handle this size with stability and reasonable performance.
Thanks a lot
It depends on which engine you are using. In this post you can find additional informations:
Maximum number of records in a MySQL database table
In general, I would suggest you to use another OS different from VISTA if you can, mysq is best tuned for linux boxes,
Also, what I would suggest you is to try to make some benchmarks before inserting all the rows.
Look here for more references:
http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function%5Fbenchmark
The deciding factor here will be what data types you are using in your fields. 10 billion x 10 columns of text fields and image blobs would be orders of magnitude larger than 10 columns of int(2).
I also agree that Vista is asking for trouble with billions of rows. It might work in theory but if you have a large number of clients it will probably crash and burn under load.
Related
I am developing a site and I'm concerned about the performance.
In the current system there are transactions like adding 10,000 rows to a single table. It doesn't matter it took around 0.6 seconds to insert.
But I am worrying about what happens if there are 100,000 concurrent users and 1000 of the users want to add 10,000 rows to a single table at once.
How could this impact the performance compared to a single user? How can I improve these transactions if there is a large amount of traffic like in this situation?
When write speed is mandatory, the way we tackle it is getting quicker hard drives.
You mentioned transactions, that means you need your data durable (D of ACID). This requirement rules out MyISAM storage engine or any type of NoSQL so I'll focus the answer towards what goes on with relational databases.
The way it works is this: you get a set number of Input Output Operations per Second or IOPS per hard drive. Hard drives also have a metric called bandwith. The metric you are interested in is write speed.
Some crude calculation here would be this - Number of MB per second divided by number of IOPS = how much data you can squeeze per IOPS.
For mechanical drives, this magic IOPS number is anywhere between 150 and 300 - quite low. Given their bandwith of about 100 MB/sec, you get a real small number of writes and bandwith per write. This is where Solid State Drives kick in - their IOPS number starts at about 5 000 (some even go to 80 000) which is awesome for databases.
Connecting these drives in RAID gives you a super quick storage solution. If you are able to squeeze 10 000 inserts into one transaction, the disk will try to squeeze all 10k inserts through 1 IOPS.
Another strategy is partitioning your table and having multiple drives where MySQL stores the data.
This is as far as you can go with a single MySQL installation. There are strategies for distributing data to multiple MySQL nodes etc. but I assume that's out of scope of your question.
TL;DR: you need quicker disks.
If you are trying to scale for inserting millions of rows per second, you have bigger problems. That could add up to trillions of rows per month. That's hundreds of terabytes before the end of the month. Do you have a big enough disk farm for that? Can you afford enough SSDs for that.
Another thing. With a trillion rows, it is quite challenging to have any indexes other than a simple auto_increment. Without any indexes, how do you plan on accessing the data? A table scan of a trillion rows will take day(s).
Also, you said 100,000 users; you implied that they are connected simultaneously? That, too, is a challenge.
What are the users doing to generate 10K rows all at once? What about the network bandwidth?
Etc. Etc.
If you really have a task like this, Sharding is probably the only solution. And that is in addition to SSDs, RAID, IOPs, etc, etc.
Few stuff that you must consider both from software and hardware point.
Things must consider :
Go for SSD drive to have better IO.
Good to have 10GB of network, if you have that huge traffic.
Use mysql 5.6 or above, they made good improvement on performance over previous version.
Use bulk inserts, instead of sequential one, and even better if you can store all data in a file and use load_data_infile. This would be
20 times faster then regular insert.
Mysql provide multiple ways to scaleout. Its depend upon on your product requirement which way you want to go.
Folks, I'm a developer of a social game and there are already 700k players in the game, and about 7k new players are registered every day, about 5k players are constantly online.
The DB server is running on a pretty powerful hardware: 16 cores CPU, 24 Gb RAM, RAID-10 with BBU built on 4 SAS disks. I'm using Percona server(patched MySQL-5.1) and currently InnoDB buffer pool is 18Gb(although according to innotop only a few free buffers available). The DB server is performing pretty well(2k QPS, iostat %util is 10-15%, almost always 0 processes in "b" state in vmstat, loadavg is 5-6). However from time to time(every few minutes) I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
There is one big InnoDB table in the MySQL database which occupies the most space. It has about 300 millions rows, it's size is about 20 Gb. Of course, this table is gradually growing... I'm starting to worry it's affecting the overall performance of the database in a negative way. In the nearest future I'll have to do something about it, but I'm not sure what exactly.
Basically question boils down to whether to shard or simply add more RAM. The latter is simpler, of course. Looks like I can add up to 256 Gb RAM. But the question is whether I should invest more time implementing sharding instead since it's more scalable?
Sharding seems reasonable if you need to have all 300m+ rows. It may be a pain to change now but when your table grows and grows there will be a point when no amount of ram will solve your problem. With such massive amounts of data it may be worth using something like couch db as you could store documents of data rather than rows ie 1 document could contain all records for an individual user.
Sounds to me like your main database table could use some normalization. Does all your information belong in that one table, or can you split it out to smaller tables? Normalization may invoke a small performance hit now, but as your table grows, that will be overwhelmed by the extra processing involved in accessing a huge, monolithic table.
I'm getting about 10-100 slow queries(where each may last about 5-6 seconds).
Quote of a comment: Database is properly normalized. The database has many tables, one of them is really huge and has nothing to do with normalization.
When im reading this i would say it has to do with your queries.. has nothing to do with your hardware.. Average companies would dream about kind of server you have!
If you write bad queries doesn't matter how good your tables are normalized, it will be slow.
maybe you got something about this, its almost a similar question with an answer(database is slow and stuff like that).
Also thought about archiving some stuff? For example from those 300 million it started with ID 1 so is that ID still get used? if not why not archive it to a other database or table(i would recommend database). I also believe that not every 700k users are logged in every day(if you got respect! but i don't believe that).
You also said 'This table contains player specific items' what kind of specific items?
Another question, can you post some of your 'slow' queries?
You also considered about a caching system from some data? that maybe changed once a month, like gear other game stuff?
I'm planning to generate a huge amount of data, which I'd like to store in a MySQL database. My current estimations point to four thousand million billion rows in the main table (only two columns, one of them indexed).
Two questions here:
1) is this possible?
and more specifically:
2) Will such table be efficiently usable?
thanks!,
Jaime
Sure, it's possible. Whether or not it's usable will depend on how you use it and how much hardware/memory you have. With a table that large, it would probably make sense to use partitioning as well if that makes sense for the kind of data you are storing.
ETA:
Based on the fact that you only have two columns with one of them being indexed, I'm going to take a wild guess here that this is some kind of key-value store. If that is the case, you might want to look into a specialized key-value store database as well.
It may be possible, MySQL has several table storage engines with differing capabilities. I think the MyISAM storage engine, for instance, has a theoretical data size limit of 256TB, but it's further constrained by the maximum size of a file on your operating system. I doubt it would be usable. I'm almost certain it wouldn't be optimal.
I would definitely look at partitioning this data across multiple tables (probably even multiple DBs on multiple machines) in a way that makes sense for your keys, then federating any search results/totals/etc. you need to. Amongst other things, this allows you to do searches where each partition is searched in parallel (in the mutiple servers approach).
I'd also look for a solution that's already done the heavy lifting of partitioning and federating queries. I wonder if Google's AppEngine data store (BigTable) or the Amazon SimpleDB would be useful. They'd both limit what you could do with the data (they are not RDBMS's), but then, the sheer size is going to do that anyway.
You should consider partitioning your data...for example if one of the two columns is a name, separate the rows into 26 tables based on the first letter.
I created a mysql database with one table that contained well over 2 million rows (imported U.S. census county line data for overlay on a Google map). Another table had slightly under 1 million rows (USGS Tiger location data). This was about 5 years ago.
I didn't really have an issue (once I remembered to create indexes! :) )
4 gigarows is not that big, actually, it is pretty average to handle by any database engine today. Even partitioning could be an overkill. It should simply work.
Your performance will depend on your HW though.
We have an InnoDB database that is about 70 GB and we expect it to grow to several hundred GB in the next 2 to 3 years. About 60 % of the data belong to a single table. Currently the database is working quite well as we have a server with 64 GB of RAM, so almost the whole database fits into memory, but we’re concerned about the future when the amount of data will be considerably larger. Right now we’re considering some way of splitting up the tables (especially the one that accounts for the biggest part of the data) and I’m now wondering, what would be the best way to do it.
The options I’m currently aware of are
Using MySQL Partitioning that comes with version 5.1
Using some kind of third party library that encapsulates the partitioning of the data (like hibernate shards)
Implementing it ourselves inside our application
Our application is built on J2EE and EJB 2.1 (hopefully we’re switching to EJB 3 some day).
What would you suggest?
EDIT (2011-02-11):
Just an update: Currently the size of the database is 380 GB, the data size of our "big" table is 220 GB and the size of its index is 36 GB. So while the whole table does not fit in memory any more, the index does.
The system is still performing fine (still on the same hardware) and we're still thinking about partitioning the data.
EDIT (2014-06-04):
One more update: The size of the whole database is 1.5 TB, the size of our "big" table is 1.1 TB. We upgraded our server to a 4 processor machine (Intel Xeon E7450) with 128 GB RAM.
The system is still performing fine.
What we're planning to do next is putting our big table on a separate database server (we've already done the necessary changes in our software) while simultaneously upgrading to new hardware with 256 GB RAM.
This setup is supposed to last for two years. Then we will either have to finally start implementing a sharding solution or just buy servers with 1 TB of RAM which should keep us going for some time.
EDIT (2016-01-18):
We have since put our big table in it's own database on a separate server. Currently the size ot this database is about 1.9 TB, the size of the other database (with all tables except for the "big" one) is 1.1 TB.
Current Hardware setup:
HP ProLiant DL 580
4 x Intel(R) Xeon(R) CPU E7- 4830
256 GB RAM
Performance is fine with this setup.
You will definitely start to run into issues on that 42 GB table once it no longer fits in memory. In fact, as soon as it does not fit in memory anymore, performance will degrade extremely quickly. One way to test is to put that table on another machine with less RAM and see how poor it performs.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
This is incorrect. Partioning (either through the feature in MySQL 5.1, or the same thing using MERGE tables) can provide significant performance benefits even if the tables are on the same drive.
As an example, let's say that you are running SELECT queries on your big table using a date range. If the table is whole, the query will be forced to scan through the entire table (and at that size, even using indexes can be slow). The advantage of partitioning is that your queries will only run on the partitions where it is absolutely necessary. If each partition is 1 GB in size and your query only needs to access 5 partitions in order to fulfill itself, the combined 5 GB table is a lot easier for MySQL to deal with than a monster 42 GB version.
One thing you need to ask yourself is how you are querying the data. If there is a chance that your queries will only need to access certain chunks of data (i.e. a date range or ID range), partitioning of some kind will prove beneficial.
I've heard that there is still some buggyness with MySQL 5.1 partitioning, particularly related to MySQL choosing the correct key. MERGE tables can provide the same functionality, although they require slightly more overhead.
Hope that helps...good luck!
If you think you're going to be IO/memory bound, I don't think partitioning is going to be helpful. As usual, benchmarking first will help you figure out the best direction. If you don't have spare servers with 64GB of memory kicking around, you can always ask your vendor for a 'demo unit'.
I would lean towards sharding if you don't expect 1 query aggregate reporting. I'm assuming you'd shard the whole database and not just your big table: it's best to keep entire entities together. Well, if your model splits nicely, anyway.
This is a great example of what can MySql partitioning do in a real-life example of huge data flows:
http://web.archive.org/web/20101125025320/http://www.tritux.com/blog/2010/11/19/partitioning-mysql-database-with-high-load-solutions/11/1
Hoping it will be helpful for your case.
A while back at a Microsoft ArcReady event, I saw a presentation on scaling patterns that might be useful to you. You can view the slides for it online.
I would go for MariaDB InnoDB + Partitions (either by key or by date, depending on your queries).
I did this and now I don't have any Database problems anymore.
MySQL can be replaced with MariaDB in seconds...all the database files stay the same.
First of all, it doesn't matter as much splitting out tables unless you also move some of the tables to a separate physical volume.
Secondly, it's not necessarily the table with the largest physical size that you want to move. You may have a much smaller table that gets more activity, while your big table remains fairly constant or only appends data.
Whatever you do, don't implement it yourselves. Let the database system handle it.
What does the big table do.
If you're going to split it, you've got a few options:
- Split it using the database system (don't know much about that)
- Split it by row.
- split it by column.
Splitting it by row would only be possible if your data can be separated easily into chunks. e.g. Something like Basecamp has multiple accounts which are completely separate. You could keep 50% of the accounts in one table and 50% in a different table on a different machine.
Splitting by Column is good for situations where the row size contains large text fields or BLOBS. If you've got a table with (for example) a user image and a huge block of text, you could farm the image into a completely different table. (on a different machine)
You break normalisation here, but I don't think it would cause too many problems.
You would probably want to split that large table eventually. You'll probably want to put it on a separate hard disk, before thinking of a second server. Doing it with MySQL is the most convenient option. If it is capable, then go for it.
BUT
Everything depends on how your database is being used, really. Statistics.
At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?
The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.
In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.
The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.
I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.
I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.
2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.
It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.
I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.
Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.
A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.
Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.
It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.
Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.
No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit
Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)