How many rows are 'too many' for a MySQL table? [duplicate] - mysql

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How many rows in a database are TOO MANY?
I am building the database scheme for an application that will have users, and each user will have many rows in relation tables such as 'favorites'.
Each user could have thousands of favorites, and there could be thousands of registered users (over time).
Given that users are never deleted, because that would either leave other entities orphaned, or have them deleted too (which isn't desired), and therefore these tables will keep growing forever, I was wondering if the resulting tables could be too big (eg: 1kk rows), and I should worry about this and do something like mark old and inactive users as deleted and remove the relations that only affect them (such as the favorites and other preferences).
Is this the way to go? Or can mysql easily handle 1kk rows in a table? Is there a known limit? Or is it fully hardware-dependant?

I agree with klennepette and Brian - with a couple of caveats.
If your data is inherently relational, and subject to queries that work well with SQL, you should be able to scale to hundreds of millions of records without exotic hardware requirements.
You will need to invest in indexing, query tuning, and making the occasional sacrifice to the relational model in the interests of speed. You should at least nod to performance when designing tables – preferring integers to strings for keys, for instance.
If, however, you have document-centric requirements, need free text search, or have lots of hierarchical relationships, you may need to look again.
If you need ACID transactions, you may run into scalability issues earlier than if you don't care about transactions (though this is still unlikely to affect you in practice); if you have long-running or complex transactions, your scalability decreases quite rapidly.
I'd recommend building the project from the ground up with scalability requirements in mind. What I've done in the past is set up a test environment populated with millions of records (I used DBMonster, but not sure if that's still around), and regularly test work-in-progress code against this database using load testing tools like Jmeter.

Millions of rows is fine, tens of millions of rows is fine - provided you've got an even remotely decent server, i.e. a few Gbs of RAM, plenty disk space. You will need to learn about indexes for fast retrieval, but in terms of MySQL being able to handle it, no problem.

Here's an example that demonstrates what can be achived using a well designed/normalised innodb schema which takes advantage of innodb's clustered primary key indexes (not available with myisam). The example is based on a forum with threads and has 500 million rows and query runtimes of 0.02 seconds while under load.
MySQL and NoSQL: Help me to choose the right one

It is mostly hardware dependant, but having that said MySQL scales pretty well.
I wouldn't worry too much about table size, if it does become an issue later on you can always use partitioning to ease the stress.

Related

Limit before sharding or partitioning a table

I am new to the database system design. After reading many articles, I am really getting confused on what is the limit till which we should have 1 table and not go for sharding or partitioning. I know that it is really hard to provide generic answer and things depend on factors like
size of row
kind of data (strings, blobs, etc)
active queries number
what kind of queries
indexes
read heavy/write heavy
the latency expected
But when someone ask that
what will you do if you have 1 billion data and million rows getting added everyday. The latency needs to be less than 5 ms for 4 read, 1 write and 2 update queries over such a big database, etc.
what will your choice if you have only 10 million rows but the updates and reads are high. The number of new rows added are not significant. High consistency and low latency are the requirement.
If the rows are less that a million and the row size is increasing by thousands then the choice is simple. But it gets trickier when the choice involves for million or billion of rows.
Note: I have not mentioned the latency number in my question. Please
answer according to the latency number which is acceptable to you. Also, we are talking about structured data.
I am not sure but I can add 3 specific questions:
Lets say that you choose sql database for amazon or any ecommerce order management system. The order numbers are increasing everyday by million. There are already 1 billion record. Now, assuming that there is no archival of data. There are high read queries more than thousand queries per second. And there are writes as well. The read:write ratio is 100:1
Let's take an example which smaller number now. Lets say that you choose a sql database for abc or any ecommerce order management system. The order numbers are increasing everyday by thousands. There are already 10 million record. Now, assuming that there is no archival of data. There are high read queries more than ten thousand queries per second. And there are writes as well. The read:write ratio is 10:1
3rd example: Free goodies distribution. We have 10 million goodies to be distributed. 1 goodies per user. High consistency and low latency is the aim. Lets assume that 20 million users already waiting for this free distribution and once the time starts, all of them will try to get the free goodies.
Note: In the whole question, the assumption is that we will go with
SQL solutions. Also, please neglect if the provided usecase doesn't make sense logically. The aim is to get the knowledge in terms of numbers.
Can someone please help with what are the benchmarks. Any practical numbers from the project you are currently working in which can tell that for such a big database with these many queries, this is the latency observed,. Anything which can help me justify the choice for the number of tables for the certain number of queries for particular latency.
Some answers for MySQL. Since all databases are limited by disk space, network latency, etc., other engines may be similar.
A "point query" (fetching one row using a suitable index) takes milliseconds regardless of the number of rows.
It is possible to write a SELECT that will take hours, maybe even days, to run. So you need to understand whether the queries are pathological like this. (I assume this is an example of high "latency".)
"Sharding" is needed when you cannot sustain the number of writes needed on a single server.
Heavy reads can be scaled 'infinitely' by using replication and sending the reads to Replicas.
PARTITIONing (especially in MySQL) has very few uses. More details: Partition
INDEXes are very important for performance.
For Data Warehouse apps, building and maintaining "Summary tables" is vital for performance at scale. (Some other engines have some built-in tools for such.)
INSERTing one million rows per day is not a problem. (Of course, there are schema designs that could make this a problem.) Rules of Thumb: 100/second is probably not a problem; 1000/sec is probably possible; it gets harder after that. More on high speed ingestion
Network latency is mostly determined by how close the client and server are. It takes over 200ms to reach the other side of the earth. On the other hand, if the client and server are in the same building, latency is under 1ms. On another hand, if you are referring to how long it takes too run a query, then here are a couple of Rules of Thumb: 10ms for a simple query that needs to hit an HDD disk; 1ms for SSD.
UUIDs and hashes are very bad for performance if the data is too big to be cached in RAM.
I have not said anything about read:write ratio because I prefer to judge reads and writes independently.
"Ten thousand reads per second" is hard to achieve; I suggest that very few apps really need such. Or they can find better ways to achieve the same goals. How fast can one user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
(my opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. So what? Some benchmarks say that when you have more than a few hundred active connections, throughput stagnates and latency heads toward infinity. So what. After you have an app running for some time, capturing the actual queries is perhaps the best benchmark. But it still has limited uses.
Almost always a single table is better than splitting up the table (multiple tables; PARTITIONing; sharding). If you have a concrete example, we can discuss the pros and cons of the table design.
Size of row and kinds of data -- Large columns (TEXT/BLOB/JSON) are stored "off-record", thereby leading to [potentially] an extra disk hit. Disk hits are the most costly part of any query.
Active queries -- After a few dozen, the queries stumble over each other. (Think about a grocery store with lots of shoppers pushing carts -- with "too many" shoppers, each takes a long time to finish.)
When you get into large databases, they fall into a few different types; each with somewhat different characteristics.
Data Warehouse (sensors, logs, etc) -- appending to 'end' of the table; Summary Tables for efficient 'reports'; huge "Fact" table (optionally archived in chunks); certain "dimension tables".
Search (products, web pages, etc) -- EAV is problematical; FULLTEXT is often useful.
Banking, order processing -- This gets heavy into the ACID features and the need for crafting transactions.
Media (images and videos) -- How to store the bulky objects while making searching (etc) reasonably fast.
'Find nearest' -- Need a 2D index, either SPATIAL or some of the techniques here

Separate single database into active and archive

I have a single database, most of the tables are connected in some way.
It consists of over 500000 records.
I need to implement live search, but number of records bothers me.
Database will grow and live search in millions of records will sure cause problems. So i need to move old records (let's assume date field is present) to another database and only keep fresh ones available for search.
Old records won't be used anymore, that's for sure, but i still need to keep them.
Any ideas how that could be implemented in MySQL?
500,000 records really is not very many records.
Before you start taking drastic actions (such as limiting the ability of users to seamlessly see all the data at once), you should consider basics for improving performance:
Indexes to improve standard query performance.
Partitioning to limit the portions of tables that need to be accessed.
Full text indexing to improve match() queries.
Optimization of SQL queries.
In general, these are sufficient for databases that are orders of magnitude larger than the volume you are dealing with.
These may not apply to your particular situation; but you should exhaust the lower-hanging fruit for performance optimization before changing your physical data model for a problem that might never occur.

Max number of rows per MySQL (Billions) NDBCLUSTER?

I have a database that needs to be able to scale up to billions of entries or rows.
Can this many rows be supported per single table? Is it advisable?
Would a single table be split over several clusters if used in a NDBCLUSTER.
Other load balancing techniques?
What are some advisable methods of deploying such a database?
What are best practices for a database with this many rows to gain more performance?
Would MySQL do, or should I look elsewhere.
We have tables with 22 million rows, and there's no bottleneck in sight. At least none that enough RAM can't fix. Generally there is no easy yes or no. It depends on the nature of your data, table engine, etc..
If you disclosed more info what kind of data it is that you're saving, then a response could be more detailed.
My only general advice for large databases is, that I'd exceed the hardware options before going into replication and/or sharding (for performance reasons -- keeping a slave for backup is a different story). You also need to know your index-fu and the obvious switches/options in order to tune the database server.
More info, if you can tell me what kind of data you're working with.

What techniques are most effective for dealing with millions of records?

I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716

How big can a MySQL database get before performance starts to degrade

At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?
The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.
In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.
The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.
I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.
I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.
2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.
It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.
I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.
Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.
A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.
Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.
It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.
Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.
No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit
Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)