20M users in a MySQL table, how to scale it?

20M users in a MySQL table, how to scale it? - mysql

Although I currently do not have it, I'm interested in learning how someone would scale an individual table in MySQL that might have, say 20 million users. Is this something you would use sharding for? What are some strategies one might use to make an individual table of this magnitude "scalable" ?

20M records is generally considered "small". Depending on the size of records and the kind of queries performed, you are likely to get very good performance on the lowliest of servers.
Almost all servers can keep such a database in memory. Let's consider that a record takes 1024 bytes, including indexes. This is quite a large record, yet 20M rows is still only 20Gb, which fits comfortably within the RAM of a modest server.
While your database fits in RAM, queries are likely to be very fast.
But in any case, you need to consider what the access patterns are.
Do you have
Very high write rates - more than 100 transactions per second?
Lots of hard queries / reports?
If the answer to both of these is "no", you probably need no special equipment at all.
Certainly you don't want to shard. It's complicated, it massively complicates your application, and will require a LOT of developer time which is better spent on features (which you can actually sell to customers)
In order to improve performance with big data, in approximate order of preference, you want to:
Buy better hardware (within reason)
Reduce the amount of data you need to store
Use horizontal partitioning
Use vertical partitioning / functional partitioning
Get a better database engine which can use existing hardware more efficiently (possible examples: Infobright, Tokutek)
Shard (you really don't want to do this!)

Related

Limit before sharding or partitioning a table

I am new to the database system design. After reading many articles, I am really getting confused on what is the limit till which we should have 1 table and not go for sharding or partitioning. I know that it is really hard to provide generic answer and things depend on factors like
size of row
kind of data (strings, blobs, etc)
active queries number
what kind of queries
indexes
read heavy/write heavy
the latency expected
But when someone ask that
what will you do if you have 1 billion data and million rows getting added everyday. The latency needs to be less than 5 ms for 4 read, 1 write and 2 update queries over such a big database, etc.
what will your choice if you have only 10 million rows but the updates and reads are high. The number of new rows added are not significant. High consistency and low latency are the requirement.
If the rows are less that a million and the row size is increasing by thousands then the choice is simple. But it gets trickier when the choice involves for million or billion of rows.
Note: I have not mentioned the latency number in my question. Please
answer according to the latency number which is acceptable to you. Also, we are talking about structured data.
I am not sure but I can add 3 specific questions:
Lets say that you choose sql database for amazon or any ecommerce order management system. The order numbers are increasing everyday by million. There are already 1 billion record. Now, assuming that there is no archival of data. There are high read queries more than thousand queries per second. And there are writes as well. The read:write ratio is 100:1
Let's take an example which smaller number now. Lets say that you choose a sql database for abc or any ecommerce order management system. The order numbers are increasing everyday by thousands. There are already 10 million record. Now, assuming that there is no archival of data. There are high read queries more than ten thousand queries per second. And there are writes as well. The read:write ratio is 10:1
3rd example: Free goodies distribution. We have 10 million goodies to be distributed. 1 goodies per user. High consistency and low latency is the aim. Lets assume that 20 million users already waiting for this free distribution and once the time starts, all of them will try to get the free goodies.
Note: In the whole question, the assumption is that we will go with
SQL solutions. Also, please neglect if the provided usecase doesn't make sense logically. The aim is to get the knowledge in terms of numbers.
Can someone please help with what are the benchmarks. Any practical numbers from the project you are currently working in which can tell that for such a big database with these many queries, this is the latency observed,. Anything which can help me justify the choice for the number of tables for the certain number of queries for particular latency.

Some answers for MySQL. Since all databases are limited by disk space, network latency, etc., other engines may be similar.
A "point query" (fetching one row using a suitable index) takes milliseconds regardless of the number of rows.
It is possible to write a SELECT that will take hours, maybe even days, to run. So you need to understand whether the queries are pathological like this. (I assume this is an example of high "latency".)
"Sharding" is needed when you cannot sustain the number of writes needed on a single server.
Heavy reads can be scaled 'infinitely' by using replication and sending the reads to Replicas.
PARTITIONing (especially in MySQL) has very few uses. More details: Partition
INDEXes are very important for performance.
For Data Warehouse apps, building and maintaining "Summary tables" is vital for performance at scale. (Some other engines have some built-in tools for such.)
INSERTing one million rows per day is not a problem. (Of course, there are schema designs that could make this a problem.) Rules of Thumb: 100/second is probably not a problem; 1000/sec is probably possible; it gets harder after that. More on high speed ingestion
Network latency is mostly determined by how close the client and server are. It takes over 200ms to reach the other side of the earth. On the other hand, if the client and server are in the same building, latency is under 1ms. On another hand, if you are referring to how long it takes too run a query, then here are a couple of Rules of Thumb: 10ms for a simple query that needs to hit an HDD disk; 1ms for SSD.
UUIDs and hashes are very bad for performance if the data is too big to be cached in RAM.
I have not said anything about read:write ratio because I prefer to judge reads and writes independently.
"Ten thousand reads per second" is hard to achieve; I suggest that very few apps really need such. Or they can find better ways to achieve the same goals. How fast can one user issue a query? Maybe one per second? How many users can be connected and active at the same time? Hundreds.
(my opinion) Most benchmarks are useless. Some benchmarks can show that one system is twice as fast as another. So what? Some benchmarks say that when you have more than a few hundred active connections, throughput stagnates and latency heads toward infinity. So what. After you have an app running for some time, capturing the actual queries is perhaps the best benchmark. But it still has limited uses.
Almost always a single table is better than splitting up the table (multiple tables; PARTITIONing; sharding). If you have a concrete example, we can discuss the pros and cons of the table design.
Size of row and kinds of data -- Large columns (TEXT/BLOB/JSON) are stored "off-record", thereby leading to [potentially] an extra disk hit. Disk hits are the most costly part of any query.
Active queries -- After a few dozen, the queries stumble over each other. (Think about a grocery store with lots of shoppers pushing carts -- with "too many" shoppers, each takes a long time to finish.)
When you get into large databases, they fall into a few different types; each with somewhat different characteristics.
Data Warehouse (sensors, logs, etc) -- appending to 'end' of the table; Summary Tables for efficient 'reports'; huge "Fact" table (optionally archived in chunks); certain "dimension tables".
Search (products, web pages, etc) -- EAV is problematical; FULLTEXT is often useful.
Banking, order processing -- This gets heavy into the ACID features and the need for crafting transactions.
Media (images and videos) -- How to store the bulky objects while making searching (etc) reasonably fast.
'Find nearest' -- Need a 2D index, either SPATIAL or some of the techniques here

Speed and tuning for mySQL (1billion rows)

My company has a mySQL server used by a team of analysts (usually 3-4 at a time). Lately the queries have slowed down, with some of them taking even days, for a database with tables up to 1 billion rows (10^9 records).
Server main features: Linux OS-64 GB of memory- 3 Terabytes of hard drive.
We know nothing of fine tuning, so any tool/rule of thumb to find out what is causing the trouble or at least to narrow it down, would be welcome.
Going to Workbench studio>Table inspector I found these key values for the DB that we use the most:
DB size: ~500 Gbytes
Largest table size: ~80 Gbytes
Index length (for largest table): ~230 Gbytes. This index relies on 6 fields.
Almost no MyISAM tables, all InnoDB
Ideally I would like to fine tune the server (better), the DB (worse), or both (in the future), in the simplest possible way, to speed it up.
My questions:
Are these values (500, 80, 230 GB) normal and manageable for a
medium size server?
Is it normal to have indexes of this size -230Gb-, way larger than the table itself?
What parameters/strategy can be tweaked to fix this? I'm thinking memory logs, or buying server RAM, but happy to investigate any sensible answers.
Many thanks.

If you're managing a MySQL instance of this scale, it would be worth your time to read High Performance MySQL which is the best book on MySQL tuning. I strongly recommend you get this book and read it.
Your InnoDB buffer pool is probably still at its default size, not taking advantage of the RAM on your Linux system. It doesn't matter how much RAM you have if you haven't configured MySQL to use it!
There are other important tuning parameters too. MySQL 5.7 Performance Tuning Immediately After Installation is a great introduction to the most important tuning options.
Indexes can be larger than the table itself. The factor of nearly 4 to 1 is unusual, but not necessarily bad. It depends on what indexes you need, and there's no way to know that unless you consider the queries you need to run against this data.
I did a presentation How to Design Indexes, Really a few years ago (it's just as relevant to current versions of MySQL). Here's the video: https://www.youtube.com/watch?v=ELR7-RdU9XU

Here's the order you want to check things:
1) Tune your indexes. Pick a commonly-used slow query and analyze it. Learn about EXPLAIN ANALYZE so that you can tell if your query is using indexes properly. It is entirely possible that your tables are not indexed correctly, and your days-long queries might run in minutes. Literally. Without proper indexes, your queries will be doing full table scans in order to do joins, and with billions of rows, that's going to be very, very slow.
A good introduction to indexes is at http://use-the-index-luke.com/ but there are zillions of books and articles on the topic.
1a) Repeat #1 with other slow queries. See if you can improve them. If you've worked on a number of slow queries and you're not able to speed them up, then proceed to server tuning.
2) Tune your server. Bill Karwin's links will be helpful there.
3) Look at increasing hardware/RAM. This should only be last resort.
Spend time with #1. It is likely to return the best bang for the buck. There is much you can do to improve things without spending a dime. You'll also learn how to write better queries and create better indexes and prevent these problems in the future.
Also: Listen to Bill Karwin and his knowledge. He is an Expert with a capital E.

In a survey of 600 rather random tables (a few were much bigger than yours), your 230GB:80GB ratio would be at about the 99th percentile. Please provide SHOW CREATE TABLE so we can discuss whether you are "doing something wrong", or it is simply an extreme situation. (Rarely is a 6-column index advisable. And if it is a single index adding up to 230GB, something is 'wrong'.)
I've seen bigger tables run fine in smaller machines. If you are doing mostly "point queries", there is virtually no size limitation. If you are using UUIDs, you are screwed. That is, it really depends on the data, the queries, the schema, the phase of the moon, your karma, etc.
A cross-join can easily get to a trillion things to do. A join with eq_ref is often not much slower than a query with no joins.
"You can't tune your way out of a performance problem." "Throwing hardware at a performance problem either wastes money, or delays the inevitable." Instead, let's see the "queries that are slowing down", together with EXPLAIN SELECT ... and SHOW CREATE TABLE.
Is this a Data Warehouse application? Do you have Summary Tables?
Here is my Cookbook on creating indexes . But it might be faster if you show us your code.
And I can provide another Tuning Analysis .
EXPLAIN SELECT ..... is a critical part of information needed to investigate your request for assistance.
SHOW CREATE TABLE for each table involved would also be helpful.
At this point in time, neither are visible in the data available from user......

I will try to answer your question but keep in mind that I am no MySQL expert.
1) It is quite large DB with large table, but nothing fairly sized server couldn't handle. But it really depends on the workload you have.
2) The index size greater than table itself is interesting but it will probably be size of all indexes on that table. In that case it is completely normal.
3) 64 GB of RAM in your server means that there will be probably lot of disk operations going on and it will definitely slow you down. So adding some memory will surely help. Maybe check how the server behaves when the query is running with iotop. And compare it with information from top to see if the server is waiting on disks.

how to increase performance of mysql query if we have more than 1 million records?

In User table i have more than 1 million records so how can i manage using MySQL, Symfony 1.4. Make performance better.
So that it can give quick output.

To significantly improve performance of well designed system all you can do is increase the resources. Typically, these days, the cheapest way to do this is to distribute the task.
For example a slow thing in RDBM system is reading and writing to an from the storage (typically RDBMs systems start as I/O bound, that is, they mostly wait for data to get read or written to storage).
So, to offset, very commonly the RDBMS will allow you to split the table across multiple HDDs, effectively multiplying the I/O performance (approach similar to RAID0).
Adding more hard disks increases the performance. This goes on up to maximum I/O that your system could support (either simply because the system can not push more data through circuits or because it does need to crunch the numbers a bit when it fetches them so it becomes CPU bound; optimally you would be utilising both)
After that you have to start multiplying the systems distributing the data across database nodes. For this to work either RDBMS must support it or there should be application layer that will coordinate distributing the tasks and merging the results, but normally things would still scale.
I would say that with 512 systems you could have all trillion records effectively cached (10^12) and achieve relatively nice performance. But really you should specify what kind of performance you are looking for - there is a difference between full text searches on terra-records and running mostly simple fetches and updates. Also, for certain work 500ms (or even more) is considered good performance and then for other work it would be horrible.

at first: theres a big difference between 1 trillion and 1 million.
to your performance problems: show us the query thats running slow, without seeing it, it's hard to tell whats wrong with it. what you could try:
use EXPLAIN to get more information about your slow querys, see if they're using your indexes or if not (and if not, why not?)
use correct and reasonable indexes

MySQL: Advisable number of rows

Consider an indexed MySQL table with 7 columns, being constantly queried and written to. What is the advisable number of rows that this table should be allowed to contain before the performance would be improved by splitting the data off into other tables?

Whether or not you would get a performance gain by partitioning the data depends on the data and the queries you will run on it. You can store many millions of rows in a table and with good indexes and well-designed queries it will still be super-fast. Only consider partitioning if you are already confident that your indexes and queries are as good as they can be, as it can be more trouble than its worth.

There's no magic number, but there's a few things that affect performance in particular:
Index Cardinality: don't bother indexing a row that has 2 or 3 values (like an ENUM). On a large table, the query optimizer will ignore these.
There's a trade off between writes and indexes. The more indexes you have, the longer writes take. Don't just index every column. Analyze your queries and see which columns need to be indexed for your app.
Disk IO and a memory play an important role. If you can fit your whole table into memory, you take disk IO out of the equation (once the table is cached, anyway). My guess is that you'll see a big performance change when your table is too big to buffer in memory.
Consider partitioning your servers based on use. If your transactional system is reading/writing single rows, you can probably buy yourself some time by replicating the data to a read only server for aggregate reporting.
As you probably know, table performance changes based on the data size. Keep an eye on your table/queries. You'll know when it's time for a change.

MySQL 5 has partitioning built in and is very nice. What's nice is you can define how your table should be split up. For instance, if you query mostly based on a userid you can partition your tables based on userid, or if you're querying by dates do it by date. What's nice about this is that MySQL will know exactly which partition table to search through to find your values. The downside is if you're search on a field that isn't defining your partition its going to scan through each table, which could possibly decrease performance.

While after the fact you could point to the table size at which performance became a problem, I don't think you can predict it, and certainly not from the information given on a web site such as this!
Some questions you might usefully ask yourself:
Is performance currently acceptable?
How is performance measured - is
there a metric?
How do we recognise
unacceptable performance?
Do we
measure performance in any way that
might allow us to forecast a
problem?
Are all our queries using
an efficient index?
Have we simulated extreme loads and volumes on the system?

Using the MyISAM engine, you'll run into a 2GB hard limit on table size unless you change the default.

Don't ever apply an optimisation if you don't think it's needed. Ideally this should be determined by testing (as others have alluded).
Horizontal or vertical partitioning can improve performance but also complicate you application. Don't do it unless you're sure that you need it AND it will definitely help.
The 2G data MyISAM file size is only a default and can be changed at table creation time (or later by an ALTER, but it needs to rebuild the table). It doesn't apply to other engines (e.g. InnoDB).

Actually this is a good question for performance. Have you read Jay Pipes? There isn't a specific number of rows but there is a specific page size for reads and there can be good reasons for vertical partitioning.
Check out his kung fu presentation and have a look through his posts. I'm sure you'll find that he's written some useful advice on this.

Are you using MyISAM? Are you planning to store more than a couple of gigabytes? Watch out for MAX_ROWS and AVG_ROW_LENGTH.
Jeremy Zawodny has an excellent write-up on how to solve this problem.

How big can a MySQL database get before performance starts to degrade

At what point does a MySQL database start to lose performance?
Does physical database size matter?
Do number of records matter?
Is any performance degradation linear or exponential?
I have what I believe to be a large database, with roughly 15M records which take up almost 2GB. Based on these numbers, is there any incentive for me to clean the data out, or am I safe to allow it to continue scaling for a few more years?

The physical database size doesn't matter. The number of records don't matter.
In my experience the biggest problem that you are going to run in to is not size, but the number of queries you can handle at a time. Most likely you are going to have to move to a master/slave configuration so that the read queries can run against the slaves and the write queries run against the master. However if you are not ready for this yet, you can always tweak your indexes for the queries you are running to speed up the response times. Also there is a lot of tweaking you can do to the network stack and kernel in Linux that will help.
I have had mine get up to 10GB, with only a moderate number of connections and it handled the requests just fine.
I would focus first on your indexes, then have a server admin look at your OS, and if all that doesn't help it might be time to implement a master/slave configuration.

In general this is a very subtle issue and not trivial whatsoever. I encourage you to read mysqlperformanceblog.com and High Performance MySQL. I really think there is no general answer for this.
I'm working on a project which has a MySQL database with almost 1TB of data. The most important scalability factor is RAM. If the indexes of your tables fit into memory and your queries are highly optimized, you can serve a reasonable amount of requests with a average machine.
The number of records do matter, depending of how your tables look like. It's a difference to have a lot of varchar fields or only a couple of ints or longs.
The physical size of the database matters as well: think of backups, for instance. Depending on your engine, your physical db files on grow, but don't shrink, for instance with innodb. So deleting a lot of rows, doesn't help to shrink your physical files.
There's a lot to this issues and as in a lot of cases the devil is in the details.

The database size does matter. If you have more than one table with more than a million records, then performance starts indeed to degrade. The number of records does of course affect the performance: MySQL can be slow with large tables. If you hit one million records you will get performance problems if the indices are not set right (for example no indices for fields in "WHERE statements" or "ON conditions" in joins). If you hit 10 million records, you will start to get performance problems even if you have all your indices right. Hardware upgrades - adding more memory and more processor power, especially memory - often help to reduce the most severe problems by increasing the performance again, at least to a certain degree. For example 37 signals went from 32 GB RAM to 128GB of RAM for the Basecamp database server.

I'm currently managing a MySQL database on Amazon's cloud infrastructure that has grown to 160 GB. Query performance is fine. What has become a nightmare is backups, restores, adding slaves, or anything else that deals with the whole dataset, or even DDL on large tables. Getting a clean import of a dump file has become problematic. In order to make the process stable enough to automate, various choices needed to be made to prioritize stability over performance. If we ever had to recover from a disaster using a SQL backup, we'd be down for days.
Horizontally scaling SQL is also pretty painful, and in most cases leads to using it in ways you probably did not intend when you chose to put your data in SQL in the first place. Shards, read slaves, multi-master, et al, they are all really shitty solutions that add complexity to everything you ever do with the DB, and not one of them solves the problem; only mitigates it in some ways. I would strongly suggest looking at moving some of your data out of MySQL (or really any SQL) when you start approaching a dataset of a size where these types of things become an issue.
Update: a few years later, and our dataset has grown to about 800 GiB. In addition, we have a single table which is 200+ GiB and a few others in the 50-100 GiB range. Everything I said before holds. It still performs just fine, but the problems of running full dataset operations have become worse.

I would focus first on your indexes, than have a server admin look at your OS, and if all that doesn't help it might be time for a master/slave configuration.
That's true. Another thing that usually works is to just reduce the quantity of data that's repeatedly worked with. If you have "old data" and "new data" and 99% of your queries work with new data, just move all the old data to another table - and don't look at it ;)
-> Have a look at partitioning.

2GB and about 15M records is a very small database - I've run much bigger ones on a pentium III(!) and everything has still run pretty fast.. If yours is slow it is a database/application design problem, not a mysql one.

It's kind of pointless to talk about "database performance", "query performance" is a better term here. And the answer is: it depends on the query, data that it operates on, indexes, hardware, etc. You can get an idea of how many rows are going to be scanned and what indexes are going to be used with EXPLAIN syntax.
2GB does not really count as a "large" database - it's more of a medium size.

I once was called upon to look at a mysql that had "stopped working". I discovered that the DB files were residing on a Network Appliance filer mounted with NFS2 and with a maximum file size of 2GB. And sure enough, the table that had stopped accepting transactions was exactly 2GB on disk. But with regards to the performance curve I'm told that it was working like a champ right up until it didn't work at all! This experience always serves for me as a nice reminder that there're always dimensions above and below the one you naturally suspect.

Also watch out for complex joins. Transaction complexity can be a big factor in addition to transaction volume.
Refactoring heavy queries sometimes offers a big performance boost.

A point to consider is also the purpose of the system and the data in the day to day.
For example, for a system with GPS monitoring of cars is not relevant query data from the positions of the car in previous months.
Therefore the data can be passed to other historical tables for possible consultation and reduce the execution times of the day to day queries.

Performance can degrade in a matter of few thousand rows if database is not designed properly.
If you have proper indexes, use proper engines (don't use MyISAM where multiple DMLs are expected), use partitioning, allocate correct memory depending on the use and of course have good server configuration, MySQL can handle data even in terabytes!
There are always ways to improve the database performance.

It depends on your query and validation.
For example, i worked with a table of 100 000 drugs which has a column generic name where it has more than 15 characters for each drug in that table .I put a query to compare the generic name of drugs between two tables.The query takes more minutes to run.The Same,if you compare the drugs using the drug index,using an id column (as said above), it takes only few seconds.

Database size DOES matter in terms of bytes and table's rows number. You will notice a huge performance difference between a light database and a blob filled one. Once my application got stuck because I put binary images inside fields instead of keeping images in files on the disk and putting only file names in database. Iterating a large number of rows on the other hand is not for free.

No it doesnt really matter. The MySQL speed is about 7 Million rows per second. So you can scale it quite a bit

Query performance mainly depends on the number of records it needs to scan, indexes plays a high role in it and index data size is proportional to number of rows and number of indexes.
Queries with indexed field conditions along with full value would be returned in 1ms generally, but starts_with, IN, Between, obviously contains conditions might take more time with more records to scan.
Also you will face lot of maintenance issues with DDL, like ALTER, DROP will be slow and difficult with more live traffic even for adding a index or new columns.
Generally its advisable to cluster the Database into as many clusters as required (500GB would be a general benchmark, as said by others it depends on many factors and can vary based on use cases) that way it gives better isolation and gives independence to scale specific clusters (more suited in case of B2B)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008