How to scale MySQL with multiple machines? - mysql

I have a web app running LAMP. We recently have an increase in load and is now looking at solutions to scale. Scaling apache is pretty easy we are just going to have multiple multiple machines hosting it and round robin the incoming traffic.
However, each instance of apache will talk with MySQL and eventually MySQL will be overloaded. How to scale MySQL across multiple machines in this setup? I have already looked at this but specifically we need the updates from the DB available immediately so I don't think replication is a good strategy here? Also hopefully this can be done with minimal code change.
PS. We have around a 1:1 read-write ratio.

There're only two strategies: replication and sharding. Replication comes often in place when you have less write and much read traffic, so you can redirect the reads to many slaves, with the pitfall of lots of replication traffic with the time and a probability for inconsitency.
With sharding you shard your database tables across multiple machines (called functional sharding), which makes especially joins much harder. If this doenst fit anymore you also need to shard you rows across multiple machines, but this is no fun and depends a sharding layer implemented between you application and the database.
Document oriented databases or column stores do this work for you, but they are currently optimized for OLAP not for OLTP.

Depends on the application backend (i.e. how the PKs, transactions and insert IDs are handled), you might consider MASTER-MASTER replication with different auto_increment setups. This can be tricky and needs to be thoroughly tested but it can work.
Also, in new MySQL 5.6 there is a GTID (Global Transaction Identifier) that generally helps a lot in keeping the replication in sync, especially in this scenario.

You should take a look at MySQL Performance Blog. Maybe you'll find something useful.

Well... good luck scaling all those writes to a real large scale. The database engine becomes the bottleneck, too many locks and buffers mgmt and stuff...
The only way I found that really works is scale out, sharding, unfortunately sharding is not provided for MySQL "out of the box" (like in some NoSQLs such as Mongo). ScaleBase (disclaimer: I work there) is a maker of a complete scale-out solution an "automatic sharding machine" if you like. ScaleBae analyzes your data and SQL stream, splits the data across DB nodes, route commands and aggregates results in runtime – so you won’t have to!

Related

Sharding on MySQL vs PostgreSQL

It seems like most large companies that have to shard their databases choose MySQL over PostgreSQL. What are the major advantages that MySQL has over PostgreSQL when it comes to distributed database? I don't see any major downside to Postgres that will prevent a successful implementation of sharding at the application level, but the sheer number of companies that choose MySQL over Postgres is giving me pause and making me wonder if I'm missing something.
PARTITIONing involves a single server; Sharding involves many servers. They solve (or fail to solve) different problems. Partitioning provides very few use cases to justify its existence; sharding provides write scaling at the cost of complexity.
MySQL's has no built-in sharding capability. There are 3rd party packages that assist with such, but there is still a large burden on the DBA. (See Spider and various Proxy servers.)
So, I see no reason why Postgres (or any other RDBMS) could not be sharded. After all, you do most of the work; the RDBMS sits on multiple machines not realizing that there are siblings with other chunks of the data.
(Disclaimer: I am very familiar with MySQL, and not familiar with Postgres.)

what's the difference between mysql master/slave mode and master/master mode? [duplicate]

I've heard about two kind of database architectures.
master-master
master-slave
Isn't the master-master more suitable for today's web cause it's like Git, every unit has the whole set of data and if one goes down, it doesn't quite matter.
Master-slave reminds me of SVN (which I don't like) where you have one central unit that handles thing.
Questions:
What are the pros and cons of each?
If you want to have a local database in your mobile phone like iPhone, which one is more appropriate?
Is the choice of one of these a critical factor to consider thoroughly?
While researching the various database architectures as well. I have compiled a good bit of information that might be relevant to someone else researching in the future. I came across
Master-Slave Replication
Master-Master Replication
MySQL Cluster
I have decided to settle for using MySQL Cluster for my use case. However please see below for the various pros and cons that I have compiled
1. Master-Slave Replication
Pros
Analytic applications can read from the slave(s) without impacting the master
Backups of the entire database of relatively no impact on the master
Slaves can be taken offline and sync back to the master without any downtime
Cons
In the instance of a failure, a slave has to be promoted to master to take over its place. No automatic failover
Downtime and possibly loss of data when a master fails
All writes also have to be made to the master in a master-slave design
Each additional slave add some load to the master since the binary log have to be read and data copied to each slave
Application might have to be restarted
2. Master-Master Replication
Pros
Applications can read from both masters
Distributes write load across both master nodes
Simple, automatic and quick failover
Cons
Loosely consistent
Not as simple as master-slave to configure and deploy
3. MySQL Cluster
The new kid in town based on MySQL cluster design. MySQL cluster was developed with high availability and scalability in mind and is the ideal solution to be used for environments that require no downtime, high avalability and horizontal scalability.
See MySQL Cluster 101 for more information
Pros
(High Avalability) No single point of failure
Very high throughput
99.99% uptime
Auto-Sharding
Real-Time Responsiveness
On-Line Operations (Schema changes etc)
Distributed writes
Cons
See known limitations
You can visit for my Blog full breakdown including architecture diagrams that goes into further details about the 3 mentioned architectures.
We're trading off availability, consistency and complexity. To address the last question first: Does this matter? Yes very much! The choices concerning how your data is to be managed is absolutely fundamental, and there's no "Best Practice" dodging the decisions. You need to understand your particular requirements.
There's a fundamental tension:
One copy: consistency is easy, but if it happens to be down everybody is out of the water, and if people are remote then may pay horrid communication costs. Bring portable devices, which may need to operate disconnected, into the picture and one copy won't cut it.
Master Slave: consistency is not too difficult because each piece of data has exactly one owning master. But then what do you do if you can't see that master, some kind of postponed work is needed.
Master-Master: well if you can make it work then it seems to offer everything, no single point of failure, everyone can work all the time. The trouble with this is that it is very hard to preserve absolute consistency. See the wikipedia article for more.
Wikipedia seems to have a nice summary of the advantages and disadvantages
Advantages
If one master fails, other masters will continue to update the
database.
Masters can be located in several physical sites i.e.
distributed across the network.
Disadvantages
Most multi-master replication systems are only loosely consistent,
i.e. lazy and asynchronous, violating ACID properties.
Eager replication systems are complex and introduce some
communication latency.
Issues such as conflict resolution can become intractable as
the number of nodes involved rises and the required latency decreases.

MySQL - using multiple DB's with identical schema instead of one large DB

I am helping a customer migrate a PHP/MySQL application to AWS.
One issue we have encountered is that they have architected this app to use a huge number of databases. They create a new DB (with identical schema) for each user. They expect to have tens of thousands of users.
I don't know MySQL very well, but this setup does not seem at all good to me. My only guess is that the developers did this so they could avoid having tables with huge amounts of data. However I can only think of drawbacks (maintaining this system will be a nightmare, very difficult to extend, difficult to scale, etc..).
Anyhow, is this type of pattern commonly used within the MySQL community? What are the benefits, if any?
I am trying to convince them that they should re-architect the DB schema.
* [EDIT] *
In the meantime we know another drawback of this approach. We had originally intended to use Amazon RDS for data storage. However, RDS currently supports up to 30 databases per instance. So unfortunately RDS is now ruled out. The fact that RDS has this limit in place is already very telling, my interpretation is that having such a huge number of databases is not a common practice with MySQL.
Thanks!
This is one of most horrible ideas I've ever read (and I've read many). For once the amounts of databases do not scale as well as tables in databases and on the other it would be impossible to connect users to each other or at least share common attributes and options. It essentially defeats the purpose of the database itself.
My advise here is rather outside of original scope: Your intuition knows more than you think, listen to it more!
This idea seems quite strange to me also! Databases are designed to handle large data sets after all! If there is genuine concern about the volume of data it is usually better practice to separate tables onto different databases - hosted on different physical servers as this allows you to spread the database level processes across hardware to boost performance
Also I don't know how they plan to host this application but many hosting providers are going to charge you per database instance!
Another problem this will give you is that it will make reporting more difficult - I wouldn't like to try including tables from 10,000 databases in a query!!

Clustering, Sharding or simple Partition / Replication

We have created a Facebook application and it got a lot of virality. The problem is that our database started getting REALLY FULL (some tables have more than 25 million rows now). It got to the point that the app just stopped working because there was a queue of thousands and thousands of writes to be made.
I need to implement a solution for scaling this app QUICKLY but I'm not sure if I should pursue Sharding or Clustering since I'm not sure what are the pro's and con's of each of them and I was thinking of doing a Partition / Replication approach but I think that doesn't help if the load is on the writes?
25 million rows is a completely reasonable size for a well-constructed relational database. Something you should bear in mind, however, is that the more indexes you have (and the more comprehensive they are), the slower your writes will be. Indexes are designed to improve query performance at the expense of write speed. Be sure that you're not over-indexed.
What sort of hardware is powering this database? Do you have enough RAM? It's far easier to change these attributes than it is to try to implement complex RDBMS load balancing techniques, especially if you're under a time crunch.
Clustering/Sharding/Partitioning comes when single node has reached to the point where its hardware cannot bear the load. But your hardware has still room to expand.
This is the first lesson I learnt when I started being hit by such issues
Well, to understand that, you need to understand how MySQL handles clustering. There are 2 main ways to do it. You can either do Master-Master replication, or NDB (Network Database) clustering.
Master-Master replication won't help with write loads, since both masters need to replay every single write issued (so you're not gaining anything).
NDB clustering will work very well for you if and only if you are doing mostly primary key lookups (since only with PK lookups can NDB operate more efficient than a regular master-master setup). All data is automatically partitioned among many servers. Like I said, I would only consider this if the vast majority of your queries are nothing more than PK lookups.
So that leaves two more options. Sharding and moving away from MySQL.
Sharding is a good option for handling a situation like this. However, to take full advantage of sharding, the application needs to be fully aware of it. So you would need to go back and rewrite all the database accessing code to pick the right server to talk to for each query. And depending on how your system is currently setup, it may not be possible to effectively shard...
But another option which I think may suit your needs best is switching away from MySQL. Since you're going to need to rewrite your DB access code anyway, it shouldn't be too hard to switch to a NoSQL database (again, depending on your current setup). There are tons of NoSQL servers out there, but I like MongoDB. It should be able to withstand your write load without worry. Just beware that you really need a 64 bit server to use it properly (with your data volume).
Replication is for data backup not for performance so its out of question.
Well, 8GB RAM is still not that much you can have many hundred GB RAM with quite big hard disk space and MySQL would still work for you.
Clustering/Sharding/Partitioning comes when single node has reached to the point where its hardware cannot bear the load. But your hardware has still room to expand.
If you don't want to upgrade your hardware then you need to give more information about database design and if there are lot of joins or not so that above named options can be considered deeply.

Maximum capabilities of MySQL

How do I know when a project is just to big for MySQL and I should use something with a better reputation for scalability?
Is there a max database size for MySQL before degradation of performance occurs? What factors contribute to MySQL not being a viable option compared to a commercial DBMS like Oracle or SQL Server?
Google uses MySQL. Is your project bigger than Google?
Smart-alec comments aside, MySQL is a professional level database application. If your application puts a strain on MySQL, I bet it'll do the same to just about any other database.
If you are looking for a couple of examples:
Facebook moved to Cassandra only after it was storing over 7 Terabytes of inbox data. (Source: Lakshman, Malik: Cassandra - A Decentralized Structured Storage System.) (... Even though they were having quite a few issues at that stage.)
Wikipedia also handles hundreds of Gigabytes of text data in MySQL.
I work for a very large Internet company. MySQL can scale very, very large with very good performance, with a couple of caveats.
One problem you might run into is that an index greater than 4 gigabytes can't go into memory. I spent a lot of time once trying to improve the MySQL's full-text performance by fiddling with some index parameters, but you can't get around the fundamental problem that if your query hits disk for an index, it gets slow.
You might find some helper applications that can help solve your problem. For the full-text problem, there is Sphinx: http://www.sphinxsearch.com/
Jeremy Zawodny, who now works at Craig's List, has a blog on which he occasionally discusses the performance of large databases: http://blog.zawodny.com/
In summary, your project probably isn't too big for MySQL. It may be too big for some of the ways that you've used MySQL before, and you may need to adapt them.
Mostly it is table size.
I am assuming here that you will use the Oracle innoDB plugin for mysql as your engine. If you do not, that probably means you're using a commercial engine such as infiniDB, InfoBright for Tokutek, in which case your questions should be sent to them.
InnoDB gets a bit nasty with very large tables. You are advised to partition your tables if at all possible with very large instances. Essentially, if your (frequently used) indexes don't all fit into ram, inserts will be very slow as they need to touch a lot of pages not in ram. This cannot be worked around.
You can use the MySQL 5.1 partitioning feature if it does what you want, or partition your tables at the application level if it does not. If you can get your tables' indexes to fit in ram, and only load one table at a time, then you're on a winner.
You can use the plugin's compression to make your ram go a bit further (as the pages are compressed in ram as well as on disc) but it cannot beat the fundamental limtation.
If your table's indexes don't all (or at least MOSTLY - if you have a few indexes which are NULL in 99.99% of cases you might get away without those ones) fit in ram, insert speed will suck.
Database size is not a major issue, provided your tables individually fit in ram while you're doing bulk loading (and of course, you only load one at once).
These limitations really happen with most row-based databases. If you need more, consider a column database.
Infobright and Infinidb both use a mysql-based core and are column based engines which can handle very large tables.
Tokutek is quite interesting too - you may want to contact them for an evaluation.
When you evaluate the engine's suitability, be sure to load it with very large data on production-grade hardware. There's no point in testing it with a (e.g.) 10G database, that won't prove anything.
MySQL is a commercial DBMS, you just have the option to get the support/monitoring that is offered by Oracle or Microsoft. Or you can use community support or community provided monitoring software.
Things you should look at are not only size at operations. Critical are also:
Scenaros for backup and restore?
Maintenance. Example: SQL Server Enterprise can rebuild an index WHILE THE OLD ONE IS AVAILABLE - transparently. This means no downtime for an index rebuild.
Availability (basically you do not want to have to restoer a 5000gb database if a server dies) - mirroring preferred, replication "sucks" (technically).
Whatever you go for, be carefull with Oracle RAC (their cluster) - it is known to be "problematic" (to say it finely). SQL Server is known to be a lot cheaper, scale a lot worse (no "RAC" option) but basically work without making admins want to commit suicide every hour (the "RAC" option seems to do that). Scalability "a lot worse" still is good enough for the Terra Server (http://msdn.microsoft.com/en-us/library/aa226316(SQL.70).aspx)
THere wer some questions here recently of people having problems rebuilding indices on a 10gb database or something.
So much for my 2 cents. I am sure some MySQL specialists will jump in on issues there.