I wanted to know if there are sharding solutions for MySQL that can be applied on large databases that are already running in the cloud. For example, I have a 500GB database on Amazon RDS, but now I want to use a sharding solution (you will tell me which one i hope) that can scale my database using sharding.
you cannot directly divide it into shards. Because sharding requires data to be physically separated. You will have to plan a downtime after testing a solution which works for you best.
I recommend scalebase. Refer http://www.scalebase.com/tag/mysql-sharding/
Disclaimer: I work at Clustrix
ClustrixDB was designed exactly for the use case you describe -- scale your database, live, as it grows. ClustrixDB was built from the ground up to scale (it is not a MySQL bolt-on solution) and is MySQL compatible and available on AWS. As your data set ClustrixDB automatically distributes data in the background and distributes queries across multiple servers, all the while providing a simple SQL interface.
Related
It seems like most large companies that have to shard their databases choose MySQL over PostgreSQL. What are the major advantages that MySQL has over PostgreSQL when it comes to distributed database? I don't see any major downside to Postgres that will prevent a successful implementation of sharding at the application level, but the sheer number of companies that choose MySQL over Postgres is giving me pause and making me wonder if I'm missing something.
PARTITIONing involves a single server; Sharding involves many servers. They solve (or fail to solve) different problems. Partitioning provides very few use cases to justify its existence; sharding provides write scaling at the cost of complexity.
MySQL's has no built-in sharding capability. There are 3rd party packages that assist with such, but there is still a large burden on the DBA. (See Spider and various Proxy servers.)
So, I see no reason why Postgres (or any other RDBMS) could not be sharded. After all, you do most of the work; the RDBMS sits on multiple machines not realizing that there are siblings with other chunks of the data.
(Disclaimer: I am very familiar with MySQL, and not familiar with Postgres.)
Let's say I have several different containers and each one of them uses it's own database. What is the best practice in this case regarding performance? Run one container, say a MySQL server, with all the databases in there or run one database server container per database?
Any other comment besides the performance would be welcome.
Since Docker container overhead is not significant and negligible here, the question is more about architecture in a microservices paradigm.
Performance is indeed a complex question and there is no general advice, but maybe the following will help you:
Personally, I doubt that at the beginning of the project one should try to solve all possible performance problems in advance (#MVP, #agile)
However, correct me, but it looks like you have not many resources (one host?) and want to be thrifty with these resources in advance.
Ok, what is your biggest concern now?
RAM is a concern
Then having two concurrent MySQL instances on the same host is maybe not that good (but not a problem for different setups)
For one host I would propose to start using one Database container but create different schemas.
It might involve additional work with standard container (https://forums.docker.com/t/multiple-databases-in-official-mysql-container/8324)
Other concerns
I would not care too much now and start with separate databases from the beginning.
Being able to separate your services horizontal to the databases is a huge value! I would not want to weaken this design decision because of very theoretical future performance issues.
You'd want to use a single database server, preferably running a shell you can attach to for administration, sharing either a Unix socket, a port or both to linked containers. This means you'll have an easier time managing the database container as a service, tweaking performances, monitoring usage, backing up volumes, etc.
Granted, there might be non-standard situations where you might want to have independent servers, for instance running servers with isolated host resources, users, databases, though I'm certain this shouldn't apply to developer environments.
I am using nodejs and couchbase to develop web app.
Just wonder if couchbase maintainence is easy and convenient compare to mysql?
Your comment welcome
In general the administration and scaling of Couchbase is very simple. I'd say setup of mysql is also relatively simple as well but there is more work in optimizing indexes, and the development process is more involved, you have to do CREATE/ALTER table and migrations for every change, taking mysql down for each change.
When it comes to scaling horizontally, Couchbase beats MySQL hands down for ease of scaling. Deciding on your mysql strategy is challenging, replication is slow and still suffers from not being able to scale the master. Sharding is equally as complex.
If you have a small application and don't need to optimize queries or anything and a single server is sufficient for your database, then you are only really changing your development process. The differences in this scenario are smaller.
I am helping a customer migrate a PHP/MySQL application to AWS.
One issue we have encountered is that they have architected this app to use a huge number of databases. They create a new DB (with identical schema) for each user. They expect to have tens of thousands of users.
I don't know MySQL very well, but this setup does not seem at all good to me. My only guess is that the developers did this so they could avoid having tables with huge amounts of data. However I can only think of drawbacks (maintaining this system will be a nightmare, very difficult to extend, difficult to scale, etc..).
Anyhow, is this type of pattern commonly used within the MySQL community? What are the benefits, if any?
I am trying to convince them that they should re-architect the DB schema.
* [EDIT] *
In the meantime we know another drawback of this approach. We had originally intended to use Amazon RDS for data storage. However, RDS currently supports up to 30 databases per instance. So unfortunately RDS is now ruled out. The fact that RDS has this limit in place is already very telling, my interpretation is that having such a huge number of databases is not a common practice with MySQL.
Thanks!
This is one of most horrible ideas I've ever read (and I've read many). For once the amounts of databases do not scale as well as tables in databases and on the other it would be impossible to connect users to each other or at least share common attributes and options. It essentially defeats the purpose of the database itself.
My advise here is rather outside of original scope: Your intuition knows more than you think, listen to it more!
This idea seems quite strange to me also! Databases are designed to handle large data sets after all! If there is genuine concern about the volume of data it is usually better practice to separate tables onto different databases - hosted on different physical servers as this allows you to spread the database level processes across hardware to boost performance
Also I don't know how they plan to host this application but many hosting providers are going to charge you per database instance!
Another problem this will give you is that it will make reporting more difficult - I wouldn't like to try including tables from 10,000 databases in a query!!
I have a web app running LAMP. We recently have an increase in load and is now looking at solutions to scale. Scaling apache is pretty easy we are just going to have multiple multiple machines hosting it and round robin the incoming traffic.
However, each instance of apache will talk with MySQL and eventually MySQL will be overloaded. How to scale MySQL across multiple machines in this setup? I have already looked at this but specifically we need the updates from the DB available immediately so I don't think replication is a good strategy here? Also hopefully this can be done with minimal code change.
PS. We have around a 1:1 read-write ratio.
There're only two strategies: replication and sharding. Replication comes often in place when you have less write and much read traffic, so you can redirect the reads to many slaves, with the pitfall of lots of replication traffic with the time and a probability for inconsitency.
With sharding you shard your database tables across multiple machines (called functional sharding), which makes especially joins much harder. If this doenst fit anymore you also need to shard you rows across multiple machines, but this is no fun and depends a sharding layer implemented between you application and the database.
Document oriented databases or column stores do this work for you, but they are currently optimized for OLAP not for OLTP.
Depends on the application backend (i.e. how the PKs, transactions and insert IDs are handled), you might consider MASTER-MASTER replication with different auto_increment setups. This can be tricky and needs to be thoroughly tested but it can work.
Also, in new MySQL 5.6 there is a GTID (Global Transaction Identifier) that generally helps a lot in keeping the replication in sync, especially in this scenario.
You should take a look at MySQL Performance Blog. Maybe you'll find something useful.
Well... good luck scaling all those writes to a real large scale. The database engine becomes the bottleneck, too many locks and buffers mgmt and stuff...
The only way I found that really works is scale out, sharding, unfortunately sharding is not provided for MySQL "out of the box" (like in some NoSQLs such as Mongo). ScaleBase (disclaimer: I work there) is a maker of a complete scale-out solution an "automatic sharding machine" if you like. ScaleBae analyzes your data and SQL stream, splits the data across DB nodes, route commands and aggregates results in runtime – so you won’t have to!