how to apply sharding + replication to MySQL relational db - mysql

I wanted to know, what are the different techniques for sharding and replication that can be applied to MySQL or any other relational database?
Are there any guidelines/rules that I should be aware of?
Basically, i want to create a custom MySQl(or other relational DB) that has support for sharding and replication. Most of the sites I see explain some technology or service that takes care of sharding/replication-- I want to understand the concepts and apply them myself to a regular MySQL database.

Sharding is not provided for MySQL "out of the box".
ScaleBase (disclaimer: I work there) is a maker of a complete scale-out solution an "automatic sharding machine" if you like. The application or any other client tool (mysql, mysqldump, PHPMyAdmin...) connects to ScaleBase controller, it looks and feels like a MySQL, proxy to a grid of "shards", automating command routing and parallelizing cross-db queries, and merge results - you wouldn't feel any difference... Just like a result that would have come from 1 DB. ORDER, GROUP, LIMIT, agg functions supported! The routing and parallelizing is done inside the "controller" according to the command and parameters. From experience with our customers, not only had we got great performance improvements with parallel queries, we also improved maintenance, think about creating an index, adding a column to a table - these are also parallelized and run much faster. All with none or pretty minimal changes to the code.
Also - I invite you to look at my blog about the subject: http://database-scalability.blogspot.com/.
PS - I forgot to mention, that ScaleBase also completes the solution for replication with a frontend that manages automatic failover and Read/Write splitting over replicated databases.
Hope I helped

Related

Should I use SQLite instead of MySQL?

I need to improve a PHP-MySQL web application, which only uses MySQL for REPL operations (and some search functions). 99% of the applications that I worked with never used advanced MySQL features, like replication, cross-table constraints, locking etc.
To my understanding I should instead use SQLite.
Are there any practical benefits if I do this?
Will I see a significant (>100ms) speed boost?
Should I expect problems with tables with more than 1,000,000 rows?
There is no catch-all answer to that, but there is a main point to consider: A very good rule of thumb is, that the higher your degree of concurrency is, the more you'll profit from MySQL and vice versa.
This means that in a scenario where database requests never ever are concurrent, you might see a speedup by using SQlite, though I doubt it would be in the 100ms order of magnitude.
The reason behind this is (very roughly):
In a database server environment, such as MySQL, PostgreSQL, MS SQL, Oracle and friends, a dedicated process (or a group of processes) exclusively touch the database files - the important part being dedicated. This means, that concurrency issues can be resolved in-process.
In a file-based database, such as SQlite, MS Access (Jet Engine) and friends, multiple processes will touch the DB files without knowing of each other - this implies that concurrency issues have to be resolved by writing them to the DB or helper file(s). This is typically much slower and less robust. In exchange for that, the overhead of communication between the database client (the web app) and the database server (which is in-process) is nonexistent.
Edit
After comment I want to make it more clear, that I am talking of concurrent writes, not concurrent reads. Concurrent reads of an unchanging dataset is not a hard problem - it doesn't need any locking at all.
The principal advantage of SQLite is that it is a file-based relational database that uses SQL as its query language. Being file-based tremendously simplifies deployment, making it very good for the case where an application needs a little database but must be run in an environment where having a database server would be problematic. (For example, many browsers use SQLite to manage their cookie stores; using a database server for that problem would be verging on the insane in many ways.)
The principal advantage of MySQL (with a sane table type) is that it is a database server that uses SQL as its query language. Being server-based allows for many features that a file-based system can't handle simply (such as replication) but does make things quite a bit more complex to deploy.
Whether the benefits of the additional complexity of a database server (e.g., MySQL) outweigh the costs (relative to a file-based database engine like SQLite) depends on a great many factors, notably including how many installations are expected and who is expected to perform those installations.

Mysql database sync between two databases

We are running a Java PoS (Point of Sale) application at various shops, with a MySql backend. I want to keep the databases in the shops synchronised with a database on a host server.
When some changes happen in a shop, they should get updated on the host server. How do I achieve this?
Replication is not very hard to create.
Here's some good tutorials:
http://www.ghacks.net/2009/04/09/set-up-mysql-database-replication/
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
http://www.lassosoft.com/Beginners-Guide-to-MySQL-Replication
Here some simple rules you will have to keep in mind (there's more of course but that is the main concept):
Setup 1 server (master) for writing data.
Setup 1 or more servers (slaves) for reading data.
This way, you will avoid errors.
For example:
If your script insert into the same tables on both master and slave, you will have duplicate primary key conflict.
You can view the "slave" as a "backup" server which hold the same information as the master but cannot add data directly, only follow what the master server instructions.
NOTE: Of course you can read from the master and you can write to the slave but make sure you don't write to the same tables (master to slave and slave to master).
I would recommend to monitor your servers to make sure everything is fine.
Let me know if you need additional help
three different approaches:
Classic client/server approach: don't put any database in the shops; simply have the applications access your server. Of course it's better if you set a VPN, but simply wrapping the connection in SSL or ssh is reasonable. Pro: it's the way databases were originally thought. Con: if you have high latency, complex operations could get slow, you might have to use stored procedures to reduce the number of round trips.
replicated master/master: as #Book Of Zeus suggested. Cons: somewhat more complex to setup (especially if you have several shops), breaking in any shop machine could potentially compromise the whole system. Pros: better responsivity as read operations are totally local and write operations are propagated asynchronously.
offline operations + sync step: do all work locally and from time to time (might be once an hour, daily, weekly, whatever) write a summary with all new/modified records from the last sync operation and send to the server. Pros: can work without network, fast, easy to check (if the summary is readable). Cons: you don't have real-time information.
SymmetricDS is the answer. It supports multiple subscribers with one direction or bi-directional asynchronous data replication. It uses web and database technologies to replicate tables between relational databases, in near real time if desired.
Comprehensive and robust Java API to suit your needs.
Have a look at Schema and Data Comparison tools in dbForge Studio for MySQL. These tool will help you to compare, to see the differences, generate a synchronization script and synchronize two databases.

How does database tiering work?

The only good reference that I can find on the internet is this whitepaper, which explains what database tiering is, but not how it works:
The concept behind database tiering is
the seamless co-existence of multiple
(legacy and new) database technologies
to best solve a business problem.
But, how does it implemented? How does it work?
Any links regarding this would also be helpful. Thanks.
I think the idea of that document is you to put "cheap" databases in front of the "expensive" databases to reduce costs.
For example. Let's assume you have an "expensive" db...something like Oracle, or DB2 or even MSSQL (more realistically it's probably more of an issue with a legacy DB system that is not supported much or you need specialized resources to maintain). A database engine that costs a lot to purchase and maintain (arguably these are not expensive when you take all factors into consideration. But let's use them for the example).
Now if you suddenly get famous and your server starts to get overloaded what do you do? Do you buy a bigger server and migrate all your data to that new server? That could be incredibly expensive.
With the tiering solution you put several "cheap" databases in front of you "expensive" database to take the brunt of the work. So your web servers (or app servers) talk to a bunch of MySQL servers, for example, instead of directly to the your expensive server. Then these MySQL servers handle the majority of the calls. For example, they could handle all read-only calls completely on their own and only need to pass write-calls back to the main database server. These MySQL servers are then kept in sync via standard replication practices.
Using methods like this you could in theory scale out your expensive server to dozens, if not hundreds, of "cheap" database servers and handle a much higher load.
Database tiering is just a specific style of tiering. There are also application tiering and service tiering. It's a form of scalability.
What exactly are you asking? This question is rather vague.
This is a PDF from a course at Ohio State. What it discusses is a bit over my head, but hopefully you might understand it better.

How to replicate two different database systems?

I'm not sure, if it fits exactly stackoverflow, however as i'm seeking for some code rather than a tool, i think it does.
I'm looking for a way of how to replicate / synchronize different database systems -- in this case: mysql and mongodb. We are running both for different purpose. We started with a mysql database and added mongodb later on for special applications. There's data we would like to have in both databases, where we want to have constraints in mysql respectivly dbrefs in mongodb. For example: We need a user-record in mysql, but also in mongodb for references between tables respectivly objects. At the moment we have a cronjob, which dumps the mysql data and imports it in mongodb. However though it works quite well, that's not the solution we would like to have.
I think for the moment a one-way replication would be enough -- mysql->mongodb, the important part is, that the replication works in "realtime", much like a mysql master->slave replication works.
Are there already any solutions for this problem or ideas anyone of how to achieve this?
Thanks!
SymmetricDS is open source, Java-based, web-enabled, database independent, data synchronization/replication software that might do the trick with a few tweaks. It has an extension point called IDataLoaderFilter which you could use to implement a MongodbDataLoader.
This would help with one way database replication. It might be a little more difficult to synchronized from MongoDb -> relational database, but the SymmetricDS team would be very helpful in trying to find the solution.
What you're looking for is called EAI (Enterprise application integration). There are a lot of commercial tools around but under the provided link, you'll also find a couple OSS solutions. The basis of EAI is that you have data sources and data sinks. The EAI framework offers tools to build custom pumps between the two.
I suggest to either use a DB trigger to start the synchronization or send a trigger signal in your applications. Note that there is no key-hole solution since synchronization can become arbitrarily complex (for example, how do you make sure that all rows are copied?).
As far as I see you need to develop some sort of "Control program" that has the drivers for each DBMS and run it as a daemon. The daemon should have a trigger or a very small recheck interval to keep the DBs synchronized
Technically, you could set up a process which parses the binary log of the MySQL server and replicate the relevant sql queries. I've never done such a thing with a a different database as a slave, but maybe it is worth a shot?

How to scale MySQL with multiple machines?

I have a web app running LAMP. We recently have an increase in load and is now looking at solutions to scale. Scaling apache is pretty easy we are just going to have multiple multiple machines hosting it and round robin the incoming traffic.
However, each instance of apache will talk with MySQL and eventually MySQL will be overloaded. How to scale MySQL across multiple machines in this setup? I have already looked at this but specifically we need the updates from the DB available immediately so I don't think replication is a good strategy here? Also hopefully this can be done with minimal code change.
PS. We have around a 1:1 read-write ratio.
There're only two strategies: replication and sharding. Replication comes often in place when you have less write and much read traffic, so you can redirect the reads to many slaves, with the pitfall of lots of replication traffic with the time and a probability for inconsitency.
With sharding you shard your database tables across multiple machines (called functional sharding), which makes especially joins much harder. If this doenst fit anymore you also need to shard you rows across multiple machines, but this is no fun and depends a sharding layer implemented between you application and the database.
Document oriented databases or column stores do this work for you, but they are currently optimized for OLAP not for OLTP.
Depends on the application backend (i.e. how the PKs, transactions and insert IDs are handled), you might consider MASTER-MASTER replication with different auto_increment setups. This can be tricky and needs to be thoroughly tested but it can work.
Also, in new MySQL 5.6 there is a GTID (Global Transaction Identifier) that generally helps a lot in keeping the replication in sync, especially in this scenario.
You should take a look at MySQL Performance Blog. Maybe you'll find something useful.
Well... good luck scaling all those writes to a real large scale. The database engine becomes the bottleneck, too many locks and buffers mgmt and stuff...
The only way I found that really works is scale out, sharding, unfortunately sharding is not provided for MySQL "out of the box" (like in some NoSQLs such as Mongo). ScaleBase (disclaimer: I work there) is a maker of a complete scale-out solution an "automatic sharding machine" if you like. ScaleBae analyzes your data and SQL stream, splits the data across DB nodes, route commands and aggregates results in runtime – so you won’t have to!