What's the best way to synchronize geographically diverse mysql servers? - mysql

Imaging that I have a number of servers which all run mysql or mongodb or redis database,
the servers are in different places. I want to keep the server's data all the same.
for example:
server A,B,C,D,E
1> insert items to A, auto update B,C,D,E
2> insert items to B, auto update A,C,D,E
3> delete ...

Your question is rather generic but the answer in all cases is a similar starting point for syncing the same data to multiple servers:
MySQL - use replication
MongoDB - use replication
redis - use replication
Depending on the database, you may have limitations such as whether replication is single master (all of the above generally are, out of the box) or whether you truly need multi-master updates (eg. MySQL Cluster, CouchDB, or other database with MVCC).
There are pros and cons of different approaches, and it really depends on your use case and where the servers are in relation to each other (same data centre, geographically diverse, etc). Generally you would want to scale up to the appropriate scenario rather than trying to start out with something overly complex to set up and support.

Related

Why do we have Redis when we have MySQL temporary tables?

MySQL temporary table are stored in memory as long as computer has enough RAM (and MySQL was set up accordingly). One can created any indexes for any fields.
Redis stores data in memory indexed by one key at time and in my understanding MySQL can do this job too.
Are there any things that make Redis better for storing big amount(100-200k rows) of volatile data? I can only explain the appearance of Redis that not every project has mysql inside and probably some other databases don't support temporary tables.
If I already have MySql in my project, does it make sense to put up with Redis?
Redis is like working with indexes directly. There's no ACID, SQL parser and many other things between you and the data.
It provides some basic data structures and they're specifically optimized to be held in memory, and they also have specific operations to read and modify them.
In the other hand, Redis isn't designed to query data (but you can implement very powerful and high-performant filters with SORT, SCAN, intersections and other operations) but to store the data as you're going to be consumed later. If you want to get, for example, customers sorted by 3 different criterias, you'll need to work to fill 3 different sorted sets. There're a lot of use cases with other data structures, but I would end up writing a book in an answer...
Also, one of most powerful features found in Redis is how easy can be replicated, and since its 3.0 version, it supports data sharding out-of-the-box.
About why you would need to use Redis instead of temporary tables on MySQL (and other engines which have them too) is up to you. You need to study your case and check if caching or storing data in a NoSQL storage like Redis can both outperform your actual approach and it provides you a more elegant data architecture.
By using Redis alongside the other database, you're effectively reducing the load on it. Also, when Redis is running on a different server, scaling can be performed independently on each tier.

MySQL Replication: Combination of Master & Slave on 2 servers

I have a situation here that will have 2 MySQL servers linked over a WAN. Each location (lets call them Location X and Location Y) will need access to three databases (Lets call them A, B and C.
Location X needs to use Database A as their main database, and database B as a 'lookup' read only database.
Location Y needs to use Database B as their main database, and database A as a 'lookup' read only database.
BOTH locations need read/write access to a replicated database C. (This database contains no auto incrementing columns etc. - fairly simple logging tables).
In essence, I need a combination of master-slave AND master-master replication happening:
X Y
--------
A --> A
B <-- B
C <--> C
Is this feasible to set up on MySQL?
MySQLD-Multi Is a package that can help you accomplish this. Individual MySQL databases can only have one master. If you setup A series mastered in one direction, and your B series mastered in the opposite direction, they would need to run on their own IP or port numbers. Likewise for C.
Multimaster setup becomes arduous if you get data transmission errors, lose table space, or get struck by thundering herds and drop connections. You will likely end up in the un-enviable position of using a log-based recovery tool to re-sync them. Please consider that your application will want to very cautiously use its "write" privileged connection and rely heavily on read connections.
Splitting up MySQL instances in this manner allows for latency between applications to isolated better. In a multi-master situation, you commonly have to watch the latency and begin failing things out of your pool if they differ even by 1 second of latency. If you truly need MM for your C-series, serioiusly consider chaining read-only slaves on each end for your C-series to minimize latency on the write-masters.
I'd suggest using Percona XtraDB Cluster.
PXC is a MySQL derivative server (so everything about using it will seem familiar except for the replication part). It supports synchronous bidirectional replication over a WAN, so both site X and Y can write to its main database, which will then be replicated to the other site. PXC even supports writes to a database on both sites, which satisfies your requirement for database C.
Percona XtraDB Cluster is free and open source software, like all Percona products.
Disclaimer: I worked for Percona 2010-2014.

replication, uuid field as primary key, row based replication

I am looking about the best way to implement a multi-sites replication with MySDQL. As I am mainly a MS-SQL user, I feel quite unconfortable with MySQL replication terminology, and the multiple MySQL versions that do not do exactly the same thing the same way.
My objective is, as per SQL terminology, to have a publisher and many subscribers. Suscribers are open to user updates. Changes are replicated with the publisher, which will then distribute these changes to other suscribers.
So my objective is here to determine the correct primary key rule to be used for our tables. As we want exclusively surrogate keys, we have the possibility to use either integer\autoincrement or uuid/uuid_short fields. In order to avoid replication conflicts, the integer\autoincrement does not fits our needs, as it can create replication conflicts when, between two synchronisations, both servers insert a record in the same table. So, according to me, the correct solution to avoid replication\primary key conflicts would be to:
use uuid or uuid-short fields as primary keys
have the corresponding uuid values set by the server at INSERT time
set the replication to RBR (Row Based Replication - sounds equivalent to MS-SQl merge replication) mode, and not SBR (Statement Based Replication - sounds like transactional replication). As I understand it, RBR will insert the calculated uuid value 'as is' in the other servers at replication time, while SBR will recall the uuid() function and generate a new value on each server ... thus causing a major problem!
Will it work?
I think there are two unrelated issues here:
1.
Choosing a primary key in a way which is probably unique - UUID is an acceptable approach. You can use it with SBR or RBR, provided the client app generates the UUID. If you happen to be calling the mysql function to generate it, then you need to use row-based replication. Row-based replication is generally much better, so the only reason you'd want to use statement-based replication is backwards compatibility with a MySQL <5.1 slave, or legacy application which relies on some of its behaviour.
2.
Secondly, you appear to want to do multi-master replication. IT WILL NOT WORK.
MySQL replication is extremely simplistic - it writes changes to a log, pushes them from one server to another, reading the log as needed.
You can't do multi-master replication, because it will fail. No only does MySQL not actually support it (for arbitrary toplologies), but it also doesn't work.
MySQL replication has no "conflict resolution" algorithm, it will simply stop and break if a conflict is discovered.
Even assuming that your database contains NO UNIQUE INDEXES except primary keys which are allocated by UUIDs, your application can still have rules which make multi-master fail.
Any constraints, invariants or assumptions within your application, probably only work if the database has immediate-consistency. Trying to use multi-master replication breaks this assumption and will cause your application to enter unexpected (i.e. normally impossible) states.

Transfer MySQL data from machineX to machineY

I want to collect MySQL data from 10 different machines and aggregate into a one big MySQL db on a different machine. All machines are Linux based.
What is the "mysqldump" syntax if I want to do this periodically to collect only the "delta" data?
Are there any other ways to achieve this?
This isn't natively supported in MySQL. You could use replication, but a replica can have only a single master, not 10 masters. I know of two workable options:
1) is to script something up that switches the replica between masters in a round-robin fashion. You might wish to refer to http://code.google.com/p/mysql-mmre/ or http://thenoyes.com/littlenoise/?p=117.
2) is to use an ETL tool.
If you get stuck, we (Percona) can help you. This is a common request, but not an easy one, because each case is different.
mysqldump can't generate incremental backups, as it doesn't have any way of determining which rows (or what parts of the schema!) have changed since the last backup, or indeed even when the last backup was. For that you'd need something which could read the MySQL binlog and convert it into a bunch of INSERT/UPDATE/DELETE statements; I'm not aware of anything that exists quite like that.
The current "state of the art" in MySQL backups is generally considered to be Percona XtraBackup.
Multiple Master Slave? Have each of the 10 as Masters, and the aggregate a slave to all 10. This assumes that the data you are aggregating is different on each of the 10. If the data is the same (or similar) on all 10 and you want to interleave it as well as integrate it then this won't work.

MySQL sharding approaches?

What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.
Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/
Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.
As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.
Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).
Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)
You can consider this middleware
shardingsphere