Can I set up a filtered, star-pattern database replication? - mysql

We have a client that needs to set up N local databases, each one containing one site's data, and then have a master corporate database containing the union of all N databases. Changes in an individual site database need to be propagated to the master database, and changes in the master database need to be propagated to the appropriate individual site database.
We've been using MySQL replication for a client that needs two databases that are kept simultaneously up to date. That's a bidirectional replication. If we tried exactly the same approach here we would wind up with all N local databases equivalent to the master database, and that's not what we want. Not only should each individual site not be able to see data from the other sites, sending that data N times from the master instead of just once is probably a huge waste.
What are my options for accomplishing this new star pattern with MySQL? I know we can replicate only certain tables, but is there a way to filter the replication by records?
Are there any tools that would help or competing RDBMSes that would be better to look at?

SymmetricDS would work for this. It is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time. The software was designed to scale for a large number of databases, work across low-bandwidth connections, and withstand periods of network outage.
We have used it to synchronize 1000+ MySQL retail store databases to an Oracle corporate database.

I've done this before, and AFAIK this is the easiest way. You should look in to using Microsoft SQL Server Merge Replication, and using Row Filtering. Your row filtering would be set up to have a column that states what individual site destination it should go to.
For example, your tables might look like this:
ID_column | column2 | destination
The data in the column might look like this:
12345 | 'data' | 'site1'
You would then set your merge replication "subscriber" site1 to filter on column 'destination' and value 'site1'.
This article will probably help:
Filtering Published Data for Merge Replication
There is also an article on msdn called "Enhancing Merge Replication Performance" which may help - and also you will need to learn the basics of setting up publishers and subscribers in SQL Server merge replication.
Good luck!

Might be worth a look at mysql-table-sync from maatkit which lets you sync tables with an optional --where clause.

If you need unidirectional replication, then use multiple copies of databases replicated in center of star and custom "bridge" application to move data further to the final one

Just a random pointer: Oracle lite supports this. I've evaluated it once for a similar task, however it needs something installed on all clients which was not an option.
A rough architecture overview can be found here

Short answer no, you should redesign.
Long answer yes, but it's pretty crazy and will be a real pain to setup and manage.
One way would be to roundrobin the main database's replication among the sites. Use a script to replicate for say 30 seconds from a site record how far it got and then go on the the next site. You may wish to look at replicate-do-db and friends to limit what is replicated.
Another option that I'm unsure would work is to have N mysqls in the main office that replicates from each of the site offices, and then use the federated storage engine to provide a common view from the main database into the per-site slaves. The site slaves can replicate from the main database and pick up whichever changes they need.

Sounds like you need some specialist assistance - and I'm probably not it.
How 'real-time' does this replication need to be?
Some sort of ETL process (or processes) is possibly an option. we use MS SSIS and Oracle in-house; SSIS seems to be fairly good for ETL type work (but I don't work on that specific coal face so I can't really say).
How volatile is the data? Would you say the data is mostly operational / transactional?
What sort of data volumes are you talking about?
Is the central master also used as a local DB for the office where it is located? if it is you might want to change that - have head office work just like a remote office - that way you can treat all offices the same; you'll often run into problems / anomalies if different sites are treated differently.

it sounds like you would be better served by stepping outside of a direct database structure for this.
I don't have a detailed answer for you, but this is the high level of what I would do:
I would select from each database a list of changes during the past (reasonable time frame), construct the insert and delete statements that would unify all of the data on the 'big' database, and then separate smaller sets of insert and delete statements for each of the specific databases.
I would then run these.
There is a potential for 'merge' issues with this setup if there is any overlap with data coming in and out.
There is also the issue of data being lost or duplicated because your time frame were not constructed properly.

Related

Best database model for saas application (1 db per account VS 1 db for everyone)

Little question, I'm developing a saas software (erp).
I designed it with 1 database per account for these reasons :
I make a lot of personalisation, and need to add specific table columns for each account.
Easier to manage db backup (and reload data !)
Less risky : sometimes I need to run SQL queries on a table, in case of an error with bad query (update / delete...), only one customer is affected instead of all of them.
Bas point : I'm turning to have hundreds of databases...
I'm hiring a company to manage my servers, and they said that it's better to have only one database, with a few tables, and put all data in the same tables with column as id_account. I'm very very surprised by these words, so I'm wondering... what are your ideas ?
Thanks !
Frederic
The current environment I am working in, we handle millions of records from numerous clients. Our solution is to use Schema to segregate each individual client. A schema allows you to partition your clients into separate virtual databases while inside a single db. Each schema will have an exact copy of the tables from your application.
The upside:
Segregated client data
data from a single client can be easily backed up, exported or deleted
Programming is still the same, but you have to select the schema before db calls
Moving clients to another db or standalone server is a lot easier
adding specific tables per client is easier (see below)
single instance of the database running
tuning the db affects all tenants
The downside:
Unless you manage your shared schema properly, you may duplicate data
Migrations are repeated for every schema
You have to remember to select the schema before db calls
hard pressed to add many negatives... I guess I may be biased.
Adding Specific Tables: Why would you add client specific tables if this is SAAS and not custom software? Better to use a Postgres DB with a Hstore field and store as much searchable data as you like.
Schemas are ideal for multi-tenant databases Link Link
A lot of what I am telling you depends on your software stack, the capabilities of your developers and the backend db you selected (all of which you neglected to mention)
Your hardware guys should not decide your software architecture. If they do, you are likely shooting yourself in the leg before you even get out of the gate. Get a good senior software architect, the grief they will save you, will likely save your business.
I hope this helps...
Bonne Chance

Can relational database scale horizontally

After some googling I have found:
Note from mysql docs:
MySQL Cluster automatically shards (partitions) tables across nodes,
enabling databases to scale horizontally on low cost, commodity
hardware to serve read and write-intensive workloads, accessed both
from SQL and directly via NoSQL APIs.
Can relational database be horizontal scaling? Will it be somehow based on NoSQL database?
Do someone have any real world example?
How can I manage sql requests, transactions, and so on in such database?
It is possible but takes lots of maintenance efforts, Explanation -
Vertical Scaling of data (synonymous to Normalisation in SQL databases) is referred as splitting data column wise into multiple tables in order to reduce space redundancy. Example of user table -
Horizontal Scaling of data (synonymous to sharding) is referred as splitting row wise into multiple tables in order to reduce time taken to fetch data. Example of user table -
Key point to note here is as we can see tables in SQL databases are Normalised into multiple tables of related data. In order to shard data of such table on multiple machines, you would need to shard related normalised data accordingly which in turn would increase maintenance efforts. Like in the example presented above of SQL database,
Customer table which is related as one to many relation with Order
table
If you move some rows of customer data onto other machine (referred as sharding) you would also need to move its related order data onto the same machine which would be troublesome task in case of multiple related tables.
Its convenient for NOSQL databases to shard out as they follow flat table structure (data is stored in aggregated form rather than normalised form).
I think the answer is, unequivocally, yes. You have to keep in mind that SQL is simply a data access language. There is absolutely no reason why it can't be extended across multiple computers and network partitions. Is it a challenging problem? Most certainly, and that's why software that does it is in its infancy.
Now, I think what you are trying to ask is "Can all features that I am familiar with and that arrive in a standard SQL-type relational database management system be developed to work with multiple servers in this manner?" While I admit I haven't studied the problem in depth, there are theorems out there that say "No, it cannot." Consistency-Availability-Partition Theorem posits that we cannot have all three qualities at the same level.
Now, for all practical purposes, "sharding" or "partitioning" or whatever you want to call it is not going away; to the contrary. This means that, given the degree to which CAP theorem holds, we are going to have to shift the way we think about databases, and how we interact with them (at least, to an extent). Many developers have already made the shift necessary to be successful on a No-SQL platform, but many more have not. Ultimately, sufficient maturity of the model and effective enough workarounds will be developed that traditional SQL databases, in the sense you refer, will be more or less practical across multiple machines. This is already starting to pan out, and I would say give it a few more years and we'll be to that point. Or we'll have collectively shifted thinking to the point where it is no longer necessary, and the world will be a better place. :)
Thanks for the question and answer. I was trying to explain this to someone like this:
In terms of the CAP theorem, you can't have all three. So when a partition (network or server failure) occurs:
A relational database on a single server is giving you C (consistency). So when a
P (partition - server/network failure) occurs, you can't have A
(availability - db goes down)
A nosql datastore if you want A when a P occurs, you can't
have C (one or more of your replicated partitions will be out of
sync, until the n/w comes back and they all sync up). So it will only
be eventually consistent
EDITED #2: to provide more perspective based on the comment below by Manish. My intention is to explain by example, why you cant have all 3. As noted below in the comments, there are other dbs where you can have C when P occurs at the expense of A.
Google Spanner is an example of a relational database that can scale horizontally. Sharding and replication are done automatically so no need to worry about that. For more information please check out this paper.
Yes it can. It is called NewSQL.
NewSQL is a new approach to relational databases that wants to combine transactional ACID (atomicity, consistency, isolation, durability) guarantees of good ol’ RDBMSs and the horizontal scalability of NoSQL. Source
Examples for Databases:
User-Shared MySQL Cluster
Citus (PostgreSQL extension)
CockroachDB
Azure Cosmos DB
Google Spanner
NuoDB
Vitess
Splice Machine (part of Hadoop ecosystem)
MemQSL (in memory store)
VoltDB (in memory store)
Examples for Data Warehouses:
IBM Netezza
Oracle
Teradata
Hive Engine (part of Hadoop ecosystem)
Spark SQL (part of Hadoop ecosystem)
Yes, but it need to migrate when storage increased.
Some open source tools can support the feature, for example: Vitess or Apache ShardingSphere.

Mysql database sync between two databases

We are running a Java PoS (Point of Sale) application at various shops, with a MySql backend. I want to keep the databases in the shops synchronised with a database on a host server.
When some changes happen in a shop, they should get updated on the host server. How do I achieve this?
Replication is not very hard to create.
Here's some good tutorials:
http://www.ghacks.net/2009/04/09/set-up-mysql-database-replication/
http://dev.mysql.com/doc/refman/5.5/en/replication-howto.html
http://www.lassosoft.com/Beginners-Guide-to-MySQL-Replication
Here some simple rules you will have to keep in mind (there's more of course but that is the main concept):
Setup 1 server (master) for writing data.
Setup 1 or more servers (slaves) for reading data.
This way, you will avoid errors.
For example:
If your script insert into the same tables on both master and slave, you will have duplicate primary key conflict.
You can view the "slave" as a "backup" server which hold the same information as the master but cannot add data directly, only follow what the master server instructions.
NOTE: Of course you can read from the master and you can write to the slave but make sure you don't write to the same tables (master to slave and slave to master).
I would recommend to monitor your servers to make sure everything is fine.
Let me know if you need additional help
three different approaches:
Classic client/server approach: don't put any database in the shops; simply have the applications access your server. Of course it's better if you set a VPN, but simply wrapping the connection in SSL or ssh is reasonable. Pro: it's the way databases were originally thought. Con: if you have high latency, complex operations could get slow, you might have to use stored procedures to reduce the number of round trips.
replicated master/master: as #Book Of Zeus suggested. Cons: somewhat more complex to setup (especially if you have several shops), breaking in any shop machine could potentially compromise the whole system. Pros: better responsivity as read operations are totally local and write operations are propagated asynchronously.
offline operations + sync step: do all work locally and from time to time (might be once an hour, daily, weekly, whatever) write a summary with all new/modified records from the last sync operation and send to the server. Pros: can work without network, fast, easy to check (if the summary is readable). Cons: you don't have real-time information.
SymmetricDS is the answer. It supports multiple subscribers with one direction or bi-directional asynchronous data replication. It uses web and database technologies to replicate tables between relational databases, in near real time if desired.
Comprehensive and robust Java API to suit your needs.
Have a look at Schema and Data Comparison tools in dbForge Studio for MySQL. These tool will help you to compare, to see the differences, generate a synchronization script and synchronize two databases.

Setting up multiple MySQL databases with scalability options

I need to set up a MySQL environment that will support adding many unique databases over time (thousands, actually).
I assume that at some point I will need to start adding MySQL servers, and would like my environment to be prepared for the case beforehand, to make the transition to a 2nd, 3rd, 100th server easy.
And just to make it interesting, It would be very convenient if the solution was modeled so the application that queries the databases sends all the queries to a single address and receives a result. It should be unaware of the number and location of the servers. The database name is unique and can be used to figure out which server holds the database.
I've done some research, and MySQL Proxy pops out as the main candidate, but I haven't been able to find anything specific about making it perform as described above.
Anyone?
Great question. I know of several companies that have done this (Facebook jumps out as the biggest). None are happy, but alternatives kind of suck, too.
More things for you to consider -- what happens when some of these databases or servers fail? What happens when you need to do a cross-database query (and you will, even if you don't think so right now).
Here's the FriendFeed solution: http://bret.appspot.com/entry/how-friendfeed-uses-mysql
It's a bit "back-asswards" since they are basically using MySQL as a glorified key-value store. I am not sure why they don't just cut out the middleman and use something like BerkeleyDB for storing their objects. Connection management, maybe? Seems like the MySQL overhead would be too high a price to pay for something that could be added pretty easily (famous last words).
What you are really looking for (I think) is a distributed share-nothing database. Several have been built on top of open-source technologies like MySQL and PostgreSQL, but none are available for free. If you are in the buying mood, check out these companies: Greenplum, AsterData, Netezza, Vertica.
There is also a large number of various distributed key-value storage solutions out there. For lack of a better reference, here's a starting point: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ .
Your problem sounds similar to one we faced - that you are acting as a white-label, and that each client needs to have their own separate database. Assuming this concept parallels yours, what we did was leverage a "master" database that stored the hostname and database name for the client (which could be cached in the application tier). The server the client was accessing could then dynamically shift its datasource to the required database. This allowed us to scale up to thousands of client databases, scattered across servers.

Is there a reason to use two databases?

Is it because of size?
By the way, what is the limit of the size?
There are many reasons to use two databases, some that come to mind:
Size (the limit of which is controlled by the operating system, filesystem, and database server)
Separation of types of data. Think of a database like a book -- you wouldn't write a book that spans multiple subjects, and you shouldn't (necessarily) have a database with multiple subjects. Just so all of the data is somehow related, you could keep it together (i.e. all the tables have something to do with one website or application).
Import / Export - it might be easier to import data into your application if you can drop and restore a whole database, rather than import individual rows into a database table.
Separate applications, or services. I can't see any reason to use separate databases for a single app/service.
(note: replication, even multimaster, isn't a separate database. Neither is Sharding.)
I believe some on here are confusing Database with a Database Instance.
Example:
A phone book is a prime example of a Database.
Replication:
having 2 copies of the same phone book does not mean you have 2 databases. It means you have 2 copies of 1 database, and that you can hand 1 to someone else so you can both look up different things at the same time thereby accomplishing more work at once.
Sharding:
You could tear those phonebooks apart at the end of the white pages and the beginning of the yellow and hand them to 2 more people. You could further tear them at each letter and when you need susan summers ask the person with that section of the book to look for her.
suppose you wanted to publish or reuse some external database, and keep it separate from your primary database. This would be a good reason to use 2 databases... You can drop and reimport the external database at any time without affecting your database, and vice versa...
You can use two databases the same reason most banks have two ATMs, for reliability. You can swap one in if the other fails, but to do it quickly requires setup, such as a cname and controlling your own DNS server.
You can also do writes on one database, if the writes have complex triggers on them, and use some synching between databases to keep the second one updates, which is used for selects.
You can use two databases for load sharing, for example, use round-robin to split up the load so one isn't overloaded.
I sometimes have separate database because they handle different concerns. I.E. a Reporting database or an Authentication Database.
Replication
Making your system scalable by devide your database system to different physical location
Provide redundancy/replication as backup and seamless uptime.a
As Ben mentioned, Replication is one reason. Another is load balancing.
For example, Hotmail uses many database servers and customer data is broken up across the databases.
To have all of their customers' data on one server would not only require massive storage requirements, but the response times would be horrible.
In other cases, the data may be separated by function. You may well have two sets of data which are either not connected, or at least very loosely so, and in such cases, it may make sense to separate that data from the rest.
Also consider IO needs. Writing to one, reading from the other. One with immediate transactional needs, others where "transactions" can be queued, one instance at high priority, the other at "idle" priority, &c. It is very obvious however with the correct hardware and tablespace/filesystem layouts most of these situations can be achieved in a singular DB.
I think SQLite databases on the iPhone is limited to a size of 50 megabytes, but you can open several databases.