How to synchronize multi-databases to one central database in remote servers? - mysql

I have a project where there are multi-sites (1500 client sites) each contain an Oracle database, and 1 datacenter with MySQL database where all the data from all client sites should reside and kept sync. The project scope is to achieve the synchronization between the datacenter and client sites.
Accordingly, I have a couple of questions:
Is there a utility/tool to implement this integration & synchronization between Oracle and MySQL databases?
What is best way to protect data when transferring/ sending using the internet?

Yes, there are a number of tools that can be used to achieve something like this, but the best solution will depend on your organization's technical needs and its budget (don't underestimate the importance of this one.) I am an Oracle DBA by trade, and while I do some work in MySql, these will be Oracle-centric recommendations. Just FYI.
For #1. I would recommend checking out Oracle Golden Gate and Oracle Data Guard.
Golden Gate is a data integration and replication and package that allows you to sync data across multiple databases in near real-time. Using Golden Gate, you can replicate data, transactions, and DDL changes across multiple kinds of environments. It can be difficult to set up, but once it's in place, it will pay for itself many times over if you are able to leverage it in a way that makes sense for your needs. This is probably going to be the most cost effective solution (in terms of level of effort, cost, etc,) but it will be costly and will probably require training if you and your staff are not familiar with it.
Data Guard is also an Oracle technology, but its more for primary/standby database set ups. It essentially allows you to create two databases - one as a primary database where your data transactions are occurring, and one as a standby database where those transactions are replicated, also in near real-time. I like Data Guard because it can allow you to to seamlessly transition between primary and standby without it necessarily being known to the end user. For example, let's say a user connects to your Primary Database, but the file system where the primary lives becomes corrupt for some reason. You can set it up so that the database will automatically move that user's connection to the standby and allow them to complete their transaction, all without an interruption on their end.
I bring up data guard because rather than having your cilent sites replicate directly to your data center, you might want to consider setting up Data Guard so that the primary databases can reside on the client side, but that your standby databases reside on the side of the data center. This will do two things - 1. create a local copy of your client databases so that in a catastrophe, you can easily recover. and 2. decrease the likelihood of things like network latency causing issues. If your replication strategy relies on those client side databases always being up and accessible, you will run into issues when folks at those client sites run into problems. With Data Guard, you can protect yourself from some of that risk by syncing with the standby databases - meaning instead of changes coming from client -> directly to data center to sync, it would go from client primary database -> data center standby database -> data center to sync. Your security and networking teams will also likely prefer a solution like this, but that may vary.
For #2. Whenever you're moving data across the internet, you want to use basic things like Encryption, SSL, certificate authentication, firewalls, VPNs, etc. Beyond that, you also want to be sure you are following the guidelines and regulations that have been set in your region. For example, if you are working with banking or financial data, there may be laws and regulation in place that stipulate the minimum requirements for moving that data across the open internet. Likewise, if you are working with medical or healthcare data, there is probably a different set of minimum requirements for that industry as well. Your organization should be able to connect you with resources to find those requirements, but at the end of the day, this responsibility to adhere to regulations (if there are any) will fall on whoever is setting it up - most likely you. Make sure you are aware of the type of data being moved, as well as any legal implications revolving around that data, and set your technical requirements based on that.
If there aren't any rules/regulations for the data you work with (let's say, you're just moving publicly available data that has no need for protection,) you can just do what works - just be sure that you are doing it in a secure way as to not compromise the client sites or your data center.

Related

Virtual Segregation of Data in Multi-tenant MySQL Database

This is more of a conceptual question so variations on the stack are welcome should they be capable of accomplishing the same concept. We're currently on MySQL and expanding some services out into MongoDB.
The idea is that we would like to be able to manage a single physical database schema/structure so that adjustments, expansions etc. don't become overly cumbersome as the number of clients utilizing the structure grows into the thousands, tens of, hundreds of, etc. however we would like to segregate their data at this level rather than simply at the application layer to provide a more rigid separation. Is it possible to create virtual bins for each client using the same structure, but have their data structurally separated from one another?
The normal way would obviously be adding Client Keys to every row of data either directly or via foreign relationships, but given that we can't foresee with 20/20 how hacks on our system might occur allowing "cross client" data retrieval, I wanted to go a little further to embed the separation at a virtually structural level.
I've also read another post here: MySQL: how to do row-level security (like Oracle's Virtual Private Database)? which uses "views" as a method but this seems to become more work the larger the list of clients.
Thanks!
---- EDIT ----
Based on some of the literature suggested below, here's a little more info on our intent:
The closest situation of the three outlined in the MSDN article provided by #Stennie would be a single database, multiple-schema, however the difference being, we're not interested in customizing client schemas after their creation, we would actually prefer they remain locked to the parent/master schema.
Ideally the solution would keep each schema linked to the parent table-set structure rather than simply duplicating it with the hope that any change to the parent or master schema would be cascaded across all client/tenant schemas.
Taking it a step further, in a cluster we could have a single master with the master schema, and each slave replicating from it but with a sharded set of tenants. Changes to the master could then be filtered down through the cluster without interruption and would maintain consistency across all instances also allowing us to update the application layer faster knowing that all DB's are compatible with the updated schemas.
Hope that makes sense, I'm still a little fresh at this level.
There are a few common infrastructure approaches ranging from "share nothing" (aka multi-instance) to "share everything" (aka multi-tenant).
For example, a straightforward approach to your "virtual bins" would be to allocate a database per client using shared database servers. This is somewhere in between the two sharing extremes, as your customers would be sharing database server infrastructure but keeping their data and schema separate.
A database-per-client approach would allow you to:
manage authentication and access per client using the database's authentication & access controls
support different database software (you mention using both MySQL which supports views, and MongoDB which does not)
more easily backup and restore data per client
avoid potential cross-client leakage at a database level
avoid excessive table growth and related management issues for a single massive database
Some potential downsides would include:
having more databases to manage
in the case of a database where you want to enforce certain schema (i.e. MySQL) you will need to apply the schema changes across all your databases or support some form of versioning
in the case of a database which preallocates storage (i.e. MongoDB) you may use more storage per client (particularly if your actual data size is small)
you may run into limits on namespaces or open files
you still have to worry about application and data security :)
If you do some research on multi-tenancy you will find some other solutions ranging from this example (isolated DB per client on shared database server architecture) through to more complex partitioned data schemes.
This Microsoft article includes a useful overview of approaches and considerations: Multi-tenant SaaS database tenancy patterns.

one big database, or one per client?

I've been asked to develop an application that will be run out to a number of business units. the application will be the basically the same for each unit, but will have minor procedural differences, which won't change the structure of the underlying database. Should I use one database per business unit, or one big database for all the units? The business units are totally separate
My preference is for one database per client. The advantages:
if a client gets too big, they're easy to move - backup, restore, change the connection string, boom. Try doing that when their data is mixed in with others in a massive database. Even if you use schemas and filegroups to segregate, moving them is not a cakewalk.
ditto for deleting a client's data when they move on.
by definition you're keeping each client's data separate. This is often going to be a want, and sometimes a need. Sometimes it will even be legally binding.
all of your code within a database is simpler - it doesn't have to include the client's schema (which can't be parameterized) and your tables don't have to be littered with an extra column indicating the client.
A lot of people will claim that managing 200 or 500 databases is a lot harder than managing 10 databases. It's not really any different, in my experience. You build scripts that automate things, you stagger index maintenance and backup jobs, etc.
The potential disadvantages are when you get up into the realm of 4-digit and higher databases per instance, where you want to start thinking about having multiple servers (the threshold really depends on the workload and the hardware, so I'm just picking a number). If you build the system right, adding a second server and putting new databases there should be quite simple. Again, the app should be aware of each client's connection string, and all you're doing by using different servers is changing the instance the connection string points to.
Some questions over on dba.SE you should look at. They're not all about SQL Server, but many of the concepts and challenges are universal:
https://dba.stackexchange.com/questions/16745/handling-growing-number-of-tenants-in-multi-tenant-database-architecture
https://dba.stackexchange.com/questions/5071/what-are-the-performance-implications-of-running-multiple-smaller-dbs-instead-of
https://dba.stackexchange.com/questions/7924/one-big-database-vs-several-smaller-ones
Your question is a design question. In order to answer it, you need to understand the requirements of the system that you want to build. From a technical perspective, SQL Server -- or really any database -- can handle either scenario.
Here are some things to think about.
The first question is how separate your clients need the data to be. Mixing data together from different business units may not be legal in some cases (say, the investment side of a bank and the market analysis side). In such situations, separate databases are the solution.
The next question is security. In some situations, clients might be very uncomfortable knowing that their data is intermixed with other clients data. A small slip-up, and confidential information is inadvertently shared. This is probably not an issue for different business units in the same company.
Do you have to deal with different uptime requirements, upload requirements, customizations, and perhaps interaction with other tools? If one business unit will need customizations ASAP that other business units are not interested in, then that suggests different databases.
Another consideration is performance. Does this application use a lot of expensive resources? If so, being able to partition the application on different databases -- and potentially different servers -- may be highly desirable.
On the other hand, if much of the data is shared, and the repository is really a central repository with the same underlying functionality, then one database is a good choice.

Splitting a mysql database for security

I have used sql (mostly mysql) for years but not to a professional standard, so I'm looking for a shove in the right direction.
I am currently designing a web app that will collect user's names/addresses/emails etc in one set of tables, as well as other personal information in another set of tables. These would most naturally reside in one database, but I've been considering splitting the user contact information in one database on a separate server and all the other information into another database/server, the theory being that a hacker would have to break both systems to get anything very useful.
I've done searches off and on for a few weeks and haven't found this type of design discussed much so far. Is this generally done? Is it overkill? Is there a design method to approach it, or will I have to roll it all on my own?
I did find Is splitting databases a legitimate security measure? which I guess is saying that this approach is likely overkill.
I tend to think this is overkill.
Please check my answer on this question: Sharing users between 2 databases
Keep in mind to address separately database design and data access
security issues. Data access security should not lead you to illogical
choices in database design.
IMHO that seems to be wrong. By splitting data across 2 DB you will only increase complexity without reasonable security profits.
I think this is where data encryption can be used. Generate encryption key based on user credentials and encrypt/decrypt sensible data by user requests. Since private data must be shown only to that user, everything should be ok.
Here's an approach I used before:
Server1: DB
Server2: SC
DB is in a network domain that is accessible by the public, but cannot access SC
SC is in a network domain that is not accessible by the public, but can access SC
DB is where you stored all pertinent information, including the 'really important stuff'.
At a specified interval (I used 5 seconds) SC checks DB for any new records in any table it may want to monitor (there is a job or scheduled task) and encrypts the important information.
Although I was utilizing SQL Server 2005 and was able to work in two domains (a private(intern al) and public(for client access) and that what I just shared was a stripped down (removed as much MSSQL-exclusive parts), simplified version, with some effort I think it would be possible to recreate something similar in mysql, especially if you can host your two databases in separate, physical machines.
While many will also think this is overkill, this idea had been implemented. It costs more, and requires more work when it's data reporting time but the clients were pleased.

Distributed db's or not?

INFORMIX-SQL 7.32 (SE) Linux Pawnshop App.
I have some users who own several pawnshops within a 100-mile radius. Each pawnshop app runs with SE. The only functionality these owners need are: ability to remotely login to any store in order to view transactions, running totals and consolidate daily totals at end of business day. This can be accomplished with dialup modems, as the app doesnt have any need for displaying BLOB's. At end-of-day, each stores totals are unloaded to a flat file and transferred to the owner's system.
What would my owners gain by converting to distributed db's?.. ability to find out if a stores customer has conducted business in another store or if another store has a desired inventory item for sale? (not important, seldomly happens). Most customers will usually do business with the same store and if they dont have a desired item for sale, they will visit the closest competitors pawnshop. What gains would distributed db's offer to accomplish the same functionality as described in the first paragraph?.. Pawnshop owners absolutely refuse to connect their production systems via the internet! They dont trust its security, even using VPN, Cisco, etc, or its reliability! In this part of the world, ISP's have a bad track record for uptime. I know of several apps which have converted from web to dialup because of comm problems!
Distributed DBs, more precisely Informix XPS and IDS, don't have just one advantage. If you care just about getting data from different places, you can accomplish it with just a design strategy. If you add a "branch_id", or something like that, you're done.
Distributed DBs have a lot of advantages, from availability to scalability. You must review all these things first.
Sorry for this kind of answer, but is really difficult to give you an straight answer about this topic.
CouchDB is a peer based distributed database system. Any number of CouchDB hosts (servers and offline-clients) can have independent “replica copies” of the same database, where applications have full database interactivity (query, add, edit, delete). When back online or on a schedule, database changes are replicated bi-directionally.
CouchDB has built-in conflict detection and management and the replication process is incremental and fast, copying only documents and individual fields changed since the previous replication. Most applications require no special planning to take advantage of distributed updates and replication.
Unlike cumbersome attempts to bolt distributed features on top of the same legacy models and databases, it is the result of careful ground-up design, engineering and integration. The document, view, security and replication models, the special purpose query language, the efficient and robust disk layout are all carefully integrated for a reliable and efficient system.
If you are not going to have general 90%+ uptime connection between the databases, then there isn't any benefit to distributed databases.
One main benefit is to give large businesses a 'failover' when one machine goes down or is unavailable. If they have the database distributed over three or four machines, then the loss of one doesn't impact their ability to do business.
A second major benefit is when a database is simply too big for one server to cope with. 'Internet scale' databases (Amazon, Twitter, etc) have that level of traffic. Walmart would have that level of traffic. A couple of storefront operations wouldn't.
I think that this is a context where there is little to gain from distributed database operation.
If you were to go towards distributed operation, I'd probably look towards using a simple ER topology, with the 'head office' store being the primary (root) node and the other shops being leaf nodes. You would then have changes to the individual store databases replicated to the HQ node; you might or might not also propagate the data back to the other stores. Especially with just two stores, you might in fact simply replicate all the information to both stores; this gives you an automatic off-site backup of the database. (You'd probably configure all nodes as root nodes in this case - at least, until a chain grew to, say, five or six nodes.)
This would give you some resiliency for disaster recovery. It would also allow the HQ (in particular) to see what is going on at each store.
My impression is that you are probably not discussing 'transactions per second' on average; the rate of transactions at a single store is probably a few transactions per minute, with 'few' possibly being less than one TPM. Consequently, the network bandwidth is unlikely to be a bottleneck at any point, even with dial-up speeds (though that might be borderline).

Pattern for updating slave SQL Server 2008 databases from a master whilst minimising disruption

We have an ASP.NET web application hosted by a web farm of many instances using SQL Server 2008 in which we do aggregation and pre-processing of data from multiple sources into a format optimised for fast end user query performance (producing 5-10 million rows in some tables). The aggregation and optimisation is done by a service on a back end server which we then want to distribute to multiple read only front end copies used by the web application instances to facilitate maximum scalability.
My question is about the best way to get this data from a back end database out to the read only front end copies in such a way that does not kill their performance during the process. The front end web application instances will be under constant high load and need to have good responsiveness at all times.
The backend database is constantly being updated so I suspect that transactional replication will not be the best approach, as the constant stream of updates to the copies will hurt their performance.
Staleness of data is not a huge issue so snapshot replication might be the way to go, but this will result in poor performance during the periods of replication.
Doing a drop and bulk insert will result in periods with no data for user queries.
I don't really want to get into writing a complex cluster approach where we drop copies out of the cluster during updating - is there something along these lines that we can do without too much effort, or is there a better alternative?
There is actually a technology built into SQL Server 2005 (and 2008) that is designed to address this kind of issues. Service Broker (I'll refer further as SSB). The problem is that it has a very steep learning curve.
I know MySpace went public how uses SSB to manage their park of SQL Servers: MySpace Uses SQL Server Service Broker to Protect Integrity of 1 Petabyte of Data. I know of several more (major) sites that use similar patterns but unfortunately they have not gone public so I cannot refer names. I was personally involved with some projects around this technology (I am a former member of the SQL Server team).
Now bear in mind that SSB is not a dedicate data transfer technology like Replication. As such you will not find anyhting similar to the publishing wizards and simple deployment options of Replication (check a table and it gets transferred). SSB is a reliable messaging technology and as such its primitives stop at the level of message exchange, you would have to write the code that leverages the data change capture, packs it as messages and also the unpacking of message into relational tables at destination.
Why still some companies preffer SSB over Replication at a task like you describe is because SSB has a far better story when it comes to reliability and scalability. I know of projects that exchange data between 1500+ sites, far beyond the capabilities of Replication. SSB is also abstracted from the physical topology: you can move databases, rename machines, rebuild servers all without changing the application. Because data flow occurs over logical routes the application can addapt on-the-fly to new topologies. SSB is also resilient to long periods of disocnnect and downtime, being capable of resuming the data flow after hours, days and even months of disconnect. High troughput achieved by engine integration (SSB is part of the SQL engine itself, is not a collection of sattelite applications and processes like Replication) means that the backlog of changes can be processes on reasonable times (I know of sites that are going through half a million transactions per minute). SSB applications typically rely on internal Activation to process the incomming data. SSB also has some unique features like built-in load balancing (via routes) with sticky session semantics, support for deadlock free application specific correlated processing, priority data delivery, specific support for database mirroring, certificate based authentication for cross domain operations, built-in persisted timers and many more.
This is not a specific answer 'how to move data from table T on server A to server B'. Is more a generic technology on how to 'exhange data between server A and server B'.
I've never had to deal with this scenario before but did come up with a possible solution for this. Basically, it would require a change in your main database structure. Instead of storing the data, you would keep records of modifications of this data. Thus, if a record is added, you store "Table X, inserted new record with these values: ..." With modifications, just store the table, field and changed value. With deletions, just store which record is deleted. Every modification will be stored with a timestamp.
Your client systems would keep their local copies of the database and will regularly ask for all database modifications after a certain date/time. You then execute those modifications on the local database and it will be up-to-date again.
And the back-end? Well, it would just keep a list of modifications and perhaps a table with the base data. Keeping just the modifications also means you're keeping track of history, allowing you to ask the system what it looked like a year ago.
How well this would perform depends on the number of modifications on the back-end database. But if you request the changes every 15 minutes, it shouldn't be that much data every time.
But again, I never had the chance to work this out in a real application so it's still a theoretic principle for me. It seems fast but a lot of work will be required.
Option 1: Write an app to transfer the data using row level transactions. It might take longer but would result in no interruption of the site using the data because the rows are there before and after the read occurs, just with new data. This processing would happen on a separate server to minimize load.
In sql server 2008 you can set READ_COMMITTED_SNAPSHOT to ON to ensure that the row being updated is not causing blocking.
But basically all this app does is read the new data as it is available out from one database and into the other.
Option 2: Move the data (tables or entire database) from the aggregation server to the front-end server. Automate this if possible. Then switch your web application to point to the new database or tables for future requests. This works but requires control over the web app, which you may not have.
Option 3: If you were talking about a single table (or this could work with many) what you can do is a view swap. So you write your code against a sql view which points to table A. You do you work on Table B and when it's ready, you update the view to point to Table B. You can even write a function that determines the active table and automate the whole swap thing.
Option 4: You might be able to use something like byte-level replication of the server. That sounds scary though. Which is basically copying the server from point A to point B exactly down to the very bytes. It's mostly used in DR situations which this sounds like it could be a kinda/sorta DR situation, but not really.
Option 5: Give up and learn how to sell insurance. :)