Distributed db's or not? - mysql

INFORMIX-SQL 7.32 (SE) Linux Pawnshop App.
I have some users who own several pawnshops within a 100-mile radius. Each pawnshop app runs with SE. The only functionality these owners need are: ability to remotely login to any store in order to view transactions, running totals and consolidate daily totals at end of business day. This can be accomplished with dialup modems, as the app doesnt have any need for displaying BLOB's. At end-of-day, each stores totals are unloaded to a flat file and transferred to the owner's system.
What would my owners gain by converting to distributed db's?.. ability to find out if a stores customer has conducted business in another store or if another store has a desired inventory item for sale? (not important, seldomly happens). Most customers will usually do business with the same store and if they dont have a desired item for sale, they will visit the closest competitors pawnshop. What gains would distributed db's offer to accomplish the same functionality as described in the first paragraph?.. Pawnshop owners absolutely refuse to connect their production systems via the internet! They dont trust its security, even using VPN, Cisco, etc, or its reliability! In this part of the world, ISP's have a bad track record for uptime. I know of several apps which have converted from web to dialup because of comm problems!

Distributed DBs, more precisely Informix XPS and IDS, don't have just one advantage. If you care just about getting data from different places, you can accomplish it with just a design strategy. If you add a "branch_id", or something like that, you're done.
Distributed DBs have a lot of advantages, from availability to scalability. You must review all these things first.
Sorry for this kind of answer, but is really difficult to give you an straight answer about this topic.

CouchDB is a peer based distributed database system. Any number of CouchDB hosts (servers and offline-clients) can have independent “replica copies” of the same database, where applications have full database interactivity (query, add, edit, delete). When back online or on a schedule, database changes are replicated bi-directionally.
CouchDB has built-in conflict detection and management and the replication process is incremental and fast, copying only documents and individual fields changed since the previous replication. Most applications require no special planning to take advantage of distributed updates and replication.
Unlike cumbersome attempts to bolt distributed features on top of the same legacy models and databases, it is the result of careful ground-up design, engineering and integration. The document, view, security and replication models, the special purpose query language, the efficient and robust disk layout are all carefully integrated for a reliable and efficient system.

If you are not going to have general 90%+ uptime connection between the databases, then there isn't any benefit to distributed databases.
One main benefit is to give large businesses a 'failover' when one machine goes down or is unavailable. If they have the database distributed over three or four machines, then the loss of one doesn't impact their ability to do business.
A second major benefit is when a database is simply too big for one server to cope with. 'Internet scale' databases (Amazon, Twitter, etc) have that level of traffic. Walmart would have that level of traffic. A couple of storefront operations wouldn't.

I think that this is a context where there is little to gain from distributed database operation.
If you were to go towards distributed operation, I'd probably look towards using a simple ER topology, with the 'head office' store being the primary (root) node and the other shops being leaf nodes. You would then have changes to the individual store databases replicated to the HQ node; you might or might not also propagate the data back to the other stores. Especially with just two stores, you might in fact simply replicate all the information to both stores; this gives you an automatic off-site backup of the database. (You'd probably configure all nodes as root nodes in this case - at least, until a chain grew to, say, five or six nodes.)
This would give you some resiliency for disaster recovery. It would also allow the HQ (in particular) to see what is going on at each store.
My impression is that you are probably not discussing 'transactions per second' on average; the rate of transactions at a single store is probably a few transactions per minute, with 'few' possibly being less than one TPM. Consequently, the network bandwidth is unlikely to be a bottleneck at any point, even with dial-up speeds (though that might be borderline).

Related

How to synchronize multi-databases to one central database in remote servers?

I have a project where there are multi-sites (1500 client sites) each contain an Oracle database, and 1 datacenter with MySQL database where all the data from all client sites should reside and kept sync. The project scope is to achieve the synchronization between the datacenter and client sites.
Accordingly, I have a couple of questions:
Is there a utility/tool to implement this integration & synchronization between Oracle and MySQL databases?
What is best way to protect data when transferring/ sending using the internet?
Yes, there are a number of tools that can be used to achieve something like this, but the best solution will depend on your organization's technical needs and its budget (don't underestimate the importance of this one.) I am an Oracle DBA by trade, and while I do some work in MySql, these will be Oracle-centric recommendations. Just FYI.
For #1. I would recommend checking out Oracle Golden Gate and Oracle Data Guard.
Golden Gate is a data integration and replication and package that allows you to sync data across multiple databases in near real-time. Using Golden Gate, you can replicate data, transactions, and DDL changes across multiple kinds of environments. It can be difficult to set up, but once it's in place, it will pay for itself many times over if you are able to leverage it in a way that makes sense for your needs. This is probably going to be the most cost effective solution (in terms of level of effort, cost, etc,) but it will be costly and will probably require training if you and your staff are not familiar with it.
Data Guard is also an Oracle technology, but its more for primary/standby database set ups. It essentially allows you to create two databases - one as a primary database where your data transactions are occurring, and one as a standby database where those transactions are replicated, also in near real-time. I like Data Guard because it can allow you to to seamlessly transition between primary and standby without it necessarily being known to the end user. For example, let's say a user connects to your Primary Database, but the file system where the primary lives becomes corrupt for some reason. You can set it up so that the database will automatically move that user's connection to the standby and allow them to complete their transaction, all without an interruption on their end.
I bring up data guard because rather than having your cilent sites replicate directly to your data center, you might want to consider setting up Data Guard so that the primary databases can reside on the client side, but that your standby databases reside on the side of the data center. This will do two things - 1. create a local copy of your client databases so that in a catastrophe, you can easily recover. and 2. decrease the likelihood of things like network latency causing issues. If your replication strategy relies on those client side databases always being up and accessible, you will run into issues when folks at those client sites run into problems. With Data Guard, you can protect yourself from some of that risk by syncing with the standby databases - meaning instead of changes coming from client -> directly to data center to sync, it would go from client primary database -> data center standby database -> data center to sync. Your security and networking teams will also likely prefer a solution like this, but that may vary.
For #2. Whenever you're moving data across the internet, you want to use basic things like Encryption, SSL, certificate authentication, firewalls, VPNs, etc. Beyond that, you also want to be sure you are following the guidelines and regulations that have been set in your region. For example, if you are working with banking or financial data, there may be laws and regulation in place that stipulate the minimum requirements for moving that data across the open internet. Likewise, if you are working with medical or healthcare data, there is probably a different set of minimum requirements for that industry as well. Your organization should be able to connect you with resources to find those requirements, but at the end of the day, this responsibility to adhere to regulations (if there are any) will fall on whoever is setting it up - most likely you. Make sure you are aware of the type of data being moved, as well as any legal implications revolving around that data, and set your technical requirements based on that.
If there aren't any rules/regulations for the data you work with (let's say, you're just moving publicly available data that has no need for protection,) you can just do what works - just be sure that you are doing it in a secure way as to not compromise the client sites or your data center.

Microservice Database shared with other services

Something I have searched for but cannot find a straight answer to is this:
For a given service, if there are two instances of that service deployed to two machines, do they share the same persistent store or do they have separate stores with some syncing mechanism (master/slave, clustering)?
E.g. I have a OrderService backed by MySQL. We're getting many orders in so I need to scale this service up, so we deploy a second OrderService. Where does its data come from?
It may sound silly but, to me, every discussion makes it seem like the service and database are a packaged unit that are deployed together. But few discussions mention what happens when you deploy a second service.
Posting this as an answer because it's too long for a comment.
Microservices are self contained components and as such are responsible for their own data. If you want the get to the data you have to talk to the service API. This applies mainly to different kinds of services (i.e. you don't share a database among services that offer different kinds of business functionality - that's bad practice because you couple services at the heap through the database and it's then easy to couple more things that would normally be done at the API level but it's more convenient to do them through the database => you risk loosing componentization).
But if you have the same kind of service then there are, as you mentioned, two obvious choices: share a database or have each service contain it's own database.
Now you have to ask yourself which solution do you chose:
Are these OrderServices of yours truly capable of working on their own, or do you need to have all the orders in the same database for reporting or access by other applications?
determine what is your actual bottleneck. Is it the database? If not then share the database. Is it the services? If not then distribute your data.
need to distribute the data? What are your choices, what are your needs? Do you need to be consistent all the time or eventual consistency is good enough? Do you need to have separate databases and synchronize them manually or does your database installation handle replication and partitioning out of the box?
etc
What I'm trying to say is that in this kind of situations the answer is: it depends. And something that we tech geeks often forget to do before embarking on such distributed/scalability/architecture journeys is to talk to business. Often business can handle a certain degree of inconsistencies, suboptimal processes or looking up data in more places instead of one (i.e. what you think is important might not necessarily be for business). So talk to them and see what they can tolerate. Might be cheaper to resolve something in an operational way than to invest a lot into trying to build a highly distributable system.

Dedicated database for each user vs single database for every user

I'll be soon developing a big cms where users can configure their website managing news, products, services and much more about their company.
Think about a shopify without the ecommerce part (at least for now).
The rdbms is MySQL and the user base will be about 150 (maybe bigger).
I'm trying to figure out which one of these two approaches would fit better.
DEDICATED DATABASE FOR EACH USER
PROS:
performance (and possible future sharding?): is querying smaller database with just your data better than querying a giant database with every user data?
easy "export my data" for users: I can simply dump their own db without fetching everything and putting it in some big encoded logical datastruct
SINGLE DATABASE FOR EVERY USER
PROS:
less general overhead
statistic: just one db to query to get and aggregate whatever I need
backup: one dump (not sure about this one because I've no experience in cluster dumping)
Which way would you go for? I don't think shopify created a dedicated database for any user registered... or maybe they did?
I'd like more experienced people than me to help me figuring out the best way and all the variables I can not guess right now because of my ignorance.
It sounds like you're developing a software-as-a-service hosted system, rather than a software package to distribute to customers for them to run on their own servers. In that case, in general, you will have an easier time developing and administering your service if you design it for a single database handling multiple users.
You'll be able to add new users to your system with data manipulation language (DML) rather than data definition language (DDL). That is, you'll insert rows for new users rather than create tables. That will make your life a LOT easier when you go live.
You are absolutely right that stuff like backups and aggregate reporting will be far easier if you have a single shared database.
Don't worry too much about the user data export functions. You'll have to develop software for those functions anyway; it won't be that hard to filter by user when you do the export.
But there's a downside you should consider to the single-database approach: if part of your requirement is to conceal various users' existence or data from each other, you'll have to be very careful to do this in your development. Will your users be competitors with each other? That could be tricky. You'll need to trust your in-house admin and support teams to refrain from disclosing one user's data to another by mistake (or deliberately). With a separate database per user, you'll have a smaller risk in that area.
150 users aren't many. Don't worry about scalability until you have a workload of paying customers. When that happens you can add MySQL server RAM, partitions, solid-state disks, replication, memcached, sharding, and all that other expensive and high-workload stuff. If you add those things before you go live, you'll just take longer and blow more money before you go live. Not good.

Virtual Segregation of Data in Multi-tenant MySQL Database

This is more of a conceptual question so variations on the stack are welcome should they be capable of accomplishing the same concept. We're currently on MySQL and expanding some services out into MongoDB.
The idea is that we would like to be able to manage a single physical database schema/structure so that adjustments, expansions etc. don't become overly cumbersome as the number of clients utilizing the structure grows into the thousands, tens of, hundreds of, etc. however we would like to segregate their data at this level rather than simply at the application layer to provide a more rigid separation. Is it possible to create virtual bins for each client using the same structure, but have their data structurally separated from one another?
The normal way would obviously be adding Client Keys to every row of data either directly or via foreign relationships, but given that we can't foresee with 20/20 how hacks on our system might occur allowing "cross client" data retrieval, I wanted to go a little further to embed the separation at a virtually structural level.
I've also read another post here: MySQL: how to do row-level security (like Oracle's Virtual Private Database)? which uses "views" as a method but this seems to become more work the larger the list of clients.
Thanks!
---- EDIT ----
Based on some of the literature suggested below, here's a little more info on our intent:
The closest situation of the three outlined in the MSDN article provided by #Stennie would be a single database, multiple-schema, however the difference being, we're not interested in customizing client schemas after their creation, we would actually prefer they remain locked to the parent/master schema.
Ideally the solution would keep each schema linked to the parent table-set structure rather than simply duplicating it with the hope that any change to the parent or master schema would be cascaded across all client/tenant schemas.
Taking it a step further, in a cluster we could have a single master with the master schema, and each slave replicating from it but with a sharded set of tenants. Changes to the master could then be filtered down through the cluster without interruption and would maintain consistency across all instances also allowing us to update the application layer faster knowing that all DB's are compatible with the updated schemas.
Hope that makes sense, I'm still a little fresh at this level.
There are a few common infrastructure approaches ranging from "share nothing" (aka multi-instance) to "share everything" (aka multi-tenant).
For example, a straightforward approach to your "virtual bins" would be to allocate a database per client using shared database servers. This is somewhere in between the two sharing extremes, as your customers would be sharing database server infrastructure but keeping their data and schema separate.
A database-per-client approach would allow you to:
manage authentication and access per client using the database's authentication & access controls
support different database software (you mention using both MySQL which supports views, and MongoDB which does not)
more easily backup and restore data per client
avoid potential cross-client leakage at a database level
avoid excessive table growth and related management issues for a single massive database
Some potential downsides would include:
having more databases to manage
in the case of a database where you want to enforce certain schema (i.e. MySQL) you will need to apply the schema changes across all your databases or support some form of versioning
in the case of a database which preallocates storage (i.e. MongoDB) you may use more storage per client (particularly if your actual data size is small)
you may run into limits on namespaces or open files
you still have to worry about application and data security :)
If you do some research on multi-tenancy you will find some other solutions ranging from this example (isolated DB per client on shared database server architecture) through to more complex partitioned data schemes.
This Microsoft article includes a useful overview of approaches and considerations: Multi-tenant SaaS database tenancy patterns.

one big database, or one per client?

I've been asked to develop an application that will be run out to a number of business units. the application will be the basically the same for each unit, but will have minor procedural differences, which won't change the structure of the underlying database. Should I use one database per business unit, or one big database for all the units? The business units are totally separate
My preference is for one database per client. The advantages:
if a client gets too big, they're easy to move - backup, restore, change the connection string, boom. Try doing that when their data is mixed in with others in a massive database. Even if you use schemas and filegroups to segregate, moving them is not a cakewalk.
ditto for deleting a client's data when they move on.
by definition you're keeping each client's data separate. This is often going to be a want, and sometimes a need. Sometimes it will even be legally binding.
all of your code within a database is simpler - it doesn't have to include the client's schema (which can't be parameterized) and your tables don't have to be littered with an extra column indicating the client.
A lot of people will claim that managing 200 or 500 databases is a lot harder than managing 10 databases. It's not really any different, in my experience. You build scripts that automate things, you stagger index maintenance and backup jobs, etc.
The potential disadvantages are when you get up into the realm of 4-digit and higher databases per instance, where you want to start thinking about having multiple servers (the threshold really depends on the workload and the hardware, so I'm just picking a number). If you build the system right, adding a second server and putting new databases there should be quite simple. Again, the app should be aware of each client's connection string, and all you're doing by using different servers is changing the instance the connection string points to.
Some questions over on dba.SE you should look at. They're not all about SQL Server, but many of the concepts and challenges are universal:
https://dba.stackexchange.com/questions/16745/handling-growing-number-of-tenants-in-multi-tenant-database-architecture
https://dba.stackexchange.com/questions/5071/what-are-the-performance-implications-of-running-multiple-smaller-dbs-instead-of
https://dba.stackexchange.com/questions/7924/one-big-database-vs-several-smaller-ones
Your question is a design question. In order to answer it, you need to understand the requirements of the system that you want to build. From a technical perspective, SQL Server -- or really any database -- can handle either scenario.
Here are some things to think about.
The first question is how separate your clients need the data to be. Mixing data together from different business units may not be legal in some cases (say, the investment side of a bank and the market analysis side). In such situations, separate databases are the solution.
The next question is security. In some situations, clients might be very uncomfortable knowing that their data is intermixed with other clients data. A small slip-up, and confidential information is inadvertently shared. This is probably not an issue for different business units in the same company.
Do you have to deal with different uptime requirements, upload requirements, customizations, and perhaps interaction with other tools? If one business unit will need customizations ASAP that other business units are not interested in, then that suggests different databases.
Another consideration is performance. Does this application use a lot of expensive resources? If so, being able to partition the application on different databases -- and potentially different servers -- may be highly desirable.
On the other hand, if much of the data is shared, and the repository is really a central repository with the same underlying functionality, then one database is a good choice.