Virtual Segregation of Data in Multi-tenant MySQL Database - mysql

This is more of a conceptual question so variations on the stack are welcome should they be capable of accomplishing the same concept. We're currently on MySQL and expanding some services out into MongoDB.
The idea is that we would like to be able to manage a single physical database schema/structure so that adjustments, expansions etc. don't become overly cumbersome as the number of clients utilizing the structure grows into the thousands, tens of, hundreds of, etc. however we would like to segregate their data at this level rather than simply at the application layer to provide a more rigid separation. Is it possible to create virtual bins for each client using the same structure, but have their data structurally separated from one another?
The normal way would obviously be adding Client Keys to every row of data either directly or via foreign relationships, but given that we can't foresee with 20/20 how hacks on our system might occur allowing "cross client" data retrieval, I wanted to go a little further to embed the separation at a virtually structural level.
I've also read another post here: MySQL: how to do row-level security (like Oracle's Virtual Private Database)? which uses "views" as a method but this seems to become more work the larger the list of clients.
Thanks!
---- EDIT ----
Based on some of the literature suggested below, here's a little more info on our intent:
The closest situation of the three outlined in the MSDN article provided by #Stennie would be a single database, multiple-schema, however the difference being, we're not interested in customizing client schemas after their creation, we would actually prefer they remain locked to the parent/master schema.
Ideally the solution would keep each schema linked to the parent table-set structure rather than simply duplicating it with the hope that any change to the parent or master schema would be cascaded across all client/tenant schemas.
Taking it a step further, in a cluster we could have a single master with the master schema, and each slave replicating from it but with a sharded set of tenants. Changes to the master could then be filtered down through the cluster without interruption and would maintain consistency across all instances also allowing us to update the application layer faster knowing that all DB's are compatible with the updated schemas.
Hope that makes sense, I'm still a little fresh at this level.

There are a few common infrastructure approaches ranging from "share nothing" (aka multi-instance) to "share everything" (aka multi-tenant).
For example, a straightforward approach to your "virtual bins" would be to allocate a database per client using shared database servers. This is somewhere in between the two sharing extremes, as your customers would be sharing database server infrastructure but keeping their data and schema separate.
A database-per-client approach would allow you to:
manage authentication and access per client using the database's authentication & access controls
support different database software (you mention using both MySQL which supports views, and MongoDB which does not)
more easily backup and restore data per client
avoid potential cross-client leakage at a database level
avoid excessive table growth and related management issues for a single massive database
Some potential downsides would include:
having more databases to manage
in the case of a database where you want to enforce certain schema (i.e. MySQL) you will need to apply the schema changes across all your databases or support some form of versioning
in the case of a database which preallocates storage (i.e. MongoDB) you may use more storage per client (particularly if your actual data size is small)
you may run into limits on namespaces or open files
you still have to worry about application and data security :)
If you do some research on multi-tenancy you will find some other solutions ranging from this example (isolated DB per client on shared database server architecture) through to more complex partitioned data schemes.
This Microsoft article includes a useful overview of approaches and considerations: Multi-tenant SaaS database tenancy patterns.

Related

mySQL performance one huge database vs small many

I am developing a site that has many subdomains in it.
It has blogging module, management system, and many more. I have shared this question in various sites but couldn't get a proper reply.
Question is should I use one database for all the modules, this means my database would have nearly 100 tables. Is this approach be appropriate or should I create separate database for every module?
Well, it does not really matter.
If you use innodb with single data file (innodb_file_per_table setting is not enabled), then all data will be stored in a single file anyway.
With innodb separate file per table mode or with myisam table engine, the only difference between one or multiple databases is really the directory where the database files are stored. Unless the directories (databases) are located in different storage devices with different speeds, their performance will be the same.
There can be 2 reasons to keep some tables in a different database:
Security: mysql does not support role based access control. Therefore if there is a group of tables that should be accessible by a certain group of users only, then the access control is more manageable if those tables are in a different database.
If some of the modules you mentioned happen to use the same table name, then you will have to move them to a separate database or you need to modify the code and table names to avoid errors.
There is no right or wrong way to design a system. Just advantages and disadvantages to the various techniques. I normally work in Oracle and SQL Server so I had to look up some terms for MySQL. According to my research, in MySQL a database is synonymous with a schema which changes things. I'd consider these things when planning the physical design for any vendor:
Security - Do all subdomains need read/write to each other? How are the users secured? Choosing one or many schemas can impact how easy schema and user security is to manage and control.
Growth - Do some subdomains grow at a faster rate than others? If yes, I'd consider separating them to allow for the different growth rates.
Organization - Is it easier to identify the different subdomains in practice if they're separated? If you don't separate them, use a strong naming convention so you can easily identify objects within one subdomain.
Linking - How easy is it to access one schema/database from another?
Hope this helps.

SQLite3 database per customer

Scenario:
Building a commercial app consisting in an RESTful backend with symfony2 and a frontend in AngularJS
This app will never be used by many customers (if I get to sell 100 that would be fantastic. Hopefully much more, but in any case will be massive)
I want to have a multi tenant structure for the database with one schema per customer (they store sensitive information for their customers)
I'm aware of problem when updating schemas but I will have to live with it.
Today I have a MySQL demo database that I will clone each time a new customer purchase the app.
There is no relationship between my customers, so I don't need to communicate with multiple shards for any query
For one customer, they can be using the app from several devices at the time, but there won't be massive write operations in the db
My question
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and if it's a common practice to have one dedicated SQLite3 database PER CLIENT. I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
Is this a correct scenario for SQLite?
Any suggestion (aka tutorial) in how to achieve this?
[I wonder] if it's a common practice to have one dedicated SQLite3 database PER CLIENT
Only if the database is deployed along with the application, like on a phone. Otherwise I've never heard of such a thing.
I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
SQLite is a SQL database and responds to ALTER TABLE and the like. As for updating all the schemas, you'll have to re-run the update for all schemas.
Schema synching is usually handled by an outside utility, usually your ORM will have something. Some are server agnostic, some only support specific servers. There are also dedicated database change management tools such as Sqitch.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and
SQLite's main advantage is not requiring you to install and run a server. That makes sense for quick projects or where you have to deploy the database, like a phone app. For server based application there's no problem having a database server. SQLite's very restricted set of SQL features becomes a disadvantage. It will also likely run slower than a server database for anything but the simplest queries.
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
Under no circumstances should you test with a different database than the production database. Databases do not all implement SQL the same, MySQL is particularly bad about this, and your tests will not reflect reality. Running a MySQL instance for testing is not much work.
This separate schema thing claims three advantages...
Extensibility (you can add fields whenever you like)
Security (a query cannot accidentally show data for the wrong tenant)
Parallel Scaling (you can potentially split each schema onto a different server)
What they're proposing is equivalent to having a separate, customized copy of the code for every tenant. You wouldn't do that, it's obviously a maintenance nightmare. Code at least has the advantage of version control systems with branching and merging. I know only of one database management tool that supports branching, Sqitch.
Let's imagine you've made a custom change to tenant 5's schema. Now you have a general schema change you'd like to apply to all of them. What if the change to 5 conflicts with this? What if the change to 5 requires special data migration different from everybody else? Now let's imagine you've made custom changes to ten schemas. A hundred. A thousand? Nightmare.
Different schemas will require different queries. The application will have to know which schema each tenant is using, there will have to be some sort of schema version map you'll need to maintain. And every different possible query for every different possible schema will have to be maintained in the application code. Nightmare.
Yes, putting each tenant in a separate schema is more secure, but that only protects against writing bad queries or including a query builder (which is a bad idea anyway). There are better ways mitigate the problem such as the view filter suggested in the docs. There are many other ways an attacker can access tenant data that this doesn't address: gain a database connection, gain access to the filesystem, sniff network traffic. I don't see the small security gain being worth the maintenance nightmare.
As for scaling, the article is ten years out of date. There are far, far better ways to achieve parallel scaling then to coarsely put schemas on different servers. There are entire databases dedicated to this idea. Fortunately, you don't need any of this! Scaling won't be a problem for you until you have tens of thousands to millions of tenants. The idea of front loading your design with a schema maintenance nightmare for a hypothetical big parallel scaling problem is putting the cart so far before the horse, it's already at the pub having a pint.
If you want to use a relational database I would recommend PostgreSQL. It has a very rich SQL implementation, its fast and scales well, and it has something that renders this whole idea of separate schemas moot: a built in JSON type. This can be used to implement the "extensibility" mentioned in the article. Each table can have a meta column using the JSON type that you can throw any extra data into you like. The application does not need special queries, the meta column is always there. PostgreSQL's JSON operators make working with the meta data very easy and efficient.
You could also look into a NoSQL database. There are plenty to choose from and many support custom schemas and parallel scaling. However, it's likely you will have to change your choice of framework to use one that supports NoSQL.

Best database model for saas application (1 db per account VS 1 db for everyone)

Little question, I'm developing a saas software (erp).
I designed it with 1 database per account for these reasons :
I make a lot of personalisation, and need to add specific table columns for each account.
Easier to manage db backup (and reload data !)
Less risky : sometimes I need to run SQL queries on a table, in case of an error with bad query (update / delete...), only one customer is affected instead of all of them.
Bas point : I'm turning to have hundreds of databases...
I'm hiring a company to manage my servers, and they said that it's better to have only one database, with a few tables, and put all data in the same tables with column as id_account. I'm very very surprised by these words, so I'm wondering... what are your ideas ?
Thanks !
Frederic
The current environment I am working in, we handle millions of records from numerous clients. Our solution is to use Schema to segregate each individual client. A schema allows you to partition your clients into separate virtual databases while inside a single db. Each schema will have an exact copy of the tables from your application.
The upside:
Segregated client data
data from a single client can be easily backed up, exported or deleted
Programming is still the same, but you have to select the schema before db calls
Moving clients to another db or standalone server is a lot easier
adding specific tables per client is easier (see below)
single instance of the database running
tuning the db affects all tenants
The downside:
Unless you manage your shared schema properly, you may duplicate data
Migrations are repeated for every schema
You have to remember to select the schema before db calls
hard pressed to add many negatives... I guess I may be biased.
Adding Specific Tables: Why would you add client specific tables if this is SAAS and not custom software? Better to use a Postgres DB with a Hstore field and store as much searchable data as you like.
Schemas are ideal for multi-tenant databases Link Link
A lot of what I am telling you depends on your software stack, the capabilities of your developers and the backend db you selected (all of which you neglected to mention)
Your hardware guys should not decide your software architecture. If they do, you are likely shooting yourself in the leg before you even get out of the gate. Get a good senior software architect, the grief they will save you, will likely save your business.
I hope this helps...
Bonne Chance

Can relational database scale horizontally

After some googling I have found:
Note from mysql docs:
MySQL Cluster automatically shards (partitions) tables across nodes,
enabling databases to scale horizontally on low cost, commodity
hardware to serve read and write-intensive workloads, accessed both
from SQL and directly via NoSQL APIs.
Can relational database be horizontal scaling? Will it be somehow based on NoSQL database?
Do someone have any real world example?
How can I manage sql requests, transactions, and so on in such database?
It is possible but takes lots of maintenance efforts, Explanation -
Vertical Scaling of data (synonymous to Normalisation in SQL databases) is referred as splitting data column wise into multiple tables in order to reduce space redundancy. Example of user table -
Horizontal Scaling of data (synonymous to sharding) is referred as splitting row wise into multiple tables in order to reduce time taken to fetch data. Example of user table -
Key point to note here is as we can see tables in SQL databases are Normalised into multiple tables of related data. In order to shard data of such table on multiple machines, you would need to shard related normalised data accordingly which in turn would increase maintenance efforts. Like in the example presented above of SQL database,
Customer table which is related as one to many relation with Order
table
If you move some rows of customer data onto other machine (referred as sharding) you would also need to move its related order data onto the same machine which would be troublesome task in case of multiple related tables.
Its convenient for NOSQL databases to shard out as they follow flat table structure (data is stored in aggregated form rather than normalised form).
I think the answer is, unequivocally, yes. You have to keep in mind that SQL is simply a data access language. There is absolutely no reason why it can't be extended across multiple computers and network partitions. Is it a challenging problem? Most certainly, and that's why software that does it is in its infancy.
Now, I think what you are trying to ask is "Can all features that I am familiar with and that arrive in a standard SQL-type relational database management system be developed to work with multiple servers in this manner?" While I admit I haven't studied the problem in depth, there are theorems out there that say "No, it cannot." Consistency-Availability-Partition Theorem posits that we cannot have all three qualities at the same level.
Now, for all practical purposes, "sharding" or "partitioning" or whatever you want to call it is not going away; to the contrary. This means that, given the degree to which CAP theorem holds, we are going to have to shift the way we think about databases, and how we interact with them (at least, to an extent). Many developers have already made the shift necessary to be successful on a No-SQL platform, but many more have not. Ultimately, sufficient maturity of the model and effective enough workarounds will be developed that traditional SQL databases, in the sense you refer, will be more or less practical across multiple machines. This is already starting to pan out, and I would say give it a few more years and we'll be to that point. Or we'll have collectively shifted thinking to the point where it is no longer necessary, and the world will be a better place. :)
Thanks for the question and answer. I was trying to explain this to someone like this:
In terms of the CAP theorem, you can't have all three. So when a partition (network or server failure) occurs:
A relational database on a single server is giving you C (consistency). So when a
P (partition - server/network failure) occurs, you can't have A
(availability - db goes down)
A nosql datastore if you want A when a P occurs, you can't
have C (one or more of your replicated partitions will be out of
sync, until the n/w comes back and they all sync up). So it will only
be eventually consistent
EDITED #2: to provide more perspective based on the comment below by Manish. My intention is to explain by example, why you cant have all 3. As noted below in the comments, there are other dbs where you can have C when P occurs at the expense of A.
Google Spanner is an example of a relational database that can scale horizontally. Sharding and replication are done automatically so no need to worry about that. For more information please check out this paper.
Yes it can. It is called NewSQL.
NewSQL is a new approach to relational databases that wants to combine transactional ACID (atomicity, consistency, isolation, durability) guarantees of good ol’ RDBMSs and the horizontal scalability of NoSQL. Source
Examples for Databases:
User-Shared MySQL Cluster
Citus (PostgreSQL extension)
CockroachDB
Azure Cosmos DB
Google Spanner
NuoDB
Vitess
Splice Machine (part of Hadoop ecosystem)
MemQSL (in memory store)
VoltDB (in memory store)
Examples for Data Warehouses:
IBM Netezza
Oracle
Teradata
Hive Engine (part of Hadoop ecosystem)
Spark SQL (part of Hadoop ecosystem)
Yes, but it need to migrate when storage increased.
Some open source tools can support the feature, for example: Vitess or Apache ShardingSphere.

How to Think About a Relational Database on the Web

I've been doing some simple web programming using python, and I have a basic understanding of most of the parts involved in generating and serving web pages. However, I have only a tenuous grasp on the use of Relational Databases as a way to store and retrieve data. I do understand the basics of SQL queries and database design, but am having trouble understanding what I should be doing to allow for concurrent access (among other things).
With that in mind I have a couple fairly specific questions. However, for each question, I'm only partially interested in the answer to the question itself. I'm mostly interested in whether or not I'm asking these questions in the right way. So here it goes:
When using a relational database, how do you insure that multiple threads don't interfere with each other while writing to the database?
Could having multiple threads accessing a database create a situation in which the data they are reading are out of sync?
How should I manage permissions to read/write from a database?
Are there things that don't belong in a database (images, large chunks of text)?
I'd love any commentary on these specific questions, or a pointer to any resource that describes the correct way of thinking about using a relational database on the web.
a lot of your concerns are abstracted away by a DBMS. You don't generally need to stress the thread/concurrency related stuff. What you can do is group inserts/updates/queries into transactions to make them more atomic and ensure that all or nothing happens. such transactions can be rolled back if, for example, they are interfered with part way thru.
You don't mention what DB you use, but here is a small DB-agnostic intro to transaction. Of course you should also check out official documentation for your database.
http://www.sqlteam.com/article/introduction-to-transactions
As far as 'what things don't belong in a database', images and large chunks of text are fine. You can store binary blobs, you can store code if it makes sense for what you're doing. One thing i'd suggest is that you consider whether it is in your interest to directly store images in the DB or to store paths/filenames for files sitting on your server instead.
what I should be doing to allow for concurrent access
You let the database handle that, it's what it is designed for.
When using a relational database, how do you insure that multiple threads don't interfere with each other while writing to the database?
The database will handle this. Sometimes this will mean that one of the queries will abort in order to avoid a deadlock. You need to detect this in your code.
Could having multiple threads accessing a database create a situation in which the data they are reading are out of sync?
Yes, this is possible. Not much you can do about it - it is a consequence of multiple threads reading/writing the same data. There are synchronization commands that you can use, but these can have an effect on performance.
How should I manage permissions to read/write from a database?
Through the database security mechanism, whatever they are.
Are there things that don't belong in a database (images, large chunks of text)?
Large files, though even that depends on the application. Store application data in your database.
I would not expose a database directly to the web; I'd have a middle tier between clients and database to handle things like authentication and authorization, validation and binding, synchronization and isolation for database access, etc.
This would have the added benefit of letting me scale by adding more middle tier hardware.