Scenario:
Building a commercial app consisting in an RESTful backend with symfony2 and a frontend in AngularJS
This app will never be used by many customers (if I get to sell 100 that would be fantastic. Hopefully much more, but in any case will be massive)
I want to have a multi tenant structure for the database with one schema per customer (they store sensitive information for their customers)
I'm aware of problem when updating schemas but I will have to live with it.
Today I have a MySQL demo database that I will clone each time a new customer purchase the app.
There is no relationship between my customers, so I don't need to communicate with multiple shards for any query
For one customer, they can be using the app from several devices at the time, but there won't be massive write operations in the db
My question
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and if it's a common practice to have one dedicated SQLite3 database PER CLIENT. I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
Is this a correct scenario for SQLite?
Any suggestion (aka tutorial) in how to achieve this?
[I wonder] if it's a common practice to have one dedicated SQLite3 database PER CLIENT
Only if the database is deployed along with the application, like on a phone. Otherwise I've never heard of such a thing.
I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
SQLite is a SQL database and responds to ALTER TABLE and the like. As for updating all the schemas, you'll have to re-run the update for all schemas.
Schema synching is usually handled by an outside utility, usually your ORM will have something. Some are server agnostic, some only support specific servers. There are also dedicated database change management tools such as Sqitch.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and
SQLite's main advantage is not requiring you to install and run a server. That makes sense for quick projects or where you have to deploy the database, like a phone app. For server based application there's no problem having a database server. SQLite's very restricted set of SQL features becomes a disadvantage. It will also likely run slower than a server database for anything but the simplest queries.
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
Under no circumstances should you test with a different database than the production database. Databases do not all implement SQL the same, MySQL is particularly bad about this, and your tests will not reflect reality. Running a MySQL instance for testing is not much work.
This separate schema thing claims three advantages...
Extensibility (you can add fields whenever you like)
Security (a query cannot accidentally show data for the wrong tenant)
Parallel Scaling (you can potentially split each schema onto a different server)
What they're proposing is equivalent to having a separate, customized copy of the code for every tenant. You wouldn't do that, it's obviously a maintenance nightmare. Code at least has the advantage of version control systems with branching and merging. I know only of one database management tool that supports branching, Sqitch.
Let's imagine you've made a custom change to tenant 5's schema. Now you have a general schema change you'd like to apply to all of them. What if the change to 5 conflicts with this? What if the change to 5 requires special data migration different from everybody else? Now let's imagine you've made custom changes to ten schemas. A hundred. A thousand? Nightmare.
Different schemas will require different queries. The application will have to know which schema each tenant is using, there will have to be some sort of schema version map you'll need to maintain. And every different possible query for every different possible schema will have to be maintained in the application code. Nightmare.
Yes, putting each tenant in a separate schema is more secure, but that only protects against writing bad queries or including a query builder (which is a bad idea anyway). There are better ways mitigate the problem such as the view filter suggested in the docs. There are many other ways an attacker can access tenant data that this doesn't address: gain a database connection, gain access to the filesystem, sniff network traffic. I don't see the small security gain being worth the maintenance nightmare.
As for scaling, the article is ten years out of date. There are far, far better ways to achieve parallel scaling then to coarsely put schemas on different servers. There are entire databases dedicated to this idea. Fortunately, you don't need any of this! Scaling won't be a problem for you until you have tens of thousands to millions of tenants. The idea of front loading your design with a schema maintenance nightmare for a hypothetical big parallel scaling problem is putting the cart so far before the horse, it's already at the pub having a pint.
If you want to use a relational database I would recommend PostgreSQL. It has a very rich SQL implementation, its fast and scales well, and it has something that renders this whole idea of separate schemas moot: a built in JSON type. This can be used to implement the "extensibility" mentioned in the article. Each table can have a meta column using the JSON type that you can throw any extra data into you like. The application does not need special queries, the meta column is always there. PostgreSQL's JSON operators make working with the meta data very easy and efficient.
You could also look into a NoSQL database. There are plenty to choose from and many support custom schemas and parallel scaling. However, it's likely you will have to change your choice of framework to use one that supports NoSQL.
Related
Little question, I'm developing a saas software (erp).
I designed it with 1 database per account for these reasons :
I make a lot of personalisation, and need to add specific table columns for each account.
Easier to manage db backup (and reload data !)
Less risky : sometimes I need to run SQL queries on a table, in case of an error with bad query (update / delete...), only one customer is affected instead of all of them.
Bas point : I'm turning to have hundreds of databases...
I'm hiring a company to manage my servers, and they said that it's better to have only one database, with a few tables, and put all data in the same tables with column as id_account. I'm very very surprised by these words, so I'm wondering... what are your ideas ?
Thanks !
Frederic
The current environment I am working in, we handle millions of records from numerous clients. Our solution is to use Schema to segregate each individual client. A schema allows you to partition your clients into separate virtual databases while inside a single db. Each schema will have an exact copy of the tables from your application.
The upside:
Segregated client data
data from a single client can be easily backed up, exported or deleted
Programming is still the same, but you have to select the schema before db calls
Moving clients to another db or standalone server is a lot easier
adding specific tables per client is easier (see below)
single instance of the database running
tuning the db affects all tenants
The downside:
Unless you manage your shared schema properly, you may duplicate data
Migrations are repeated for every schema
You have to remember to select the schema before db calls
hard pressed to add many negatives... I guess I may be biased.
Adding Specific Tables: Why would you add client specific tables if this is SAAS and not custom software? Better to use a Postgres DB with a Hstore field and store as much searchable data as you like.
Schemas are ideal for multi-tenant databases Link Link
A lot of what I am telling you depends on your software stack, the capabilities of your developers and the backend db you selected (all of which you neglected to mention)
Your hardware guys should not decide your software architecture. If they do, you are likely shooting yourself in the leg before you even get out of the gate. Get a good senior software architect, the grief they will save you, will likely save your business.
I hope this helps...
Bonne Chance
This is more of a conceptual question so variations on the stack are welcome should they be capable of accomplishing the same concept. We're currently on MySQL and expanding some services out into MongoDB.
The idea is that we would like to be able to manage a single physical database schema/structure so that adjustments, expansions etc. don't become overly cumbersome as the number of clients utilizing the structure grows into the thousands, tens of, hundreds of, etc. however we would like to segregate their data at this level rather than simply at the application layer to provide a more rigid separation. Is it possible to create virtual bins for each client using the same structure, but have their data structurally separated from one another?
The normal way would obviously be adding Client Keys to every row of data either directly or via foreign relationships, but given that we can't foresee with 20/20 how hacks on our system might occur allowing "cross client" data retrieval, I wanted to go a little further to embed the separation at a virtually structural level.
I've also read another post here: MySQL: how to do row-level security (like Oracle's Virtual Private Database)? which uses "views" as a method but this seems to become more work the larger the list of clients.
Thanks!
---- EDIT ----
Based on some of the literature suggested below, here's a little more info on our intent:
The closest situation of the three outlined in the MSDN article provided by #Stennie would be a single database, multiple-schema, however the difference being, we're not interested in customizing client schemas after their creation, we would actually prefer they remain locked to the parent/master schema.
Ideally the solution would keep each schema linked to the parent table-set structure rather than simply duplicating it with the hope that any change to the parent or master schema would be cascaded across all client/tenant schemas.
Taking it a step further, in a cluster we could have a single master with the master schema, and each slave replicating from it but with a sharded set of tenants. Changes to the master could then be filtered down through the cluster without interruption and would maintain consistency across all instances also allowing us to update the application layer faster knowing that all DB's are compatible with the updated schemas.
Hope that makes sense, I'm still a little fresh at this level.
There are a few common infrastructure approaches ranging from "share nothing" (aka multi-instance) to "share everything" (aka multi-tenant).
For example, a straightforward approach to your "virtual bins" would be to allocate a database per client using shared database servers. This is somewhere in between the two sharing extremes, as your customers would be sharing database server infrastructure but keeping their data and schema separate.
A database-per-client approach would allow you to:
manage authentication and access per client using the database's authentication & access controls
support different database software (you mention using both MySQL which supports views, and MongoDB which does not)
more easily backup and restore data per client
avoid potential cross-client leakage at a database level
avoid excessive table growth and related management issues for a single massive database
Some potential downsides would include:
having more databases to manage
in the case of a database where you want to enforce certain schema (i.e. MySQL) you will need to apply the schema changes across all your databases or support some form of versioning
in the case of a database which preallocates storage (i.e. MongoDB) you may use more storage per client (particularly if your actual data size is small)
you may run into limits on namespaces or open files
you still have to worry about application and data security :)
If you do some research on multi-tenancy you will find some other solutions ranging from this example (isolated DB per client on shared database server architecture) through to more complex partitioned data schemes.
This Microsoft article includes a useful overview of approaches and considerations: Multi-tenant SaaS database tenancy patterns.
We have large amounts of data in multiple mysql databases which is constantly updated from external sources.
Is there some software (preferably php based) with which we can define rules to test against the database, for example regular expressions on the data, frequency of updates, missing data etc..) and run checks regularly reporting that something has gone wrong or a trend has changed in the data ?
How about STFW? Googling for "Mysql data quality" brought (among others) a link to
http://www.talend.com
Otherwise, I'd have a look at data warehousing tools - Oracle Warehouse Builder for example has some mechanisms for data auditing.
Kind regards, Frank
If you have multiple db tables that are not joined with foreign keys, then you should add and use them for data integrity.
If you have lots of PL/SQL code then you need unit tests for it (yes, DB needs tests too). So in the end you'll end up with "continuous integration" that runs your tests periodically. And yes, you have to write it yourself
See http://www.slideshare.net/antonkeks/database-refactoring for more info.
If you have to sync databases, then i'd recommend SQLYog.
If you have properly designed the database you don't have many data integrity problems. This means doing the work of setting up PK/FK relationships, data constraints, correct datatypes, triggers, etc. This especially means you never consider that the application will handle all that. It might mean setting up jobs to check on certain types of data entry and notifying someone of possible problems. It might mean revising all your data imports to use a standard set of cleaning routines. It might mean creating a way to identfy and merge duplicate records (all complex databases should havea deduping application written so that the users can make chocies about what data to keep and what data to save when duplicates are found).
If you didn't design the database correctly, you need to set these things up in the database one at a time depending on your business rules, fixing the bad data as you go. There is no easy solution for the failure of the developers to design properly.
Since the needs of each database are very different, no one that I know of has automated a way to enforce all integrity rules, this is a large part of what the database designer does when designing the database. I ceratinly wouldn't trust any COTS program to do it either based on how badly designed every COTS database I have ever had the displeaure to support has been.
Some parts of my web app would work very well with a RDBMS, such as user and URL handling - I want to normalize users, emails, hosts (ie stackoverflow.com), and urls (ie https://stackoverflow.com/questions/ask) so that updating things in one place update things in all places and to minimize redundancy.
But some parts of my web app would very well with a document-based database, like Mongo, because they have a lot of components that would work more efficiently as embedded objects.
Would it make sense to use MySQL for the relational objects and Mongo for the document objects, or would it be not worth the hassle to have to manage two types of databases? I know that Mongo has references, but I get the idea that it is not really designed and optimized for references.
Thanks!
PS: I read this: Using combination of MySQL and MongoDB and it scratches the edge of what I am asking, but it is really a completely different question.
We use Mongo and MySQL in unision. Yes there is additional maintenance involved but it is about using the right tool for the right job. We use Mongo for a more real-time scenario where we need fast reads and writes and can do without persisting data for long periods of time. MySQL for everything else.
That being said, your needs may be unique and you need to figure out the right tool for the job.
I recently built a system using MySql for as the RDBMS managing users and blogging and MongoDB for searchable attributes. It works well however keeping data in sync, especially user Id's etc requires a bit of work. It is a case of basically choosing the right tool for the job.
I'm working on an application that's similar to Wufoo in that it allows our users to create their own databases and collect/present records with auto generated forms and views.
Since every user is creating a different schema (one user might have a database of their baseball card collection, another might have a database of their recipes) our current approach is using MySQL to create separate databases for every user with its own tables. So in other words, the databases our MySQL server contains look like:
main-web-app-db (our web app containing tables for users account info, billing, etc)
user_1_db (baseball_cards_table)
user_2_db (recipes_table)
....
And so on. If a user wants to set up a new database to keep track of their DVD collection, we'd do a "create database ..." with "create table ...". If they enter some data in and then decide they want to change a column we'd do an "alter table ....".
Now, the further along I get with building this out the more it seems like MySQL is poorly suited to handling this.
1) My first concern is that switching databases every request, first to our main app's database for authentication etc, and then to the user's personal database, is going to be inefficient.
2) The second concern I have is that there's going to be a limit to the number of databases a single MySQL server can host. Pretending for a moment this application had 500,000 user databases, is MySQL designed to operate this way? What if it were a million, or more?
3) Lastly, is this method going to be a nightmare to support and scale? I've never heard of MySQL being used in this way so I do worry about how this affects things like replication and other methods of scaling.
To me, it seems like MySQL wasn't built to be used in this way but what do I know. I've been looking at document-based databases like MongoDB, CouchDB, and Redis as alternatives because it seems like a schema-less approach to this particular problem makes a lot of sense.
Can anyone offer some advice on this?
Since you are leaving the schema up to your users to decide, it doesn't make sense using a relational database that forces you to define a schema.
Use a NoSQL database. Do some more reading on stack overflow.
What is NoSQL, how does it work, and what benefits does it provide?
Pros/Cons of document based database vs relational database
What is the best Document-oriented database?
Creating tables on the fly like you describe is a very bad idea. Supporting schema changes would be a nightmare. Each time someone added or removed a field you would have to run an ALTER TABLE ... command, and if there's data i the table, that's not a quick operation since it basically creates a new table with the new scehma and moves all the data over to the new one. Don't go down that route.
You could implement some kind of key/value-store on top of MySQL without too much work, or use something like Friendly, but going for a proper document database is probably a much simpler way.
MongoDB would be my choice, but there's a lot of things to consider, and others may say that Cassandra would be better. It's very easy to get going with MongoDB, and using it feels quite familiar to using a SQL database. It does indexing more or less identically, and querying is not too different either. The best thing though, is probably that you don't need an ORM, your objects are stored more or less as-is in the database. Reading and writing can be done very close to the metal without requiring a lot of mapping to and from objects.