I'm looking for working code and ideas from others who have tried to build a multi-tenant Django application using database-level isolation.
Update/Solution: I ended solving this in a new opensource project: see django-db-multitenant
Goal
My goal is to multiplex requests as they come in to a single app server (WSGI frontend like gunicorn), based on the request hostname or request path (for instance, foo.example.com/ sets the Django connection to use database foo, and bar.example.com/ uses database bar).
Precedent
I'm aware of a few existing solutions for multi tenancy in Django:
django-tenant-schemas: This is very close to what I want: you install its middleware at highest precedence, and it sends a SET search_path command to the db. Unfortunately, it is Postgres specific and I am stuck with MySQL.
django-simple-multitenant: The strategy here is to add a "tenant" foreign key to all models, and adjust all application business logic to key off of that. Basically each row is becomes indexed by (id, tenant_id) rather than (id). I've tried, and don't like, this approach for a number of reasons: it makes the application more complex, it can lead to hard-to-find bugs, and it provides no database-level isolation.
One {app server, django settings file with appropriate db} per tenant. Aka poor man's multi tenancy (actually rich man's, given the resources it involves). I do not want to spin up a new app server per tenant, and for scalability I want any app server to be able to dispatch requests for any client.
Ideas
My best idea so far is to do something like django-tenant-schemas: in the first middleware, grab django.db.connection and fiddle with the database selection rather than the schema. I haven't quite thought through what this means in terms of pooled/persistent connections
Another dead end I pursued was tenant-specific table prefixes: Setting aside that I'd need them to be dynamic, even a global table prefix is not easily achieved in Django (see rejected ticket 5000, among others).
Finally, Django multiple database support lets you define multiple named databases, and mux among them based on the instance type and read/write mode. Not helpful since there is no facility to select the db on a per-request basis.
Question
Has anyone managed something similar? If so, how did you implement it?
I've done something similar that is closest to point 1, but instead of using middleware to set a default connection Django database routers are used. This allow application logic to use a number of databases if required for each request. It's up to the application logic to choose a suitable database for every query, and this is the big downside of this approach.
With this setup, all databases are listed in settings.DATABASES, including databases which may be shared among customers. Each model that is customer specific is placed in a Django app that has a specific app label.
eg. The following class defines a model which exists in all customer databases.
class MyModel(Model):
....
class Meta:
app_label = 'customer_records'
managed = False
A database router is placed in the settings.DATABASE_ROUTERS chain to route database request by app_label, something like this (not a full example):
class AppLabelRouter(object):
def get_customer_db(self, model):
# Route models belonging to 'myapp' to the 'shared_db' database, irrespective
# of customer.
if model._meta.app_label == 'myapp':
return 'shared_db'
if model._meta.app_label == 'customer_records':
customer_db = thread_local_data.current_customer_db()
if customer_db is not None:
return customer_db
raise Exception("No customer database selected")
return None
def db_for_read(self, model, **hints):
return self.get_customer_db(model, **hints)
def db_for_write(self, model, **hints):
return self.get_customer_db(model, **hints)
The special part about this router is the thread_local_data.current_customer_db() call. Before the router is exercised, the caller/application must have set up the current customer db in thread_local_data. A Python context manager can be used for this purpose to push/pop a current customer database.
With all of this configured, the application code then looks something like this, where UseCustomerDatabase is a context manager to push/pop a current customer database name into thread_local_data so that thread_local_data.current_customer_db() will return the correct database name when the router is eventually hit:
class MyView(DetailView):
def get_object(self):
db_name = determine_customer_db_to_use(self.request)
with UseCustomerDatabase(db_name):
return MyModel.object.get(pk=1)
This is quite a complex setup already. It works, but I'll try to summarize what I see see as advantages and disadvantages:
Advantages
Database selection is flexible. It allows multiple database to be used in a single query, both customer specific and shared databases can be used in a request.
Database selection is explicit (not sure if this is an advantage or disadvantage). If you try to run a query that hits a customer database but the application hasn't selected one, an exception will occur indicating a programming error.
Using a database router allows different databases to exist on different hosts, rather than relying on a USE db; statement that guesses that all databases are accessible through a single connection.
Disadvantages
It's complex to setup, and there are quite a few layers involved to get it functioning.
The need and use of thread local data is obscure.
Views are littered with database selection code. This could be abstracted using class based views to automatically choose a database based on request parameters in the same manner as middleware would choose a default database.
The context manager to choose a database must be wrapped around a queryset in such a manner that the context manager is still active when the query is evaluated.
Suggestions
If you want flexible database access, I'd suggest to use Django's database routers. Use Middleware or a view Mixin which automatically sets up a default database to use for the connection based on request parameters. You might have to resort to thread local data to store the default database to use so that when the router is hit, it knows which database to route to. This allows Django to use its existing persistent connections to a database (which may reside on different hosts if wanted), and chooses the database to use based on routing set up in the request.
This approach also has the advantage that the database for a query can be overridden if needed by using the QuerySet using() function to select a database other than the default.
For the record, I chose to implement a variation of my first idea: issue a USE <dbname> in an early request middleware. I also set the CACHE prefix the same way.
I'm using it on a small production site, looking up the tenant name from a Redis database based on the request host. So far, I'm quite happy with the results.
I've turned it into a (hopefully resuable) github project here: https://github.com/mik3y/django-db-multitenant
You could create a simple middleware of your own that determined the database name from your sub-domain or whatever and then executed a USE statement on the database cursor for each request. Looking at the django-tenants-schema code, that is essentially what it is doing. It is sub-classing psycopg2 and issuing the postgres equivalent to USE, "set search_path XXX". You could create a model to manage and create your tenants too, but then you would be re-writing much of django-tenants-schema.
There should be no performance or resource penalty in MySQL to switching the schema (db name). It is just setting a session parameter for the connection.
Related
I am relatively new to Laravel and I want to know more about the behavior of the following code inside an API controller method (i.e. defined in routes/api.php).
DB::connection($dbName)->beginTransaction();
Assuming two people simultaneously access to the server and two controller instances are created. Now both instances calls DB::connection($dbName)->beginTransaction();) and executes some code before commiting.
To describe the situation more clearly, two people A and B both access to the same Laravel server, with two instances of the same controller class. The two controller instances both access to the same DB from the same IP address and executes the same code segment which involves the beginTransaction.
As of my current information there are two possible things happening:
(1) They are treated as two different transactions and things such as lastInsertId are maintained separately. This is the desired behavior but I have trouble understanding how the two different instances of the controller are identified internally. Based on a rough look at Laravel source code it seems only read the config file for the specific DB connection, which is identical for both instances of the controller.
(2) They are treated as the same transaction, which is not the desired behavior. I will have to think of a way to make it work like described in (1)
Currently I don't have sufficient environment for accurately testing behavior of simultaneous controller instances so I have to ask the question here.
I have read https://laravel.com/docs/8.x/database#database-transactions but it does not tell us which of (1) or (2) it is.
It's 1st case
Transactions aren't even part of Laravel - they are part of SQL.
Laravel only calls SQL.
Also in PHP each request is handles separately by spawning a new instance of script. Therefore there is no variable sharing between two requests.
Application 1: Suppose I have a Twitter like application. Hence I need to use multiple databases/schema (suppose one to store user info, suppose one for user logging purpose, etc)
Application 2: Suppose I have a blog with logically separated DBs needed ( suppose one to store user info, suppose one for user logging purpose, etc ).
How to use same MySQL instance as the datastore for both. I mean, since each has multiple similar DBs , there are chances of getting confused with names of databases or tables unless I keep long names like twitter_users and blog_users.
Any effective solution within MySQL?
a other way is to use MaxScale as DB Proxy. There is rewrite Engine. There you can configure a rewrite for schema name for one application. The benefit is that you can use a single MySQL/MariaDB instance and configure the hole memory for it.
First some Background:
We are trying to create a multi-tenant application, we thought of going on with using the mean stack first and create multiple collection for each tenant (eg order_tenant1,order_tenant2 etc) then we went through some blogs that suggested against this approach, second we felt the need of transaction as a core requirement of our DB thus opened our self's to RDBMS lke mysql and mariaDB, we stumbled upon a blog which explained the approach in a lot of detail which says to create views to get, update and insert data related to tenant and views parameter would be defined thorugh the connection string as we are using node.js i found ORM for mysql sequelizejs which is quite good.
The actual problem:
As per my experience of the mean stack we define the mongo connection in the server.js file and the application establishes those connection at the application start and keeps them alive,
how can i have multiple sequelizejs (or for that matter and database connection )objects to connect to the database according to the user belonging to a particular tenant and provide the right object to the application to carry on with the business logic
1)should i create a new connection object on every request the application get and then close it after the request is processed ?
2)or is there any better way to handle this in node, express or sequelizejs!?
Edited:
we have decided to use row-base approch containing the tenant_id as a column as said in the blog above, but i am struggling about how would i maintain dirrent connection object to the database through sequelizejs objects i.e id a user belonging to tenant id:1 sends a request to the application he need to be serverd with an object say "db" which is a sequelize object to communicate with the database which is created using tenant id 1's details in its connection string, same for a user belonging tenant id:2 it needs to be served with the same object i.e. "db" but it must be created using tenant id 2's details in its connection string as i want to maintain different connection string (database connection objects) for every tenant i have to serve.
Multi-tenancy can be implemented as row-based, schema-based, or database-based. Other than that 2010 article, I think you will find few if any other recommendations for doing database-backed multi-tenancy. Systems were not designed to talk to tens or thousands of databases, and so things will keep failing on you. The basic thing you're trying to avoid is SQL injection attacks that reveal other user's data, but the proper way to avoid those is through sanitizing user inputs, which you need to do no matter what.
I highly recommend going with a normal row-based multi-tenancy approach as opposed to schema-based, as described in https://devcenter.heroku.com/articles/heroku-postgresql#multiple-schemas and in this original article: http://railscraft.tumblr.com/post/21403448184/multi-tenanting-ruby-on-rails-applications-on-heroku
Updated:
Your updated question still isn't clear about the difference between database-based and row-based multi-tenancy. You want to do row-based. Which means that you can setup a single Sequelize connection string exactly like the examples, since you'll only have a single database.
Then, your queries to the database will look like:
User.find({ userid: 538 }).complete(function(err, user) {
console.log(user.values)
})
The multi-tenancy is provided by the userid attribute. I would urge you to do a lot more reading about databases, ORM, and typical patterns before getting started. I think you will find an additional up front investment to pay dividends versus starting development when you don't fully understand how ORMs typically work.
I have several Rails apps running on a single MySQL server. All of them run the same app, and all of the databases have the same schema, but each database belongs to a different customer.
Conceptually, here's what I want to do:
Customer.all.each do |customer|
connection.execute("use #{customer.database}")
customer.do_some_complex_stuff_with_multiple_models
end
This approach does not work because, when this is run in a web request, the underlying model classes cache different database connections from the A/R connection pool. So the connection on which I execute the "use" statement, may not be the connection the model uses, in which case it queries the wrong database.
I read through the Rails A/R code (version 3.0.3), and came up with this code to execute in the loop, instead of the "use" statement:
ActiveRecord::Base.clear_active_connections!
ActiveRecord::Base.establish_connection(each_customer_database_config)
I believe that the connection pool is per-thread, so it seems like this would clobber the connection pool and re-establish it only for the one thread the web request is on. But if the connections are shared in some way I'm not seeing, I would not want that code to wreak havoc with other active web requests in the same app.
Is this safe to do in a running web app? Is there any other way to do this?
IMO switching to a new database connection for different requests is a very expensive operation. AR maintains a limited pool of connections.
I guess you should move to PostgreSQL, where you have concept of schemas.
In an ideal SQL world this is the structure of a database
database --> schemas --> tables
In MYSQL, database and schemas are the same thing. Postgres has separate schemas, which can hold tables for different customers. You can switch schema on the fly without changing the AR connection by setting
ActiveRecord::Base.connection.set_schema_search_path("CUSTOMER's SCHEMA")
Developing it require a bit of hacking though.
Switching database by connecting/disconnecting is really slow, and is not going to work due to AR connection pools an internal caches. Try using ActiveRecord::Base.table_name_prefix = "customer_" and keep the database constant.
Right now you have connections in ActiveRecord can be per class level. Its looks per thread basis because is in before 1.9 ruby threads sucked so implementations were using process instead of thread, but It may not be true for long.
But since AR uses one thread per Model. You can create different mock models for each database you have. So using answer given in this question.
Code will look something like this. (I have not tested it)
Customer.all.each do |customer|
c_class = Class.new(ActiveRecord::Base)
c_class.establish_connection(each_customer_database_config)
c_class.table_name = customor.table_name()
c_class.do_something_on_diff_models_using_cutomer_from_diff_conn(customer.id)
c_class.clear_active_connections!
end
Why not keep the same db and tables and just have each of your models belong_to a customer? Then you can find all the models for that customer with:
Customer.all.each do |customer|
customer.widgets
customer.wodgets
# etc
end
I'm writing a simpler version of phpMyAdmin in Rails; this web app will run on a web server (where users will be able to indicate the database name, hostname, username, password, and port number of one of the database servers running on the same network). The user will then be connected to that machine and will be able to use the UI to administer that database (add or remove columns, drop tables, etc).
I have two related questions (your help will greatly aid me in understanding how to best approach this):
In a traditional Rails application I would store the database info in database.yml, however here I need to do it dynamically. Is there a good way to leave the database.yml file empty and tell Rails to use the connection data provided by the user at run time instead?
Different users may connect to different databases (or even hosts). I assume that I need to keep track of the association between an established database connection and a user session. What's the best way to achieve this?
Thank you in advance.
To prevent Rails from initializing ActiveRecord using database.yml, you can simply remove :active_record from config.frameworks in config/environment.rb. Then, to manually establish connections, you use ActiveRecord::Base.establish_connection. (And maybe ActiveRecord::Base.configurations)
ActiveRecord stores everything connection related in class variables. So if you want to dynamically create multiple connections, you also have to dynamically subclass ActiveRecord::Base and call establish_connection on that.
This will be your abstract base class for any subclass you'll use to actually manage tables. To make ActiveRecord aware of this, you should do self.abstract_class = true within the base class definition.
Then, each table you want to manage will in turn dynamically subclass this new abstract base class.
This is more difficult, because you can't really persist connections, of course. The immediate solution I can think of is storing a unique token in the session, and use that in a before_filter to get back to the dynamic ActiveRecord::Base subclass, which you'll probably be storing in a hash somewhere.
This gets more interesting once you start running multiple Rails worker processes:
You will have to store all of the database connection information in the session, so other workers can use it.
You probably want a consistent unique token across workers, so use a hash function on a combination of database connection parameters.
Because a worker may be called with a token it doesn't yet know about, your subclassing and establish_connection logic will probably happen in the before_filter. (Rather than the moment of login, for example.)
You will have to figure out some clever way of garbage collecting connections and classes, for when user doesn't properly log out and the session expires. (Sorry, I don't know this one.)