Safe way to iterate over multiple databases in Rails

Safe way to iterate over multiple databases in Rails - mysql

I have several Rails apps running on a single MySQL server. All of them run the same app, and all of the databases have the same schema, but each database belongs to a different customer.
Conceptually, here's what I want to do:
Customer.all.each do |customer|
connection.execute("use #{customer.database}")
customer.do_some_complex_stuff_with_multiple_models
end
This approach does not work because, when this is run in a web request, the underlying model classes cache different database connections from the A/R connection pool. So the connection on which I execute the "use" statement, may not be the connection the model uses, in which case it queries the wrong database.
I read through the Rails A/R code (version 3.0.3), and came up with this code to execute in the loop, instead of the "use" statement:
ActiveRecord::Base.clear_active_connections!
ActiveRecord::Base.establish_connection(each_customer_database_config)
I believe that the connection pool is per-thread, so it seems like this would clobber the connection pool and re-establish it only for the one thread the web request is on. But if the connections are shared in some way I'm not seeing, I would not want that code to wreak havoc with other active web requests in the same app.
Is this safe to do in a running web app? Is there any other way to do this?

IMO switching to a new database connection for different requests is a very expensive operation. AR maintains a limited pool of connections.
I guess you should move to PostgreSQL, where you have concept of schemas.
In an ideal SQL world this is the structure of a database
database --> schemas --> tables
In MYSQL, database and schemas are the same thing. Postgres has separate schemas, which can hold tables for different customers. You can switch schema on the fly without changing the AR connection by setting
ActiveRecord::Base.connection.set_schema_search_path("CUSTOMER's SCHEMA")
Developing it require a bit of hacking though.

Switching database by connecting/disconnecting is really slow, and is not going to work due to AR connection pools an internal caches. Try using ActiveRecord::Base.table_name_prefix = "customer_" and keep the database constant.

Right now you have connections in ActiveRecord can be per class level. Its looks per thread basis because is in before 1.9 ruby threads sucked so implementations were using process instead of thread, but It may not be true for long.
But since AR uses one thread per Model. You can create different mock models for each database you have. So using answer given in this question.
Code will look something like this. (I have not tested it)
Customer.all.each do |customer|
c_class = Class.new(ActiveRecord::Base)
c_class.establish_connection(each_customer_database_config)
c_class.table_name = customor.table_name()
c_class.do_something_on_diff_models_using_cutomer_from_diff_conn(customer.id)
c_class.clear_active_connections!
end

Why not keep the same db and tables and just have each of your models belong_to a customer? Then you can find all the models for that customer with:
Customer.all.each do |customer|
customer.widgets
customer.wodgets
# etc
end

Related

Best technique to make node mysql run fastest?

I am using this
var mysql = require('mysql');
in my node.js app. I am interested to make my app perform the fastest. I have many functions that connect to SQL. There is 2 approaches I am familiar with
For every request, I make a new connection and then execute the query and the close the connection.
Open the connection and make it a global variable, and then never close it. Then for every request that comes in, it just uses the opened connection saved globally.
Which is generally better to use? Also for number 2, if the server closes unexpectedly, then the sql connection doesn't close. Is that bad?
Thanks

Approach 2 is faster, but to avoid the potential problem of connections dropping without unexpectedly, you'll have to implement testing mechanism for every segment that queries the database (ex: count the number of returned rows).
To take this approach further, you can define connections bank or pool. Where you can deal with connection testing and distributions. The basic idea is to have many connections to the database and just inject healthy connections to consumers (functions, or objects that query the database). As Andrew mentions in the comments You can check this question: node.js + mysql connection pooling
Since the database is an essential asset to a project, if this is not a homework or learning project, it might not be a bad idea to explore 3rd party libraries, where a lot of the connections and security details is covered and automated.

What is the difference between MYSQL and SQLite multi-user functionality?

I am new to server side programming and am trying to understand relational databases a little better. Whenever I read about MYSQL vs SQLite people always talk about SQLite not being able to have multiple users. However, when I program with the Django Framework I am able to create multiple users on the sqlitedb. Can someone explain what people mean by multi-user? Thanks!

When people talk about multiple users in this context, they are talking about simultaneous connections to the database. The users in this case are threads in the web server that are accessing the database.
Different databases have different solutions for handling multiple connections working with the database at once. Generally reading is not a problem, as multiple reading operations can overlap without disturbing each other, but only one connection can write data in a specific unit at a a time.
The difference between concurrency for databases is basically how large units they lock when someone is writing. MySQL has an advanced system where records, blocks or tables can be locked depending on the need, while SQLite has a simpler system where it only locks the entire database.
The impact of this difference is seen when you have multiple threads in the webserver, where some threads want to read data and others want to write data. MySQL can read from one table and write into another at the same time without problem. SQLite has to suspend all incoming read requests whenever someone wants to write something, wait for all current reads to finish, do the write, and then open up for reading operations again.

As you can read here, sqlite supports multi users, but lock the whole db.
Sqlite is used for development ussualy, buy Mysql is a better sql for production, because it has a better support for concurrency access and write, but sqlite dont.
Hope helps

SQLite concurrency is explained in detail here.
In a nutshell, SQLite doesn't have the fine-grained concurrency mechanisms that MySQL does. When someone tries to write to a MySQL database, the MySQL database will only lock what it needs to lock, usually a single record, sometimes a table.
When a user writes to a SQLite database, the entire database file is momentarily locked. As you might imagine, this limits SQLite's ability to handle many concurrent users.

Multi-user means that many tasks (possibly on many separate computers) can have open connections to the database at the same time.
A multi-user database provides things like locks to allow these tasks to update the database safely.

Look at ScimoreDB. It's an embedded database that supports multi-process (or user) read and write access. It also can work as a client-server database.

Sqlalchemy sessions and autobahn

I'm using the autobahn server in twisted to provide an RPC API. Some calls require queries to the database and multiple clients may be connected via websocket to the server.
I am using the SqlAlchemy ORM to access the database.
What are the pros and cons of the two following approaches for dealing with SqlAlchemy sessions.
Create and destroy a session for every RPC call
Create a single session when the server starts and use it in every RPC call
Which would you recommend and why? (I'm leaning towards 2)

The recommended way of doing SQL-based database access from Twisted (and Autobahn) with databases like PostgreSQL, Oracle or SQLite would be twisted.enterprise.adbapi.
twisted.enterprise.adbapi will run queries on a background thread pool, which is required, since most database drivers are blocking.
Sidenote: for PostgreSQL, there is a native-asynchronous, non-blocking
driver also: txpostgres.
Now, if you put an ORM like SQLAlchemy on top of the native SQL driver, I'm not sure how this will work together (if at all) with twisted.enterprise.adbapi.
So from the options you mention
Is a no go, since most drivers are blocking (and Autobahn's RPCs run on the main thread = Twisted reactor thread - and you MUST not block that).
With this, you need to put the database session(s) in background threads (again, to not block).
Also see here.

If you're using SQLAlchemy and Twisted together, consider using Alchimia rather than the built-in adbapi.

Multi-tenant Django applications: altering database connection per request?

I'm looking for working code and ideas from others who have tried to build a multi-tenant Django application using database-level isolation.
Update/Solution: I ended solving this in a new opensource project: see django-db-multitenant
Goal
My goal is to multiplex requests as they come in to a single app server (WSGI frontend like gunicorn), based on the request hostname or request path (for instance, foo.example.com/ sets the Django connection to use database foo, and bar.example.com/ uses database bar).
Precedent
I'm aware of a few existing solutions for multi tenancy in Django:
django-tenant-schemas: This is very close to what I want: you install its middleware at highest precedence, and it sends a SET search_path command to the db. Unfortunately, it is Postgres specific and I am stuck with MySQL.
django-simple-multitenant: The strategy here is to add a "tenant" foreign key to all models, and adjust all application business logic to key off of that. Basically each row is becomes indexed by (id, tenant_id) rather than (id). I've tried, and don't like, this approach for a number of reasons: it makes the application more complex, it can lead to hard-to-find bugs, and it provides no database-level isolation.
One {app server, django settings file with appropriate db} per tenant. Aka poor man's multi tenancy (actually rich man's, given the resources it involves). I do not want to spin up a new app server per tenant, and for scalability I want any app server to be able to dispatch requests for any client.
Ideas
My best idea so far is to do something like django-tenant-schemas: in the first middleware, grab django.db.connection and fiddle with the database selection rather than the schema. I haven't quite thought through what this means in terms of pooled/persistent connections
Another dead end I pursued was tenant-specific table prefixes: Setting aside that I'd need them to be dynamic, even a global table prefix is not easily achieved in Django (see rejected ticket 5000, among others).
Finally, Django multiple database support lets you define multiple named databases, and mux among them based on the instance type and read/write mode. Not helpful since there is no facility to select the db on a per-request basis.
Question
Has anyone managed something similar? If so, how did you implement it?

I've done something similar that is closest to point 1, but instead of using middleware to set a default connection Django database routers are used. This allow application logic to use a number of databases if required for each request. It's up to the application logic to choose a suitable database for every query, and this is the big downside of this approach.
With this setup, all databases are listed in settings.DATABASES, including databases which may be shared among customers. Each model that is customer specific is placed in a Django app that has a specific app label.
eg. The following class defines a model which exists in all customer databases.
class MyModel(Model):
....
class Meta:
app_label = 'customer_records'
managed = False
A database router is placed in the settings.DATABASE_ROUTERS chain to route database request by app_label, something like this (not a full example):
class AppLabelRouter(object):
def get_customer_db(self, model):
# Route models belonging to 'myapp' to the 'shared_db' database, irrespective
# of customer.
if model._meta.app_label == 'myapp':
return 'shared_db'
if model._meta.app_label == 'customer_records':
customer_db = thread_local_data.current_customer_db()
if customer_db is not None:
return customer_db
raise Exception("No customer database selected")
return None
def db_for_read(self, model, **hints):
return self.get_customer_db(model, **hints)
def db_for_write(self, model, **hints):
return self.get_customer_db(model, **hints)
The special part about this router is the thread_local_data.current_customer_db() call. Before the router is exercised, the caller/application must have set up the current customer db in thread_local_data. A Python context manager can be used for this purpose to push/pop a current customer database.
With all of this configured, the application code then looks something like this, where UseCustomerDatabase is a context manager to push/pop a current customer database name into thread_local_data so that thread_local_data.current_customer_db() will return the correct database name when the router is eventually hit:
class MyView(DetailView):
def get_object(self):
db_name = determine_customer_db_to_use(self.request)
with UseCustomerDatabase(db_name):
return MyModel.object.get(pk=1)
This is quite a complex setup already. It works, but I'll try to summarize what I see see as advantages and disadvantages:
Advantages
Database selection is flexible. It allows multiple database to be used in a single query, both customer specific and shared databases can be used in a request.
Database selection is explicit (not sure if this is an advantage or disadvantage). If you try to run a query that hits a customer database but the application hasn't selected one, an exception will occur indicating a programming error.
Using a database router allows different databases to exist on different hosts, rather than relying on a USE db; statement that guesses that all databases are accessible through a single connection.
Disadvantages
It's complex to setup, and there are quite a few layers involved to get it functioning.
The need and use of thread local data is obscure.
Views are littered with database selection code. This could be abstracted using class based views to automatically choose a database based on request parameters in the same manner as middleware would choose a default database.
The context manager to choose a database must be wrapped around a queryset in such a manner that the context manager is still active when the query is evaluated.
Suggestions
If you want flexible database access, I'd suggest to use Django's database routers. Use Middleware or a view Mixin which automatically sets up a default database to use for the connection based on request parameters. You might have to resort to thread local data to store the default database to use so that when the router is hit, it knows which database to route to. This allows Django to use its existing persistent connections to a database (which may reside on different hosts if wanted), and chooses the database to use based on routing set up in the request.
This approach also has the advantage that the database for a query can be overridden if needed by using the QuerySet using() function to select a database other than the default.

For the record, I chose to implement a variation of my first idea: issue a USE <dbname> in an early request middleware. I also set the CACHE prefix the same way.
I'm using it on a small production site, looking up the tenant name from a Redis database based on the request host. So far, I'm quite happy with the results.
I've turned it into a (hopefully resuable) github project here: https://github.com/mik3y/django-db-multitenant

You could create a simple middleware of your own that determined the database name from your sub-domain or whatever and then executed a USE statement on the database cursor for each request. Looking at the django-tenants-schema code, that is essentially what it is doing. It is sub-classing psycopg2 and issuing the postgres equivalent to USE, "set search_path XXX". You could create a model to manage and create your tenants too, but then you would be re-writing much of django-tenants-schema.
There should be no performance or resource penalty in MySQL to switching the schema (db name). It is just setting a session parameter for the connection.

Apache -> MySQL multiple connections vs one connection

I've been thinking, why does Apache start a new connection to the MySQL server for each page request? Why doesn't it just keep ONE connection open at all times and send all sql queries through that one connection (obviously with client id attached to each req)?
It cuts down on the handshake time overhead, and a couple of other advantages that I see.
It's like plugging in a computer every time you want to use it. Why go to the outlet each time when you can just leave it plugged in?

MySQL does not support multiple sessions over a single connection.
Oracle, for instance, allows this, and you can setup Apache to mutliplex several logical sessions over a single TCP connection.
This is limitation of MySQL, not Apache or script languages.
There are modules that can do session pooling:
Precreate a number of connections
Pick a free connection on demand
Create additional connections if not free connection is available.

the reason is: it's simpler.
to re-use connections, you have to invent and implement connection pooling. this adds another almost-layer of code that has to be developed, maintained, etc.
plus pooled connections invite a whole other class of bugs that you have to watch out for while developing your application. for example, if you define a user variable but the next user of that connection goes down a code path that branches based on the existence of that variable or not then that user runs the wrong code. other problems include: temporary tables, transaction deadlocks, session variables, etc. all of these become very hard to duplicate because it depends on the subsequent actions of two different users that appear to have no ties to each other.
besides, the connection overhead on a mysql connection is tiny. in my experience, connection pooling does increase the number of users a server can support by very much.

Because that's the purpose of the mod_dbd module.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008