SQLAlchemy Caching Objects Beyond Transaction - sqlalchemy

For performance reasons, our application loads certain SQLAlchemy model instances into memory at startup time. These instances are not expected to change through the life of the application, so this is a reasonable solution.
For the most part, this has been fine, but I have observed isolated incidences where a DetachedInstanceError: Instance <ZZZ at 0x3a23510> is not bound to a Session; attribute refresh operation cannot proceed will appear, causing ongoing problems. This is the same error I would (expectedly) receive when attempting to access a lazy-loaded relationship on a similarly cached object.
The error appears to be caused by access to the .id attribute, which I would expect not to require any kind of DB refresh to access. What really bothers me is that I can not reproduce this exception consistently.
My question is what might cause this exception to occur and what techniques has anybody used for storing SQLAlchemy instances beyond the transaction that brought them to be?

.id would be missing if the object were expired before it were placed into the cache. After a Session commits or rolls back, by default it expires all attributes, which refresh when next accessed.
It's not a given that .id is what it is, as the object may have been deleted from the database or its primary key modified (both conditions would emit an ObjectDeletedError).
Solution is to cache your object before it gets expired, expunge() it from the session before the session expires everything, or disable expire_on_commit.

Related

ActiveRecord::StaleObject error on opening each result on a new tab

Recently we've added a functionality in our RoR application which allows users to open a particular record, let's say in their own individual tabs. Doing so, we've started seeing frequent ActiveRecord::StaleObject errors. On investigating the issue I found that rails is indeed trying to update the session store first whenever a resource is opened in a tab and the exception is raised.
We've lock_version in our active record session store, so Rails is taking it as optimistic locking by default. Is there any way we could solve this issue without introducing much complexity, as the application is already live on the client's machine and without affecting any sessions' data we've stored in our session store DB.
Any suggestions would be much appreciated. Thanks
It sounds like you're using optimistic locking on a db session record and updating the session record when you process an update to other records. Not sure what you'd need to update in the session, but if you're worried about possibly conflicting updates to the session object (and need the locking) then these errors might be desired.
If you don't - you can refresh the session object before saving the session (or disable it's optimistic locking) to avoid this error for these session updates.
You also might look into what about the session is being updated and whether it's strictly necessary. If you're updating something like "last_active_on" then you might be better off sending off a background job to do this and/or using the update_column method which bypasses the rather heavyweight activerecord save callback chain.
--- UPDATE ---
Pattern: Putting side-effects in background jobs
There are several common Rails patterns that start to break down as your app usage grows. One of the most common that I've run into is when a controller endpoint for a specific record also updates a common/shared record (for example, if creating a 'message' also updates the messages_count for a user using counter cache, or updates a last_active_at on a session). These patterns create bottlenecks in your application as multiple different types of requests across your application will compete for write locks on the same database rows unnecessarily.
These tend to creep into your app over time and become hard to refactor later. I'd recommend always handling side-effects of a request in an asynchronous job (using something like Sidekiq). Something like:
class Message < ActiveRecord::Base
after_commit :enqueue_update_messages_count_job
def enqueue_update_messages_count_job
Jobs::UpdateUserMessageCountJob.enqueue(self.id)
end
end
While this may seem like overkill at first, it creates an architecture that is significantly more scalable. If counting the messages becomes slow... that will make the job slower but not impact the usability of the product. In addition, if certain activities create lots of objects with the same side-effects (lets say you have a "signup" controller that creates a bunch of objects for a user that all trigger an update of user.updated_at) it becomes easy to throw out duplicate jobs and prevent updating the same field 20 times.
Pattern: Skipping the activerecord callback chain
Calling save on an ActiveRecord object runs validations and all the before and after callbacks. These can be slow and (at times) unnecessary. For example, updating a message_count cached value doesn't necessarily care about whether the user's email address is valid (or any other validations) and you may not care about other callbacks running. Similar if you're just updating a user's updated_at value to clear a cache. You can bypass the activerecord callback chain by calling user.update_attribute(:message_count, ..) to write that field directly to the database. In theory this shouldn't be necessary for a well designed application but in practice some larger/legacy codebases may make significant use of the activerecord callback chain to handle business logic that you may not want to invoke.
--- Update #2 ---
On Deadlocks
One reason to avoid updating (or generally locking) a common/shared object from a concurrent request is that it can introduce Deadlock errors.
Generally speaking a "Deadlock" in a database is when there are two processes that both need a lock the other one has. Neither thread can continue so it must error instead. In practice, detecting this is hard, so some databases (like postgres) just throw a "Deadlock" error after a thread waits for an exclusive/write lock for x amount of time. While contention for locks is common (e.g. two updates that are both updating a 'session' object), a true deadlock is often rare (where thread A has a lock on the session that thread B needs, but thread B has a lock on a different object that thread A needs), so you may be able to partially address the problem by looking at / extending your deadlock timeout. While this may reduce the errors, it doesn't fix the issue that the threads may be waiting for up to the deadlock timeout. An alternative approach is to have a short deadlock timeout and rescue/retry a few times.

Read after write consistency with mysql and multiple concurrent connections

I'm trying to understand whether it is possible to achieve the following:
I have multiple instances of an application server running behind a round-robin load balancer. The client expects GET after POST/PUT semantics, in particular the client will make a POST request, wait for the response and immediately make a GET request expecting the response to reflect the change made by the POST request, e.g:
> Request: POST /some/endpoint
< Response: 201 CREATED
< Location: /some/endpoint/123
> Request: GET /some/endpoint/123
< Response must not be 404 Not Found
It is not guaranteed that both requests are handled by the same application server. Each application server has a pool of connections to the DB. Each request will commit a transaction before responding to the client.
Thus the database will on one connection see an INSERT statement, followed by a COMMIT. One another connection, it will see a SELECT statement. Temporally, the SELECT will be strictly after the commit, however there may only be a tiny delay in the order of milliseconds.
The application server I have in mind uses Java, Spring, and Hibernate. The database is MySQL 5.7.11 managed by Amazon RDS in a multiple availability zone setup.
I'm trying to understand whether this behavior can be achieved and how so. There is a similar question, but the answer suggesting to lock the table does not seem right for an application that must handle concurrent requests.
Under ordinary circumstances, you will not have any issue with this sequence of requests, since your MySQL will have committed the changes to the database by the time the 201 response has been sent back. Therefore, any subsequent statements will see the created / updated record.
What could be the extraordinary circumstances under which the subsequent select will not find the updated / inserted record?
Another process commits an update or delete statement that changes or removes the given record. There is not too much you can do about this, since it is part of the normal operation. If you do not want such thing to happen, then you have to implement application level locking of data.
The subsequent GET request is routed not only to a different application server, but that one uses (or is forced to use) a different database instance, which does not have the most updated state of that record. I would envisage this to happen if either application or database server level there is a severe failure, or routing of the request goes really bad (routed to a data center at a different geographical location). These should not happen too frequently.
If you're using MyISAM tables, you might be seeing the effects of 'concurrent inserts' (see 8.11.3 in the mysql manual). You can avoid them by either setting the concurrent_insert system variable to 0, or by using the HIGH_PRIORITY keyword on the INSERT.

SQLAlchemy Session Object

I'm having a confusion about the session object in SQLAlchemy. Is it like the PHP session where a session is all the transactions of a users or is a session an entity which scopes the lifetime of a transaction.
For every transaction in SQLAlchemy, is the procedure as follows:
-create and open session
-perform transaction
-commit or rollback
-close session
So, my question is, for a client, do we create a single session object, or a session object is created whenever we have a transaction to perform
I would be hesitant to compare a SQLAlchemy session with a PHP session, since typically a PHP session refers to cookies, whereas SQLAlchemy has nothing to do with cookies or HTTP at all.
As explained by the documentation:
A Session is typically constructed at the beginning of a logical
operation where database access is potentially anticipated.
The Session, whenever it is used to talk to the database, begins a
database transaction as soon as it starts communicating. Assuming the
autocommit flag is left at its recommended default of False, this
transaction remains in progress until the Session is rolled back,
committed, or closed. The Session will begin a new transaction if it
is used again, subsequent to the previous transaction ending; from
this it follows that the Session is capable of having a lifespan
across many transactions, though only one at a time. We refer to these
two concepts as transaction scope and session scope.
The implication here is that the SQLAlchemy ORM is encouraging the
developer to establish these two scopes in his or her application,
including not only when the scopes begin and end, but also the expanse
of those scopes, for example should a single Session instance be local
to the execution flow within a function or method, should it be a
global object used by the entire application, or somewhere in between
these two.
As you can see, it is completely up to the developer of the application to determine how to use the session. In a simple desktop application, it might make sense to create a single global session object and just keep using that session object, committing as the user hits "save". In a web application, a "session per request handled" strategy is often used. Sometimes you use both strategies in the same application (a session-per-request for web requests, but a single session with slightly different properties for background tasks).
There is no "one size fits all" solution for when to use a session. The documentation does give hints as to how you might go about determining this.

How to avoid 'Transaction managed block ended with pending COMMIT/ROLLBACK' error across methods

I have a situation where I have applied the #transaction.commit_manually decorator to a method in which I am importing information passed back in an http request response. I need to control committing and rolling back depending on whether business validation rules pass or fail.
Now, when I have some sort of validation failure I have a separate method in which I log an error to the database. This action should always commit immediately, while leaving the primary transaction in its current state. However, what happens is if I apply the #transaction.commit_on_success decorator to the error capturing routine, my primary transaction commits automatically as well. If I don't apply the #transaction.commit_on_success decorator, then, I receive the 'Transaction managed block ended with pending COMMIT/ROLLBACK' error as soon as a call is made to the error capturing routine.
I am using MYSQL database version 5.1.49 using storage engine INNODB.
Is there a way to persist the open transaction in the calling routine while committing the transaction in the second routine?
Django's default transaction management doesn't support nested transactions. In general, transactions can't be nested. Everything that's done in the midst of a transaction is either committed or rolledback. So when you commit the transaction, no matter where you commit the transaction, it's atomic.
Looking around online, I found a snippet that might be a good starting point for you. It essentially overrides the commit_on_success decorator, adding a form of reference counting. In a sense, it forgoes committing if it's not the last out.

Rails debugging in production environment

I'm creating a Twitter application, and every time user updates the page it reloads the newest messages from Twitter and saves them to local database, unless they have already been created before. This works well in development environment (database: sqlite3), but in production environment (mysql) it always creates messages again, even though they already have been created.
Message creation is checked by twitter_id, that each message has:
msg = Message.find_by_twitter_id(message_hash['id'].to_i)
if msg.nil?
# creates new message from message_hash (and possibly new user too)
end
msg.save
Apparently, in production environment it's unable to find the messages by twitter id for some reason (when I look at the database it has saved all the attributes correctly before).
With this long introduction, I guess my main question is how do I debug this? (unless you already have an answer to the main problem, of course :) When I look in the production.log, it only shows something like:
Processing MainPageController#feeds (for 91.154.7.200 at 2010-01-16 14:35:36) [GET]
Rendering template within layouts/application
Rendering main_page/feeds
Completed in 9774ms (View: 164, DB: 874) | 200 OK [http://www.tweets.vidious.net/]
...but not the database requests, logger.debug texts, or anything that could help me find the problem.
You can change the log level in production by setting the log level in config/environment/production.rb
config.log_level = :debug
That will log the sql and everything else you are used to seeing in dev - it will slow down the app a bit, and your logs will be large, so use judiciously.
But as to the actual problem behind the question...
Could it be because of multiple connections accessing mysql?
If the twitter entries have not yet been committed, then a query for them from another connection will not return them, so if your query for them is called before the commit, then you won't find them, and will instead insert the same entries again. This is much more likely to happen in a production environment with many users than with you alone testing on sqlite.
Since you are using mysql, you could use a unique key on the twitter id to prevent dupes, then catch the ActiveRecord exception if you try to insert a dupe. But this means handling an error, which is not a pretty way to handle this (though I recommend doing it as a back up means of prevent dupes - mysql is good for this, use it).
You should also prevent the attempt to insert the dupes. One way is to use a lock on a common record, say the User record which all the tweets are related to, so that another process cannot try to add tweets for the user until it can get that lock (which you will only free once the transaction is done), and so prevent simultaneous commits of the same info.
I ran into a similar issue while saving emails to a database, I agree with Andrew, set the log level to debug for more information on what exactly is happening.
As for the actual problem, you can try adding a unique index to the database that will prevent two items from being saved with the same parameters. This is like the validates_uniqueness but at the database level, and is very effective: Mysql Constraign Database Entries in Rails.
For example if you wanted no message objects in your database that had a duplicate body of text, and a duplicate twitter id (which would mean the same person tweeted the same text). Then you can add this to your migration:
add_index( :message, [:twitter_id, :body] , :unique => true)
It takes a small amount of time after you tell an object in Rails to save, before it actually gets in the database, thats maybe why the query for the id doesn't find anything yet.
For your production server, I would recommend setting up a rollbar to report you all of the unhandled errors and exceptions in your production servers.
You can also store a bunch of useful information, like http request, requested users, code which invoked an error and many more or sends email notifications each time some unhandled exceptions happened on your production server.
Here is a simple article about debugging in rails that could help you out.