The Alembic documentation states that "a standalone Operations instance can be made for use cases external to regular Alembic migrations by passing in a MigrationContext". An example is given:
from alembic.migration import MigrationContext
from alembic.operations import Operations
conn = myengine.connect()
ctx = MigrationContext.configure(conn)
op = Operations(ctx)
op.alter_column("t", "c", nullable=True)
How can this be done as a transaction? In other words, how can these operations be rolled back?
Related
Our web app is based on sqlalchemy in pyramid framework and we are looking to use alembic for managing database migrations. The web application consists of various packages that operate on one database. This consequently means we have multiple models.py that need to be migrated. I am confused as to how to handle this. I could progress some far using the following in my env.py
from pkg_a.app.models import Base as pkg_a_base
from pkg_b.app.models import Base as pkg_b_base
from pkg_c.app.models import Base as pkg_c_base
def combine_metadata(*args):
m = MetaData()
for metadata in args:
for t in metadata.tables.values():
t.tometadata(m)
return m
target_metadata = combine_metadata(pkg_a_base, pkg_b_base, pkg_c_base)
This works great the first time. However, if I add one more model later, just adding that to this list doesn't do much. I was expecting that running
alembic revision -m "added a new model pkg_d.models" --version-path=migrations/versions --autogenerate
would generate a new version file that would have the code for adding the tables from pkg_d.models. But it isn't so. What am I doing wrong here.
If your packages are completely independent and separate then each of them should have a separate migration history - either stored inside each package (pkg_a.migrations, pkg_b.migrations etc.) or at least stored in a separate top-level migrations directory via having a separate section in alembic.ini and using -n parameter to alembic command to specify which section to use:
[pkg_a]
# path to migration scripts
script_location = migrations_a
sqlalchemy.url = xxx
[pkg_b]
script_location = migrations_b
sqlalchemy.url = xxx
[pkg_c]
script_location = migrations_c
sqlalchemy.url = xxx
And then you'll be able to use alembic revision -n pkg_a -m "added a new model pkg_a.models"
If, however, your models are dependent in any way then they should use a common Base - you do realize you don't have to keep all your SQLAlchemy stuff in a single models.py file, don't you? I would create a separate "base" package which would contain a common MetaData, Base and other SQLAlchemy configuration stuff which would then be imported by your other packages.
I have a pyramid view that is used for loading data from a large file into a database. For each line in the file it does a little processing then creates some model instances and adds them to the session. This works fine except when the files are big. For large files the view slowly eats up all my ram until everything effectively grinds to a halt.
So my idea is to process each line individually with a function that creates a session, creates the necessary model instances and adds them to the current session, then commits.
def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
from sqlalchemy.orm import (
scoped_session,
sessionmaker,
)
from sqlalchemy import engine_from_config
from pyramidapp.models import Base, DataEntry
from zope.sqlalchemy import ZopeTransactionExtension
import transaction
oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
engine = engine_from_config(dSettings, 'sqlalchemy.')
oCurrentDBSession.configure(bind=engine)
Base.metadata.bind = engine
oEntry = DataEntry()
oCurrentDBSession.add(oEntry)
...
transaction.commit()
My requirements for this function are as follows:
create a session (check)
make a bunch of model instances (check)
add those instances to the session (check)
commit those models to the database
get rid of the session (so that it and the objects created in 2 are garbage collected)
I've made sure that the newly created session is passed as an argument whenever necessary in order to stop errors to do with multiple sessions blah blah. But alas! I can't get database connections to go away and stuff isn't being committed.
I tried separating the function out into a celery task so the view executes to completion and does what it needs to but I'm getting an error in celery about having too many mysql connections no matter what I try in terms of committing and closing and disposing and I'm not sure why. And yes, I restart the celery server when I make changes.
Surely there is a simple way to do this? All I want to do is make a session commit then go away and leave me alone.
Creating a new session for each line of your large file is going to be quite slow I would imagine.
What I would try is to commit the session and expunge all objects from it every 1000 rows or so:
counter = 0
for line in mymegafile:
entry = process_line(line)
session.add(entry)
if counter > 1000:
counter = 0
transaction.commit() # if you insist on using ZopeTransactionExtension, otherwise session.commit()
session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
else:
counter += 1
If there are no references to DataEntry instances from anywhere they should be garbage collected by Python interpreter at some point.
However, if all you're doing in that view is inserting new records to the database, it may be much more efficient to use SQLAlchemy Core constructs or literal SQL to bulk-insert data. This would also get rid of the problem with your ORM instances eating up your RAM. See I’m inserting 400,000 rows with the ORM and it’s really slow! for details.
So I tried a bunch of things and, although using SQLAlchemy's built in functionality to solve this was probably possible I could not find any way of pulling that off.
So here's an outline of what I did:
seperate the lines to be processed into batches
for each batch of lines queue up a celery task to deal with those lines
in the celery task a seperate process is launched that does the necessary stuff with the lines.
Reasoning:
The batch stuff is obvious
Celery was used because it took a heck of a long time to process an entire file so queuing just made sense
the task launched a separate process because if it didn't then I had the same problem that I had with the pyramid application
Some code:
Celery task:
def commit_lines(lLineData,dSettings,cwd):
"""
writes the line data to a file then calls a process that reads the file and creates
the necessary data entries. Then deletes the file
"""
import lockfile
sFileName = "/home/sheena/tmp/cid_line_buffer"
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'a') #in case the process was at any point interrupted...
for d in lLineData:
f.write('{0}\n'.format(d))
f.close()
#now call the external process
import subprocess
import os
sConnectionString = dSettings.get('sqlalchemy.url')
lArgs = [
'python',os.path.join(cwd,'commit_line_file.py'),
'-c',sConnectionString,
'-f',sFileName
]
#open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
subprocess.check_call(lArgs,shell=False)
#and clear the file
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'w')
f.close()
External process:
"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
from optparse import OptionParser
from sqlalchemy import create_engine
from pyramidapp.models import Base,DBSession
import ast
import transaction
#get options
oParser = OptionParser()
oParser.add_option('-c','--connection_string',dest='connection_string')
oParser.add_option('-f','--input_file',dest='input_file')
(oOptions, lArgs) = oParser.parse_args()
#set up connection
#engine = engine_from_config(dSettings, 'sqlalchemy.')
engine = create_engine(
oOptions.connection_string,
echo=False)
DBSession.configure(bind=engine)
Base.metadata.bind = engine
#commit stuffs
import lockfile
lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
with lock:
for sLine in open(oOptions.input_file,'r'):
dLine = ast.literal_eval(sLine)
create_entry(**dLine)
transaction.commit()
def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
#import stuff
oEntry = DataEntry()
#do some other stuff, make more model instances...
DBSession.add(oEntry)
if __name__ == "__main__":
main()
in the view:
for line in big_giant_csv_file_handler:
lLineData.append({'stuff':'lots'})
if lLineData:
lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
for l in lLineSets:
commit_lines.delay(l,dSettings,sCWD) #queue it for celery
You are just doing it wrong. Period.
Quoted from SQLAlchemy docs
The advanced developer will try to keep the details of session,
transaction and exception management as far as possible from the
details of the program doing its work.
Quoted from Pyramid docs
We made the decision to use SQLAlchemy to talk to our database. We also, though, installed pyramid_tm and zope.sqlalchemy.
Why?
Pyramid has a strong orientation towards support for transactions.
Specifically, you can install a transaction manager into your app
application, either as middleware or a Pyramid "tween". Then, just
before you return the response, all transaction-aware parts of your
application are executed. This means Pyramid view code usually doesn't
manage transactions.
My answer today is not code, but a recommendation to follow best practices recommended by the authors of the packages/frameworks you are working with.
References
Big picture - Using Thread-Local Scope with Web Applications
Typical error message when doing it wrong
Databases using SQLAlchemy
How to use scoped_session
Encapsulate CSV reading and creating SQLAlchemy model instances into something that supports the iterator protocol. I called it BatchingModelReader. It returns a collection of DataEntry instances, collection size depends on batch size. If the model changes overtime, you do not need to change the celery task. The task only puts a batch of models into a session and commits the transaction. By controlling the batch size you control memory consumption. Neither BatchingModelReader nor the celery task save huge amounts of intermediate data. This example shows as well that using celery is only an option. I added links to code samples of an pyramid application I am actually refactoring in a Github fork.
BatchingModelReader - encapsulates csv.reader and uses existing models from your pyramid application
get inspired by source code of csv.DictReader
could be run as a celery task - use appropriate task decorator
from .models import DBSession
import transaction
def import_from_csv(path_to_csv, batchsize)
"""given a CSV file and batchsize iterate over batches of model instances and import them to database"""
for batch in BatchingModelReader(path_to_csv, batchsize):
with transaction.manager:
DBSession.add_all(batch)
pyramid view - just save big giant CSV file, start task, return response immediately
#view_config(...):
def view(request):
"""gets file from request, save it to filesystem and start celery task"""
with open(path_to_csv, 'w') as f:
f.write(big_giant_csv_file)
#start task with parameters
import_from_csv.delay(path_to_csv, 1000)
Code samples
ToDoPyramid - commit transaction from commandline
ToDoPyramid - commit transaction from request
Pyramid using SQLAlchemy
Databases using SQLAlchemy
SQLAlchemy internals
Big picture - Using Thread-Local Scope with Web Applications
How to use scoped_session
On server I am using combination of Tornado and SQLAlchemy (maybe SQLAlchemy is not best choice for async server but it is temporary) I split project and handlers in 10 files/modules.
In every module I am using session = Session() and session to query database.
common part of every module looks like
...
import tornado.ioloop
engine = create_engine(DB_URL, echo=False, pool_size=100, pool_recycle=3600)
Session = sessionmaker(bind=engine)
class BaseHandler(tornado.web.RequestHandler):
....
Do I need to set somehow
engine = create_engine(DB_URL, echo=False, pool_size=100, pool_recycle=3600)
Session = sessionmaker(bind=engine)
to be like singleton, not to create in every module or this is ok way to do things and create sessions.
You probably want to use scoped_session which essentially serves as a thread-local singleton, creating sessions on-demand using the provided factory function.
In one module imported by all others you write:
engine = create_engine(DB_URL, echo=False, pool_size=100, pool_recycle=3600)
session_factory = sessionmaker(bind=engine)
Session = scoped_session(session_factory)
# or make it a Tornado Application property
And then either use Session as an explicit factory:
session = Session()
session.query(...)
Or use implicit method delegation:
Session.query(...)
I have a pyramid app called mainsite.
The site works in a pretty asynchronous manner mostly through threads being launched from the view to carry out the backend operations.
It connects to mysql with sqlalchemy and uses ZopeTransactionExtension for session management.
So far the application has been running great.
I need to run periodic jobs on it and it needs to use some of the same asynchronous functions that are being launched from the view.
I used apscheduler but ran into issues with that. So I thought of using celery beat as a separate process that treats mainapp as a library and imports the functions to be used.
My celery config looks like this:
from datetime import timedelta
from api.apiconst import RERUN_CHECK_INTERVAL, AUTOMATION_CHECK_INTERVAL, \
AUTH_DELETE_TIME
BROKER_URL = 'sqla+mysql://em:em#localhost/edgem'
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = 'mysql://em:em#localhost/edgem'
CELERYBEAT_SCHEDULE = {
'rerun': {
'task': 'tasks.rerun_scheduler',
'schedule': timedelta(seconds=RERUN_CHECK_INTERVAL)
},
'automate': {
'task': 'tasks.automation_scheduler',
'schedule': timedelta(seconds=20)
},
'remove-tokens': {
'task': 'tasks.token_remover_scheduler',
'schedule': timedelta(seconds=2 * 24 * 3600 )
},
}
CELERY_TIMEZONE = 'UTC'
The tasks.py is
from celery import Celery
celery = Celery('tasks')
celery.config_from_object('celeryconfig')
#celery.task
def rerun_scheduler():
from mainsite.task import check_update_rerun_tasks
check_update_rerun_tasks()
#celery.task
def automation_scheduler():
from mainsite.task import automate
automate()
#celery.task
def token_remover_scheduler():
from mainsite.auth_service import delete_old_tokens
delete_old_tokens()
keep in mind that all the above functions immediately return but launch threads if required
The threads save objects into db by doing transaction.commit() after session.add(object).
The problem is that the whole things works like a gem only for about 30 minutes. After that ResourceClosedError: The transaction is closed errors starts happening wherever there is a transaction.commit(). I am not sure what is the problem and I need help troubleshooting.
The reason I do import inside the tasks was to get rid of this error. Thought importing every time task needed to be run was a good idea and I may get a new transaction each time, but looks like that is not the case.
In my experience trying to reuse a session configured to be used with Pyramid (with ZopeTransactionExtension etc.) with a Celery worker results in a terrible hard-to-debug mess.
ZopeTransactionExtension binds SQLAlchemy session to Pyramid's request-response cycle - a transaction is started and committed or rolled back automatically, you're generally not supposed to use transaction.commit() within your code - if everything is ok ZTE will commit everything, if your code raises and exception your transaction will be rolled back.
With Celery you need to manage SQLAlchemy sessions manually, which ZTE prevents you from doing, so you need to configure your DBSession differently.
Something simple like this would work:
DBSession = None
def set_dbsession(session):
global DBSession
if DBSession is not None:
raise AttributeError("DBSession has been already set to %s!" % DBSession)
DBSession = session
And then from Pyramid startup code you do
def main(global_config, **settings):
...
set_dbsession(scoped_session(sessionmaker(extension=ZopeTransactionExtension())))
With Celery it's a bit trickier - I ended up creating a custom start script for Celery, in which I configure the session.
In setup.py of the worker egg:
entry_points="""
# -*- Entry points: -*-
[console_scripts]
custom_celery = worker.celeryd:start_celery
custom_celerybeat = worker.celeryd:start_celerybeat
""",
)
in worker/celeryd.py:
def initialize_async_session(db_string, db_echo):
import sqlalchemy as sa
from db import Base, set_dbsession
session = sa.orm.scoped_session(sa.orm.sessionmaker(autoflush=True, autocommit=True))
engine = sa.create_engine(db_string, echo=db_echo)
session.configure(bind=engine)
set_dbsession(session)
Base.metadata.bind = engine
def start_celery():
initialize_async_session(DB_STRING, DB_ECHO)
import celery.bin.celeryd
celery.bin.celeryd.main()
The general approach you're using with "threads being launched from the view to carry out the backend operations" feels a bit dangerous to me if you ever plan to deploy the application to a production server - a web server often recycles, kills or creates new "workers" so generally there are no guarantees each particular process would survive beyond the current request-response cycle. I never tried doing this though, so maybe you'll be ok :)
I am using the sqlalchemy package in python. I have an operation that takes some time to execute after I perform an autoload on an existing table. This causes the following error when I attempt to use the connection:
sqlalchemy.exc.OperationalError: (OperationalError) (2006, 'MySQL server has gone away')
I have a simple utility function that performs an insert many:
def insert_data(data_2_insert, table_name):
engine = create_engine('mysql://blah:blah123#localhost/dbname')
# Metadata is a Table catalog.
metadata = MetaData()
table = Table(table_name, metadata, autoload=True, autoload_with=engine)
for c in mytable.c:
print c
column_names = tuple(c.name for c in mytable.c)
final_data = [dict(zip(column_names, x)) for x in data_2_insert]
ins = mytable.insert()
conn = engine.connect()
conn.execute(ins, final_data)
conn.close()
It is the following line that times long time to execute since 'data_2_insert' has 677,161 rows.
final_data = [dict(zip(column_names, x)) for x in data_2_insert]
I came across this question which refers to a similar problem. However I am not sure how to implement the connection management suggested by the accepted answer because robots.jpg pointed this out in a comment:
Note for SQLAlchemy 0.7 - PoolListener is deprecated, but the same solution can be implemented using the new event system.
If someone can please show me a couple of pointers on how I could go about integrating the suggestions into the way I use sqlalchemy I would be very appreciative. Thank you.
I think you are looking for something like this:
from sqlalchemy import exc, event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, "checkout")
def check_connection(dbapi_con, con_record, con_proxy):
'''Listener for Pool checkout events that pings every connection before using.
Implements pessimistic disconnect handling strategy. See also:
http://docs.sqlalchemy.org/en/rel_0_8/core/pooling.html#disconnect-handling-pessimistic'''
cursor = dbapi_con.cursor()
try:
cursor.execute("SELECT 1") # could also be dbapi_con.ping(),
# not sure what is better
except exc.OperationalError, ex:
if ex.args[0] in (2006, # MySQL server has gone away
2013, # Lost connection to MySQL server during query
2055): # Lost connection to MySQL server at '%s', system error: %d
# caught by pool, which will retry with a new connection
raise exc.DisconnectionError()
else:
raise
If you wish to trigger this strategy conditionally, you should avoid use of decorator here and instead register listener using listen() function:
# somewhere during app initialization
if config.check_connection_on_checkout:
event.listen(Pool, "checkout", check_connection)
More info:
Connection Pool Events
Events API
There is a better way to handle it right now - pool_recycle
engine = create_engine('mysql://...', pool_recycle=3600)
MySQL has a default timeout of 8 hours.
This leads to the connection to be closed by MySQL but the engine above it (such as SQLAlchemy) to not know about it.
There are 2 ways to solve it -
Optimistic - Using pool_recycle
Pessimistic - using pool_pre_ping=True
I prefer to go with the pool_recycle as it doesn't emit a SELECT 1 before each query - causing less stress on the db