Handling multiple models.py with alembic - sqlalchemy

Our web app is based on sqlalchemy in pyramid framework and we are looking to use alembic for managing database migrations. The web application consists of various packages that operate on one database. This consequently means we have multiple models.py that need to be migrated. I am confused as to how to handle this. I could progress some far using the following in my env.py
from pkg_a.app.models import Base as pkg_a_base
from pkg_b.app.models import Base as pkg_b_base
from pkg_c.app.models import Base as pkg_c_base
def combine_metadata(*args):
m = MetaData()
for metadata in args:
for t in metadata.tables.values():
t.tometadata(m)
return m
target_metadata = combine_metadata(pkg_a_base, pkg_b_base, pkg_c_base)
This works great the first time. However, if I add one more model later, just adding that to this list doesn't do much. I was expecting that running
alembic revision -m "added a new model pkg_d.models" --version-path=migrations/versions --autogenerate
would generate a new version file that would have the code for adding the tables from pkg_d.models. But it isn't so. What am I doing wrong here.

If your packages are completely independent and separate then each of them should have a separate migration history - either stored inside each package (pkg_a.migrations, pkg_b.migrations etc.) or at least stored in a separate top-level migrations directory via having a separate section in alembic.ini and using -n parameter to alembic command to specify which section to use:
[pkg_a]
# path to migration scripts
script_location = migrations_a
sqlalchemy.url = xxx
[pkg_b]
script_location = migrations_b
sqlalchemy.url = xxx
[pkg_c]
script_location = migrations_c
sqlalchemy.url = xxx
And then you'll be able to use alembic revision -n pkg_a -m "added a new model pkg_a.models"
If, however, your models are dependent in any way then they should use a common Base - you do realize you don't have to keep all your SQLAlchemy stuff in a single models.py file, don't you? I would create a separate "base" package which would contain a common MetaData, Base and other SQLAlchemy configuration stuff which would then be imported by your other packages.

Related

How to use SQLAlchemy Utils in a SQLAlchemy model

I'm trying to create a user model that uses UUID as primary key:
from src.db import db # SQLAlchemy instance
import sqlalchemy_utils
import uuid
class User(db.Model):
__tablename__ = 'user'
id = db.Column(sqlalchemy_utils.UUIDType(binary=True), primary_key=True, nullable=False)
But when I generate the migrations I receive:
File "/home/pc/Downloads/project/auth/venv/lib/python3.6/site-packages/alembic/runtime/environment.py", line 836, in run_migrations
self.get_context().run_migrations(**kw)
File "/home/pc/Downloads/project/auth/venv/lib/python3.6/site-packages/alembic/runtime/migration.py", line 330, in run_migrations
step.migration_fn(**kw)
File "/home/pc/Downloads/project/auth/migrations/versions/efae4166f832_.py", line 22, in upgrade
sa.Column('id', sqlalchemy_utils.types.uuid.UUIDType(length=16), nullable=False),
NameError: name 'sqlalchemy_utils' is not defined`
I had try to explicity inform the module I'm using like this and use a 'internal' implementation that SQLAlchemy
Obs: If I manualy import the sqlalchemy_utils in the /migrations/version/efae4166f832_.py and remove the length that is generated automaticaly sa.Column('id', sqlalchemy_utils.types.uuid.UUIDType(length=16), nullable=False) it works fine
I generate the migrations using a generate.py script:
from src import create_app
from src.db import db
from flask_migrate import Migrate
# Models
from src.user.models.user import User
app = create_app()
migrate = Migrate(app, db)`
Obs: MySQL engine
I expect that when I generate migration it generate a user model that uses UUID implemented from SQLAlchemy Utils as primary key
You have just to add:
import sqlalchemy_utils
to your script.py.mako inside migrations folder
Thanks, Marco, but I have already fixed it. I have put the import import sqlalchemy_utils inside env.py and script.py.mako, I have also put the following function:
def render_item(type_, obj, autogen_context):
"""Apply custom rendering for selected items"""
if type_ == "type" and isinstance(obj, sqlalchemy_utils.types.uuid.UUIDType):
# Add import for this type
autogen_context.imports.add("import sqlalchemy_utils")
autogen_context.imports.add("import uuid")
return "sqlalchemy_utils.types.uuid.UUIDType(), default=uuid.uuid4"
# Default rendering for other objects
return False
Inside the env.py, and at the same file I have set render_item=render_item in the function run_migrations_online:
context.configure(
...,
render_item=render_item,
...
)
I researched to do this automatically, but I couldn't find nothing that could help me.
The order of the operations matter:
export FLASK_APP=manage.py
flask db init
Do the tutorial above
flask db migrate
flask db upgrade
Background
It would be ideal if you didn't have to go and manually edit each migration file with an import sqlalchemy_utils statement.
Looking at the Alembic documentation, script.py.mako is "a Mako template file which is used to generate new migration scripts." Therefore, you'll need to re-generate your migration files, with Mako already importing sqlalchemy_utils as part of the migration file generation.
Fix
If possible, remove your old migrations (they're probably broken anyway), and add the import sqlalchemy_utils like so to your script.py.mako file:
from alembic import op
import sqlalchemy as sa
import sqlalchemy_utils #<-- line you add
${imports if imports else ""}
then just rerun your alembic migrations:
alembic revision --autogenerate -m "create initial tables"
when you go to look at your migration file you should see sqlalchemy_utils already imported via the mako script.
hope that helps.
Adding import sqlalchemy_utils in the script.py.mako file will automatically import this line on all the migration files that are generated and resolve the issue.
from alembic import op
import sqlalchemy as sa
import sqlalchemy_utils
${imports if imports else ""}
Add the import sqlalchemy_utils line to the newly-created migrations/versions/{hash}_my_comment.py file. However, this will only fix the problem for that specific step of the migration. If you expect that you'll be making lots of changes to columns which reference sqlalchemy_utils, you should probably do something more robust like Walter's suggestion. Even then, though, it looks like you may need to add code to properly deal with each column type you end up using.
NB: Despite seeing the suggestion in multiple places of just adding the import line to the script.py.mako file, that did not work for me.

How do I package all of my binaries in bazel?

There is a plethora of BUILD files scattered throughout the hierarchy of my mono repo.
Some of these files contain cc_binary rules.
I know they are all built into bazel-bin, but I'd like to get easy access to them all.
How can I package them all up, and put them all into ~/.bin/ for example?
I see the packaging rules, but its not clear to me how to write a rule that captures every single program and packages them together.
It may not be the most elegant solution (plus I hope I got the question), but this is how we do it by packaging/"tarring" each binary in its own bazel package / BUILD file:
cc_binary(
name = "hello"
...
)
load("#bazel_tools//tools/build_defs/pkg:pkg.bzl", "pkg_tar")
pkg_tar(
name = "hello_pkg",
srcs = [":hello"],
mode = "0755",
package_dir = "/usr/bin",
)
And then we'd collect all those into a one overall tarball/package in project root:
pkg_tar(
name = "mypkg",
extension = "tar.gz",
deps = [
"//hello:hello_pkg",
...
],
)
Sometimes we'd actually have multiple such rules for hello to collect for instance executables under bin and libraries in lib with intermediary hello_bin and hello_lib targets. Which would in the same fashion as mypkg above be first aggregated into hello_pkg and that in turn would be used in mypkg.

SQLAlchemy automap: Best practices for performance

I'm building a python app around an existing (mysql) database and am using automap to infer tables and relationships:
base = automap_base()
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
# reflect the tables
base.prepare(self.engine, reflect=True)
self.TableName = base.classes.table_name
Using this I can do things like session.query(TableName) etc...
However, I'm worried about performance, because every time the app runs it will do the whole inference again.
Is this a legitimate concern?
If so, is there a possibility to 'cache' the output of Automap?
I think that "reflecting" the structure of your database is not the way to go. Unless your app tries to "infer" things from the structure, like static code analysis would for source files, then it is unnecessary. The other reason for reflecting it at run-time would be the reduced time to begin "using" the database using SQLAlchemy. However:
Another option would be to use something like SQLACodegen (https://pypi.python.org/pypi/sqlacodegen):
It will "reflect" your database once and create a 99.5% accurate set of declarative SQLAlchemy models for you to work with. However, this does require that you keep the model subsequently in-sync with the structure of the database. I would assume that this is not a big concern seeing as the tables you're already-working with are stable-enough such that run-time reflection of their structure does not impact your program much.
Generating the declarative models is essentially a "cache" of the reflection. It's just that SQLACodegen saved it into a very readable set of classes + fields instead of data in-memory. Even with a changing structure, and my own "changes" to the generated declarative models, I still use SQLACodegen later-on in a project whenever I make database changes. It means that my models are generally consistent amongst one-another and that I don't have things such as typos and data-mismatches due to copy-pasting.
Performance can be a legitimate concern. If the database schema is not changing, it can be time consuming to reflect the database every time a script is run. This is more of an issue during development, not starting up a long running application. It's also a significant time saver if your database is on a remote server (again, particularly during development).
I use code that is similar to the answer here (as noted by #ACV). The general plan is to perform the reflection the first time, then pickle the metadata object. The next time the script is run, it will look for the pickle file and use that. The file can be anywhere, but I place mine in ~/.sqlalchemy_cache. This is an example based on your code.
import os
from sqlalchemy.ext.declarative import declarative_base
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
metadata_pickle_filename = "mydb_metadata"
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem, reflect as usual
pass
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
# save the metadata for future runs
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
self.TableName = base.classes.table_name
For anyone using declarative table class definitions, assuming a Base object defined as e.g.
Base = declarative_base(bind=engine)
metadata_pickle_filename = "ModelClasses_trilliandb_trillian.pickle"
# ------------------------------------------
# Load the cached metadata if it's available
# ------------------------------------------
# NOTE: delete the cached file if the database schema changes!!
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem
pass
# ------------------------------------------
# define all tables
#
class MyTable(Base):
if cached_metadata:
__table__ = cached_metadata.tables['my_schema.my_table']
else:
__tablename__ = 'my_table'
__table_args__ = {'autoload':True, 'schema':'my_schema'}
...
# ----------------------------------------
# If no cached metadata was found, save it
# ----------------------------------------
if cached_metadata is None:
# cache the metadata for future loading
# - MUST DELETE IF THE DATABASE SCHEMA HAS CHANGED
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
Important Note!! If the database schema changes, you must delete the cached file to force the code to autoload and create a new cache. If you don't, the changes will be be reflected in the code. It's an easy thing to forget.
The answer to your first question is largely subjective. You are adding database queries to fetch the reflection metadata to the application load time. Whether or not that overhead is significant depends on your project requirements.
For reference, I have an internal tool at work that uses a reflection pattern because the the load-time is acceptable for our team. That might not be the case if it were an externally-facing product. My hunch is that for most applications the reflection overhead will not dominate the total application load time.
If you decide it is significant for your purposes, this question has an interesting answer where the user pickles the database metadata in order to locally cache it.
Adding to this, what #Demitri answered is close to correct but (at least in sqlalchemy 1.4.29), the example will fail on the last line self.TableName = base.classes.table_name when generating from the cached file. In this case declarative_base() has no attribute classes.
To fix is as simple as altering:
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
to
if cached_metadata:
base = automap_base(declarative_base(bind=self.engine, metadata=cached_metadata))
base.prepare()
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
This will create your automap object with the proper attributes.

How to commit model instances and remove them from working memory a few at a time

I have a pyramid view that is used for loading data from a large file into a database. For each line in the file it does a little processing then creates some model instances and adds them to the session. This works fine except when the files are big. For large files the view slowly eats up all my ram until everything effectively grinds to a halt.
So my idea is to process each line individually with a function that creates a session, creates the necessary model instances and adds them to the current session, then commits.
def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
from sqlalchemy.orm import (
scoped_session,
sessionmaker,
)
from sqlalchemy import engine_from_config
from pyramidapp.models import Base, DataEntry
from zope.sqlalchemy import ZopeTransactionExtension
import transaction
oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
engine = engine_from_config(dSettings, 'sqlalchemy.')
oCurrentDBSession.configure(bind=engine)
Base.metadata.bind = engine
oEntry = DataEntry()
oCurrentDBSession.add(oEntry)
...
transaction.commit()
My requirements for this function are as follows:
create a session (check)
make a bunch of model instances (check)
add those instances to the session (check)
commit those models to the database
get rid of the session (so that it and the objects created in 2 are garbage collected)
I've made sure that the newly created session is passed as an argument whenever necessary in order to stop errors to do with multiple sessions blah blah. But alas! I can't get database connections to go away and stuff isn't being committed.
I tried separating the function out into a celery task so the view executes to completion and does what it needs to but I'm getting an error in celery about having too many mysql connections no matter what I try in terms of committing and closing and disposing and I'm not sure why. And yes, I restart the celery server when I make changes.
Surely there is a simple way to do this? All I want to do is make a session commit then go away and leave me alone.
Creating a new session for each line of your large file is going to be quite slow I would imagine.
What I would try is to commit the session and expunge all objects from it every 1000 rows or so:
counter = 0
for line in mymegafile:
entry = process_line(line)
session.add(entry)
if counter > 1000:
counter = 0
transaction.commit() # if you insist on using ZopeTransactionExtension, otherwise session.commit()
session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
else:
counter += 1
If there are no references to DataEntry instances from anywhere they should be garbage collected by Python interpreter at some point.
However, if all you're doing in that view is inserting new records to the database, it may be much more efficient to use SQLAlchemy Core constructs or literal SQL to bulk-insert data. This would also get rid of the problem with your ORM instances eating up your RAM. See I’m inserting 400,000 rows with the ORM and it’s really slow! for details.
So I tried a bunch of things and, although using SQLAlchemy's built in functionality to solve this was probably possible I could not find any way of pulling that off.
So here's an outline of what I did:
seperate the lines to be processed into batches
for each batch of lines queue up a celery task to deal with those lines
in the celery task a seperate process is launched that does the necessary stuff with the lines.
Reasoning:
The batch stuff is obvious
Celery was used because it took a heck of a long time to process an entire file so queuing just made sense
the task launched a separate process because if it didn't then I had the same problem that I had with the pyramid application
Some code:
Celery task:
def commit_lines(lLineData,dSettings,cwd):
"""
writes the line data to a file then calls a process that reads the file and creates
the necessary data entries. Then deletes the file
"""
import lockfile
sFileName = "/home/sheena/tmp/cid_line_buffer"
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'a') #in case the process was at any point interrupted...
for d in lLineData:
f.write('{0}\n'.format(d))
f.close()
#now call the external process
import subprocess
import os
sConnectionString = dSettings.get('sqlalchemy.url')
lArgs = [
'python',os.path.join(cwd,'commit_line_file.py'),
'-c',sConnectionString,
'-f',sFileName
]
#open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
subprocess.check_call(lArgs,shell=False)
#and clear the file
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'w')
f.close()
External process:
"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
from optparse import OptionParser
from sqlalchemy import create_engine
from pyramidapp.models import Base,DBSession
import ast
import transaction
#get options
oParser = OptionParser()
oParser.add_option('-c','--connection_string',dest='connection_string')
oParser.add_option('-f','--input_file',dest='input_file')
(oOptions, lArgs) = oParser.parse_args()
#set up connection
#engine = engine_from_config(dSettings, 'sqlalchemy.')
engine = create_engine(
oOptions.connection_string,
echo=False)
DBSession.configure(bind=engine)
Base.metadata.bind = engine
#commit stuffs
import lockfile
lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
with lock:
for sLine in open(oOptions.input_file,'r'):
dLine = ast.literal_eval(sLine)
create_entry(**dLine)
transaction.commit()
def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
#import stuff
oEntry = DataEntry()
#do some other stuff, make more model instances...
DBSession.add(oEntry)
if __name__ == "__main__":
main()
in the view:
for line in big_giant_csv_file_handler:
lLineData.append({'stuff':'lots'})
if lLineData:
lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
for l in lLineSets:
commit_lines.delay(l,dSettings,sCWD) #queue it for celery
You are just doing it wrong. Period.
Quoted from SQLAlchemy docs
The advanced developer will try to keep the details of session,
transaction and exception management as far as possible from the
details of the program doing its work.
Quoted from Pyramid docs
We made the decision to use SQLAlchemy to talk to our database. We also, though, installed pyramid_tm and zope.sqlalchemy.
Why?
Pyramid has a strong orientation towards support for transactions.
Specifically, you can install a transaction manager into your app
application, either as middleware or a Pyramid "tween". Then, just
before you return the response, all transaction-aware parts of your
application are executed. This means Pyramid view code usually doesn't
manage transactions.
My answer today is not code, but a recommendation to follow best practices recommended by the authors of the packages/frameworks you are working with.
References
Big picture - Using Thread-Local Scope with Web Applications
Typical error message when doing it wrong
Databases using SQLAlchemy
How to use scoped_session
Encapsulate CSV reading and creating SQLAlchemy model instances into something that supports the iterator protocol. I called it BatchingModelReader. It returns a collection of DataEntry instances, collection size depends on batch size. If the model changes overtime, you do not need to change the celery task. The task only puts a batch of models into a session and commits the transaction. By controlling the batch size you control memory consumption. Neither BatchingModelReader nor the celery task save huge amounts of intermediate data. This example shows as well that using celery is only an option. I added links to code samples of an pyramid application I am actually refactoring in a Github fork.
BatchingModelReader - encapsulates csv.reader and uses existing models from your pyramid application
get inspired by source code of csv.DictReader
could be run as a celery task - use appropriate task decorator
from .models import DBSession
import transaction
def import_from_csv(path_to_csv, batchsize)
"""given a CSV file and batchsize iterate over batches of model instances and import them to database"""
for batch in BatchingModelReader(path_to_csv, batchsize):
with transaction.manager:
DBSession.add_all(batch)
pyramid view - just save big giant CSV file, start task, return response immediately
#view_config(...):
def view(request):
"""gets file from request, save it to filesystem and start celery task"""
with open(path_to_csv, 'w') as f:
f.write(big_giant_csv_file)
#start task with parameters
import_from_csv.delay(path_to_csv, 1000)
Code samples
ToDoPyramid - commit transaction from commandline
ToDoPyramid - commit transaction from request
Pyramid using SQLAlchemy
Databases using SQLAlchemy
SQLAlchemy internals
Big picture - Using Thread-Local Scope with Web Applications
How to use scoped_session

How to store web sessions in MySQL for CherryPy 3.2.2?

I found many examples for older versions of CherryPy but they each referenced importing modules not found in cherrypy 3.2.2. Looking in the documentation, I found a reference to the fact that there is built in functionality with storage_type (one of ‘ram’, ‘file’, ‘postgresql’).
For a start you could take a look at
https://github.com/3kwa/cherrys
how this guy writes his own session class and overwrites some methods. He does it for redis not MySQL. You would write methods for MySQL. A very similar class already exists in cherrypy in "cherrpy/lib/sessions.py":
class PostgresqlSession(Session)
which is very similar to what you want. I'd say, take the implementing approach from the "3kwa" but instead of his RedisSession-class copy the PostgresqlSession-class from "cherrpy/lib/sessions.py" and alter to match proper MySQL-Syntax.
A possible path could be:
Download the "cherrys.py" from above link and rename into "mysqlsession.py". Overwrite the "RedisSessions(Session)" with the "PostgresqlSession(Session)" from "cherrpy/lib/sessions.py" and rename to "MySQLSession(Session)". Be sure to add
locks = {}
def acquire_lock(self):
"""Acquire an exclusive lock on the currently-loaded session data."""
self.locked = True
self.locks.setdefault(self.id, threading.RLock()).acquire()
def release_lock(self):
"""Release the lock on the currently-loaded session data."""
self.locks[self.id].release()
self.locked = False
to your new "MySQLSession"-class (like it is done in RedisSession(Session). Alter the the PostgreSQL-Syntax to match MySQL-Syntax (that shouldn't be difficult). Put the "mysqlsession.py" somewhere below your project directory and import in the application with
import mysqlsession
and use
cherrypy.lib.sessions.MySQLSession = mysqlsession.MySQLSession
in the initialization of you app. In the config
tools.sessions.storage_type : 'mysql'
and the parameters (like host, port, etc.) like you would with class "PostgreSQL".
I can be wrong all along. But this is how I would try to solve this.