I've been fiddling with SQLAlchemy + Alembic for around 2 days and need some guidance. I know Alembic is for schema migrations and the autogenrate stuff is awesome but my question lies now with how do data migrations work?
An example: I create a table and I want to insert data into that table in an automated way (upgrade & downgrade). Ideally when someone updates to the latest migration all this is done for the developer.
From my research I can either:
Write normal code that updates the tables, insert, delete data but I loose automation efforts here, i.e. the developer would have to run a script that inserts the data, there isn't really an option to do rollbacks etc. The use of ORM stuff is nice and cleaner than operations though. So something like this:
def create_users():
with session_maker() as session:
for user in users:
session.add(user)
session.commit()
I can make use Alembic operations like bulk_insert, batch_alter_table etc. With this approach the developer would just need to run the latest migrations but the use of Alembic operations seems clunky, the functions themselves aren't as shiny as an ORM approach.
op.bulk_insert(accounts_table,
[
{'id':1, 'name':'John Smith',
'create_date':date(2010, 10, 5)},
{'id':2, 'name':'Ed Williams',
'create_date':date(2007, 5, 27)},
{'id':3, 'name':'Wendy Jones',
'create_date':date(2008, 8, 15)},
]
)
Have a hybrid approach and use ORM stuff inside of a migration? So something like this:
session_maker = sessionmaker(bind=create_engine(db_url))
business_units = [
BusinessUnits(name="Marketing"),
BusinessUnits(name="Sales"),
BusinessUnits(name="Software"),
]
with session_maker() as session:
for unit in business_units:
session.add(unit)
session.commit()
def upgrade() -> None:
create_business_units()
# ### commands auto generated by Alembic - please adjust! ###
pass
# ### end Alembic commands ###
def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
pass
# ### end Alembic commands ###```
Questions:
Is there something wrong with the 3rd approach because why would option 2 exist? Approach 3 seems to be the best of both worlds.
What is the best approach 1,2,3 or is there no best approach and it varies from my to place? I understand this is subjective but there must be some sort of measuring stick used to determine one approach over the other?
Please bear in mind I'm new to this and I know this is subjective but ya... just need some guidance and perspective. Not too sure where else to post / ask this question.
Related
I have a Pylons project and a SQLAlchemy model that implements schema qualified tables:
class Hockey(Base):
__tablename__ = "hockey"
__table_args__ = {'schema':'winter'}
hockey_id = sa.Column(sa.types.Integer, sa.Sequence('score_id_seq', optional=True), primary_key=True)
baseball_id = sa.Column(sa.types.Integer, sa.ForeignKey('summer.baseball.baseball_id'))
This code works great with Postgresql but fails when using SQLite on table and foreign key names (due to SQLite's lack of schema support)
sqlalchemy.exc.OperationalError: (OperationalError) unknown database "winter" 'PRAGMA "winter".table_info("hockey")' ()
I'd like to continue using SQLite for dev and testing.
Is there a way of have this fail gracefully on SQLite?
I'd like to continue using SQLite for
dev and testing.
Is there a way of have this fail
gracefully on SQLite?
It's hard to know where to start with that kind of question. So . . .
Stop it. Just stop it.
There are some developers who don't have the luxury of developing on their target platform. Their life is a hard one--moving code (and sometimes compilers) from one environment to the other, debugging twice (sometimes having to debug remotely on the target platform), gradually coming to an awareness that the gnawing in their gut is actually the start of an ulcer.
Install PostgreSQL.
When you can use the same database environment for development, testing, and deployment, you should.
Not to mention the QA team. Why on earth are they testing stuff they're not going to ship? If you're deploying on PostgreSQL, assure the quality of your work on PostgreSQL.
Seriously.
I'm not sure if this works with foreign keys, but someone could try to use SQLAlchemy's Multi-Tenancy Schema Translation for Table objects. It worked for me but I have used custom primaryjoin and secondaryjoinexpressions in combination with composite primary keys.
The schema translation map can be passed directly to the engine creator:
...
if dialect == "sqlite":
url = lambda: "sqlite:///:memory:"
execution_options={"schema_translate_map": {"winter": None, "summer": None}}
else:
url = lambda: f"postgresql://{user}:{pass}#{host}:{port}/{name}"
execution_options=None
engine = create_engine(url(), execution_options=execution_options)
...
Here is the doc for create_engine. There is a another question on so which might be related in that regard.
But one might get colliding table names all schema names are mapped to None.
I'm just a beginner myself, and I haven't used Pylons, but...
I notice that you are combining the table and the associated class together. How about if you separate them?
import sqlalchemy as sa
meta = sa.MetaData('sqlite:///tutorial.sqlite')
schema = None
hockey_table = sa.Table('hockey', meta,
sa.Column('score_id', sa.types.Integer, sa.Sequence('score_id_seq', optional=True), primary_key=True),
sa.Column('baseball_id', sa.types.Integer, sa.ForeignKey('summer.baseball.baseball_id')),
schema = schema,
)
meta.create_all()
Then you could create a separate
class Hockey(Object):
...
and
mapper(Hockey, hockey_table)
Then just set schema above = None everywhere if you are using sqlite, and the value(s) you want otherwise.
You don't have a working example, so the example above isn't a working one either. However, as other people have pointed out, trying to maintain portability across databases is in the end a losing game. I'd add a +1 to the people suggesting you just use PostgreSQL everywhere.
HTH, Regards.
I know this is a 10+ year old question, but I ran into the same problem recently: Postgres in production and sqlite in development.
The solution was to register an event listener for when the engine calls the "connect" method.
#sqlalchemy.event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
dbapi_connection.execute('ATTACH "your_data_base_name.db" AS "schema_name"')
Using ATTACH statement only once will not work, because it affects only a single connection. This is why we need the event listener, to make the ATTACH statement over all connections.
I'm building a python app around an existing (mysql) database and am using automap to infer tables and relationships:
base = automap_base()
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
# reflect the tables
base.prepare(self.engine, reflect=True)
self.TableName = base.classes.table_name
Using this I can do things like session.query(TableName) etc...
However, I'm worried about performance, because every time the app runs it will do the whole inference again.
Is this a legitimate concern?
If so, is there a possibility to 'cache' the output of Automap?
I think that "reflecting" the structure of your database is not the way to go. Unless your app tries to "infer" things from the structure, like static code analysis would for source files, then it is unnecessary. The other reason for reflecting it at run-time would be the reduced time to begin "using" the database using SQLAlchemy. However:
Another option would be to use something like SQLACodegen (https://pypi.python.org/pypi/sqlacodegen):
It will "reflect" your database once and create a 99.5% accurate set of declarative SQLAlchemy models for you to work with. However, this does require that you keep the model subsequently in-sync with the structure of the database. I would assume that this is not a big concern seeing as the tables you're already-working with are stable-enough such that run-time reflection of their structure does not impact your program much.
Generating the declarative models is essentially a "cache" of the reflection. It's just that SQLACodegen saved it into a very readable set of classes + fields instead of data in-memory. Even with a changing structure, and my own "changes" to the generated declarative models, I still use SQLACodegen later-on in a project whenever I make database changes. It means that my models are generally consistent amongst one-another and that I don't have things such as typos and data-mismatches due to copy-pasting.
Performance can be a legitimate concern. If the database schema is not changing, it can be time consuming to reflect the database every time a script is run. This is more of an issue during development, not starting up a long running application. It's also a significant time saver if your database is on a remote server (again, particularly during development).
I use code that is similar to the answer here (as noted by #ACV). The general plan is to perform the reflection the first time, then pickle the metadata object. The next time the script is run, it will look for the pickle file and use that. The file can be anywhere, but I place mine in ~/.sqlalchemy_cache. This is an example based on your code.
import os
from sqlalchemy.ext.declarative import declarative_base
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
metadata_pickle_filename = "mydb_metadata"
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem, reflect as usual
pass
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
# save the metadata for future runs
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
self.TableName = base.classes.table_name
For anyone using declarative table class definitions, assuming a Base object defined as e.g.
Base = declarative_base(bind=engine)
metadata_pickle_filename = "ModelClasses_trilliandb_trillian.pickle"
# ------------------------------------------
# Load the cached metadata if it's available
# ------------------------------------------
# NOTE: delete the cached file if the database schema changes!!
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem
pass
# ------------------------------------------
# define all tables
#
class MyTable(Base):
if cached_metadata:
__table__ = cached_metadata.tables['my_schema.my_table']
else:
__tablename__ = 'my_table'
__table_args__ = {'autoload':True, 'schema':'my_schema'}
...
# ----------------------------------------
# If no cached metadata was found, save it
# ----------------------------------------
if cached_metadata is None:
# cache the metadata for future loading
# - MUST DELETE IF THE DATABASE SCHEMA HAS CHANGED
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
Important Note!! If the database schema changes, you must delete the cached file to force the code to autoload and create a new cache. If you don't, the changes will be be reflected in the code. It's an easy thing to forget.
The answer to your first question is largely subjective. You are adding database queries to fetch the reflection metadata to the application load time. Whether or not that overhead is significant depends on your project requirements.
For reference, I have an internal tool at work that uses a reflection pattern because the the load-time is acceptable for our team. That might not be the case if it were an externally-facing product. My hunch is that for most applications the reflection overhead will not dominate the total application load time.
If you decide it is significant for your purposes, this question has an interesting answer where the user pickles the database metadata in order to locally cache it.
Adding to this, what #Demitri answered is close to correct but (at least in sqlalchemy 1.4.29), the example will fail on the last line self.TableName = base.classes.table_name when generating from the cached file. In this case declarative_base() has no attribute classes.
To fix is as simple as altering:
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
to
if cached_metadata:
base = automap_base(declarative_base(bind=self.engine, metadata=cached_metadata))
base.prepare()
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
This will create your automap object with the proper attributes.
I have been using SQLAlchemy for year now. And I use Alembic for migrating changes.
I am also using alembic to seed data. Take an example of the authorization seeds.
def upgrade():
op.bulk_insert(Role.__table__, ROLE_LIST)
op.bulk_insert(Permission.__table__, PERMISSION_LIST)
op.bulk_insert(RolePermission.__table__, ROLE_PERMISSION_LIST)
def downgrade():
[op.execute(a_table.__table__.delete()) for a_table in [RolePermission, Permission, Role]]
This works perfect if we assume there are going to be no changes in my data set. (permissions, roles, etc in my case)
ROLE_LIST = [
dict(id=generate_business_id(), name='customer', default=1),
dict(id=generate_business_id(), name='admin', default=0),
]
PERMISSION_LIST = [
dict(id=generate_business_id(), name='profile'),
dict(id=generate_business_id(), name='delete_user'),
dict(id=generate_business_id(), name='make_payment'),
]
But it's not the case. I am gonna have new roles and permissions and role_permission mappings. And alembic upgrade head will not migrate new data because my head has gone far ahead of the seed migration.
I could have the data in the respective revision only, but I am also using ROLE_LIST and PERMISSION_LIST in my business logic.
Is there any way I can do replace query in migrations? Is there any better option?
I have just started looking at Alembic, and coming from Django, where we have South to migrate our database schemas (which is soon to be included) which uses a friendly old fixed-width number like 0037_fix_my_schema.py to talk about the order in which migrations are to be applied, I am naturally intrigued by Alembic's revision ID. Is there a DAG backing Alembic, or can someone give a little overview of its internals in this respect?
I took a look myself. The source says:
def rev_id():
val = int(uuid.uuid4()) % 100000000000000
return hex(val)[2:-1]
Not so fascinating.
I am currently trying to move my DB tables over to InnoDB from MyISAM. I am having timing issues with requests and cron jobs that are running on the server that is leading to some errors. I am quite sure that transaction support will help me with the problem. I am therefore transitioning to InnoDB.
I have a suite of tests which make calls to our webservices REST API and receive XML responses. The test suite is fairly thorough, and it's written in Python and uses SQLAlchemy to query information from the database. When I change the tables in the system from MyISAM to InnoDB however, the tests start failing. However, the tests aren't failing because the system isn't working, they are failing because the ORM is not correctly querying the rows from the database I am testing on. when I step through the code I see the correct results, but the ORM is not returning the correct results at all.
Basic flow is:
class UnitTest(unittest.TestCase):
def setUp(self):
# Create a test object in DB that gets affected by the web server
testObject = Obj(foo='one')
self.testId = testObject.id
session.add(testObject)
session.commit()
def tearDown(self):
# Clean up after the test
testObject = session.query(Obj).get(self.testId)
session.delete(testObject)
session.commit()
def test_web_server(self):
# Ensure the initial state of the object.
objects = session.query(Obj).get(self.testId)
assert objects.foo == 'one'
# This will make a simple HTTP get call on an url that will modify the DB
response = server.request.increment_foo(self.testId)
# This one fails, the object still has a foo of 'one'
# When I stop here in a debugger though, and look at the database,
# The row in question actually has the correct value in the database.
# ????
objects = session.query(Obj).get(self.testId)
assert objects.foo == 'two'
Using MyISAM tables to store the object and this test will pass. However, when I change to InnoDB tables, this test will not pass. What is more interesting is that when I step through the code in the debugger, I can see that the datbase has what I expect, so it's not a problem in the web server code. I have tried nearly every combination of expire_all, autoflush, autocommit, etc. etc, and still can't get this test to pass.
I can provide more info if necessary.
Thanks,
Conrad
The problem is that you put the line self.testId = testObject.id before new object is added to session, flushed, and SQLAlchemy assigned ID to it. Thus self.testId is always None. Move this line below session.commit().