Updating object fields from separate processes? (kind of upsert)

Updating object fields from separate processes? (kind of upsert) - sqlalchemy

I have Task objects with several attributes. These tasks are bounced between several processes (using Celery) and I'd like to update the task status in a database.
Every update should update only non-NULL attributes of the object. So far I have something like:
def del_empty_attrs(task):
for name in (key for key, val in vars(task).iteritems() if val is None):
delattr(task, name)
def update_task(session, id, **kw):
task = session.query(Task).get(id)
if task is None:
task = Task(id=id)
for key, value in kw.iteritems():
if not hasattr(task, key):
raise AttributeError('Task does not have {} attribute'.format(key))
setattr(task, key, value)
del_empty_attrs(task) # Don't update empty fields
session.merge(task)
However, get either IntegrityError or StaleDataError. What the right way to do this?
I think the problem is that every process has its own session, but I'm not sure.

a lot more detail would be needed to say for sure, but there is a race condition in this code:
def update_task(session, id, **kw):
# 1.
task = session.query(Task).get(id)
if task is None:
# 2.
task = Task(id=id)
for key, value in kw.iteritems():
if not hasattr(task, key):
raise AttributeError('Task does not have {} attribute'.format(key))
setattr(task, key, value)
del_empty_attrs(task) # Don't update empty fields
# 3.
session.merge(task)
If two processes both encounter #1, and find the object for the given id to be None, they both proceed to create a new Task() object with the given primary key (assuming id here is the primary key attribute). Both processes then race down to the Session.merge() which will attempt to emit an INSERT for the row. One process gets the INSERT, the other one gets an IntegrityError as it did not INSERT the row before the other one did.
There's no simple answer for how to "fix" this, it depends on what you're trying to do. One approach might be to ensure that no two processes work on the same pool of primary key identifiers. Another would be to ensure that all INSERTs of non-existent rows are handled by a single process.
Edit: other approaches might involve going with an "optimistic" approach, where SAVEPOINT (e.g. Session.begin_nested()) is used to intercept an IntegrityError on an INSERT, then continue on after it occurs.

Related

What does 'synchronize_session=False' do exactly in update functions for Sqlalchemy?and what is the best value for it?

We have the CRUD functions in our API which is using FastAPI and SQLAlchemy.
For update functions we have the below code:
def update_user(
user_id: uuid.UUID,
db: Session,
update_model: UserUpdateModel,
) -> bool:
query = (
db.query(User)
.filter(
User.user_id == user_id,
)
.update(update_model, synchronize_session=False)
)
try:
db.commit()
except IntegrityError as e:
if isinstance(e.orig, PG2UniqueViolation):
raise UniqueViolation from e
return bool(query)
What exactly does the 'synchronize_session=False' do here?
What is the best value for it? False or Fetch...?
Is it critical if we don't use it?

By looking at the sqlalchemy doc you can find what synchronize_session does and how to use it properly
From the official doc:
With both the 1.x and 2.0 form of ORM-enabled updates and deletes, the following values for synchronize_session are supported:
False - don’t synchronize the session. This option is the most efficient and is reliable once the session is expired, which typically occurs after a commit(), or explicitly using expire_all(). Before the expiration, objects that were updated or deleted in the database may still remain in the session with stale values, which can lead to confusing results.
'fetch' - Retrieves the primary key identity of affected rows by either performing a SELECT before the UPDATE or DELETE, or by using RETURNING if the database supports it, so that in-memory objects which are affected by the operation can be refreshed with new values (updates) or expunged from the Session (deletes). Note that this synchronization strategy is not available if the given update() or delete() construct specifies columns for UpdateBase.returning() explicitly.
'evaluate' - Evaluate the WHERE criteria given in the UPDATE or DELETE statement in Python, to locate matching objects within the Session. This approach does not add any round trips and in the absence of RETURNING support is more efficient. For UPDATE or DELETE statements with complex criteria, the 'evaluate' strategy may not be able to evaluate the expression in Python and will raise an error. If this occurs, use the 'fetch' strategy for the operation instead.

Does Statement.RETURN_GENERATED_KEYS generate any extra round trip to fetch the newly created identifier?

JDBC allows us to fetch the value of a primary key that is automatically generated by the database (e.g. IDENTITY, AUTO_INCREMENT) using the following syntax:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?)",
Statement.RETURN_GENERATED_KEYS
);
while (resultSet.next()) {
LOGGER.info("Generated identifier: {}", resultSet.getLong(1));
}
I'm interested if the Oracle, SQL Server, postgresQL, or MySQL driver uses a separate round trip to fetch the identifier, or there is a single round trip which executes the insert and fetches the ResultSet automatically.

It depends on the database and driver.
Although you didn't ask for it, I will answer for Firebird ;). In Firebird/Jaybird the retrieval itself doesn't require extra roundtrips, but using Statement.RETURN_GENERATED_KEYS or the integer array version will require three extra roundtrips (prepare, execute, fetch) to determine the columns to request (I still need to build a form of caching for it). Using the version with a String array will not require extra roundtrips (I would love to have RETURNING * like in PostgreSQL...).

In PostgreSQL with PgJDBC there is no extra round-trip to fetch generated keys.
It sends a Parse/Describe/Bind/Execute message series followed by a Sync, then reads the results including the returned result-set. There's only one client/server round-trip required because the protocol pipelines requests.
However sometimes batches that can otherwise be streamed to the server may be broken up into smaller chunks or run one by on if generated keys are requested. To avoid this, use the String[] array form where you name the columns you want returned and name only columns of fixed-width data types like integer. This only matters for batches, and it's a due to a design problem in PgJDBC.
(I posted a patch to add batch pipelining support in libpq that doesn't have that limitation, it'll do one client/server round trip for arbitrary sized batches with arbitrary-sized results, including returning keys.)

MySQL receives the generated key(s) automatically in the OK packet of the protocol in response to executing a statement. There is no communication overhead when requesting generated keys.

In my opinion even for such a trivial thing a single approach working in all database systems will fail.
The only pragmatic solution is (in analogy to Hibernate) to find the best working solution for each target RDBMS (and
call it a dialect of your one for all solution:)
Here the information for Oracle
I'm using a sequence to generate key, same behavior is observed for IDENTITY column.
create table auto_pk
(id number,
pad varchar2(100));
This works and use only one roundtrip
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX')",
Statement.RETURN_GENERATED_KEYS)
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getGeneratedKeys()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getString(1);
But unfortunately you get ROWID as a result - not the generated key
How is it implemented internally? You can see it if you activate a 10046 trace (BTW this is also the best way to see
how many roundtrips were performed)
PARSING IN CURSOR
insert into auto_pk values(auto_pk_seq.nextval, 'XXX')
RETURNING ROWID INTO :1
END OF STMT
So you see the JDBC Standard 3.0 is implemented, but you don't get a requested result. Under the cover is used the
RETURNING clause.
The right approach to get the generated key in Oracle is therefore:
def stmt = con.prepareStatement("insert into auto_pk values(auto_pk_seq.nextval, 'XXX') returning id into ?")
stmt.registerReturnParameter(1, Types.INTEGER);
def rowCount = stmt.executeUpdate()
def generatedKeys = stmt.getReturnResultSet()
if (null != generatedKeys && generatedKeys.next()) {
def id = generatedKeys.getLong(1);
}
Note:
Oracle Release 12.1.0.2.0
To activate the 10046 trace use
con.createStatement().execute "alter session set events '10046 trace name context forever, level 12'"
con.createStatement().execute "ALTER SESSION SET tracefile_identifier = my_identifier"

Depending on frameworks or libraries to do things that are perfectly possible in plain SQL is bad design IMHO, especially when working against a defined DBMS. (The Statement.RETURN_GENERATED_KEYS is relatively innocuous, although it apparently does raise a question for you, but where frameworks are built on separate entities and doing all sorts of joins and filters in code or have custom-built transaction isolation logic things get inefficient and messy very quickly.)
Why not simply:
PreparedStatement ps= connection.prepareStatement(
"INSERT INTO post (title) VALUES (?) RETURNING id");
Single trip, defined result.

How does Rails assign id to an object in after_save

When an object is saved in rails, it's ID in DB is assigned to it. But it is not actually saved in the DB.
On the console, I haven't seen any query being fired other than the INSERT query, which is performed after the after_save.
So how does rails assign the id to the object before the INSERT query.

There are different ways for different dbs. For more details you have to look through the AR models or any other ORM you are using.
For pg see - rails postgres insert
If you don't get an ID in you records, show more details from schema.rb

Typically, this is done by the database itself. Usually, the id column of a table is an auto_increment column which means that the database will keep an auto incrementing counter and assign its value to the new record when saved. Then, Rails has to pull the newly assigned id back from the database after inserting the record.
This is what Rails does when inserting a new row to DB (see docs for the insert method):
# ActiveRecord::ConnectionAdapters::DatabaseStatements#insert
#
# Returns the last auto-generated ID from the affected table.
#
# +id_value+ will be returned unless the value is nil, in
# which case the database will attempt to calculate the last inserted
# id and return that value.
#
# If the next id was calculated in advance (as in Oracle), it should be
# passed in as +id_value+.
def insert(arel, name = nil, pk = nil, id_value = nil, sequence_name = nil, binds = [])
sql, binds = sql_for_insert(to_sql(arel, binds), pk, id_value, sequence_name, binds)
value = exec_insert(sql, name, binds, pk, sequence_name)
id_value || last_inserted_id(value)
end
So, in practice, the ID is never passed from Rails in the INSERT statement. But after the insert, the created Rails object will have it's id defined.

sql alchemy column value dependent on other table

If there's a table with a column that I want to get the number of occurrences of the columns 'id' in another tables column?
So if there was a table 'player' of every player, and a table 'goals' that listed every goal scored, is there an easy way to autoupdate the player column every time a goal they score is added to the goal table?
another example would be a 'team' and 'players' table, where the table updates the team.number_of_players every time a player is added with player.team_name == team.name or something like that.
Would using JSON as a way of holding {'username': True} or something like that for each user be worthwhile?

You have several ways to implement you idea:
Easiest way: you can update your columns with update query, something like this:
try:
player = Player(name='New_player_name', team_id=3)
Session.add(player)
Session.flush()
Session.query(Team).filter(Team.id == Player.team_id).update({Team.players_number: Team.players_number + 1})
Session.commit()
except SQLAlchemyError:
Session.rollback()
# error processing
You can implement sql-trigger. But an implementation is different for different DBMS. So, you can read about it in the documentation of your DBMS.
You can implement SQLAlchemy trigger, like this:
from sqlalchemy import event
class Team(Base):
...
class Player(Base):
...
#staticmethod
def increment_players_number(mapper, connection, player):
try:
Session.query(Team).filter(Team.id == player.team_id)\
.update({Team.players_number: Team.players_number + 1})
except SQLAlchemyError:
Session.rollback()
# error processing
event.listen(Player, 'after_insert', Player.increment_players_number)
As you see, there are always two queries, because you should perform two procedures: insert and update. I think (but I'm not sure) that some DBMS can process queries like this:
UPDATE table1 SET column = column + 1 WHERE id = SOMEID AND (INSERT INTO table2 values (VALUES))

How does SqlAlchemy handle unique constraint in table definition

I have a table with the following declarative definition:
class Type(Base):
__tablename__ = 'Type'
id = Column(Integer, primary_key=True)
name = Column(String, unique = True)
def __init__(self, name):
self.name = name
The column "name" has a unique constraint, but I'm able to do
type1 = Type('name1')
session.add(type1)
type2 = Type(type1.name)
session.add(type2)
So, as can be seen, the unique constraint is not checked at all, since I have added to the session 2 objects with the same name.
When I do session.commit(), I get a mysql error since the constraint is also in the mysql table.
Is it possible that sqlalchemy tells me in advance that I can not make it or identifies it and does not insert 2 entries with the same "name" columm?
If not, should I keep in memory all existing names, so I can check if they exist of not, before creating the object?

SQLAlechemy doesn't handle uniquness, because it's not possible to do good way. Even if you keep track of created objects and/or check whether object with such name exists there is a race condition: anybody in other process can insert a new object with the name you just checked. The only solution is to lock whole table before check and release the lock after insertion (some databases support such locking).

AFAIK, sqlalchemy does not handle uniqueness constraints in python behavior. Those "unique=True" declarations are only used to impose database level table constraints, and only then if you create the table using a sqlalchemy command, i.e.
Type.__table__.create(engine)
or some such. If you create an SA model against an existing table that does not actually have this constraint present, it will be as if it does not exist.
Depending on your specific use case, you'll probably have to use a pattern like
try:
existing = session.query(Type).filter_by(name='name1').one()
# do something with existing
except:
newobj = Type('name1')
session.add(newobj)
or a variant, or you'll just have to catch the mysql exception and recover from there.

From the docs
class MyClass(Base):
__tablename__ = 'sometable'
__table_args__ = (
ForeignKeyConstraint(['id'], ['remote_table.id']),
UniqueConstraint('foo'),
{'autoload':True}
)

.one() throws two kinds of exceptions:
sqlalchemy.orm.exc.NoResultFound and sqlalchemy.orm.exc.MultipleResultsFound
You should create that object when the first exception occurs, if the second occurs you're screwed anyway and shouldn't make is worse.
try:
existing = session.query(Type).filter_by(name='name1').one()
# do something with existing
except NoResultFound:
newobj = Type('name1')
session.add(newobj)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008