I'm going through the SQLAlchemy ORM tutorial (https://docs.sqlalchemy.org/en/latest/orm/tutorial.html) and I'm finding it super difficult to understand when/why a Python object will reflect the latest data in the database.
Here is a sequence of events that confuses me:
First we create a user ed_user and add it to the session. Its id is None because the row hasn't been written to the database.
Then we create a different user our_user which is obtained by querying the database with a query that will match ed_user. So our_user and ed_user are actually the same user. When we query either our_user.id or ed_user.id after this query has taken place, we see that the id has now been assigned because ed_user was flushed to the database when the SELECT query was written.
Then we edit ed_user and add some other unrelated rows, and issue a session commit.
Finally we then read the value of ed_user.id again and it causes the database to issue a SELECT query to get the latest value of id since the previous commit ended the previous transaction.
I find this extremely confusing because in the first step, before ed_user was ever written to the database, SQLAlchemy was content to give us a None value for id even though it could have obtained an id if it went ahead and flushed the write to the database but for some reason once the row was written to the database once, SQLAlchemy thinks it is important to keep it up-to-date (in the last step) by refreshing the data when it is read. Why is this happening and what controls this behavior?
Bottom line, I have no idea what logic I can rely on regarding when/why/how my Python objects will be kept up-to-date with the database, and any extra clarity you can offer will be extremely appreciated.
I'll try and shed some light on the state management in SQLAlchemy by going through your bullet points.
First we create a user ed_user and add it to the session. Its id is None because the row hasn't been written to the database.
Before adding the newly created Ed-object to the session it is in transient state; it has not been added to a session and it does not have a database identity. When you add it to a session it moves to pending state. It has not been flushed to the database, but will be when the next flush occurs. If you have autoflush enabled (the default), all pending changes will be flushed before issuing the next query operation in order to make sure that the states of the session and the database are in sync when querying, which brings us to:
Then we create a different user our_user which is obtained by querying the database with a query that will match ed_user. So our_user and ed_user are actually the same user.
It is a bit misleading to say that you create our_user. Instead you perform the query and bind the result to the name our_user:
>>> our_user = session.query(User).filter_by(name='ed').first()
Here it is important to remember that all pending changes are flushed before this query takes place. That means that the changes held in the object bound to the name ed_user are sent to the database and SQLAlchemy fetches its database identity (id is not None anymore), moving it to persistent state and adding it to the identity map.
Since all that took place before the query, you get the row that was created when the Ed-object was flushed as the result, and inspecting that row's identity (using the identity map) SQLAlchemy notices that it in fact represents the existing object held in the session, the one bound to the name ed_user before. That is why both ed_user.id and our_user.id give you the same value – in fact ed_user is our_user will also be True; they are the same object.
Finally we then read the value of ed_user.id again and it causes the database to issue a SELECT query to get the latest value of id since the previous commit ended the previous transaction.
By default SQLAlchemy expires all database loaded state after a commit, in order to keep you from working on stale data. Some other thread or process might already have committed its changes in between. Like most things this behaviour can be controlled by passing expire_on_commit=False to sessionmaker or a Session directly, if you really need to.
Related
Having distributed serverless application, based on AWS Aurora Serverless MySQL 5.6 and multiple Lambda functions. Some of Lambdas represent writing threads, another are reading treads. For denoting most important details, lets suppose that there is only one table with following structure:
id: bigint primary key autoincrement
key1: varchar(700)
key2: bigint
content: blob
unique(key1, key2)
Writing threads perform INSERTs in following manner: every writing thread generates one entry with key1+key2+content, where key1+key2 pair is unique, and id is generating automatically by autoincrement. Some writing threads can fail by DUPLICATE KEY ERROR, if key1+key2 will have repeating value, but that does not matter and okay.
There also some reading threads, which are polling table and tries to process new inserted entries. Goal of reading thread is retrieve all new entries and process them some way. Amount of reading threads is uncontrolled and they does not communicate with each other and does not write anything in table above, but can write some state in custom table.
Firstly it's seems that polling is very simple - it's enough to reading process to store last id that has been processed, and continue polling from it, e.g. SELECT * FROM table WHERE id > ${lastId}. Approach above works well on small load, but does not work with high load by obvious reason: there are some amount of inserting entries, which have not yet appeared in the database, because cluster had not been synchronized at this point.
Let's see what happens in cluster point of view, event if it consists of only two servers A and B.
1) Server A accepts write transaction with entry insertion and acquired autoincrement number 100500
2) Server B accepts write transaction with entry insertion and acquired autoincrement number 100501
3) Server B commits write transaction
4) Server B accepts read transaction, and returns entries with id > 100499, which is only 100501 entry.
5) Server A commits write transaction.
6) Reading thread receives only 100501 entry and moves lastId cursor to 100501. Entry 100500 is lost for current reading thread forever.
QUESTION: Is there way to solve problem above WITHOUT hard-lock tables on all cluster, in some lock-less aware way or something similar?
The issue here is that the local state in each lambda (thread) does not reflect the global state of said table.
As a first call I would try to always consult the table what is the latest ID before reading the entry with that ID.
Have a look at built in function LAST_INSERT_ID() in MySQL.
The caveat
[...] the most recently generated ID is maintained in the server on a
per-connection basis
Your lambda could be creating connections prior to handler function / method which would make them longer living (it's a known trick, but it's not bomb proof here), but I think new simultaneously executing lambda function would be given a new connection, in which case the above solution would fall apart.
Luckily what you have to do then is to wrap all WRITES and all READS in transactions so that additional coordination will take place when reading and writing simultaneously to the same table.
In your quest you might come across transaction isolation levels and SEERIALIZEABLE would be safest and least perfomant, but apparently AWS Aurora does not support it (I had not verified that statement).
HTH
As part of the persistence process in one of my models an md5 check_sum of the entire record is generated and stored with the record. The md5 check_sum contains a flattened representation of the entire record including all EAV attributes etc. This makes preventing absolute duplicates very easy and efficient.
I am not using a unique index on this check_sum for a specific reason, I want this all to be silent, i.e. if a user submits a duplicate then the app just silently ignores it and returns the already existing record. This ensures backwards compatibility with legacy app's and api's.
I am using Laravel's eloquent. So once a record has been created and before committing the application does the following:
$taxonRecords = TaxonRecord::where('check_sum', $taxonRecord->check_sum)->get();
if ($taxonRecords->count() > 0) {
DB::rollBack();
return $taxonRecords->first();
}
However recently I encountered a 60,000/1 shot incident(odds based on record counts at that time). A single duplicate ended up in the database with the same check_sum. When I reviewed the logs I noticed that the creation time was identical down to the second. Further investigation of Apache logs showed a valid POST but the POST was duplicated. I presume the users browser malfunctioned or something but both POSTS arrived simultaneously resulting in two simultaneous transactions.
My question is how can I ensure that a transaction and its contained SELECT for the previous check_sum is Atomic & Isolated. Based upon my reading the answer lies in https://dev.mysql.com/doc/refman/8.0/en/innodb-locking-reads.html and isolation levels.
If transaction A and transaction B arrive at the server at the same time then they should not run side by side but should wait for the first to complete.
You created a classic race condition. Both transactions are calculating the checksum while they're both in progress, not yet committed. Neither can read the other's data, since they're uncommitted. So they calculate that they're the only one with the same checksum, and they both go through and commit.
To solve this, you need to run such transactions serially, to be sure that there aren't other concurrent transactions submitting the same data.
You may have to use use GET_LOCK() before starting your transaction to calculate the checksum. Then RELEASE_LOCK() after you commit. That will make sure other concurrent requests wait for your data to be committed, so they will see it when they try to calculate their checksum.
I want to create a model with ID equal to the current greatest ID for that model plus one (like auto-increment). I'm considering doing this with select_for_update to ensure there is no race condition for the current greatest ID, like this:
with transaction.atomic():
greatest_id = MyModel.objects.select_for_update().order_by('id').last().id
MyModel.objects.create(id=greatest_id + 1)
But I'm wondering, if two processes try to run this simultaneously, once the second one unblocks, will it see the new greatest ID inserted by the first process, or will it still see the old greatest ID?
For example, say the current greatest ID is 10. Two processes go to create a new model. The first one locks ID 10. Then the second one blocks because 10 is locked. The first one inserts 11 and unlocks 10. Then, the second one unblocks, and now will it see the 11 inserted by the first as the greatest, or will it still see 10 because that's the row it blocked on?
In the select_for_update docs, it says:
Usually, if another transaction has already acquired a lock on one of the selected rows, the query will block until the lock is released.
So for my example, I'm thinking this means that the second process will rerun the query for the greatest ID once it unblocks and get 11. But I'm not certain I'm interpreting that right.
Note: I'm using MySQL for the db.
No, I don't think this will work.
First, let me note that you should absolutely check the documentation for the database you're using, as there are many subtle differences between the databases that are not captured in the Django documentation.
Using the PostgreSQL documentation as a guide, the problem is that, at the default READ COMMITTED isolation level, the blocked query will not be rerun. When the first transaction commits, the blocked transaction will be able to see changes to that row, but it will not be able to see that new rows have been added.
It is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database.
So 10 is what will be returned.
Edit: My understanding in this answer is wrong, just leaving it for documentation's sake in case I ever want to come back to it.
After some investigation, I believe this will work as intended.
The reason is that for this call:
MyModel.objects.select_for_update().order_by('id').last().id
The SQL Django generates and runs against the db is actually:
SELECT ... FROM MyModel ORDER BY id ASC FOR UPDATE;
(the call to last() only happens after the queryset has already been evaluated.)
Meaning, the query scans over all rows both times it runs. Meaning the second time it runs, it will pick up the new row and return it accordingly.
I learned that this phenomenon is called a "phantom read", and is possible because the isolation level of my db is REPEATABLE-READ.
#KevinChristopherHenry "The issue is that the query is not rerun after the lock is released; the rows have already been selected" Are you sure that's how it works? Why does READ COMMITTED imply the select doesn't run after the lock is released? I thought the isolation level defines which snapshot of data a query sees when it runs, not ~when~ the query is run. It seems to me that whether the select happens before or after the lock is released is orthogonal to the isolation level. And by definition, doesn't a blocked query not select the rows until after it is unblocked?
For what it's worth, I tried to test this by opening two separate connections to my db in a shell and issuing some queries. In the first, I began a transaction, and got a lock 'select * from MyModel order by id for update'. Then, in the second, I did the same, causing the select to block. Then back in the first, I inserted a new row, and commited the transaction. Then in the second, the query unblocked, and returned the new row. This makes me think my hypothesis is correct.
P.S. I finally actually read the "undesirable results" documentation that you read and I see your point - in that example, it looks like it ignores rows that weren't preselected, so that would point to the conclusion that my second query wouldn't pick up the new row. But I tested in a shell and it did. Now I'm not sure what to make of this.
I am using 2 separate processes via multiprocessing in my application. Both have access to a MySQL database via sqlalchemy core (not the ORM). One process reads data from various sources and writes them to the database. The other process just reads the data from the database.
I have a query which gets the latest record from the a table and displays the id. However it always displays the first id which was created when I started the program rather than the latest inserted id (new rows are created every few seconds).
If I use a separate MySQL tool and run the query manually I get correct results, but SQL alchemy is always giving me stale results.
Since you can see the changes your writer process is making with another MySQL tool that means your writer process is indeed committing the data (at least, if you are using InnoDB it does).
InnoDB shows you the state of the database as of when you started your transaction. Whatever other tools you are using probably have an autocommit feature turned on where a new transaction is implicitly started following each query.
To see the changes in SQLAlchemy do as zzzeek suggests and change your monitoring/reader process to begin a new transaction.
One technique I've used to do this myself is to add autocommit=True to the execution_options of my queries, e.g.:
result = conn.execute( select( [table] ).where( table.c.id == 123 ).execution_options( autocommit=True ) )
assuming you're using innodb the data on your connection will appear "stale" for as long as you keep the current transaction running, or until you commit the other transaction. In order for one process to see the data from the other process, two things need to happen: 1. the transaction that created the new data needs to be committed and 2. the current transaction, assuming it's read some of that data already, needs to be rolled back or committed and started again. See The InnoDB Transaction Model and Locking.
Consider the following situation: You want to update the number of page views of each profile in your system. This action is very frequent, as almost all visits to your website result in a page view incremental.
The basic way is update Users set page_views=page_views+1. But this is totally not optimal because we don't really need instant update (1 hour late is ok). Is there any other way in MySQL to postpone a sequence of updates, and make cumulative updates at a later time?
I myself tried another method: storing a counter (# of increments) for each profile. But this also results in handling a few thousands of small files, and I think that the disk IO cost (even if a deep tree-structure for files is applied) would probably exceed the database.
What is your suggestion for this problem (other than MySQL)?
To improve performance you could store your page view data in a MEMORY table - this is super fast but temporary, the table only persists while the server is running - on restart it will be empty...
You could then create an EVENT to update a table that will persist the data on a timed basis. This would help improve performance a little with the risk that, should the server go down, only the number of visits since the last run of the event would be lost.
The link posted by James via the comment to your question, wherein lies an accepted answer with another comment about memcached was my first thought also. Just store the profileIds in memcached then you could set up a cron to run every 15 minutes and grab all the entries then issue the updates to MySQL in a batch, but there are a few things to consider.
When you run the batch script to grab the ids out of memcached, you will have to ensure you remove all entries which have been parsed, otherwise you run the risk of counting the same profile views multiple times.
Being that memcache doesn't support wildcard searching via keys, and that you will have to purge existing keys for the reason stated in #1, you will probably have to setup a separate memcache server pool dedicated for the sole purpose of tracking profile ids, so you don't end up purging cached values which have no relation to profile view tracking. However, you could avoid this by storing the profileId and a timestamp within the value payload, then have your batch script step through each entry and check the timestamp, if it's within the time range you specified, add it to queue to be updated, and once you hit the upper limit of your time range, the script stops.
Another option may be to parse your access logs. If user profiles are in a known location like /myapp/profile/1234, you could parse for this pattern and add profile views this way. I ended up having to go this route for advertiser tracking, as it ended up being the only repeatable way to generate billing numbers. If they had any billing disputes we would offer to send them the access logs and parse for themselves.