I have a django application that is backed by a MySQL database. I have recently moved a section of code out of the request flow and put it into a Process. The code uses select_for_update() to lock affected rows in the DB but now I am occasionally seeing the Process updating a record while it should be locked in the main Thread. If I switch my Executor from a ProcessPoolExecutor to a ThreadPoolExecutor the locking works as expected. I thought that select_for_update() operated at the database level so it shouldn't make any difference whether code is in Threads, Processes, or even on another machine - what am I missing?
I've boiled my code down to a sample that exhibits the same behaviour:
from concurrent import futures
import logging
from time import sleep
from django.db import transaction
from myapp.main.models import CompoundBase
logger = logging.getLogger()
executor = futures.ProcessPoolExecutor()
# executor = futures.ThreadPoolExecutor()
def test() -> None:
pk = setup()
f1 = executor.submit(select_and_sleep, pk)
f2 = executor.submit(sleep_and_update, pk)
futures.wait([f1, f2])
def setup() -> int:
cb = CompoundBase.objects.first()
cb.corporate_id = 'foo'
cb.save()
return cb.pk
def select_and_sleep(pk: int) -> None:
try:
with transaction.atomic():
cb = CompoundBase.objects.select_for_update().get(pk=pk)
print('Locking')
sleep(5)
cb.corporate_id = 'baz'
cb.save()
print('Updated after sleep')
except Exception:
logger.exception('select_and_sleep')
def sleep_and_update(pk: int) -> None:
try:
sleep(2)
print('Updating')
with transaction.atomic():
cb = CompoundBase.objects.select_for_update().get(pk=pk)
cb.corporate_id = 'bar'
cb.save()
print('Updated without sleep')
except Exception:
logger.exception('sleep_and_update')
test()
When run as shown I get:
Locking
Updating
Updated without sleep
Updated after sleep
But if I change to the ThreadPoolExecutor I get:
Locking
Updating
Updated after sleep
Updated without sleep
The good news is that it's mostly there, I did some reading around and based on an answer I found here
I am assuming that you are running on Linux as that seems to be the behaviour on the platform.
It looks like under Linux the default Process start strategy is the fork strategy, which is usually what you want, however in this exact circumstance it appears that resources (such as DB connections) are being shared, resulting in the db operations being treated as the same transaction and thus are not blocked. To get the behaviour you want, each process would appear to need its own resources and to not share resouces with its parent process (and subsequently any other children of the parent).
It is possible to get the behaviour you want using the following code, however be aware that I had to split the code into two files.
fn.py
from time import sleep
from django.db import transaction
import django
django.setup()
from myapp.main.models import CompoundBase
def setup() -> int:
cb = CompoundBase.objects.first()
cb.corporate_id = 'foo'
cb.save()
return cb.pk
def select_and_sleep(pk: int) -> None:
try:
with transaction.atomic():
cb = CompoundBase.objects.select_for_update().get(pk=pk)
print('Locking')
sleep(5)
cb.corporate_id = 'baz'
cb.save()
print('Updated after sleep')
except Exception:
logger.exception('select_and_sleep')
def sleep_and_update(pk: int) -> None:
try:
sleep(2)
print('Updating')
with transaction.atomic():
cb = CompoundBase.objects.select_for_update().get(pk=pk)
cb.corporate_id = 'bar'
cb.save()
print('Updated without sleep')
except Exception:
logger.exception('sleep_and_update')
proc_test.py
from concurrent import futures
from multiprocessing import get_context
from time import sleep
import logging
import fn
logger = logging.getLogger()
executor = futures.ProcessPoolExecutor(mp_context=get_context("forkserver"))
# executor = futures.ThreadPoolExecutor()
def test() -> None:
pk = fn.setup()
f1 = executor.submit(fn.select_and_sleep, pk)
f2 = executor.submit(fn.sleep_and_update, pk)
futures.wait([f1, f2])
test()
There are three strategies in starting a process, fork, spawn, and forkserver, using either spawn or forkserver appears to get you the behaviour that you are looking for.
References:
Multiprocessing Locks
Connections
Related
I am using django and celery. I have a long running celery task and I would like it to report progress. I am doing this:
#shared_task
def do_the_job(tracker_id, *args, **kwargs):
while condition:
#Do a long operation
tracker = ProgressTracker.objects.get(pk=tracker_id)
tracker.task_progress = F('task_progress') + 1
tracker.last_update = timezone.now()
tracker.save(update_fields=['task_progress', 'last_update'])
The problem is that the view that is supposed to show the progress to the user cannot see the updates until the task finishes. Is there a way to get the django orm to ignore transactions for just this one table? Or just this one write?
You can use bound tasks to define custom states for your tasks and set/update the state during execution:
#celery.task(bind=True)
def show_progress(self, n):
for i in range(n):
self.update_state(state='PROGRESS', meta={'current': i, 'total': n})
You can dump the state of currently executing tasks to get the progress:
>>> from celery import Celery
>>> app = Celery('proj')
>>> i = app.control.inspect()
>>> i.active()
I've tried to implement this pipeline in my spider.
After installing the necessary dependencies I am able to run the spider without any errors but for some reason it doesn't write to my database.
I'm pretty sure there is something going wrong with connecting to the database. When I give in a wrong password, I still don't get any error.
When the spider scraped all the data, it needs a few minutes before it starts dumping the stats.
2017-08-31 13:17:12 [scrapy] INFO: Closing spider (finished)
2017-08-31 13:17:12 [scrapy] INFO: Stored csv feed (27 items) in: test.csv
2017-08-31 13:24:46 [scrapy] INFO: Dumping Scrapy stats:
Pipeline:
import MySQLdb.cursors
from twisted.enterprise import adbapi
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy import log
SETTINGS = {}
SETTINGS['DB_HOST'] = 'mysql.domain.com'
SETTINGS['DB_USER'] = 'username'
SETTINGS['DB_PASSWD'] = 'password'
SETTINGS['DB_PORT'] = 3306
SETTINGS['DB_DB'] = 'database_name'
class MySQLPipeline(object):
#classmethod
def from_crawler(cls, crawler):
return cls(crawler.stats)
def __init__(self, stats):
print "init"
#Instantiate DB
self.dbpool = adbapi.ConnectionPool ('MySQLdb',
host=SETTINGS['DB_HOST'],
user=SETTINGS['DB_USER'],
passwd=SETTINGS['DB_PASSWD'],
port=SETTINGS['DB_PORT'],
db=SETTINGS['DB_DB'],
charset='utf8',
use_unicode = True,
cursorclass=MySQLdb.cursors.DictCursor
)
self.stats = stats
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
print "close"
""" Cleanup function, called after crawing has finished to close open
objects.
Close ConnectionPool. """
self.dbpool.close()
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
return item
def _insert_record(self, tx, item):
print "insert"
result = tx.execute(
" INSERT INTO matches(type,home,away,home_score,away_score) VALUES (soccer,"+item["home"]+","+item["away"]+","+item["score"].explode("-")[0]+","+item["score"].explode("-")[1]+")"
)
if result > 0:
self.stats.inc_value('database/items_added')
def _handle_error(self, e):
print "error"
log.err(e)
Spider:
import scrapy
import dateparser
from crawling.items import KNVBItem
class KNVBspider(scrapy.Spider):
name = "knvb"
start_urls = [
'http://www.knvb.nl/competities/eredivisie/uitslagen',
]
custom_settings = {
'ITEM_PIPELINES': {
'crawling.pipelines.MySQLPipeline': 301,
}
}
def parse(self, response):
# www.knvb.nl/competities/eredivisie/uitslagen
for row in response.xpath('//div[#class="table"]'):
for div in row.xpath('./div[#class="row"]'):
match = KNVBItem()
match['home'] = div.xpath('./div[#class="value home"]/div[#class="team"]/text()').extract_first()
match['away'] = div.xpath('./div[#class="value away"]/div[#class="team"]/text()').extract_first()
match['score'] = div.xpath('./div[#class="value center"]/text()').extract_first()
match['date'] = dateparser.parse(div.xpath('./preceding-sibling::div[#class="header"]/span/span/text()').extract_first(), languages=['nl']).strftime("%d-%m-%Y")
yield match
If there are better pipelines available to do what I'm trying to achieve that'd be welcome as well. Thanks!
Update:
With the link provided in the accepted answer I eventually got to this function that's working (and thus solved my problem):
def process_item(self, item, spider):
print "process"
query = self.dbpool.runInteraction(self._insert_record, item)
query.addErrback(self._handle_error)
query.addBoth(lambda _: item)
return query
Take a look at this for how to use adbapi with MySQL for saving scraped items. Note the difference in your process_item and their process_item method implementation. While you return the item immediately, they return Deferred object which is the result of runInteraction method and which returns the item upon its completion. I think this is the reason your _insert_record never gets called.
If you can see the insert in your output that's already a good sign.
I'd rewrite the insert function this way:
def _insert_record(self, tx, item):
print "insert"
raw_sql = "INSERT INTO matches(type,home,away,home_score,away_score) VALUES ('%s', '%s', '%s', '%s', '%s')"
sql = raw_sql % ('soccer', item['home'], item['away'], item['score'].explode('-')[0], item['score'].explode('-')[1])
print sql
result = tx.execute(sql)
if result > 0:
self.stats.inc_value('database/items_added')
It allows you to debug the sql you're using. In you version you're not wrapping the string in ' which is a syntax error in mysql.
I'm not sure about your last values (score) so I treated them as strings.
In Django I often assert the number of queries that should be made so that unit tests catch new N+1 query problems
from django import db
from django.conf import settings
settings.DEBUG=True
class SendData(TestCase):
def test_send(self):
db.connection.queries = []
event = Events.objects.all()[1:]
s = str(event) # QuerySet is lazy, force retrieval
self.assertEquals(len(db.connection.queries), 2)
In in SQLAlchemy tracing to STDOUT is enabled by setting the echo flag on
engine
engine.echo=True
What is the best way to write tests that count the number of queries made by SQLAlchemy?
class SendData(TestCase):
def test_send(self):
event = session.query(Events).first()
s = str(event)
self.assertEquals( ... , 2)
I've created a context manager class for this purpose:
class DBStatementCounter(object):
"""
Use as a context manager to count the number of execute()'s performed
against the given sqlalchemy connection.
Usage:
with DBStatementCounter(conn) as ctr:
conn.execute("SELECT 1")
conn.execute("SELECT 1")
assert ctr.get_count() == 2
"""
def __init__(self, conn):
self.conn = conn
self.count = 0
# Will have to rely on this since sqlalchemy 0.8 does not support
# removing event listeners
self.do_count = False
sqlalchemy.event.listen(conn, 'after_execute', self.callback)
def __enter__(self):
self.do_count = True
return self
def __exit__(self, *_):
self.do_count = False
def get_count(self):
return self.count
def callback(self, *_):
if self.do_count:
self.count += 1
Use SQLAlchemy Core Events to log/track queries executed (you can attach it from your unit tests so they don't impact your performance on the actual application:
event.listen(engine, "before_cursor_execute", catch_queries)
Now you write the function catch_queries, where the way depends on how you test. For example, you could define this function in your test statement:
def test_something(self):
stmts = []
def catch_queries(conn, cursor, statement, ...):
stmts.append(statement)
# Now attach it as a listener and work with the collected events after running your test
The above method is just an inspiration. For extended cases you'd probably like to have a global cache of events that you empty after each test. The reason is that prior to 0.9 (current dev) there is no API to remove event listeners. Thus make one global listener that accesses a global list.
what about the approach of using flask_sqlalchemy.get_debug_queries() btw. this is the methodology used by internal of Flask Debug Toolbar check its source
from flask_sqlalchemy import get_debug_queries
def test_list_with_assuring_queries_count(app, client):
with app.app_context():
# here generating some test data
for _ in range(10):
notebook = create_test_scheduled_notebook_based_on_notebook_file(
db.session, owner='testing_user',
schedule={"kind": SCHEDULE_FREQUENCY_DAILY}
)
for _ in range(100):
create_test_scheduled_notebook_run(db.session, notebook_id=notebook.id)
with app.app_context():
# after resetting the context call actual view we want asserNumOfQueries
client.get(url_for('notebooks.personal_notebooks'))
assert len(get_debug_queries()) == 3
keep in mind that for having reset context and count you have to call with app.app_context() before the exact stuff you want to measure.
Slightly modified version of #omar-tarabai's solution that removes the event listener when exiting the context:
from sqlalchemy import event
class QueryCounter(object):
"""Context manager to count SQLALchemy queries."""
def __init__(self, connection):
self.connection = connection.engine
self.count = 0
def __enter__(self):
event.listen(self.connection, "before_cursor_execute", self.callback)
return self
def __exit__(self, *args, **kwargs):
event.remove(self.connection, "before_cursor_execute", self.callback)
def callback(self, *args, **kwargs):
self.count += 1
Usage:
with QueryCounter(session.connection()) as counter:
session.query(XXX).all()
session.query(YYY).all()
print(counter.count) # 2
Recently I've got a SQLAlchemy InvalidRequestError.
The error log shows:
InvalidRequestError: Transaction <sqlalchemy.orm.session.SessionTransaction object at
0x106830dd0> is not on the active transaction list
In what circumstance this error will be raised???
-----Edit----
# the following two line actually in my decorator
s = Session()
s.add(model1)
# refer to <http://techspot.zzzeek.org/2012/01/11/django-style-database-routers-in-sqlalchemy/>
s2 = Session().using_bind('master')
model2 = s2.query(Model2).with_lockmode('update').get(1)
model2.somecolumn = 'new'
s2.commit()
This exception is raised
-----Edit2 -----
s = Session().using_bind('master')
model = Model(user_id=123456)
s.add(model)
s.flush()
# here, raise the exception.
# I add log in get_bind() of RoutingSession. when doing 'flush', the _name is None, and it returns engines['slave'].
#If I use commit() instead of flush(), then it commits successfully
I change the using_bind method as the following and it works well.
def using_bind(self, name):
self._name = name
return self
The previous RoutingSession:
class RoutingSession(Session):
_name = None
def get_bind(self, mapper=None, clause=None):
logger.info(self._name)
if self._name:
return engines[self._name]
elif self._flushing:
logger.info('master')
return engines['master']
else:
logger.info('slave')
return engines['slave']
def using_bind(self, name):
s = RoutingSession()
vars(s).update(vars(self))
s._name = name
return s
that's an internal assertion which should never occur. There's no way to answer this question without at least a full stack trace, if perhaps you are improperly using the Session in a concurrent fashion, or manipulating its internals. I can only show that exception raised if I manipulate private methods or state pertaining to the Session object.
Here's that:
from sqlalchemy.orm import Session
s = Session()
s2 = Session()
t = s.transaction
t2 = s2.transaction
s2.transaction = t # nonsensical assignment of the SessionTransaction
# from one Session to also be referred to by another,
# corrupts the transaction chain by leaving out "t2".
# ".transaction" should never be assigned to on the outside
t2.rollback() # triggers the assertion case
basically, the above should never happen, since you're not supposed to assign to ".transaction". that's a read-only attribute.
My WSGI application uses SQLAlchemy. I want to start session when request starts, commit it if it's dirty and request processing finished successfully, make rollback otherwise. So, I need to implement behavior of Django's TransactionMiddleware.
So, I suppose that I should create WSGI middleware and make following stuff:
Create and add DB session to environ on pre-processing.
Get DB session from environ and call commit() on post-processing, if no errors occurred.
Get DB session from environ and call rollback() on post-processing, if some errors occurred.
Step 1 is obvious for me:
class DbSessionMiddleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
environ['db_session'] = create_session()
return self.app(environ, start_response)
Step 2 and 3 - not. I found the example of post-processing task:
class Caseless:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
for chunk in self.app(environ, start_response):
yield chunk.lower()
It contains comment:
Note that the __call__ function is a Python generator, which is typical for this sort of “post-processing” task.
Could you please clarify how does it work and how can I solve my issue similarly.
Thanks,
Boris.
For step 1 I use SQLAlchemy scoped sessions:
engine = create_engine(settings.DB_URL, echo=settings.DEBUG, client_encoding='utf8')
Base = declarative_base()
sm = sessionmaker(bind=engine)
get_session = scoped_session(sm)
They return the same thread-local session for each get_session() call.
Step 2 and 3 for now is following:
class DbSessionMiddleware:
def __init__(self, app):
self.app = app
def __call__(self, environ, start_response):
try:
db.get_session().begin_nested()
return self.app(environ, start_response)
except BaseException:
db.get_session().rollback()
raise
finally:
db.get_session().commit()
As you can see, I start nested transaction on session to be able to rollback even queries that were already committed in views.