Properly handling Dask multiprocessing in SQLAlchemy - sqlalchemy

The setting in which I am working can be described as follows:
Database and what I want to extract from it
The data required to run the analysis is stored in a single de-normalized (more than 100 columns) Oracle table. Financial reporting data is published to the table every day and its range-partitioned on the reporting date (one partition per day). Here's the structure of the query I intend to run:
SELECT col1,
col2,
col3
FROM table
WHERE date BETWEEN start_date AND end_date
Strategy to load data with Dask
I am using sqlalchemy with the cx_Oracle driver to access the database. The strategy I am following to load data in parallel with Dask is:
from dask import bag as db
def read_rows(from_date, to_date, engine):
engine.dispose()
query = """
-- Query Text --
""".format(from_date, to_date)
with engine.connect() as conn:
ret = conn.execute(query).fetchall()
return ret
engine = create_engine(...) # initialise sqlalchemy engine
add_engine_pidguard(engine) # adding pidguard to engine
date_ranges = [...] # list of (start_date, end_date)-tuples
data_db = db.from_sequence(date_ranges)
.map(lambda x: read_rows(from_date=x[0], to_date=x[1], engine=engine)).concat()
# ---- further process data ----
...
add_engine_pidguard is taken from the sqlalchemy documentation:How do I use engines / connections / sessions with Python multiprocessing, or os.fork()?
Questions
Is the current way of running blocked queries fine - or is there a cleaner way of achieving this in sqlalchemy?
Since the queries operate in a multiprocessing environment, is the approach of managing the engines fine the way it is implemented?
Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?

I would be apt to try code along the lines of
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
...
con = engine.connect()
df = dd.from_delayed([
delayed(pd.read_sql_query)(QUERY, con, params=params)
for params in date_ranges
])
In this case, I just have one connection I make--cx_Oracle connections are, as I understand it, able to be used by multiple threads. The data is loaded using dask.dataframe, without doing anything yet to make it anything other than the threaded scheduler. Database IO and many pandas operations release the GIL, so the threaded scheduler is a good candidate here.
This will let us jump right to having a dataframe, which is nice for many operations on structured data.
Currently I am executing a "raw query", would it be beneficial from a performance point of view to define the table in a declarative_base (with respective column types) and use session.query on the required columns from within read_rows?
This is not especially likely to improve performance, as I understand things.

Related

How many tasks are created when spark read or write from mysql?

As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?
Here is some piece of code to read or write data from/to mysql:
def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
sqlContext.read.format("jdbc").options(Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
"user" -> user,
"password" -> password,
"numPartitions" -> numPartitions.toString
)).load
}
def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
var dF1 = sparkSession.sqlContext.read.format("jdbc")
.option("url", jdbc.jdbcUrl)
.option("user", jdbc.user)
.option("password", jdbc.passwd)
.option("driver", jdbc.jdbcDriver)
.option("dbtable", table)
.load()
// dF1.show(3)
dF1.createOrReplaceTempView(s"${table}")
dF1
}
}
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/
In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF
In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)
Spark jdb documentation
In this docu you will find:
partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.
numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write
regarding write
How about writing data to mysql,how many connections are there?
As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write
If you don't set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server

SQLAlchemy automap: Best practices for performance

I'm building a python app around an existing (mysql) database and am using automap to infer tables and relationships:
base = automap_base()
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
# reflect the tables
base.prepare(self.engine, reflect=True)
self.TableName = base.classes.table_name
Using this I can do things like session.query(TableName) etc...
However, I'm worried about performance, because every time the app runs it will do the whole inference again.
Is this a legitimate concern?
If so, is there a possibility to 'cache' the output of Automap?
I think that "reflecting" the structure of your database is not the way to go. Unless your app tries to "infer" things from the structure, like static code analysis would for source files, then it is unnecessary. The other reason for reflecting it at run-time would be the reduced time to begin "using" the database using SQLAlchemy. However:
Another option would be to use something like SQLACodegen (https://pypi.python.org/pypi/sqlacodegen):
It will "reflect" your database once and create a 99.5% accurate set of declarative SQLAlchemy models for you to work with. However, this does require that you keep the model subsequently in-sync with the structure of the database. I would assume that this is not a big concern seeing as the tables you're already-working with are stable-enough such that run-time reflection of their structure does not impact your program much.
Generating the declarative models is essentially a "cache" of the reflection. It's just that SQLACodegen saved it into a very readable set of classes + fields instead of data in-memory. Even with a changing structure, and my own "changes" to the generated declarative models, I still use SQLACodegen later-on in a project whenever I make database changes. It means that my models are generally consistent amongst one-another and that I don't have things such as typos and data-mismatches due to copy-pasting.
Performance can be a legitimate concern. If the database schema is not changing, it can be time consuming to reflect the database every time a script is run. This is more of an issue during development, not starting up a long running application. It's also a significant time saver if your database is on a remote server (again, particularly during development).
I use code that is similar to the answer here (as noted by #ACV). The general plan is to perform the reflection the first time, then pickle the metadata object. The next time the script is run, it will look for the pickle file and use that. The file can be anywhere, but I place mine in ~/.sqlalchemy_cache. This is an example based on your code.
import os
from sqlalchemy.ext.declarative import declarative_base
self.engine = create_engine(
'mysql://%s:%s#%s/%s?charset=utf8mb4' % (
config.DB_USER, config.DB_PASSWD, config.DB_HOST, config.DB_NAME
), echo=False
)
metadata_pickle_filename = "mydb_metadata"
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem, reflect as usual
pass
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
# save the metadata for future runs
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
self.TableName = base.classes.table_name
For anyone using declarative table class definitions, assuming a Base object defined as e.g.
Base = declarative_base(bind=engine)
metadata_pickle_filename = "ModelClasses_trilliandb_trillian.pickle"
# ------------------------------------------
# Load the cached metadata if it's available
# ------------------------------------------
# NOTE: delete the cached file if the database schema changes!!
cache_path = os.path.join(os.path.expanduser("~"), ".sqlalchemy_cache")
cached_metadata = None
if os.path.exists(cache_path):
try:
with open(os.path.join(cache_path, metadata_pickle_filename), 'rb') as cache_file:
cached_metadata = pickle.load(file=cache_file)
except IOError:
# cache file not found - no problem
pass
# ------------------------------------------
# define all tables
#
class MyTable(Base):
if cached_metadata:
__table__ = cached_metadata.tables['my_schema.my_table']
else:
__tablename__ = 'my_table'
__table_args__ = {'autoload':True, 'schema':'my_schema'}
...
# ----------------------------------------
# If no cached metadata was found, save it
# ----------------------------------------
if cached_metadata is None:
# cache the metadata for future loading
# - MUST DELETE IF THE DATABASE SCHEMA HAS CHANGED
try:
if not os.path.exists(cache_path):
os.makedirs(cache_path)
# make sure to open in binary mode - we're writing bytes, not str
with open(os.path.join(cache_path, metadata_pickle_filename), 'wb') as cache_file:
pickle.dump(Base.metadata, cache_file)
except:
# couldn't write the file for some reason
pass
Important Note!! If the database schema changes, you must delete the cached file to force the code to autoload and create a new cache. If you don't, the changes will be be reflected in the code. It's an easy thing to forget.
The answer to your first question is largely subjective. You are adding database queries to fetch the reflection metadata to the application load time. Whether or not that overhead is significant depends on your project requirements.
For reference, I have an internal tool at work that uses a reflection pattern because the the load-time is acceptable for our team. That might not be the case if it were an externally-facing product. My hunch is that for most applications the reflection overhead will not dominate the total application load time.
If you decide it is significant for your purposes, this question has an interesting answer where the user pickles the database metadata in order to locally cache it.
Adding to this, what #Demitri answered is close to correct but (at least in sqlalchemy 1.4.29), the example will fail on the last line self.TableName = base.classes.table_name when generating from the cached file. In this case declarative_base() has no attribute classes.
To fix is as simple as altering:
if cached_metadata:
base = declarative_base(bind=self.engine, metadata=cached_metadata)
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
to
if cached_metadata:
base = automap_base(declarative_base(bind=self.engine, metadata=cached_metadata))
base.prepare()
else:
base = automap_base()
base.prepare(self.engine, reflect=True) # reflect the tables
This will create your automap object with the proper attributes.

Loading a pandas Dataframe into a sql database with Django

I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries

How to commit model instances and remove them from working memory a few at a time

I have a pyramid view that is used for loading data from a large file into a database. For each line in the file it does a little processing then creates some model instances and adds them to the session. This works fine except when the files are big. For large files the view slowly eats up all my ram until everything effectively grinds to a halt.
So my idea is to process each line individually with a function that creates a session, creates the necessary model instances and adds them to the current session, then commits.
def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
from sqlalchemy.orm import (
scoped_session,
sessionmaker,
)
from sqlalchemy import engine_from_config
from pyramidapp.models import Base, DataEntry
from zope.sqlalchemy import ZopeTransactionExtension
import transaction
oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
engine = engine_from_config(dSettings, 'sqlalchemy.')
oCurrentDBSession.configure(bind=engine)
Base.metadata.bind = engine
oEntry = DataEntry()
oCurrentDBSession.add(oEntry)
...
transaction.commit()
My requirements for this function are as follows:
create a session (check)
make a bunch of model instances (check)
add those instances to the session (check)
commit those models to the database
get rid of the session (so that it and the objects created in 2 are garbage collected)
I've made sure that the newly created session is passed as an argument whenever necessary in order to stop errors to do with multiple sessions blah blah. But alas! I can't get database connections to go away and stuff isn't being committed.
I tried separating the function out into a celery task so the view executes to completion and does what it needs to but I'm getting an error in celery about having too many mysql connections no matter what I try in terms of committing and closing and disposing and I'm not sure why. And yes, I restart the celery server when I make changes.
Surely there is a simple way to do this? All I want to do is make a session commit then go away and leave me alone.
Creating a new session for each line of your large file is going to be quite slow I would imagine.
What I would try is to commit the session and expunge all objects from it every 1000 rows or so:
counter = 0
for line in mymegafile:
entry = process_line(line)
session.add(entry)
if counter > 1000:
counter = 0
transaction.commit() # if you insist on using ZopeTransactionExtension, otherwise session.commit()
session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
else:
counter += 1
If there are no references to DataEntry instances from anywhere they should be garbage collected by Python interpreter at some point.
However, if all you're doing in that view is inserting new records to the database, it may be much more efficient to use SQLAlchemy Core constructs or literal SQL to bulk-insert data. This would also get rid of the problem with your ORM instances eating up your RAM. See I’m inserting 400,000 rows with the ORM and it’s really slow! for details.
So I tried a bunch of things and, although using SQLAlchemy's built in functionality to solve this was probably possible I could not find any way of pulling that off.
So here's an outline of what I did:
seperate the lines to be processed into batches
for each batch of lines queue up a celery task to deal with those lines
in the celery task a seperate process is launched that does the necessary stuff with the lines.
Reasoning:
The batch stuff is obvious
Celery was used because it took a heck of a long time to process an entire file so queuing just made sense
the task launched a separate process because if it didn't then I had the same problem that I had with the pyramid application
Some code:
Celery task:
def commit_lines(lLineData,dSettings,cwd):
"""
writes the line data to a file then calls a process that reads the file and creates
the necessary data entries. Then deletes the file
"""
import lockfile
sFileName = "/home/sheena/tmp/cid_line_buffer"
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'a') #in case the process was at any point interrupted...
for d in lLineData:
f.write('{0}\n'.format(d))
f.close()
#now call the external process
import subprocess
import os
sConnectionString = dSettings.get('sqlalchemy.url')
lArgs = [
'python',os.path.join(cwd,'commit_line_file.py'),
'-c',sConnectionString,
'-f',sFileName
]
#open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
subprocess.check_call(lArgs,shell=False)
#and clear the file
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'w')
f.close()
External process:
"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
from optparse import OptionParser
from sqlalchemy import create_engine
from pyramidapp.models import Base,DBSession
import ast
import transaction
#get options
oParser = OptionParser()
oParser.add_option('-c','--connection_string',dest='connection_string')
oParser.add_option('-f','--input_file',dest='input_file')
(oOptions, lArgs) = oParser.parse_args()
#set up connection
#engine = engine_from_config(dSettings, 'sqlalchemy.')
engine = create_engine(
oOptions.connection_string,
echo=False)
DBSession.configure(bind=engine)
Base.metadata.bind = engine
#commit stuffs
import lockfile
lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
with lock:
for sLine in open(oOptions.input_file,'r'):
dLine = ast.literal_eval(sLine)
create_entry(**dLine)
transaction.commit()
def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
#import stuff
oEntry = DataEntry()
#do some other stuff, make more model instances...
DBSession.add(oEntry)
if __name__ == "__main__":
main()
in the view:
for line in big_giant_csv_file_handler:
lLineData.append({'stuff':'lots'})
if lLineData:
lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
for l in lLineSets:
commit_lines.delay(l,dSettings,sCWD) #queue it for celery
You are just doing it wrong. Period.
Quoted from SQLAlchemy docs
The advanced developer will try to keep the details of session,
transaction and exception management as far as possible from the
details of the program doing its work.
Quoted from Pyramid docs
We made the decision to use SQLAlchemy to talk to our database. We also, though, installed pyramid_tm and zope.sqlalchemy.
Why?
Pyramid has a strong orientation towards support for transactions.
Specifically, you can install a transaction manager into your app
application, either as middleware or a Pyramid "tween". Then, just
before you return the response, all transaction-aware parts of your
application are executed. This means Pyramid view code usually doesn't
manage transactions.
My answer today is not code, but a recommendation to follow best practices recommended by the authors of the packages/frameworks you are working with.
References
Big picture - Using Thread-Local Scope with Web Applications
Typical error message when doing it wrong
Databases using SQLAlchemy
How to use scoped_session
Encapsulate CSV reading and creating SQLAlchemy model instances into something that supports the iterator protocol. I called it BatchingModelReader. It returns a collection of DataEntry instances, collection size depends on batch size. If the model changes overtime, you do not need to change the celery task. The task only puts a batch of models into a session and commits the transaction. By controlling the batch size you control memory consumption. Neither BatchingModelReader nor the celery task save huge amounts of intermediate data. This example shows as well that using celery is only an option. I added links to code samples of an pyramid application I am actually refactoring in a Github fork.
BatchingModelReader - encapsulates csv.reader and uses existing models from your pyramid application
get inspired by source code of csv.DictReader
could be run as a celery task - use appropriate task decorator
from .models import DBSession
import transaction
def import_from_csv(path_to_csv, batchsize)
"""given a CSV file and batchsize iterate over batches of model instances and import them to database"""
for batch in BatchingModelReader(path_to_csv, batchsize):
with transaction.manager:
DBSession.add_all(batch)
pyramid view - just save big giant CSV file, start task, return response immediately
#view_config(...):
def view(request):
"""gets file from request, save it to filesystem and start celery task"""
with open(path_to_csv, 'w') as f:
f.write(big_giant_csv_file)
#start task with parameters
import_from_csv.delay(path_to_csv, 1000)
Code samples
ToDoPyramid - commit transaction from commandline
ToDoPyramid - commit transaction from request
Pyramid using SQLAlchemy
Databases using SQLAlchemy
SQLAlchemy internals
Big picture - Using Thread-Local Scope with Web Applications
How to use scoped_session

Uncommitted transactions in Plone + SqlAlchemy + MySql

We have a hybrid web application integrating a MySql db with Plone (last upgrade was to Plone 4.0), using collective.tin, collective.lead and SqlAlchemy.
Ok, I know that collective.tin never was released and collective.lead has been superseded; however all things work (almost) perfectly since a few years.
Recently we experienced a very strange behaviour and are looking for help in order to understand it.
Among others, we have 2 Plone content types, say A and B, defined by subclassing collective.tin, and the corresponding innodb MySql tables; rows of B have a foreign key towards A.
In the time span of 15-20 minutes, 2 different users created 3 A objects and some 10-20 B objects that weren't committed to MySql but were indexed by Plone; queries I executed with a MySql client from the linux shell weren't able to find those A rows (didn't look for B rows); however, queries executed through the web application (the aforementioned components stack) by those 2 users, and also by other users, occasionally were still finding and correctly visualizing some of those 3 A objects.
Only after I restarted the Zope instance, it was possible to resume normal activity from the Plone web interface; 3 A rows and many B rows were still missing from the MySql db, but the autoincrement counter showed the expected increment; I had to remove 3 invalid brains for A objects from the Plone index (didn't worry for B objects).
Any suggestion on possible causes and on how to investigate the problem?
We had the exact same problem with sqlalchemy 0.4; the session would get out of sync with the actual database contents. The problem was somewhat masked in our case because users were sent to specific backends in the cluster through session affinity. If the affinity was lost suddenly messages had disappeared. The exact details are a little hazy, because I cannot locate the correct (ancient) revision history of the fix I put in place.
From what I can glean from context is that the session identity map prevents the session from requiring the database for objects it retrieved before. It thus won't see changes made to these objects in different sessions.
The fix is to call .expire_all() on the session after each and every commit or rollback; SQLAlchemy 0.5 and up does this automatically (autoexpire=True on the session, now called expire_on_commit I believe), but for 0.4 you'll need to register a SessionExtension to do this for you.
Lucky for you, we also use collective.lead for this project, so my fix is your fix:
# The identity map should be flushed on commit.
# SQLAlchemy 0.5 does this properly, but in 0.4 we need to do this via
# a SesssionExtension.
from sqlalchemy import __version__
if __version__[:3] == '0.4':
from sqlalchemy.orm.session import SessionExtension
class ExpireAllSessionExtension(SessionExtension):
def after_commit(self, session):
"""Expire the identity-map on commit"""
session.expire_all()
def after_rollback(self, session):
"""Expire the identity-map on rollback"""
session.expire_all()
def installExtension():
# Patch collective.lead.database to let us install the extension
# on the session created there.
from collective.lead.database import Database
old_session = Database.session.fget
def session(self):
session = old_session(self)
if session.extension is None:
session.extension = ExpireAllSessionExtension()
return session
Database.session = property(session)
else:
def installExtension():
pass
When defining the mapper, you install this extension with:
from .sessionexpiration import installExtension
# Ensure that sessions get properly expired on commit and rollback.
installExtension()