We have a hybrid web application integrating a MySql db with Plone (last upgrade was to Plone 4.0), using collective.tin, collective.lead and SqlAlchemy.
Ok, I know that collective.tin never was released and collective.lead has been superseded; however all things work (almost) perfectly since a few years.
Recently we experienced a very strange behaviour and are looking for help in order to understand it.
Among others, we have 2 Plone content types, say A and B, defined by subclassing collective.tin, and the corresponding innodb MySql tables; rows of B have a foreign key towards A.
In the time span of 15-20 minutes, 2 different users created 3 A objects and some 10-20 B objects that weren't committed to MySql but were indexed by Plone; queries I executed with a MySql client from the linux shell weren't able to find those A rows (didn't look for B rows); however, queries executed through the web application (the aforementioned components stack) by those 2 users, and also by other users, occasionally were still finding and correctly visualizing some of those 3 A objects.
Only after I restarted the Zope instance, it was possible to resume normal activity from the Plone web interface; 3 A rows and many B rows were still missing from the MySql db, but the autoincrement counter showed the expected increment; I had to remove 3 invalid brains for A objects from the Plone index (didn't worry for B objects).
Any suggestion on possible causes and on how to investigate the problem?
We had the exact same problem with sqlalchemy 0.4; the session would get out of sync with the actual database contents. The problem was somewhat masked in our case because users were sent to specific backends in the cluster through session affinity. If the affinity was lost suddenly messages had disappeared. The exact details are a little hazy, because I cannot locate the correct (ancient) revision history of the fix I put in place.
From what I can glean from context is that the session identity map prevents the session from requiring the database for objects it retrieved before. It thus won't see changes made to these objects in different sessions.
The fix is to call .expire_all() on the session after each and every commit or rollback; SQLAlchemy 0.5 and up does this automatically (autoexpire=True on the session, now called expire_on_commit I believe), but for 0.4 you'll need to register a SessionExtension to do this for you.
Lucky for you, we also use collective.lead for this project, so my fix is your fix:
# The identity map should be flushed on commit.
# SQLAlchemy 0.5 does this properly, but in 0.4 we need to do this via
# a SesssionExtension.
from sqlalchemy import __version__
if __version__[:3] == '0.4':
from sqlalchemy.orm.session import SessionExtension
class ExpireAllSessionExtension(SessionExtension):
def after_commit(self, session):
"""Expire the identity-map on commit"""
session.expire_all()
def after_rollback(self, session):
"""Expire the identity-map on rollback"""
session.expire_all()
def installExtension():
# Patch collective.lead.database to let us install the extension
# on the session created there.
from collective.lead.database import Database
old_session = Database.session.fget
def session(self):
session = old_session(self)
if session.extension is None:
session.extension = ExpireAllSessionExtension()
return session
Database.session = property(session)
else:
def installExtension():
pass
When defining the mapper, you install this extension with:
from .sessionexpiration import installExtension
# Ensure that sessions get properly expired on commit and rollback.
installExtension()
Related
I've recently been introduced to R and trying the heatwaveR package. I get an error when loading erddap data ... Here's the code I have used so far:
library(rerddap)
library(ncdf4)
info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")
And I get the following error:
Error in curl::curl_fetch_memory(x$url$url, handle = x$url$handle) :
schannel: next InitializeSecurityContext failed: SEC_E_INVALID_TOKEN (0x80090308) - The token supplied to the function is invalid
Would like some help in this. I'm new to this website too so I apologize if the above question is not as per standards (codes to be typed in a grey box, etc.)
Someone directed this post to my attention from the heatwaveR issues page on GitHub. Here is the answer I provided for them:
I do not manage the rerddap package so can't say exactly why it may be giving you this error. But I can say that I have noticed lately that the OISST data are often not available on the ERDDAP server in question. I (attempt to) download fresh data every day and am often denied with an error similar to the one you posted. It's gotten to the point where I had to insert some logic gates into my download script so it tells me that the data aren't currently being hosted before it tries to download them. I should also point out that one may download the "final" data from this server, which have roughly a two week delay from present day, as well as the "preliminary (prelim)" data, which are near-real-time but haven't gone through all of the QC steps yet. These two products are accounted for in the following code:
# First download the list of data products on the server
server_data <- rerddap::ed_datasets(which = "griddap", "https://www.ncei.noaa.gov/erddap/")$Dataset.ID
# Check if the "final" data are currently hosted
if(!"ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon" %in% server_data)
stop("Final data are not currently up on the ERDDAP server")
# Check if the "prelim" data are currently hosted
if(!"ncdc_oisst_v2_avhrr_prelim_by_time_zlev_lat_lon" %in% server_data)
stop("Prelim data are not currently up on the ERDDAP server")
If the data are available I then check the times/dates available with these two lines:
# Download final OISST meta-data
final_info <- rerddap::info(datasetid = "ncdc_oisst_v2_avhrr_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")
# Download prelim OISST meta-data
prelim_info <- rerddap::info(datasetid = "ncdc_oisst_v2_avhrr_prelim_by_time_zlev_lat_lon", url = "https://www.ncei.noaa.gov/erddap/")
I ran this now and it looks like the data are currently available. Is your error from today, or from a day or two ago? The availability seems to cycle over the week but I haven't quite made sense of any pattern yet. It is also important to note that about a day before the data go dark they are filled with all sorts of massive errors. So I've also had to add error trapping into my code that stops the data aggregation process once it detects temperatures in excess of some massive number. In this case it is something like1^90, but the number isn't consistent meaning it is not a missing value placeholder.
To manually see for yourself if the data are being hosted you can go to this link and scroll to the bottom:
https://www.ncei.noaa.gov/erddap/griddap/index.html
All the best,
-Robert
I was stuck in a situation that I have initialised a namesapce with
default-ttl to 30 days. There was about 5 million data with that (30-day calculated) ttl-value. Actually, my requirement is that ttl should be zero(0), but It(ttl-30d) was kept with unaware or un-recognise.
So, Now I want to update prev(old) 5 million data with new ttl-value (Zero).
I've checked/tried "set-disable-eviction true", but it is not working, it is removing data according to (old)ttl-value.
How do I overcome out this? (and I want to retrieve the removed data, How can I?).
Someone help me.
First, eviction and expiration are two different mechanisms. You can disable evictions in various ways, such as the set-disable-eviction config parameter you've used. You cannot disable the cleanup of expired records. There's a good knowledge base FAQ What are Expiration, Eviction and Stop-Writes?. Unfortunately, the expired records that have been cleaned up are gone if their void time is in the past. If those records were merely evicted (i.e. removed before their void time due to crossing the namespace high-water mark for memory or disk) you can cold restart your node, and those records with a future TTL will come back. They won't return if either they were durably deleted or if their TTL is in the past (such records gets skipped).
As for resetting TTLs, the easiest way would be to do this through a record UDF that is applied to all the records in your namespace using a scan.
The UDF for your situation would be very simple:
ttl.lua
function to_zero_ttl(rec)
local rec_ttl = record.ttl(rec)
if rec_ttl > 0 then
record.set_ttl(rec, -1)
aerospike:update(rec)
end
end
In AQL:
$ aql
Aerospike Query Client
Version 3.12.0
C Client Version 4.1.4
Copyright 2012-2017 Aerospike. All rights reserved.
aql> register module './ttl.lua'
OK, 1 module added.
aql> execute ttl.to_zero_ttl() on test.foo
Using a Python script would be easier if you have more complex logic, with filters etc.
zero_ttl_operation = [operations.touch(-1)]
query = client.query(namespace, set_name)
query.add_ops(zero_ttl_operation)
policy = {}
job = query.execute_background(policy)
print(f'executing job {job}')
while True:
response = client.job_info(job, aerospike.JOB_SCAN, policy={'timeout': 60000})
print(f'job status: {response}')
if response['status'] != aerospike.JOB_STATUS_INPROGRESS:
break
time.sleep(0.5)
Aerospike v6 and Python SDK v7.
I have a pyramid view that is used for loading data from a large file into a database. For each line in the file it does a little processing then creates some model instances and adds them to the session. This works fine except when the files are big. For large files the view slowly eats up all my ram until everything effectively grinds to a halt.
So my idea is to process each line individually with a function that creates a session, creates the necessary model instances and adds them to the current session, then commits.
def commit_line(lTitles,lLine,oStartDate,oEndDate,iDS,dSettings):
from sqlalchemy.orm import (
scoped_session,
sessionmaker,
)
from sqlalchemy import engine_from_config
from pyramidapp.models import Base, DataEntry
from zope.sqlalchemy import ZopeTransactionExtension
import transaction
oCurrentDBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension()))
engine = engine_from_config(dSettings, 'sqlalchemy.')
oCurrentDBSession.configure(bind=engine)
Base.metadata.bind = engine
oEntry = DataEntry()
oCurrentDBSession.add(oEntry)
...
transaction.commit()
My requirements for this function are as follows:
create a session (check)
make a bunch of model instances (check)
add those instances to the session (check)
commit those models to the database
get rid of the session (so that it and the objects created in 2 are garbage collected)
I've made sure that the newly created session is passed as an argument whenever necessary in order to stop errors to do with multiple sessions blah blah. But alas! I can't get database connections to go away and stuff isn't being committed.
I tried separating the function out into a celery task so the view executes to completion and does what it needs to but I'm getting an error in celery about having too many mysql connections no matter what I try in terms of committing and closing and disposing and I'm not sure why. And yes, I restart the celery server when I make changes.
Surely there is a simple way to do this? All I want to do is make a session commit then go away and leave me alone.
Creating a new session for each line of your large file is going to be quite slow I would imagine.
What I would try is to commit the session and expunge all objects from it every 1000 rows or so:
counter = 0
for line in mymegafile:
entry = process_line(line)
session.add(entry)
if counter > 1000:
counter = 0
transaction.commit() # if you insist on using ZopeTransactionExtension, otherwise session.commit()
session.expunge_all() # this may not be required actually, see https://groups.google.com/forum/#!topic/sqlalchemy/We4XGX2CYX8
else:
counter += 1
If there are no references to DataEntry instances from anywhere they should be garbage collected by Python interpreter at some point.
However, if all you're doing in that view is inserting new records to the database, it may be much more efficient to use SQLAlchemy Core constructs or literal SQL to bulk-insert data. This would also get rid of the problem with your ORM instances eating up your RAM. See I’m inserting 400,000 rows with the ORM and it’s really slow! for details.
So I tried a bunch of things and, although using SQLAlchemy's built in functionality to solve this was probably possible I could not find any way of pulling that off.
So here's an outline of what I did:
seperate the lines to be processed into batches
for each batch of lines queue up a celery task to deal with those lines
in the celery task a seperate process is launched that does the necessary stuff with the lines.
Reasoning:
The batch stuff is obvious
Celery was used because it took a heck of a long time to process an entire file so queuing just made sense
the task launched a separate process because if it didn't then I had the same problem that I had with the pyramid application
Some code:
Celery task:
def commit_lines(lLineData,dSettings,cwd):
"""
writes the line data to a file then calls a process that reads the file and creates
the necessary data entries. Then deletes the file
"""
import lockfile
sFileName = "/home/sheena/tmp/cid_line_buffer"
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'a') #in case the process was at any point interrupted...
for d in lLineData:
f.write('{0}\n'.format(d))
f.close()
#now call the external process
import subprocess
import os
sConnectionString = dSettings.get('sqlalchemy.url')
lArgs = [
'python',os.path.join(cwd,'commit_line_file.py'),
'-c',sConnectionString,
'-f',sFileName
]
#open the subprocess. wait for it to complete before continuing with stuff. if errors: raise
subprocess.check_call(lArgs,shell=False)
#and clear the file
lock = lockfile.FileLock("{0}_lock".format(sFileName))
with lock:
f = open(sFileName,'w')
f.close()
External process:
"""
this script goes through all lines in a file and creates data entries from the lines
"""
def main():
from optparse import OptionParser
from sqlalchemy import create_engine
from pyramidapp.models import Base,DBSession
import ast
import transaction
#get options
oParser = OptionParser()
oParser.add_option('-c','--connection_string',dest='connection_string')
oParser.add_option('-f','--input_file',dest='input_file')
(oOptions, lArgs) = oParser.parse_args()
#set up connection
#engine = engine_from_config(dSettings, 'sqlalchemy.')
engine = create_engine(
oOptions.connection_string,
echo=False)
DBSession.configure(bind=engine)
Base.metadata.bind = engine
#commit stuffs
import lockfile
lock = lockfile.FileLock("{0}_lock".format(oOptions.input_file))
with lock:
for sLine in open(oOptions.input_file,'r'):
dLine = ast.literal_eval(sLine)
create_entry(**dLine)
transaction.commit()
def create_entry(iDS,oStartDate,oEndDate,lTitles,lValues):
#import stuff
oEntry = DataEntry()
#do some other stuff, make more model instances...
DBSession.add(oEntry)
if __name__ == "__main__":
main()
in the view:
for line in big_giant_csv_file_handler:
lLineData.append({'stuff':'lots'})
if lLineData:
lLineSets = [lLineData[i:i+iBatchSize] for i in range(0,len(lLineData),iBatchSize)]
for l in lLineSets:
commit_lines.delay(l,dSettings,sCWD) #queue it for celery
You are just doing it wrong. Period.
Quoted from SQLAlchemy docs
The advanced developer will try to keep the details of session,
transaction and exception management as far as possible from the
details of the program doing its work.
Quoted from Pyramid docs
We made the decision to use SQLAlchemy to talk to our database. We also, though, installed pyramid_tm and zope.sqlalchemy.
Why?
Pyramid has a strong orientation towards support for transactions.
Specifically, you can install a transaction manager into your app
application, either as middleware or a Pyramid "tween". Then, just
before you return the response, all transaction-aware parts of your
application are executed. This means Pyramid view code usually doesn't
manage transactions.
My answer today is not code, but a recommendation to follow best practices recommended by the authors of the packages/frameworks you are working with.
References
Big picture - Using Thread-Local Scope with Web Applications
Typical error message when doing it wrong
Databases using SQLAlchemy
How to use scoped_session
Encapsulate CSV reading and creating SQLAlchemy model instances into something that supports the iterator protocol. I called it BatchingModelReader. It returns a collection of DataEntry instances, collection size depends on batch size. If the model changes overtime, you do not need to change the celery task. The task only puts a batch of models into a session and commits the transaction. By controlling the batch size you control memory consumption. Neither BatchingModelReader nor the celery task save huge amounts of intermediate data. This example shows as well that using celery is only an option. I added links to code samples of an pyramid application I am actually refactoring in a Github fork.
BatchingModelReader - encapsulates csv.reader and uses existing models from your pyramid application
get inspired by source code of csv.DictReader
could be run as a celery task - use appropriate task decorator
from .models import DBSession
import transaction
def import_from_csv(path_to_csv, batchsize)
"""given a CSV file and batchsize iterate over batches of model instances and import them to database"""
for batch in BatchingModelReader(path_to_csv, batchsize):
with transaction.manager:
DBSession.add_all(batch)
pyramid view - just save big giant CSV file, start task, return response immediately
#view_config(...):
def view(request):
"""gets file from request, save it to filesystem and start celery task"""
with open(path_to_csv, 'w') as f:
f.write(big_giant_csv_file)
#start task with parameters
import_from_csv.delay(path_to_csv, 1000)
Code samples
ToDoPyramid - commit transaction from commandline
ToDoPyramid - commit transaction from request
Pyramid using SQLAlchemy
Databases using SQLAlchemy
SQLAlchemy internals
Big picture - Using Thread-Local Scope with Web Applications
How to use scoped_session
I am trying to read a 200 MB csv file using SQLalchemy. Each line has about 30 columns, of which, I use only 8 columns using the code below. However, the code runs really slow! Is there a way to improve this? I would like to use map/list comprehension or other techniques. As you an tell, I am a newbie. Thanks for your help.
for ddata in dread:
record = DailyData()
record.set_campaign_params(pdata) #Pdata is assigned in the previous step
record.set_daily_data(ddata) #data is sent to a class method where only 8 of 30 items in the list are used
session.add(record)
session.commit() #writing to the SQL database.
don't commit on every record. commit or just flush every 1000 or so:
for i, data in enumerate(csv_stuff):
rec = MyORMObject()
rec.set_stuff(data)
session.add(rec)
if i % 1000 == 0:
session.flush()
session.commit() # flushes everything remaining + commits
if that's still giving you problems then do some basic profiling, see my post at How can I profile a SQLAlchemy powered application?
I am currently trying to move my DB tables over to InnoDB from MyISAM. I am having timing issues with requests and cron jobs that are running on the server that is leading to some errors. I am quite sure that transaction support will help me with the problem. I am therefore transitioning to InnoDB.
I have a suite of tests which make calls to our webservices REST API and receive XML responses. The test suite is fairly thorough, and it's written in Python and uses SQLAlchemy to query information from the database. When I change the tables in the system from MyISAM to InnoDB however, the tests start failing. However, the tests aren't failing because the system isn't working, they are failing because the ORM is not correctly querying the rows from the database I am testing on. when I step through the code I see the correct results, but the ORM is not returning the correct results at all.
Basic flow is:
class UnitTest(unittest.TestCase):
def setUp(self):
# Create a test object in DB that gets affected by the web server
testObject = Obj(foo='one')
self.testId = testObject.id
session.add(testObject)
session.commit()
def tearDown(self):
# Clean up after the test
testObject = session.query(Obj).get(self.testId)
session.delete(testObject)
session.commit()
def test_web_server(self):
# Ensure the initial state of the object.
objects = session.query(Obj).get(self.testId)
assert objects.foo == 'one'
# This will make a simple HTTP get call on an url that will modify the DB
response = server.request.increment_foo(self.testId)
# This one fails, the object still has a foo of 'one'
# When I stop here in a debugger though, and look at the database,
# The row in question actually has the correct value in the database.
# ????
objects = session.query(Obj).get(self.testId)
assert objects.foo == 'two'
Using MyISAM tables to store the object and this test will pass. However, when I change to InnoDB tables, this test will not pass. What is more interesting is that when I step through the code in the debugger, I can see that the datbase has what I expect, so it's not a problem in the web server code. I have tried nearly every combination of expire_all, autoflush, autocommit, etc. etc, and still can't get this test to pass.
I can provide more info if necessary.
Thanks,
Conrad
The problem is that you put the line self.testId = testObject.id before new object is added to session, flushed, and SQLAlchemy assigned ID to it. Thus self.testId is always None. Move this line below session.commit().