SQLAlchemy bulk update strategies - mysql

I am currently writing a web app (Flask) using SQLAlchemy (on GAE, connecting to Google's cloud MySQL) and needing to do bulk updates of a table. In short, a number of calculations are done resulting in a single value needing to be updated on 1000's of objects. At the moment I'm doing it all in a transaction, but still at the end, the flush/commit is taking ages.
The table has an index on id and this is all carried out in a single transaction. So I believe I've avoided the usual mistakes, but is is still very slow.
INFO 2017-01-26 00:45:46,412 log.py:109] UPDATE wallet SET balance=%(balance)s WHERE wallet.id = %(wallet_id)s
2017-01-26 00:45:46,418 INFO sqlalchemy.engine.base.Engine ({'wallet_id': u'3c291a05-e2ed-11e6-9b55-19626d8c7624', 'balance': 1.8711760000000002}, {'wallet_id': u'3c352035-e2ed-11e6-a64c-19626d8c7624', 'balance': 1.5875759999999999}, {'wallet_id': u'3c52c047-e2ed-11e6-a903-19626d8c7624', 'balance': 1.441656}
From my understanding there is no way to do a bulk update in SQL actually, and the statement above ends up being multiple UPDATE statements being sent to the server.
I've tried using Session.bulk_update_mappings() but that doesn't seem to actually do anything :( Not sure why, but the updates never actually happen. I can't see any examples of this method actually being used (including in the performance suite) so not sure if it is intended to be used.
One technique I've seen discussed is doing a bulk insert into another table and then doing an UPDATE JOIN. I've given it a test, like below, and it seems to be significantly faster.
wallets = db_session.query(Wallet).all()
ledgers = [ Ledger(id=w.id, amount=w._balance) for w in wallets ]
db_session.bulk_save_objects(ledgers)
db_session.execute('UPDATE wallet w JOIN ledger l on w.id = l.id SET w.balance = l.amount')
db_session.execute('TRUNCATE ledger')
But the problem now is how to structure my code. I'm using the ORM and I need to somehow not 'dirty' the original Wallet objects so that they don't get committed in the old way. I could just create these Ledger objects instead and keep a list of them about and then manually insert them at the end of my bulk operation. But that almost smells like I'm replicating some of the work of the ORM mechanism.
Is there a smarter way to do this? So far my brain is going down something like:
class Wallet(Base):
...
_balance = Column(Float)
...
#property
def balance(self):
# first check if we have a ledger of the same id
# and return the amount in that, otherwise...
return self._balance
#balance.setter
def balance(self, amount):
l = Ledger(id=self.id, amount=amount)
# add l to a list somewhere then process later
# At the end of the transaction, do a bulk insert of Ledgers
# and then do an UPDATE JOIN and TRUNCATE
As I said, this all seems to be fighting against the tools I (may) have. Is there a better way to be handling this? Can I tap into the ORM mechanism to be doing this? Or is there an even better way to do the bulk updates?
EDIT: Or is there maybe something clever with events and sessions? Maybe before_flush?
EDIT 2: So I have tried to tap into the event machinery and now have this:
#event.listens_for(SignallingSession, 'before_flush')
def before_flush(session, flush_context, instances):
ledgers = []
if session.dirty:
for elem in session.dirty:
if ( session.is_modified(elem, include_collections=False) ):
if isinstance(elem, Wallet):
session.expunge(elem)
ledgers.append(Ledger(id=elem.id, amount=elem.balance))
if ledgers:
session.bulk_save_objects(ledgers)
session.execute('UPDATE wallet w JOIN ledger l on w.id = l.id SET w.balance = l.amount')
session.execute('TRUNCATE ledger')
Which seems pretty hacky and evil to me, but appears to work OK. Any pitfalls, or better approaches?
-Matt

What you're essentially doing is bypassing the ORM in order to optimize the performance. Therefore, don't be surprised that you're "replicating the work the ORM is doing" because that's exactly what you need to do.
Unless you have a lot of places where you need to do bulk updates like this, I would recommend against the magical event approach; simply writing the explicit queries is much more straightforward.
What I recommend doing is using SQLAlchemy Core instead of the ORM to do the update:
ledger = Table("ledger", db.metadata,
Column("wallet_id", Integer, primary_key=True),
Column("new_balance", Float),
prefixes=["TEMPORARY"],
)
wallets = db_session.query(Wallet).all()
# figure out new balances
balance_map = {}
for w in wallets:
balance_map[w.id] = calculate_new_balance(w)
# create temp table with balances we need to update
ledger.create(bind=db.session.get_bind())
# insert update data
db.session.execute(ledger.insert().values([{"wallet_id": k, "new_balance": v}
for k, v in balance_map.items()])
# perform update
db.session.execute(Wallet.__table__
.update()
.values(balance=ledger.c.new_balance)
.where(Wallet.__table__.c.id == ledger.c.wallet_id))
# drop temp table
ledger.drop(bind=db.session.get_bind())
# commit changes
db.session.commit()

Generally it is poor schema design to need to update thousands of rows frequently. That aside...
Plan A: Write ORM code that generates
START TRANSACTION;
UPDATE wallet SET balance = ... WHERE id = ...;
UPDATE wallet SET balance = ... WHERE id = ...;
UPDATE wallet SET balance = ... WHERE id = ...;
...
COMMIT;
Plan B: Write ORM code that generates
CREATE TEMPORARY TABLE ToDo (
id ...,
new_balance ...
);
INSERT INTO ToDo -- either one row at a time, or a bulk insert
UPDATE wallet
JOIN ToDo USING(id)
SET wallet.balance = ToDo.new_balance; -- bulk update
(Check the syntax; test; etc.)

Related

Rails 3: What is the best way to update a column in a very large table

I want to update all of a column in a table with over 2.2 million rows where the attribute is set to null. There is a Users table and a Posts table. Even though there is a column for num_posts in User, only about 70,000 users have that number populated; otherwise I have to query the db like so:
#num_posts = #user.posts.count
I want to use a migration to update the attributes and I'm not sure whether or not it's the best way to do it. Here is my migration file:
class UpdateNilPostCountInUsers < ActiveRecord::Migration
def up
nil_count = User.select(:id).where("num_posts IS NULL")
nil_count.each do |user|
user.update_attribute :num_posts, user.posts.count
end
end
def down
end
end
In my console, I ran a query on the first 10 rows where num_posts was null, and then used puts for each user.posts.count . The total time was 85.3ms for 10 rows, for an avg of 8.53ms. 8.53ms*2.2million rows is about 5.25 hours, and that's without updating any attributes. How do I know if my migration is running as expected? Is there a way to log to the console %complete? I really don't want to wait 5+ hours to find out it didn't do anything. Much appreciated.
EDIT:
Per Max's comment below, I abandoned the migration route and used find_each to solve the problem in batches. I solved the problem by writing the following code in the User model, which I successfully ran from the Rails console:
def self.update_post_count
nil_count = User.select(:id).where("num_posts IS NULL")
nil_count.find_each { |user|
user.update_column(:num_posts, user.posts.count) if user.posts
}
end
Thanks again for the help everyone!
desc 'Update User post cache counter'
task :update_cache_counter => :environment do
users = User.joins('LEFT OUTER JOIN "posts" ON "posts.user_id" = "users.id"')
.select('"users.id", "posts.id", COUNT("posts.id") AS "p_count"')
.where('"num_posts" IS NULL')
puts "Updating user post counts:"
users.find_each do |user|
print '.'
user.update_attribute(:num_posts, user.p_count)
end
end
First off don't use a migration for what is essentially a maintenance task. Migrations should mainly alter the schema of your database. Especially if it is long running like in this case and may fail midway resulting in a botched migration and problems with the database state.
Then you need to address the fact that calling user.posts is causing a N+1 query and you instead should join the posts table and select a count.
And without using batches you are likely to exhaust the servers memory quickly.
You can use update_all and subquery to do this.
sub_query = 'SELECT count(*) FROM `posts` WHERE `posts`.`user_id` = `users`.`id`'
User.where('num_posts IS NULL').update_all('num_posts = (#{sub_query})')
It will take only seconds instead of hours.
If so, you may not have to find a way to log something.

Mysql duplicate row deletion with Perl DBI across two tables

This one is a pretty good one IMO and I have not seen a close exampled on SO or Google so here you go. I need to do the following within a Perl application I am building. Unfortunately it can not be done directly in MySQL and will require DBI. In a nutshell I need to take Database1.tableA and locate every record with the column 'status' matching 'started'. This I can do as it is fairly easy (not very good with DBI yet, but have read the docs), but where I am having issues is what I have to do next.
my $started_query = "SELECT primary_ip FROM queue WHERE status='started'";
my $started = $dbh->prepare($started_query);
$started->execute();
while ( my #started = $started->fetchrow_array() ) {
# Where I am hoping to have the following occur so it can go by row
# for only rows with the status 'started'
}
So for each record in the #started array, really only contains one value per iteration of the while loop, I need to see if it exists in the Database2.tableA and IF it does exist in the other database (Database2.tableA) I need to delete it from Database1.tableA, but if it DOES NOT exist in the other database (Database2.tableA) I need to update the record in the current database (Database1.tableA).
Basically replicating the below semi-valid MySQL syntax.
DELETE FROM tableA WHERE primary_ip IN (SELECT primary_ip FROM db2.tablea) OR UPDATE tableA SET status = 'error'
I am limited to DBI to connect to the two databases and the logic is escaping me currently. I could do the queries to both databases and store in #arrays and then do a comparison, but that seems redundant as I think it should be possible within the while ( my #started = $started->fetchrow_array() ) as that will save on runtime and resources required. I am also not familiar enough with passing variables between DBI instances and as the #started array will always contain the column value I need to query for and delete I would like to take full advantage of having that defined and passed to the DBI objects.
I am going to be working on this thing all night and already ran through a couple pots of coffee so your helping me understand this logic is greatly appreciated.
You'll be better off with fetchrow_hashref, which returns a hashref of key/value pairs, where the keys are the column names, rather than coding based on columns showing up at ordinal positions in the array.
You need an additional database handle to do the lookups and updates because you've got a live statement handle on the first one. Something like this:
my $dbh2 = DBI->connect(...same credentials...);
...
while(my $row = $started->fetchrow_hashref)
{
if(my $found = $dbh2->selectrow_hashref("SELECT * FROM db2.t2 WHERE primary_ip = ?",undef,$row->{primary_ip}))
{
$dbh2->do("DELETE FROM db1.t1 WHERE primary_ip = ?",undef,$found->{primary_ip});
}
else
{
$dbh2->do("UPDATE db1.t1 SET status = 'error' WHERE primary_ip = ?",undef,$found->{primary_ip}");
}
}
Technically, I don't "need" to fetch the row from db2.t2 into my $found since you're only apparently testing for existence, there are other ways, but using it here is a bit of insurance against doing something other than you intended, since it will be undef if we somehow get some bad logic going and that should keep us from making some potential wrong changes.
But approaching a relational database with loop iterations is rarely the best tactic.
This "could" be done directly in MySQL with just a couple of queries.
First, the updates, where t1.status = 'started' and t2.primary_ip has no matching value for t1.primary_ip:
UPDATE db1.t1 a LEFT JOIN db2.t2 b ON b.primary_ip = a.primary_ip
SET a.status = 'error'
WHERE b.primary_ip IS NULL AND a.status = 'started';
If you are thinking "but b.primary_ip is never null" ... well, it is null in a left join where there are no matching rows.
Then deleting the rows from t1 can also be accomplished with a join. Multi-table joins delete only the rows from the table aliases listed between DELETE and FROM. Again, we're calling "t1" by the alias "a" and t2 by the alias "b".
DELETE a
FROM db1.t1 a JOIN db2.t2 b ON a.primary_ip = b.primary_ip
WHERE a.status = 'started';
This removes every row from t1 ("a") where status = 'started' AND where a matching row exists in t2.

Why does Salesforce prevent me from creating a Push Topic with a query that contains relationships?

When I execute this code in the developer console
PushTopic pushTopic = new PushTopic();
pushTopic.ApiVersion = 23.0;
pushTopic.Name = 'Test';
pushTopic.Description = 'test';
pushtopic.Query = 'SELECT Id, Account.Name FROM Case';
insert pushTopic;
System.debug('Created new PushTopic: '+ pushTopic.Id);
I receive this message:
FATAL ERROR System.DmlException: Insert failed. First exception on row
0; first error: INVALID_FIELD, relationships are not supported:
[QUERY]
The same query runs fine on the Query Editor, but when I assign it to a Push Topic I get the INVALID_FIELD exception.
If the bottom line is what the exception message says, that relationships are just not supported by Push Topic objects, how do I create a Push Topic object that will return the data I'm looking for?
Why
Salesforce prevents this because it will require them to join tables, joins in salesforces database are expensive due to the multi-tenancy. Usually when they add a new feature they will not support joins as it requires more optimization of the feature.
Push Topics are still quite new to the system and need to be real time, anything that would slow them down I'd say needs to be trimmed.
I'd suggest you look more closely at your requirement and see if there is something else that will work for you.
Workaround
A potential workaround is to add a Formula field to the Case object with the data you need and include that in the query instead. This may not work as it will still require a join to work.
A final option may be to use a workflow rule or trigger to update the account name to a custom field on the Case object this way the data is local so doesn't require a join...
PushTopics support a very small subset of SOQL queries, see more here:
https://developer.salesforce.com/docs/atlas.en-us.api_streaming.meta/api_streaming/unsupported_soql_statements.htm
However this should work:
PushTopic casePushTopic = new PushTopic();
pushTopic.ApiVersion = 23.0;
pushTopic.Name = 'CaseTopic';
pushTopic.Description = 'test';
pushtopic.Query = 'SELECT Id, Account.Id FROM Case';
insert pushTopic;
PushTopic accountPushTopic = new PushTopic();
pushTopic.ApiVersion = 23.0;
pushTopic.Name = 'AccountTopic';
pushTopic.Description = 'test';
pushtopic.Query = 'SELECT Id, Name FROM Account';
insert pushTopic;
It really depends on your use case though, if it is for replicating into RDBMS this should be enough, you can use a join to get the full data.

Rails best way to add huge amount of records

I've got to add like 25000 records to database at once in Rails.
I have to validate them, too.
Here is what i have for now:
# controller create action
def create
emails = params[:emails][:list].split("\r\n")
#created_count = 0
#rejected_count = 0
inserts = []
emails.each do |email|
#email = Email.new(:email => email)
if #email.valid?
#created_count += 1
inserts.push "('#{email}', '#{Date.today}', '#{Date.today}')"
else
#rejected_count += 1
end
end
return if emails.empty?
sql = "INSERT INTO `emails` (`email`, `updated_at`, `created_at`) VALUES #{inserts.join(", ")}"
Email.connection.execute(sql) unless inserts.empty?
redirect_to new_email_path, :notice => "Successfuly created #{#created_count} emails, rejected #{#rejected_count}"
end
It's VERY slow now, no way to add such number of records 'cause of timeout.
Any ideas? I'm using mysql.
Three things come into mind:
You can help yourself with proper tools like:
zdennis/activerecord-import or jsuchal/activerecord-fast-import. The problem is with, your example, that you will also create 25000 objects. If you tell activerecord-import to not use validations, it will not create new objects (activerecord-import/wiki/Benchmarks)
Importing tens thousands of rows into relational database will never be super fast, it should be done asynchronously via background process. And there are also tools for that, like DelayedJob and more: https://www.ruby-toolbox.com/
Move the code that belongs to model out of controller(TM)
And after that, you need to rethink the flow of this part of application. If you're using background processing inside a controller action like create, you can not just simply return HTTP 201, or HTTP 200. What you need to do is to return "quick" HTTP 202 Accepted, and provide a link to another representation where user could check the status of their request (do we already have success response? how many emails failed?), as it is in now beeing processed in the background.
It can sound a bit complicated, and it is, which is a sign, that you maybe shouldn't do it like that. Why do you have to add like 25000 records in one request? What's the backgorund?
Why don't you create a rake task for the work? The following link explains it pretty well.
http://www.ultrasaurus.com/sarahblog/2009/12/creating-a-custom-rake-task/
In a nutshell, once you write your rake task, you can kick off the work by:
rake member:load_emails
If speed is your concern, I'd attack the problem from a different angle.
Create a table that copies the structure of your emails table; let it be emails_copy. Don't copy indexes and constraints.
Import the 25k records into it using your database's fast import tools. Consult your DB docs or see e.g. this answer for MySQL. You will have to prepare the input file, but it's way faster to do — I suppose you already have the data in some text or tabular form.
Create indexes and constraints for emails_copy to mimic emails table. Constraint violations, if any, will surface; fix them.
Validate the data inside the table. It may take a few raw SQL statements to check for severe errors. You don't have to validate emails for anything but very simple format anyway. Maybe all your validation could be done against the text you'll use for import.
insert into emails select * from emails_copy to put the emails into the production table. Well, you might play a bit with it to get autoincrement IDs right.
Once you're positive that the process succeeded, drop table emails_copy.

processing data with perl - selecting for update usage with mysql

I have a table that is storing data that needs to be processed. I have id, status, data in the table. I'm currently going through and selecting id, data where status = #. I'm then doing an update immediately after the select, changing the status # so that it won't be selected again.
my program is multithreaded and sometimes I get threads that grab the same id as they are both querying the table at a relatively close time to each other, causing the grab of the same id. i looked into select for update, however, i either did the query wrong, or i'm not understanding what it is used for.
my goal is to find a way of grabbing the id, data that i need and setting the status so that no other thread tries to grab and process the same data. here is the code i tried. (i wrote it all together for show purpose here. i have my prepares set at the beginning of the program as to not do a prepare for each time it's ran, just in case anyone was concerned there)
my $select = $db->prepare("SELECT id, data FROM `TestTable` WHERE _status=4 LIMIT ? FOR UPDATE") or die $DBI::errstr;
if ($select->execute($limit))
{
while ($data = $select->fetchrow_hashref())
{
my $update_status = $db->prepare( "UPDATE `TestTable` SET _status = ?, data = ? WHERE _id=?");
$update_status->execute(10, "", $data->{_id});
push(#array_hash, $data);
}
}
when i run this, if doing multiple threads, i'll get many duplicate inserts, when trying to do an insert after i process my transaction data.
i'm not terribly familiar with mysql and the research i've done, i haven't found anything that really cleared this up for me.
thanks
As a sanity check, are you using InnoDB? MyISAM has zero transactional support, aside from faking it with full table locking.
I don't see where you're starting a transaction. MySQL's autocommit option is on by default, so starting a transaction and later committing would be necessary unless you turned off autocommit.
It looks like you simply rely on the database locking mechanisms. I googled perl dbi locking and found this:
$dbh->do("LOCK TABLES foo WRITE, bar READ");
$sth->prepare("SELECT x,y,z FROM bar");
$sth2->prepare("INSERT INTO foo SET a = ?");
while (#ary = $sth->fetchrow_array()) {
$sth2->$execute($ary[0]);
}
$sth2->finish();
$sth->finish();
$dbh->do("UNLOCK TABLES");
Not really saying GIYF as I am also fairly novice at both MySQL and DBI, but perhaps you can find other answers that way.
Another option might be as follows, and this only works if you control all the code accessing the data. You can create lock column in the table. When your code accesses the table it (pseudocode):
if row.lock != 1
row.lock = 1
read row
update row
row.lock = 0
next
else
sleep 1
redo
again though, this trusts that all users/script that access this data will agree to follow this policy. If you cannot ensure that then this won't work.
Anyway thats all the knowledge I have on the topic. Good Luck!