I have a query that takes a very long time to complete. I am wondering if there is anyway to optimize it so it would run quicker. Currently the table has around 15 million row and has indexing on expirationDate. This query has been running for 8000+ seconds.
UPDATE LOW_PRIORITY items
SET expire = 1
WHERE expirationdate < Curdate()
AND expirationdate != '000-00-00'
AND expire = '0'
AND ( `submit_id` != '742457'
OR submit_id IS NULL )
Never mind your table structure (I requested it before this edit), I have a solution for you. I've done this before with my DBAs several times on our production mySQL databases even during peak hours of the day. The approach you need to take is taking your one massive UPDATE query (that will lock the table while executing) and turn it into several individual updates. Those individual updates will not have a material impact on your database, especially if you run them in smaller batches and break it down.
To generate those individual queries, run this (assuming the items table's primary key is named "id"):
SELECT CONCAT(
‘UPDATE items SET expire = 1 WHERE id = ‘,
items.id,
‘;’) AS ''
FROM items
WHERE expirationdate < Curdate()
AND expirationdate != '000-00-00'
AND expire = '0'
AND ( `submit_id` != ‘742457'
OR submit_id IS NULL )
This will generate SQL that looks like this:
UPDATE items SET expire = 1 WHERE id = 1234;
UPDATE items SET expire = 1 WHERE id = 2345;
.....
What I personally like to do is put the above query in a text file and run this from the command line:
cat theQuery.sql | mysql -u yourusername [+ command line args with db host, etc] -p > outputSQLCommands.sql
This will pass the query to mySQL and dump the results to an output file. If you notice, we named the column in our output to '' (nothing). That makes the output very nicely formatted in the text file so it can be easily dumped back into mySQL to run the commands. In fact you could do it all in one fell swoop like this:
cat theQuery.sql | mysql -u yourusername [+ command line args with db host, etc] -p | mysql -u yourusername [+ command line args with db host, etc] -p
Then you are definitely ballin' SQL style.
After you have the output SQL file though, if it's too many queries to run one after another, you can use a command line tool like "split" to split the file into a bunch of smaller files.
I would recommend testing to find the optimum batch size based on your DB's hardware/performance/etc. For example, try running 1000 updates first, if that works with no material impact try 5000, etc. Then you can split the rest of your file into chunks you are comfortable with and run them through.
Good luck!
Related
I need to run a MySQL script that, according to my benchmarking, should take over 14 hours to run. The script is updating every row in a 332715-row table:
UPDATE gene_set SET attribute_fk = (
SELECT id FROM attribute WHERE
gene_set.name_from_dataset <=> attribute.name_from_dataset AND
gene_set.id_from_dataset <=> attribute.id_from_dataset AND
gene_set.description_from_dataset <=> attribute.description_from_dataset AND
gene_set.url_from_dataset <=> attribute.url_from_dataset AND
gene_set.name_from_naming_authority <=> attribute.name_from_naming_authority AND
gene_set.id_from_naming_authority <=> attribute.id_from_naming_authority AND
gene_set.description_from_naming_authority <=> attribute.description_from_naming_authority AND
gene_set.url_from_naming_authority <=> attribute.url_from_naming_authority AND
gene_set.attribute_type_fk <=> attribute.attribute_type_fk AND
gene_set.naming_authority_fk <=> attribute.naming_authority_fk
);
(The script is unfortunate; I need to transfer all the data from gene_set to attribute, but first I must correctly set a foreign key to point to attribute).
I haven't been able to successfully run it using this command:
nohup mysql -h [host] -u [user] -p [database] < my_script.sql
For example, last night, it ran over 10 hours but then the ssh connection broke:
Write failed: Broken pipe
Is there any way to run this script in a way to better ensure that it finishes? I really don't care how long it takes (1 day vs 2 days doesn't really matter) so long as I know it will finish.
The quickest way might be to run it in a screen or tmux session.
Expanding on my comment, you're getting poor performance for a 350k record update statement. This is because you're setting based on the result of a sub query, and not updating as a set. Thus you're running the statement once for each row. Update as such:
UPDATE gene_set g JOIN attribute_fk a ON < all where clauses > SET g.attribute_fk = a.id.
This doesn't answer your question per se, but I'll be interested to know how much faster it runs.
Here is how i did it in past where I ran monolithic alter queries in remote server which take ages sometime :
mysql -h [host] -u [user] -p [database] < my_script.sql > result.log 2>&1 &
This way you don't need to wait for it as it will finish on its own time,You could customize and add select now() at start and end in your my_script.sql to find out how long it took if you interest .
Things also to consider if applicable
Why this query take this long, can we improve it(eg : disable key checks .. , offline prepare the data and update with a temp table ..
Can we break the query to run in batches
What is the impact on rest of the DB
etc
If you have ssh access to the server you could copy it over and run it there with the following lines:
#copy over to tmp dir
scp my_script.sql user#remoteHost:/tmp/
#execute script on remote host
ssh -t user#remoteHost "nohup mysql \
-h localhost -u [user] -p [database] < /tmp/my_script.sql &"
Maybe you can try to do 300k updates with frequent commits instead of one single huge update. Doing that inc ase anything failed at you will maintain the changes already applied.
with some dimacic sql you can get all the lines in one go, later copy the file to your server ...
I am very much a SQL developer and am new to redis, but it's performance is very interesting. I have a problem I think redis could help me very much in. I have a SQL table familiar to this:
| CONTAINER <String><NoUnq> | PROCESS <String><NoUnq> | PROCESS_DATA <String><NoUnq> | TimeCreated <TimeStamp><NoUnq>|
This table when populated to its max has roughly ~450,000,000 rows. I am running this on AWS. With these rows I select all the processes within a container (~1,000,000 containers), so I would have something like this in sql (of course container is indexed):
SELECT * FROM table WHERE container = '[CONTAINER_NAME]';
I then have a cronjob script which runs every hour and removes old processes from containers with something like this:
DELETE FROM table WHERE TimeCreated <= [SOME_TIME];
So essentially I like to have processes which are not older than ~4-5 hours. Looking at Redis I feel like I can engineer something similar to improve my performance, but am having trouble to convert this SQL like design into Redis.
My first thought was to use HSET, but I found out HSET does not allow the EXPIRE command on fields so I could not automatically remove old process. I am most concerned about performance and efficiency.
Look's like you can (and probably should) use HSET. And look's like you do not need to expire fields. You need to expire keys. The key name based on container name and EXPIREAT on this key. If you told about table relation structure like you wrote above the most like analogue is one table row is one key:
MULTI
HMSET <container name:rowId> PROCESS <value> PROCESS_DATA <value>
EXPIREAT <container name:rowId> <TimeCreated>
EXEC
Also you can use ZSET to store time related list of rows:
ZADD <container name> <TimeCreated> <rowId>
So you may use zRange as SELECT equivalent. Also you may use LUA scripting to get content of container with one request. Something like (I can make a mistake somewhere in the syntax of LUA):
local result = {}
local tmp = redis.call( 'zrange', KEYS[1], ARG[1], ARG[2], 'withscores' )
for k, v in pairs(tmp) do
result[v] = redis.call('hgetall', KEYS[1] + ':' + k)
end
return result
Where KEYS1 - container name, ARG1 - from , ARG2- to .
p.s. Also you should understand how redis expire keys to understand thats happens with memory at your instance.
I do have a MySQL Server and administration is done via pypMyAdmin. All works fine since "forever" but now I realized that I am having a problem:
Often I do SQL Updates using the "SQL" Link (Run SQL query/queries on server).
If entering a lot of Statements (like this):
UPDATE table SET column = 'new value A' WHERE id = 1;
UPDATE table SET column = 'new value B' WHERE id = 2;
UPDATE table SET column = 'new value C' WHERE id = 3;
....
UPDATE table SET column = 'new value Z' WHERE id = 100;
I have encountered that only about 40-50 Statements are done - no error messages, nothing seems broken - just not all 100 or more short SQL Statements are fullfilled ...
Did any of you encounter the same or even better:
What can be done to make sure all lines/SQL Statements are processed?
Not much can be done. I've encountered the same thing and am comfortable submitting a few statements at a time via phpMyAdmin, but for anything more than say 20 simple statments I go to my MySQL host and import a file with all my statements like $>mysql -u me -pmypass mydb < file_with_many_statements.sql
There's clearly a limitation that could depend on a number of factors (native to phpMyAdmin, settings in your phpMyAdmin hosts's php/web server/mysql configs, network issues, etc.)
I have enabled CDC on few tables in my SQL server 2008 database. I want to change the number of days I can keep the change history.
I have read that by default change logs are kept for 3 days, before they are deleted by sys.sp_cdc_cleanup_change_table stored proc.
Does anyone know how I can change this default value, so that I can keep the logs for longer.
Thanks
You need to update the cdc_jobs.retention field for your database. The record in the cdc_jobs table won't exist until at least one table has been enabled for CDC.
-- modify msdb.dbo.cdc_jobs.retention value (in minutes) to be the length of time to keep change-tracked data
update
j
set
[retention] = 3679200 -- 7 years
from
sys.databases d
inner join
msdb.dbo.cdc_jobs j
on j.database_id = d.database_id
and j.job_type = 'cleanup'
and d.name = '<Database Name, sysname, DatabaseName>';
Replace <Database Name, sysname, DatabaseName> with your database name.
Two alternative solutions:
Drop the cleanup job:
EXEC sys.sp_cdc_drop_job #job_type = N'cleanup';
Change the job via sp:
EXEC sys.sp_cdc_change_job
#job_type = N'cleanup',
#retention = 2880;
Retention time in minutes, max 52494800 (100 years). But if you drop the job, data is never cleaned up, the job isn't even looking, if there is data to clean up. In case of wanting to keep data indefinitely, I'd prefer dropping the job.
I am writing a test program with Ruby and ActiveRecord, and it reads a document
which is like 6000 words long. And then I just tally up the words by
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
end
if recordWord.count.nil?
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
and so this part loops for 6000 times... and it takes a few minutes to
run at least using sqlite3. Is it normal? I was expecting it could run
within a couple seconds... can MySQL speed it up a lot?
With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.
Take a look at AR:Extensions as well to handle the bulk insertions.
http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/
I wrote up some quick code in perl that simply does:
Create the database
Insert a record that only contains a single integer
Retrieve the most recent record and verify that it returns what it inserted
And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."
Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:
Do what people have suggested and find away around the requirement
Try buying some solid state disks.
I think #1 is a good way to start :)
Code:
#!/usr/bin/perl
use warnings;
use strict;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');
create_database($dbh);
insert_data($dbh);
sub insert_data {
my ($dbh) = #_;
my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";
my $insert_sth = $dbh->prepare($insert_sql);
my $retrieve_sth = $dbh->prepare($retrieve_sql);
my $i = 0;
while (++$i < 6000) {
$insert_sth->execute(($i));
$retrieve_sth->execute(($i));
my $hash_ref = $retrieve_sth->fetchrow_hashref;
die "bad data!" unless $hash_ref->{'test_data'} == $i;
}
}
sub create_database {
my ($dbh) = #_;
my $status = $dbh->do("DROP TABLE test_table");
# return error status if CREATE resulted in error
if (!defined $status) {
print "DROP TABLE failed";
}
my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
$create_statement .= "test_data varchar(255)\n";
$create_statement .= ");";
$status = $dbh->do($create_statement);
# return error status if CREATE resulted in error
if (!defined $status) {
die "CREATE failed";
}
}
What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.
Another way to boost performance of a database connection is to group SQL statements together in a single command.
For example, making a single 6,000 line SQL statement that looks like this
"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'"
and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.
Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.
Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:
"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'"
If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly
Without knowing about Ruby and Sqlite, some general hints:
create a unique index on Word.s (you did not state whether you have one)
define a default for Word.count in the database ( DEFAULT 1 )
optimize assignment of count:
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
Use BEGIN TRANSACTION before your updates then COMMIT at the end.
ok, i found some general rule:
1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.