Faster way to delete matching rows? - mysql

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.
The goal is to remove all the rows in table A that have a matching id in table B.
I'm currently doing the following:
DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);
There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.
This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?
EDIT:
Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:
CREATE TABLE b LIKE a;
Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.
Also, I'm using INNO DB.
Here is some additional info that might be helpful to you.
Table A has a structure similar to this (I've sanitized this a bit):
DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE `frobozz`.`a` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`fk_g` varchar(30) NOT NULL,
`h` int(10) unsigned default NULL,
`i` longtext,
`j` bigint(20) NOT NULL,
`k` bigint(20) default NULL,
`l` varchar(45) NOT NULL,
`m` int(10) unsigned default NULL,
`n` varchar(20) default NULL,
`o` bigint(20) NOT NULL,
`p` tinyint(1) NOT NULL,
PRIMARY KEY USING BTREE (`id`),
KEY `idx_l` (`l`),
KEY `idx_h` USING BTREE (`h`),
KEY `idx_m` USING BTREE (`m`),
KEY `idx_fk_g` USING BTREE (`fk_g`),
KEY `fk_g_frobozz` (`id`,`fk_g`),
CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
I suspect that part of the issue is there are a number of indexes for this table.
Table B looks similar to table B, though it only contains the columns id and h.
Also, the profiling results are as follows:
starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002
SOLVED
Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"
The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,
DELETE FROM c WHERE c.id = theId;
I looked into the EXPLAIN statement and rewrote this as,
EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;
So, I could see what this was doing and it gave me the following info:
id 1
select_type SIMPLE
table c
type ALL
possible_keys NULL
key NULL
key_len NULL
ref NULL
rows 2633
Extra using where
This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:
id 1
select_type SIMPLE
table c
type ref
possible_keys Index_1
key Index_1
key_len 8
ref const
rows 1
Extra
Much better, in fact really great.
I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.
Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.
Reference: EXPLAIN Statement

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.
While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.
To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.
For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.
Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.
So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows.
To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record.
With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

Try this:
DELETE a
FROM a
INNER JOIN b
on a.id = b.id
Using subqueries tend to be slower then joins as they are run for each record in the outer query.

This is what I always do, when I have to operate with super large data (here: a sample test table with 150000 rows):
drop table if exists employees_bak;
create table employees_bak like employees;
insert into employees_bak
select * from employees
where emp_no > 100000;
rename table employees to employees_todelete;
rename table employees_bak to employees;
drop table employees_todelete;
In this case the sql filters 50000 rows into the backup table.
The query cascade performs on my slow machine in 5 seconds.
You can replace the insert into select by your own filter query.
That is the trick to perform mass deletion on big databases!;=)

Your time of three minutes seems really slow. My guess is that the id column is not being indexed properly. If you could provide the exact table definition you're using that would be helpful.
I created a simple python script to produce test data and ran multiple different versions of the delete query against the same data set. Here's my table definitions:
drop table if exists a;
create table a
(id bigint unsigned not null primary key,
data varchar(255) not null) engine=InnoDB;
drop table if exists b;
create table b like a;
I then inserted 100k rows into a and 25k rows into b (22.5k of which were also in a). Here's the results of the various delete commands. I dropped and repopulated the table between runs by the way.
mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (1.14 sec)
mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (0.81 sec)
mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (0.97 sec)
mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (0.81 sec)
All the tests were run on an Intel Core2 quad-core 2.5GHz, 2GB RAM with Ubuntu 8.10 and MySQL 5.0. Note, that the execution of one sql statement is still single threaded.
Update:
I updated my tests to use itsmatt's schema. I slightly modified it by remove auto increment (I'm generating synthetic data) and character set encoding (wasn't working - didn't dig into it).
Here's my new table definitions:
drop table if exists a;
drop table if exists b;
drop table if exists c;
create table c (id varchar(30) not null primary key) engine=InnoDB;
create table a (
id bigint(20) unsigned not null primary key,
c_id varchar(30) not null,
h int(10) unsigned default null,
i longtext,
j bigint(20) not null,
k bigint(20) default null,
l varchar(45) not null,
m int(10) unsigned default null,
n varchar(20) default null,
o bigint(20) not null,
p tinyint(1) not null,
key l_idx (l),
key h_idx (h),
key m_idx (m),
key c_id_idx (id, c_id),
key c_id_fk (c_id),
constraint c_id_fk foreign key (c_id) references c(id)
) engine=InnoDB row_format=dynamic;
create table b like a;
I then reran the same tests with 100k rows in a and 25k rows in b (and repopulating between runs).
mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (11.90 sec)
mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (11.48 sec)
mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (12.21 sec)
mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (12.33 sec)
As you can see this is quite a bit slower than before, probably due to the multiple indexes. However, it is nowhere near the three minute mark.
Something else that you might want to look at is moving the longtext field to the end of the schema. I seem to remember that mySQL performs better if all the size restricted fields are first and text, blob, etc are at the end.

You're doing your subquery on 'b' for every row in 'a'.
Try:
DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;

Try this out:
DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID
It is much faster than normal queries.
Refer for Syntax : http://dev.mysql.com/doc/refman/5.0/en/delete.html

I know this question has been pretty much solved due to OP's indexing omissions but I would like to offer this additional advice, which is valid for a more generic case of this problem.
I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag, as the longer each single mutator query runs, the worse the lag would be (replication is single threaded).
So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time).
This will achieve the following:
each one of the delete statements will not lock the table for too long, thus not letting replication lag to get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because it has to do the long select all over again. Slave lag, again, suffers badly. If you use a separate SELECT query, this problem goes away, as all you're passing is a list of IDs.
Let me know if there's a fault in my logic somewhere.
For more discussion on replication lag and ways to fight it, similar to this one, see MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It
P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.

DELETE FROM a WHERE id IN (SELECT id FROM b)

Maybe you should rebuild the indicies before running such a hugh query. Well, you should rebuild them periodically.
REPAIR TABLE a QUICK;
REPAIR TABLE b QUICK;
and then run any of the above queries (i.e.)
DELETE FROM a WHERE id IN (SELECT id FROM b)

The query itself is already in an optimal form, updating the indexes causes the whole operation to take that long. You could disable the keys on that table before the operation, that should speed things up. You can turn them back on at a later time, if you don't need them immediately.
Another approach would be adding a deleted flag-column to your table and adjusting other queries so they take that value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). That would be a fast operation, you can delete the values afterwards.
The same thoughts expressed in sql statements:
ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL;
-- The following query should be faster than the delete statement:
UPDATE a INNER JOIN b SET a.deleted = '';
-- This is the catch, you need to alter the rest
-- of your queries to take the new column into account:
SELECT * FROM a WHERE deleted IS NULL;
-- You can then issue the following queries in a cronjob
-- to clean up the tables:
DELETE FROM a WHERE deleted IS NOT NULL;
If that, too, is not what you want, you can have a look at what the mysql docs have to say about the speed of delete statements.

BTW, after posting the above on my blog, Baron Schwartz from Percona brought to my attention that his maatkit already has a tool just for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html.
It is most likely your best tool for the job.

Obviously the SELECT query that builds the foundation of your DELETE operation is quite fast so I'd think that either the foreign key constraint or the indexes are the reasons for your extremely slow query.
Try
SET foreign_key_checks = 0;
/* ... your query ... */
SET foreign_key_checks = 1;
This would disable the checks on the foreign key. Unfortunately you cannot disable (at least I don't know how) the key-updates with an InnoDB table. With a MyISAM table you could do something like
ALTER TABLE a DISABLE KEYS
/* ... your query ... */
ALTER TABLE a ENABLE KEYS
I actually did not test if these settings would affect the query duration. But it's worth a try.

Connect datebase using terminal and execute command below, look at the result time each of them, you'll find that times of delete 10, 100, 1000, 10000, 100000 records are not Multiplied.
DELETE FROM #{$table_name} WHERE id < 10;
DELETE FROM #{$table_name} WHERE id < 100;
DELETE FROM #{$table_name} WHERE id < 1000;
DELETE FROM #{$table_name} WHERE id < 10000;
DELETE FROM #{$table_name} WHERE id < 100000;
The time of deleting 10 thousand records is not 10 times as much as deleting 100 thousand records.
Then, except for finding a way delete records more faster, there are some indirect methods.
1, We can rename the table_name to table_name_bak, and then select records from table_name_bak to table_name.
2, To delete 10000 records, we can delete 1000 records 10 times. There is an example ruby script to do it.
#!/usr/bin/env ruby
require 'mysql2'
$client = Mysql2::Client.new(
:as => :array,
:host => '10.0.0.250',
:username => 'mysql',
:password => '123456',
:database => 'test'
)
$ids = (1..1000000).to_a
$table_name = "test"
until $ids.empty?
ids = $ids.shift(1000).join(", ")
puts "delete =================="
$client.query("
DELETE FROM #{$table_name}
WHERE id IN ( #{ids} )
")
end

The basic technique for deleting multiple Row form MySQL in single table through the id field
DELETE FROM tbl_name WHERE id <= 100 AND id >=200;
This query is responsible for deleting the matched condition between 100 AND 200 from the certain table

Related

MySQL: Slow SELECT because of Index / FKEY?

Dear StackOverflow Members
It's my first post, so please be nice :-)
I have a strange SQL behavior which i can't explain and don't find any resources which explains it.
I have built a web honeypot which record all access and attacks and display it on a statistic page.
However since the data increased, the generation of the statistic page is getting slower and slower.
I narrowed it down to a some select statements which takes a quite a long time.
The "issue" seems to be an index on a specific column.
*For sure the real issue is my lack of knowledge :-)
Database: mysql
DB schema
Event Table (removed unrelated columes):
Event table size: 30MB
Event table records: 335k
CREATE TABLE `event` (
`EventID` int(11) NOT NULL,
`EventTime` datetime NOT NULL DEFAULT current_timestamp(),
`WEBURL` varchar(50) COLLATE utf8_bin DEFAULT NULL,
`IP` varchar(15) COLLATE utf8_bin NOT NULL,
`AttackID` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `event`
ADD PRIMARY KEY (`EventID`),
ADD KEY `AttackID` (`AttackID`);
ALTER TABLE `event`
ADD CONSTRAINT `event_ibfk_1` FOREIGN KEY (`AttackID`) REFERENCES `attack` (`AttackID`);
Attack Table
attack table size: 32KB
attack Table records: 11
CREATE TABLE attack (
`AttackID` int(4) NOT NULL,
`AttackName` varchar(30) COLLATE utf8_bin NOT NULL,
`AttackDescription` varchar(70) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `attack`
ADD PRIMARY KEY (`AttackID`),
SLOW Query:
SELECT Count(EventID), IP
-> FROM event
-> WHERE AttackID >0
-> GROUP BY IP
-> ORDER BY Count(EventID) DESC
-> LIMIT 5;
RESULT: 5 rows in set (1.220 sec)
(This seems quite long for me, for a simple query)
QuerySlow
Now the Strange thing:
If I remove the foreign key relationship the performance of the query is the same.
But if I remove the the index on event.AttackID same select statement is much faster:
(ALTER TABLE `event` DROP INDEX `AttackID`;)
The result of the SQL SELECT query:
5 rows in set (0.242 sec)
QueryFast
From my understanding indexes on columns which are used in "WHERE" should improve the performance.
Why does removing the index have such an impact on the query?
What can I do to keep the relations between the table and have a faster
SELECT execution?
Cheers
Why does removing the index improve performance?
The query optimizer has multiple ways to resolve a query. For instance, two methods for filtering data are:
Look up the rows that match the where clause in the index and then fetch related data from the data pages.
Scan the index.
This doesn't get into the use of indexes for joins or aggregations or alternative algorithms.
Which is better? Under some circumstances, the first method is horribly slower than the second. This occurs when the data for the table does not fit into memory. Under such circumstances, the index can read a record from page 124 and then from 1068 and then from 124 again and -- well, all sorts of random intertwined reading of pages. Reading data pages in order is usually faster. And when the data doesn't fit into memory, thrashing occurs, which means that a page in memory is aged (overwritten) -- and then needed again.
I'm not saying that is occurring in your case. I am simply saying that what optimizers do is not always obvious. The optimizer has to make judgements based on the nature of the data -- and those judgements are not right 100% of the time. They are usually correct. But there are borderline cases. Sometimes, the issue is out-of-date statistics. Sometimes the issue is that what looks best to the optimizer is not best in practice.
Let me emphasize that optimizers usually do a very good job, and a better job than a person would do. Even if they occasionally come up with suboptimal plans, they are still quite useful.
Get rid of your redundant UNIQUE KEYs. A primary key is a unique key.
Use COUNT(*) rather than COUNT(IP) in your query. They mean the same thing because you declared IP to be NOT NULL.
Your query can be much faster if you stop saying WHERE AttackId>0. Because that column is a FK to the PK of your other table, those values should be nonzero anyway. But to get that speedup you'll need an index on event(IP) something like this.
CREATE INDEX IpDex ON event (IP)
But you're still summarizing a large table, and that will always take time.
It looks like you want to display some kind of leaderboard. You could add a top_ips table, and use an EVENT to populate it, using your query, every few minutes. Then you could display it to your users without incurring the cost of the query every time. This of course would display slightly stale data; only you know whether that's acceptable in your app.
Pro Tip. Read https://use-the-index-luke.com by Marcus Winand.
Essentially every part of your query, except for the FKey, conspires to make the query slow.
Your query is equivalent to
SELECT Count(*), IP
FROM event
WHERE AttackID >0
GROUP BY IP
ORDER BY Count(*) DESC
LIMIT 5;
Please use COUNT(*) unless you need to avoid NULL.
If AttackID is rarely >0, the optimal index is probably
ADD INDEX(AttackID, -- for filtering
IP) -- for covering
Else, the optimal index is probably
ADD INDEX(IP, -- to avoid sorting
AttackID) -- for covering
You could simply add both indexes and let the Optimizer decide. Meanwhile, get rid of these, if they exist:
DROP INDEX(AttackID)
DROP INDEX(IP)
because any uses of them are handled by the new indexes.
Furthermore, leaving the 1-column indexes around can confuse the Optimizer into using them instead of the covering index. (This seems to be a design flaw in at least some versions of MySQL/MariaDB.)
"Covering" means that the query can be performed entirely in the index's BTree. EXPLAIN will indicate it with "Using index". A "covering" index speeds up a query by 2x -- but there is a very wide variation on this prediction. ("Using index condition" is something different.)
More on index creation: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Fastest way to remove a HUGE set of row keys from a table via primary key? [duplicate]

I have two tables. Let's call them KEY and VALUE.
KEY is small, somewhere around 1.000.000 records.
VALUE is huge, say 1.000.000.000 records.
Between them there is a connection such that each KEY might have many VALUES. It's not a foreign key but basically the same meaning.
The DDL looks like this
create table KEY (
key_id int,
primary key (key_id)
);
create table VALUE (
key_id int,
value_id int,
primary key (key_id, value_id)
);
Now, my problem. About half of all key_ids in VALUE have been deleted from KEY and I need to delete them in a orderly fashion while both tables are still under high load.
It would be easy to do
delete v
from VALUE v
left join KEY k using (key_id)
where k.key_id is null;
However, as it's not allowed to have a limit on multi table delete I don't like this approach. Such a delete would take hours to run and that makes it impossible to throttle the deletes.
Another approach is to create cursor to find all missing key_ids and delete them one by one with a limit. That seems very slow and kind of backwards.
Are there any other options? Some nice tricks that could help?
Any solution that tries to delete so much data in one transaction is going to overwhelm the rollback segment and cause a lot of performance problems.
A good tool to help is pt-archiver. It performs incremental operations on moderate-sized batches of rows, as efficiently as possible. pt-archiver can copy, move, or delete rows depending on options.
The documentation includes an example of deleting orphaned rows, which is exactly your scenario:
pt-archiver --source h=host,D=db,t=VALUE --purge \
--where 'NOT EXISTS(SELECT * FROM `KEY` WHERE key_id=`VALUE`.key_id)' \
--limit 1000 --commit-each
Executing this will take significantly longer to delete the data, but it won't use too many resources, and without interrupting service on your existing database. I have used it successfully to purge hundreds of millions of rows of outdated data.
pt-archiver is part of the Percona Toolkit for MySQL, a free (GPL) set of scripts that help common tasks with MySQL and compatible databases.
Directly from MySQL documentation
If you are deleting many rows from a large table, you may exceed the
lock table size for an InnoDB table. To avoid this problem, or simply
to minimize the time that the table remains locked, the following
strategy (which does not use DELETE at all) might be helpful:
Select the rows not to be deleted into an empty table that has the same structure as the original table:
INSERT INTO t_copy SELECT * FROM t WHERE ... ;
Use RENAME TABLE to atomically move the original table out of the way and rename the copy to the original name:
RENAME TABLE t TO t_old, t_copy TO t;
Drop the original table:
DROP TABLE t_old;
No other sessions can access the tables involved while RENAME TABLE
executes, so the rename operation is not subject to concurrency
problems. See Section 12.1.9, “RENAME TABLE Syntax”.
So in Your case You may do
INSERT INTO value_copy SELECT * FROM VALUE WHERE key_id IN
(SELECT key_id FROM `KEY`);
RENAME TABLE value TO value_old, value_copy TO value;
DROP TABLE value_old;
And according to what they wrote here RENAME operation is quick and number of records doesn't affect it.
What about this for having a limit?
delete x
from `VALUE` x
join (select key_id, value_id
from `VALUE` v
left join `KEY` k using (key_id)
where k.key_id is null
limit 1000) y
on x.key_id = y.key_id AND x.value_id = y.value_id;
First, examine your data. Find the keys which have too many values to be deleted "fast". Then find out which times during the day you have the smallest load on the system. Perform the deletion of the "bad" keys during that time. For the rest, start deleting them one by one with some downtime between deletes so that you don't put to much pressure on the database while you do it.
May be instead of limit divide whole set of rows into small parts by key_id:
delete v
from VALUE v
left join KEY k using (key_id)
where k.key_id is null and v.key_id > 0 and v.key_id < 100000;
then delete rows with key_id in 100000..200000 and so on.
You can try to delete in separated transaction batches.
This is for MSSQL, but should be similar.
declare #i INT
declare #step INT
set #i = 0
set #step = 100000
while (#i< (select max(VALUE.key_id) from VALUE))
BEGIN
BEGIN TRANSACTION
delete from VALUE where
VALUE.key_id between #i and #i+#step and
not exists(select 1 from KEY where KEY.key_id = VALUE.key_id and KEY.key_id between #i and #i+#step)
set #i = (#i+#step)
COMMIT TRANSACTION
END
Create a temporary table!
drop table if exists batch_to_delete;
create temporary table batch_to_delete as
select v.* from `VALUE` v
left join `KEY` k on k.key_id = v.key_id
where k.key_id is null
limit 10000; -- tailor batch size to your taste
-- optional but may help for large batch size
create index batch_to_delete_ix_key on batch_to_delete(key_id);
create index batch_to_delete_ix_value on batch_to_delete(value_id);
-- do the actual delete
delete v from `VALUE` v
join batch_to_delete d on d.key_id = v.key_id and d.value_id = v.value_id;
To me this is a kind of task the progress of which I would want to see in a log file. And I would avoid solving this in pure SQL, I would use some scripting in Python or other similar language. Another thing that would bother me is that lots of LEFT JOINs with WHERE IS NOT NULL between the tables might cause unwanted locks, so I would avoid JOINs either.
Here is some pseudo code:
max_key = select_db('SELECT MAX(key) FROM VALUE')
while max_key > 0:
cur_range = range(max_key, max_key-100, -1)
good_keys = select_db('SELECT key FROM KEY WHERE key IN (%s)' % cur_range)
keys_to_del = set(cur_range) - set(good_keys)
while 1:
deleted_count = update_db('DELETE FROM VALUE WHERE key IN (%s) LIMIT 1000' % keys_to_del)
db_commit
log_something
if not deleted_count:
break
max_key -= 100
This should not bother the rest of the system very much, but may take long. Another issue is to optimize the table after you deleted all those rows, but this is another story.
If the target columns are properly indexed this should go fast,
DELETE FROM `VALUE`
WHERE NOT EXISTS(SELECT 1 FROM `key` k WHERE k.key_id = `VALUE`.key_id)
-- ORDER BY key_id, value_id -- order by PK is good idea, but check the performance first.
LIMIT 1000
Alter the limit from 10 to 10000 to get acceptable performance, and rerun it several times.
Also take in mind that this mass deletes will perform locks and backups for each row ..
multiple the execution time for each row several times ...
There are some advanced methods to prevent this, but the easiest workaround
is just to put a transaction around this query.
Do you have SLAVE or Dev/Test environment with same data?
The first step is to find out your data distribution if you are worried about a particular key having 1 million value_ids
SELECT v.key_id, COUNT(IFNULL(k.key_id,1)) AS cnt
FROM `value` v LEFT JOIN `key` k USING (key_id)
WHERE k.key_id IS NULL
GROUP BY v.key_id ;
EXPLAIN PLAN for above query is much better than adding
ORDER BY COUNT(IFNULL(k.key_id,1)) DESC ;
Since you don't have partitioning on key_id (too many partitions in your case) and want to keep database running during your delete process, the option is to delete in chucks with SLEEP() between different key_id deletes to avoid overwhelming server. Don't forget to keep an eye on your binary logs to avoid disk filling.
The quickest way is :
Stop application so data is not changed.
Dump key_id and value_id from VALUE table with only matching key_id in KEY table by using
mysqldump YOUR_DATABASE_NAME value --where="key_id in (select key_id from YOUR_DATABASE_NAME.key)" --lock-all --opt --quick --quote-names --skip-extended-insert > VALUE_DATA.txt
Truncate VALUE table
Load data exported in step 2
Start Application
As always, try this in Dev/Test environment with Prod data and same infrastructure so you can calculate downtime.
Hope this helps.
I am just curious what the effect would be of adding a non-unique index on key_id in table VALUE. Selectivity is not high at all (~0.001) but I am curious how that would affect the join performance.
Why don't you split your VALUE table into several ones according to some rule like key_id module some power of 2 (like 256 for example)?

MySQL Select... for update with index has concurrency issue

This is a follow up on my previous question (you can skip it as I explain in this post the issue):
MySQL InnoDB SELECT...LIMIT 1 FOR UPDATE Vs UPDATE ... LIMIT 1
Environment:
JSF 2.1 on Glassfish
JPA 2.0 EclipseLink and JTA
MySQL 5.5 InnoDB engine
I have a table:
CREATE TABLE v_ext (
v_id INT NOT NULL AUTO_INCREMENT,
product_id INT NOT NULL,
code VARCHAR(20),
username VARCHAR(30),
PRIMARY KEY (v_id)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;
It is populated with 20,000 records like this one (product_id is 54 for all records, code is randomly generated and unique, username is set to NULL):
v_id product_id code username
-----------------------------------------------------
1 54 '20 alphanumerical' NULL
...
20,000 54 '20 alphanumerical' NULL
When a user purchase product 54, he gets a code from that table. If the user purchases multiple times, he gets a code each times (no unique constraint on username). Because I am preparing for a high activity I want to make sure that:
No concurrency/deadlock can occur
Performance is not impacted by the locking mechanism which will be needed
From the SO question (see link above) I found that doing such a query is faster:
START TRANSACTION;
SELECT v_id FROM v_ext WHERE username IS NULL LIMIT 1 FOR UPDATE;
// Use result for next query
UPDATE v_ext SET username=xxx WHERE v_id=...;
COMMIT;
However I found a deadlock issue ONLY when using an index on username column. I thought of adding an index would help in speeding up a little bit but it creates a deadlock after about 19,970 records (actually quite consistently at this number of rows). Is there a reason for this? I don't understand. Thank you.
From a purely theoretical point of view, it looks like you are not locking the right rows (different condition in the first statement than in the update statement; besides you only lock one row because of LIMIT 1, whereas you possibly update more rows later on).
Try this:
START TRANSACTION;
SELECT v_id FROM v_ext WHERE username IS NULL AND v_id=yyy FOR UPDATE;
UPDATE v_ext SET username=xxx WHERE v_id=yyy;
COMMIT;
[edit]
As for the reason for your deadlock, this is the probable answer (from the manual):
If you have no indexes suitable for your statement and MySQL must scan
the entire table to process the statement, every row of the table
becomes locked (...)
Without an index, the SELECT ... FOR UPDATE statement is likely to lock the entire table, whereas with an index, it only locks some rows. Because you didn't lock the right rows in the first statement, an additional lock is acquired during the second statement.
Obviously, a deadlock cannot happen if the whole table is locked (i.e. without an index).
A deadlock can certainly occur in the second setup.
First of all, the definition of the table is wrong. You have no tid column in the table, so i am suspecting the primary key is v_id.
Second of all, if you select for update, you lock the row. Any other select coming until the first transaction is done will wait for the row to be cleared, because it will hit the exact same record. So you will have waits for this row.
However, i pretty much doubt this can be a real serious problem in your case, because first of all, you have the username there, and second of all you have the product id there. It is extremly unlikely that you will have alot of hits on that exact same record you hit initially, and even if you do, the transaction should be running very fast.
You have to understand that by using transactions, you usually give up pretty much on concurrency for consistent data. There is no way to support consistency of data and concurrency at the same time.

MySQL taking forever 'sending data'. Simple query, lots of data

I'm trying to run what I believe to be a simple query on a fairly large dataset, and it's taking a very long time to execute -- it stalls in the "Sending data" state for 3-4 hours or more.
The table looks like this:
CREATE TABLE `transaction` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uuid` varchar(36) NOT NULL,
`userId` varchar(64) NOT NULL,
`protocol` int(11) NOT NULL,
... A few other fields: ints and small varchars
`created` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `uuid` (`uuid`),
KEY `userId` (`userId`),
KEY `protocol` (`protocol`),
KEY `created` (`created`)
) ENGINE=InnoDB AUTO_INCREMENT=61 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4 COMMENT='Transaction audit table'
And the query is here:
select protocol, count(distinct userId) as count from transaction
where created > '2012-01-15 23:59:59' and created <= '2012-02-14 23:59:59'
group by protocol;
The table has approximately 222 million rows, and the where clause in the query filters down to about 20 million rows. The distinct option will bring it down to about 700,000 distinct rows, and then after grouping, (and when the query finally finishes), 4 to 5 rows are actually returned.
I realize that it's a lot of data, but it seems that 4-5 hours is an awfully long time for this query.
Thanks.
Edit: For reference, this is running on AWS on a db.m2.4xlarge RDS database instance.
Why don't you profile a query and see what exactly is happening?
SET PROFILING = 1;
SET profiling_history_size = 0;
SET profiling_history_size = 15;
/* Your query should be here */
SHOW PROFILES;
SELECT state, ROUND(SUM(duration),5) AS `duration (summed) in sec` FROM information_schema.profiling WHERE query_id = 3 GROUP BY state ORDER BY `duration (summed) in sec` DESC;
SET PROFILING = 0;
EXPLAIN /* Your query again should appear here */;
I think this will help you in seeing where exactly query takes time and based on result you can perform optimization operations.
This is a really heavy query. To understand why it takes so long you should understand the details.
You have a range condition on the indexed field, that is MySQL finds the smallest created value in the index and for each value it gets the corresponding primary key from the index, retrieves the row from disk, and fetches the required fields (protocol, userId) missing in the current index record, puts them in a "temporary table", making the groupings on those 700000 rows. The index can actually be used and is used here only for speeding up the range condition.
The only way to speed it up, is to have an index that contains all the necessary data, so that MySQL would not need to make on disk lookups for the rows. That is called a covering index. But you should understand that the index will reside in memory and will contain ~ sizeOf(created+protocol+userId+PK)*rowCount bytes, that may become a burden as itself for the queries that update the table and for other indexes. It is easier to create a separate aggregates table and periodically update the table using your query.
Both distinct and group by will need to sort and store temporary data on the server. With that much data that might take a while.
Indexing different combinations of userId, created and protocol will help, but I can't say how much or what index will help the most.
Starting from a certain version of MariaDB (maybe since 10.5), I noticed that after importing a dump with
mysql dbname < dump.sql
the optimizer thinks things different from how they are, making the wrong decisions about indexes.
In general even listing tables innodb with phpmyadmin becomes very very slow.
I noticed that running
ANALYZE TABLE myTable;
fixes.
So after each import I run, that it's equal to run ANALYZE on each table
mysqlcheck -aA

MySQL Query Optimization for large table

I have a very large database of images and i need to run an update to increment the view count on the images. every hour there are over one million unique rows to update. Right now it takes about an hour to run this query is there anyway to have this run faster?
i'm creating a memory table:
CREATE TABLE IF NOT EXISTS tmp_views_table (
key VARCHAR(7) NOT NULL,
views INT NOT NULL,
primary key ( `key` )
) ENGINE = MEMORY
Then I insert 1000 views at a time using a loop that runs until all the views have been inserted into the memory table:
insert low_priority into tmp_views_table
values ('key', 'count'),('key', 'count'),('key', 'count'), etc...
Then i run an update on the actual table like this:
update images, tmp_views_table
set images.views = images.views+tmp_views_table.views
where images.key = tmp_views_table.key
this last update is the one that is taking around an hour, the memory table stuff runs pretty quickly.
Is there a faster way that i can do this update?
Are you using Innodb, right? Try general tuning of mysql and innodb engine to allow for faster data changes.
I suppose you have an index on the key field of images table. You can try your update query also without index on the memory table - in that case the query optimizer should choose full table scan of the memory table.
I have never used joins with UPDATE statements, so I don't know exactly it is executed, but maybe the JOIN is taking too long. Maybe you can post an EXPLAIN result of that query.
Here is what I have used in one project to do the something similar - insert/update real-time data to temp table and merge it to aggregate table once a day, so can try if it will execute faster.
INSERT INTO st_views_agg (pageid,pagetype,day,count)
SELECT pageid,pagetype,DATE(`when`) AS day, COUNT(*) AS count FROM st_views_pending WHERE (pagetype=4) GROUP BY pageid,pagetype,day
ON DUPLICATE KEY UPDATE count=count+VALUES(count);