Slow time updating primary key indexed row - mysql

I have a query that updates a field in a table using the primary key to locate the row. The table can contain many rows where the date/time field is initially NULL, and then is updated with a date/time stamp using NOW().
When I run the update statement on the table, I am getting a slow query log entry (3.38 seconds). The log indicates that 200,000 rows were examined. Why would that many rows be examined if I am using the PK to identify the row being updated?
Primary key is item_id and customer_id. I have verified the PRIMARY key is correct in the mySQL table structure.
UPDATE cust_item
SET status = 'approved',
lstupd_dtm = NOW()
WHERE customer_id = '7301'
AND item_id = '12498';

I wonder if it's a hardware issue.
While the changes I've mentioned in comments might help slightly, in truth, I cannot replicate this issue...
I have a data set of roughly 1m rows...:
CREATE TABLE cust_item
(customer_id INT NOT NULL
,item_id INT NOT NULL
,status VARCHAR(12) NULL
,PRIMARY KEY(customer_id,item_id)
);
-- INSERT some random rows...
SELECT COUNT(*)
, SUM(customer_id = 358) dense
, SUM(item_id=12498) sparse
FROM cust_item;
+----------+-------+--------+
| COUNT(*) | dense | sparse |
+----------+-------+--------+
| 1047720 | 104 | 8 |
+----------+-------+--------+
UPDATE cust_item
SET status = 'approved'
WHERE item_id = '12498'
AND customer_id = '358';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

How long does it take to select the record, without the update?
If select is fast then you need to look into things that can affect update/write speed.
too many indexes on the table, don't forget filtered indexes and indexed views
the index pages have 0 fill factor and need to split to accommodate the data change.
referential constraints with cascade
triggers
slow write speed at the storage level
If the select is slow
old/bad statistics on the index
extreme fragmentation
columnstore index with too many open rowgroups
If the select speed improves significantly after the first time, you may be having some cold buffer performance issues. That could point to storage I/O problems as well.
You may also be having concurrency issues caused by another process locking the table momentarily.
Finally, any chance the tool executing the query is returning a false duration? For example, SQL Server Management Studio can occasionally be slow to return a large resultset, even if the server handled it very quickly.

Related

MariaDB - Indexing not improving performance on char(255) field

I'm trying to execute this SQL query on a table which has a little shy of 1 million records:
SELECT * FROM enty_score limit 100;
It gives me result in about 600 ms
As soon as I add a where clause on a field `dim_agg_strategy` char(255) DEFAULT NULL, it takes 40 seconds to execute:
SELECT * FROM enty_score WHERE dim_agg_strategy='COMPOSITE_AVERAGE_LAKE' limit 100;
I've tried to create an index but there is no improvement it still takes 40 seconds to execute the same query:
ALTER TABLE `enty_score` ADD INDEX `dim_agg_strategy_index` (`dim_agg_strategy`);
SELECT INDEX_NAME, COLUMN_NAME, CARDINALITY, NULLABLE, INDEX_TYPE
FROM information_schema.statistics where INDEX_NAME = 'dim_agg_strategy_index';
INDEX_NAME |COLUMN_NAME |CARDINALITY|NULLABLE|INDEX_TYPE|
----------------------+----------------+-----------+--------+----------+
dim_agg_strategy_index|dim_agg_strategy| 586|YES |BTREE |
A little more info, this column which I have placed in where clause just contains 6 distinct values:
select distinct dim_agg_strategy from enty_score;
dim_agg_strategy |
-------------------------+
COMPOSITE_AVERAGE |
COMPOSITE_AVERAGE_ALL |
COMPOSITE_AVERAGE_LAKE |
COMPOSITE_AVERAGE_NONLAKE|
NORMALISED_AVERAGE |
SIMPLE_AVERAGE |
The optimizer noticed that there were very few different values for that indexed column. So it realized that a lot of the rows would be needed. So it decided to simply plow through the table and not bother with the index. (Using the index would involve bouncing back and forth between the index's BTree and the data's BTree a lot.)
So, you counter by pointing out the LIMIT 100. This is a valid question. Alas, this points out a deficiency in the Optimizer.
It is torn between
Ignore the index, which is likely to be optimal if it needed to scan the entire table. Note: That would happen if the 100 rows you needed happened to be at the end of the table.
Use the index, but pay the extra overhead. Here it is failing to realize that 100 is a lot less than 1M, hence improving the odds that the index would usually be the best approach.
Let's try to fool it... DROP that index and add another index. This time put 2 columns:
(dim_agg_strategy, xx)
where xx is some other column.
(Let me know if this trick works for you.)

What is the default select order in PostgreSQL or MySQL?

I have read in the PostgreSQL docs that without an ORDER statement, SELECT will return records in an unspecified order.
Recently on an interview, I was asked how to SELECT records in the order that they inserted without an PK or created_at or other field that can be used for order. The senior dev who interviewed me was insistent that without an ORDER statement the records will be returned in the order that they were inserted.
Is this true for PostgreSQL? Is it true for MySQL? Or any other RDBMS?
I can answer for MySQL. I don't know for PostgreSQL.
The default order is not the order of insertion, generally.
In the case of InnoDB, the default order depends on the order of the index read for the query. You can get this information from the EXPLAIN plan.
For MyISAM, it returns orders in the order they are read from the table. This might be the order of insertion, but MyISAM will reuse gaps after you delete records, so newer rows may be stored earlier.
None of this is guaranteed; it's just a side effect of the current implementation. MySQL could change the implementation in the next version, making the default order of result sets different, without violating any documented behavior.
So if you need the results in a specific order, you should use ORDER BY on your queries.
Following BK's answer, and by way of example...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table(id INT NOT NULL) ENGINE = MYISAM;
INSERT INTO my_table VALUES (1),(9),(5),(8),(7),(3),(2),(6);
DELETE FROM my_table WHERE id = 8;
INSERT INTO my_table VALUES (4),(8);
SELECT * FROM my_table;
+----+
| id |
+----+
| 1 |
| 9 |
| 5 |
| 4 | -- is this what
| 7 |
| 3 |
| 2 |
| 6 |
| 8 | -- we expect?
+----+
In the case of PostgreSQL, that is quite wrong.
If there are no deletes or updates, rows will be stored in the table in the order you insert them. And even though a sequential scan will usually return the rows in that order, that is not guaranteed: the synchronized sequential scan feature of PostgreSQL can have a sequential scan "piggy back" on an already executing one, so that rows are read starting somewhere in the middle of the table.
However, this ordering of the rows breaks down completely if you update or delete even a single row: the old version of the row will become obsolete, and (in the case of an UPDATE) the new version can end up somewhere entirely different in the table. The space for the old row version is eventually reclaimed by autovacuum and can be reused for a newly inserted row.
Without an ORDER BY clause, the database is free to return rows in any order. There is no guarantee that rows will be returned in the order they were inserted.
With MySQL (InnoDB), we observe that rows are typically returned in the order by an index used in the execution plan, or by the cluster key of a table.
It is not difficult to craft an example...
CREATE TABLE foo
( id INT NOT NULL
, val VARCHAR(10) NOT NULL DEFAULT ''
, UNIQUE KEY (id,val)
) ENGINE=InnoDB;
INSERT INTO foo (id, val) VALUES (7,'seven') ;
INSERT INTO foo (id, val) VALUES (4,'four') ;
SELECT id, val FROM foo ;
MySQL is free to return rows in any order, but in this case, we would typically observe that MySQL will access rows through the InnoDB cluster key.
id val
---- -----
4 four
7 seven
Not at all clear what point the interviewer was trying to make. If the interviewer is trying to sell the idea, given a requirement to return rows from a table in the order the rows were inserted, a query without an ORDER BY clause is ever the right solution, I'm not buying it.
We can craft examples where rows are returned in the order they were inserted, but that is a byproduct of the implementation, ... not guaranteed behavior, and we should never rely on that behavior to satisfy a specification.

Mysql High Concurrency Updates

I have a mysql table:
CREATE TABLE `coupons` (
`id` INT NOT NULL AUTO_INCREMENT,
`code` VARCHAR(255),
`user_id` INT,
UNIQUE KEY `code_idx` (`code`)
) ENGINE=InnoDB;
The table consists of thousands/millions of codes and initially user_id is NULL for everyone.
Now I have a web application which assigns a unique code to thousands of users visiting the application concurrently. I am not sure what is the correct way to handle this considering very high traffic.
The query I have written is:
UPDATE coupons SET user_id = <some_id> where user_id is NULL limit 1;
And the application runs this query with say a concurrency of 1000 req/sec.
What I have observed is the entire table gets locked and this is not scaling well.
What should I do?
Thanks.
As it is understood, coupons is prepopulated and a null user_id is updated to one that is not null.
explain update coupons set user_id = 1 where user_id is null limit 1;
This is likely requiring an architectural solution, but you may wish to review the explain after ensuring that the table has indexes for the columns treated, and that the facilitate rapid updates.
Adding an index to coupons.user_id, for example alters MySQL's strategy.
create unique index user_id_idx on coupons(user_id);
explain update coupons set user_id = 1 where user_id is null limit 1;
+----+-------------+---------+------------+-------+---------------+-------------+---------+-------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+-------------+---------+-------+------+----------+------------------------------+
| 1 | UPDATE | coupons | NULL | range | user_id_idx | user_id_idx | 5 | const | 6 | 100.00 | Using where; Using temporary |
+----+-------------+---------+------------+-------+---------------+-------------+---------+-------+------+----------+------------------------------+
1 row in set (0.01 sec)
So you should work with a DBA to ensure that the database entity is optimized. Trade-offs need to be considered.
Also, since you have a client application, you have the opportunity to pre-fetch null coupons.user_id and do an update directly on coupons.id. Curious to hear of your solution.
This question might be more suitable for DBA's (and I'm not a DBA) but I'll try to give you some ideas of what's going on.
InnoDB does not actually lock the whole table when you perform you update query. What it does is the next: it puts a record lock which prevents any other transaction from inserting, updating, or deleting rows where the value of coupons.user_id is NULL.
With your query you have at the moment(which depends on user_id to be NULL), you cannot have concurrency because your transaction will run one after another, not in parallel.
Even an index on your coupons.user_id won't help, because when putting the lock InnoDB create a shadow index for you if you don't have one. The outcome would be the same.
So, if you want to increase your throughput, there are two options I can think of:
Assign a user to a coupon in async mode. Put all assignment request in a queue then process the queue in background. Might not be suitable for your business rules.
Decrease the number of locked records. The idea here is to lock as less records as possible while performing an update. To achieve this you can add one or more indexed columns to your table, then use the index in your WHERE clause of Update query.
An example of column is a product_id, or a category, maybe a user location(country, zip).
then your query will look something like this:
UPDATE coupons SET user_id = WHERE product_id = user_id is NULL LIMIT 1;
And now InnoDB will lock only records with product_id = <product_id>. this way you you'll have concurrency.
Hope this helps!

Can a row with a bigger auto-increment value appear sooner than a row with a smaller one?

Is the following scenario possible?
MySQL version 5.6.15
CREATE TABLE my_table (
id int NOT NULL AUTO_INCREMENT.
data1 int,
data2 timestamp,
PRIMARY KEY (id)
) ENGINE=InnoDB;
innodb_autoinc_lock_mode = 1
AUTO_INCREMENT=101
0 ms: Query A is run: INSERT INTO my_table (data1, data2) VALUES (101, FROM_TIMESTAMP(1418501101)), (102, FROM_TIMESTAMP(1418501102)), .. [200 values in total] .., (300, FROM_TIMESTAMP(1418501300));
500 ms: Query B is run: INSERT INTO my_table (data1, data2) VALUES (301, FROM_TIMESTAMP(1418501301));
505 ms: Query B is completed. The row gets id=301.
1000 ms: SELECT id FROM my_table WHERE id >= 300; — would return one row (id=301).
1200 ms: Query A is completed. The rows get id=101 to id=300.
1500 ms: SELECT id FROM my_table WHERE id >= 300; — would return two rows (id=300, id=301).
In other words, is it possible that the row with id=301 can be selected earlier than it's possible to select the row with id=300?
And how to avoid it, if it is possible?
Why is query A taking more than a second to run? Yuck! And yes what you're seeing is exactly how I'd expect it to behave.
Primary keys 101 through 300 are reserved immediately while inserting the new rows. This takes a couple of milliseconds. It then spends more than 1 second rebuilding the indexes and half way through doing all of that you ran another query which inserted a new row, using the next available auto_increment: 301.
As Alin Purcaru said you can avoid this specific issue by changing the lock mode, but it will cause performance issues (Instead of executing in 5 milliseconds, query B will take 700 milliseconds to execute.
Under high load situations the problem gets exponentially worse and you'll eventually see "too many connection" errors, effectively bringing your whole database offline.
Also there are other rare situations where auto_increment can give "out of order" increments, which will not be solved by locking.
Personally I think the best option is to use UUIDs instead of auto_increment for the primary key, and then have a timestamp column (as a double with microseconds) to determine what order rows were inserted into the database.
So... I read the docs for you.
According to them when what you have in your example is two simple inserts (the multi-insert is still considered simple, even if it has multiple rows, because you know how many you will insert, so the end autoincrement value may be determined), and when you have innodb_autoinc_lock_mode = 1 no lock will be set on the table and you can have your second query finish before the first and have that row available before.
If you want to avoid this, you need to set innodb_autoinc_lock_mode = 0, but you may take a toll in terms of scalability.
Disclaimer: I haven't tried it and the answer is based on my understanding of the MySQL docs.

Faster way to delete matching rows?

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.
The goal is to remove all the rows in table A that have a matching id in table B.
I'm currently doing the following:
DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);
There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.
This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?
EDIT:
Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:
CREATE TABLE b LIKE a;
Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.
Also, I'm using INNO DB.
Here is some additional info that might be helpful to you.
Table A has a structure similar to this (I've sanitized this a bit):
DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE `frobozz`.`a` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`fk_g` varchar(30) NOT NULL,
`h` int(10) unsigned default NULL,
`i` longtext,
`j` bigint(20) NOT NULL,
`k` bigint(20) default NULL,
`l` varchar(45) NOT NULL,
`m` int(10) unsigned default NULL,
`n` varchar(20) default NULL,
`o` bigint(20) NOT NULL,
`p` tinyint(1) NOT NULL,
PRIMARY KEY USING BTREE (`id`),
KEY `idx_l` (`l`),
KEY `idx_h` USING BTREE (`h`),
KEY `idx_m` USING BTREE (`m`),
KEY `idx_fk_g` USING BTREE (`fk_g`),
KEY `fk_g_frobozz` (`id`,`fk_g`),
CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
I suspect that part of the issue is there are a number of indexes for this table.
Table B looks similar to table B, though it only contains the columns id and h.
Also, the profiling results are as follows:
starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002
SOLVED
Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"
The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,
DELETE FROM c WHERE c.id = theId;
I looked into the EXPLAIN statement and rewrote this as,
EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;
So, I could see what this was doing and it gave me the following info:
id 1
select_type SIMPLE
table c
type ALL
possible_keys NULL
key NULL
key_len NULL
ref NULL
rows 2633
Extra using where
This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:
id 1
select_type SIMPLE
table c
type ref
possible_keys Index_1
key Index_1
key_len 8
ref const
rows 1
Extra
Much better, in fact really great.
I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.
Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.
Reference: EXPLAIN Statement
Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.
While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.
To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.
For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.
Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.
So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows.
To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record.
With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.
Try this:
DELETE a
FROM a
INNER JOIN b
on a.id = b.id
Using subqueries tend to be slower then joins as they are run for each record in the outer query.
This is what I always do, when I have to operate with super large data (here: a sample test table with 150000 rows):
drop table if exists employees_bak;
create table employees_bak like employees;
insert into employees_bak
select * from employees
where emp_no > 100000;
rename table employees to employees_todelete;
rename table employees_bak to employees;
drop table employees_todelete;
In this case the sql filters 50000 rows into the backup table.
The query cascade performs on my slow machine in 5 seconds.
You can replace the insert into select by your own filter query.
That is the trick to perform mass deletion on big databases!;=)
Your time of three minutes seems really slow. My guess is that the id column is not being indexed properly. If you could provide the exact table definition you're using that would be helpful.
I created a simple python script to produce test data and ran multiple different versions of the delete query against the same data set. Here's my table definitions:
drop table if exists a;
create table a
(id bigint unsigned not null primary key,
data varchar(255) not null) engine=InnoDB;
drop table if exists b;
create table b like a;
I then inserted 100k rows into a and 25k rows into b (22.5k of which were also in a). Here's the results of the various delete commands. I dropped and repopulated the table between runs by the way.
mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (1.14 sec)
mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (0.81 sec)
mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (0.97 sec)
mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (0.81 sec)
All the tests were run on an Intel Core2 quad-core 2.5GHz, 2GB RAM with Ubuntu 8.10 and MySQL 5.0. Note, that the execution of one sql statement is still single threaded.
Update:
I updated my tests to use itsmatt's schema. I slightly modified it by remove auto increment (I'm generating synthetic data) and character set encoding (wasn't working - didn't dig into it).
Here's my new table definitions:
drop table if exists a;
drop table if exists b;
drop table if exists c;
create table c (id varchar(30) not null primary key) engine=InnoDB;
create table a (
id bigint(20) unsigned not null primary key,
c_id varchar(30) not null,
h int(10) unsigned default null,
i longtext,
j bigint(20) not null,
k bigint(20) default null,
l varchar(45) not null,
m int(10) unsigned default null,
n varchar(20) default null,
o bigint(20) not null,
p tinyint(1) not null,
key l_idx (l),
key h_idx (h),
key m_idx (m),
key c_id_idx (id, c_id),
key c_id_fk (c_id),
constraint c_id_fk foreign key (c_id) references c(id)
) engine=InnoDB row_format=dynamic;
create table b like a;
I then reran the same tests with 100k rows in a and 25k rows in b (and repopulating between runs).
mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (11.90 sec)
mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (11.48 sec)
mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (12.21 sec)
mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (12.33 sec)
As you can see this is quite a bit slower than before, probably due to the multiple indexes. However, it is nowhere near the three minute mark.
Something else that you might want to look at is moving the longtext field to the end of the schema. I seem to remember that mySQL performs better if all the size restricted fields are first and text, blob, etc are at the end.
You're doing your subquery on 'b' for every row in 'a'.
Try:
DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;
Try this out:
DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID
It is much faster than normal queries.
Refer for Syntax : http://dev.mysql.com/doc/refman/5.0/en/delete.html
I know this question has been pretty much solved due to OP's indexing omissions but I would like to offer this additional advice, which is valid for a more generic case of this problem.
I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag, as the longer each single mutator query runs, the worse the lag would be (replication is single threaded).
So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time).
This will achieve the following:
each one of the delete statements will not lock the table for too long, thus not letting replication lag to get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because it has to do the long select all over again. Slave lag, again, suffers badly. If you use a separate SELECT query, this problem goes away, as all you're passing is a list of IDs.
Let me know if there's a fault in my logic somewhere.
For more discussion on replication lag and ways to fight it, similar to this one, see MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It
P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.
DELETE FROM a WHERE id IN (SELECT id FROM b)
Maybe you should rebuild the indicies before running such a hugh query. Well, you should rebuild them periodically.
REPAIR TABLE a QUICK;
REPAIR TABLE b QUICK;
and then run any of the above queries (i.e.)
DELETE FROM a WHERE id IN (SELECT id FROM b)
The query itself is already in an optimal form, updating the indexes causes the whole operation to take that long. You could disable the keys on that table before the operation, that should speed things up. You can turn them back on at a later time, if you don't need them immediately.
Another approach would be adding a deleted flag-column to your table and adjusting other queries so they take that value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). That would be a fast operation, you can delete the values afterwards.
The same thoughts expressed in sql statements:
ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL;
-- The following query should be faster than the delete statement:
UPDATE a INNER JOIN b SET a.deleted = '';
-- This is the catch, you need to alter the rest
-- of your queries to take the new column into account:
SELECT * FROM a WHERE deleted IS NULL;
-- You can then issue the following queries in a cronjob
-- to clean up the tables:
DELETE FROM a WHERE deleted IS NOT NULL;
If that, too, is not what you want, you can have a look at what the mysql docs have to say about the speed of delete statements.
BTW, after posting the above on my blog, Baron Schwartz from Percona brought to my attention that his maatkit already has a tool just for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html.
It is most likely your best tool for the job.
Obviously the SELECT query that builds the foundation of your DELETE operation is quite fast so I'd think that either the foreign key constraint or the indexes are the reasons for your extremely slow query.
Try
SET foreign_key_checks = 0;
/* ... your query ... */
SET foreign_key_checks = 1;
This would disable the checks on the foreign key. Unfortunately you cannot disable (at least I don't know how) the key-updates with an InnoDB table. With a MyISAM table you could do something like
ALTER TABLE a DISABLE KEYS
/* ... your query ... */
ALTER TABLE a ENABLE KEYS
I actually did not test if these settings would affect the query duration. But it's worth a try.
Connect datebase using terminal and execute command below, look at the result time each of them, you'll find that times of delete 10, 100, 1000, 10000, 100000 records are not Multiplied.
DELETE FROM #{$table_name} WHERE id < 10;
DELETE FROM #{$table_name} WHERE id < 100;
DELETE FROM #{$table_name} WHERE id < 1000;
DELETE FROM #{$table_name} WHERE id < 10000;
DELETE FROM #{$table_name} WHERE id < 100000;
The time of deleting 10 thousand records is not 10 times as much as deleting 100 thousand records.
Then, except for finding a way delete records more faster, there are some indirect methods.
1, We can rename the table_name to table_name_bak, and then select records from table_name_bak to table_name.
2, To delete 10000 records, we can delete 1000 records 10 times. There is an example ruby script to do it.
#!/usr/bin/env ruby
require 'mysql2'
$client = Mysql2::Client.new(
:as => :array,
:host => '10.0.0.250',
:username => 'mysql',
:password => '123456',
:database => 'test'
)
$ids = (1..1000000).to_a
$table_name = "test"
until $ids.empty?
ids = $ids.shift(1000).join(", ")
puts "delete =================="
$client.query("
DELETE FROM #{$table_name}
WHERE id IN ( #{ids} )
")
end
The basic technique for deleting multiple Row form MySQL in single table through the id field
DELETE FROM tbl_name WHERE id <= 100 AND id >=200;
This query is responsible for deleting the matched condition between 100 AND 200 from the certain table