A performance question in MySQL - mysql

I’m seeing a performance behavior in mysqld that I don’t understand.
I have a table t with a primary key id and three data columns col1, … col4.
The data are in 4 TSV files 'col1.tsv', … 'col4.tsv'. The procedure I use to ingest them is:
CREATE TABLE t (
id INT NOT NULL,
col1 INT NOT NULL,
col2 INT NOT NULL,
col3 INT NOT NULL,
col4 CHAR(12) CHARACTER SET latin1 NOT NULL );
LOAD DATA LOCAL INFILE # POP 1
'col1.tsv' INTO TABLE t (id, col1);
ALTER TABLE t ADD PRIMARY KEY (id);
SET GLOBAL hot_keys.key_buffer_size= # something suitable
CACHE INDEX t IN hot_keys;
LOAD INDEX INTO CACHE t;
DROP TABLE IF EXISTS tmpt;
CREATE TABLE tmpt ( id INT NOT NULL, val INT NOT NULL );
LOAD DATA LOCAL INFILE 'col2.tsv' INTO TABLE tmpt tt;
INSERT INTO t (id, col2) # POP 2
SELECT tt.id, tt.val FROM tmpt tt
ON DUPLICATE KEY UPDATE col2=tt.val;
DROP TABLE IF EXISTS tmpt;
CREATE TABLE tmpt ( id INT NOT NULL, val INT NOT NULL );
LOAD DATA LOCAL INFILE 'col3.tsv' INTO TABLE tmpt tt;
INSERT INTO t (id, col3) # POP 3
SELECT tt.id, tt.val FROM tmpt tt
ON DUPLICATE KEY UPDATE col3=tt.val;
DROP TABLE IF EXISTS tmpt;
CREATE TABLE tmpt ( id INT NOT NULL,
val CHAR(12) CHARACTER SET latin1 NOT NULL );
LOAD DATA LOCAL INFILE 'col4.tsv' INTO TABLE tmpt tt;
INSERT INTO t (id, col4) # POP 4
SELECT tt.id, tt.val FROM tmpt tt
ON DUPLICATE KEY UPDATE col4=tt.val;
Now here’s the performance thing I don’t understand. Sometimes the POP 2
and 3 INSERT INTO … SELECT … ON DUPLICATE KEY UPDATE queries run very fast with mysqld
occupying 100% of a core and at other times mysqld bogs down at 1% CPU reading t.MYD, i.e. table t’s MyISAM data file, at random offsets.
I’ve had a very hard time isolating in which circumstances it is fast and
in which it is slow but I have found one repeatable case:
In the above sequence, POP 2 and 3 are very slow. But if I create t
without col4 then POP 2 and POP 3 are very fast. Why?
And if, after that, I add col4 with an ALTER TABLE query then POP 4 runs
very fast too.
Again, when the INSERTs run slow, mysqld is bogged down in file IO
reading from random offsets in table t’s MyISAM data file. I don’t even
understand why it is reading that file.
MySQL server version 5.0.87. OS X 10.6.4 on Core 2 Duo iMac.
UPDATE
I eventually found (what I think is) the answer to this question. The mysterious difference between some inserts being slow and some fast is dependent on the data.
The clue was: when the insert is slow, mysqld is seeking on average 0.5GB between reads on t.MYD. When it is fast, successive reads have tiny relative offsets.
The confusion arose because some of the 'col?.tsv' files happen to have their rows in roughly the same order w.r.t. the id column while others are randomly ordered relative to them.
I was able to drastically reduce overall processing time by using sort(1) on the tsv files before loading and inserting them.

It's a pretty open question... here's a speculative, open answer. :)
... when the INSERTs run slow, mysqld is bogged down in file IO reading from random offsets in table t’s MyISAM data file. I don’t even understand why it is reading that file.
I can think of two possible explanations:
Even after it knows there is a primary key collision, it has to see what used to be in the field that will be updated -- if it is coincidentally the destination value already, 0 in this case, it won't perform the update -- i.e. zero rows affected.
Moreover, when you update a field, I believe MySQL re-writes the whole row back to disk (if not multiple rows due to paging), and not just that single field as one might assume.
But if I create t without col4 then POP 2 and POP 3 are very fast. Why?
If it's a fixed-row size MyISAM table, which it looks like due to the datatypes in the table, then including the CHAR field, even if it's blank, will make the table 75% larger on disk (4 bytes per INT field = 16 bytes, whereas the CHAR(12) would add another 12 bytes). So, in theory, you'll need to read/write 75% more.
Does your dataset fit in memory? Have you considered using InnoDB or Memory tables?
Addendum
If the usable/active/hot dataset goes from fitting in memory to not fitting in memory, an orders of magnitude decrease in performance isn't unheard of. A couple reads:
http://jayant7k.blogspot.com/2010/10/foursquare-outage-post-mortem.html
http://www.mysqlperformanceblog.com/2010/11/19/is-there-benefit-from-having-more-memory/

Related

split table performance in mysql

everyone. Here is a problem in my mysql server.
I have a table about 40,000,000 rows and 10 columns.
Its size is about 4GB.And engine is innodb.
It is a master database, and only execute one sql like this.
insert into mytable ... on duplicate key update ...
And about 99% sqls executed update part.
Now the server is becoming slower and slower.
I heard that split table may enhance its performance. Then I tried on my personal computer, splited into 10 tables, failed , also tried 100 ,failed too. The speed became slower instead. So I wonder why splitting tables didn't enhance the performance?
Thanks in advance.
more details:
CREATE TABLE my_table (
id BIGINT AUTO_INCREMENT,
user_id BIGINT,
identifier VARCHAR(64),
account_id VARCHAR(64),
top_speed INT UNSIGNED NOT NULL,
total_chars INT UNSIGNED NOT NULL,
total_time INT UNSIGNED NOT NULL,
keystrokes INT UNSIGNED NOT NULL,
avg_speed INT UNSIGNED NOT NULL,
country_code VARCHAR(16),
update_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY(id), UNIQUE KEY(user_id)
);
PS:
I also tried different computers with Solid State Drive and Hard Disk Drive, but didn't help too.
Splitting up a table is unlikely to help at all. Ditto for PARTITIONing.
Let's count the disk hits. I will skip counting non-leaf nodes in BTrees; they tend to be cached; I will count leaf nodes in the data and indexes; they tend not to be cached.
IODKU does:
Read the index block containing the for any UNIQUE keys. In your case, that is probably user_id. Please provide a sample SQL statement. 1 read.
If the user_id entry is found in the index, read the record from the data as indexed by the PK(id) and do the UPDATE, and leave this second block in the buffer_pool for eventual rewrite to disk. 1 read now, 1 write later.
If the record is not found, do INSERT. The index block that needs the new row was already read, so it is ready to have a new entry inserted. Meanwhile, the "last" block in the table (due to id being AUTO_INCREMENT) is probably already cached. Add the new row to it. 0 reads now, 1 write later (UNIQUE). (Rewriting the "last" block is amortized over, say, 100 rows, so I am ignoring it.)
Eventually do the write(s).
Total, assuming essentially all take the UPDATE path: 2 reads and 1 write. Assuming the user_id follows no simple pattern, I will assume that all 3 I/Os are "random".
Let's consider a variation... What if you got rid of id? Do you need id anywhere else? Since you have a UNIQUE key, it could be the PK. That is replace your two indexes with just PRIMARY KEY(user_id). Now the counts are:
1 read
If UPDATE, 0 read, 1 write
If INSERT, 0 read, 0 write
Total: 1 read, 1 write. 2/3 as many as before. Better, but still not great.
Caching
How much RAM do you have?
What is the value of innodb_buffer_pool_size?
SHOW TABLE STATUS -- What are Data_length and Index_length?
I suspect that the buffer_pool is not big enough, and possible could be raised. If you have more than 4GB of RAM, make it about 70% of RAM.
Others
SSDs should have helped significantly, since you appear to be I/O bound. Can you tell whether you are I/O-bound or CPU-bound?
How many rows are you updating at once? How long does it take? Is it batched, or one at a time? There may be a significant improvement possible here.
Do you really need BIGINT (8 bytes)? INT UNSIGNED is only 4 bytes.
Is a transaction involved?
Is the Master having a problem? The Slave? Both? I don't want to fix the Master in such a way that it messes up the Slave.
Try to split your database into some mysql instances using mysql proxy just like mysql-proxy or haproxy instead of one mysql instance. Maybe you can have great performance.

MySQL - Data Loading by Partitions, and Indexes

This is for MySQL 5.7 with InnoDB.
I have a partitioned table, and I'll be doing batch data loading (of a large amount of data) by partitions. i.e. I know that each batch of data I load will fall exclusively into one partition.
Now, the common way to handle indexes with data loading (as far as I know), would be to drop all indexes first, do the data loading, then re-create the indexes.
But I'm wondering, since I'm loading by partitions, is this still the most optimal way (dropping and then re-creating indexes) since it seems like I'm unnecessarily "touching" the non-updated partitions this way.
e.g.
Loading data into partition 1.
Drop all indexes - nothing happens, since no data yet.
Load data - all goes into partition 1.
Create indexes - only parition 1 is modified.
Loading data into partition 2.
Drop all indexes - all indexes in partition 1 dropped (unnecessary!)
Load data - all goes into partition 2.
Create indexes - partition 1 indexes re-created (unnecessary!) and partition 2 indexes created.
And hence, loading this second batch of data takes significantly longer than the first batch. And it will get worse for each batch!
In that case, should I just pre-create the indexes and leave them there when loading data?
(BTW, don't worry about queries. The database is "offline" when data loading takes place. The objective here is only to shorten the time for each batch of data loading.)
The table schema is as follows:
CREATE TABLE MYTABLE (
ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL,
YEAR SMALLINT UNSIGNED NOT NULL,
MONTH TINYINT UNSIGNED NOT NULL,
A CHAR(4),
B VARCHAR(127),
C VARCHAR(15),
D VARCHAR(511),
E TEXT,
F TEXT,
G VARCHAR(127),
H VARCHAR(127),
I VARCHAR(127),
J VARCHAR(511),
K VARCHAR(511),
L BIT(1),
CONSTRAINT PKEY PRIMARY KEY (ID, YEAR, MONTH)
)
PARTITION BY LIST COLUMNS(YEAR, MONTH) (
PARTITION PART1 VALUES IN ((2007, 1)),
PARTITION PART2 VALUES IN ((2007, 2)),
PARTITION PART3 VALUES IN ((2007, 3)),
...
);
And, of course, there are a bunch of indexes (14 in all), mostly involving 2 to 4 columns. None of the 2 TEXT columns are in any of the index.
If you are using InnoDB, do not drop the PRIMARY KEY.
All PARTITIONs always have the same indexes. So you cannot turn on/off indexes separately.
Please provide SHOW CREATE TABLE for further critique and advice. I may say that PARTITIONing is of no use; there are very few use cases were it is worth using PARTITION. More info, and use cases

Design of mysql database for large number of large matrix data

I am looking into storing a "large" amount of data and not sure what the best solution is, so any help would be most appreciated. The structure of the data is
450,000 rows
11,000 columns
My requirements are:
1) Need as fast access as possible to a small subset of the data e.g. rows (1,2,3) and columns (5,10,1000)
2) Needs to be scalable will be adding columns every month but the number of rows are fixed.
My understanding is that often its best to store as:
id| row_number| column_number| value
but this would create 4,950,000,000 entries? I have tried storing as just rows and columns as is in MySQL but it is very slow at subsetting the data.
Thanks!
Build the giant matrix table
As N.B. said in comments, there's no cleaner way than using one mysql row for each matrix value.
You can do it without the id column:
CREATE TABLE `stackoverflow`.`matrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
You may add a UNIQUE INDEX on colNum, rowNum, or only a non-unique INDEX on colNum if you often access matrix by column (because PRIMARY INDEX is on ( `rowNum`, `colNum` ), note the order, so it will be inefficient when it comes to select a whole column).
You'll probably need more than 200Go to store the 450.000x11.000 lines, including indexes.
Inserting data may be slow (because there are two indexes to rebuild, and 450.000 entries [1 per row] to add when adding a column).
Edit should be very fast, as index wouldn't change and value is of fixed size
If you access same subsets (rows + cols) often, maybe you can use PARTITIONing of the table if you need something "faster" than what mysql provides by default.
After years of experience (20201 edit)
Re-reading myself years later, I would say the "cache" ideas are totally dumb, as it's MySQL role to handle these sort of cache (it should actually already be in the innodb pool cache).
A better thing would be, if matrix is full of zeroes, not storing the zero values, and consider 0 as "default" in the client code. That way, you may lightenup the storage (if needed: mysql should actually be pretty fast responding to queries event on such 5 billion row table)
Another thing, if storage makes issue, is to use a single ID to identify both row and col: you say number of rows is fixed (450000) so you may replace (row, col) with a single (id = 450000*col+row) value [tho it needs BIGINT so maybe not better than 2 columns)
Don't do like below: don't reinvent MySQL cache
Add a cache (actually no)
Since you said you add values, and doesn't seem to edit matrix values, a cache can speed up frequently asked rows/columns.
If you often read the same rows/columns, you can cache their result in another table (same structure to make it easier):
CREATE TABLE `stackoverflow`.`cachedPartialMatrix` (
`rowNum` MEDIUMINT NOT NULL ,
`colNum` MEDIUMINT NOT NULL ,
`value` INT NOT NULL ,
PRIMARY KEY ( `rowNum`, `colNum` )
) ENGINE = MYISAM ;
That table will be void at the beginning, and each SELECT on the matrix table will feed the cache. When you want to get a column / row:
SELECT the row/column from that caching table
If the SELECT returns a void/partial result (no data returned or not enough data to match the expected row/column number) then do the SELECT on the matrix table
Save the SELECT from the matrix table to the cachingPartialMatrix
If the caching matrix gets too big, clear it (the bigger cached matrix is, the slower it becomes)
Smarter cache (actually, no)
You can make it even smarter with a third table to count how many times a selection is done:
CREATE TABLE `stackoverflow`.`requestsCounter` (
`isRowSelect` BOOLEAN NOT NULL ,
`index` INT NOT NULL ,
`count` INT NOT NULL ,
`lastDate` DATETIME NOT NULL,
PRIMARY KEY ( `isRowSelect` , `index` )
) ENGINE = MYISAM ;
When you do a request on your matrix (one may use TRIGGERS) for the Nth-row or Kth-column, increment the counter. When the counter gets big enough, feed the cache.
lastDate can be used to remove some old values from the cache (take care: if you remove the Nth-column from cache entries because its ``lastDate```is old enough, you may break some other entries cache) or to regularly clear the cache and only leave the recently selected values.

Fast way to populate a relational database in MySQL using JDBC?

I am trying to implement simple program in Java that will be used to populate a MySQL database from a CSV source file. For each row in the CSV file, I need to execute following sequence of SQL statements (example in pseudo code):
execute("INSERT INTO table_1 VALUES(?, ?)");
String id = execute("SELECT LAST_INSERT_ID()");
execute("INSERT INTO table_2 VALUES(?, ?)");
String id2 = execute("SELECT LAST_INSERT_ID()");
execute("INSERT INTO table_3 values("some value", id1, id2)");
execute("INSERT INTO table_3 values("some value2", id1, id2)");
...
There are three basic problems:
1. Database is not on localhost so each single INSERT/SELECT has latency and this is the basic problem
2. CSV file contains millions of rows (like 15 000 000) so it takes too long.
3. I cannot modify the database structure (add extra tables, disable keys etc).
I was wondering how can I speed up the INSERT/SELECT process? Currently 80% of the execution time is consumed by communication.
I already tried to group the above statements and execute them as batch but because of LAST_INSERT_ID it does not work. In any other cases it takes too long (see point 1).
Fastest way is to let MySQL parse the CSV and load records into the table. For that, you can use "LOAD DATA INFILE":
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
It works even better if you can transfer the file to server or keep it on a shared directory that is accessible to server.
Once that is done, you can have a column that indicates whether the records has been processed or not. Its value should be false by default.
Once data is loaded, you can pick up all records where processed=false.
For all such records you can populate table 2 and 3.
Since all these operation would happen on server, server <> client latency would not come into the picture.
Feed the data into a blackhole
CREATE TABLE `test`.`blackhole` (
`t1_f1` int(10) unsigned NOT NULL,
`t1_f2` int(10) unsigned NOT NULL,
`t2_f1` ... and so on for all the tables and all the fields.
) ENGINE=BLACKHOLE DEFAULT CHARSET=latin1;
Note that this is a blackhole table, so the data is going nowhere.
However you can create a trigger on the blackhole table, something like this.
And pass it on using a trigger
delimiter $$
create trigger ai_blackhole_each after insert on blackhole for each row
begin
declare lastid_t1 integer;
declare lastid_t2 integer;
insert into table1 values(new.t1_f1, new.t1_f2);
select last_insert_id() into lastid_t1;
insert into table2 values(new.t2_f1, new.t2_f1, lastid_t1);
etc....
end$$
delimiter ;
Now you can feed the blackhole table with a single insert statement at full speed and even insert multiple rows in one go.
insert into blackhole values(a,b,c,d,e,f,g,h),(....),(...)...
Disable index updates to speed things up
ALTER TABLE $tbl_name DISABLE KEYS;
....Lot of inserts
ALTER TABLE $tbl_name ENABLE KEYS;
Will disable all non-unique key updates and speed up the insert. (an autoincrement key is unique, so that's not affected)
If you have any unique keys and you don't want MySQL to check for them during the mass-insert, make sure you do an alter table to eliminate the unique key and enable it afterwards.
Note that the alter table to put the unique key back in will take a long time.

Faster way to delete matching rows?

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.
The goal is to remove all the rows in table A that have a matching id in table B.
I'm currently doing the following:
DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);
There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.
This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?
EDIT:
Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:
CREATE TABLE b LIKE a;
Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.
Also, I'm using INNO DB.
Here is some additional info that might be helpful to you.
Table A has a structure similar to this (I've sanitized this a bit):
DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE `frobozz`.`a` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`fk_g` varchar(30) NOT NULL,
`h` int(10) unsigned default NULL,
`i` longtext,
`j` bigint(20) NOT NULL,
`k` bigint(20) default NULL,
`l` varchar(45) NOT NULL,
`m` int(10) unsigned default NULL,
`n` varchar(20) default NULL,
`o` bigint(20) NOT NULL,
`p` tinyint(1) NOT NULL,
PRIMARY KEY USING BTREE (`id`),
KEY `idx_l` (`l`),
KEY `idx_h` USING BTREE (`h`),
KEY `idx_m` USING BTREE (`m`),
KEY `idx_fk_g` USING BTREE (`fk_g`),
KEY `fk_g_frobozz` (`id`,`fk_g`),
CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;
I suspect that part of the issue is there are a number of indexes for this table.
Table B looks similar to table B, though it only contains the columns id and h.
Also, the profiling results are as follows:
starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002
SOLVED
Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"
The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,
DELETE FROM c WHERE c.id = theId;
I looked into the EXPLAIN statement and rewrote this as,
EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;
So, I could see what this was doing and it gave me the following info:
id 1
select_type SIMPLE
table c
type ALL
possible_keys NULL
key NULL
key_len NULL
ref NULL
rows 2633
Extra using where
This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:
id 1
select_type SIMPLE
table c
type ref
possible_keys Index_1
key Index_1
key_len 8
ref const
rows 1
Extra
Much better, in fact really great.
I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.
Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.
Reference: EXPLAIN Statement
Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.
While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.
To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.
For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.
Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.
So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows.
To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record.
With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.
Try this:
DELETE a
FROM a
INNER JOIN b
on a.id = b.id
Using subqueries tend to be slower then joins as they are run for each record in the outer query.
This is what I always do, when I have to operate with super large data (here: a sample test table with 150000 rows):
drop table if exists employees_bak;
create table employees_bak like employees;
insert into employees_bak
select * from employees
where emp_no > 100000;
rename table employees to employees_todelete;
rename table employees_bak to employees;
drop table employees_todelete;
In this case the sql filters 50000 rows into the backup table.
The query cascade performs on my slow machine in 5 seconds.
You can replace the insert into select by your own filter query.
That is the trick to perform mass deletion on big databases!;=)
Your time of three minutes seems really slow. My guess is that the id column is not being indexed properly. If you could provide the exact table definition you're using that would be helpful.
I created a simple python script to produce test data and ran multiple different versions of the delete query against the same data set. Here's my table definitions:
drop table if exists a;
create table a
(id bigint unsigned not null primary key,
data varchar(255) not null) engine=InnoDB;
drop table if exists b;
create table b like a;
I then inserted 100k rows into a and 25k rows into b (22.5k of which were also in a). Here's the results of the various delete commands. I dropped and repopulated the table between runs by the way.
mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (1.14 sec)
mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (0.81 sec)
mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (0.97 sec)
mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (0.81 sec)
All the tests were run on an Intel Core2 quad-core 2.5GHz, 2GB RAM with Ubuntu 8.10 and MySQL 5.0. Note, that the execution of one sql statement is still single threaded.
Update:
I updated my tests to use itsmatt's schema. I slightly modified it by remove auto increment (I'm generating synthetic data) and character set encoding (wasn't working - didn't dig into it).
Here's my new table definitions:
drop table if exists a;
drop table if exists b;
drop table if exists c;
create table c (id varchar(30) not null primary key) engine=InnoDB;
create table a (
id bigint(20) unsigned not null primary key,
c_id varchar(30) not null,
h int(10) unsigned default null,
i longtext,
j bigint(20) not null,
k bigint(20) default null,
l varchar(45) not null,
m int(10) unsigned default null,
n varchar(20) default null,
o bigint(20) not null,
p tinyint(1) not null,
key l_idx (l),
key h_idx (h),
key m_idx (m),
key c_id_idx (id, c_id),
key c_id_fk (c_id),
constraint c_id_fk foreign key (c_id) references c(id)
) engine=InnoDB row_format=dynamic;
create table b like a;
I then reran the same tests with 100k rows in a and 25k rows in b (and repopulating between runs).
mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (11.90 sec)
mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (11.48 sec)
mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (12.21 sec)
mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (12.33 sec)
As you can see this is quite a bit slower than before, probably due to the multiple indexes. However, it is nowhere near the three minute mark.
Something else that you might want to look at is moving the longtext field to the end of the schema. I seem to remember that mySQL performs better if all the size restricted fields are first and text, blob, etc are at the end.
You're doing your subquery on 'b' for every row in 'a'.
Try:
DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;
Try this out:
DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID
It is much faster than normal queries.
Refer for Syntax : http://dev.mysql.com/doc/refman/5.0/en/delete.html
I know this question has been pretty much solved due to OP's indexing omissions but I would like to offer this additional advice, which is valid for a more generic case of this problem.
I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag, as the longer each single mutator query runs, the worse the lag would be (replication is single threaded).
So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time).
This will achieve the following:
each one of the delete statements will not lock the table for too long, thus not letting replication lag to get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because it has to do the long select all over again. Slave lag, again, suffers badly. If you use a separate SELECT query, this problem goes away, as all you're passing is a list of IDs.
Let me know if there's a fault in my logic somewhere.
For more discussion on replication lag and ways to fight it, similar to this one, see MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It
P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.
DELETE FROM a WHERE id IN (SELECT id FROM b)
Maybe you should rebuild the indicies before running such a hugh query. Well, you should rebuild them periodically.
REPAIR TABLE a QUICK;
REPAIR TABLE b QUICK;
and then run any of the above queries (i.e.)
DELETE FROM a WHERE id IN (SELECT id FROM b)
The query itself is already in an optimal form, updating the indexes causes the whole operation to take that long. You could disable the keys on that table before the operation, that should speed things up. You can turn them back on at a later time, if you don't need them immediately.
Another approach would be adding a deleted flag-column to your table and adjusting other queries so they take that value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). That would be a fast operation, you can delete the values afterwards.
The same thoughts expressed in sql statements:
ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL;
-- The following query should be faster than the delete statement:
UPDATE a INNER JOIN b SET a.deleted = '';
-- This is the catch, you need to alter the rest
-- of your queries to take the new column into account:
SELECT * FROM a WHERE deleted IS NULL;
-- You can then issue the following queries in a cronjob
-- to clean up the tables:
DELETE FROM a WHERE deleted IS NOT NULL;
If that, too, is not what you want, you can have a look at what the mysql docs have to say about the speed of delete statements.
BTW, after posting the above on my blog, Baron Schwartz from Percona brought to my attention that his maatkit already has a tool just for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html.
It is most likely your best tool for the job.
Obviously the SELECT query that builds the foundation of your DELETE operation is quite fast so I'd think that either the foreign key constraint or the indexes are the reasons for your extremely slow query.
Try
SET foreign_key_checks = 0;
/* ... your query ... */
SET foreign_key_checks = 1;
This would disable the checks on the foreign key. Unfortunately you cannot disable (at least I don't know how) the key-updates with an InnoDB table. With a MyISAM table you could do something like
ALTER TABLE a DISABLE KEYS
/* ... your query ... */
ALTER TABLE a ENABLE KEYS
I actually did not test if these settings would affect the query duration. But it's worth a try.
Connect datebase using terminal and execute command below, look at the result time each of them, you'll find that times of delete 10, 100, 1000, 10000, 100000 records are not Multiplied.
DELETE FROM #{$table_name} WHERE id < 10;
DELETE FROM #{$table_name} WHERE id < 100;
DELETE FROM #{$table_name} WHERE id < 1000;
DELETE FROM #{$table_name} WHERE id < 10000;
DELETE FROM #{$table_name} WHERE id < 100000;
The time of deleting 10 thousand records is not 10 times as much as deleting 100 thousand records.
Then, except for finding a way delete records more faster, there are some indirect methods.
1, We can rename the table_name to table_name_bak, and then select records from table_name_bak to table_name.
2, To delete 10000 records, we can delete 1000 records 10 times. There is an example ruby script to do it.
#!/usr/bin/env ruby
require 'mysql2'
$client = Mysql2::Client.new(
:as => :array,
:host => '10.0.0.250',
:username => 'mysql',
:password => '123456',
:database => 'test'
)
$ids = (1..1000000).to_a
$table_name = "test"
until $ids.empty?
ids = $ids.shift(1000).join(", ")
puts "delete =================="
$client.query("
DELETE FROM #{$table_name}
WHERE id IN ( #{ids} )
")
end
The basic technique for deleting multiple Row form MySQL in single table through the id field
DELETE FROM tbl_name WHERE id <= 100 AND id >=200;
This query is responsible for deleting the matched condition between 100 AND 200 from the certain table