We have this table:
CREATE TABLE `test_table` (
`id` INT NOT NULL AUTO_INCREMENT,
`time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`value` FLOAT NOT NULL,
`session` INT NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `session_time_idx` (`session` ASC,`time` ASC)
) ENGINE = InnoDB;
It is used to store different "measurement sessions" each resulting in potentially hundreds of thousands of rows. Different measurement sessions may have the same or overlapping timestamp ranges. We then need to randomly access single measurements with queries like this:
SELECT * FROM `test_table` WHERE `session` = 2 AND `time` < '2003-12-02' ORDER BY `time` DESC LIMIT 1;
We need to query for times which are spread uniformly on the measurement session. The "less than" operator is necessary because we don't know exactly when each measurement was taken, we just need to find the last measurement which was performed before a given date and time.
Depending on the time specified in the query, we have 2 possible resulting plans:
mysql> EXPLAIN SELECT * FROM `test_table` WHERE `session` = 2 AND `time` < '2003-12-02' ORDER BY `time` DESC LIMIT 1;
+----+-------------+------------+-------+------------------+------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+------------------+------------------+---------+------+------+-------------+
| 1 | SIMPLE | test_table | range | session_time_idx | session_time_idx | 8 | NULL | 6050 | Using where |
+----+-------------+------------+-------+------------------+------------------+---------+------+------+-------------+
mysql> EXPLAIN SELECT * FROM `test_table` WHERE `session` = 2 AND `time` < '2005-01-02' ORDER BY `time` DESC LIMIT 1;
+----+-------------+------------+------+------------------+------------------+---------+-------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+------------------+------------------+---------+-------+--------+-------------+
| 1 | SIMPLE | test_table | ref | session_time_idx | session_time_idx | 4 | const | 127758 | Using where |
+----+-------------+------------+------+------------------+------------------+---------+-------+--------+-------------+
The first plan uses the whole index (session and time), typically resulting in sub-ms execution times on development machines.
The second plan uses only part of the index then scans the results of the whole session, sometimes hundreds of thousands of rows. Needless to say, the performance of the second plan is very poor. Tens of ms on development machines, which can become seconds on the slow production embedded devices.
The difference between the two queries is just the amount of rows which would match the query if no "LIMIT" clause was used. This makes sense when no "LIMIT" is specified because scanning the data directly can be an advantage instead of scanning both the second part of the index and the data. But MySQL doesn't seem to care about the fact that we only need one row: using the full index seems to be always the best choice in this case.
I made some tests which resulted in the following observations:
if I select just "id", "time" and/or "session" (not "value") the full index is used in all cases (because all needed data is in the indexes); so, while slightly cumbersome, querying first the "id" and then the rest of the data would work
using "FORCE INDEX (session_time_idx)" does fix the bad plan and results in fast queries all the times
no issue is present when using a single column index on time
running OPTIMIZE TABLE does not make any difference
using MyIASM instead if InnoDB makes no difference
using a simple integer instead of a TIMESTAMP makes no difference (as expected: TIMESTAMP is an integer after all)
I played with various parameters, including "max_seeks_for_key", but I couldn't fix the bad plan
Since we are using this kind of access pattern in many places and we have a custom ORM system, I'd like to know if there is a way to "convince" MySQL to do the right thing without having to add "FORCE INDEX" support to the ORM.
Any other suggestion for working around this problem would also be appreciated.
My setup: MySQL Server 5.5.47 on Ubuntu 14.04 64-bit.
Update: this also happens with MySQL Server 5.6 and 5.7.
For reference, this is the script I am using to create the test setup:
set ##time_zone = "+00:00";
drop schema if exists `index_test`;
create schema `index_test`;
use `index_test`;
CREATE TABLE `test_table` (
`id` INT NOT NULL AUTO_INCREMENT,
`time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
`value` FLOAT NOT NULL,
`session` INT NOT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `session_time_idx` (`session` ASC,`time` ASC)
) ENGINE = InnoDB;
delimiter $$
CREATE PROCEDURE fill(total int)
BEGIN
DECLARE count int;
DECLARE countPerSs int;
DECLARE tim int;
set count = 0;
set countPerSs = 100000;
set tim = unix_timestamp('2000-01-01');
myloop: LOOP
insert into `test_table` set `value` = rand(), `session` = count div countPerSs, `time` = from_unixtime(tim);
set tim = tim + 10 * 60;
SET count = count + 1;
IF count < total THEN
ITERATE myloop;
END IF;
LEAVE myloop;
END LOOP myloop;
END;
$$
delimiter ;
call fill(500000);
Related
I'll be the first to admit that I'm not great at SQL (and I probably shouldn't be treating it like a rolling log file), but I was wondering if I could get some pointers for improving some slow queries...
I have a large mysql table with 2M rows where I do two full table lookups based on a subset of the most recent data. When I load the page that contains these queries, I often find they take several seconds to complete, but the queries inside are quite quick.
PMA's (supposedly terrible) advisor pretty much throws the entire kitchen sink at me, temporary tables, too many sorts, joins without indexes (I don't even have any joins?), reading from fixed position, reading next position, temporary tables written to disk... that last one especially makes me wonder if it's a configuration issue, but I played with all the knobs, and even paid for a managed service which didn't seem to help.
CREATE TABLE `archive` (
`id` bigint UNSIGNED NOT NULL,
`ip` varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`service` enum('ssh','telnet','ftp','pop3','imap','rdp','vnc','sql','http','smb','smtp','dns','sip','ldap') CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`hostid` bigint UNSIGNED NOT NULL,
`date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `archive`
ADD PRIMARY KEY (`id`),
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-date` (`service`,`date`),
ADD KEY `service-ip` (`service`,`ip`);
Adding indexes definitely helped (even though they're 4x the size of the actual data), but I'm kindof at a loss where I can optimize further. Initially I thought about caching the subquery results in php and using it twice for the main queries, but I don't think I have access to the result once I close the subquery. I looked into doing joins, but they look like they're meant for 2 or more separate tables, but the subquery is from the same table, so I'm not sure if that would even work either. The queries are supposed to find the most active ip/services based on whether I have data from an ip in the past 24 hours...
SELECT service, COUNT(service) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY service HAVING total > 1
ORDER BY total DESC, service ASC LIMIT 10
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | service,ip,date-service,ip-date,ip-service,service-date,service-ip | ip-service | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44246 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
SELECT ip, COUNT(ip) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY ip HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | ip,date-ip,ip-date,ip-service,service-ip | ip-date | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44168 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
common subquery: 0.0351s
whole query 1: 1.4270s
whole query 2: 1.5601s
total page load: 3.050s (7 queries total)
Am I just doomed to terrible performance with this table?
Hopefully there's enough information here to get an idea of what's going, but if anyone can help I would certainly appreciate it. I don't mind throwing more hardware at the issue, but when an 8c/16t server with 16gb can't handle 150mb of data I'm not sure what will. Thanks in advance for reading my long winded question.
You have the right indexes (as well as many other indexes) and your query both meets your specs and runs close to optimally. It's unlikely that you can make this much faster: it needs to look all the way back to the beginning of your table.
If you can change your spec so you only have to look back a limited amount of time like a year you'll get a good speedup.
Some possible minor tweaks.
use the latin1_bin collation for your ip column. It uses 8-bit characters and collates them without any case sensitivity. That's plenty for IPv4 dotted-quad addresses (and IPv6 addresses). You'll get rid of a bit of overhead in matching and grouping. Or, even better,
If you know you will have nothing but IPv4 addresses, rework your ip column to store their binary representations ( that is, the INET_ATON() - generated value of each IPv4). You can fit those in the UNSIGNED INT 32-bit integer data type, making the lookup, grouping, and ordering even faster.
It's possible to rework the way you gather these data. For example, you could arrange to gather at most one row per service per day. That will reduce the timeseries resolution of your data, but it will also make your queries much faster. Define your table something like this:
CREATE TABLE archive2 (
ip VARCHAR(15) COLLATE latin1_bin NOT NULL,
service ENUM ('ssh','telnet','ftp',
'pop3','imap','rdp',
'vnc','sql','http','smb',
'smtp','dns','sip','ldap') COLLATE NOT NULL,
`date` DATE NOT NULL,
`count` INT NOT NULL,
hostid bigint UNSIGNED NOT NULL,
PRIMARY KEY (`date`, ip, service)
) ENGINE=InnoDB;
Then, when you insert a row, use this query:
INSERT INTO archive2 (`date`, ip, service, `count`, hostid)
VALUES (CURDATE(), ?ip, ?service, 1, ?hostid)
ON DUPLICATE KEY UPDATE
SET count = count + 1;
This will automatically increment your count column if the row for the ip, service, and date already exists.
Then your second query will look like:
SELECT ip, SUM(`count`) AS total
FROM archive
WHERE ip IN (
SELECT ip FROM archive
WHERE `date` > CURDATE() - INTERVAL 1 DAY
GROUP BY ip
HAVING total > 1
)
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10;
The index of the primary key will satisfy this query.
First query
(I'm not convinced that it can be made much faster.)
(currently)
SELECT service, COUNT(service) AS total
FROM `archive`
WHERE ip IN (
SELECT DISTINCT ip
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
Notes:
COUNT(service) --> COUNT(*)
DISTINCT is not needed in IN (SELECT DISTINCT ...)
IN ( SELECT ... ) is often slow; rewrite using EXISTS ( SELECT 1 ... ) or JOIN (see below)
INDEX(date, IP) -- for subquery
INDEX(service, IP) -- for your outer query
INDEX(IP, service) -- for my outer query
Toss redundant indexes; they can get in the way. (See below)
It will have to gather all the possible results before getting to the ORDER BY and LIMIT. (That is, LIMIT has very little impact on performance for this query.)
CHARACTER SET utf8 COLLATE utf8_unicode_ci is gross overkill for IP addresses; switch to CHARACTER SET ascii COLLATE ascii_bin.
If you are running MySQL 8.0 (Or MariaDB 10.2), a WITH to calculate the subquery once, together with a UNION to compute the two outer queries, may provide some extra speed.
MariaDB has a "subquery cache" that might have the effect of skipping the second subquery evaluation.
By using DATETIME instead of TIMESTAMP, you will two minor hiccups per year when daylight savings kicks in/out.
I doubt if hostid needs to be a BIGINT (8-bytes).
To switch to a JOIN, think of fetching the candidate rows first:
SELECT service, COUNT(*) AS total
FROM ( SELECT DISTINCT IP
FROM archive
WHERE `date` > NOW() - INTERVAL 24 HOUR
) AS x
JOIN archive USING(IP)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
For any further discussion any slow (but working) query, please provide both flavors of EXPLAIN:
EXPLAIN SELECT ...
EXPLAIN FORMAT=JSON SELECT ...
Drop these indexes:
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
Recommend only
ADD PRIMARY KEY (`id`),
-- as discussed:
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-ip` (`service`,`ip`),
-- maybe other queries need these:
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `service-date` (`service`,`date`),
The general rule here is that you don't need INDEX(a) when you also have INDEX(a,b). In particular, they may be preventing the use of better indexes; see the EXPLAINs.
Second query
Rewrite it
SELECT ip, COUNT(DISTINCT ip) AS total
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
GROUP BY ip
HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC
LIMIT 10
It will use only INDEX(date, ip).
I need some help figuring out a performance issue. A database containing a single table with a growing number of METARs (aviation weather reports) is slowing down after about 8 million records are present. This despite indexes being in use. Performance can be recovered by rebuilding indexes, but that's really slow and takes the database offline, so I've resorted to just dropping the table and recreating it (losing the last few weeks of data).
The behaviour is the same whether a query is run trying to retrieve an actual metar, or whether a simple select count(*) is executed.
The table creation syntax is as follows:
CREATE TABLE `metars` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`tstamp` timestamp NULL DEFAULT NULL,
`metar` varchar(255) DEFAULT NULL,
`icao` char(7) DEFAULT NULL,
`qnh` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `timestamp` (`tstamp`),
KEY `icao` (`icao`),
KEY `qnh` (`qnh`),
KEY `metar` (`metar`)
) ENGINE=InnoDB AUTO_INCREMENT=812803050 DEFAULT CHARSET=latin1;
Up to about 8 million records, a select count(*) returns in about 500ms. Then it gradually increases, currently again at 14 million records, the count takes between 3 and 30 seconds. I was surprised to see that when explaining the count query, it's using the timestamp as an index, not the primary key. Using the primary key this should be a matter of just a few ms to return the number of records:
mysql> explain select count(*) from metars;
+----+-------------+--------+-------+---------------+-----------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+-----------+---------+------+----------+-------------+
| 1 | SIMPLE | metars | index | NULL | timestamp | 5 | NULL | 14693048 | Using index |
+----+-------------+--------+-------+---------------+-----------+---------+------+----------+-------------+
1 row in set (0.00 sec)
Forcing it to use the primary index is even slower:
mysql> select count(*) from metars use index(PRIMARY);
+----------+
| count(*) |
+----------+
| 14572329 |
+----------+
1 row in set (37.87 sec)
Oddly, the typical use case query is to get the weather for an airport nearest to a specific point in time which continues to perform very well, despite being more complex than a simple count:
mysql> SELECT qnh, metar from metars WHERE icao like 'KLAX' ORDER BY ABS(TIMEDIFF(tstamp, STR_TO_DATE('2019-10-10 00:00:00', '%Y-%m-%d %H:%i:%s'))) LIMIT 0,1;
+------+-----------------------------------------------------------------------------------------+
| qnh | metar |
+------+-----------------------------------------------------------------------------------------+
| 2980 | KLAX 092353Z 25012KT 10SM FEW015 20/14 A2980 RMK AO2 SLP091 T02000139 10228 20200 56007 |
+------+-----------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
What am I doing wrong here?
InnoDB performs a plain COUNT(*) by traversing some index. It prefers the smallest index because that will require touching the least number of blocks.
The PRIMARY KEY is clustered with the data, so that index is actually the biggest.
What version are you using? TIMESTAMP changed at some point. Perhaps that explains why tstamp is used instead of qnh.
If you are purging old data by using DELETE, see http://mysql.rjweb.org/doc.php/partitionmaint for a faster way.
I assume the data is static; that is it is never UPDATEd? Consider building and maintaining a summary table, perhaps indexed by date. This could have various counts for each day. Then a fetch from that table would be much faster than hitting the raw data. More: http://mysql.rjweb.org/doc.php/summarytables
How many rows for KLAX? That query must fetch all of them in order to convert the timestamp before doing the LIMIT. If you had INDEX(icao, tstamp), you could find the next before or after a given time even faster.
I know solution when you can sort table by some unique index
SELECT user_id, external_id, name, metadata, date_created
FROM users
WHERE user_id > 51234123
ORDER BY user_id ASC
LIMIT 10000;
but in my case, I want to sort table by some index, which have random data
CREATE TABLE `t` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`sorter` bigint(20) NOT NULL,
`data1` varchar(200) NOT NULL,
`data2` varchar(200) NOT NULL,
`data3` varchar(200) NOT NULL,
`data4` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `sorter` (`sorter`),
KEY `id` (`id`,`sorter`),
KEY `sorter_2` (`sorter`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8;
for ($i = 0; $i < 2e6; $i++)
$db->query("INSERT INTO `t` (`sorter`, `data1`, `data2`, `data3`, `data4`) VALUES (rand()*3e17, rand(), rand(), rand(), rand())");
for ($i = 0; $i < 1e6; $i++)
$db->query("INSERT INTO `t` (`sorter`, `data1`, `data2`, `data3`, `data4`) VALUES (0, rand(), rand(), rand(), rand())");
solution 1:
for ($i = 0; $i < $maxId; $i += $step)
select * from t
where id>=$i
order by sorter
limit $step
select * from t order by sorter limit 512123, 10000;
10000 rows in set (9.22 sec)
select * from t order by sorter limit 512123, 1000;
1000 rows in set (6.25 sec)
+------+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| 1 | SIMPLE | t | ALL | NULL | NULL | NULL | NULL | 3000000 | Using filesort |
+------+-------------+-------+------+---------------+------+---------+------+---------+----------------+
solution 2:
select id from t order by sorter limit 1512123, 10000;
+------+-------------+-------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | t | index | NULL | sorter_2 | 16 | NULL | 1522123 | Using index |
+------+-------------+-------+-------+---------------+----------+---------+------+---------+-------------+
10000 rows in set (0.74 sec)
0.74 sounds good, but for all table it takes 0.74*3000e3/10e3/60 = more than 3 minutes, and its only for gathering ids
Using OFFSET is not as efficient as you might think. With LIMIT 1512123, 10000, 1512123 rows must be stepped over. The bigger that number, the slower the query runs.
To explain the difference in the EXPLAINs...
'Solution 1' uses SELECT *; you don't have a covering index for it. So, there are two ways the query might be run:
(it did this): Scan 'ALL' the table, collecting all the columns (*); sort; skip over 512123 rows; and deliver 10000 or 1000 rows.
(a small OFFSET and LIMIT might lead to this): Inside the BTree for INDEX(sorter, id) skip over the OFFSET rows; grab the LIMIT rows; for each grabbed row in the index, reach over into the data file using the byte offset (note: You are using MyISAM, not InnoDB) to find the row; grab * and deliver it. No sort needed.
Unfortunately, the Optimizer does not have enough statistics, nor enough smarts, to always pick correctly between these two choices.
'Solution 2' uses a "covering" index INDEX(sorter, id). (The clue: "Using index".) This contains all the columns (only sorter and id) found anywhere in the query (select id from t order by sorter limit 1512123, 10000;), hence the index can (and usually will) be used in preference over scanning the table.
Another solution alluded to involved where id>=$i. This avoids the OFFSET. However, since you are using MyISAM, the index and the data cannot be "clustered" together. With InnoDB, the data is ordered according to the PRIMARY KEY. If that is id, then the query can start by jumping directly into the middle of the data (at $i). With MyISAM, what I just described is done in the BTree for INDEX(id); but it still has to bounce back and forth between that Btree and the .MYD file where the data is. (This is an example of where InnoDB's design is inherently more efficient than MyISAM's.)
If your goal is to get a bunch of random rows from a table, read my treatise. In summary, there are faster ways, but none is 'perfect', though usually "good enough".
Morning,
I've got multiple clients trying to get a unique primary key on a table.
A row identified by that PK is considered "valid" only if they match a successful range scan. The range scan is SELECT id FROM lookup WHERE allowed='Y' and updated<=NOW() LIMIT 1
------------+---------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| fullname | varchar(250) | NO | UNI | 0 | |
| allowed | enum('Y','N') | NO | MUL | N | |
| updated | timestamp | NO | | CURRENT_TIMESTAMP | |
| hits | smallint(6) | NO | MUL | 0 | |
| stop_allow | enum('Y','N') | NO | MUL | N | |
+------------+---------------+------+-----+-------------------+----------------+
Once that is first select is done, another SELECT is executed in order to retrieve the content.
The problem is that many clients are doing the same thing at the same time (or they do randomly find a way to match each other grrrr...).
So far, I've tried:
1)
start transaction;
*range scan* LIMIT 1 FOR UPDATE;
SELECT * from lookup WHERE id=(result of the range scan);
*perform stuff*
commit;
This is a performance killer. Stuff is locked forever and "Mysql server goes to heaven" after some time.
2)
start transaction;
*range scan*
SELECT * from lookup WHERE id=(result of the previous query) FOR UPDATE;
*perform stuff*
commit;
This fails miserably with autocommit=0, but it is quite fast
3) At this point, I'm starting to think that transactions are the problem
no transaction;
//get a row that is not being processed
*range scan* LEFT OUTER JOIN temp_mem_table WHERE **temp_mem_table.id IS NULL**
$rid = (result of the range scan)
//check if another client is doing the same thing, if so then stop here
select 1 from temp_mem_table WHERE id=$rid
//if there is a result => return null; this is not enough to block stuff going through
//signal to other client that this ID is being processed
insert into temp_mem_table(id) values($rid)
//get the content
SELECT * from lookup WHERE id=($rid);
*perform time intensive operations*
Edit: the temp_mem_table is in fact a memory table, that is flushed on in a while. It does look like this:
CREATE TABLE temp_mem_table(id int(11), primary key(id)) engine=memory
Thought process is: if what's being processed is stored on a memory table accessible to all clients, then they should be able to know what their friends are doing. The check should stop any further processing. But somehow they find a way to go through :(
After a short period of time, it appears that almost 50% of those primary keys were processed at least twice.
I'm going to find a way of doing this, but maybe some of you encountered a similar situation and can help.
thanx
Ok, for those who encountered the famous "How do I select a unlocked row in Mysql?" such as seen here http://bugs.mysql.com/bug.php?id=49763 and a lot of other place. Here is a dirty hack to solve it.
This is done in READ REPEATABLE MODE which should be ACID over 9000 or at least won't break anything (maybe).
The start point is to have some kind of 'range' of rows that needs to be locked for read so other clients won't get it no matter what.
SELECT pk FROM tbl LIMIT 0,10
SELECT pk FROM tbl where *large range scan*
I do create a memory table (because it should be faster) such as:
CREATE TABLE `jobs` (
`pid` smallint(6) DEFAULT NULL,
`tid` int(11) DEFAULT NULL,
UNIQUE KEY `pid` (`pid`),
UNIQUE KEY `tid` (`tid`)
) ENGINE=MEMORY DEFAULT CHARSET=utf8 |
Pid is a unique identifier of the client. In my case its the actual process id.
Tid is the task id which matches the primary key of that huge table in which we perform some king of range scan.
Then pseudo-code is like this:
SELECT pk from tbl WHERE (range scan) or limit 0,100
delete from jobs where pid=$my_pid
foreach of those pk do
if(insert IGNORE into jobs(pid,tid) values(1234,pk)) break;
done;
select pk from jobs where pid=$my_pid
select * from big_tbl where id=pk
Have tested this with 2,10,25,50 and 100 concurrent clients and got 100% unique distribution of tasks across each client.
Now this might not be super complicated, or might not look elegant but I don't give a damn as long as CPU stays cool.
Can you add another column to the table to indicate that the row is being processed, and by whom? Then you could do:
START TRANSACTION;
UPDATE lookup SET owner=<client id>
WHERE id=( SELECT id FROM *range scan* ...
AND owner IS NULL
AND completed = false
FOR UPDATE);
COMMIT;
*do stuff*
UPDATE lookup SET owner=NULL, completed=true,... WHERE owner=<client id>;
The final UPDATE will never cause a conflict as long as every client has its own unique ID, and the initial SELECT can be LIMITed, and with proper indexing ought to be quite fast.
It is important that the last UPDATE keeps the row unselectable by the other clients. That is, the initial SELECT gets those rows where owner is NULL and completed is false; the first UPDATE makes them unselectable in that they now have an owner; the final UPDATE keeps them unselectable in that they are now completed.
Note: I hadn't realized this solution had already been proposed in a comment by user Kickstart.
I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
Why not using a concatenated index?
CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
Concatenated index vs. merging multiple indexes
Index-Only Scan
After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:
where `created_at` >= convert({specific_date}, datetime)
My original comments still apply for the optimization part.
The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.
This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:
select k.kind_id, count(*)
from
kind_catalog k
inner join (
select kind_id
from counters
where create_at >= ?
) c on k.kind_id = c.kind_id
group by k.kind_id
This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.