I have a database with the following three tables:
matches table has 200,000 matches...
CREATE TABLE `matches` (
`match_id` bigint(20) unsigned NOT NULL,
`start_time` int(10) unsigned NOT NULL,
PRIMARY KEY (`match_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
heroes table has ~100 heroes...
CREATE TABLE `heroes` (
`hero_id` smallint(5) unsigned NOT NULL,
`name` char(40) NOT NULL,
PRIMARY KEY (`hero_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
matches_heroes table has 2,000,000 relationships (10 random heroes per match)...
CREATE TABLE `matches_heroes` (
`relation_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`match_id` bigint(20) unsigned NOT NULL,
`hero_id` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`relation_id`),
KEY `match_id` (`match_id`),
KEY `hero_id` (`hero_id`),
CONSTRAINT `matches_heroes_ibfk_2` FOREIGN KEY (`hero_id`)
REFERENCES `heroes` (`hero_id`),
CONSTRAINT `matches_heroes_ibfk_1` FOREIGN KEY (`match_id`)
REFERENCES `matches` (`match_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3689891 DEFAULT CHARSET=utf8
The following query takes over 1 second, which seems pretty slow to me for something so simple:
SELECT SQL_NO_CACHE COUNT(*) AS match_count
FROM matches INNER JOIN matches_heroes ON matches.match_id = matches_heroes.match_id
WHERE hero_id = 5
Removing only the WHERE clause doesn't help, but if I take out the INNER JOIN also, like so:
SELECT SQL_NO_CACHE COUNT(*) AS match_count FROM matches
...it only takes 0.05 seconds. It seems that INNER JOIN is very costly. I don't have much experience with joins. Is this normal or am I doing something wrong?
UPDATE #1: Here's the EXPLAIN result.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE matches_heroes ref match_id,hero_id,match_id_hero_id hero_id 2 const 34742
1 SIMPLE matches eq_ref PRIMARY PRIMARY 8 mydatabase.matches_heroes.match_id 1 Using index
UPDATE #2: After listening to you guys, I think it's working properly and this is simply as fast as it gets. Please let me know if you disagree. Thanks for all the help. I really appreciate it.
Use COUNT(matches.match_id) instead of count(*), as when using joins it's best to not use the * as it does extra computation. Using columns from the join are the best way ensure you are not requesting any other operations. (not a problem on MySql InnerJoin, my bad).
Also you should verify that you have all keys defragmented, and enough ram free for the index to load in memory
Update 1:
Try to add a composed index for match_id,hero_id as it should give better performance.
ALTER TABLE `matches_heroes` ADD KEY `match_id_hero_id` (`match_id`,`hero_id`)
Update 2:
I wasn't satisfied with the accepted answer, that mysql is that slow for just 2 mill records and I runed benchmarks on my ubuntu PC (i7 processor, with standard HDD).
-- pre-requirements
CREATE TABLE seq_numbers (
number INT NOT NULL
) ENGINE = MYISAM;
DELIMITER $$
CREATE PROCEDURE InsertSeq(IN MinVal INT, IN MaxVal INT)
BEGIN
DECLARE i INT;
SET i = MinVal;
START TRANSACTION;
WHILE i <= MaxVal DO
INSERT INTO seq_numbers VALUES (i);
SET i = i + 1;
END WHILE;
COMMIT;
END$$
DELIMITER ;
CALL InsertSeq(1,200000)
;
ALTER TABLE seq_numbers ADD PRIMARY KEY (number)
;
-- create tables
-- DROP TABLE IF EXISTS `matches`
CREATE TABLE `matches` (
`match_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`start_time` int(10) unsigned NOT NULL,
PRIMARY KEY (`match_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;
CREATE TABLE `heroes` (
`hero_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`name` char(40) NOT NULL,
PRIMARY KEY (`hero_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;
CREATE TABLE `matches_heroes` (
`match_id` bigint(20) unsigned NOT NULL,
`hero_id` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`match_id`,`hero_id`),
KEY (match_id),
KEY (hero_id),
CONSTRAINT `matches_heroes_ibfk_2` FOREIGN KEY (`hero_id`) REFERENCES `heroes` (`hero_id`),
CONSTRAINT `matches_heroes_ibfk_1` FOREIGN KEY (`match_id`) REFERENCES `matches` (`match_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;
-- insert DATA
-- 100
INSERT INTO heroes(name)
SELECT SUBSTR(CONCAT(char(RAND()*25+65),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97)),1,RAND()*9+4) as RandomName
FROM seq_numbers WHERE number <= 100
-- 200000
INSERT INTO matches(start_time)
SELECT rand()*1000000
FROM seq_numbers WHERE number <= 200000
-- 2000000
INSERT INTO matches_heroes(hero_id,match_id)
SELECT a.hero_id, b.match_id
FROM heroes as a
INNER JOIN matches as b ON 1=1
LIMIT 2000000
-- warm-up database, load INDEXes in ram (optional, works only for MyISAM tables)
LOAD INDEX INTO CACHE matches_heroes,matches,heroes
-- get random hero_id
SET #randHeroId=(SELECT hero_id FROM matches_heroes ORDER BY rand() LIMIT 1);
-- test 1
SELECT SQL_NO_CACHE #randHeroId,COUNT(*) AS match_count
FROM matches as a
INNER JOIN matches_heroes as b ON a.match_id = b.match_id
WHERE b.hero_id = #randHeroId
; -- Time: 0.039s
-- test 2: adding some complexity
SET #randName = (SELECT `name` FROM heroes WHERE hero_id = #randHeroId LIMIT 1);
SELECT SQL_NO_CACHE #randName, COUNT(*) AS match_count
FROM matches as a
INNER JOIN matches_heroes as b ON a.match_id = b.match_id
INNER JOIN heroes as c ON b.hero_id = c.hero_id
WHERE c.name = #randName
; -- Time: 0.037s
Conclusion: The test results are about 20x faster, and my server load was about 80% before testing as it's not a dedicated mysql server and had other cpu intensive tasks running, so if you run the whole script (from above) and get lower results it can be because:
you have a shared host, and the load is too big. In this case there isn't much you can do: you either complain to your current host, pay for a better host/vm or try another host
your configured key_buffer_size(for MyISAM) or innodb_buffer_pool_size(for innoDB) is too small, the optimum size would be over 150MB
your available ram is not enough, you would require about 100 - 150 mb of ram for the indexes to be loaded into memory. solution: free up some ram or buy more of it
Note that by using the test script, the generating of new data rules out the index fragmentation problem.
Hope this helps, and ask if you have issues in testing this.
obs:
SELECT SQL_NO_CACHE COUNT(*) AS match_count
FROM matches INNER JOIN matches_heroes ON matches.match_id = matches_heroes.match_id
WHERE hero_id = 5`
is the equivalent to:
SELECT SQL_NO_CACHE COUNT(*) AS match_count
FROM matches_heroes
WHERE hero_id = 5`
So you wouldn't require a join, if that's the count you need, but I'm guessing that was just an example.
So you say reading a table of 200,000 records is faster than reading a table of 2,000,000 records, finding the desired ones, then take them all to find matching records in the 200,000 record table?
And this surprises you? It's simply a lot of more work for the dbms. (It can even be, btw, that the dbms decides not to use the hero_id index when it considers a full table scan to be faster.)
So in my opinion there is nothing wrong with what is happening here.
Related
I have a largish but narrow InnoDB table with ~9m records. Doing count(*) or count(id) on the table is extremely slow (6+ seconds):
DROP TABLE IF EXISTS `perf2`;
CREATE TABLE `perf2` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`channel_id` int(11) DEFAULT NULL,
`timestamp` bigint(20) NOT NULL,
`value` double NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ts_uniq` (`channel_id`,`timestamp`),
KEY `IDX_CHANNEL_ID` (`channel_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
RESET QUERY CACHE;
SELECT COUNT(*) FROM perf2;
While the statement is not run too often it would be nice to optimize it. According to http://www.cloudspace.com/blog/2009/08/06/fast-mysql-innodb-count-really-fast/ this should be possible by forcing InnoDB to use an index:
SELECT COUNT(id) FROM perf2 USE INDEX (PRIMARY);
The explain plan seems fine:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE perf2 index NULL PRIMARY 4 NULL 8906459 Using index
Unfortunately the statement is as slow as before. According to "SELECT COUNT(*)" is slow, even with where clause I've also tried optimizing the table without success.
What/is the/re a way to optimize COUNT(*) performance on InnoDB?
As of MySQL 5.1.6 you can use the Event Scheduler and insert the count to a stats table regularly.
First create a table to hold the count:
CREATE TABLE stats (
`key` varchar(50) NOT NULL PRIMARY KEY,
`value` varchar(100) NOT NULL);
Then create an event to update the table:
CREATE EVENT update_stats
ON SCHEDULE
EVERY 5 MINUTE
DO
INSERT INTO stats (`key`, `value`)
VALUES ('data_count', (select count(id) from data))
ON DUPLICATE KEY UPDATE value=VALUES(value);
It's not perfect but it offers a self contained solution (no cronjob or queue) that can be easily tailored to run as often as the required freshness of the count.
For the time being I've solved the problem by using this approximation:
EXPLAIN SELECT COUNT(id) FROM data USE INDEX (PRIMARY)
The approximate number of rows can be read from the rows column of the explain plan when using InnoDB as shown above. When using MyISAM this will remain EMPTY as the table reference isbeing optimized away- so if empty fallback to traditional SELECT COUNT instead.
Based on #Che code, you can also use triggers on INSERT and on UPDATE to perf2 in order to keep the value in stats table up to date in realtime.
CREATE TABLE stats (
`key` varchar(50) NOT NULL PRIMARY KEY,
`value` varchar(100) NOT NULL
);
Then:
CREATE TRIGGER `count_up` AFTER INSERT ON `perf2` FOR EACH ROW UPDATE `stats`
SET `stats`.`value` = `stats`.`value` + 1
WHERE `stats`.`key` = 'perf2_count';
CREATE TRIGGER `count_down` AFTER DELETE ON `perf2` FOR EACH ROW UPDATE `stats`
SET `stats`.`value` = `stats`.`value` - 1
WHERE `stats`.`key` = 'perf2_count';
So the number of rows in the perf2 table can be read using this query, in realtime:
SELECT `value` FROM `stats` WHERE `key` = 'perf2_count';
This would have the advantage of eliminating the performance issue of performing a COUNT(*) and would only be executed when data changes in perf2.
I have to create a cron job, which is simple in itself, but because it will run every minute I'm worried about performance. I have two tables, one has user names and the other has details about their network. Most of the time a user will belong to just one network, but it is theoretically possible that they might belong to more, but even then very few, maybe two or three. So, in order to reduce the number of JOINs, I saved the network ids separated by | in a field in the user table, e.g.
|1|3|9|
The (simplified for this question) user table structure is
TABLE `users` (
`u_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`userid` VARCHAR(500) NOT NULL UNIQUE,
`net_ids` VARCHAR(500) NOT NULL DEFAULT '',
PRIMARY KEY (`u_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The (also simplified) network table structure is
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
PRIMARY KEY (`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I have to send a warning when timeout occurs, my query is
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN users AS U ON U.net_ids LIKE CONCAT('%|', N.n_id, '|%');
I made N a subquery to reduce the number of rows joined. But I would like to know if it would be faster to add a third table with u_id and n_id as columns, removed the net_ids column from users and then do a join on all three tables? Because I read that using LIKE slows things down.
Which is the most effcient query to use in this case? One JOIN and a LIKE or two JOINS?
P.S. I did some experimentation and the initial values for using two JOINS are higher than using a JOIN and a LIKE. However, repeated runs of the same query seems to speed things up a lot, I suspect something is cached somewhere, either in my app or the database, and both become comparable, so I did not find this data satisfactory. It also contradicts what I was expecting based on what I have been reading.
I used this table:
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
INDEX `u_id` (`u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
INDEX `n_id` (`n_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
and this query:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid FROM
(SELECT netname, timeout_mins, n_id FROM network
WHERE is_open = 1 AND notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, timeout_mins, login_time), NOW()) < 60) AS N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id;
You should define composite indexes for the user_net table. One of them can (and should) be the primary key.
TABLE `user_net` (
`u_id` BIGINT UNSIGNED NOT NULL,
`n_id` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`u_id`, `n_id`),
INDEX `uid_nid` (`n_id`, `u_id`),
FOREIGN KEY (`u_id`) REFERENCES `users`(`u_id`),
FOREIGN KEY (`n_id`) REFERENCES `network`(`n_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I would also rewrite your query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND TIMESTAMPDIFF(SECOND, TIMESTAMPADD(MINUTE, N.timeout_mins, N.login_time), NOW()) < 60
While your subquery will probably not hurt much, there is no need for it.
Note that the last condition cannot use an index, because you have to combine two columns. If your MySQL version is at least 5.7.6 you can define an indexed virtual (calculated) column.
CREATE TABLE `network` (
`n_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT UNIQUE,
`netname` VARCHAR(500) NOT NULL UNIQUE,
`login_time` DATETIME DEFAULT NULL,
`timeout_mins` TINYINT UNSIGNED NOT NULL DEFAULT 10,
`is_open` TINYINT UNSIGNED,
`notify` TINYINT UNSIGNED,
`timeout_dt` DATETIME AS (`login_time` + INTERVAL `timeout_mins` MINUTE),
PRIMARY KEY (`n_id`),
INDEX (`timeout_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now change the query to:
SELECT N.netname, N.timeout_mins, N.n_id, U.userid
FROM network N
INNER JOIN user_net AS UN ON N.n_id = UN.n_id
INNER JOIN users AS U ON UN.u_id = U.u_id
WHERE N.is_open = 1
AND N.notify = 1
AND N.timeout_dt < NOW() + INTERVAL 60 SECOND
and it will be able to use the index.
You can also try to replace
INDEX (`timeout_dt`)
with
INDEX (`is_open`, `notify`, `timeout_dt`)
and see if it is of any help.
Reformulate to avoid hiding columns inside functions. I can't grok your date expression, but note this:
login_time < NOW() - INTERVAL timeout_mins MINUTE
If you can achieve something like that, then this index should help:
INDEX(is_open, notify, login_time)
If that is not good enough, let's see the other formulation so we can compare them.
Having stuff separated by comma (or |) is likely to be a really bad idea.
Bottom line: Assume that JOINs are not a performance problem, write the queries with as many JOINs as needed. Then let's optimize that.
I have a largish but narrow InnoDB table with ~9m records. Doing count(*) or count(id) on the table is extremely slow (6+ seconds):
DROP TABLE IF EXISTS `perf2`;
CREATE TABLE `perf2` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`channel_id` int(11) DEFAULT NULL,
`timestamp` bigint(20) NOT NULL,
`value` double NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ts_uniq` (`channel_id`,`timestamp`),
KEY `IDX_CHANNEL_ID` (`channel_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
RESET QUERY CACHE;
SELECT COUNT(*) FROM perf2;
While the statement is not run too often it would be nice to optimize it. According to http://www.cloudspace.com/blog/2009/08/06/fast-mysql-innodb-count-really-fast/ this should be possible by forcing InnoDB to use an index:
SELECT COUNT(id) FROM perf2 USE INDEX (PRIMARY);
The explain plan seems fine:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE perf2 index NULL PRIMARY 4 NULL 8906459 Using index
Unfortunately the statement is as slow as before. According to "SELECT COUNT(*)" is slow, even with where clause I've also tried optimizing the table without success.
What/is the/re a way to optimize COUNT(*) performance on InnoDB?
As of MySQL 5.1.6 you can use the Event Scheduler and insert the count to a stats table regularly.
First create a table to hold the count:
CREATE TABLE stats (
`key` varchar(50) NOT NULL PRIMARY KEY,
`value` varchar(100) NOT NULL);
Then create an event to update the table:
CREATE EVENT update_stats
ON SCHEDULE
EVERY 5 MINUTE
DO
INSERT INTO stats (`key`, `value`)
VALUES ('data_count', (select count(id) from data))
ON DUPLICATE KEY UPDATE value=VALUES(value);
It's not perfect but it offers a self contained solution (no cronjob or queue) that can be easily tailored to run as often as the required freshness of the count.
For the time being I've solved the problem by using this approximation:
EXPLAIN SELECT COUNT(id) FROM data USE INDEX (PRIMARY)
The approximate number of rows can be read from the rows column of the explain plan when using InnoDB as shown above. When using MyISAM this will remain EMPTY as the table reference isbeing optimized away- so if empty fallback to traditional SELECT COUNT instead.
Based on #Che code, you can also use triggers on INSERT and on UPDATE to perf2 in order to keep the value in stats table up to date in realtime.
CREATE TABLE stats (
`key` varchar(50) NOT NULL PRIMARY KEY,
`value` varchar(100) NOT NULL
);
Then:
CREATE TRIGGER `count_up` AFTER INSERT ON `perf2` FOR EACH ROW UPDATE `stats`
SET `stats`.`value` = `stats`.`value` + 1
WHERE `stats`.`key` = 'perf2_count';
CREATE TRIGGER `count_down` AFTER DELETE ON `perf2` FOR EACH ROW UPDATE `stats`
SET `stats`.`value` = `stats`.`value` - 1
WHERE `stats`.`key` = 'perf2_count';
So the number of rows in the perf2 table can be read using this query, in realtime:
SELECT `value` FROM `stats` WHERE `key` = 'perf2_count';
This would have the advantage of eliminating the performance issue of performing a COUNT(*) and would only be executed when data changes in perf2.
To start out here is a simplified version of the tables involved.
tbl_map has approx 4,000,000 rows, tbl_1 has approx 120 rows, tbl_2 contains approx 5,000,000 rows. I know the data shouldn't be consider that large given that Google, Yahoo!, etc use much larger datasets. So I'm just assuming that I'm missing something.
CREATE TABLE `tbl_map` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`tbl_1_id` bigint(20) DEFAULT '-1',
`tbl_2_id` bigint(20) DEFAULT '-1',
`rating` decimal(3,3) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `tbl_1_id` (`tbl_1_id`),
KEY `tbl_2_id` (`tbl_2_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tbl_1` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tbl_2` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`data` varchar(255) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The Query in interest: also, instead of ORDER BY RAND(), ORDERY BY t.id DESC. The query is taking as much as 5~10 seconds and causes a considerable wait when users view this page.
EXPLAIN SELECT t.data, t.id , tm.rating
FROM tbl_2 AS t
JOIN tbl_map AS tm
ON t.id = tm.tbl_2_id
WHERE tm.tbl_1_id =94
AND tm.rating IS NOT NULL
ORDER BY t.id DESC
LIMIT 200
1 SIMPLE tm ref tbl_1_id, tbl_2_id tbl_1_id 9 const 703438 Using where; Using temporary; Using filesort
1 SIMPLE t eq_ref PRIMARY PRIMARY 8 tm.tbl_2_id 1
I would just liked to speed up the query, ensure that I have proper indexes, etc.
I appreciate any advice from DB Gurus out there! Thanks.
SUGGESTION : Index the table as follows:
ALTER TABLE tbl_map ADD INDEX (tbl_1_id,rating,tbl_2_id);
As per Rolando, yes, you definitely need an index on the map table but I would expand to ALSO include the tbl_2_id which is for your ORDER BY clause of Table 2's ID (which is in the same table as the map, so just use that index. Also, since the index now holds all 3 fields, and is based on the ID of the key search and criteria of null (or not) of rating, the 3rd element has them already in order for your ORDER BY clause.
INDEX (tbl_1_id,rating, tbl_2_id);
Then, I would just have the query as
SELECT STRAIGHT_JOIN
t.data,
t.id ,
tm.rating
FROM
tbl_map tm
join tbl_2 t
on tm.tbl_2_id = t.id
WHERE
tm.tbl_1_id = 94
AND tm.rating IS NOT NULL
ORDER BY
tm.tbl_2_id DESC
LIMIT 200
I have two mysql tables:
/* Table users */
CREATE TABLE IF NOT EXISTS `users` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`DateRegistered` datetime NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
/* Table statistics_user */
CREATE TABLE IF NOT EXISTS `statistics_user` (
`UserId` int(10) unsigned NOT NULL AUTO_INCREMENT,
`Sent_Views` int(10) unsigned NOT NULL DEFAULT '0',
`Sent_Winks` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`UserId`),
CONSTRAINT `statistics_user_ibfk_1` FOREIGN KEY (`UserId`) REFERENCES `users` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Both tables are populated with 10.000 random rows for testing by using the following procedure:
DELIMITER //
CREATE DEFINER=`root`#`localhost` PROCEDURE `FillUsersStatistics`(IN `cnt` INT)
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE dt DATE;
DECLARE Winks INT DEFAULT 1;
DECLARE Views INT DEFAULT 1;
WHILE (i<=cnt) DO
SET dt = str_to_date(concat(floor(1 + rand() * (9-1)),'-',floor(1 + rand() * (28 -1)),'-','2011'),'%m-%d-%Y');
INSERT INTO users (Id, DateRegistered) VALUES(i, dt);
SET Winks = floor(1 + rand() * (30-1));
SET Views = floor(1 + rand() * (30-1));
INSERT INTO statistics_user (UserId, Sent_Winks, Sent_Views) VALUES (i, Winks, Views);
SET i=i+1;
END WHILE;
END//
DELIMITER ;
CALL `FillUsersStatistics`(10000);
The problem:
When I run the EXPLAIN for this query:
SELECT
t1.Id, (Sent_Views + Sent_Winks) / DATEDIFF(NOW(), t1.DateRegistered) as Score
FROM users t1
JOIN statistics_user t2 ON t2.UserId = t1.Id
ORDER BY Score DESC
.. I get this explain:
Id select_type table type possible_keys key key_len ref rows extra
1 SIMPLE t1 ALL PRIMARY (NULL) (NULL) (NULL) 10037 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 test2.t2.UserId 1
The above query gets very slow when both tables have more than 500K rows. I guess it's because of the 'Using temporary; Using filesort' in the explain of the query.
How can the above query be optimized so that it runs faster?
I'm faily sure that the ORDER BY is what's killing you, since it cannot be properly indexed. Here is a workable, if not particularly pretty, solution.
First, let's say you have a column named Score for storing a user's current score. Every time a user's Sent_Views or Sent_Winks changes, modify the Score column to match. This could probably be done with a trigger (my experience with triggers is limited), or definitely done in the same code that updates the Sent_Views and Sent_Winks fields. This change wouldn't need to know the DATEDIFF portion, because it could just divide by the old sum of Sent_Views + Sent_Winks and multiply by the new one.
Now you just need to change the Score column once per day (if you're not picky about the precise number of hours a user has been registered). This could be done with a script run by a cron job.
Then, just index the Score column and SELECT away!
Note: edited to remove incorrect first attempt.
I'm offering my comment as answer:
Establish a future date, far enough to not interfere with your application, say the year 5000. Replace the current date with this future date in your score calculation. The score computation is now for all intents and purposes absolute, and can be computed when updating winks and views (through a stored rocedure or atrigger (does mysql have triggers?)).
Add a score column to your statistics_user table to store the computed score and define an index on it.
Your SQL can be rewritten as:
SELECT
UserId, score
FROM
statistics_user
ORDER BY score DESC
If you need the real score, it is easily computed with just a constant multiplication which could be done afterwards if it interferse with mysql index selection.
Shouldn't you have indexed DateRegistered in Users?
You should try an inner join, rather than a cartesian product, the next thing you can do is partitioning according to date_registered.