SQL performance on multiple id matching and a Join statement - mysql

Consider this query:
SELECT DISTINCT (linkindex_tags.link_id)
, links_sorted.link_title
, links_sorted.link_url
FROM linkindex_tags
INNER JOIN links_sorted ON links_sorted.link_id = linkindex_tags.link_id
ORDER BY
(
IF (word_id = 400, 1,0)+
IF (word_id = 177, 1,0)+
IF (word_id = 114, 1,0)+
IF (word_id = 9, 1,0)+
IF (word_id = 270, 1,0)+
IF (word_id = 715, 1,0)+
IF (word_id = 279, 1,0)+
IF (word_id = 1, 1,0)+
IF (word_id = 1748, 1,0)
) DESC
LIMIT 0,15;
So looking for matches to a series of word_id's and odering by the score of those matches (e.g. find a link with 5 word_ids then it's a score of 5)
The linkindex_tags table is currently 552,196 rows (33 MB) but will expan to many millions
The link_sorted table is currently 823,600 (558MB - obv more data per row) rows but will also expand to more.
The linkindex_tags table is likely to be around 8-12 times larger than links_sorted.
Execution Time : 7.069 sec on a local i3 core windows 7 machine.
My server is CentOs 64bit 8GB ram Intel Xeon 3470 (Quad Core) - so that will aid in the matter slightly I guess as can assign decent RAM allocation.
It is running slowly and was wondering if my approach is all wrong. Here's the slow bits from the profile breakdown:
Copying to tmp table - (time) 3.88124 - (%) 55.08438
Copying to tmp table on disk - (time) 2.683123 -(%) 8.08010
converting HEAP to MyISAM - (time) 0.37656 - (%) 5.34432
Here's the EXPLAIN:
id - 1
select_type - SIMPLE
table - linkindex_tags
type - index
possible_keys - link_id,link_id_2
key - link_id
key_len - 8
ref - \N
rows - 552196
Extra - Using index; Using temporary; Using filesort
2nd row
id - 1
select_type - SIMPLE
table - links_sorted
type - eq_ref
possible_keys - link_id
key - link_id
key_len - 4
ref - flinksdb.linkindex_tags.link_id
rows - 1
Extra -
And finally the 2 table schema's:
CREATE TABLE IF NOT EXISTS `linkindex_tags` (
`linkindex_tag_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`link_id` int(10) unsigned NOT NULL,
`word_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`linkindex_tag_id`),
UNIQUE KEY `link_id` (`link_id`,`word_id`),
KEY `link_id_2` (`link_id`),
KEY `word_id` (`word_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=0 ;
CREATE TABLE IF NOT EXISTS `links_sorted` (
`link_sorted_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`site_id` int(10) unsigned NOT NULL,
`link_id` int(10) unsigned NOT NULL,
`link_title` char(255) NOT NULL,
`link_duration` char(20) NOT NULL,
`link_url` char(255) NOT NULL,
`active` tinyint(4) NOT NULL,
PRIMARY KEY (`link_sorted_id`),
UNIQUE KEY `link_id` (`link_id`),
KEY `link_title` (`link_title`,`link_url`,`active`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=0 ;
Have to stick with INT as may enter a range bigger than MEDIUMINT.
Without the join, just getting the ids the query is fast now I've upped some MySQL settings.
Don't know too much about MySQL settings and their effects so if you need me to change a few settings and run some tests by all means fire away!
Oh and I played with the mysql.ini settings so they're like this - just guessing and toying really!
key_buffer = 512M
max_allowed_packet = 1M
table_cache = 512M
sort_buffer_size = 512M
net_buffer_length = 8K
read_buffer_size = 512M
read_rnd_buffer_size = 512K
How can I speed up this query?

A few comments:
DISTINCT
SELECT DISTINCT works on all the fields selected, no matter how many () you use, use a GROUP BY clause instead if you only want 1 field to be distinct.
Note that this will make the results of your query indeterminate!
Keep the distinct, or aggregate the other fields in a GROUP_CONCAT if you want to prevent that.
ORDER BY
A field can only have one value at a time, adding different IF's together, when there can be only one that matches is a waste of time, use an IN instead.
A boolean = 1 for true, 0 for false, you don't need an extra IF to assert that.
WHERE
If you have a lot of rows, consider adding a where that can reduce the number of rows under consideration, without altering the outcome.
?
Is the series: 400,177,114,9,270,715,279,1,1748 same sort of magical construct like the 4-8-15-16-23-42 in Lost?
SELECT lt.link_id
, GROUP_CONCAT(ls.link_title) as link_titles
, GROUP_CONCAT(ls.link_url) as link_urls
FROM linkindex_tags lt
INNER JOIN links_sorted ls ON ls.link_id = lt.link_id
WHERE lt.word_id <= 1748
GROUP BY lt.link_id
ORDER BY
(
lt.word_id IN (400,177,114,9,270,715,279,1,1748)
) DESC
LIMIT 15 OFFSET 0;

Related

Use index for ORDER BY in "SELECT .. FROM .. WHERE column IN (...) ORDER BY"

Is there any way to make the following query use an index and not use filesort:
SELECT c1 FROM table WHERE c2 IN (val_1, val_2, ..., val_n) ORDER BY c3
I guess chances are bad so if it is not possible is there any way to make the following problem use indexes (or be fast):
The table contains comments from users:
CREATE TABLE `comments` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`comment` varchar(180) CHARACTER SET utf8 NOT NULL,
`timestamp` int(11) unsigned NOT NULL)
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I want to output the comments of specific users (for example the ones who user_x is following) ordered by timestamp (compare query above).
The only way I can imagine making this query fast is to insert a new variable that is set to 1 for the last let's say 15 entries of a single user. So the first query would just get a maximum of 15 rows per user so the maximum amount of rows mysql has to order is 15*n, where n is the amount of users the comments are selected from.
Edit: This is what EXPLAIN outputs:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE comments range idx_comments_user_id_timestamp idx_comments_user_id_timestamp 4 NULL 1113 Using where; Using index; Using filesort

Please help me optimize this MySQL SELECT statement

I have a query that takes roughly four minutes to run on a high powered SSD server with no other notable processes running. I'd like to make it faster if possible.
The database stores a match history for a popular video game called Dota 2. In this game, ten players (five on each team) each select a "hero" and battle it out.
The intention of my query is to create a list of past matches along with how much of a "XP dependence" each team had, based on the heroes used. With 200,000 matches (and a 2,000,000 row matches-to-heroes relationship table) the query takes about four minutes. With 1,000,000 matches, it takes roughly 15.
I have full control of the server, so any configuration suggestions are also appreciated. Thanks for any help guys. Here are the details...
CREATE TABLE matches (
* match_id BIGINT UNSIGNED NOT NULL,
start_time INT UNSIGNED NOT NULL,
skill_level TINYINT NOT NULL DEFAULT -1,
* winning_team TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (match_id),
KEY start_time (start_time),
KEY skill_level (skill_level),
KEY winning_team (winning_team));
CREATE TABLE heroes (
* hero_id SMALLINT UNSIGNED NOT NULL,
name CHAR(40) NOT NULL DEFAULT '',
faction TINYINT NOT NULL DEFAULT -1,
primary_attribute TINYINT NOT NULL DEFAULT -1,
group_index TINYINT NOT NULL DEFAULT -1,
match_count BIGINT UNSIGNED NOT NULL DEFAULT 0,
win_count BIGINT UNSIGNED NOT NULL DEFAULT 0,
* xp_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
* team_xp_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
xp_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
team_xp_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
gold_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
team_gold_from_wins BIGINT UNSIGNED NOT NULL DEFAULT 0,
gold_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
team_gold_from_losses BIGINT UNSIGNED NOT NULL DEFAULT 0,
included TINYINT UNSIGNED NOT NULL DEFAULT 0,
PRIMARY KEY (hero_id));
CREATE TABLE matches_heroes (
* match_id BIGINT UNSIGNED NOT NULL,
player_id INT UNSIGNED NOT NULL,
* hero_id SMALLINT UNSIGNED NOT NULL,
xp_per_min SMALLINT UNSIGNED NOT NULL,
gold_per_min SMALLINT UNSIGNED NOT NULL,
position TINYINT UNSIGNED NOT NULL,
PRIMARY KEY (match_id, hero_id),
KEY match_id (match_id),
KEY player_id (player_id),
KEY hero_id (hero_id),
KEY xp_per_min (xp_per_min),
KEY gold_per_min (gold_per_min),
KEY position (position));
Query
SELECT
matches.match_id,
SUM(CASE
WHEN position < 5 THEN xp_from_wins / team_xp_from_wins
ELSE 0
END) AS radiant_xp_dependence,
SUM(CASE
WHEN position >= 5 THEN xp_from_wins / team_xp_from_wins
ELSE 0
END) AS dire_xp_dependence,
winning_team
FROM
matches
INNER JOIN
matches_heroes
ON matches.match_id = matches_heroes.match_id
INNER JOIN
heroes
ON matches_heroes.hero_id = heroes.hero_id
GROUP BY
matches.match_id
Sample Results
match_id | radiant_xp_dependence | dire_xp_dependence | winning_team
2298874871 | 1.0164 | 0.9689 | 1
2298884079 | 0.9932 | 1.0390 | 0
2298885606 | 0.9877 | 1.0015 | 1
EXPLAIN
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | heroes | ALL | PRIMARY | NULL | NULL | NULL | 111 | Using temporary; Using filesort
1 | SIMPLE | matches_heroes | ref | PRIMARY,match_id,hero_id | hero_id | 2 | dota_2.heroes.hero_id | 3213 |
1 | SIMPLE | matches | eq_ref | PRIMARY | PRIMARY | 8 | dota_2.matches_heroes.match_id | 1 |
Machine Specs
Intel Xeon E5
E5-1630v3 4/8t
3.7 / 3.8 GHz
64 GB of RAM
DDR4 ECC 2133 MHz
2 x 480GB of SSD SOFT
Database
MariaDB 10.0
InnoDB
In all likelihood, the main performance driver is the GROUP BY. Sometimes, in MySQL, it can be faster to use correlated subuqeries. So, try writing the query like this:
SELECT m.match_id,
(SELECT SUM(h.xp_from_wins / h.team_xp_from_wins)
FROM matches_heroes mh INNER JOIN
heroes h
ON mh.hero_id = h.hero_id
WHERE m.match_id = mh.match_id AND mh.position < 5
) AS radiant_xp_dependence,
(SELECT SUM(h.xp_from_wins / h.team_xp_from_wins)
FROM matches_heroes mh INNER JOIN
heroes h
ON mh.hero_id = h.hero_id
WHERE m.match_id = mh.match_id AND mh.position >= 5
) AS dire_xp_dependence,
m.winning_team
FROM matches m;
Then, you want indexes on:
matches_heroes(match_id, position)
heroes(hero_id, xp_from_wins, team_xp_from_wins)
For completeness, you might want this index as well:
matches(match_id, winning_team)
This would be more important if you added order by match_id to the query.
As has already been mentioned in a comment; there is little you can do, because you select all data from the table. The query looks perfect.
The one idea that comes to mind are covering indexes. With indexes containing all data needed for the query, the tables themselves don't have to be accessed anymore.
CREATE INDEX matches_quick ON matches(match_id, winning_team);
CREATE INDEX heroes_quick ON heroes(hero_id, xp_from_wins, team_xp_from_wins);
CREATE INDEX matches_heroes_quick ON matches_heroes (match_id, hero_id, position);
There is no guarantee for this to speed up your query, as you are still reading all data, so running through the indexes may be just as much work as reading the tables. But there is a chance that the joins will be faster and there would probably be less physical read. Just give it a try.
Waiting for another idea? :-)
Well, there is always the data warehouse approach. If you must run this query again and again and always for all matches ever played, then why not store the query results and access them later?
I suppose that matches played won't be altered, so you could access all results you computed, say, last week and only retrieve the additional results from the games since then from your real tables.
Create a table archived_results. Add a flag archived in your matches table. Then add query results to the archived_results table and set the flag to TRUE for these matches. When having to perform your query, you'd either update the archived_results table anew and only show its contents then or you'd combine archive and current:
select match_id, radiant_xp_dependence, radiant_xp_dependence winning_team
from archived_results
union all
SELECT
matches.match_id,
SUM(CASE
WHEN position < 5 THEN xp_from_wins / team_xp_from_wins
ELSE 0
END) AS radiant_xp_dependence,
...
WHERE matches.archived = FALSE
GROUP BY matches.match_id;
People's comments about loading whole tables into memory got me thinking. I searched for "MySQL memory allocation" and learned how to change the buffer pool size for InnoDB tables. The default is much smaller than my database, so I ramped it up to 8 GB using the innodb_buffer_pool_size directive in my.cnf. The speed of the query increased drastically from 1308 seconds to only 114.
After researching more settings, my my.cnf file now looks like the following (no further speed improvements, but it should be better in other situations).
[mysqld]
bind-address=127.0.0.1
character-set-server=utf8
collation-server=utf8_general_ci
innodb_buffer_pool_size=8G
innodb_buffer_pool_dump_at_shutdown=1
innodb_buffer_pool_load_at_startup=1
innodb_flush_log_at_trx_commit=2
innodb_log_buffer_size=8M
innodb_log_file_size=64M
innodb_read_io_threads=64
innodb_write_io_threads=64
Thanks everyone for taking the time to help out. This will be a massive improvement to my website.

Is there any way to optimize this SELECT query any further?

I have a MySQL table that is filled with mails from a postfix mail log. The table is updated very often, some times multiple times per second. Here's the SHOW CREATE TABLE output:
Create Table postfix_mails CREATE TABLE `postfix_mails` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`mail_id` varchar(20) COLLATE utf8_danish_ci NOT NULL,
`host` varchar(30) COLLATE utf8_danish_ci NOT NULL,
`queued_at` datetime NOT NULL COMMENT 'When the message was received by the MTA',
`attempt_at` datetime NOT NULL COMMENT 'When the MTA last attempted to relay the message',
`attempts` smallint(5) unsigned NOT NULL,
`from` varchar(254) COLLATE utf8_danish_ci DEFAULT NULL,
`to` varchar(254) COLLATE utf8_danish_ci NOT NULL,
`source_relay` varchar(100) COLLATE utf8_danish_ci DEFAULT NULL,
`target_relay` varchar(100) COLLATE utf8_danish_ci DEFAULT NULL,
`target_relay_status` enum('sent','deferred','bounced','expired') COLLATE utf8_danish_ci NOT NULL,
`target_relay_comment` varchar(4098) COLLATE utf8_danish_ci NOT NULL,
`dsn` varchar(10) COLLATE utf8_danish_ci NOT NULL,
`size` int(11) unsigned NOT NULL,
`delay` float unsigned NOT NULL,
`delays` varchar(50) COLLATE utf8_danish_ci NOT NULL,
`nrcpt` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `mail_signature` (`host`,`mail_id`,`to`),
KEY `from` (`from`),
KEY `to` (`to`),
KEY `source_relay` (`source_relay`),
KEY `target_relay` (`target_relay`),
KEY `target_relay_status` (`target_relay_status`),
KEY `mail_id` (`mail_id`),
KEY `last_attempt_at` (`attempt_at`),
KEY `queued_at` (`queued_at`)
) ENGINE=InnoDB AUTO_INCREMENT=111592 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci
I want to know how many mails were relayed through a specific host on a specific date, so I'm using this query:
SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` LIKE '2016-04-11%'
AND `host` = 'mta03'
The query takes between 100 and 110 ms.
Currently the table contains about 70 000 mails, and the query returns around 31 000. This is only a couple of days' worth of mails, and I plan to keep at least a month. The query cache doesn't help much because the table is getting updated constantly.
I have tried doing this instead:
SELECT SQL_NO_CACHE COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` >= '2016-04-11'
AND `queued_at` < '2016-04-12'
AND `host` = 'mta03'
But the query takes the exact same time to run. I have made these changes to the MySQL configuration:
[mysqld]
query_cache_size = 128M
key_buffer_size = 256M
read_buffer_size = 128M
sort_buffer_size = 128M
innodb_buffer_pool_size = 4096M
And confirmed that they are all in effect (SHOW VARIABLES) but the query doesn't run any faster.
Am I doing something stupid that makes this query take this long? Can you spot any obvious or non-obvious ways to make it faster? Is there another database engine that works better than InnoDB in this scenario?
mysql> EXPLAIN SELECT SQL_NO_CACHE COUNT(*) as `count`
-> FROM `postfix_mails`
-> WHERE `queued_at` >= '2016-04-11'
-> AND `queued_at` < '2016-04-12'
-> AND `host` = 'mta03';
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
| 1 | SIMPLE | postfix_mails | ref | mail_signature,queued_at | mail_signature | 92 | const | 53244 | Using where |
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
queued_at is a datetime value. Don't use LIKE. That converts it to a string, preventing the use of indexes and imposing a full-table scan. Instead, you want an appropriate index and to fix the query.
The query is:
SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` >= '2016-04-11' AND `queued_at` < DATE_ADD('2016-04-11', interval 1 day) AND
`host` = 'mta03';
Then you want a composite index on postfix_mails(host, queued_at). The host column needs to be first.
Note: If your current version is counting 31,000 out of 70,000 emails, then an index will not be much help for that. However, this will make the code more scalable for the future.
If you need your query to be really fast, you'll need to materialize it.
MySQL lacks a way to do that natively, so you'll have to create a table like that:
CREATE TABLE mails_host_day
(
host VARCHAR(30) NOT NULL,
day DATE NOT NULL,
mails BIGINT NOT NULL,
PRIMARY KEY (host, day)
)
and update it either in a trigger on postfix_mails or with a script once in a while:
INSERT
INTO mails_host_day (host, day, mails)
SELECT host, CAST(queued_at AS DATE), COUNT(*)
FROM postfix_mails
WHERE id > :last_sync_id
GROUP BY
host, CAST(queued_at AS DATE)
ON DUPLICATE KEY
UPDATE mails = mails + VALUES(mails)
This way, querying a host-day entry is a single primary key seek.
Note that trigger-based solution will affect DML performance, while the script-based solution will result in slightly less actual data.
However, you can improve the script-based solution actuality if you union the most recent actual data with the stored results:
SELECT host, day, SUM(mails) AS mails
FROM (
SELECT host, day, mails
FROM mails_host_day
UNION ALL
SELECT host, CAST(queued_at) AS day, COUNT(*) AS mails
FROM postfix_mails
WHERE id >= :last_sync_id
GROUP BY
host, CAST(queued_at) AS day
) q
It's not a single index seek anymore, however, if you run the update script often enough, there will be less actual records to read.
You have a unique key on 'host', 'mail_id', and 'to', however when the query engine tries to use that index, you aren't filtering on 'mail_id' and 'to', so it may not be as efficient. A solution could be to add another index just on 'host' or add AND 'mail_id' IS NOT NULL AND'to' IS NOT NULL to your query to fully utilize the existing unique index.
You could use pagination to speed up queries in PHP which is usually how I resolve anything that contains a large amount of data - but this depends on your Table hierarchy.
Integrate your LIMIT in the SQL query.
PHP:
foreach ($db->Prepare("SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE DATEDIFF(`queued_at`, '2016-04-11') = 0)
AND mail_id < :limit "))->execute(array(':limit' => $_POST['limit'])) as $row)
{
// normal output
}
jQuery:
$(document).ready( function() {
var starting = 1;
$('#next').click( function() {
starting = starting + 10;
$.post('phpfilehere.php', { limit: starting })
.done( function(data) {
$('#mail-output').innerHTML = data;
});
);
);
Here, each page shows 10 emails on, of course you can change this and modify it and even add a search which I actually have an Object I use for all my Projects.
I just thought I'd share the idea - it also adds real-time data flow on your site too.
This was inspired to me by Facebook's scrolling show more - which really isn't hard but is such a good way for querying a lot of data.

Real-time aggregation on a table with millions of records

I'm dealing with an ever growing table which contains about 5 million records at the moment. About a 100000 new records are added daily.
The table contains information about ad campaigns, and is joined on query with another table:
CREATE TABLE `statistics` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range_id` int(11) DEFAULT NULL,
`campaign_id` int(11) DEFAULT NULL,
`payout` decimal(5,2) DEFAULT NULL,
`is_converted` tinyint(1) unsigned NOT NULL DEFAULT '0',
`converted` datetime DEFAULT NULL,
`created` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `created` (`created`),
KEY `converted` (`converted`),
KEY `campaign_id` (`campaign_id`),
KEY `ip_range_id` (`ip_range_id`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The other table contains IP ranges:
CREATE TABLE `ip_ranges` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`ip_range` varchar(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ip_range` (`ip_range`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The aggregation query is as follows:
SELECT
SUM(`payout`) AS `revenue`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id`) AS `clicks`,
(SELECT COUNT(*) FROM `statistics` WHERE `ip_range_id` = `IpRange`.`id` AND `is_converted` = 1) AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20
The query takes about 20 seconds to complete.
This is what EXPLAIN returns:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY ip_range index PRIMARY PRIMARY 4 NULL 306552 Using index; Using temporary; Using filesort
1 PRIMARY statistic ref ip_range_id ip_range_id 5 db.ip_range.id 8 Using where
3 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where
2 DEPENDENT SUBQUERY statistics ref ip_range_id ip_range_id 5 func 8 Using where; Using index
Caching the clicks and conversions in the ip_ranges table as extra columns is not an option, because I need to be able to also filter on the campaign_id column (and possibly other columns in the future). So these aggregations need to be somewhat real-time.
What is the best strategy to do aggregation on large tables on multiple dimensions and near real-time?
Note that I'm not necessarily looking to just make the query better, but I'm also interested in strategies that might involve other database systems (NoSQL) and/or distributing the data over different servers, etc
Your query looks overly complicated. There is no need to query the same table again and again:
select
sum(payout) as revenue,
count(*) as clicks,
sum(s.is_converted = 1) as conversions
from ip_ranges r
inner join statistics s on r.id = s.ip_range_id
group by r.id
order by clicks desc
limit 20;
EDIT (after acceptance): As to your actual question on how to deal with a task like this:
You want to look at all the data in your table and you want your result to be up-to-date. Then there is no other option than to read all data (full table scans). If the tables are wide (i.e. have many columns) you may want to create covering indexes (i.e. indexes that contain all columns involved), so instead of reading the table, the index would be read. Well, what else? On full table scans it is recommendable to use parallel access, which MySQL doesn't provide, as far as I know. So you might want to switch to another DBMS. Then see what else the DBMS offers. Maybe the parallel querying would benefit from partitioning the tables. The last thing that comes to mind is hardware, i.e. more CPUs, faster drives etc.
Another option might be to remove old data from your tables. Say you need the details of the current year, but only the aggregated data for previous years. Then have another table old_statistics holding only the sums and counts needed, e.g.
table old_statistics
(
ip_range_id,
revenue,
conversions
);
Then you'd aggregate the data from statistics, which would be much smaller then, because it would hold only data of the current year, and add old_statistics to get the results.
Try this
SELECT
SUM(`payout`) AS `revenue`,
SUM(case when `ip_range_id` = `IpRange`.`id` then 1 else 0 end) AS `clicks`,
SUM(case when `ip_range_id` = `IpRange`.`id` and `is_converted` = 1 then 1 else 0 end)
AS `conversions`
FROM `ip_ranges` AS `IpRange`
INNER JOIN `statistics` AS `Statistic` ON `IpRange`.`id` = `Statistic`.`ip_range_id`
GROUP BY `IpRange`.`id`
ORDER BY `clicks` DESC
LIMIT 20

Large MySQL table with very slow select

I have a large table in MySQL (running within MAMP) it has 28 million rows and its 3.1GB in size. Here is its structure
CREATE TABLE `termusage` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`termid` bigint(20) DEFAULT NULL,
`date` datetime DEFAULT NULL,
`dest` varchar(255) DEFAULT NULL,
`cost_type` tinyint(4) DEFAULT NULL,
`cost` decimal(10,3) DEFAULT NULL,
`gprsup` bigint(20) DEFAULT NULL,
`gprsdown` bigint(20) DEFAULT NULL,
`duration` time DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `termid_idx` (`termid`),
KEY `date_idx` (`date`),
KEY `cost_type_idx` (`cost_type`),
CONSTRAINT `termusage_cost_type_cost_type_cost_code` FOREIGN KEY (`cost_type`) REFERENCES `cost_type` (`cost_code`),
CONSTRAINT `termusage_termid_terminal_id` FOREIGN KEY (`termid`) REFERENCES `terminal` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=28680315 DEFAULT CHARSET=latin1
Here is the output from SHOW TABLE STATUS :
Name,Engine,Version,Row_format,Rows,Avg_row_length,Data_length,Max_data_length,Index_length,Data_free,Auto_increment,Create_time,Update_time,Check_time,Collation,Checksum,Create_options,Comment
'termusage', 'InnoDB', '10', 'Compact', '29656469', '87', '2605711360', '0', '2156920832', '545259520', '28680315', '2011-08-16 15:16:08', NULL, NULL, 'latin1_swedish_ci', NULL, '', ''
Im trying to run the following select statement :
select u.id from termusage u
where u.date between '2010-11-01' and '2010-12-01'
it takes 35 minutes to return to result (approx 14 million rows) - this is using MySQL Worksbench.
I have the following MySQL config setup :
Variable_name Value
bulk_insert_buffer_size 8388608
innodb_buffer_pool_instances 1
innodb_buffer_pool_size 3221225472
innodb_change_buffering all
innodb_log_buffer_size 8388608
join_buffer_size 131072
key_buffer_size 8388608
myisam_sort_buffer_size 8388608
net_buffer_length 16384
preload_buffer_size 32768
read_buffer_size 131072
read_rnd_buffer_size 262144
sort_buffer_size 2097152
sql_buffer_result OFF
Eventually im trying to run a larger query - that joins a couple of tables and groups some data, all based on the variable - customer id -
select c.id,u.termid,u.cost_type,count(*) as count,sum(u.cost) as cost,(sum(u.gprsup) + sum(u.gprsdown)) as gprsuse,sum(time_to_sec(u.duration)) as duration
from customer c
inner join terminal t
on (c.id = t.customer)
inner join termusage u
on (t.id = u.termid)
where c.id = 1 and u.date between '2011-03-01' and '2011-04-01' group by c.id,u.termid,u.cost_type
This returns a maximum of 8 rows (as there are only 8 separate cost_types - but this query runs OK where there are not many (less than 1 million) rows in the termusage table to calculate - but takes forever when the number of rows in the termusage table is large - how can I reduce the select time.
Data is added to the termusage table once a month from CSV files using LOAD DATA method - so it doesn't need to be quite so tuned for inserts.
EDIT : Show explain on main query :
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,c,const,PRIMARY,PRIMARY,8,const,1,"Using index; Using temporary; Using filesort"
1,SIMPLE,u,ALL,"termid_idx,date_idx",NULL,NULL,NULL,29656469,"Using where"
1,SIMPLE,t,eq_ref,"PRIMARY,customer_idx",PRIMARY,8,wlnew.u.termid,1,"Using where"
Looks like you're asking two questions - correct?
The most likely reason the first query is taking so long is because it's IO-bound. It takes a long time to transfer 14 million records from disk and down the wire to your MySQL work bench.
Have you tried putting the second query though "explain"? Yes, you only get back 8 rows - but the SUM operation may be summing millions of records.
I'm assuming the "customer" and "terminal" tables are appropriately indexed? As you're joining on the primary key on termusage, that should be really quick...
You could try removing the where clause restricting by date and instead put an IF statement in the select so that if the date is within these boundaries, the value is returned otherwise a zero value is returned. The SUM will then of course only sum values which lie in this range as all others will be zero.
It sounds a bit nonsensical to fetch more rows than you need but we observed recently on an Oracle DB that this made quite a huge improvement. Of course it will be dependent on many other factors but it might be worth a try.
You may also think about breaking down the table into years or months. So you have a termusage_2010, termusage_2011, ... or something like this.
Not a very nice solution, but seeing your table is quite large it might be usefull on a smaller server.