Very slow MYSQL query for 2.5 million row table

Very slow MYSQL query for 2.5 million row table - mysql

I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds
here is the query
SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;
`id` int(11) NOT NULL auto_increment,
`instance` varchar(255) NOT NULL,
`email` varchar(255) NOT NULL,
`type` enum('play','claim','friend','email') NOT NULL,
`result` enum('win','win-small','lose','none') NOT NULL,
`timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
`play_date` date NOT NULL,
`email_refer` varchar(255) NOT NULL,
`remote_addr` varchar(15) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `result` (`result`),
KEY `timestamp` (`timestamp`),
KEY `email_refer` (`email_refer`),
KEY `type_2` (`type`,`timestamp`),
KEY `type_4` (`type`,`play_date`),
KEY `type_result` (`type`,`play_date`,`result`)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE log ref type_2,type_4,type_result type_4 1 const 270404 Using where
The query is using the type_4 index.
Does anyone know how I could speed this query up?
Thanks
Tom

That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.
You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.
To do this, change the index as follows:
KEY `type_4` (`type`,`play_date`, `email`)
I would be surprised if that wouldn't speed things up quite a bit.
(Thanks to MarkR for the proper term.)

Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.
My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.

Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps
There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.
(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates
By reversing the index, the approach is
> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file
Hope this helps

Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).

The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.
SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
SELECT play_date AS date
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
ORDER BY play_date desc
LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date

Try creating an index only on play_date.

Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.
Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.

There is a good chance a table scan will be quicker than random access to over 200,000 rows:
SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...
Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):
SELECT SQL_BIG_RESULT ...

Related

Improving MySQL Query Speeds - 150,000+ Rows Returned Slows Query

Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */

The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.

What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).

450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.

MySQL Table Query Unusually Slow (Using Matlab's Connector/J)

EDIT: Thank you everyone for your comments. I have tried most of your suggestions but they did not help. I need to add that I am running this query through Matlab using Connector/J 5.1.26 (Sorry for not mentioning this earlier). In the end, I think this is the source of the increase in execution time since when I run the query "directly" it takes 0.2 seconds. However, I have never come across such a huge performance hit using Connector/J. Given this new information, do you have any suggestions? I apologize for not disclosing this earlier, but again, I've never experienced performance impact with Connector/J.
I have the following table in mySQL (CREATE code taken from HeidiSQL):
CREATE TABLE `data` (
`PRIMARY` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`ID` VARCHAR(5) NULL DEFAULT NULL,
`DATE` DATE NULL DEFAULT NULL,
`PRICE` DECIMAL(14,4) NULL DEFAULT NULL,
`QUANT` INT(10) NULL DEFAULT NULL,
`TIME` TIME NULL DEFAULT NULL,
INDEX `DATE` (`DATE`),
INDEX `ID` (`SYMBOL`),
INDEX `PRICE` (`PRICE`),
INDEX `QUANT` (`SIZE`),
INDEX `TIME` (`TIME`),
PRIMARY KEY (`PRIMARY`)
)
It is populated with approximately 360,000 rows of data.
The following query takes over 10 seconds to execute:
Select ID, DATE, PRICE, QUANT, TIME FROM database.data WHERE DATE
>= "2007-01-01" AND DATE <= "2010-12-31" ORDER BY ID, DATE, TIME ASC;
I have other tables with millions of rows in which a similar query would take a fraction of a second. I can't figure out what might be causing this one to be so slow. Any ideas/tips?
EXPLAIN:
id = 1
select_type = SIMPLE
table = data
type = ALL
possible_keys = DATE
key = (NULL)
key_len = (NULL)
ref = (NULL)
rows = 361161
Extra = Using where; Using filesort

You are asking for a wide range of data. The time is probably being spent sorting the results.
Is a query on a smaller date range faster? For instance,
WHERE DATE >= '2007-01-01' AND DATE < '2007-02-01'
One possibility is that the optimizer may be using the index on id for the sort and doing a full table scan to filter out the date range. Using indexes for sorts is often suboptimal. You might try the query as:
select t.*
from (Select ID, DATE, PRICE, QUANT, TIME
FROM database.data
WHERE DATE >= "2007-01-01" AND DATE <= "2010-12-31"
) t
ORDER BY ID, DATE, TIME ASC;
I think this will force the optimizer to use the date index for the selection and then sort using file sort -- but there is the cost of a derived table. If you do not have a large result set, this might significantly improve performance.

I assume you already tried to OPTIMIZE TABLE and got no results.
You can either try to use a covering index (at the expense of more disk space, and a slight slowing down on UPDATEs) by replacing the existing date index with
CREATE INDEX data_date_ndx ON data (DATE, TIME, PRICE, QUANT, ID);
and/or you can try and create an empty table data2 with the same schema. Then just SELECT all the contents of data table into data2 and run the same query against the new table. It could be that the data table needed to be compacted more than OPTIMIZE could - maybe at the filesystem level.
Also, check out the output of EXPLAIN SELECT... for that query.

I'm not familiar with mysql but mssql so maybe:
what about to provide index which fully covers all fields in your select query.
Yes, it will duplicates data but we can move to next point of issue discussion.

MySQL taking forever 'sending data'. Simple query, lots of data

I'm trying to run what I believe to be a simple query on a fairly large dataset, and it's taking a very long time to execute -- it stalls in the "Sending data" state for 3-4 hours or more.
The table looks like this:
CREATE TABLE `transaction` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`uuid` varchar(36) NOT NULL,
`userId` varchar(64) NOT NULL,
`protocol` int(11) NOT NULL,
... A few other fields: ints and small varchars
`created` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `uuid` (`uuid`),
KEY `userId` (`userId`),
KEY `protocol` (`protocol`),
KEY `created` (`created`)
) ENGINE=InnoDB AUTO_INCREMENT=61 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4 COMMENT='Transaction audit table'
And the query is here:
select protocol, count(distinct userId) as count from transaction
where created > '2012-01-15 23:59:59' and created <= '2012-02-14 23:59:59'
group by protocol;
The table has approximately 222 million rows, and the where clause in the query filters down to about 20 million rows. The distinct option will bring it down to about 700,000 distinct rows, and then after grouping, (and when the query finally finishes), 4 to 5 rows are actually returned.
I realize that it's a lot of data, but it seems that 4-5 hours is an awfully long time for this query.
Thanks.
Edit: For reference, this is running on AWS on a db.m2.4xlarge RDS database instance.

Why don't you profile a query and see what exactly is happening?
SET PROFILING = 1;
SET profiling_history_size = 0;
SET profiling_history_size = 15;
/* Your query should be here */
SHOW PROFILES;
SELECT state, ROUND(SUM(duration),5) AS `duration (summed) in sec` FROM information_schema.profiling WHERE query_id = 3 GROUP BY state ORDER BY `duration (summed) in sec` DESC;
SET PROFILING = 0;
EXPLAIN /* Your query again should appear here */;
I think this will help you in seeing where exactly query takes time and based on result you can perform optimization operations.

This is a really heavy query. To understand why it takes so long you should understand the details.
You have a range condition on the indexed field, that is MySQL finds the smallest created value in the index and for each value it gets the corresponding primary key from the index, retrieves the row from disk, and fetches the required fields (protocol, userId) missing in the current index record, puts them in a "temporary table", making the groupings on those 700000 rows. The index can actually be used and is used here only for speeding up the range condition.
The only way to speed it up, is to have an index that contains all the necessary data, so that MySQL would not need to make on disk lookups for the rows. That is called a covering index. But you should understand that the index will reside in memory and will contain ~ sizeOf(created+protocol+userId+PK)*rowCount bytes, that may become a burden as itself for the queries that update the table and for other indexes. It is easier to create a separate aggregates table and periodically update the table using your query.

Both distinct and group by will need to sort and store temporary data on the server. With that much data that might take a while.
Indexing different combinations of userId, created and protocol will help, but I can't say how much or what index will help the most.

Starting from a certain version of MariaDB (maybe since 10.5), I noticed that after importing a dump with
mysql dbname < dump.sql
the optimizer thinks things different from how they are, making the wrong decisions about indexes.
In general even listing tables innodb with phpmyadmin becomes very very slow.
I noticed that running
ANALYZE TABLE myTable;
fixes.
So after each import I run, that it's equal to run ANALYZE on each table
mysqlcheck -aA

MySQL: Optimizing COUNT(*) and GROUP BY

I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?

You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html

I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC

If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.

Simple query takes 15-30 seconds

The following query is pretty simple. It selects the last 20 records from a messages table for use in a paging scenario. The first time this query is run, it takes from 15 to 30 seconds. Subsequent runs take less than a second (I expect some caching is involved). I am trying to determine why the first time takes so long.
Here's the query:
SELECT DISTINCT ID,List,`From`,Subject, UNIX_TIMESTAMP(MsgDate) AS FmtDate
FROM messages
WHERE List='general'
ORDER BY MsgDate
LIMIT 17290,20;
MySQL version: 4.0.26-log
Here's the table:
messages CREATE TABLE `messages` (
`ID` int(10) unsigned NOT NULL auto_increment,
`List` varchar(10) NOT NULL default '',
`MessageId` varchar(128) NOT NULL default '',
`From` varchar(128) NOT NULL default '',
`Subject` varchar(128) NOT NULL default '',
`MsgDate` datetime NOT NULL default '0000-00-00 00:00:00',
`TextBody` longtext NOT NULL,
`HtmlBody` longtext NOT NULL,
`Headers` text NOT NULL,
`UserID` int(10) unsigned default NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `List` (`List`,`MsgDate`,`MessageId`),
KEY `From` (`From`),
KEY `UserID` (`UserID`,`List`,`MsgDate`),
KEY `MsgDate` (`MsgDate`),
KEY `ListOnly` (`List`)
) TYPE=MyISAM ROW_FORMAT=DYNAMIC
Here's the explain:
table type possible_keys key key_len ref rows Extra
------ ------ ------------- -------- ------- ------ ------ --------------------------------------------
m ref List,ListOnly ListOnly 10 const 18002 Using where; Using temporary; Using filesort
Why is it using a filesort when I have indexes on all the relevant columns? I added the ListOnly index just to see if it would help. I had originally thought that the List index would handle both the list selection and the sorting on MsgDate, but it didn't. Now that I added the ListOnly index, that's the one it uses, but it still does a filesort on MsgDate, which is what I suspect is taking so long.
I tried using FORCE INDEX as follows:
SELECT DISTINCT ID,List,`From`,Subject, UNIX_TIMESTAMP(MsgDate) AS FmtDate
FROM messages
FORCE INDEX (List)
WHERE List='general'
ORDER BY MsgDate
LIMIT 17290,20;
This does seem to force MySQL to use the index, but it doesn't speed up the query at all.
Here's the explain for this query:
table type possible_keys key key_len ref rows Extra
------ ------ ------------- ------ ------- ------ ------ ----------------------------
m ref List List 10 const 18002 Using where; Using temporary
UPDATES:
I removed DISTINCT from the query. It didn't help performance at all.
I removed the UNIX_TIMESTAMP call. It also didn't affect performance.
I made a special case in my PHP code so that if I detect the user is looking at the last page of results, I add a WHERE clause that returns only the last 7 days of results:
SELECT m.ID,List,From,Subject,MsgDate
FROM messages
WHERE MsgDate>='2009-11-15'
ORDER BY MsgDate DESC
LIMIT 20
This is a lot faster. However, as soon as I navigate to another page of results, it must use the old SQL and takes a very long time to execute. I can't think of a practical, realistic way to do this for all pages. Also, doing this special case makes my PHP code more complex.
Strangely, only the first time the original query is run takes a long time. Subsequent runs of either the same query or a query showing a different page of results (i.e., only the LIMIT clause changes) are very fast. The query slows down again if it has not been run for about 5 minutes.
SOLUTION:
The best solution I came up with is based on Jason Orendorff and Juliet's idea.
First, I determine if the current page is closer to the beginning or end of the total number of pages. If it's closer to the end, I use ORDER BY MsgDate DESC, apply an appropriate limit, then reverse the order of the returned records.
This makes retrieving pages close to the beginning or end of the resultset much faster (first time now takes 4-5 seconds instead of 15-30). If the user wants to navigate to a page near the middle (currently around the 430th page), then the speed might drop back down. But that would be a rare case.
So while there seems to be no perfect solution, this is much better than it was for most cases.
Thank you, Jason and Juliet.

Instead of ORDER BY MsgDate LIMIT 17290,20, try ORDER BY MsgDate DESC LIMIT 20.
Of course the results will come out in the reverse order, but that should be easy to deal with.
EDIT: Do your MessageId values always increase with time? Are they unique?
If so, I would make an index:
UNIQUE KEY `ListMsgId` ( `List`, `MessageId` )
and query based on the message ids rather than the date when possible.
-- Most recent messages (in reverse order)
SELECT * FROM messages
WHERE List = 'general'
ORDER BY MessageId DESC
LIMIT 20
-- Previous page (in reverse order)
SELECT * FROM messages
WHERE List = 'general' AND MessageId < '15885830'
ORDER BY MessageId DESC
LIMIT 20
-- Next page
SELECT * FROM messages
WHERE List = 'general' AND MessageId > '15885829'
ORDER BY MessageId
LIMIT 20
I think you're also paying for having varchar columns where an int type would be a lot faster. For example, List could instead be a ListId that points to an entry in a separate table. You might want to try it out in a test database to see if that's really true; I'm not a MySQL expert.

You can drop the ListOnly key. The compound index List already contains all the information in it.
Your EXPLAIN for the List-indexed query looks much better, lacking the filesort. You may be able to get better real performance out of it by swapping the ORDER as suggested by Jason, and maybe losing the UNIX_TIMESTAMP call (you can do that in the application layer, or just use Unix timestamps stored as INTEGER in the schema).

What version of my SQL are you using? Some of the older versions used the LIMIT clause as a post-process filter (meaning get all the record requested from the server, but only display the 20 you requested back).
You can see from your explain, 18002 rows are coming back, even though you are only showing 20 of them. Is there any way to adjust your selection criteria to identify the 20 rows you want to return, rather than getting 18000 rows back and only showing 20 of them???

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008