i have a table similar to below, which has GUID as the key.
i am trying to display the content of tis using paging which has GUID as key, but running into issue of how do i do that?
CREATE TABLE `planetgeni`.`PostComment` (
`PostCommentId` CHAR(36) DEFAULT NULL,
`UserId` INT NOT NULL,
`CreatedAt` DATETIME NULL DEFAULT NULL ,
.
.
.
PRIMARY KEY (`PostCommentId`)
)
ENGINE=InnoDB DEFAULT CHARSET=latin1;
if it was a Int key my Stored procedure would look something like this , giving me next 10 order by desc. But with GUID not sure how to do that type of paging.
getPostComment( int lastPostID)
where PostCommentId< lastPostID order by PostCommentId desc LIMIT 10;
You can still do this with GUID's, but since GUID's are pseudorandom, when you ORDER BY postcommentid the order probably won't be what you want. You probably want something in approximately chronological order, and as you sort by the random GUID, the order will be repeatable, but random.
As #James comments, you could use another column for the sort order, but that column would need to be unique, or else you would either miss some duplicate rows (if you use >) or repeat values on the next page (if you use >=).
You'll just have to use LIMIT with OFFSET. MySQL optimizes LIMIT queries, so it quits examining rows once it finds the rows it needs for the page. But it also must examine all the preceding rows, so the query gets more expensive as you advance through higher-numbered pages.
A couple of ways to mitigate this:
Don't let your users view higher-numbered pages. Definitely don't give them a direct link to the "Last" page. Just give them a link to the "Next" page and hope they give up searching before they advance so far that the queries become very costly.
Fetch more than one page at a time, and cache it. For instance, instead of LIMIT 10, you could LIMIT 70 and then keep the results in memcached or something. Use application code to present 10 rows at a time, until the user advances through that set of rows. Then only if they go on to the 8th page, run another SQL query. Users typically don't search through more than a few pages, so the chance you'll have to run a second or a third query become very small.
Change column by which You are using in 'order by'.
getPostComment( int lastPostID)
where PostCommentId< lastPostID order by CreatedAt,UserId desc LIMIT 10;
Related
I am coding an app where users can read and write comments.
When the number of comments exceed a certain limit, a "Load more coments" button is displayed and the offset of loaded comments is stored.
I update this offset whenever the user writes or deletes own comments so that no duplicates are loaded and no comments are left out.
But I forgot about the case when the database changes because other users added/deleted comments.
So the offset method seems to be unreliable, so is there any way to solve this problem maybe by saving the id of the last comment and using this as some kind of "offset"?
The WHERE clause in my query is like:
WHERE x = ? ORDER BY y = ? (neither x nor y are the ID, y is not unique)
You can do this using a timestamp column or possibly even the primary key itself depending on how you've set that up. Here is an example of using the primary key if it is an AUTO_INCREMENT integer.
CREATE TABLE `comments` (
`comment_id` int NOT NULL AUTO_INCREMENT,
`thread_id` int NOT NULL,
`comment` text,
PRIMARY KEY (`comment_id`),
FOREIGN KEY (`thread_id`) REFERENCES `threads` (`thread_id`)
);
In that table definition, you have an AUTO_INCREMENT int primary key. You also have a thread_id that is a foreign key to a threads table. Finally, you have the comment itself in comment.
When you first load the page for some thread you'd do the following:
SELECT comment_id, comment
FROM comments
WHERE thread_id = 123
ORDER BY comment_id
LIMIT 10;
This means you'd select 10 comments ordered by their int PK for your given thread (123 in this case). Now, when you display this, you need to somehow save the largest comment_id. Say in this case it is 10. Then, have the "Load more comments" button pass this largest comment_id to the server when it is clicked. The server will now execute the following:
SELECT comment_id, comment
FROM comments
WHERE thread_id = 123 AND comment_id > 10 -- 10 is the value you passed in as your largest previously loaded comment_id
ORDER BY comment_id
LIMIT 10;
Now you have a set of ten more comments where you know that none of the comments can possibly be duplicates of your previously displayed comments, and you will never skip over any comments because they're always ordered by ascending int keys.
If you now look back to the query you used for loading the initial set of comments, you'll see that it's pretty much the same as the one for loading additional comments, so you can actually use the same query for both. When you load the comments initially just pass 0 as the largest comment_id.
You can do the same thing using a timestamp column as well if you don't have a primary key that works like this, and you don't want to change it to work like this either. You'd simply order the results by the timestamp column, and then pass the timestamp of the last loaded comment to your "Load more comments" function. In order to avoid skipping comments posted at the same time, you can use a timestamp with six digits of fractional second precision. Create the timestamp column as TIMESTAMP(6). Your timestamps will then be recorded as things like 2014-09-08 17:51:04.123456, where the last six digits after the second are the fraction of a second. With this much precision, it's extremely unlikely that you have comments recorded exactly at the same time.
Of course you could still have two or more comments recorded at the same exact timestamp, but it's unlikely. This makes the AUTO_INCREMENT int a slightly better solution. One final option is to use a time-based UUID because they include a mechanism to ensure uniqueness by slightly adjusting the value when things occur at the same microsecond. They are also still ordered by time. The problem with this is that MySQL does not have very good support for UUIDs.
Performance problem on update MySql MyISAM big table making column ascending order based on an index on same table
My problem is that the server have only 4 GB memory.
I have to do an update query like this: previous asked question
Mine is this:
set #orderid = 0;
update images im
set im.orderid = (select #orderid := #orderid + 1)
ORDER BY im.hotel_id, im.idImageType;
On im.hotel_id, im.idImageType I have an ascending index.
On im.orderid I have an ascending index too.
The table have 21 millions records and is an MyIsam table.
The table is this:
CREATE TABLE `images` (
`photo_id` int(11) NOT NULL,
`idImageType` int(11) NOT NULL,
`hotel_id` int(11) NOT NULL,
`room_id` int(11) DEFAULT NULL,
`url_original` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_max300` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_square60` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`archive` int(11) NOT NULL DEFAULT '0',
`orderid` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`photo_id`),
KEY `idImageType` (`idImageType`),
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
KEY `archive` (`archive`),
KEY `room_id` (`room_id`),
KEY `orderid` (`orderid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The problem is the performance: hang for several minutes!
Server disk go busy too.
My question is: there is a better manner to achieve the same result?
Have I to partition the table or something else to increase the performance?
I cannot modify server hardware but can tuning MySql application db server settings.
best regards
Tanks to every body. Yours answers help me much. I think that now I have found a better solution.
This problem involve in two critical issue:
efficient paginate on large table
update large table.
To go on efficient paginate on large table I have found a solution by make a previous update on the table but doing so I fall in issues on the 51 minute time needed to the updates and consequent my java infrastructure time out (spring-batch step).
Now by yours help, I found two solution to paginate on large table, and one solution to update large table.
To reach this performance the server need memory. I try this solution on develop server using 32 GB memory.
common solution step
To paginate follow a fields tupla like I needed I have make one index:
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`)
to achieve the new solution we have to change this index by add the primary key part to the index tail KEY hotel_id_idImageType (hotel_id,idImageType, primary key fields):
drop index hotel_id_idImageType on images;
create index hotelTypePhoto on images (hotel_id, idImageType, photo_id);
This is needed to avoid touch table and use only the index file ...
Suppose we want the 10 records after the 19000000 record.
The decimal point is this , in this answers
solution 1
This solution is very practice and not needed the extra field orderid and you have not to do any update before the pagination:
select * from images im inner join
(select photo_id from images
order by hotel_id, idImageType, photo_id
limit 19000000,10) k
on im.photo_id = k.photo_id;
To make the table k on my 21 million table records need only 1,5 sec because it use only the three field in index hotelTypePhoto so haven't to access to the table file and work only on index file.
The order was like the original required (hotel_id, idImageType) because is included in (hotel_id, idImageType, photo_id): same subset...
The join take no time so every first time the paginate is executed on the same page need only 1,5 sec and this is a good time if you have to execute it in a batch one on 3 months.
On production server using 4 GB memory the same query take 3,5 sec.
Partitioning the table do not help to improve performance.
If the server take it in cache the time go down or if you do a jdbc params statment the time go down too (I suppose).
If you have to use it often, it have the advantage that it do not care if the data change.
solution 2
This solution need the extra field orderid and need to do the orderid update one time by batch import and the data have not to change until the next batch import.
Then you can paginate on the table in 0,000 sec.
set #orderid = 0;
update images im inner join (
select photo_id, (#orderid := #orderid + 1) as newOrder
from images order by hotel_id, idImageType, photo_id
) k
on im.photo_id = k.photo_id
set im.orderid = k.newOrder;
The table k is fast almost like in the first solution.
This all update take only 150,551 sec much better than 51 minute!!! (150s vs 3060s)
After this update in the batch you can do the paginate by:
select * from images im where orderid between 19000000 and 19000010;
or better
select * from images im where orderid >= 19000000 and orderid< 19000010;
this take 0,000sec to execute first time and all other time.
Edit after Rick comment
Solution 3
This solution is to avoid extra fields and offset use. But need too take memory of the last page read like in this solution
This is a fast solution and can work on online server production using only 4GB memory
Suppose you need to read last ten records after 20000000.
There is two scenario to take care:
You can start read it from the first to the 20000000 if you need all of it like me and update some variable to take memory of last page read.
you have to read only the last 10 after 20000000.
In this second scenario you have to do a pre query to find the start page:
select hotel_id, idImageType, photo_id
from images im
order by hotel_id, idImageType, photo_id limit 20000000,1
It give to me:
+----------+-------------+----------+
| hotel_id | idImageType | photo_id |
+----------+-------------+----------+
| 1309878 | 4 | 43259857 |
+----------+-------------+----------+
This take 6,73 sec.
So you can store this values in variable to next use.
Suppose we named #hot=1309878, #type=4, #photo=43259857
Then you can use it in a second query like this:
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and idImageType>#type OR (
idImageType=#type and photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
The first clause hotel_id>#hot take all records after the actual first field on scrolling index but lost some record. To take it we have to do the OR clause that take on the first index field all remained unread records.
This take only 0,10 sec now.
But this query can be optimized (bool distributive):
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and
(idImageType>#type or idImageType=#type)
and (idImageType>#type or photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and
idImageType>=#type
and (idImageType>#type or photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
(hotel_id>#hot OR hotel_id=#hot) and
(hotel_id>#hot OR
(idImageType>=#type and (idImageType>#type or photo_id>#photo))
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
hotel_id>=#hot and
(hotel_id>#hot OR
(idImageType>=#type and (idImageType>#type or photo_id>#photo))
)
order by hotel_id, idImageType, photo_id limit 10;
Are they the same data we can get by the limit?
To quick not exhaustive test do:
select im.* from images im inner join (
select photo_id from images order by hotel_id, idImageType, photo_id limit 20000000,10
) k
on im.photo_id=k.photo_id
order by im.hotel_id, im.idImageType, im.photo_id;
This take 6,56 sec and the data is the same that the query above.
So the test is positive.
In this solution you have to spend 6,73 sec only the first time you need to seek on first page to read (but if you need all you haven't).
To real all other page you need only 0,10 sec a very good result.
Thanks to rick to his hint on a solution based on store the last page read.
Conclusion
On solution 1 you haven't any extra field and take 3,5 sec on every page
On solution 2 you have extra field and need a big memory server (32 GB tested) in 150 sec. but then you read the page in 0,000 sec.
On solution 3 you haven't any extra field but have to store last page read pointer and if you do not start reading by the first page you have to spend 6,73 sec for first page. Then you spend only 0,10 sec on all the other pages.
Best regards
Edit 3
solution 3 is exactly that suggested by Rick. Im sorry, in my previous solution 3 I have do a mistake and when I coded the right solution then I have applied some boolean rule like distributive property and so on, and after all I get the same Rich solution!
regards
You can use some of this:
Update engine to InnoDB, it blocks only one row, not all the table on update.
Create #temp table with photo_id and good orderid and than update your table from this temp:
update images im, temp tp
set im.orderid = tp.orderid
where im.photo_id = tp.photo_id
it will be fastest way and when you fill your tmp table - you have no blocks on primary table.
You can drop indexes before mass update. After all your single update you have rebuilding of indexes and it has a long time.
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
DROP the former; the latter takes care of any need for it. (This won't speed up the original query.)
"The problem is the performance: hang for several minutes!" What is the problem?
Other queries are blocked for several minutes? (InnoDB should help.)
You run this update often and it is annoying? (Why in the world??)
Something else?
This one index is costly while doing the Update:
KEY `orderid` (`orderid`)
DROP it and re-create it. (Don't bother dropping the rest.) Another reason for going with InnoDB is that these operations can be done (in 5.6) without copying the table over. (21M rows == long time if it has to copy the table!)
Why are you building a second Unique index (orderid) in addition to photo_id, which is already Unique? I ask this because there may be another way to solve the real problem that does not involve this time-consuming Update.
I have two more concrete suggestions, but I want to here your answers first.
Edit Pagination, ordered by hotel_id, idImageType, photo_id:
It is possible to read the records in order by that triple. And even to "paginate" through them.
If you "left off" after ($hid, $type, $pid), here would be the 'next' 20 records:
WHERE hotel_id >= $hid
AND ( hotel_id > $hid
OR idImageType >= $type
AND ( idImageType > $type
OR photo_id > $pid
)
)
ORDER BY hotel_id, idImageType, photo_id
LIMIT 20
and have
INDEX(hotel_id, idImageType, photo_id)
This avoids the need for orderid and its time consuming Update.
It would be simpler to paginate one hotel_id at a time. Would that work?
Edit 2 -- eliminate downtime
Since you are reloading the entire table periodically, do this when you reload:
CREATE TABLE New with the recommended index changes.
Load the data into New. (Be sure to avoid your 51-minute timeout; I don't know what is causing that.)
RENAME TABLE images TO old, New TO images;
DROP TABLE old;
That will avoid blocking the table for the load and for the schema changes. There will be a very short block for the atomic Step #3.
Plan on doing this procedure each time you reload your data.
Another benefit -- After step #2, you can test the New data to see if it looks OK.
I've found a few questions that deal with this problem, and it appears that MySQL doesn't allow it. That's fine, I don't have to have a subquery in the FROM clause. However, I don't know how to get around it. Here's my setup:
I have a metrics table that has 3 columns I want: ControllerID, TimeStamp, and State. Basically, a data gathering engine contacts each controller in the database every 5 minutes and sticks an entry in the metrics table. The table has those three columns, plus a MetricsID that I don't care about. Maybe there is a better way to store those metrics, but I don't know it.
Regardless, I want a view that takes the most recent TimeStamp for each of the different ControllerIDs and grabs the TimeStamp, ControllerID, and State. So if there are 4 controllers, the view should always have 4 rows, each with a different controller, along with its most recent state.
I've been able to create a query that gets what I want, but it relies on a subquery in the FROM clause, something that isn't allowed in a view. Here is what I have so far:
SELECT *
FROM
(SELECT
ControllerID, TimeStamp, State
FROM Metrics
ORDER BY TimeStamp DESC)
AS t
GROUP BY ControllerID;
Like I said, this works great. But I can't use it in a view. I've tried using the max() function, but as per here: SQL: Any straightforward way to order results FIRST, THEN group by another column? if I want any additional columns besides the GROUP BY and ORDER BY columns, max() doesn't work. I've confirmed this limitation, it doesn't work.
I've also tried to alter the metrics table to order by TimeStamp. That doesn't work either; the wrong rows are kept.
Edit: Here is the SHOW CREATE TABLE of the Metrics table I am pulling from:
CREATE TABLE Metrics (
MetricsID int(11) NOT NULL AUTO_INCREMENT,
ControllerID int(11) NOT NULL,
TimeStamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
State tinyint(4) NOT NULL,
PRIMARY KEY (MetricsID),
KEY makeItFast (ControllerID,MetricsID),
KEY fast (ControllerID,TimeStamp),
KEY fast2 (MetricsID),
KEY MetricsID (MetricsID),
KEY TimeStamp (TimeStamp)
) ENGINE=InnoDB AUTO_INCREMENT=8958 DEFAULT CHARSET=latin1
If you want the most recent row for each controller, the following is view friendly:
SELECT ControllerID, TimeStamp, State
FROM Metrics m
WHERE NOT EXISTS (SELECT 1
FROM Metrics m2
WHERE m2.ControllerId = m.ControllerId and m2.Timestamp > m.TimeStamp
);
Your query is not correct anyway, because it uses a MySQL extension that is not guaranteed to work. The value for state doesn't necessary come from the row with the largest timestamp. It comes from an arbitrary row.
EDIT:
For best performance, you want an index on Metrics(ControllerId, Timestamp).
Edit Sorry, I misunderstood your question; I thought you were trying to overcome the nested-query limitation in a view.
You're trying to display the most recent row for each distinct ControllerID. Furthermore, you're trying to do it with a view.
First, let's do it. If your MetricsID column (which I know you don't care about) is an autoincrement column, this is really easy.
SELECT ControllerId, TimeStamp, State
FROM Metrics m
WHERE MetricsID IN (
SELECT MAX(MetricsID) MetricsID
FROM Metrics
GROUP BY ControllerID)
ORDER BY ControllerID
This query uses MAX ... GROUP BY to extract the highest-numbered (most recent) row for each controller. It can be made into a view.
A compound index on (ControllerID, MetricsID) will be able to satisfy the subquery with a highly efficient loose index scan.
The root cause of my confusion: I didn't read your question carefully enough.
The root cause of your confusion: You're trying to take advantage of a pernicious MySQL extension to GROUP BY. Your idea of ordering the subquery may have worked. But your temporary success is an accidental side-effect of the present implementation. Read this: http://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html
Is there a faster way to update the oldest row of a MySQL table that matches a certain condition than using ORDER BY id LIMIT 1 as in the following query?
UPDATE mytable SET field1 = '1' WHERE field1 = 0 ORDER BY id LIMIT 1;
Note:
Assume the primary key is id and there is also a index on field1.
We are updating a single row.
We are not updating strictly the oldest row, we are updating the oldest row that matches a condition.
We want to update the oldest matching row, i.e the lowest id, i.e. the head of the FIFO queue.
Questions:
Is the ORDER BY id necessary? How does MySQL order by default?
Real world example
We have a DB table being used for a email queue. Rows are added when we want to queue emails to send to our users. Rows are removed by a cron job, run each minute, processing as many as possible in that minute and sending 1 email per row.
We plan to ditch this approach and use something like Gearman or Resque to process our email queue. But in the meantime I have a question on how we can efficiently mark the oldest item of the queue for processing, a.k.a. The row with the lowest ID. This query does the job:
mysql_query("UPDATE email_queue SET processingID = '1' WHERE processingID = 0 ORDER BY id LIMIT 1");
However, it is appearing in the mysql slow log a lot due to scaling issues. The query can take more than 10s when the table has 500,000 rows. The problem is that this table has grown massively since it was first introduced and now sometimes has half a million rows and a overhead of 133.9 MiB. For example we INSERT 6000 new rows perhaps 180 times a day and DELETE roughly the same number.
To stop the query appearing in the slow log we removed the ORDER BY id to stop a massive sort of the whole table. i.e.
mysql_query("UPDATE email_queue SET processingID = '1' WHERE processingID = 0 LIMIT 1");
... but the new query no longer always gets the row with the lowest id (although it often does). Is there a more efficient way of getting the row with the lowest id other than using ORDER BY id ?
For reference, this is the structure of the email queue table:
CREATE TABLE IF NOT EXISTS `email_queue` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`time_queued` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Time when item was queued',
`mem_id` int(10) NOT NULL,
`email` varchar(150) NOT NULL,
`processingID` int(2) NOT NULL COMMENT 'Indicate if row is being processed',
PRIMARY KEY (`id`),
KEY `processingID` (`processingID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Give this a read:
ORDER BY … LIMIT Performance Optimization
sounds like you have other processes locking the table preventing your update completing in a timely manner - have you considered using innodb ?
I think the 'slow part' comes from
WHERE processingID = 0
It's slow because it's not indexed. But, indexing this column (IMHO) seems incorrect too.
The idea is to change above query to something like :
WHERE id = 0
Which theoretically will be faster since it uses index.
How about creating another table which contains ids of rows which hasn't been processed? Hence the insertion works twice. First to insert to the real table and the second is to insert id into 'table of hasn't processed'. The processing part too, needs to double its duty. First to retrieve an id from 'table of hasn't been processed' then delete it. The second job of processing part is to process of course.
Of course, the id column in 'table of hasn't been processed' needs to index its content. Just to ensure that selecting and deleting will be faster.
This question is old, but for reference for anyone ending up here:
You have a condition on processingID (WHERE processingID = 0), and within that constraint you want to order by ID.
What's happening with your current query is that it scans the table from the lowest ID to the greatest, stopping when it finds 1 record matching the condition. Presumably, it will first find a ton of old records, scanning almost the entire table until it finds an unprocessed one near the end.
How do we improve this?
Consider that you have an index on processingID. Technically, the primary key is always appended (which is how the index can "point" to anything in the first place). So you really have an index on processingID, id. That means ordering on that will be fast.
Change your ordering to: ORDER BY processingID, id
Since you have fixed processingID to a single value with you WHERE clause, this does not change the resulting order. However, it does make it easy for the database to apply both your condition and your ordering, without scanning any records that do not match.
One funny thing is that MySQL, by default, returns rows orderd by ID, instead in a casual way as stated in the relational theory (I am not sure if this behaviour is changed in the latest versions). So, the last row you get from a select should be the last inserted row. I would not use this way, of course.
As you said, the best solution is to use something like Resque, or RabbitMQ & co.
You could use an in-memory table, that is volatile, but much faster, than store, there the latest ID, or just use a my_isam table to add persistency. It is simple and fast in performance and it takes a little bit to implement.
The following query is pretty simple. It selects the last 20 records from a messages table for use in a paging scenario. The first time this query is run, it takes from 15 to 30 seconds. Subsequent runs take less than a second (I expect some caching is involved). I am trying to determine why the first time takes so long.
Here's the query:
SELECT DISTINCT ID,List,`From`,Subject, UNIX_TIMESTAMP(MsgDate) AS FmtDate
FROM messages
WHERE List='general'
ORDER BY MsgDate
LIMIT 17290,20;
MySQL version: 4.0.26-log
Here's the table:
messages CREATE TABLE `messages` (
`ID` int(10) unsigned NOT NULL auto_increment,
`List` varchar(10) NOT NULL default '',
`MessageId` varchar(128) NOT NULL default '',
`From` varchar(128) NOT NULL default '',
`Subject` varchar(128) NOT NULL default '',
`MsgDate` datetime NOT NULL default '0000-00-00 00:00:00',
`TextBody` longtext NOT NULL,
`HtmlBody` longtext NOT NULL,
`Headers` text NOT NULL,
`UserID` int(10) unsigned default NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `List` (`List`,`MsgDate`,`MessageId`),
KEY `From` (`From`),
KEY `UserID` (`UserID`,`List`,`MsgDate`),
KEY `MsgDate` (`MsgDate`),
KEY `ListOnly` (`List`)
) TYPE=MyISAM ROW_FORMAT=DYNAMIC
Here's the explain:
table type possible_keys key key_len ref rows Extra
------ ------ ------------- -------- ------- ------ ------ --------------------------------------------
m ref List,ListOnly ListOnly 10 const 18002 Using where; Using temporary; Using filesort
Why is it using a filesort when I have indexes on all the relevant columns? I added the ListOnly index just to see if it would help. I had originally thought that the List index would handle both the list selection and the sorting on MsgDate, but it didn't. Now that I added the ListOnly index, that's the one it uses, but it still does a filesort on MsgDate, which is what I suspect is taking so long.
I tried using FORCE INDEX as follows:
SELECT DISTINCT ID,List,`From`,Subject, UNIX_TIMESTAMP(MsgDate) AS FmtDate
FROM messages
FORCE INDEX (List)
WHERE List='general'
ORDER BY MsgDate
LIMIT 17290,20;
This does seem to force MySQL to use the index, but it doesn't speed up the query at all.
Here's the explain for this query:
table type possible_keys key key_len ref rows Extra
------ ------ ------------- ------ ------- ------ ------ ----------------------------
m ref List List 10 const 18002 Using where; Using temporary
UPDATES:
I removed DISTINCT from the query. It didn't help performance at all.
I removed the UNIX_TIMESTAMP call. It also didn't affect performance.
I made a special case in my PHP code so that if I detect the user is looking at the last page of results, I add a WHERE clause that returns only the last 7 days of results:
SELECT m.ID,List,From,Subject,MsgDate
FROM messages
WHERE MsgDate>='2009-11-15'
ORDER BY MsgDate DESC
LIMIT 20
This is a lot faster. However, as soon as I navigate to another page of results, it must use the old SQL and takes a very long time to execute. I can't think of a practical, realistic way to do this for all pages. Also, doing this special case makes my PHP code more complex.
Strangely, only the first time the original query is run takes a long time. Subsequent runs of either the same query or a query showing a different page of results (i.e., only the LIMIT clause changes) are very fast. The query slows down again if it has not been run for about 5 minutes.
SOLUTION:
The best solution I came up with is based on Jason Orendorff and Juliet's idea.
First, I determine if the current page is closer to the beginning or end of the total number of pages. If it's closer to the end, I use ORDER BY MsgDate DESC, apply an appropriate limit, then reverse the order of the returned records.
This makes retrieving pages close to the beginning or end of the resultset much faster (first time now takes 4-5 seconds instead of 15-30). If the user wants to navigate to a page near the middle (currently around the 430th page), then the speed might drop back down. But that would be a rare case.
So while there seems to be no perfect solution, this is much better than it was for most cases.
Thank you, Jason and Juliet.
Instead of ORDER BY MsgDate LIMIT 17290,20, try ORDER BY MsgDate DESC LIMIT 20.
Of course the results will come out in the reverse order, but that should be easy to deal with.
EDIT: Do your MessageId values always increase with time? Are they unique?
If so, I would make an index:
UNIQUE KEY `ListMsgId` ( `List`, `MessageId` )
and query based on the message ids rather than the date when possible.
-- Most recent messages (in reverse order)
SELECT * FROM messages
WHERE List = 'general'
ORDER BY MessageId DESC
LIMIT 20
-- Previous page (in reverse order)
SELECT * FROM messages
WHERE List = 'general' AND MessageId < '15885830'
ORDER BY MessageId DESC
LIMIT 20
-- Next page
SELECT * FROM messages
WHERE List = 'general' AND MessageId > '15885829'
ORDER BY MessageId
LIMIT 20
I think you're also paying for having varchar columns where an int type would be a lot faster. For example, List could instead be a ListId that points to an entry in a separate table. You might want to try it out in a test database to see if that's really true; I'm not a MySQL expert.
You can drop the ListOnly key. The compound index List already contains all the information in it.
Your EXPLAIN for the List-indexed query looks much better, lacking the filesort. You may be able to get better real performance out of it by swapping the ORDER as suggested by Jason, and maybe losing the UNIX_TIMESTAMP call (you can do that in the application layer, or just use Unix timestamps stored as INTEGER in the schema).
What version of my SQL are you using? Some of the older versions used the LIMIT clause as a post-process filter (meaning get all the record requested from the server, but only display the 20 you requested back).
You can see from your explain, 18002 rows are coming back, even though you are only showing 20 of them. Is there any way to adjust your selection criteria to identify the 20 rows you want to return, rather than getting 18000 rows back and only showing 20 of them???