The following query is pretty simple. It selects the last 20 records from a messages table for use in a paging scenario. The first time this query is run, it takes from 15 to 30 seconds. Subsequent runs take less than a second (I expect some caching is involved). I am trying to determine why the first time takes so long.
Here's the query:
SELECT DISTINCT ID,List,`From`,Subject, UNIX_TIMESTAMP(MsgDate) AS FmtDate
FROM messages
WHERE List='general'
ORDER BY MsgDate
LIMIT 17290,20;
MySQL version: 4.0.26-log
Here's the table:
messages CREATE TABLE `messages` (
`ID` int(10) unsigned NOT NULL auto_increment,
`List` varchar(10) NOT NULL default '',
`MessageId` varchar(128) NOT NULL default '',
`From` varchar(128) NOT NULL default '',
`Subject` varchar(128) NOT NULL default '',
`MsgDate` datetime NOT NULL default '0000-00-00 00:00:00',
`TextBody` longtext NOT NULL,
`HtmlBody` longtext NOT NULL,
`Headers` text NOT NULL,
`UserID` int(10) unsigned default NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `List` (`List`,`MsgDate`,`MessageId`),
KEY `From` (`From`),
KEY `UserID` (`UserID`,`List`,`MsgDate`),
KEY `MsgDate` (`MsgDate`),
KEY `ListOnly` (`List`)
) TYPE=MyISAM ROW_FORMAT=DYNAMIC
Here's the explain:
table type possible_keys key key_len ref rows Extra
------ ------ ------------- -------- ------- ------ ------ --------------------------------------------
m ref List,ListOnly ListOnly 10 const 18002 Using where; Using temporary; Using filesort
Why is it using a filesort when I have indexes on all the relevant columns? I added the ListOnly index just to see if it would help. I had originally thought that the List index would handle both the list selection and the sorting on MsgDate, but it didn't. Now that I added the ListOnly index, that's the one it uses, but it still does a filesort on MsgDate, which is what I suspect is taking so long.
I tried using FORCE INDEX as follows:
SELECT DISTINCT ID,List,`From`,Subject, UNIX_TIMESTAMP(MsgDate) AS FmtDate
FROM messages
FORCE INDEX (List)
WHERE List='general'
ORDER BY MsgDate
LIMIT 17290,20;
This does seem to force MySQL to use the index, but it doesn't speed up the query at all.
Here's the explain for this query:
table type possible_keys key key_len ref rows Extra
------ ------ ------------- ------ ------- ------ ------ ----------------------------
m ref List List 10 const 18002 Using where; Using temporary
UPDATES:
I removed DISTINCT from the query. It didn't help performance at all.
I removed the UNIX_TIMESTAMP call. It also didn't affect performance.
I made a special case in my PHP code so that if I detect the user is looking at the last page of results, I add a WHERE clause that returns only the last 7 days of results:
SELECT m.ID,List,From,Subject,MsgDate
FROM messages
WHERE MsgDate>='2009-11-15'
ORDER BY MsgDate DESC
LIMIT 20
This is a lot faster. However, as soon as I navigate to another page of results, it must use the old SQL and takes a very long time to execute. I can't think of a practical, realistic way to do this for all pages. Also, doing this special case makes my PHP code more complex.
Strangely, only the first time the original query is run takes a long time. Subsequent runs of either the same query or a query showing a different page of results (i.e., only the LIMIT clause changes) are very fast. The query slows down again if it has not been run for about 5 minutes.
SOLUTION:
The best solution I came up with is based on Jason Orendorff and Juliet's idea.
First, I determine if the current page is closer to the beginning or end of the total number of pages. If it's closer to the end, I use ORDER BY MsgDate DESC, apply an appropriate limit, then reverse the order of the returned records.
This makes retrieving pages close to the beginning or end of the resultset much faster (first time now takes 4-5 seconds instead of 15-30). If the user wants to navigate to a page near the middle (currently around the 430th page), then the speed might drop back down. But that would be a rare case.
So while there seems to be no perfect solution, this is much better than it was for most cases.
Thank you, Jason and Juliet.
Instead of ORDER BY MsgDate LIMIT 17290,20, try ORDER BY MsgDate DESC LIMIT 20.
Of course the results will come out in the reverse order, but that should be easy to deal with.
EDIT: Do your MessageId values always increase with time? Are they unique?
If so, I would make an index:
UNIQUE KEY `ListMsgId` ( `List`, `MessageId` )
and query based on the message ids rather than the date when possible.
-- Most recent messages (in reverse order)
SELECT * FROM messages
WHERE List = 'general'
ORDER BY MessageId DESC
LIMIT 20
-- Previous page (in reverse order)
SELECT * FROM messages
WHERE List = 'general' AND MessageId < '15885830'
ORDER BY MessageId DESC
LIMIT 20
-- Next page
SELECT * FROM messages
WHERE List = 'general' AND MessageId > '15885829'
ORDER BY MessageId
LIMIT 20
I think you're also paying for having varchar columns where an int type would be a lot faster. For example, List could instead be a ListId that points to an entry in a separate table. You might want to try it out in a test database to see if that's really true; I'm not a MySQL expert.
You can drop the ListOnly key. The compound index List already contains all the information in it.
Your EXPLAIN for the List-indexed query looks much better, lacking the filesort. You may be able to get better real performance out of it by swapping the ORDER as suggested by Jason, and maybe losing the UNIX_TIMESTAMP call (you can do that in the application layer, or just use Unix timestamps stored as INTEGER in the schema).
What version of my SQL are you using? Some of the older versions used the LIMIT clause as a post-process filter (meaning get all the record requested from the server, but only display the 20 you requested back).
You can see from your explain, 18002 rows are coming back, even though you are only showing 20 of them. Is there any way to adjust your selection criteria to identify the 20 rows you want to return, rather than getting 18000 rows back and only showing 20 of them???
Related
Given this table in MySQL 5.6:
create table PlayerSession
(
id bigint auto_increment primary key,
lastActivity datetime not null,
player_id bigint null,
...
constraint FK4410E05525A98981
foreign key (player_id) references Player (id)
)
How can it possibly be that this query returns about 2000 rows instantly:
SELECT * FROM PlayerSession
WHERE player_id = ....
ORDER BY lastActivity DESC
but adding LIMIT 1 makes it take 4 seconds, even though all that should do is pick the first result?
Using EXPLAIN I found the only difference to be that without the limit, filesort is used. From what I gather, this should make it slower, not faster. The whole table contains about 2M rows.
Also, adding LIMIT 3 or anything higher than that, gives the same performance as no limit.
And yes, I have since created an index on playerId, lastActivity, which, surprise surprise, makes it fast again. While that takes the immediate stress out of the situation (the server was rather overloaded), it doesn't really explain the mystery.
What specific version of 5.6? Please provide EXPLAIN FORMAT=JSON SELECT .... Please provide SHOW CREATE TABLE; we need to see the other indexes, plus datatypes.
INDEX(playerId, lastActivity) lets the query avoid "filesort".
A possible reason for the strange timings could be caching. Run each query twice to avoid that hiccup.
i have this table (500,000 row)
CREATE TABLE IF NOT EXISTS `listings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL DEFAULT '1',
`hash` char(32) NOT NULL,
`source_id` int(10) unsigned NOT NULL,
`link` varchar(255) NOT NULL,
`short_link` varchar(255) NOT NULL,
`cat_id` mediumint(5) NOT NULL,
`title` mediumtext NOT NULL,
`description` mediumtext,
`content` mediumtext,
`images` mediumtext,
`videos` mediumtext,
`views` int(10) unsigned NOT NULL,
`comments` int(11) DEFAULT '0',
`comments_update` int(11) NOT NULL DEFAULT '0',
`editor_id` int(11) NOT NULL DEFAULT '0',
`auther_name` varchar(255) DEFAULT NULL,
`createdby_id` int(10) NOT NULL,
`createdon` int(20) NOT NULL,
`editedby_id` int(10) NOT NULL,
`editedon` int(20) NOT NULL,
`deleted` tinyint(1) NOT NULL,
`deletedon` int(20) NOT NULL,
`deletedby_id` int(10) NOT NULL,
`deletedfor` varchar(255) NOT NULL,
`published` tinyint(1) NOT NULL DEFAULT '1',
`publishedon` int(11) unsigned NOT NULL,
`publishedby_id` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm thinking to make each query by the publishedon between x and y (show in all the site just records of 1 month)
in the same time, i want to add with the publishedon in the where clause published, cat_id , source_id
some thing like this:
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
that query is ok and fast until now without indexing, but when trying to use order by publishedon its became too slow, so i used this index
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`
)
it worked and the order by publishedon became fast, now i want to order by views like this
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
this is the explanation
this query is too slow because of ORDER BY views DESC
then i'm tried to drop the old index and add this
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`,
`views` DESC
)
its too slow also
what about if i use just single index on publishedon?
what about using single index on cat_id,source_id,views,publishedon?
i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns
what about making index in (cat_id, source_id, publishedon, published) ? but in some cases i will use source_id only?
what is the best indexing schema for that table
This query:
SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
(published = 1) AND
(cat_id in (1,2,3,4,5)) AND
(source_id in (1,2,3,4,5));
Is hard to optimize with only indexes. The best index is one that starts with published and then has the other columns -- it is not clear what their order should be. The reason is because all but published are not using =.
Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE clause before the index can be used for the ORDER BY. That makes this hard to optimize.
Suggestions . . . None are that great:
If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the ORDER BY faster, but won't help the ORDER BY.
Try various orders of columns after published in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.
Think about ways that you can structure the query to have more equality conditions in the WHERE clause or to return a smaller set of data.
(Not really recommended) Put an index on published and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN and so on) in the outer query. The subquery will use the index for sorting and then filter the results.
The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.
One important general note as to why your query isn't getting any faster despite your attempts is that DESC on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.
In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.
You have a few options, and all are actually pretty decent and can probably help you see significant improvement.
A note on SQL
First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.
It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.
absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.
maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN as #Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published column. Keep in mind that every INDEX adds to the work your queries need to do.
Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.
Options, options, options...
There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.
Split your SQL table into two or more separate tables.
This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon, [action]edby_id, and [action]ed could easily be put into another table, actions:
+-----------+-------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| action_id | int(11) | NO | | NULL | |
| action | varchar(45) | NO | | NULL | |
| date | datetime | NO | | CURRENT_TIMESTAMP | |
| user_id | int(11) | NO | | NULL | |
+-----------+-------------+------+-----+-------------------+----------------+
The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created, but also by all of your other actions.
Edit: As requested, here is a sample sorting query
SELECT * FROM listings
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published')
AND (listings.published = 1)
AND (listings.cat_id in(1,2,3,4,5))
AND (listings.source_id in(1,2,3,4,5))
AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC
Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!
If you put a composite key on actiondate and listings.id, this should help to increase speed.
As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:
Create a month field
I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.
So another option is for you to split your integer function into a month, day, and year field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN on even a simple query and you'll see for yourself.
That way, you can just index the month and year fields and do a query like this:
SELECT * FROM listings
WHERE (publishedmonth = 9)
AND (publishedyear = 2015)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
Slap an EXPLAIN in front and you should see massive improvements.
Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.
Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.
Do your heavy sorting elsewhere
This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.
The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.
There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.
If I were you, I'd at least INDEX the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.
Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.
EXPLAIN SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
The rows of your table are enormous (all those mediumtext columns), so sorting SELECT * is going to have a lot of overhead. That's a simple reality of your schema design. SELECT * is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.
You showed us a query with the following filter criteria
single-value equality on published.
range matching on publishedon.
set matching on cat_id
set matching on source_id.
Ordering on views.
Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.
CREATE INDEX listings_x_pub_date_cover ON listings(
published, publishedon, cat_id, source_id, views, id )
To satisfy your query the MySQL engine will random-access the index at the appropriate value of published, and then at the begiining of the publishedon range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses the id value to look up each row that passes the filter. Give it a try.
If that performance isn't good enough try this so-called deferred join operation.
SELECT a.*
FROM listings a
JOIN ( SELECT id, views
FROM listings
WHERE published = 1
AND publishedon BETWEEN 1441105258
AND 1443614458
AND cat_id IN (1,2,3,4,5)
AND source_id IN (1,2,3,4,5)
ORDER BY views DESC
) b ON a.id = b.id
ORDER BY b.views DESC
This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have ORDER BY ... LIMIT n pattern in your query, but you don't.
Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:
SELECT id
FROM listings
WHERE published = 1
AND publishedon BETWEEN 1441105258
AND 1443614458
AND cat_id IN (1,2,3,4,5)
AND source_id IN (1,2,3,4,5)
ORDER BY views DESC
and then fetching the full rows of the table one-by-one using these id values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (each mediumtext column can hold up to 16MiB) it's probably your best bet.
tl;dr. Create the index mentioned. Get rid of SELECT * if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.
Performance problem on update MySql MyISAM big table making column ascending order based on an index on same table
My problem is that the server have only 4 GB memory.
I have to do an update query like this: previous asked question
Mine is this:
set #orderid = 0;
update images im
set im.orderid = (select #orderid := #orderid + 1)
ORDER BY im.hotel_id, im.idImageType;
On im.hotel_id, im.idImageType I have an ascending index.
On im.orderid I have an ascending index too.
The table have 21 millions records and is an MyIsam table.
The table is this:
CREATE TABLE `images` (
`photo_id` int(11) NOT NULL,
`idImageType` int(11) NOT NULL,
`hotel_id` int(11) NOT NULL,
`room_id` int(11) DEFAULT NULL,
`url_original` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_max300` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_square60` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`archive` int(11) NOT NULL DEFAULT '0',
`orderid` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`photo_id`),
KEY `idImageType` (`idImageType`),
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
KEY `archive` (`archive`),
KEY `room_id` (`room_id`),
KEY `orderid` (`orderid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The problem is the performance: hang for several minutes!
Server disk go busy too.
My question is: there is a better manner to achieve the same result?
Have I to partition the table or something else to increase the performance?
I cannot modify server hardware but can tuning MySql application db server settings.
best regards
Tanks to every body. Yours answers help me much. I think that now I have found a better solution.
This problem involve in two critical issue:
efficient paginate on large table
update large table.
To go on efficient paginate on large table I have found a solution by make a previous update on the table but doing so I fall in issues on the 51 minute time needed to the updates and consequent my java infrastructure time out (spring-batch step).
Now by yours help, I found two solution to paginate on large table, and one solution to update large table.
To reach this performance the server need memory. I try this solution on develop server using 32 GB memory.
common solution step
To paginate follow a fields tupla like I needed I have make one index:
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`)
to achieve the new solution we have to change this index by add the primary key part to the index tail KEY hotel_id_idImageType (hotel_id,idImageType, primary key fields):
drop index hotel_id_idImageType on images;
create index hotelTypePhoto on images (hotel_id, idImageType, photo_id);
This is needed to avoid touch table and use only the index file ...
Suppose we want the 10 records after the 19000000 record.
The decimal point is this , in this answers
solution 1
This solution is very practice and not needed the extra field orderid and you have not to do any update before the pagination:
select * from images im inner join
(select photo_id from images
order by hotel_id, idImageType, photo_id
limit 19000000,10) k
on im.photo_id = k.photo_id;
To make the table k on my 21 million table records need only 1,5 sec because it use only the three field in index hotelTypePhoto so haven't to access to the table file and work only on index file.
The order was like the original required (hotel_id, idImageType) because is included in (hotel_id, idImageType, photo_id): same subset...
The join take no time so every first time the paginate is executed on the same page need only 1,5 sec and this is a good time if you have to execute it in a batch one on 3 months.
On production server using 4 GB memory the same query take 3,5 sec.
Partitioning the table do not help to improve performance.
If the server take it in cache the time go down or if you do a jdbc params statment the time go down too (I suppose).
If you have to use it often, it have the advantage that it do not care if the data change.
solution 2
This solution need the extra field orderid and need to do the orderid update one time by batch import and the data have not to change until the next batch import.
Then you can paginate on the table in 0,000 sec.
set #orderid = 0;
update images im inner join (
select photo_id, (#orderid := #orderid + 1) as newOrder
from images order by hotel_id, idImageType, photo_id
) k
on im.photo_id = k.photo_id
set im.orderid = k.newOrder;
The table k is fast almost like in the first solution.
This all update take only 150,551 sec much better than 51 minute!!! (150s vs 3060s)
After this update in the batch you can do the paginate by:
select * from images im where orderid between 19000000 and 19000010;
or better
select * from images im where orderid >= 19000000 and orderid< 19000010;
this take 0,000sec to execute first time and all other time.
Edit after Rick comment
Solution 3
This solution is to avoid extra fields and offset use. But need too take memory of the last page read like in this solution
This is a fast solution and can work on online server production using only 4GB memory
Suppose you need to read last ten records after 20000000.
There is two scenario to take care:
You can start read it from the first to the 20000000 if you need all of it like me and update some variable to take memory of last page read.
you have to read only the last 10 after 20000000.
In this second scenario you have to do a pre query to find the start page:
select hotel_id, idImageType, photo_id
from images im
order by hotel_id, idImageType, photo_id limit 20000000,1
It give to me:
+----------+-------------+----------+
| hotel_id | idImageType | photo_id |
+----------+-------------+----------+
| 1309878 | 4 | 43259857 |
+----------+-------------+----------+
This take 6,73 sec.
So you can store this values in variable to next use.
Suppose we named #hot=1309878, #type=4, #photo=43259857
Then you can use it in a second query like this:
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and idImageType>#type OR (
idImageType=#type and photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
The first clause hotel_id>#hot take all records after the actual first field on scrolling index but lost some record. To take it we have to do the OR clause that take on the first index field all remained unread records.
This take only 0,10 sec now.
But this query can be optimized (bool distributive):
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and
(idImageType>#type or idImageType=#type)
and (idImageType>#type or photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and
idImageType>=#type
and (idImageType>#type or photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
(hotel_id>#hot OR hotel_id=#hot) and
(hotel_id>#hot OR
(idImageType>=#type and (idImageType>#type or photo_id>#photo))
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
hotel_id>=#hot and
(hotel_id>#hot OR
(idImageType>=#type and (idImageType>#type or photo_id>#photo))
)
order by hotel_id, idImageType, photo_id limit 10;
Are they the same data we can get by the limit?
To quick not exhaustive test do:
select im.* from images im inner join (
select photo_id from images order by hotel_id, idImageType, photo_id limit 20000000,10
) k
on im.photo_id=k.photo_id
order by im.hotel_id, im.idImageType, im.photo_id;
This take 6,56 sec and the data is the same that the query above.
So the test is positive.
In this solution you have to spend 6,73 sec only the first time you need to seek on first page to read (but if you need all you haven't).
To real all other page you need only 0,10 sec a very good result.
Thanks to rick to his hint on a solution based on store the last page read.
Conclusion
On solution 1 you haven't any extra field and take 3,5 sec on every page
On solution 2 you have extra field and need a big memory server (32 GB tested) in 150 sec. but then you read the page in 0,000 sec.
On solution 3 you haven't any extra field but have to store last page read pointer and if you do not start reading by the first page you have to spend 6,73 sec for first page. Then you spend only 0,10 sec on all the other pages.
Best regards
Edit 3
solution 3 is exactly that suggested by Rick. Im sorry, in my previous solution 3 I have do a mistake and when I coded the right solution then I have applied some boolean rule like distributive property and so on, and after all I get the same Rich solution!
regards
You can use some of this:
Update engine to InnoDB, it blocks only one row, not all the table on update.
Create #temp table with photo_id and good orderid and than update your table from this temp:
update images im, temp tp
set im.orderid = tp.orderid
where im.photo_id = tp.photo_id
it will be fastest way and when you fill your tmp table - you have no blocks on primary table.
You can drop indexes before mass update. After all your single update you have rebuilding of indexes and it has a long time.
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
DROP the former; the latter takes care of any need for it. (This won't speed up the original query.)
"The problem is the performance: hang for several minutes!" What is the problem?
Other queries are blocked for several minutes? (InnoDB should help.)
You run this update often and it is annoying? (Why in the world??)
Something else?
This one index is costly while doing the Update:
KEY `orderid` (`orderid`)
DROP it and re-create it. (Don't bother dropping the rest.) Another reason for going with InnoDB is that these operations can be done (in 5.6) without copying the table over. (21M rows == long time if it has to copy the table!)
Why are you building a second Unique index (orderid) in addition to photo_id, which is already Unique? I ask this because there may be another way to solve the real problem that does not involve this time-consuming Update.
I have two more concrete suggestions, but I want to here your answers first.
Edit Pagination, ordered by hotel_id, idImageType, photo_id:
It is possible to read the records in order by that triple. And even to "paginate" through them.
If you "left off" after ($hid, $type, $pid), here would be the 'next' 20 records:
WHERE hotel_id >= $hid
AND ( hotel_id > $hid
OR idImageType >= $type
AND ( idImageType > $type
OR photo_id > $pid
)
)
ORDER BY hotel_id, idImageType, photo_id
LIMIT 20
and have
INDEX(hotel_id, idImageType, photo_id)
This avoids the need for orderid and its time consuming Update.
It would be simpler to paginate one hotel_id at a time. Would that work?
Edit 2 -- eliminate downtime
Since you are reloading the entire table periodically, do this when you reload:
CREATE TABLE New with the recommended index changes.
Load the data into New. (Be sure to avoid your 51-minute timeout; I don't know what is causing that.)
RENAME TABLE images TO old, New TO images;
DROP TABLE old;
That will avoid blocking the table for the load and for the schema changes. There will be a very short block for the atomic Step #3.
Plan on doing this procedure each time you reload your data.
Another benefit -- After step #2, you can test the New data to see if it looks OK.
I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.
I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds
here is the query
SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;
`id` int(11) NOT NULL auto_increment,
`instance` varchar(255) NOT NULL,
`email` varchar(255) NOT NULL,
`type` enum('play','claim','friend','email') NOT NULL,
`result` enum('win','win-small','lose','none') NOT NULL,
`timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
`play_date` date NOT NULL,
`email_refer` varchar(255) NOT NULL,
`remote_addr` varchar(15) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `result` (`result`),
KEY `timestamp` (`timestamp`),
KEY `email_refer` (`email_refer`),
KEY `type_2` (`type`,`timestamp`),
KEY `type_4` (`type`,`play_date`),
KEY `type_result` (`type`,`play_date`,`result`)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE log ref type_2,type_4,type_result type_4 1 const 270404 Using where
The query is using the type_4 index.
Does anyone know how I could speed this query up?
Thanks
Tom
That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.
You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.
To do this, change the index as follows:
KEY `type_4` (`type`,`play_date`, `email`)
I would be surprised if that wouldn't speed things up quite a bit.
(Thanks to MarkR for the proper term.)
Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.
My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.
Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps
There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.
(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates
By reversing the index, the approach is
> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file
Hope this helps
Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).
The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.
SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
SELECT play_date AS date
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
ORDER BY play_date desc
LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date
Try creating an index only on play_date.
Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.
Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.
There is a good chance a table scan will be quicker than random access to over 200,000 rows:
SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...
Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):
SELECT SQL_BIG_RESULT ...