ORDER BY uses all rows - mysql

I want to show the 10 latest forum topics and I do it by ordering the date ASC. I put index on date, however, it still gets all rows to check (I use EXPLAIN to see that).
What is the problem or you can't see it without seeing my table?
Thank you.

depending on the type of index, ordering by date will need a full scan. I think you can't do much about that with mysql.
nevertheless, one solution is to "cut" the search using a WHERE clause. eg
WHERE date > 10 days ago
the ordering will be not be done on the full scan but on what is left after the WHERE clause.
Weird as it may seem, and depending on your table, you may be able to optimize your query with ... 2 queries. eg :
SELECT max(primary key) from topics => $max
SELECT topic from topics where primary key >= $max - 10
these 2 request will be faster than a full scan if your table has many lines and will certainly give the same result if your primary key is auto-increment.
I hope this will help you
Jerome Wagner

What you are describing should work, but is really easy to break http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html gives a list of reasons that it can break. Without seeing your SQL I don't know which of these you are doing. However here is an example of how to make this work:
SELECT topic
FROM forum_posts
ORDER BY some_date DESC
LIMIT 10
And if you have a more complex query that breaks this, you can join to this query to filter results and save work. For instance.
SELECT f.topic
, o.some_information
, some_date
FROM (
SELECT id, topic, some_date
FROM forum_posts
ORDER BY some_date DESC
LIMIT 10
) f
JOIN other_table o
ON o.forum_id = f.id
ORDER BY some_date DESC

Related

Is there a way to avoid sorting for a query with WHERE and ORDER BY?

I am looking to understand how a query with both WHERE and ORDER BY can be indexed properly. Say I have a query like:
SELECT *
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3
With an index on date_created, it seems like the execution plan will prefer to use the PRIMARY key and then sort the results itself. This seems to be very slow when it needs to sort a large amount of results.
I was reading through this guide on indexing for ordered queries which mentions an almost identical example and it mentions:
If the database uses a sort operation even though you expected a pipelined execution, it can have two reasons: (1) the execution plan with the explicit sort operation has a better cost value; (2) the index order in the scanned index range does not correspond to the order by clause.
This makes sense to me but I am unsure of a solution. Is there a way to index my particular query and avoid an explicit sort or should I rethink how I am approaching my query?
The Optimizer is caught between a rock and a hard place.
Plan A: Use an index starting with id; collect however many rows that is; sort them; then deliver only 3. The downside: If the list is large and the ids are scattered, it could take a long time to find all the candidates.
Plan B: Use an index starting with date_created filtering on id until it gets 3 items. The downside: What if it has to scan all the rows before it finds 3.
If you know that the query will always work better with one query plan than the other, you can use an "index hint". But, when you get it wrong, it will be a slow query.
A partial answer... If * contains bulky columns, both approaches may be hauling around stuff that will eventually be tossed. So, let's minimize that:
SELECT u.*
FROM ( SELECT id
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3 -- not repeated
) AS x
JOIN users AS u USING(id)
ORDER BY date_created; -- repeated
Together with
INDEX(date_created, id),
INDEX(id, date_created)
Hopefully, the Optimizer will pick one of those "covering" indexes to perform the "derived table" (subquery). If so that will be somewhat efficiently performed. Then the JOIN will look up the rest of the columns for the 3 desired rows.
If you want to discuss further, please provide
SHOW CREATE TABLE.
How many ids you are likely to have.
Why you are not already JOINing to another table to get the ids.
Approximately how many rows in the table.
You best bet might to to write this in a more complicated way:
SELECT u.*
FROM ((SELECT u.*
FROM users u
WHERE id = 1
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 2
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 3
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 4
ORDER BY date_created
LIMIT 3
)
) u
ORDER BY date_created
LIMIT 3;
Each of the subqueries will now use an index on users(id, date_created). The outer query is then sorting at most 12 rows, which should be trivial from a performance perspective.
You could create a composite index on (id, date_created) - that will give the engine the option of using an index for both steps - but the optimiser may still choose not to.
If there aren't many rows in your table or it thinks the resultset will be small it's quicker to sort after the fact than it is to traverse the index tree.
If you really think you know better than the optimiser (which you don't), you can use index hints to tell it what to do, but this is almost always a bad idea.

Mysql query - faster?

So I have this MySQL query, and as I have lots of records this gets very slow, the computers that use the software (cash registers) aren't that powerful either.
Is there a way to get the same result, but faster? Would really appreciate help!
SELECT d.sifra, COUNT(d.sifra) AS pogosti, c.*, s.Stevilka as Stev_sk FROM Cenik c, dnevna d, Podskupina s
WHERE d.sifra = c.Sifra AND d.datum >= DATE(DATE_SUB(NOW(),INTERVAL 3 DAY))
GROUP BY d.sifra ORDER BY pogosti DESC limit 27
Have you tried indexing?
You are using c.Sifra in the WHERE, so you probably want
CREATE INDEX Cenik_Sifra ON Cenik(Sifra);
Also you use datum and sifra from dnevna, and datum is your SELECT, so
CREATE INDEX dnevna_ndx ON dnevna(datum, sifra);
Finally there's no JOIN condition on Podskupina, whence you draw Stevilka. Is this a constant table? As it is, you're just counting rows in Podskupina and/or getting an unspecified value out of it, unless it only has the one row.
On some versions of MySQL you might also find benefit in pre-calculating the datum:
SELECT #datum := DATE(DATE_SUB(NOW(), INTERVAL 3 DAY))
and then use #datum in your query. This might improve its chances of a good indexed performance.
Without knowing more about the structure and cardinality of the involved tables, though, there's little that can be done.
At the very least you should post the result of
EXPLAIN SELECT...(your select)
in the question.
you don't have condition to join Podskupina s, and you get cross join (all to all), so you get x rows from join "d.sifra = c.Sifra" multiplicate by y rows of Podskupina s
This looks like a very problematic query. Do you really need to return all of c.* ? And where's the join or filter on Podskupina? Once you tighten the query, make sure you've created good indexes on the tables. For example, presuming you've already got a clustered index on a unique ID as a primary key in dnevna, performance would typically benefit by putting a secondary index on the sifra and datum columns.

Is this the best approach to this complex MySQL multi table query?

I'm building a complex multi-table MySQL query, and even though it works, I'm wondering could I make it more simple.
The idea behind it is this, using the Events table that logs all site interaction, select the ID, Title, and Slug of the 10 most popular blog posts, and order by the most hits descending.
SELECT content.id, content.title, content.slug, COUNT(events.id) AS hits
FROM content, events
WHERE events.created >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND events.page_url REGEXP '^/posts/[0-9]'
AND content.id = events.content_id
GROUP BY content.id
ORDER BY hits DESC
LIMIT 10
Blog post URLs have the following format:
/posts/2013-05-16-hello-world
As I mentioned it seems to work, but I'm sure I could be doing this cleaner.
Thanks,
The condition on created and the condition on page_url are both range conditions. You can get index-assistance for only one range condition per table in a SQL query, so you have to pick one or the other to index.
I would create an index on the events table over two columns (content_id, created).
ALTER TABLE events ADD KEY (content_id, created);
I'm assuming that restricting by created date is more selective than restricting by page_url, because I assume "/posts/" is going to match a large majority of the events.
After narrowing down the matching rows by created date, the page-url condition will have to be handled by the SQL layer, but hopefully that won't be too inefficient.
There is no performance difference between SQL-89 ("comma-style") join syntax and SQL-92 JOIN syntax. I do recommend SQL-92 syntax because it's more clear and it supports outer joins, but performance is not a reason to use it. The SQL query optimizer supports both join styles.
Temporary table and filesort are often costly for performance. This query is bound to create a temporary table and use a filesort, because you're using GROUP BY and ORDER BY against different columns. You can only hope that the temp table will be small enough to fit within your tmp_table_size limit (or increase that value). But that won't help if content.title or content.slug are BLOB/TEXT columns, the temp table will be forced to be spooled on disk anyway.
Instead of a regular expression, you can use the left function:
SELECT content.id, content.title, content.slug, COUNT(events.id) AS hits FROM content JOIN events ON content.id = events.content_id
WHERE events.created >= DATE_SUB(NOW(), INTERVAL 1 MONTH)
AND left( events.page_url, 7) = '/posts/'
GROUP BY content.id
ORDER BY hits DESC
LIMIT 10)
But that's just off the top of my head, and without a fiddle, untested. The JOIN suggestion, made in the comment, is also good and has been reflected in my answer.

More Efficient Way To Write MySQL Query?

My site has suddenly started spitting out the following error:
"Incorrect key file for table '/tmp/#sql_645a_1.MYI'; try to repair it"
I remove it, the site works fine.
My server tech support guys suggest I clean up the query and make it more efficient.
Here's the query:
SELECT *, FROM_UNIXTIME(post_time, '%Y-%c-%d %H:%i') as posttime
FROM communityposts, communitytopics, communityusers
WHERE communityposts.poster_id=communityusers.user_id
AND communityposts.topic_id=communitytopics.topic_id
ORDER BY post_time DESC LIMIT 5
Any help is greatly appreciated. Perhaps can be done with a JOIN?
Many thanks,
Scott
UPDATE: Here's the working query, I still feel it could be optimised though.
SELECT
communityposts.post_id, communityposts.topic_id, communityposts.post_time,
communityusers.user_id, , communitytopics.topic_title, communityusers.username,
communityusers.user_avatar,
FROM_UNIXTIME(post_time, '%Y-%c-%d %H:%i') as post time
FROM
communityposts,
communitytopics,
communityusers
WHERE
communityposts.poster_id=communityusers.user_id
AND communityposts.topic_id=communitytopics.topic_id
ORDER BY post_time DESC LIMIT 5
SELECT
A.*,B.*,C.*,FROM_UNIXTIME(post_time, '%Y-%c-%d %H:%i') as posttime
FROM
(
SELECT id,poster_id,topic_id
FROM communityposts
ORDER BY post_time DESC
LIMIT 5
) cpk
INNER JOIN communityposts A USING (id)
INNER JOIN communityusers B ON cpk.poster_id=B.user_id
INNER JOIN communitytopics C USING (topic_id);
If a community post does not have to have a user and a topic, then use LEFT JOINs for the last two joins.
You will need to create a supporting index for the cpk subquqery:
ALTER TABLE communityposts ADD INDEX (posttime,id,poster_id,topic_id);
This query has to be the fastest because the cpk subquery only gets five keys ALL THE TIME.
UPDATE 2011-10-10 16:28 EDT
This query eliminiates the ambiguous topic_id issue:
SELECT
A.post_id, cpk.topic_id, A.post_time,
B.user_id, C.topic_title, B.username,
B.user_avatar,
FROM_UNIXTIME(post_time, '%Y-%c-%d %H:%i') as posttime
FROM
(
SELECT id,poster_id,topic_id
FROM communityposts
ORDER BY post_time DESC
LIMIT 5
) cpk
INNER JOIN communityposts A USING (id)
INNER JOIN communityusers B ON cpk.poster_id=B.user_id
INNER JOIN communitytopics C ON cpk.topic_id=C.topic_id;
Then temp table used for sorting the data probably gets too big. I have seen this happen when the /tmp/ runs out of space. The LIMIT clause does not make it any quicker or easier, as the sorting of the full data set has to be done first.
Under some conditions, MySQL does not use a temp table to sort data. You can read about it here: http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
If you manage to meet the right conditions (mostly use the correct indexes), it will also peed up your query.
If this doesn't help (in some cases you can't escape the heavy sorting), try to find out how much free space there is on /tmp/, and see if it can be expanded. Also, as sehe mentioned, selecting only the needed columns (instead of *) can make the temp table smaller and is considered best practice anyway (and so is using explicit JOINs instead of implicit ones).
You could reduce the number of fields selected.
The * operator will select all fields from all (3) tables. This may get big. That said,
I think mysql is smart enough to lay this plan out so that it doesn't need to access the data pages except for the 5 rows being selected.
Are you sure that all the involved (foreign) keys are indexed?
Here's my stab:
SELECT posts.*, FROM_UNIXTIME(post_time, '%Y-%c-%d %H:%i') as posttime
FROM communityposts posts
INNER JOIN communitytopics topics ON posts.topic_id = topics.topic_id
INNER JOIN communityusers users ON posts.poster_id = users.user_id
ORDER BY post_time DESC LIMIT 5

What SQL indexes should I add for this bloated query?

I'm wondering if indexes will speed this query up. It takes 9 seconds last time I checked. The traffic table has about 300k rows, listings and users 5k rows. I'm open to ridicule/humiliation too, if this is just a crappy query altogether. I wrote it long ago.
It's supposed to get the listings with the most page views (traffic). Let me know if the explanation is lacking.
SELECT traffic_listingid AS listing_id,
COUNT(traffic_listingid) AS genuine_hits,
COUNT(DISTINCT traffic_ipaddress) AS distinct_ips,
users.username,
listings.listing_address,
listings.datetime_created,
DATEDIFF(NOW(), listings.datetime_created) AS listing_age_days
FROM traffic
LEFT JOIN listings
ON traffic.traffic_listingid = listings.listing_id
LEFT JOIN users
ON users.id = listings.seller_id
WHERE traffic_genuine = 1
AND listing_id IS NOT NULL
AND username IS NOT NULL
AND DATEDIFF(NOW(), traffic_timestamp) < 24
GROUP BY traffic_listingid
ORDER BY distinct_ips DESC
LIMIT 10
P.S.
ENGINE=MyISAM /
MySQL Server 4.3
Sidenotes:
1.You have
LEFT JOIN listings
ON traffic.traffic_listingid = listings.listing_id
...
WHERE ...
AND listing_id IS NOT NULL
This condition cancels the LEFT JOIN. Change your query into:
INNER JOIN listings
ON traffic.traffic_listingid = listings.listing_id
and remove the listing_id IS NOT NULL from the WHERE conditions.
The same thing applies to LEFT JOIN user and username IS NOT NULL.
2.The check on traffic_timestamp:
DATEDIFF(NOW(), traffic_timestamp) < 24
makes it difficult for the index to be used. Change it into something like this that can use an index
(and check that my version is equivalent, I may have mistakes):
traffic_timestamp >= CURRENT_DATE() - INTERVAL 23 DAY
3.The COUNT(non-nullable-column) is equivalent to COUNT(*). Change the:
COUNT(traffic_listingid) AS genuine_hits,
to:
COUNT(*) AS genuine_hits,
because it's bit faster in MySQL (although I'm not sure about that for version 4.3)
For the index question, you should have at least an index on every column that is used for joining. Adding one more for the traffic_timestamp will probably help, too.
If you tell us in which tables the traffic_ipaddress and traffic_timestamp are, and what the EXPLAIN EXTENDED shows, someone may have a better idea.
Reading again the query, it seems that it's actually a GROUP BY only in table traffic and the other 2 tables are used to get refrence data. So, the query is equivalent to a (traffic group by)-join-listing-join-user. Not sure if that helps in your MySQL old version but it may be good to have both versions of the query and test if one query runs faster in your system.
Indexes should always be put on columns you use in the where clause.
In this case the listingid looks like a good option, as well as the users.id, seller_id and traffic_timestamp.
Use a EXPLAIN EXTENDED in front of your query to see what MySQL recommends you (It shows how many rows are touched, and what indexes it uses)