mysql order by count performance - mysql

I'm finding the following a little perplexing... if I perform the below queries, when sorting by the indexed value 'keyword' it takes 0.0008 seconds, but when sorting by 'count' it takes over 3 seconds.
The following takes approx 0.0008 seconds:
SELECT keyword, COUNT(DISTINCT pmid) as count
FROM keywords
WHERE (collection_id = 13262022107433)
GROUP BY keyword
order by keyword desc limit 1;
This takes over 3 seconds:
SELECT keyword, COUNT(DISTINCT pmid) as count
FROM keywords
WHERE (collection_id = 13262022107433)
GROUP BY keyword
order by count desc limit 1;
Is there a way of speeding up a sort on a result set when sorting by count? Should it really take that much longer? Are there any alternatives? The engine is InnoDB.
Many thanks for your input!

You may want to add an additional index to assist the in the counting phase.
ALTER TABLE keywords ADD INDEX ckp_index (collection_id,keyword,pmid);
If you already have a compound index with collection_id and keyword only, the Query Optimizer will still include a lookup for the pmid field from the table.
By adding this new index, this will remove any table scans and perform index scans only.
This will speed the count(distinct pmid) portion of the query.
Give it a Try !!!

Not unexpected, not avoidable. When this query is ordered by keyword, MySQL can just look at what keyword comes last, pick out the rows with that keyword, and count them. When you order by count, though, it has to count the rows for every keyword to figure out which one is highest. That's a lot more work!

Related

Unexpected result by MySQL Optimizer for identical queries?

Following query is working like expected and uses index
Query takes 0,0481 sec
SELECT
geodb_locations.name,
geodb_locations.name_url,
COUNT(user.uid) AS useranzahl
FROM
user
LEFT JOIN
geodb_locations ON geodb_locations.id=user.plz
WHERE
user.freigeben=1 AND
geodb_locations.adm0='AT'
GROUP BY user.plz
ORDER BY useranzahl DESC
LIMIT 25
Explain
If only country locale is changed within the query from AT to DE
Query takes about 2.5 sec and does not use index
SELECT
geodb_locations.name,
geodb_locations.name_url,
COUNT(user.uid) AS useranzahl
FROM
user
LEFT JOIN
geodb_locations ON geodb_locations.id=user.plz
WHERE
user.freigeben=1 AND
geodb_locations.adm0='DE'
GROUP BY user.plz
ORDER BY useranzahl DESC
LIMIT 25
Explain
Why is index not used by the optimizer of second query and how to improve the query.
2.5 sec are to long ..
If u.uid cannot be NULL, use COUNT(*) instead of COUNT(u.uid).
As already pointed out, remove LEFT.
Add these indexes:
user: (freigeben, plz)
geodb_locations: (adm0, name_url, name)
As for why the EXPLAIN changed, ... It is quite normal (but somewhat rare) for the distribution of the constants to determine what order the tables are touched (Austria is less common than Germany?) or which index to use.
Regardless of optimizations, this query will have to scan a lot more rows for DE than for AT; this has to happen before the sort (ORDER BY) and LIMIT.
Two things prevent much optimization:
The WHERE references both tables.
The ORDER BY depends on a computed value.

Order by / limit execution in SQL

Lots of thread already on web, just trying to understand some nuances which had me confused!
Quoting the doc reference
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as
soon as it has found the first row_count rows of the sorted result,
rather than sorting the entire result. If ordering is done by using an
index, this is very fast.
and a SO thread
It will order first, then get the first 20. A database will also
process anything in the WHERE clause before ORDER BY.
Taking the same query from the question :
SELECT article
FROM table1
ORDER BY publish_date
LIMIT 20
lets say table has 2000 rows, of which query is expected to return 20 rows, now, looking at mysql ref ....stops sorting as soon as it has found the first row_count rows.... confuses me as i find it little ambiguous!!
Why does it say stops sorting? isn't the limit clause being applied on an already sorted data returned via order by clause ( assuming its a non-indexed column ) or is my understanding wrong and SQL is limiting first and then sorting!!??
The optimization mentioned in the documentation generally only works if there's an index on the publish_date column. The values are stored in the index in order, so the engine simply iterates through the index of the column, fetching the associated rows, until it has fetched 20 rows.
If the column isn't indexed, the engine will generally need to fetch all the rows, sort them, and then return the first 20 of these.
It's also useful to understand how this interacts with WHERE conditions. Suppose the query is:
SELECT article
FROM table1
WHERE last_read_date > '2018-11-01'
ORDER BY publish_date
LIMIT 20
If publish_date is indexed and last_read_date is not, it will scan the publish_date index in order, test the associated last_read_date against the condition, and add article to the result set if the test succeeds. When there are 20 rows in the result set it will stop and return it.
If last_read_date is indexed and publish_date is not, it will use the last_read_date index to find the subset of all the rows that meet the condition. It will then sort these rows using the publish_date column, and return the first 20 rows from that.
If neither column is indexed it will do a full table scan to test last_read_date, sort all the rows that match the condition, and return the first 20 rows of this.
MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result
This is actually a very sensible optimisation within mysql. If you use limit to return 20 rows and mysql knows it already found them, then why would mysql (or you) care how exactly the rest of the records are sorted? It does not matter, therefore mysql stops sorting the rest of the rows.
If the order by is done on an indexed column, then mysql can tell pretty quickly, if it found the top n records.

mysql query with where and order by take long time

I have mysql db with 7 Million records
when I run query like
select * from data where cat_id=12 order by id desc limit 0,30
the query take long time like 0.4603 sec
but same query with out (where cat_id=12) or with out (order by id desc) very Fast
the query take long time like 0.0002 sec
I have indexes on cat_id and id
there is any way to make query with (where and order by) fast
thanks
Create a composite index that combines cat_id and id. See http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html for syntax and examples.
If you state 'cat_id=12' only, you will get all matching rows, which is fast, because of the index. But these rows won't be ordererd, so mysql has to read them all into a temporary table and sort that table, which is slow.
Similarly, 'order by id desc' will order the rows quickly, but mysql has to read all of them to find out which have 'cat_id=12', which is slow.
A composite index should solve these issues.
It is running fast without order by since when you write order by DESC then it first iterates through all the rows and then it selects in descending order. Removing the condition makes it by default ASCENDING which makes it fast.
Also it may be that your index is sorted ascending so when you ask for descending it needs to do a lot more work to bring it back in that order

Does the ORDER BY optimization takes effect in the following SELECT statement?

I have a SELECT statement which I would like to optimize. The mysql - order by optimization says that in some cases the index cannot be used to optimize the ORDER BY. Specifically the point:
You use ORDER BY on nonconsecutive parts of a key
SELECT * FROM t1 WHERE key2=constant ORDER BY key_part2;
makes me thinking, that this could be the case. I'm using following indexes:
UNIQUE KEY `met_value_index1` (`RTU_NB`,`DATETIME`,`MP_NB`),
KEY `met_value_index` (`DATETIME`,`RTU_NB`)
With following SQL-statement:
SELECT * FROM met_value
WHERE rtu_nb=constant
AND mp_nb=constant
AND datetime BETWEEN constant AND constant
ORDER BY mp_nb, datetime
Would it be enough delete the index met_value_index1 and create it with the new ordering RTU_NB, MP_NB, DATETIME?
Do I have to include RTU_NB into the ORDER BY clause?
Outcome: I have tried what #meriton suggested and added the index met_value_index2. The SELECT completed after 1.2 seconds, previously it completed after 5.06 seconds. The following doesn't belong to the question but as a side note: After some other tries I switched the engine from MyISAM to InnoDB – with rtu_nb, mp_nb, datetime as primary key – and the statement completed after 0.13 seconds!
I don't get your query. If a row must match mp_np = constant to be returned, all rows returned will have the same mp_nb, so including mp_nb in the order by clause has no effect. I recommend you use the semantically equivalent statement:
SELECT * FROM met_value
WHERE rtu_nb=constant
AND mp_nb=constant
AND datetime BETWEEN constant AND constant
ORDER BY datetime
to avoid needlessly confusing the query optimizer.
Now, to your question: A database can implement an order by clause without sorting if it knows that the underlying access will return the rows in proper order. In the case of indexes, that means that an index can assist with sorting if the rows matched by the where clause appear in the index in the order requested by the order by clause.
That is the case here, so the database could actually do an index range scan over met_value_index1 for the rows where rtu_nb=constant AND datetime BETWEEN constant AND constant, and then check whether mp_nb=constant for each of these rows, but that would amount to checking far more rows than necessary if mp_nb=constant has high selectivity. Put differently, an index is most useful if the matching rows are contiguous in the index, because that means the index range scan will only touch rows that actually need to be returned.
The following index will therefore be more helpful for this query:
UNIQUE KEY `met_value_index2` (`RTU_NB`,`MP_NB`, `DATETIME`),
as all matching rows will be right next to each other in the index and the rows appear in the index in the order the order by clause requests. I can not say whether the query optimizer is smart enough to get that, so you should check the execution plan.
I do not think it will use any index for the ORDER BY. But you should look at the execution plan. Or here.
The order of the fields as they appear in the WHERE clause must match the order in the index. So with your current query you need one index with the fields in order of rtu_nb, mp_nb, datetime.

Optimizing query instead of using order by

I want to run a simple query to get the "n" oldest records in the table. (It has a creation_date column).
How can i get that without using "order-by". It is a very big table and using order by on entire table to get only "n" records is not so convincing.
(Assume n << size of table)
When you are concerned about performance, you should probably not discard the use of order by too early.
Queries like that can be implemende as Top-N query supported by an appropriate index, that's running very fast because it doesn't need to sort the entire table, not even the selecte rows, because the data is already sorted in the index.
example:
select *
from table
where A = ?
order by creation_date
limit 10;
without appropriate index it will be slow if you are having lot's of data. However, if you create an index like that:
create index test on table (A, creation_date );
The query will be able to start fetching the rows in the correct order, without sorting, and stop when the limit is reached.
Recipe: put the where columns in the index, followed by the order by columns.
If there is no where clause, just put the order by into the index. The order by must match the index definition, especially if there are mixed asc/desc orders.
The indexed Top-N query is the performance king--make sure to use them.
I few links for further reading (all mine):
How to use index efficienty in mysql query
http://blog.fatalmind.com/2010/07/30/analytic-top-n-queries/ (Oracle centric)
http://Use-The-Index-Luke.com/ (not yet covering Top-N queries, but that's to come in 2011).
I haven't tested this concept before but try and create an index on the creation_date column. Which will automatically sort the rows is ascending order. Then your select query can use the orderby creation_date desc with the Limit 20 to get the first 20 records. The database engine should realize the index has already done the work sorting and wont actually need to sort, because the index has already sorted it on save. All it needs to do is read the last 20 records from the index.
Worth a try.
Create an index on creation_date and query by using order by creation_date asc|desc limit n and the response will be very fast (in fact it cannot be faster). For the "latest n" scenario you need to use desc.
If you want more constraints on this query (e.g where state='LIVE') then the query may become very slow and you'll need to reconsider the indexing strategy.
You can use Group By if your grouping some data and then Having clause to select specific records.