Cost of a count statement for leaderboard ranking? - mysql

I'm using mysql for a game. I have a scores table of approximately 150,000 records. The table looks like:
fk_user_id | high_score
The high_score column is an int. It has an index on it. I want to figure out a user's rank by running the following:
SELECT COUNT(*) AS count FROM scores WHERE high_score >= [x]
so supplying a user's current high_score to the above, I can get their rank. The idea would be that every time the user looks at a profile page, I would run the above to get the rank.
I'm wondering how expensive this is, and if I should even go down this path. Is mysql scanning the entire table every time the query is issued? Is this a crazy idea?
Update: Here's what 'explain' says about the query:
id: 1
select_type: SIMPLE
table: scores
type: range
possible_keys: high_score
key: high_score
key_len: 5
ref: null
rows: 1
extra: Using where; Using index
Thanks

MySQL is scanning the entire table for every record you ask it to return.
Why use count(*) can't you use a count(distinct User_ID) or count(user_ID)
You should already have that column indexed, and i'm sure it would return your results accurately.
SELECT COUNT(distinct user_ID) AS count FROM scores WHERE high_score >= [x]

If high_score is index then cost is relatively small if not - then full table scan is made.
Relatively small - just read rowids from key and count them - very small cost.
You can always write explain followed by query to check database exactly is doing for fetching your data.

Related

Simple query optimization (WHERE + ORDER + LIMIT)

I have this query that runs unbelievably slow (4 minutes):
SELECT * FROM `ad` WHERE `ad`.`user_id` = USER_ID ORDER BY `ad`.`id` desc LIMIT 20;
Ad table has approximately 10 million rows.
SELECT COUNT(*) FROM `ad` WHERE `ad`.`user_id` = USER_ID;
Returns 10k rows.
Table has following indexes:
PRIMARY KEY (`id`),
KEY `idx_user_id` (`user_id`,`status`,`sorttime`),
EXPLAIN gives this:
id: 1
select_type: SIMPLE
table: ad
type: index
possible_keys: idx_user_id
key: PRIMARY
key_len: 4
ref: NULL
rows: 4249
Extra: Using where
I am failing to understand why does it take so long? Also this query is generated by ORM (pagination) so it would be nice to optimize it from outside (maybe add some extra index).
BTW this query works fast:
select aa.*
from (select id from ad where user_id=USER_ID order by id desc limit 20) as a
join ad as aa on a.id = aa.id ;
Edit: I tried another user with much less rows (dozens) than original one. I am wondering why doesn't original query use idx_user_id:
EXPLAIN SELECT * FROM `ad` WHERE `ad`.`user_id` = ANOTHER_ID ORDER BY `ad`.`id` desc LIMIT 20;
id: 1
select_type: SIMPLE
table: ad
type: ref
possible_keys: idx_user_id
**key: idx_user_id**
key_len: 3
ref: const
rows: 84
Extra: Using where; Using filesort
Edit2: with help of Alexander I decided to try force MySQL to use the index I want, and following query is much faster (1 sec instead of 4 mins):
SELECT *
FROM `ad` USE INDEX (idx_user_id)
WHERE `ad`.`user_id` = 1884774
ORDER BY `ad`.`id` desc LIMIT 20;
In the EXPLAIN output you can see that the key value is PRIMARY. This means that MySQL optimizer decided that it is faster to scan all table records (which are already sorted by id) and search first 20 records with the specific user_id value than to use idx_user_id key, which was considered by optimizer as a possible key and then rejected.
In your second query the optimizer sees that only id values are necessary in the subquery, and decided to use idx_user_id index instead, as that index allows to calculate the list of necessary ids without touching the table itself. Then only 20 records are retrieved by direct search by primary key value, which is very fast operation for that small number of records.
As you query with ANOTHER_ID shows, the MySQL wrong decision was based on the number of rows for the previous USER_ID value. This number was so big that the optimizer guessed that it will find the first 20 records with this specific user_id faster just by looking at the table records itself and skipping records with wrong user_id values.
If table rows are accessed by index, it requires random access operations. For typical HDD random access operations are about 100 time slower then sequential scan. So in order for index to be useful it must reduce the count of rows to less then 1% of the total rows count. If the rows for the specific USER_ID value accounts for more than 1% of the total number of rows, it may be more efficient to do full table scan instead of using of index, if we want to retrieve all these rows. But MySQL optimizer doesn't takes into account the fact that only 20 of this rows will be retrieved. So it mistakenly decided not to use index and do full table scan instead.
In order to make your query fast for any user_id value you can add one more index which will allow the query execution in the fastest way possible:
create index idx_user_id_2 on ad(user_id, id);
This index allows MySQL to do both filtering and sorting. To do that the columns used for filtering should be placed first, and the columns used for ordering should be placed second. MySQL should be smart enough to use that index, because this index allows to search all necessary records without skipping any records.

Why is this date range query so slow?

I have a database table with 5 million rows, I am running:
select
*
from
tbl
where
datetime_created
between
'2014-10-01 00:00:00' and
'2014-10-31 23:59:59'
It took 54 seconds to return 428k results
The columns on the tbl:
id (int pk auto inc)
actor (varchar)
action (enum)
target (varchar)
is_successful (tinyint)
datetime_created (datetime)
The index:
datetime_created (datetime_created, action, target, is_successful)
Any ideas on how I can improve this?
edit:
EXPLAIN results:
select_type: simple
type: range
possible keys
datetime_created
key: datetime_created
key_len: 8
ref: null
rows: 359569
extra: using index condition
428k is a lot of rows to work with in one shot . Even though you have an index on date, the engine still has to scan through the table between the high and low values. I would suggest multiple queries reading the data in smaller chunks and narrowing result set if possible.
E.g. Try adding action enum filter together with the date range should yield much faster results. Say there are 5 enum types then you run 5 queries for each action enum. The more indexed criteria you add the better the query will perform .
Also consider if this is going to be used in an app, that is a massive recordset to deal with. Do you really need to work with 428k results at a time?

Why does this query cause lock wait timeouts?

Our team just spent the last week debugging and trying to find the source of many mysql lock timeouts and many extremely long running queries. In the end it appears that this query is the culprit.
mysql> explain
SELECT categories.name AS cat_name,
COUNT(distinct items.id) AS category_count
FROM `items`
INNER JOIN `categories` ON `categories`.`id` = `items`.`category_id`
WHERE `items`.`state` IN ('listed', 'reserved')
AND (items.category_id IS NOT NULL)
GROUP BY categories.name
ORDER BY category_count DESC
LIMIT 10\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: items
type: range
possible_keys: index_items_on_category_id,index_items_on_state
key: index_items_on_category_id
key_len: 5
ref: NULL
rows: 119371
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: categories
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: production_db.items.category_id
rows: 1
Extra:
2 rows in set (0.00 sec)
I can see that it is doing a nasty table scan and creating a temporary table to run.
Why would this query cause database response times to go up by a factor of ten and some queries that usually take 40-50ms (updates on items table), to explode to 50,000 ms and higher at times?
Is hard to tell without more information like
Is that running inside a transaction?
If so, what's the isolation level?
How many categories are there?
How many items?
My guess would be that the query is too slow and its running inside a
transaction (which it probably is since you have this problem) and is
probably issuing range-locks on the items table which cannot allow
writes to proceed hence slowing the updates till they can get a lock
on the table.
And I have a couple of comments based on what I can see from your query and execution plan:
1) Your items.state would probably be better as a catalog, instead of having the string on every row in items, this is for space efficiency and comparing IDs is way faster than comparing strings (regardless of whatever optimizations the engine may do).
2) I am guessing items.state is a column with low cardinality (few unique values), hence an index in that column is probably hurting you more than helping you. Every index adds over head when inserting/deleting/updating rows since the indexes have to be mantained, this particular index probably is not used that much to be worthwhile. Of course, I am just guessing, it depends on the rest of the queries.
SELECT
; Grouping by name, means comparing strings.
categories.name AS cat_name,
; No need for distinct, the same item.id cannot belong to different categories
COUNT(distinct items.id) AS category_count
FROM `items`
INNER JOIN `categories` ON `categories`.`id` = `items`.`category_id`
WHERE `items`.`state` IN ('listed', 'reserved')
; Not needed, the inner join gets rid of items with no category_id
AND (items.category_id IS NOT NULL)
GROUP BY categories.name
ORDER BY category_count DESC
LIMIT 10\G
The way this query is structured is basically having to scan the entire items table since its using the category_id index, then filtering by the where clause, then, joining with the category table, which means an index seek on the primary key (categories.id) index per item row in the items result set. Then grouping by name (using strings comparison) to count, then getting rid of everything but 10 of the results.
I would write the query like:
SELECT categories.name, counts.n
FROM (SELECT category_id, COUNT(id) n
FROM items
WHERE state IN ('listed', 'reserved') AND category_id is not null
GROUP BY category_id ORDER BY COUNT(id) DESC LIMIT 10) counts
JOIN categories on counts.category_id = categories.id
ORDER BY counts.n desc
(I am sorry if the syntax ain't perfect I am not running MySQL)
With this query what the engine will probably do is :
Use the items.state index to get the 'listed', 'reserved' items and group by category_id comparing numbers, not strings then getting only the 10 topmost counts, then join with categories to get the name (but using only 10 index seeks).

MySQL query optimization. Avoiding temporary & filesort

Currently I have a table with close to 1 million rows, which I need to query from. What I need to be able to do is stack rank packages on the number of products they include from a given list of product id's.
SELECT count(productID) AS commonProducts, packageID
FROM supply
WHERE productID IN (2,3,4,5,6,7,8,9,10)
GROUP BY packageID
ORDER BY commonProducts
DESC LIMIT 10
The query works fine, but I would like to improve upon it. I tried a multi-column index on productID and packageID, but it seemed to seek more rows than just having a separate index for each of the columns.
MySQL Explain
select_type: SIMPLE
table: supply
type: range
possible_keys: supplyID
key: supplyID
key_len: 3
ref: null
rows: 996
extra: Using where; Using temporary; Using filesort
My main concern is that the query is using a temporary table and filesort. How could I go about optimizing this query? I presume that the biggest issues is count() and the ORDER BY on the results of count().
You can remove the temp table using a Dependent Subquery:
select * from
(
SELECT count(productID) AS commonProducts, s.productId, s.packageID
FROM supply as s
WHERE EXISTS
(
select 1 from supply as innerS
where innerS.productID in (2,3,4,5,6,7,8,9,10)
and s.productId = innerS.productId
)
GROUP BY s.packageID
) AS t
ORDER BY t.commonProducts
DESC LIMIT 10
The inner query links to the outer query and preserves the index. You'll find that any query that sorts on commonProducts, including the above query, will use a filesort, as count(*) is definitely not indexed. But fear not, filesort is just a fancy word for sort -- mysql can choose to use an effective in-memory sort -- and whether you did it now or as a mergesort on the way to an indexed temporary table, you'll have to pay for that sorting somewhere. However, this case is pretty good because filesort will stop sorting once it hits the LIMIT you've put in place. It will not sort the entire list of commonProducts.
Update
If this query is going to be run all the time, I would recommend (without getting too fancy) setting triggers on the supply table to update a legitimate table that tracks counters like this one.
Creatng a temporary resulte set:
SELECT TMP.*
FROM ( SELECT count(productID) AS commonProducts, packageID
FROM supply
WHERE productID IN (2,3,4,5,6,7,8,9,10)
GROUP BY packageID
) AS TMP
ORDER BY commonProducts
DESC LIMIT 10
Perhaps it's not the most elegant way and I cannot guarantee it will be faster because everything depends on your particular data. But in some cases this gives much better results:
SELECT count(*) AS commonProducts, packageID
FROM (
SELECT packageID FROM supply WHERE productID = 2
UNION ALL
SELECT packageID FROM supply WHERE productID = 3
UNION ALL
.
.
.
SELECT packageID FROM supply WHERE productID = 10
) AS t
GROUP BY packageID
ORDER BY commonProducts DESC
LIMIT 10

MySQL: Not using index for ORDER BY?

I've been trying and googling everything and still can't figure out what's going on.
I have a big table (100M+rows). Among others it has 3 columns: user_id, date, type.
It has an index idx(user_id, type, date).
When I EXPLAIN this query:
SELECT *
FROM table
WHERE user_id = 12345
AND type = 'X'
ORDER BY date DESC
LIMIT 5
EXPLAIN shows that MySQL examined 110K rows. which is roughly row many rows this user_id has.
My question is:
Why the same index is not used for ORDER_BY LIMIT 5? It knows which rows belong to the user_id, date is part of the same index, so why not just take last 5 rows in that index?
P.S. I tried index by (user_id, date, type) - same results; i tried removing DESC - same results.
This is the EXPLAIN plan:
id: 1
select_type: SIMPLE
table: s
type: ref
possible_keys: dateIdx,userTypeDateIdx
key: userTypeDateIdx
key_len: 5
ref: const,const
rows: 110118
Extra: Using where
I also tried adding FORCE INDEX FOR ORDER BY hint, but i still get rows: 110118.
Did you ANALYZE TABLE after creating the index?
Mysql will not use the index until the table is analyzed. The best index to use is the one you created with (user_id, type, date)
The date in the index is in ascending order, and you are asking for the most recent five rows in descending order by date; it can't use the index for that. If you changed the index to user_id, type, date desc it would be able to use the index to get the most recent five rows.