I got this query:
SELECT user_id
FROM basic_info
WHERE age BETWEEN 18 AND 22 AND gender = 0
ORDER BY rating
LIMIT 50
The table looks like (and it contains about 700k rows):
CREATE TABLE IF NOT EXISTS `basic_info` (
`user_id` mediumint(8) unsigned NOT NULL auto_increment,
`gender` tinyint(1) unsigned NOT NULL default '0',
`age` tinyint(2) unsigned NOT NULL default '0',
`rating` smallint(5) unsigned NOT NULL default '0',
PRIMARY KEY (`user_id`),
KEY `tmp` (`gender`,`rating`),
) ENGINE=MyISAM;
The query itself is optimized but it has to walk about 200k rows to do his job.
Here's the explain output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE basic_info ref tmp,age tmp 1 const 200451 Using where
Is it possible to optimize the query so it won't walk over 200k rows ?
Thanks !
There are two useful indexes that can help this query:
KEY gender_age (gender, age) -- this index can satisfy both the gender=0 condition as well as age BETWEEN 18 AND 22. However, because you have a range condition over the age field, adding the rating column to the index will not give sorted results -- hence MySQL will select all matching rows -- ignoring your LIMIT clause -- and do an additional filesort regardless.
KEY gender_rating (gender, rating) -- the index you already have; this index can satisfy the gender=0 condition and retrieves data already sorted by rating. However, the database has to scan all elements with gender=0 and eliminate those who are not in range age BETWEEN 18 AND 22
Changing schema
If the above does not help you enough, changing your schema is always possible. One such approach is turning the age BETWEEN condition into an equality condition, by defining an age group column; for instance, ages 0-12 will be in age group 1, ages 12-18 in age group 2, etc.
This way, having an index with (gender, agegroup, rating) and query with WHERE gender=0 AND agegroup=3 ORDER BY rating will retrieve all results from the index and already sorted. In this case, the LIMIT clause should only fetch 50 entries from the table and no more.
Extend you tmp-key to include the age-column:
KEY `tmp` (`age`,`gender`,`rating`)
Attempt to use InnoDB to improve performence?
Benchmarking here
Related
I encountered a very puzzling optimization case. I'm no SQL expert but still this case seems to defy my understanding of clustered key principles.
I have the below table schema:
CREATE TABLE `orders` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`chargeQuote` tinyint(1) NOT NULL,
`features` int(11) NOT NULL,
`sequenceIndex` int(11) NOT NULL,
`createdAt` bigint(20) NOT NULL,
`previousSeqId` bigint(20) NOT NULL,
`refOrderId` bigint(20) NOT NULL,
`refSeqId` bigint(20) NOT NULL,
`seqId` bigint(20) NOT NULL,
`updatedAt` bigint(20) NOT NULL,
`userId` bigint(20) NOT NULL,
`version` bigint(20) NOT NULL,
`amount` decimal(36,18) NOT NULL,
`fee` decimal(36,18) NOT NULL,
`filledAmount` decimal(36,18) NOT NULL,
`makerFeeRate` decimal(36,18) NOT NULL,
`price` decimal(36,18) NOT NULL,
`takerFeeRate` decimal(36,18) NOT NULL,
`triggerOn` decimal(36,18) NOT NULL,
`source` varchar(32) NOT NULL,
`status` varchar(50) NOT NULL,
`symbol` varchar(32) NOT NULL,
`type` varchar(50) NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_STATUS` (`status`) USING BTREE,
KEY `IDX_USERID_SYMBOL_STATUS_TYPE` (`userId`,`symbol`,`status`,`type`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=7937243 DEFAULT CHARSET=utf8mb4;
This is a big table. 100 million rows. It's already sharded by createdAt, so 100 million = 1 month worth of orders.
I have a below slow query. The query is pretty straight-forward:
select id,chargeQuote,features,sequenceIndex,createdAt,previousSeqId,refOrderId,refSeqId,seqId,updatedAt,userId,version,amount,fee,filledAmount,makerFeeRate,price,takerFeeRate,triggerOn,source,`status`,symbol,type
from orders where 1=1
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and symbol in ( 'BTC_USDT' )
and status in ( 'FULLY_FILLED' , 'PARTIAL_CANCELLED' , 'FULLY_CANCELLED' )
and type in ( 'BUY_LIMIT' , 'BUY_MARKET' , 'SELL_LIMIT' , 'SELL_MARKET' )
order by id desc limit 0,20;
This query takes 24 seconds. The number of rows that satisfy userId=100000 is very little, around 100. And the number of rows that satisfy this entire where clause is 0.
But when I did a small tweak, that is, I changed the order by clause:
order by id desc limit 0,20; -- before
order by createdAt desc, id desc limit 0,20; -- after
It became very fast, 0.03 seconds.
I can see it made a big difference in MySQL engine because explain gives that, before the change it was using key: PRIMARY and after it finally uses key: IDX_USERID_SYMBOL_STATUS_TYPE, as expected, and I guess therefore very fast. Here's the explain plan:
select_type table partitions type possible_keys key key_len ref rows filtered Extra
SIMPLE orders index IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE PRIMARY 8 20360 0.02 Using where
SIMPLE orders range IDX_STATUS,IDX_USERID_SYMBOL_STATUS_TYPE IDX_USERID_SYMBOL_STATUS_TYPE 542 26220 11.11 Using index condition; Using where; Using filesort
So what gives? Actually I was very surprised by the fact that it was not naturally sorted by id (which is the PRIMARY KEY). Isn't this the clustered key in MySQL? And why it chose to not to use index when it's sorted by id?
I'm very puzzled because a more demanding query (sort by 2 conditions) is super fast but a more lenient query is slow.
And no, I tried ANALYZE TABLE orders; and nothing happened.
MySQL has two alternative query plans for queries with ORDER BY ... LIMIT n:
Read all qualifying rows, sort them, and pick the n top rows.
Read the rows in sorted order and stop when n qualifying rows have been found.
In order to decide which is the better option, the optimizer needs to estimate the filtering effect of your WHERE condition. This is not straight-forward, especially for columns that are not indexed, or for columns where values are correlated. In your case, the MySQL optimizer evidently thinks that the second strategy is the best. Inn other words, it does not see that the WHERE clause will not be satisfied by any rows, but thinks that 2% of the rows will satisfy the WHERE clause, and that it will be able to find 20 rows by only scanning part of the table backwards in PRIMARY key order.
How the filtering effect of a WHERE clause is estimated varies quite a bit between 5.6, 5.7, and 8.0. If you are using MySQL 8.0, you can try to create histograms for the columns involved to see if that can improve the estimation. If not, I think your only option is to use a FORCE INDEX hint to make the optimizer choose the desired index.
For your fast query, the second strategy is not an option since there is no index on createdAt that can be used to avoid sorting.
Update:
Reading Rick's answer, I realized that an index on only userId should speed up your ORDER BY id query. In such an index, the entries for a given userId will be sorted on primary key. Hence, using this index will both make it possible to only access the rows of the requested userId, and access the rows in the requested sort order (by id).
The main filters works well with cardinality estimator. When order by uses limit, this is automatically another filter, as data needs to be filter further. This may redirect cardinality estimator to prone to inaccurate estimation which eventually result a poor plan to be selected. In order to prove this, run the 24sec query without the limit clause. It should also respond at 0.3 as your trick.
In order to solve this, if you have a standard very good performance just with the main filters, select this first, and filter at later 2nd time where the result set will be significantly smaller than the whole table. Use something like:
select * from (select ...main select statement)
order by x limit by y
...or...
insert into temp select ...main select statement
select from temp order by x limit by y
Given
and userId=100000
and createdAt >= '1567775174000' and createdAt <= '1567947974000'
and ... -- I am not making use of the other items
order by createdAt DESC, id desc -- I am assuming this change
limit 0,20;
I would try
INDEX(userId, createdAt, id) -- in this order
userId is tested by = is first, thereby narrows down the part of the index to look at.
Leave out the columns tested by IN. If there are multiple values in a IN, we can't make use of step 4.
createdAt filters further by range.
createdAt and id are compared in the same direction (DESC). (Yes, I know 8.0 has an improvement, but I don't think you wanted (ASC, DESC)).
Is there any way to make the following query use an index and not use filesort:
SELECT c1 FROM table WHERE c2 IN (val_1, val_2, ..., val_n) ORDER BY c3
I guess chances are bad so if it is not possible is there any way to make the following problem use indexes (or be fast):
The table contains comments from users:
CREATE TABLE `comments` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`comment` varchar(180) CHARACTER SET utf8 NOT NULL,
`timestamp` int(11) unsigned NOT NULL)
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I want to output the comments of specific users (for example the ones who user_x is following) ordered by timestamp (compare query above).
The only way I can imagine making this query fast is to insert a new variable that is set to 1 for the last let's say 15 entries of a single user. So the first query would just get a maximum of 15 rows per user so the maximum amount of rows mysql has to order is 15*n, where n is the amount of users the comments are selected from.
Edit: This is what EXPLAIN outputs:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE comments range idx_comments_user_id_timestamp idx_comments_user_id_timestamp 4 NULL 1113 Using where; Using index; Using filesort
I got this table
CREATE TABLE `votes` (
`item_id` int(10) unsigned NOT NULL,
`user_id` int(10) unsigned NOT NULL,
`vote` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`item_id`,`user_id`),
KEY `FK_vote_user` (`user_id`),
KEY `vote` (`vote`),
KEY `item` (`item_id`),
CONSTRAINT `FK_vote_item` FOREIGN KEY (`item_id`) REFERENCES `items` (`id`) ON UPDATE CASCADE,
CONSTRAINT `FK_vote_user` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
And I got this simple select
SELECT
`a`.`item_id`, `a`.`sum`
FROM
(SELECT
`item_id`, SUM(vote) AS `sum`
FROM
`votes`
GROUP BY `item_id`) AS a
ORDER BY `a`.`sum` DESC
LIMIT 10
Right now, with only 250 rows, there isn't a problem, but it's using filesort. The vote column has either -1, 0 or 1. But will this be performant when this table has millions or rows?
If I make it a simpler query without a subquery, then the using temporary table appears.
Explain gives (the query completes in 0.00170s):
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 33 Using filesort
2 DERIVED votes index NULL PRIMARY 8 NULL 250
No, this won't be efficient with millions of rows.
You'll have to create a supporting aggregate table which would store votes per item:
CREATE TABLE item_votes
(
item_id INT NOT NULL PRIMARY KEY,
votes UNSIGNED INT NOT NULL,
upvotes UNSIGNED INT NOT NULL,
downvotes UNSIGNED INT NOT NULL,
KEY (votes),
KEY (upvotes),
KEY (downvotes)
)
and update it each time a vote is cast:
INSERT
INTO item_votes (item_id, votes, upvotes, downvotes)
VALUES (
$item_id,
CASE WHEN $upvote THEN 1 ELSE -1 END,
CASE WHEN $upvote THEN 1 ELSE 0 END,
CASE WHEN $upvote THEN 0 ELSE 1 END
)
ON DUPLICATE KEY
UPDATE
SET votes = votes + VALUES(upvotes) - VALUES(downvotes),
upvotes = upvotes + VALUES(upvotes),
downvotes = downvotes + VALUES(downvotes)
then select top 10 votes:
SELECT *
FROM item_votes
ORDER BY
votes DESC, item_id DESC
LIMIT 10
efficiently using an index.
But will this be performant when this table has millions or rows?
No, it won't.
If I make it a simpler query without a subquery, then the using temporary table appears.
Probably because the planner would turn it into the query you posted: it needs to calculate the sum to return the results in the correct order.
To quickly grab the top voted questions, you need to cache the result. Add a score field in your items table, and maintain it (e.g. using triggers). And index it. You'll then be able to grab the top 10 scores using an index scan.
First, you don't need the subquery, so you can rewrite your query as:
SELECT `item_id`, SUM(vote) AS `sum`
FROM `votes`
GROUP BY `item_id`
ORDER BY `a`.`sum` DESC
LIMIT 10
Second, you can build an index on votes(item_id, vote). The group by will then be an index scan. This will take time as the table gets bigger, but it should be manageable for reasonable data sizes.
Finally, with this structure of a query, you need to do a file sort for the final order by. Whether this is efficient or not depends on the number of items you have. If each item has, on average, one or two votes, then this may take some time. If you have a fixed set of items and there are only a few hundred or thousand, then then should not be a performance bottleneck, even as the data size expands.
If this summary is really something you need quickly, then a trigger with a summary table (as explained in another answer) provides a faster retrieval method.
I have the following query:
SELECT * FROM `alltrackers`
WHERE `deviceid`='FT_99000083401624'
AND `locprovider`!='none'
ORDER BY `id` DESC
This is the show create table:
CREATE TABLE `alltrackers` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`deviceid` varchar(50) NOT NULL,
`gpsdatetime` int(11) NOT NULL,
`locprovider` varchar(30) NOT NULL,
PRIMARY KEY (`id`),
KEY `statename` (`statename`),
KEY `gpsdatetime` (`gpsdatetime`),
KEY `locprovider` (`locprovider`),
KEY `deviceid` (`deviceid`(18))
) ENGINE=MyISAM AUTO_INCREMENT=8665045 DEFAULT CHARSET=utf8;
I've removed the columns which I thought were unnecessary for this question.
This is the EXPLAIN output for this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE alltrackers ref locprovider,deviceid deviceid 56 const 156416 Using
where; Using filesort
This particular query is showing as taking several seconds in mytop (mtop). I'm a bit confused though, as the same query but with a different "deviceid" doesn't take as long. Although I only need the last row, I've already removed LIMIT 1 as that makes it take even longer. This table currently contains 3 million rows.
It is used for storing the locations from different GPS devices. Each GPS device has a unique device ID. Locations come in and are added to the table. For statistics I'm running the above query to find the time of the last received location from a certain device.
I'm open to advice on ways to further optimize the query or even the tables.
Many thanks in advance.
If you only need the last row, add an index on (deviceid, id, locprovider). It would be even faster with an index on (deviceid, id, locprovider, gpsdatetime):
ALTER TABLE alltrackers
ADD INDEX special_covering_IDX
(deviceid, id, locprovider, gpsdatetime) ;
Then try this out:
SELECT id, locprovider, gpsdatetime
FROM alltrackers
WHERE deviceid = 'FT_99000083401624'
AND locprovider <> 'none'
ORDER BY id DESC
LIMIT 1 ;
one question that I should be able to answer myself but I don't and I also don't find any answer in google:
I have a table that contains 5 million rows with this structure:
CREATE TABLE IF NOT EXISTS `files_history2` (
`FILES_ID` int(10) unsigned DEFAULT NULL,
`DATE_FROM` date DEFAULT NULL,
`DATE_TO` date DEFAULT NULL,
`CAMPAIGN_ID` int(10) unsigned DEFAULT NULL,
`CAMPAIGN_STATUS_ID` int(10) unsigned DEFAULT NULL,
`ON_HOLD` decimal(1,0) DEFAULT NULL,
`DIVISION_ID` int(11) DEFAULT NULL,
KEY `DATE_FROM` (`DATE_FROM`),
KEY `FILES_ID` (`FILES_ID`),
KEY `CAMPAIGN_ID` (`CAMPAIGN_ID`),
KEY `CAMP_DATE` (`CAMPAIGN_ID`,`DATE_FROM`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When I execute
SELECT files_id, min( date_from )
FROM files_history2
WHERE campaign_id IS NOT NULL
GROUP BY files_id
the query rests with status "Sending data" for more than eight hours (then I killed the process).
Here the explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE files_history2 ALL CAMPAIGN_ID,CAMP_DATE NULL NULL NULL 5073254 Using where; Using temporary; Using filesort
I assume that I generated the necessary keys but then the query should take that long, does it?
I would suggest a different index... Index on (Files_ID, Date_From, Campaign_ID)...
Since your group by is on Files_ID, you want THOSE grouped. Then the MIN( Date_From), so that is in second position... Then FINALLY the Campaign_ID to qualify for not null and here's why...
If you put all your campaign IDs first, great, get all the NULLs out of the way... Now, you have 1,000 campaigns and the Files_ID spans MANY campaigns and they also span many dates, you are going to choke.
By the index I'm projecting, by the Files_ID first, you have each "files_id" already ordered to match your group by. Then, within that, all the earliest dates are at the top of the indexed list... great, almost there, then, by campaign ID. Skip over whatever NULL may be there and you are done, on to the next Files_ID
Hope this makes sense -- unless you have TONs of entries with NULL value campaigns.
Also, by having all 3 parts of the index matching the criteria and output columns of your query, it never has to go back to the raw data file for the data, it gets it all from the index directly.
I'd create a covering index (CAMPAIGN_ID, files_id, date_from) and check that performance. I suspect your issue is due to the grouping not and date_from not being able to use the same index.
CREATE INDEX your_index_name ON files_history2 (CAMPAIGN_ID, files_id, date_from);
If this works you could drop the point index CAMPAIGN_ID as it's included in the composite index.
Well the query is slow due to the aggregation ( function MIN ) along with grouping.
One of the solution is altering your query by moving the aggregating subquery from the WHERE clause to the FROM clause, which will be lot faster than the approach you are using.
try following:
SELECT f.files_id
FROM file_history2 AS f
JOIN (
SELECT campaign_id, MIN(date_from) AS datefrom
FROM file_history2
GROUP BY files_id
) AS f1 ON f.campaign_id = f1.campaign_id AND f.date_from = f1.datefrom;
This should have lot better performance, if doesn't work temporary table would only be the choice to go with.