How can LIMIT 1 possibly make a query slower - mysql

Given this table in MySQL 5.6:
create table PlayerSession
(
id bigint auto_increment primary key,
lastActivity datetime not null,
player_id bigint null,
...
constraint FK4410E05525A98981
foreign key (player_id) references Player (id)
)
How can it possibly be that this query returns about 2000 rows instantly:
SELECT * FROM PlayerSession
WHERE player_id = ....
ORDER BY lastActivity DESC
but adding LIMIT 1 makes it take 4 seconds, even though all that should do is pick the first result?
Using EXPLAIN I found the only difference to be that without the limit, filesort is used. From what I gather, this should make it slower, not faster. The whole table contains about 2M rows.
Also, adding LIMIT 3 or anything higher than that, gives the same performance as no limit.
And yes, I have since created an index on playerId, lastActivity, which, surprise surprise, makes it fast again. While that takes the immediate stress out of the situation (the server was rather overloaded), it doesn't really explain the mystery.

What specific version of 5.6? Please provide EXPLAIN FORMAT=JSON SELECT .... Please provide SHOW CREATE TABLE; we need to see the other indexes, plus datatypes.
INDEX(playerId, lastActivity) lets the query avoid "filesort".
A possible reason for the strange timings could be caching. Run each query twice to avoid that hiccup.

Related

MySQL query run time is better even though its execution plan is bad

I am trying to optimize this MySQL query and having less experience in understanding execution plan I am having hard time making sense of the execution plan.
My question is : Can you please help me in understanding why the query execution plan of New Query is worse than that of Original query even though New query performs better in Prod.
SQL needed to reproduce this case is here
Also kept relevant table definition in the end ( Table bill_range references bill using foreign key bill_id )
Original query takes 10 second to complete in PROD
select *
from bill_range
where (4050 between low and high )
order by bill_id limit 1;
while new query (I am forcing/suggesting to use index) takes 5 second to complete in PROD
select *
from bill_range
use index ( bill_range_low_high_index)
where (4050 between low and high )
order by bill_id limit 1;
But the execution plan gives suggest original query is better( this is the part where my understanding seems to be wrong )
Original query
New query
Column "type" for original query suggest index while new query
says ALL
Column "Key" is bill_id (perhaps index on FK) for
original queryand Null for new query
Column "rows" for original query is 1 while for new query says 9
So given all this information wouldn't it imply that new query is actually worse than original query .
And if that is true why is new query performing better? Or am I reading the execution plan wrong.
Table defintions
CREATE TABLE bill_range (
id int(11) NOT NULL AUTO_INCREMENT,
low varchar(255) NOT NULL,
high varchar(255) NOT NULL,
PRIMARY KEY (id),
bill_id int(11) NOT NULL,
FOREIGN KEY (bill_id) REFERENCES bill(id)
);
CREATE TABLE bill (
id int(11) NOT NULL AUTO_INCREMENT,
label varchar(10),
PRIMARY KEY (id)
);
create index bill_range_low_high_index on bill_range( low, high);
NOTE : The reason I am providing definition of 2 tables is because original query decided to use an index based on Foreign key to bill table
Your index isn't quite optimal for your query. Let me explain if I may.
MySQL indexes use BTREE data structures. Those work well in indexed-sequential access mode (hence the MyISAM name of MySQL's first storage engine). It favors queries that jump to a particular place in an index and then run through the index element by element. The typical example is this, with an index on col.
SELECT whatever FROM tbl WHERE col >= constant AND col <= constant2
That is a rewrite of WHERE col BETWEEN constant AND constant2.
Let's recast your query so this pattern is obvious, and so the columns you want are explicit.
select id, low, high, bill_id
from bill_range
where low <= 4050
and high >= 4050
order by bill_id limit 1;
An index on the high column allows a range scan starting with the first eligible row with high >= 4050. Then, we can go on to make it a compound index, including the bill_id and low columns.
CREATE INDEX high_billid_low ON bill_range (high, bill_id, low);
Because we want the lowest matching bill_id we put that into the index next, then finally the low value. So the query planner random accesses the index to the first elibible row by high, then scans until it finds the very first index item that meets the low criterion. And then it's done: that's the desired result. It's already ordered by bill_id so it can stop. ORDER BY comes from the index. The query can be satisfied entirely from the index -- it is a so-called covering index.
As to why your two queries performed differently: In the first, the query planner decided to scan your data in bill_id order looking for the first matching low/high pair. Possibly it decided that actually sorting a result set would likely be more expensive than scanning bill_ids in order. It looks to me like your second query did a table scan. Why that was faster, who knows?
Notice that this index would also work for you.
CREATE INDEX low_billid_high ON bill_range (low DESCENDING, bill_id, high);
In InnoDB the table's PK id is implicitly part of every index, so there's no need to mention it in the compound index.
And, you can still write it the way you first wrote it; the query planner will figure out what you want.
Pro tip: Avoid SELECT * ... the * makes it harder to reason about the columns you need to retrieve.

mariadb (mysql) sub partition error (total sub partition count exceeds 64)

enter image description here
Hello
I want to configure a partition (monthly)/subpartition (day by day) as the query above.
If the total number of subpartitions exceeds 64,
'(errno: 168 "Unknown (generic) error from engine")'
The table is not created due to an error. (Creating less than 64 is successed).
I know that the maximum number of partitions (including subpartitions) that can be created is 8,192, is there anything I missed?
Below is the log table.
create table detection_log
(
id bigint auto_increment,
detected_time datetime default '1970-01-01' not null,
malware_title varchar(255) null,
malware_category varchar(30) null,
user_name varchar(30) null,
department_path varchar(255) null,
PRIMARY KEY (detected_time, id),
INDEX `detection_log_id_uindex` (id),
INDEX `detection_log_malware_title_index` (malware_title),
INDEX `detection_log_malware_category_index` (malware_category),
INDEX `detection_log_user_name_index` (user_name),
INDEX `detection_log_department_path_index` (departmen`enter code here`t_path)
);
SUBPARTITIONs provide no benefit that I know of.
HASH partitioning either provides no benefit or hurts performance.
So... Explain what you hoped to gain by partitioning; then we can discuss whether any type of partitioning is worth doing. Also, provide the likely SELECTs so we can discuss the optimal INDEXes. If you need a "two-dimensional" index, that might indicate a need for partitioning (but still not subpartitioning).
More
I see PRIMARY KEY(detected_time,id). This provides a very fast way to do
SELECT ...
WHERE detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
In fact, it will probably be faster than if you also partition the table. (As a general rule it is useless to partition on the first part of the PK.)
If you need to do
SELECT ...
WHERE user_id = 123
AND detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
Then this is optimal:
INDEX(user_id, detected_time, id)
Again, probably faster than any form of partitioning on any column(s).
And
A "point query" (WHERE key = 123) takes a few milliseconds more in a 1-billion-row table compared to a 1000-row table. Rarely is the difference important. The depth of the BTree (perhaps 5 levels vs 2 levels) is the main difference. If you PARTITION the table, you are removing perhaps 1 or 2 levels of the BTree, but replacing them with code to "prune" down to the desired partition. I claim that this tradeoff does not provide a performance benefit.
A "range query" is very nearly the same speed regardless of the table size. This is because the structure is actually a B+Tree, so it is very efficient to fetch the 'next' row.
Hence, the main goal in optimizing queries on a huge table is to take advantage of the characteristics of the B+Tree.
Pagination
SELECT log.detected_time, log.user_name, log.department_path,
log.malware_category, log.malware_title
FROM detection_log as log
JOIN
(
SELECT id
FROM detection_log
WHERE user_name = 'param'
ORDER BY detected_time DESC
LIMIT 25 OFFSET 1000
) as temp ON temp.id = log.id;
The good part: Finding ids, then fetching the data.
The slow part: Using OFFSET.
Have this composite index: INDEX(user_name, detected_time, id) in that order. Make another index for when you use department_path.
Instead of OFFSET, "remember where you left off". A blog specifically about that: http://mysql.rjweb.org/doc.php/pagination
Purging
Deleting after a year is an excellent use of PARTITIONing. Use PARTITION BY RANGE(TO_DAYS(detected_time)) and have either ~55 weekly or 15 monthly partitions. See HTTP://mysql.rjweb.org/doc.php/partitionmaint for details. DROP PARTITION is immensely faster than DELETE. (This partitioning will not speed up SELECT.)

MySQL: Slow SELECT because of Index / FKEY?

Dear StackOverflow Members
It's my first post, so please be nice :-)
I have a strange SQL behavior which i can't explain and don't find any resources which explains it.
I have built a web honeypot which record all access and attacks and display it on a statistic page.
However since the data increased, the generation of the statistic page is getting slower and slower.
I narrowed it down to a some select statements which takes a quite a long time.
The "issue" seems to be an index on a specific column.
*For sure the real issue is my lack of knowledge :-)
Database: mysql
DB schema
Event Table (removed unrelated columes):
Event table size: 30MB
Event table records: 335k
CREATE TABLE `event` (
`EventID` int(11) NOT NULL,
`EventTime` datetime NOT NULL DEFAULT current_timestamp(),
`WEBURL` varchar(50) COLLATE utf8_bin DEFAULT NULL,
`IP` varchar(15) COLLATE utf8_bin NOT NULL,
`AttackID` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `event`
ADD PRIMARY KEY (`EventID`),
ADD KEY `AttackID` (`AttackID`);
ALTER TABLE `event`
ADD CONSTRAINT `event_ibfk_1` FOREIGN KEY (`AttackID`) REFERENCES `attack` (`AttackID`);
Attack Table
attack table size: 32KB
attack Table records: 11
CREATE TABLE attack (
`AttackID` int(4) NOT NULL,
`AttackName` varchar(30) COLLATE utf8_bin NOT NULL,
`AttackDescription` varchar(70) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
ALTER TABLE `attack`
ADD PRIMARY KEY (`AttackID`),
SLOW Query:
SELECT Count(EventID), IP
-> FROM event
-> WHERE AttackID >0
-> GROUP BY IP
-> ORDER BY Count(EventID) DESC
-> LIMIT 5;
RESULT: 5 rows in set (1.220 sec)
(This seems quite long for me, for a simple query)
QuerySlow
Now the Strange thing:
If I remove the foreign key relationship the performance of the query is the same.
But if I remove the the index on event.AttackID same select statement is much faster:
(ALTER TABLE `event` DROP INDEX `AttackID`;)
The result of the SQL SELECT query:
5 rows in set (0.242 sec)
QueryFast
From my understanding indexes on columns which are used in "WHERE" should improve the performance.
Why does removing the index have such an impact on the query?
What can I do to keep the relations between the table and have a faster
SELECT execution?
Cheers
Why does removing the index improve performance?
The query optimizer has multiple ways to resolve a query. For instance, two methods for filtering data are:
Look up the rows that match the where clause in the index and then fetch related data from the data pages.
Scan the index.
This doesn't get into the use of indexes for joins or aggregations or alternative algorithms.
Which is better? Under some circumstances, the first method is horribly slower than the second. This occurs when the data for the table does not fit into memory. Under such circumstances, the index can read a record from page 124 and then from 1068 and then from 124 again and -- well, all sorts of random intertwined reading of pages. Reading data pages in order is usually faster. And when the data doesn't fit into memory, thrashing occurs, which means that a page in memory is aged (overwritten) -- and then needed again.
I'm not saying that is occurring in your case. I am simply saying that what optimizers do is not always obvious. The optimizer has to make judgements based on the nature of the data -- and those judgements are not right 100% of the time. They are usually correct. But there are borderline cases. Sometimes, the issue is out-of-date statistics. Sometimes the issue is that what looks best to the optimizer is not best in practice.
Let me emphasize that optimizers usually do a very good job, and a better job than a person would do. Even if they occasionally come up with suboptimal plans, they are still quite useful.
Get rid of your redundant UNIQUE KEYs. A primary key is a unique key.
Use COUNT(*) rather than COUNT(IP) in your query. They mean the same thing because you declared IP to be NOT NULL.
Your query can be much faster if you stop saying WHERE AttackId>0. Because that column is a FK to the PK of your other table, those values should be nonzero anyway. But to get that speedup you'll need an index on event(IP) something like this.
CREATE INDEX IpDex ON event (IP)
But you're still summarizing a large table, and that will always take time.
It looks like you want to display some kind of leaderboard. You could add a top_ips table, and use an EVENT to populate it, using your query, every few minutes. Then you could display it to your users without incurring the cost of the query every time. This of course would display slightly stale data; only you know whether that's acceptable in your app.
Pro Tip. Read https://use-the-index-luke.com by Marcus Winand.
Essentially every part of your query, except for the FKey, conspires to make the query slow.
Your query is equivalent to
SELECT Count(*), IP
FROM event
WHERE AttackID >0
GROUP BY IP
ORDER BY Count(*) DESC
LIMIT 5;
Please use COUNT(*) unless you need to avoid NULL.
If AttackID is rarely >0, the optimal index is probably
ADD INDEX(AttackID, -- for filtering
IP) -- for covering
Else, the optimal index is probably
ADD INDEX(IP, -- to avoid sorting
AttackID) -- for covering
You could simply add both indexes and let the Optimizer decide. Meanwhile, get rid of these, if they exist:
DROP INDEX(AttackID)
DROP INDEX(IP)
because any uses of them are handled by the new indexes.
Furthermore, leaving the 1-column indexes around can confuse the Optimizer into using them instead of the covering index. (This seems to be a design flaw in at least some versions of MySQL/MariaDB.)
"Covering" means that the query can be performed entirely in the index's BTree. EXPLAIN will indicate it with "Using index". A "covering" index speeds up a query by 2x -- but there is a very wide variation on this prediction. ("Using index condition" is something different.)
More on index creation: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

MySQL: Optimizing COUNT(*) and GROUP BY

I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.

Very slow MYSQL query for 2.5 million row table

I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds
here is the query
SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;
`id` int(11) NOT NULL auto_increment,
`instance` varchar(255) NOT NULL,
`email` varchar(255) NOT NULL,
`type` enum('play','claim','friend','email') NOT NULL,
`result` enum('win','win-small','lose','none') NOT NULL,
`timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
`play_date` date NOT NULL,
`email_refer` varchar(255) NOT NULL,
`remote_addr` varchar(15) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `result` (`result`),
KEY `timestamp` (`timestamp`),
KEY `email_refer` (`email_refer`),
KEY `type_2` (`type`,`timestamp`),
KEY `type_4` (`type`,`play_date`),
KEY `type_result` (`type`,`play_date`,`result`)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE log ref type_2,type_4,type_result type_4 1 const 270404 Using where
The query is using the type_4 index.
Does anyone know how I could speed this query up?
Thanks
Tom
That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.
You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.
To do this, change the index as follows:
KEY `type_4` (`type`,`play_date`, `email`)
I would be surprised if that wouldn't speed things up quite a bit.
(Thanks to MarkR for the proper term.)
Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.
My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.
Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps
There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.
(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates
By reversing the index, the approach is
> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file
Hope this helps
Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).
The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.
SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
SELECT play_date AS date
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
ORDER BY play_date desc
LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date
Try creating an index only on play_date.
Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.
Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.
There is a good chance a table scan will be quicker than random access to over 200,000 rows:
SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...
Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):
SELECT SQL_BIG_RESULT ...