i have a simple query with no joins that is running very slow (20s+). The table queried has about 400k rows and all columns used in the where clause are indexed.
SELECT deals.id, deals.title,
deals.amount_sold * deals.sale_price AS total_sold_total
FROM deals
WHERE deals.country_id = 33
AND deals.target_approved = 1
AND deals.target_active = 1
AND deals.finished_at >= '2012-04-01'
AND deals.started_at <= '2012-04-30'
ORDER BY total_sold_total DESC
LIMIT 0, 10
Im struggling with this since last week, please i could use some help :)
Update 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE deals index_merge NewIndex3,finished_at_index,index_deals_on_country_id,index_deals_on_target_active,index_deals_on_target_approved index_deals_on_target_active,index_deals_on_target_approved,index_deals_on_country_id 2,2,5 \N 32382 Using intersect(index_deals_on_target_active,index_deals_on_target_approved,index_deals_on_country_id); Using where; Using filesort
To improve the selection, create the following compound indexes, using the columns in the WHERE clause, in the order specified:
(country_id, target_approved, target_active, finished_at)
(country_id, target_approved, target_active, started_at)
You want columns with higher cardinality first, with ranges last. MySQL cannot utilize any key parts past the first range, which is why we have two separate indexes that diverge at the range in the WHERE clause (>= and <=).
If MySQL doesn't utilize both indexes through an index merge, then you might consider deleting one of them.
Unfortunately, MySQL still has to return all of the rows, computing the total_sold_total, before it can order the rows. In other words, MySQL has to manually sort the rows after it has retrieved the entire dataset.
The time it takes to sort will be directly proportional to the size of the result set.
LIMIT optimizations cannot be used because LIMIT is applied after the sort.
Unfortunately, having ranges in your WHERE clause precludes you from adding a precalculated total_sold_total column to the end of your index to return the results already in order, which would prevent the manual sort.
Related
Long time lurker, first time questioner. ;-)
Using PHP 5.6 and MySQL Ver 14.14 Distrib 5.6.41, for Win64 (x86_64) Yeah, I know a little behind the times and we're working on updating. But that's where we are now. ;-)
Updates for questions asked:
The index is on the CreateDate. I thought there might be an issue with that column being a DateTime so I created another column that was just a date, set an index on that and retried, but it didn't have any effect.
ulc has 8965 rows total. With index searches 3787
et has 9530 rows. In the query that doesn't use the index, it searches just one row as it's searching on the primary key from the first query.
The formatting of the comparison date doesn't seem to matter. I've tried all sorts of formats, including just straight "2018-01-01 {00:00:00}'. No change.
I've got what I consider a weird one, but I suspect for someone here it's going to be a "duh!" one. I've got a query that includes a date range for the primary table and then goes to get other bits of data from other tables based on a set of unique ids from the first table. Don't worry, I'll have examples below. When I do the search on just the primary table, the range index works as expected and only searches the relevant rows. However, when I add in the next table with the ON clause, it ignores the index and searches all of the rows of the primary table. If I leave off the on clause, it goes back to using the index correctly. I tried using the FORCE INDEX (USE is ignored) and while that makes it use the index, it slows the query way down. Anyway, here are the queries:
Works:
select CreateDate
from ulc
Inner Join et
WHERE ulc.CreateDate >= STR_TO_DATE("01/01/2018", "%m/%d/%Y")
AND ulc.CreateDate <= STR_TO_DATE("08/02/2018", "%m/%d/%Y")
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ulc range index_CreateDate index_CreateDate 5 NULL 3787 Using where; Using index
1 SIMPLE et index NULL index_BankProcessorProfile 5 NULL 9530 Using index; Using join buffer (Block Nested Loop)
Doesn't work:
select CreateDate
from ulc
Inner Join et on et.TranID = ulc.TranID
WHERE ulc.CreateDate >= STR_TO_DATE("01/01/2018", "%m/%d/%Y")
AND ulc.CreateDate <= STR_TO_DATE("08/02/2018", "%m/%d/%Y")
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE ulc ALL TranID,index_CreateDate NULL NULL NULL 8965 Using where
1 SIMPLE et eq_ref PRIMARY PRIMARY 8 showpro.ulc.TranID 1 Using index
For the second one I just added the on et.TranID = ulc.TranID
Additionally, if I change it from a range to a specific date, the index works as well.
Just guessing here without more data,but adding a new table to the JOIN changes the data distribution.
So if in the first case the WHERE condition return probably a small(relatively) percentage of the data,in you second case the optimizer decides you`ll get faster results without using the index since the same conditions might not be quite so selective for the new batch of data.
Add the table definitions and a COUNT for both queries,both total and based on your queries,for a better answer.
if you are using DateTime in your query its suggested to use "YYYY-MM-DD HH:MM:SS" in where class
if you are using Date in your query its suggested to use the format "YYYY-MM-DD" in your where class.you have used STR_TO_DATE("01/01/2018", "%m/%d/%Y") which will typecast to '2018-01-01' seems to be fine
you try to find the complexity of the query using EXPLAIN
explain select CreateDate
from ulc
Inner Join et on et.TranID = ulc.TranID
WHERE ulc.CreateDate >= STR_TO_DATE("01/01/2018", "%m/%d/%Y")
AND ulc.CreateDate <= STR_TO_DATE("08/02/2018", "%m/%d/%Y")
you can check if et.TranID and ulc.TranID have proper index or not
(I'm going to have to guess at some things, since you have not provided SHOW CREATE TABLE. As a 'long time lurker', you should have realized this.)
First guess is that TranID is not the PRIMARY KEY of ulc?
The solution is to add a "composite" INDEX(CreateDate, TranID) to ulc. (Actually, you should replace the existing INDEX(CreateDate) (Second guess is that you have that index now.)
Now I will try to explain why the first query was happy with INDEX(CreateDate) but the second was not.
In the first query, INDEX(CreateDate) is a "covering" index. That is, this index contains all the columns of ulc that are needed by the SELECT. So, it is almost guaranteed that using the index would be better than scanning the table. It will be a "range index scan" of that index.
The second query needs both CreateDate and TranID, so your index won't be "covering". There are two ways to perform the first part of the query. But first, note that (in InnoDB) a secondary index has all the columns of the PRIMARY KEY (third guess: it is (id)).
Range scan of the index. But, in order to get TranID, it first gets id, then does a lookup in the PRIMARY KEY/data to get TranID`. This process is more costly than simply staying in the index, so the Optimizer does not want to do it unless the estimated number of rows is 'small'.
Since 3787/8965 is not "small", the Optimizer decides that it is probably faster to scan ALL 8965 rows, filtering out the ones not needed.
My proposed index is 'covering', thereby avoiding the bounding back and forth between index and data. So, a range index scan is efficient.
Your observation that switching to a single date made use of the index -- Well, 1 row out of 8965 is 'small', so the index (and the bouncing) is deemed to be the faster way.
As for formatting of the date -- True, it does not matter. This is because the parser notices that STR_TO_DATE("01/01/2018", "%m/%d/%Y") is a constant that can be evaluated once, and does so.
My cookbook should take you directly to the composite index without having to scratch your head over this Question.
Your first query is a "cross join" since it has no ON clause to relate the tables together, and it will return about 35 million rows (9530*3787). The second query will have about 3787 rows, maybe fewer (if some of the joins fail to find a match).
"how little changes between the two queries" -- Never think that! The Optimizer will latch on to seemingly insignificant differences. SELECT CreateDate versus SELECT * -- a huge difference. Most of what I said about the 'first query' would be thrown out. Even changing to SELECT ChangeDate, x would be enough to make a big wrinkle. If the datatypes of TranID in the two tables differed enough, the indexes become useless. Etc, etc.
I have this query that drives me crazy for quite some time. It has 3 tables (originally it has a lot more but I isolated the performance issue), 1 base table, 1 product table which adds more data, and 1 with product types.
The product types table contains a "max age" column which indicates the maximum age of a row I want to fetch (anything older is considered "archived") and its value is different according to the product type.
My poor performance query goes like this and it takes 50 seconds for a 250,000 rows base table:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > (curdate() - INTERVAL md_prodtypes.MaxAge DAY))
order by CreationDate desc
limit 750);
Here is the EXPLAIN of this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE MAX_AGE 5 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where
I found a clue a few days back, when I was able to determine that limiting the query to 750 records would cause is to go fast, but 751 would bring poor performance.
I tried creating indexes of many kinds, with no success.
I tried removing the reference to MAX_AGE and the curdate function and just set a fixed value, with little success as the query now takes 20 seconds:
(select d_baseservices.ID
from d_baseservices
inner join d_products on d_baseservices.ServiceID = d_products.ServiceID
inner join md_prodtypes on d_products.ProdType = md_prodtypes.ProdType
where
(d_baseservices.CreationDate > '2015-09-21 19:02:25')
order by CreationDate desc
limit 750);
And the EXPLAIN command output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE md_prodtypes index PRIMARY,ProdType_UNIQUE,ID_MAX_AGE ProdType_UNIQUE 4 23 Using index; Using temporary; Using filesort
1 SIMPLE d_products ref PRIMARY,ServiceID_UNIQUE,fk_Products_BaseServices1,fk_d_products_md_prodtypes1 fk_d_products_md_prodtypes1 4 combina.md_prodtypes.ProdType 8625
1 SIMPLE d_baseservices eq_ref PRIMARY,CreationDateDesc_index,CreationDate_index PRIMARY 8 combina.d_products.ServiceID 1 Using where\
Can anyone please help? I'm stuck for almost a month
It's hard to say exactly what to do without knowing more about the specific data you have (how many rows in each table, how many rows you expect the query to return, the distribution of the data values, etc), but I'll make some educated guesses and hopefully point you in the right direction.
First an explanation about why taking md_prodtypes.MaxAge out of the query greatly reduced the run time: Prior to that change the database had no ability at all to filter using indexes because in order to see if rows are candidates for inclusion it had to join the three tables in order to compare CreationDate from the first table to MaxAge in the third table. There is simply no index that you can add to correlate these two values. You're forcing the database engine to look at every single row.
As to the 750 magic number - I'm guessing that past 750 results the database has to page data or that it's hitting some other memory limit based on the values in your specific MySQL configuration file. I wouldn't read too much into that 750 number.
Lastly I'd like to point out that the EXPLAIN of your second query is a bit strange since it's showing md_prodtypes as the first table despite the fact that you took MaxAge out of the WHERE. That means the database is starting from md_prodtypes then moving up to d_products and finally to d_baseservices and only then filtering based on the date. I'm guessing that you're expecting it to first filter on the date then join only when it's decided what baseservices records to include. It's impossible to know why this is happening with the information you've provided. Perhaps you are missing an index.
Another possibility may have to do with variance in your CreationDate column. Let me explain by example: Say you had a table of users, and each user had a gender column that could be either f or m. Let's pretend that we have a 50%/50% split of females and males. Now, if you add an index on the column gender and do a query filtered by WHERE gender='f' expecting that the index will filter out half of the records, you'd be surprised to see that the database will totally ignore the index and just scan the table. The reason being is that it's cheaper to just read the whole table if you know the index isn't filtering out enough (the alternative being jumping constantly from the index to the main table data). In your case, if the WHERE on the CreationDate column doesn't filter out enough records, then even if you have an index on it, it won't be used.
With a constant date...
INDEX(CreationDate)
That will encourage the optimizer to start with the table that can be filtered. Also, since the ORDER BY is on the same field, the WHERE, ORDER BY and LIMIT can all be done at the same time.
Otherwise, it must read all the relevant records from all 3 tables, sort them, then deliver 750 (or 751) of them.
Using MAX_AGE...
Now the optimizer won't know whether it is better to do as above or find all the rows, sort them, then deliver the LIMIT.
I am trying to understand performance of an SQL query using MySQL.
With only indexes on the PK, the query failed to complete in over 10mins.
I have added indexes on all the columns used in the where clauses (timestamp, hostname, path, type) and the query now completes in approx 50seconds -- however this still seems a long time for what does not seem an overly complex query.
So, I'd like to understand what it is about the query that is causing this. My assumption is that my inner subquery is in someway causing an explosion in the number of comparisons necessary.
There are two tables involved:
storage (~5,000 rows / 4.6MB ) and machines (12 rows, <4k)
The query is as follows:
SELECT T.hostname, T.path, T.used_pct,
T.used_gb, T.avail_gb, T.timestamp, machines.type AS type
FROM storage AS T
JOIN machines ON T.hostname = machines.hostname
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
AND (machines.type = 'nfs')
ORDER BY used_pct DESC
An EXPLAIN EXTENDED for the query returns the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY machines ref hostname,type type 768 const 1 100.00 Using where; Using temporary; Using filesort
1 PRIMARY T ref fk_hostname fk_hostname 768 monitoring.machines.hostname 4535 100.00 Using where
2 DEPENDENT SUBQUERY st ref fk_hostname,path path 1002 monitoring.T.path 648 100.00 Using where
Noticing that the 'extra' column for Row 1 includes 'using filesort' and question:
MySQL explain Query understanding
states that "Using filesort is a sorting algorithm where MySQL isn't able to use an index for sorting and therefore can't do the complete sort in memory."
What is the nature of this query which is causing slow performance?
Why is it necessary for MySQL to use 'filesort' for this query?
Indexes don't get populated, they are there as soon as you create them. That's why inserts and updates become slower the more indexes you have on a table.
Your query runs fast after the first time because the whole result of the query is put into cache. To see how fast the query is without using the cache you can do
SELECT SQL_NO_CACHE T.hostname ...
MySQL uses filesort usually for ORDER BY or in your case to determine the maximum value for timestamp. Instead of going through all possible values and memorizing which value is the greatest, MySQL sorts the values descending and picks the first one.
So, why is your query slow? Two things jumped into my eye.
1) Your subquery
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
gets evaluated for every (hostname, path). Have a try with an index on timestamp (btw, I discourage naming columns like keywords / datatypes). If that alone doesn't help, try to rewrite your query. There are two excellent examples in the MySQL manual: The Rows Holding the Group-wise Maximum of a Certain Column.
2) This is a minor issue, but it seems you are joining on char/varchar fields. Numbers / IDs are much faster.
Let's say we have a common join like the one below:
EXPLAIN SELECT *
FROM visited_links vl
JOIN device_tracker dt ON ( dt.Client_id = vl.Client_id
AND dt.Device_id = vl.Device_id )
GROUP BY dt.id
if we execute an explain, it says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE vl index NULL vl_id 273 NULL 1977 Using index; Using temporary; Using filesort
1 SIMPLE dt ref Device_id,Device_id_2 Device_id 257 datumprotect.vl.device_id 4 Using where
I know that sometimes it's difficult to choose the right indexes when you are using group by but, what indexes could I set for avoiding 'using temporary, using filesort' in this query? why is this happening? and specially, why this happens after using an index?
One point to mention is that the fields returned by the select (* in this case) should either be in the GROUP BY clause or be using agregate functions such as SUM() or MAX(). Otherwise unexpected results can occur. This is because if the database is not told how to choose fields that are not in the group by clause you may get any member of the group, pretty much at random.
The way I look at it is to break the query down into bits.
you have a join on (dt.Client_id = vl.Client_id and dt.Device_id = vl.Device_id) so all of those fields should be indexed in their respective tables.
You are using GROUP BY dt.id so you need an index that includes dt.id
BUT...
an index on (dt.client_id,dt.device_id,dt.id) will not work for the GROUP BY
and
an index on (dt.id, dt.client_id, dt.device_id) will not work for the join.
Sometimes you end up with a query which just can't use an index.
See also:
http://ntsrikanth.blogspot.com/2007/11/sql-query-order-of-execution.html
You didn't post your indices, but first of all, you'll want to have an index for (client_id, device_id) on visited_links, and (client_id, device_id, id) on device_tracker to make sure that query is fully indexed.
From page 191 of the excellent High Performance MySQL, 2nd Ed.:
MySQL has two kinds of GROUP BY strategies when it can't use an index: it can use a temporary table or a filesort to perform the grouping. Either one can be more efficient depending on the query. You can force the optimizer to choose one method or the other with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints.
In your case, I think the issue stems from joining on multiple columns and using GROUP BY together, even after the suggested indices are in place. If you remove either (a) one of the join conditions or (b) the GROUP BY, this shouldn't need a filesort.
However, keep in mind that a filesort doesn't always use actual files, it can also happen entirely within a memory buffer if the result set is small enough, so the performance penalty may be minimal. Consider the wall-clock time for the query too.
HTH!
How is
SELECT t.id
FROM table t
JOIN (SELECT(FLOOR(max(id) * rand())) AS maxid FROM table)
AS tt
ON t.id >= tt.maxid
LIMIT 1
faster than
SELECT * FROM `table` ORDER BY RAND() LIMIT 1
I am actually having trouble understanding the first. Maybe if I knew why one is faster than the other I would have a better understanding.
*original post # Difficult MySQL self-join please explain
You can use EXPLAIN on the queries, but basically:
In the first you're getting a random number (which isn't very slow), based on the maximum of a (i presume) indexed field. This is quite quick, i'd say maybe even near-constant time (depends on the implementation of the index hash?)
Then you're joining on that number and returning only the first row that's greater then, and because you're using an index again, this is lightning quick.
The second is ordering by some random function. This has to, but you'll need to look at the explain for that, do a FULL TABLE scan, and then return the first. This is ofcourse VERY expensive. You're not using any indexes because of that rand.
(the explain will look like this, showing that you're not using keys)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table ALL NULL NULL NULL NULL 14 Using temporary; Using filesort