I am trying to understand performance of an SQL query using MySQL.
With only indexes on the PK, the query failed to complete in over 10mins.
I have added indexes on all the columns used in the where clauses (timestamp, hostname, path, type) and the query now completes in approx 50seconds -- however this still seems a long time for what does not seem an overly complex query.
So, I'd like to understand what it is about the query that is causing this. My assumption is that my inner subquery is in someway causing an explosion in the number of comparisons necessary.
There are two tables involved:
storage (~5,000 rows / 4.6MB ) and machines (12 rows, <4k)
The query is as follows:
SELECT T.hostname, T.path, T.used_pct,
T.used_gb, T.avail_gb, T.timestamp, machines.type AS type
FROM storage AS T
JOIN machines ON T.hostname = machines.hostname
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
AND (machines.type = 'nfs')
ORDER BY used_pct DESC
An EXPLAIN EXTENDED for the query returns the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY machines ref hostname,type type 768 const 1 100.00 Using where; Using temporary; Using filesort
1 PRIMARY T ref fk_hostname fk_hostname 768 monitoring.machines.hostname 4535 100.00 Using where
2 DEPENDENT SUBQUERY st ref fk_hostname,path path 1002 monitoring.T.path 648 100.00 Using where
Noticing that the 'extra' column for Row 1 includes 'using filesort' and question:
MySQL explain Query understanding
states that "Using filesort is a sorting algorithm where MySQL isn't able to use an index for sorting and therefore can't do the complete sort in memory."
What is the nature of this query which is causing slow performance?
Why is it necessary for MySQL to use 'filesort' for this query?
Indexes don't get populated, they are there as soon as you create them. That's why inserts and updates become slower the more indexes you have on a table.
Your query runs fast after the first time because the whole result of the query is put into cache. To see how fast the query is without using the cache you can do
SELECT SQL_NO_CACHE T.hostname ...
MySQL uses filesort usually for ORDER BY or in your case to determine the maximum value for timestamp. Instead of going through all possible values and memorizing which value is the greatest, MySQL sorts the values descending and picks the first one.
So, why is your query slow? Two things jumped into my eye.
1) Your subquery
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
gets evaluated for every (hostname, path). Have a try with an index on timestamp (btw, I discourage naming columns like keywords / datatypes). If that alone doesn't help, try to rewrite your query. There are two excellent examples in the MySQL manual: The Rows Holding the Group-wise Maximum of a Certain Column.
2) This is a minor issue, but it seems you are joining on char/varchar fields. Numbers / IDs are much faster.
Related
The fact that I haven't been able to come up (or research) a solution to this question means that I'm either too stupid to read the docs or it is in fact a complicated problem.
In a rather big database I often need a query like this:
SELECT ... WHERE condition GROUP BY something;
This takes a fraction of a second to complete. So I put this in a VIEW:
CREATE VIEW view_x AS SELECT ... GROUP BY something;
And when I then do
SELECT * FROM view_x WHERE condition;
it takes more than a minute to complete. Now it's easy to see why: In the plain SELECT, the DB engine first selects a few hundred results from millions of records and then does the aggregating and grouping only on the matching records. When using the view, it seems to first evaluate the entire dataset, aggregating and grouping everything, and then returns only the records meeting the condition and throwing away the expensively calculated rest.
Is there a more intelligent VIEW solution, or do I have to use the full SELECT each time?
Thanks.
EDIT: Here's the original SQL code for the view:
CREATE VIEW v_status1 AS SELECT
FROM_UNIXTIME(J.ts_start) AS job_start,
J.id AS job_id, J.carrier, J.n_wafers,
count(W.id) AS n
FROM job AS J
JOIN wafer AS W ON J.id=W.job_id
GROUP BY J.carrier, J.n_wafers, W.status_id;
table job: 100k records, table wafer: 2M records.
Comparison is between these queries:
SELECT * FROM v_status1 WHERE carrier LIKE 'W96L00%'; -- very slow
versus the identical SELECT in the VIEW definition with the WHERE clause before the GROUP BY clause.
Some additional information: The query yields 9 records. Using the view it takes 19 seconds to execute. Using the direct query, it takes 0.000 seconds according to MySQL Workbench.
When I replace the WHERE clause in the direct query by a HAVING clause with the same condition at the end of the query, I end up at the same execution time as the query using the view.
Yes, I forgot some columns in the GROUP BY part. Put them in, doesn't make much of a difference.
Minimal example (5 seconds execution time):
CREATE VIEW v_status2 AS SELECT
job_id,
status_id,
count(id) AS n
FROM wafer
GROUP BY job_id, status_id;
yields 2 records given some job_id
well, I did the obvious and asked MySQL to EXPLAIN. The output is below. My interpretation is what I suspected all along: MySQL first builds a temporary table, doing all the hard work aggregating and grouping, and then selects only the rows matching the selection criteria. In other words, MySQL is not intelligent enough to first analyze the view to find where it can efficiently cull the original dataset and only work on the remaining records.
BTW, this has nothing to do with joins and indexes. You can see the effect with any sufficiently large two-column table.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 952929 Using where
2 DERIVED WS index PRIMARY ix_waferstatus_text 123 NULL 9 Using index; Using temporary; Using filesort
2 DERIVED W ref ix_wafer_job_id,wafer_ibfk_2 wafer_ibfk_2 5 jobwatch.WS.id 105881 Using where
2 DERIVED J eq_ref PRIMARY,job_ibkf_2 PRIMARY 4 jobwatch.W.job_id 1 Using where
2 DERIVED T eq_ref PRIMARY PRIMARY 4 jobwatch.J.tool_id 1
We're upgrading our DB systems to MySQL 5.7 coming from MySQL 5.6 and since the upgrade a few queries have been running really slow.
After some investigating we narrowed it down to a few JOIN queries which suddenly don't listen to the 'WHERE' clause anymore when using a 'larger than' > or 'smaller than' < operator. When using a '=' operator it does work as expected. When querying a large table this caused a constant 100% CPU usage.
The queries have been simplified to explain the issue at hand; when using explain we get the following outputs:
explain
select * from TableA as A
left join
(
select
DATE_FORMAT(created_at,'%H:%i:00') as `time`
FROM
TableB
WHERE
created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR)
)
as V ON V.time = A.time
Output
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 SIMPLE TableB NULL index created_at created_at 4 NULL 488389 100.00 Using where; Using index; Using join buffer (Block Nested Loop)
As you can see, it's querying/matching 488389 rows and not using the where clause since this is the total records in that table.
And now running the same query but with a LIMIT 99999999 command or using the '=' operator:
explain
select * from TableA as A
left join
(
select
DATE_FORMAT(created_at,'%H:%i:00') as `time`
FROM
TableB
WHERE
created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR) LIMIT 999999999
)
as V ON V.time = A.time
Output
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 PRIMARY <derived2> NULL ALL NULL NULL NULL NULL 244194 100.00 Using where; Using join buffer (Block Nested Loop)
2 DERIVED TableB NULL range created_at created_at 4 NULL 244194 100.00 Using where; Using index
You can see it's suddenly only matching '244194' rows which is a part of the table, or with the '=' operator:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 SIMPLE TableB NULL ref created_at created_at 4 const 1 100.00 Using where; Using index
Just 1 row, as expected.
So the question now is, have we been querying in a wrong way and
just now finding out while upgrading or have things changed since
MySQL 5.6? It seems odd that the = operator works, but the <
and > are ignored for some reason, unless when using a LIMIT?..
We've searched around and couldn't find the cause of this issue, and we'd rather not use the limit 9999999 solution in our code for obvious reasons.
Note When running just the query inside the join, it works as expected as well.
Note We've also ran the same test on MariaDB 10.1, same issue.
The explain row-output is merely a guess on how many rows it will hit. It is based upon statistical data, that has been resettet with your update. And if I had to guess how many rows of all your existing rows are older than yesterday 9pm, I would too guess its closer to "all rows" than to "just some rows". The reason why 'limit 99999999' is displaying another rowcount is the same: it just guesses the limit will have an effect; in this case, mysql guesses it will be exactly half of the rows (what would be, if true, a strange coincidence), and of course, it doesn't actually look at the limit-value, since 999999999 will not limit anything when you only have 500k rows; and even the "1" in case of "=" is just a guess (and might more often be 0 than 1, and maybe sometimes more).
This estimate will help choose the correct execution plan, and being wrong in this guess is just a problem if it would choose the wrong one; your execution plan looks fine though, and there are not many option to do it otherwise. It does exactly as expected: Scan the index for all dates using the index on created_at. Since you do a left join, you cannot skip values from tableA even if you would start with the inner query, so there is really no alternative execution plan available. (The optimizer actually have been changed in 5.7., but here is doesn't have an effect.)
If that is your actual query, there is no real reason why it should be slower than before (only regarding this query; there are of course a lot of general performance options that might have an indirect effect, like caching strategies, buffersizes, ..., but with standard options, it should not have an effect here).
If not, and you e.g. actually use additional columns from TableB in the subquery (it is often hard to guess which maybe important things have gotten "simplified away" in questions), and thus need access to the actual table, it might depends on how your data is structured (or better: in what order you added it). And you might try Optimize table TableB to make your table and indexes fresh and new, it can't hurt (but will lock your table for a little while).
With mysql 5.7., you now can add generated columns, so it might be worth a try to generate a cleaned up column time as DATE_FORMAT(created_at,'%H:%i:00'), so you don't have to calculate it anymore. And maybe add it to your index, so you don't have to sort it anymore to improve the block nested join, but that may depend on your actual query and how often you use it (spamming indexes increases overhead and uses space).
In MySQL 5.7, derived tables (sub-queries in FROM clause) will be merged into the outer query if possible. This is usually an advantage since one avoids that the result of the sub-query is stored in a temporary table. However, for your query, MySQL 5.6 will create an index on this temporary table that could be used for the join execution.
The problem with the merged query is that the index on TableB.created_at can not be used when the column is a parameter to a function. If you can change the query so that the transformation is made to the column on the left side of the join, an index can be used to access the table on the right side. Something like:
select * from TableA as A
left join
(
select created_at as time
FROM TableB
WHERE created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR)
)
as V ON V.time = func(A.time)
Alternatively, if you can use inner join instead of left join, MySQL can reverse the join order, so that the index on tableA.time can be used for the join.
If the subquery use LIMIT, it can not be merged. Hence, by using LIMIT you will get the same query plan as was used in MySQL 5.6.
Use JOIN instead of LEFT JOIN unless you need the 'right' table to be optional.
Avoid JOIN ( SELECT ... ). Although 5.6 and 5.7 added some features to handle it, it is usually better to turn the subquery into a simpler JOIN.
Your time expression leads to 9pm yesterday; did you mean "3 hours ago" instead?
See if this gives the desired results and runs faster:
select A.*, DATE_FORMAT(B.created_at,'%H:%i:00') as `time`
from TableA as A
JOIN TableB as B ON B.time = A.time
WHERE B.created_at < NOW() - INTERVAL 3 HOUR -- (assuming "3 hours ago")
As for 5.6 vs 5.7... 5.7 has a new, 'better', optimizer, based on a "cost model". However your particular query makes it virtually impossible for the optimizer to come up with good costs. I guess that 5.6 happened on the better EXPLAIN, and 5.7 happened on a worse one. By simplifying the query, I think both optimizers will have a better chance at performing the query faster.
You do need these indexes:
B: INDEX(time, created_at) -- in that order
A: INDEX(time)
I have this query:
SELECT
COUNT(*) AS `numrows`
FROM (`tbl_A`)
JOIN `tbl_B` ON `tbl_A`.`B_id` = `tbl_B`.`id`
WHERE
`tbl_B`.`boolean_value` <> 1;
I added three indexes for tbl_A.B_id, tbl_B.id and tbl_B.boolean_value but mysql still says it doesn't use indexes (in queries not using indexes log) and it examine whole of tables to retrieve the result.
I need to know what I should do to optimize this query.
EDIT:
Explain output:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tbl_B ALL PRIMARY,boolean_value NULL NULL NULL 5049 Using where
1 SIMPLE tbl_A ref B_id B_id 9 tbl_B.id 9 Using where; Using index
The explain show us that an index is used to make the join to tbl_B but no index is used to filter tbl_A on the boolean value.
An index was available but the engine choose not to use it. Why it happen:
maybe 5049 rows is not a big deal and the engine saw that using the index to filter something like 10% of the rows using the index would be as fast as doing it without using it.
Booleans takes only 3 values: 1, 0 or NULL. So the cardinality of the index will always be very low (3 max). Low cardinality index are usually dropped by the query analyser (which is quite right usually thinking this index won't help him a lot)
It would be interesting to see if the query analyser behaves the same way when you have a 50/50 repartition of true and false value for this boolean, or when you have just a few False.
Now usually boolean fields are useful only on indexes containing multiple keys, so that if your queries use all the fields of the index (in where or order by) the query analyser will trust that index to be really a good tool.
Note that indexes are slowing down your writes and takes extra-spaces, do not add useless indexes. Using logt-query-not-using-indexes is a good thing, but you should compensate that log information with the slow queries log.If the query is fast it's not a problem.
if boolean_value it's really boolean value indexing of it not so good idea. Index wouldn't be effective.
i have a simple query with no joins that is running very slow (20s+). The table queried has about 400k rows and all columns used in the where clause are indexed.
SELECT deals.id, deals.title,
deals.amount_sold * deals.sale_price AS total_sold_total
FROM deals
WHERE deals.country_id = 33
AND deals.target_approved = 1
AND deals.target_active = 1
AND deals.finished_at >= '2012-04-01'
AND deals.started_at <= '2012-04-30'
ORDER BY total_sold_total DESC
LIMIT 0, 10
Im struggling with this since last week, please i could use some help :)
Update 1
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE deals index_merge NewIndex3,finished_at_index,index_deals_on_country_id,index_deals_on_target_active,index_deals_on_target_approved index_deals_on_target_active,index_deals_on_target_approved,index_deals_on_country_id 2,2,5 \N 32382 Using intersect(index_deals_on_target_active,index_deals_on_target_approved,index_deals_on_country_id); Using where; Using filesort
To improve the selection, create the following compound indexes, using the columns in the WHERE clause, in the order specified:
(country_id, target_approved, target_active, finished_at)
(country_id, target_approved, target_active, started_at)
You want columns with higher cardinality first, with ranges last. MySQL cannot utilize any key parts past the first range, which is why we have two separate indexes that diverge at the range in the WHERE clause (>= and <=).
If MySQL doesn't utilize both indexes through an index merge, then you might consider deleting one of them.
Unfortunately, MySQL still has to return all of the rows, computing the total_sold_total, before it can order the rows. In other words, MySQL has to manually sort the rows after it has retrieved the entire dataset.
The time it takes to sort will be directly proportional to the size of the result set.
LIMIT optimizations cannot be used because LIMIT is applied after the sort.
Unfortunately, having ranges in your WHERE clause precludes you from adding a precalculated total_sold_total column to the end of your index to return the results already in order, which would prevent the manual sort.
Let's say we have a common join like the one below:
EXPLAIN SELECT *
FROM visited_links vl
JOIN device_tracker dt ON ( dt.Client_id = vl.Client_id
AND dt.Device_id = vl.Device_id )
GROUP BY dt.id
if we execute an explain, it says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE vl index NULL vl_id 273 NULL 1977 Using index; Using temporary; Using filesort
1 SIMPLE dt ref Device_id,Device_id_2 Device_id 257 datumprotect.vl.device_id 4 Using where
I know that sometimes it's difficult to choose the right indexes when you are using group by but, what indexes could I set for avoiding 'using temporary, using filesort' in this query? why is this happening? and specially, why this happens after using an index?
One point to mention is that the fields returned by the select (* in this case) should either be in the GROUP BY clause or be using agregate functions such as SUM() or MAX(). Otherwise unexpected results can occur. This is because if the database is not told how to choose fields that are not in the group by clause you may get any member of the group, pretty much at random.
The way I look at it is to break the query down into bits.
you have a join on (dt.Client_id = vl.Client_id and dt.Device_id = vl.Device_id) so all of those fields should be indexed in their respective tables.
You are using GROUP BY dt.id so you need an index that includes dt.id
BUT...
an index on (dt.client_id,dt.device_id,dt.id) will not work for the GROUP BY
and
an index on (dt.id, dt.client_id, dt.device_id) will not work for the join.
Sometimes you end up with a query which just can't use an index.
See also:
http://ntsrikanth.blogspot.com/2007/11/sql-query-order-of-execution.html
You didn't post your indices, but first of all, you'll want to have an index for (client_id, device_id) on visited_links, and (client_id, device_id, id) on device_tracker to make sure that query is fully indexed.
From page 191 of the excellent High Performance MySQL, 2nd Ed.:
MySQL has two kinds of GROUP BY strategies when it can't use an index: it can use a temporary table or a filesort to perform the grouping. Either one can be more efficient depending on the query. You can force the optimizer to choose one method or the other with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints.
In your case, I think the issue stems from joining on multiple columns and using GROUP BY together, even after the suggested indices are in place. If you remove either (a) one of the join conditions or (b) the GROUP BY, this shouldn't need a filesort.
However, keep in mind that a filesort doesn't always use actual files, it can also happen entirely within a memory buffer if the result set is small enough, so the performance penalty may be minimal. Consider the wall-clock time for the query too.
HTH!