How is using Join faster than using just Rand() in MySQL - mysql

How is
SELECT t.id
FROM table t
JOIN (SELECT(FLOOR(max(id) * rand())) AS maxid FROM table)
AS tt
ON t.id >= tt.maxid
LIMIT 1
faster than
SELECT * FROM `table` ORDER BY RAND() LIMIT 1
I am actually having trouble understanding the first. Maybe if I knew why one is faster than the other I would have a better understanding.
*original post # Difficult MySQL self-join please explain

You can use EXPLAIN on the queries, but basically:
In the first you're getting a random number (which isn't very slow), based on the maximum of a (i presume) indexed field. This is quite quick, i'd say maybe even near-constant time (depends on the implementation of the index hash?)
Then you're joining on that number and returning only the first row that's greater then, and because you're using an index again, this is lightning quick.
The second is ordering by some random function. This has to, but you'll need to look at the explain for that, do a FULL TABLE scan, and then return the first. This is ofcourse VERY expensive. You're not using any indexes because of that rand.
(the explain will look like this, showing that you're not using keys)
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table ALL NULL NULL NULL NULL 14 Using temporary; Using filesort

Related

MySQL JOIN not filtering on WHERE clause with < > operators, since moving from MySQL 5.6 -> 5.7

We're upgrading our DB systems to MySQL 5.7 coming from MySQL 5.6 and since the upgrade a few queries have been running really slow.
After some investigating we narrowed it down to a few JOIN queries which suddenly don't listen to the 'WHERE' clause anymore when using a 'larger than' > or 'smaller than' < operator. When using a '=' operator it does work as expected. When querying a large table this caused a constant 100% CPU usage.
The queries have been simplified to explain the issue at hand; when using explain we get the following outputs:
explain
select * from TableA as A
left join
(
select
DATE_FORMAT(created_at,'%H:%i:00') as `time`
FROM
TableB
WHERE
created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR)
)
as V ON V.time = A.time
Output
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 SIMPLE TableB NULL index created_at created_at 4 NULL 488389 100.00 Using where; Using index; Using join buffer (Block Nested Loop)
As you can see, it's querying/matching 488389 rows and not using the where clause since this is the total records in that table.
And now running the same query but with a LIMIT 99999999 command or using the '=' operator:
explain
select * from TableA as A
left join
(
select
DATE_FORMAT(created_at,'%H:%i:00') as `time`
FROM
TableB
WHERE
created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR) LIMIT 999999999
)
as V ON V.time = A.time
Output
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 PRIMARY <derived2> NULL ALL NULL NULL NULL NULL 244194 100.00 Using where; Using join buffer (Block Nested Loop)
2 DERIVED TableB NULL range created_at created_at 4 NULL 244194 100.00 Using where; Using index
You can see it's suddenly only matching '244194' rows which is a part of the table, or with the '=' operator:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE A NULL ALL NULL NULL NULL NULL 10080 100.00 NULL
1 SIMPLE TableB NULL ref created_at created_at 4 const 1 100.00 Using where; Using index
Just 1 row, as expected.
So the question now is, have we been querying in a wrong way and
just now finding out while upgrading or have things changed since
MySQL 5.6? It seems odd that the = operator works, but the <
and > are ignored for some reason, unless when using a LIMIT?..
We've searched around and couldn't find the cause of this issue, and we'd rather not use the limit 9999999 solution in our code for obvious reasons.
Note When running just the query inside the join, it works as expected as well.
Note We've also ran the same test on MariaDB 10.1, same issue.
The explain row-output is merely a guess on how many rows it will hit. It is based upon statistical data, that has been resettet with your update. And if I had to guess how many rows of all your existing rows are older than yesterday 9pm, I would too guess its closer to "all rows" than to "just some rows". The reason why 'limit 99999999' is displaying another rowcount is the same: it just guesses the limit will have an effect; in this case, mysql guesses it will be exactly half of the rows (what would be, if true, a strange coincidence), and of course, it doesn't actually look at the limit-value, since 999999999 will not limit anything when you only have 500k rows; and even the "1" in case of "=" is just a guess (and might more often be 0 than 1, and maybe sometimes more).
This estimate will help choose the correct execution plan, and being wrong in this guess is just a problem if it would choose the wrong one; your execution plan looks fine though, and there are not many option to do it otherwise. It does exactly as expected: Scan the index for all dates using the index on created_at. Since you do a left join, you cannot skip values from tableA even if you would start with the inner query, so there is really no alternative execution plan available. (The optimizer actually have been changed in 5.7., but here is doesn't have an effect.)
If that is your actual query, there is no real reason why it should be slower than before (only regarding this query; there are of course a lot of general performance options that might have an indirect effect, like caching strategies, buffersizes, ..., but with standard options, it should not have an effect here).
If not, and you e.g. actually use additional columns from TableB in the subquery (it is often hard to guess which maybe important things have gotten "simplified away" in questions), and thus need access to the actual table, it might depends on how your data is structured (or better: in what order you added it). And you might try Optimize table TableB to make your table and indexes fresh and new, it can't hurt (but will lock your table for a little while).
With mysql 5.7., you now can add generated columns, so it might be worth a try to generate a cleaned up column time as DATE_FORMAT(created_at,'%H:%i:00'), so you don't have to calculate it anymore. And maybe add it to your index, so you don't have to sort it anymore to improve the block nested join, but that may depend on your actual query and how often you use it (spamming indexes increases overhead and uses space).
In MySQL 5.7, derived tables (sub-queries in FROM clause) will be merged into the outer query if possible. This is usually an advantage since one avoids that the result of the sub-query is stored in a temporary table. However, for your query, MySQL 5.6 will create an index on this temporary table that could be used for the join execution.
The problem with the merged query is that the index on TableB.created_at can not be used when the column is a parameter to a function. If you can change the query so that the transformation is made to the column on the left side of the join, an index can be used to access the table on the right side. Something like:
select * from TableA as A
left join
(
select created_at as time
FROM TableB
WHERE created_at < DATE_ADD(CURDATE(), INTERVAL -3 HOUR)
)
as V ON V.time = func(A.time)
Alternatively, if you can use inner join instead of left join, MySQL can reverse the join order, so that the index on tableA.time can be used for the join.
If the subquery use LIMIT, it can not be merged. Hence, by using LIMIT you will get the same query plan as was used in MySQL 5.6.
Use JOIN instead of LEFT JOIN unless you need the 'right' table to be optional.
Avoid JOIN ( SELECT ... ). Although 5.6 and 5.7 added some features to handle it, it is usually better to turn the subquery into a simpler JOIN.
Your time expression leads to 9pm yesterday; did you mean "3 hours ago" instead?
See if this gives the desired results and runs faster:
select A.*, DATE_FORMAT(B.created_at,'%H:%i:00') as `time`
from TableA as A
JOIN TableB as B ON B.time = A.time
WHERE B.created_at < NOW() - INTERVAL 3 HOUR -- (assuming "3 hours ago")
As for 5.6 vs 5.7... 5.7 has a new, 'better', optimizer, based on a "cost model". However your particular query makes it virtually impossible for the optimizer to come up with good costs. I guess that 5.6 happened on the better EXPLAIN, and 5.7 happened on a worse one. By simplifying the query, I think both optimizers will have a better chance at performing the query faster.
You do need these indexes:
B: INDEX(time, created_at) -- in that order
A: INDEX(time)

How to improve IF NOT NULL query?

I have the following query:
SELECT * FROM `title_mediaasset`
WHERE upload_id is not null
ORDER BY `upload_date` DESC
It takes almost a second and doesn't use an index:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE title_mediaasset ALL upload_id,upload_id_2 NULL NULL NULL 119216 Using where; Using filesort
How can I improve this query?
This table holds about 100k results, and will probably increase to 1M in the next year.
If you need all rows and all columns from the result, you can't re-write the query to make it better. It is probably running slow because you don't have an index on upload_date.
If you don't need all of the rows, use LIMIT and you'll see a decent speed increase on the ORDER BY.
If you don't need all of the columns, use SELECT [columns you need] instead of SELECT *. That way if you really need to optimize the query, you can put the columns you need in your index so that you can read everything directly from the index: index on (upload_id, upload_date, [other columns in select statement]).
If you need all of the columns, or a good number of them, just add index on (upload_id, upload_date).

understanding perf of mysql query using explain extended

I am trying to understand performance of an SQL query using MySQL.
With only indexes on the PK, the query failed to complete in over 10mins.
I have added indexes on all the columns used in the where clauses (timestamp, hostname, path, type) and the query now completes in approx 50seconds -- however this still seems a long time for what does not seem an overly complex query.
So, I'd like to understand what it is about the query that is causing this. My assumption is that my inner subquery is in someway causing an explosion in the number of comparisons necessary.
There are two tables involved:
storage (~5,000 rows / 4.6MB ) and machines (12 rows, <4k)
The query is as follows:
SELECT T.hostname, T.path, T.used_pct,
T.used_gb, T.avail_gb, T.timestamp, machines.type AS type
FROM storage AS T
JOIN machines ON T.hostname = machines.hostname
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
AND (machines.type = 'nfs')
ORDER BY used_pct DESC
An EXPLAIN EXTENDED for the query returns the following:
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY machines ref hostname,type type 768 const 1 100.00 Using where; Using temporary; Using filesort
1 PRIMARY T ref fk_hostname fk_hostname 768 monitoring.machines.hostname 4535 100.00 Using where
2 DEPENDENT SUBQUERY st ref fk_hostname,path path 1002 monitoring.T.path 648 100.00 Using where
Noticing that the 'extra' column for Row 1 includes 'using filesort' and question:
MySQL explain Query understanding
states that "Using filesort is a sorting algorithm where MySQL isn't able to use an index for sorting and therefore can't do the complete sort in memory."
What is the nature of this query which is causing slow performance?
Why is it necessary for MySQL to use 'filesort' for this query?
Indexes don't get populated, they are there as soon as you create them. That's why inserts and updates become slower the more indexes you have on a table.
Your query runs fast after the first time because the whole result of the query is put into cache. To see how fast the query is without using the cache you can do
SELECT SQL_NO_CACHE T.hostname ...
MySQL uses filesort usually for ORDER BY or in your case to determine the maximum value for timestamp. Instead of going through all possible values and memorizing which value is the greatest, MySQL sorts the values descending and picks the first one.
So, why is your query slow? Two things jumped into my eye.
1) Your subquery
WHERE timestamp = ( SELECT max(timestamp) FROM storage AS st
WHERE st.hostname = T.hostname AND
st.path = T.path)
gets evaluated for every (hostname, path). Have a try with an index on timestamp (btw, I discourage naming columns like keywords / datatypes). If that alone doesn't help, try to rewrite your query. There are two excellent examples in the MySQL manual: The Rows Holding the Group-wise Maximum of a Certain Column.
2) This is a minor issue, but it seems you are joining on char/varchar fields. Numbers / IDs are much faster.

how to set indexes for join and group by queries

Let's say we have a common join like the one below:
EXPLAIN SELECT *
FROM visited_links vl
JOIN device_tracker dt ON ( dt.Client_id = vl.Client_id
AND dt.Device_id = vl.Device_id )
GROUP BY dt.id
if we execute an explain, it says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE vl index NULL vl_id 273 NULL 1977 Using index; Using temporary; Using filesort
1 SIMPLE dt ref Device_id,Device_id_2 Device_id 257 datumprotect.vl.device_id 4 Using where
I know that sometimes it's difficult to choose the right indexes when you are using group by but, what indexes could I set for avoiding 'using temporary, using filesort' in this query? why is this happening? and specially, why this happens after using an index?
One point to mention is that the fields returned by the select (* in this case) should either be in the GROUP BY clause or be using agregate functions such as SUM() or MAX(). Otherwise unexpected results can occur. This is because if the database is not told how to choose fields that are not in the group by clause you may get any member of the group, pretty much at random.
The way I look at it is to break the query down into bits.
you have a join on (dt.Client_id = vl.Client_id and dt.Device_id = vl.Device_id) so all of those fields should be indexed in their respective tables.
You are using GROUP BY dt.id so you need an index that includes dt.id
BUT...
an index on (dt.client_id,dt.device_id,dt.id) will not work for the GROUP BY
and
an index on (dt.id, dt.client_id, dt.device_id) will not work for the join.
Sometimes you end up with a query which just can't use an index.
See also:
http://ntsrikanth.blogspot.com/2007/11/sql-query-order-of-execution.html
You didn't post your indices, but first of all, you'll want to have an index for (client_id, device_id) on visited_links, and (client_id, device_id, id) on device_tracker to make sure that query is fully indexed.
From page 191 of the excellent High Performance MySQL, 2nd Ed.:
MySQL has two kinds of GROUP BY strategies when it can't use an index: it can use a temporary table or a filesort to perform the grouping. Either one can be more efficient depending on the query. You can force the optimizer to choose one method or the other with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints.
In your case, I think the issue stems from joining on multiple columns and using GROUP BY together, even after the suggested indices are in place. If you remove either (a) one of the join conditions or (b) the GROUP BY, this shouldn't need a filesort.
However, keep in mind that a filesort doesn't always use actual files, it can also happen entirely within a memory buffer if the result set is small enough, so the performance penalty may be minimal. Consider the wall-clock time for the query too.
HTH!

Strange Performance Issues with INNER JOIN vs. LEFT JOIN

I was using a query that looked similar to this one:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`, `views_sum`
WHERE `views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`
GROUP BY `episodes`.`id`
... which takes ~0.1s to execute. But it's problematic, because some episodes don't have a corresponding views_sum row, so those episodes aren't included in the result.
What I want is NULL values when a corresponding views_sum row doesn't exist, so I tried using a LEFT JOIN instead:
SELECT `episodes`.*, IFNULL(SUM(`views_sum`.`clicks`), 0) as `clicks`
FROM `episodes`
LEFT JOIN `views_sum` ON (`views_sum`.`index` = "episode" AND `views_sum`.`key` = `episodes`.`id`)
GROUP BY `episodes`.`id`
This query produces the same columns, and it also includes the few rows missing from the 1st query.
BUT, the 2nd query takes 10 times as long! A full second.
Why is there such a huge discrepancy between the execution times when the result is so similar? There's nowhere near 10 times as many rows — it's like 60 from the 1st query, and 70 from the 2nd. That's not to mention that the 10 additional rows have no views to sum!
Any light shed would be greatly appreciated!
(There are indexes on episodes.id, views_sum.index, and views_sum.key.)
EDIT:
I copy-pasted the SQL from above, and here are the EXPLAINs, in order:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE views_sum ref index,key index 27 const 6532 Using where; Using temporary; Using filesort
1 SIMPLE episodes eq_ref PRIMARY PRIMARY 4 db102914_itw.views_sum.key 1 Using where
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE episodes ALL NULL NULL NULL NULL 70 Using temporary; Using filesort
1 SIMPLE views_sum ref index,key index 27 const 6532
Here's the query I ultimately came up with, after many, many iterations. (The SQL_NO_CACHE flag is there so I can test execution times.)
SELECT SQL_NO_CACHE e.*, IFNULL(SUM(vs.`clicks`), 0) as `clicks`
FROM `episodes` e
LEFT JOIN
(SELECT * FROM `views_sum` WHERE `index` = "episode") vs
ON vs.`key` = e.`id`
GROUP BY e.`id`
Because the ON condtion views_sum.index = "episode" is static, i.e., isn't dependent on the row it's joined to, I was able to get a massive performance boost by first using a subquery to limit the views_sum table before joining.
My query now takes ~0.2s. And what's even better, the time doesn't grow as you increase the offset of the query (unlike my first LEFT JOIN attempt). It stays the same, even if you do a sort on the clicks column.
You should have a combined index on views_sum.index and views_sum.key. I suspect you will always use both fields together if i look at the names. Also, I would rewrite the first query to use a proper INNER JOIN clause instead of a filtered cartesian product.
I suspect the performance of both queries will be much closer together if you do this. And, more importantly, much faster than they are now.
edit: Thinking about it, I would probably add a third column to that index: views_sum.clicks, which probably can be used for the SUM. But remember that multi-column indexes can only be used left to right.
It's all about the indexes. You'll have to play around with it a bit or post your database schema on here. Just as a rough guess i'd say you should make sure you have an index on views_sum.key.
Normally, a LEFT JOIN will be slower than an INNER JOIN or a CROSS JOIN because it has to view the first table differently. Put another way, the difference in time isn't related to the size of the result, but the full size of the left table.
I also wonder if you're asking MySQL to figure things out for you that you should be doing yourself. Specifically, that SUM() function would normally require a GROUP BY clause.