Does MySQL eliminate common subexpressions between SELECT and HAVING/GROUP BY clause - mysql

I often see people answer MySQL questions with queries like this:
SELECT DAY(date), other columns
FROM table
GROUP BY DAY(date);
SELECT somecolumn, COUNT(*)
FROM table
HAVING COUNT(*) > 1;
I always like to give the column an alias and refer to that in the GROUP BY or HAVING clause, e.g.
SELECT DAY(date) AS day, other columns
FROM table
GROUP BY day;
SELECT somecolumn, COUNT(*) AS c
FROM table
HAVING c > 1;
Is MySQL smart enough to notice that the expressions in the later clauses are the same as in SELECT, and only do it once? I'm not sure how to test this -- EXPLAIN doesn't show any difference, but it doesn't seem to show how it's doing the grouping or filtering in the first place; it seems mainly useful for optimizing joins and WHERE clauses.
I tend to be pessimistic about MySQL optimization, so I like to give it all the help I can.

I think this can be tested using sleep() function,
for example take a look at this demo: http://sqlfiddle.com/#!2/0bc1b/1
Select * FROM t;
| X |
|---|
| 1 |
| 2 |
| 2 |
SELECT x+sleep(1)
FROM t
GROUP BY x+sleep(1);
SELECT x+sleep(1) As name
FROM t
GROUP BY name;
Execution times of both queries are about 3000 ms ( 3 seconds ).
There are 3 records in the table, and for each record the query sleeps for 1 second only,
so it means that the expression is evaluated only once for each record, not twice.

After consulting with one of the MySQL engineers, I proffer this lengthy answer.
Caching - no part of a query is 'remembered' for later use in that (or subsequent) query. (Contrast: the Query cache.)
Common subexpression elimination - no. This is a common Compiler technique, but MySQL does not use it. Example: (a-b)*(a-b) will do the subtract twice.
Removal of a constant from a loop - yes, with limitations. This is another Compiler technique.
A variety of SQL-centric hacks - yes; see below.
Re-evaluation of a subquery - it depends. Also, the Optimizer is gradually getting better.
VIEWs - it depends. There are still cases where a VIEW is destined to perform worse than the equivalent SELECT. Example: no condition pushdown into a UNION in a VIEW. Actually, this is more a matter of delayed action.
I think that some newer versions of MariaDB have a "subquery cache".
(Caveat: I do not have 100% confidence in any of my answer, but I do believe that most of it is correct, as of MySQL 5.7, MariaDB 10.1, etc)
Think of a multi-row SELECT as a loop. Many, maybe all, "deterministic" expressions are evaluated once. Example: Constant date expressions, even involving function calls. But...
NOW() is specifically evaluated once at the beginning of a query. Furthermore, the value is passed to Slaves when replicating. That is, by the time the query is stored on a slave, NOW() could be out of date. (SYSDATE() is another animal.)
Especially with the advent of only_full_group_by, GROUP BY needs to know if it matches the SELECT expressions. So, this looks for similar code.
HAVING and ORDER BY can use aliases from the SELECT list (unlike WHERE and GROUP BY). So SELECT expr AS x ... HAVING expr seems to reevaluate expr, but SELECT expr AS x ... HAVING x seems to reach for the already-evaluated expr.
The Windowing functions of MariaDB 10.2 have some pretty severe restrictions on where they can/cannot be reused; I don't have a complete picture of them yet.
Generally, none of this matters -- the re-evaluation of an expression (DATE(date) or even COUNT(*)) will get the same answer. Furthermore, the rummaging through the rows is usually much more costly than expression evaluation. So, unless you have a good stopwatch, you won't tell the difference.

Related

Does MySQL not use LIMIT to optimize query select functions?

I've got a complex query I have to run in an application that is giving me some performance trouble. I've simplified it here. The database is MySQL 5.6.35 on CentOS.
SELECT a.`po_num`,
Count(*) AS item_count,
Sum(b.`quantity`) AS total_quantity,
Group_concat(`web_sku` SEPARATOR ' ') AS web_skus
FROM `order` a
INNER JOIN `order_item` b
ON a.`order_id` = b.`order_key`
WHERE `store` LIKE '%foobar%'
LIMIT 200 offset 0;
The key part of this query is where I've placed "foobar" as a placeholder. If this value is something like big_store, the query takes much longer (roughly 0.4 seconds in the query provided here, much longer in the query I'm actually using) than if the value is small_store (roughly 0.1 seconds in the query provided). big_store would return significantly more results if there were not limit.
But there is a limit and that's what surprises me. Both datasets have more than the LIMIT, which is only 200. It appears to me that MySQL performing the select functions COUNT, SUM, GROUP_CONCAT for all big_store/small_store rows and then applies the LIMIT retroactively. I would imagine that it'd be best to stop when you get to 200.
Could it not do the select functions COUNT, SUM, GROUP_CONCAT actions after grabbing the 200 rows it will use, making my query much much quicker? This seems feasible to me except in cases where there's an ORDER BY on one of those rows.
Does MySQL not use LIMIT to optimize a query select functions? If not, is there a good reason for that? If so, did I make a mistake in my thinking above?
It can stop short due to the LIMIT, but that is not a reasonable query since there is no ORDER BY.
Without ORDER BY, it will pick whatever 200 rows it feels like and stop short.
With an ORDER BY, it will have to scan the entire table that contains store (please qualify columns with which table they come from!). This is because of the leading wildcard. Only then can it trim to 200 rows.
Another problem -- Without a GROUP BY, aggregates (SUM, etc) are performed across the entire table (or at least those that remain after filtering). The LIMIT does not apply until after that.
Perhaps what you are asking about is MariaDB 5.5.21's "LIMIT_ROWS_EXAMINED".
Think of it this way ... All of the components of a SELECT are done in the order specified by the syntax. Since LIMIT is last, it does not apply until after the other stuff is performed.
(There are a couple of exceptions: (1) SELECT col... must be done after FROM ..., since it would not know which table(s); (2) The optimizer readily reorders JOINed table and clauses in WHERE ... AND ....)
More details on that query.
The optimizer peeks ahead, and sees that the WHERE is filtering on order (that is where store is, yes?), so it decides to start with the table order.
It fetches all rows from order that match %foobar%.
For each such row, find the row(s) in order_item. Now it has some number of rows (possibly more than 200) with which to do the aggregates.
Perform the aggregates - COUNT, SUM, GROUP_CONCAT. (Actually this will probably be done as it gathers the rows -- another optimization.)
There is now 1 row (with an unpredictable value for a.po_num).
Skip 0 rows for the OFFSET part of the LIMIT. (OK, another out-of-order thingie.)
Deliver up to 200 rows. (There is only 1.)
Add ORDER BY (but no GROUP BY) -- big deal, sort the 1 row.
Add GROUP BY (but no ORDER BY) in, now you may have more than 200 rows coming out, and it can stop short.
Add GROUP BY and ORDER BY and they are identical, then it may have to do a sort for the grouping, but not for the ordering, and it may stop at 200.
Add GROUP BY and ORDER BY and they are not identical, then it may have to do a sort for the grouping, and will have to re-sort for the ordering, and cannot stop at 200 until after the ORDER BY. That is, virtually all the work is performed on all the data.
Oh, and all of this gets worse if you don't have the optimal index. Oh, did I fail to insist on providing SHOW CREATE TABLE?
I apologize for my tone. I have thrown quite a few tips in your direction; please learn from them.

Check if MySQL Table is empty: COUNT(*) is zero vs. LIMIT(0,1) has a result?

This is a simple question about efficiency specifically related to the MySQL implementation. I want to just check if a table is empty (and if it is empty, populate it with the default data). Would it be best to use a statement like SELECT COUNT(*) FROM `table` and then compare to 0, or would it be better to do a statement like SELECT `id` FROM `table` LIMIT 0,1 then check if any results were returned (the result set has next)?
Although I need this for a project I am working on, I am also interested in how MySQL works with those two statements and whether the reason people seem to suggest using COUNT(*) is because the result is cached or whether it actually goes through every row and adds to a count as it would intuitively seem to me.
You should definitely go with the second query rather than the first.
When using COUNT(*), MySQL is scanning at least an index and counting the records. Even if you would wrap the call in a LEAST() (SELECT LEAST(COUNT(*), 1) FROM table;) or an IF(), MySQL will fully evaluate COUNT() before evaluating further. I don't believe MySQL caches the COUNT(*) result when InnoDB is being used.
Your second query results in only one row being read, furthermore an index is used (assuming id is part of one). Look at the documentation of your driver to find out how to check whether any rows have been returned.
By the way, the id field may be omitted from the query (MySQL will use an arbitrary index):
SELECT 1 FROM table LIMIT 1;
However, I think the simplest and most performant solution is the following (as indicated in Gordon's answer):
SELECT EXISTS (SELECT 1 FROM table);
EXISTS returns 1 if the subquery returns any rows, otherwise 0. Because of this semantic MySQL can optimize the execution properly.
Any fields listed in the subquery are ignored, thus 1 or * is commonly written.
See the MySQL Manual for more info on the EXISTS keyword and its use.
It is better to do the second method or just exists. Specifically, something like:
if exists (select id from table)
should be the fastest way to do what you want. You don't need the limit; the SQL engine takes care of that for you.
By the way, never put identifiers (table and column names) in single quotes.

Math calculations in MySql WHERE

Can math calculations be done in the WHERE portion of a MySQL statement?
For example, lets say I have the following SQL statement:
SELECT
employee_id,
max_hours,
sum(hours) AS total_hours
FROM
some_table
WHERE
total_hours < (max_hours * 1.5)
I looked around and found that MySQL does have math functions, but all the examples are in the SELECT portion of the statement.
You can use any (supported) arithmetic you like in a where or join clause, as long as the final result is a boolean (true, false or NULL (where null is treat as false).
This will usually mean indexes can not be used as their structure only allows their use for direct equality, inequality, or range lookups. In the example you gave there will be no useful index you could define so the query runner would be forced to perform a table scan. For simple filtering clauses referring to one table an index will only get used if one side is a constant (or a variable that is constant for the run time of the query).
With joining clauses an index might be used for one side of the match, if that side is a direct column reference (i.e. no arithmetic) though if the join is likely to cover many rows a scan may still be used as in index (or even table) scan can be quicker than a great many index seeks.
You might try something like this...
SELECT
employee_id,
max_hours,
SUM(hours)
FROM
some_table
GROUP BY
employee_id
HAVING
SUM(hours) < (max_hours * 1.5)

Should criteria be duplicated on subqueries

I have a query which actually runs two queries on a table. I query the whole table, a datediff and then a subquery which tells me the sum of hours each unit spent in certain operational steps. The main query limits the results to the REP depot so technically I don't need to put that same criteria on the subquery since repair_order is unique.
Would it be faster, slower or no difference to apply the depot filter on the subquery?
SELECT
*,
DATEDIFF(date_shipped, date_received) as htg_days,
(SELECT SUM(t3.total_days) FROM report_tables.cycle_time_days as t3 WHERE t1.repair_order=t3.repair_order AND (operation='MFG' OR operation='ENG' OR operation='ENGH' OR operation='HOLD') GROUP BY t3.repair_order) as subt_days
FROM
report_tables.cycle_time_days as t1
WHERE
YEAR(t1.date_shipped)=2010
AND t1.depot='REP'
GROUP BY
repair_order
ORDER BY
date_shipped;
I run into this with a lot of situations but I never know if it would be better to put the filter in the sub query, main query or both.
In this example, it would actually alter the query if you moved your WHERE clause to filter by REP into the subquery. So it wouldn't be about performance at that point, it would be about getting the same result set. In general, though, if you will get the same exact result set by moving a WHERE clause elsewhere in a complex query, it is better to do so at the most atomic level possible, ie, in the subquery. Then the subquery returns a smaller result set to the main query before the main query has to process it.
The answer to your question will vary depending on your schema, the complexity of your queries, the reliability of your data, etc. A general rule of thumb is to try to process the least amount of data possible, which generally means filtering it at the lowest level possible as well.
When you want to optimize a query the absolute number one place to start is to use the EXPLAIN output to see what optimizations the query parser was able to figure out and check to see what the weakest link is in the query plan. Resolve that, rinse, repeat.
You can also use explain's "extended" keyword to see the actual query it built to run which will reveal more about its usage of your criteria. In some cases, it will optimize away duplicate conditions between parent/subqueries. In other cases, it may push the conditions down from the parent in to the subquery. In some cases for (too) complex queries I've seen the it repeat the condition when it was only specified in the query once. Thankfully, you don't have to guess, mysql's explain plan will reveal all, albeit sometimes in cryptic ways.
I usually use a derived table as a "driver or aggregating" query then join that result back onto whatever table that i want to pull data from:
select
t1.*,
datediff(t1.date_shipped, t1.date_received) as htg_days,
subt_days.total_days
from
cycle_time_days as t1
inner join
(
-- aggregating/driver query
select
repair_order,
sum(total_days) as total_days
from
cycle_time_days
where
year(date_shipped) = 2010 and depot = 'REP' and
operation in ('MFG','ENG','ENGH','HOLD') -- covering index on date, depot, op ???
group by
repair_order -- indexed ??
having
total_days > 14 -- added for demonstration purposes
order by
total_days desc limit 10
) as subt_days on t1.repair_order = subt_days.repair_order
order by
t1.date_shipped;

Should I COUNT(*) or not?

I know it's generally a bad idea to do queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.
SELECT COUNT(*) FROM `group_relations`
Or the more specfic
SELECT COUNT(`group_id`) FROM `group_relations`
I have a feeling the latter could potentially be faster, but are there any other things to consider?
Update: I am using InnoDB in this case, sorry for not being more specific.
If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,
select count(*)
will count all rows, whereas
select count(group_id)
will only count the rows where group_id is not null.
Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.
Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.
If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.
COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.
Important to note in MySQL:
COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.
if you try SELECT COUNT(1) FROMgroup_relations it will be a bit faster because it will not try to retrieve information from your columns.
Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/
Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.
I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.
I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).
The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).
That said here are my results:
QueryType
| w/o WHERE | where id is not null | where true=true
COUNT()
| 9 min 30.13 sec ++ | 6 min 16.68 sec ++ | 2 min 21.80 sec ++
| 6 min 13.34 sec | 1 min 36.02 sec | 2 min 0.11 sec
| 6 min 10.06 se | 1 min 33.47 sec | 1 min 50.54 sec
COUNT(Id)
| 5 min 59.87 sec | 1 min 34.47 sec | 2 min 3.96 sec
| 5 min 44.95 sec | 1 min 13.09 sec | 2 min 6.48 sec
COUNT(1)
| 6 min 49.64 sec | 2 min 0.80 sec | 2 min 11.64 sec
| 6 min 31.64 sec | 1 min 41.19 sec | 1 min 43.51 sec
++This is considered the cache Seeding call. It is expected to be slower than the rest.
I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.
I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.
Seek Alternatives
As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.
Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.
Denormalize
If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.
If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.
BEGIN TRANSACTION;
INSERT INTO `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;
Alternatively, you could use database triggers if your RDBMS supports them.
Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.
MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.
An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)
I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.
But if your field is nullable, specify the field in COUNT.
COUNT(*) facts and myths:
MYTH: "InnoDB doesn't handle count(*) queries well":
Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.
FACT: InnoDB doesn't optimize count(*) queries without the where clause
It is best to count by an indexed column such as a primary key.
SELECT COUNT(`group_id`) FROM `group_relations`
It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).
It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure
if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.
COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns
The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.
Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.
That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.
Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.
I know it's generally a bad idea to do
queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should
I go for this query since that allows
the table to change but still yields
the same results.
SELECT COUNT(*) FROM `group_relations`
As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.