MySql performance query vs view with 'explain' output - mysql

I'm trying to understand why a direct query takes ~0.5s to run but a view using the same query takes ~10s to run. MySql v5.6.27.
Direct Query:
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i
where i.b = -1 and i.a > 1500;
Direct Query 'explain' Results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,i,range,PRIMARY,PRIMARY,4,\N,3629,Using where
2,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
The View's definition/query is the same without the 'where' clause...
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i;
Query against the view:
select * from ViewA t where t.b = -1 and t.a > 1500;
Query of View 'explain' results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,<derived2>,ALL,\N,\N,\N,\N,7259,Using where
2,DERIVED,i,ALL,\N,\N,\N,\N,7259,\N
3,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
Why does the query against the view end up performing 3 full table scans whereas the direct query performs ~1.5?

The short answer is: the MySQL optimizer is not clever enough to do it.
When processing a view, MySQL can either merge the view or create a temporary table for it:
For MERGE, the text of a statement that refers to the view and the view definition are merged such that parts of the view definition replace corresponding parts of the statement.
For TEMPTABLE, the results from the view are retrieved into a temporary table, which then is used to execute the statement.
This applies in a very similar way to derived tables and subqueries too.
The behaviour you are seeking is the merge. This is the default value, and MySQL will try to use it whenever possible. If it is not possible (or rather: if MySQL thinks it is not possible), MySQL has to evaluate the complete view, no matter if you only need one row out of it. This obviously takes more time, and is what happens in your view.
There is a list of things that prevent MySQL from using the merge algorithm:
MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Assignment to user variables
Refers only to literal values (in this case, there is no underlying table)
You can test this if MySQL will merge or not: try to create the view specifying the merge-algorithm:
create algorithm=merge view viewA as ...
If MySQL doesn't think it can merge the view, you get the warning
1 warning(s): 1354 View merge algorithm can't be used here for now (assumed undefined algorithm)
In your case, the Subquery in the select list is preventing the merge. This is not because it would be impossible to do. You have already prooven that it is possible to merge it: by just rewriting it.
But the MySQL optimizer didn't see that possibility. It is not specific to views: it will actually not merge it either if you use the unmerged viewcode directly: explain select * from (select a, b, ... from TableA i) as ViewA where .... You would have to test this on MySQL 5.7, as MySQL 5.6 will not merge in this situation on principle (as, in a query, it assumes you want to have a temptable here, even for very simple derived tables that could be merged). MySQL 5.7 will by default try to do it, although it won't work with your view.
As the optimizer gets improved, in some situation, the optimizer will merge even in cases where there is an subquery in the select list, so there are some exceptions to that list. MariaDB, which is based on MySQL, is actually a lot better doing the merge optimization, and would merge your view just like you did it - so it is possible to do it even as a machine.
So to summarize: the MySQL optimizer is currently not clever enough to do it.
And you unfortunately cannot do much about it, except testing if MySQL accepts algorithm=merge and then not using views that MySQL cannot merge and instead merge them yourself.

Related

why there is performance difference when retrieving data from view vs underlying select of that view

I am doing query on view with single predicates which gives me the record in 4-7 seconds, but when i try to retrieve the record with same predicate and directly with underlying query from that view it gives me records in less then seconds. I am using MySQL.
I have tried checking the execution plan of both the query and it gives major differences if i have hundreds of thousands of records in tables.
So any clue or idea why performance is better when using query directly?
Following is my view definition
SELECT entity_info.source_entity_info_id AS event_sync_id,
entity_info.source_system_id AS source_system_id,
entity_info.target_system_id AS destination_system_id,
event_sync_info.integrationid AS integration_id,
event_sync_info.source_update_time AS last_updated,
entity_info.source_internal_id AS source_entity_internal_id,
entity_info.source_entity_project AS source_entity_project,
entity_info.target_internal_id AS destination_entity_internal_id,
entity_info.destination_entity_project AS destination_entity_project,
entity_info.source_entity_type AS source_entity_type,
entity_info.destination_entity_type AS destination_entity_type,
event_sync_info.opshub_update_time AS opshub_update_time,
event_sync_info.entity_info_id AS entity_info_id,
entity_info.global_id AS global_id,
entity_info.target_entity_info_id AS target_entity_info_id,
entity_info.source_entity_info_id AS source_entity_info_id,
(
SELECT Count(0) AS count(*)
FROM ohrv_failed_event_view_count failed_event_view
WHERE ((
failed_event_view.integration_id = event_sync_info.integrationid)
AND (
failed_event_view.entityinfo = entity_info.source_entity_info_id))) AS no_of_failures
FROM (ohrv_entity_info entity_info
LEFT JOIN ohmt_eai_event_sync_info event_sync_info
ON ((
entity_info.source_entity_info_id = event_sync_info.entity_info_id)))
WHERE (
entity_info.source_entity_info_id IS NOT NULL)
Query examples
select * from view where integration_id=10
Execution plan of this processes 142668 rows for sub query that is there in this view
select QUERY_OF_VIEW and integration_id=10
Execution plan of this looks good and only required rows are getting processed.
I think the issue is in the following query:
SELECT * FROM view WHERE integration_id = 10;
This forces MySQL to materialize an intermediate table, against which it then has to query again to apply the restriction in the WHERE clause. On the other hand, in the second version:
SELECT (QUERY_OF_VIEW with WHERE integration_id = 10)
MySQL does not have to materialize anything other than the query in the view itself. That is, in your second version MySQL just has to execute the query in the view, without any subsequent subquery.
refereeing to this link of documentation you can see,that its depend on if the MERGE algorithm can used it will , but if its not applicable so new temp table must generated to find the relations of data, also you can see this answer that talking about optimization and when to use view and when you should not .
If the MERGE algorithm cannot be used, a temporary table must be used
instead. MERGE cannot be used if the view contains any of the
following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Refers only to literal values (in this case, there is no underlying
table)

EXISTS vs ALL, ANY, SOME

I'm trying to understand the difference between EXISTS and ALL in MySQL. Let me give you an example:
SELECT *
FROM table1
WHERE NOT EXISTS (
SELECT *
FROM table2
WHERE table2.val < table1.val
);
SELECT *
FROM table1
WHERE val <= ALL( SELECT val FROM table2 );
A quote from MySQL docs:
Traditionally, an EXISTS subquery starts with SELECT *, but it could
begin with SELECT 5 or SELECT column1 or anything at all. MySQL
ignores the SELECT list in such a subquery, so it makes no difference. [1]
Reading this, it seems to me that mysql should be able to translate both queries to the same relational algebra expression. Both queries are just a simple comparison between values from two tables. However, that doesn't seem to be the case. I tried both queries and the second one performs much better than the first one.
How are this queries exactly handled by the optimizer?
Why the optimizer can't make the first query perform as the second one?
Is it always more efficient to use an ALL/ANY/SOME condition?
The queries in your question are not equivalent, so they will have different execution plans regardless of how well they're optimized. If you used NOT val > ANY(...) then it would be equivalent.
You should always use EXPLAIN to see the execution plan of a query and realize that the execution plan can change as your data changes. Testing and understanding the execution plan will help you determine which methods perform better. There is no hard and fast rule for ALL/ANY/SOME and they're often optimized down to an EXISTS.

Check if MySQL Table is empty: COUNT(*) is zero vs. LIMIT(0,1) has a result?

This is a simple question about efficiency specifically related to the MySQL implementation. I want to just check if a table is empty (and if it is empty, populate it with the default data). Would it be best to use a statement like SELECT COUNT(*) FROM `table` and then compare to 0, or would it be better to do a statement like SELECT `id` FROM `table` LIMIT 0,1 then check if any results were returned (the result set has next)?
Although I need this for a project I am working on, I am also interested in how MySQL works with those two statements and whether the reason people seem to suggest using COUNT(*) is because the result is cached or whether it actually goes through every row and adds to a count as it would intuitively seem to me.
You should definitely go with the second query rather than the first.
When using COUNT(*), MySQL is scanning at least an index and counting the records. Even if you would wrap the call in a LEAST() (SELECT LEAST(COUNT(*), 1) FROM table;) or an IF(), MySQL will fully evaluate COUNT() before evaluating further. I don't believe MySQL caches the COUNT(*) result when InnoDB is being used.
Your second query results in only one row being read, furthermore an index is used (assuming id is part of one). Look at the documentation of your driver to find out how to check whether any rows have been returned.
By the way, the id field may be omitted from the query (MySQL will use an arbitrary index):
SELECT 1 FROM table LIMIT 1;
However, I think the simplest and most performant solution is the following (as indicated in Gordon's answer):
SELECT EXISTS (SELECT 1 FROM table);
EXISTS returns 1 if the subquery returns any rows, otherwise 0. Because of this semantic MySQL can optimize the execution properly.
Any fields listed in the subquery are ignored, thus 1 or * is commonly written.
See the MySQL Manual for more info on the EXISTS keyword and its use.
It is better to do the second method or just exists. Specifically, something like:
if exists (select id from table)
should be the fastest way to do what you want. You don't need the limit; the SQL engine takes care of that for you.
By the way, never put identifiers (table and column names) in single quotes.

On the efficiency of using views of views in MySQL

I have recently been introduced to the concept of views and I am finding them a great help for splitting complex queries into parts.
My question is wether there are any efficiency disadvantages when I start making queries from views which they are in turn queries from other views, etc...
So I would for example have:
view1 -> query from tables A, B & C
view2 -> query from tables D, E & F
view3 -> query joining view1 & view2
Will there be any speed disadvantage when querying view3 instead of designing a single query that joins tables A, B, C, D, E & F?
And in case I would choose to use the views approach, does it matter, wether I have ORDER BY clauses in the design of view1, view2 & view3 or is it better that I donĀ“t put any ORDER BY clause in any of the views and I just use ORDER BY when I query view3?
Thank you very much for your help!
Boga.
For order by see CREATE VIEW Syntax
ORDER BY is permitted in a view definition, but it is ignored if you select from a view using a statement that has its own ORDER BY.
And here on View Processing Algorithms, you can see how MySQL processes selects on views. As always, it depends. ;-)
It seems that the MERGE algorithm is the most efficient, because algorithm temptable copies the view results to a temporary table first and does the query on that. But you cannot always use merge, see the last section
If the MERGE algorithm cannot be used, a temporary table must be used instead. MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Refers only to literal values (in this case, there is no underlying table)

sql group by versus distinct

Why would someone use a group by versus distinct when there are no aggregations done in the query?
Also, does someone know the group by versus distinct performance considerations in MySQL and SQL Server. I'm guessing that SQL Server has a better optimizer and they might be close to equivalent there, but in MySQL, I expect a significant performance advantage to distinct.
I'm interested in dba answers.
EDIT:
Bill's post is interesting, but not applicable. Let me be more specific...
select a, b, c
from table x
group by a, b,c
versus
select distinct a,b,c
from table x
GROUP BY maps groups of rows to one row, per distinct value in specific columns, which don't even necessarily have to be in the select-list.
SELECT b, c, d FROM table1 GROUP BY a;
This query is legal SQL (correction: only in MySQL; actually it's not standard SQL and not supported by other brands). MySQL accepts it, and it trusts that you know what you're doing, selecting b, c, and d in an unambiguous way because they're functional dependencies of a.
However, Microsoft SQL Server and other brands don't allow this query, because it can't determine the functional dependencies easily. edit: Instead, standard SQL requires you to follow the Single-Value Rule, i.e. every column in the select-list must either be named in the GROUP BY clause or else be an argument to a set function.
Whereas DISTINCT always looks at all columns in the select-list, and only those columns. It's a common misconception that DISTINCT allows you to specify the columns:
SELECT DISTINCT(a), b, c FROM table1;
Despite the parentheses making DISTINCT look like function call, it is not. It's a query option and a distinct value in any of the three fields of the select-list will lead to a distinct row in the query result. One of the expressions in this select-list has parentheses around it, but this won't affect the result.
A little (VERY little) empirical data from MS SQL Server, on a couple of random tables from our DB.
For the pattern:
SELECT col1, col2 FROM table GROUP BY col1, col2
and
SELECT DISTINCT col1, col2 FROM table
When there's no covering index for the query, both ways produced the following query plan:
|--Sort(DISTINCT ORDER BY:([table].[col1] ASC, [table].[col2] ASC))
|--Clustered Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]))
and when there was a covering index, both produced:
|--Stream Aggregate(GROUP BY:([table].[col1], [table].[col2]))
|--Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]), ORDERED FORWARD)
so from that very small sample SQL Server certainly treats both the same.
In MySQL I've found using a GROUP BY is often better in performance than DISTINCT.
Doing an "EXPLAIN SELECT DISTINCT" shows "Using where; Using temporary " MySQL will create a temporary table.
vs a "EXPLAIN SELECT a,b, c from T1, T2 where T2.A=T1.A GROUP BY a" just shows "Using where"
Both would generate the same query plan in MS SQL Server.... If you have MS SQL Server you could just enable the actual execution plan to see which one is better for your needs ...
Please have a look at those posts:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html
If you really are looking for distinct values, the distinct makes the source code more readable (like if it's part of a stored procedure) If I'm writing ad-hoc queries I'll usually start with the group by, even if I have no aggregations because I'll often end up putting them on.