I have recently been introduced to the concept of views and I am finding them a great help for splitting complex queries into parts.
My question is wether there are any efficiency disadvantages when I start making queries from views which they are in turn queries from other views, etc...
So I would for example have:
view1 -> query from tables A, B & C
view2 -> query from tables D, E & F
view3 -> query joining view1 & view2
Will there be any speed disadvantage when querying view3 instead of designing a single query that joins tables A, B, C, D, E & F?
And in case I would choose to use the views approach, does it matter, wether I have ORDER BY clauses in the design of view1, view2 & view3 or is it better that I donĀ“t put any ORDER BY clause in any of the views and I just use ORDER BY when I query view3?
Thank you very much for your help!
Boga.
For order by see CREATE VIEW Syntax
ORDER BY is permitted in a view definition, but it is ignored if you select from a view using a statement that has its own ORDER BY.
And here on View Processing Algorithms, you can see how MySQL processes selects on views. As always, it depends. ;-)
It seems that the MERGE algorithm is the most efficient, because algorithm temptable copies the view results to a temporary table first and does the query on that. But you cannot always use merge, see the last section
If the MERGE algorithm cannot be used, a temporary table must be used instead. MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Refers only to literal values (in this case, there is no underlying table)
Related
I came up with this solution in my class by piecing together Internet knowledge. Please break this down for me I would love to know how I made it work. Specifically the t.s and the closing t.
SELECT
CourseType,
GPA,
NumberOfStudents * 100 / t.s AS `Percentage of Students`
FROM View1
CROSS JOIN
(
SELECT
SUM(NumberOfStudents) AS s
FROM View1) t;
Your query uses a subquery. A sub-query is a query that is done within another query. In your case, your subquery is:
(
SELECT
SUM(NumberOfStudents) AS s
FROM View1)
When you create subqueries, you need to give them an alias. An alias is just a name you give a subquery, so you can use it in the main query.
In your example, you named your subquery "t".
Fields can also have aliases. in your subquery, you created a field SUM(NumberOfStudents), and you named it s.
Going back to your question, you use the aliases to address fields inside the subquery. in your case, when you do 100 / t.s you are basically saying:
"I want to divide 100 by the field s from my subquery t".
The other concept that is important in your query is the Cross join. A cross join is the Cartesian product of two tables.
You can find a great and intuitive explanation of how a cross join works in the following link:
https://www.sqlshack.com/sql-cross-join-with-examples/#:~:text=The%20CROSS%20JOIN%20is%20used,also%20known%20as%20cartesian%20join.&text=The%20main%20idea%20of%20the,product%20of%20the%20joined%20tables.
I this case, the use is simpler than that. your subquery should return only one value, which is the sum of all students. And since a cross join basically pairs every row of one table with every row from the other, your cross join just provides a way to use the number of students as a constant value for the calculation of the percentage of students in the main query.
A better way to do this uses window functions:
SELECT v.CourseType, v.GPA,
v.NumberOfStudents * 100 / SUM(v.NumberOfStudents) OVER () AS Percentage_of_Students
FROM View1 v;
If you are learning SQL, you might as well learn the correct way to express logic.
Notes:
Use meaningful table aliases (abbreviations for the table/view names).
Qualify column references. This is less important in a query with only one table reference, but it is a good habit.
Window functions allow you to summarize data across multiple rows, without using an explicit JOIN.
I am doing query on view with single predicates which gives me the record in 4-7 seconds, but when i try to retrieve the record with same predicate and directly with underlying query from that view it gives me records in less then seconds. I am using MySQL.
I have tried checking the execution plan of both the query and it gives major differences if i have hundreds of thousands of records in tables.
So any clue or idea why performance is better when using query directly?
Following is my view definition
SELECT entity_info.source_entity_info_id AS event_sync_id,
entity_info.source_system_id AS source_system_id,
entity_info.target_system_id AS destination_system_id,
event_sync_info.integrationid AS integration_id,
event_sync_info.source_update_time AS last_updated,
entity_info.source_internal_id AS source_entity_internal_id,
entity_info.source_entity_project AS source_entity_project,
entity_info.target_internal_id AS destination_entity_internal_id,
entity_info.destination_entity_project AS destination_entity_project,
entity_info.source_entity_type AS source_entity_type,
entity_info.destination_entity_type AS destination_entity_type,
event_sync_info.opshub_update_time AS opshub_update_time,
event_sync_info.entity_info_id AS entity_info_id,
entity_info.global_id AS global_id,
entity_info.target_entity_info_id AS target_entity_info_id,
entity_info.source_entity_info_id AS source_entity_info_id,
(
SELECT Count(0) AS count(*)
FROM ohrv_failed_event_view_count failed_event_view
WHERE ((
failed_event_view.integration_id = event_sync_info.integrationid)
AND (
failed_event_view.entityinfo = entity_info.source_entity_info_id))) AS no_of_failures
FROM (ohrv_entity_info entity_info
LEFT JOIN ohmt_eai_event_sync_info event_sync_info
ON ((
entity_info.source_entity_info_id = event_sync_info.entity_info_id)))
WHERE (
entity_info.source_entity_info_id IS NOT NULL)
Query examples
select * from view where integration_id=10
Execution plan of this processes 142668 rows for sub query that is there in this view
select QUERY_OF_VIEW and integration_id=10
Execution plan of this looks good and only required rows are getting processed.
I think the issue is in the following query:
SELECT * FROM view WHERE integration_id = 10;
This forces MySQL to materialize an intermediate table, against which it then has to query again to apply the restriction in the WHERE clause. On the other hand, in the second version:
SELECT (QUERY_OF_VIEW with WHERE integration_id = 10)
MySQL does not have to materialize anything other than the query in the view itself. That is, in your second version MySQL just has to execute the query in the view, without any subsequent subquery.
refereeing to this link of documentation you can see,that its depend on if the MERGE algorithm can used it will , but if its not applicable so new temp table must generated to find the relations of data, also you can see this answer that talking about optimization and when to use view and when you should not .
If the MERGE algorithm cannot be used, a temporary table must be used
instead. MERGE cannot be used if the view contains any of the
following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Refers only to literal values (in this case, there is no underlying
table)
I'm trying to understand why a direct query takes ~0.5s to run but a view using the same query takes ~10s to run. MySql v5.6.27.
Direct Query:
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i
where i.b = -1 and i.a > 1500;
Direct Query 'explain' Results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,i,range,PRIMARY,PRIMARY,4,\N,3629,Using where
2,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
The View's definition/query is the same without the 'where' clause...
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i;
Query against the view:
select * from ViewA t where t.b = -1 and t.a > 1500;
Query of View 'explain' results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,<derived2>,ALL,\N,\N,\N,\N,7259,Using where
2,DERIVED,i,ALL,\N,\N,\N,\N,7259,\N
3,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
Why does the query against the view end up performing 3 full table scans whereas the direct query performs ~1.5?
The short answer is: the MySQL optimizer is not clever enough to do it.
When processing a view, MySQL can either merge the view or create a temporary table for it:
For MERGE, the text of a statement that refers to the view and the view definition are merged such that parts of the view definition replace corresponding parts of the statement.
For TEMPTABLE, the results from the view are retrieved into a temporary table, which then is used to execute the statement.
This applies in a very similar way to derived tables and subqueries too.
The behaviour you are seeking is the merge. This is the default value, and MySQL will try to use it whenever possible. If it is not possible (or rather: if MySQL thinks it is not possible), MySQL has to evaluate the complete view, no matter if you only need one row out of it. This obviously takes more time, and is what happens in your view.
There is a list of things that prevent MySQL from using the merge algorithm:
MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Assignment to user variables
Refers only to literal values (in this case, there is no underlying table)
You can test this if MySQL will merge or not: try to create the view specifying the merge-algorithm:
create algorithm=merge view viewA as ...
If MySQL doesn't think it can merge the view, you get the warning
1 warning(s): 1354 View merge algorithm can't be used here for now (assumed undefined algorithm)
In your case, the Subquery in the select list is preventing the merge. This is not because it would be impossible to do. You have already prooven that it is possible to merge it: by just rewriting it.
But the MySQL optimizer didn't see that possibility. It is not specific to views: it will actually not merge it either if you use the unmerged viewcode directly: explain select * from (select a, b, ... from TableA i) as ViewA where .... You would have to test this on MySQL 5.7, as MySQL 5.6 will not merge in this situation on principle (as, in a query, it assumes you want to have a temptable here, even for very simple derived tables that could be merged). MySQL 5.7 will by default try to do it, although it won't work with your view.
As the optimizer gets improved, in some situation, the optimizer will merge even in cases where there is an subquery in the select list, so there are some exceptions to that list. MariaDB, which is based on MySQL, is actually a lot better doing the merge optimization, and would merge your view just like you did it - so it is possible to do it even as a machine.
So to summarize: the MySQL optimizer is currently not clever enough to do it.
And you unfortunately cannot do much about it, except testing if MySQL accepts algorithm=merge and then not using views that MySQL cannot merge and instead merge them yourself.
I'm maintaining a legacy app where the SQL queries look like the devil's handywork - before I give up and rewrite the whole damn thing I'm hoping to use better tools to do a surgical change :)
I want to know which "WHERE" clause resulted in the inclusion of certain rows when there are a lot of ORs. Even better, why certain rows were excluded from the result set.
(Specifically I am using MySQL)
Move your where condition to a case statement. Break up each 'OR' into it's own column and the result will show you which returned true and false.
I haven't heard of a tool that would do this for you.
This is a variation on Pirion's approach.
Instead of using a case statement, but each where clause in the select statement. So, a condition like:
select t.*
from t
where A or B or C or D
would become:
select t.*,
(Acomp) as A,
(Bcomp) as B,
(Ccomp) as C,
(Dcomp) as D
from t
where Acomp or Bcomp or Ccomp or Dcomp
MySQL has the nice features that a boolean is returns as a 0 (false) or 1 (true). This will allow you see all the conditions that a given row matches.
You might then be able to simplify the logic by removing or combining conditions.
If the conditions are computationally intensive or very long (such as using subqueries), you might want to do this using a subquery:
select t.*
from (select t.*,
(Acomp) as A,
(Bcomp) as B,
(Ccomp) as C,
(Dcomp) as D
from t
) t
where A or B or C or D;
Why would someone use a group by versus distinct when there are no aggregations done in the query?
Also, does someone know the group by versus distinct performance considerations in MySQL and SQL Server. I'm guessing that SQL Server has a better optimizer and they might be close to equivalent there, but in MySQL, I expect a significant performance advantage to distinct.
I'm interested in dba answers.
EDIT:
Bill's post is interesting, but not applicable. Let me be more specific...
select a, b, c
from table x
group by a, b,c
versus
select distinct a,b,c
from table x
GROUP BY maps groups of rows to one row, per distinct value in specific columns, which don't even necessarily have to be in the select-list.
SELECT b, c, d FROM table1 GROUP BY a;
This query is legal SQL (correction: only in MySQL; actually it's not standard SQL and not supported by other brands). MySQL accepts it, and it trusts that you know what you're doing, selecting b, c, and d in an unambiguous way because they're functional dependencies of a.
However, Microsoft SQL Server and other brands don't allow this query, because it can't determine the functional dependencies easily. edit: Instead, standard SQL requires you to follow the Single-Value Rule, i.e. every column in the select-list must either be named in the GROUP BY clause or else be an argument to a set function.
Whereas DISTINCT always looks at all columns in the select-list, and only those columns. It's a common misconception that DISTINCT allows you to specify the columns:
SELECT DISTINCT(a), b, c FROM table1;
Despite the parentheses making DISTINCT look like function call, it is not. It's a query option and a distinct value in any of the three fields of the select-list will lead to a distinct row in the query result. One of the expressions in this select-list has parentheses around it, but this won't affect the result.
A little (VERY little) empirical data from MS SQL Server, on a couple of random tables from our DB.
For the pattern:
SELECT col1, col2 FROM table GROUP BY col1, col2
and
SELECT DISTINCT col1, col2 FROM table
When there's no covering index for the query, both ways produced the following query plan:
|--Sort(DISTINCT ORDER BY:([table].[col1] ASC, [table].[col2] ASC))
|--Clustered Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]))
and when there was a covering index, both produced:
|--Stream Aggregate(GROUP BY:([table].[col1], [table].[col2]))
|--Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]), ORDERED FORWARD)
so from that very small sample SQL Server certainly treats both the same.
In MySQL I've found using a GROUP BY is often better in performance than DISTINCT.
Doing an "EXPLAIN SELECT DISTINCT" shows "Using where; Using temporary " MySQL will create a temporary table.
vs a "EXPLAIN SELECT a,b, c from T1, T2 where T2.A=T1.A GROUP BY a" just shows "Using where"
Both would generate the same query plan in MS SQL Server.... If you have MS SQL Server you could just enable the actual execution plan to see which one is better for your needs ...
Please have a look at those posts:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html
If you really are looking for distinct values, the distinct makes the source code more readable (like if it's part of a stored procedure) If I'm writing ad-hoc queries I'll usually start with the group by, even if I have no aggregations because I'll often end up putting them on.