sql group by versus distinct - mysql

Why would someone use a group by versus distinct when there are no aggregations done in the query?
Also, does someone know the group by versus distinct performance considerations in MySQL and SQL Server. I'm guessing that SQL Server has a better optimizer and they might be close to equivalent there, but in MySQL, I expect a significant performance advantage to distinct.
I'm interested in dba answers.
EDIT:
Bill's post is interesting, but not applicable. Let me be more specific...
select a, b, c
from table x
group by a, b,c
versus
select distinct a,b,c
from table x

GROUP BY maps groups of rows to one row, per distinct value in specific columns, which don't even necessarily have to be in the select-list.
SELECT b, c, d FROM table1 GROUP BY a;
This query is legal SQL (correction: only in MySQL; actually it's not standard SQL and not supported by other brands). MySQL accepts it, and it trusts that you know what you're doing, selecting b, c, and d in an unambiguous way because they're functional dependencies of a.
However, Microsoft SQL Server and other brands don't allow this query, because it can't determine the functional dependencies easily. edit: Instead, standard SQL requires you to follow the Single-Value Rule, i.e. every column in the select-list must either be named in the GROUP BY clause or else be an argument to a set function.
Whereas DISTINCT always looks at all columns in the select-list, and only those columns. It's a common misconception that DISTINCT allows you to specify the columns:
SELECT DISTINCT(a), b, c FROM table1;
Despite the parentheses making DISTINCT look like function call, it is not. It's a query option and a distinct value in any of the three fields of the select-list will lead to a distinct row in the query result. One of the expressions in this select-list has parentheses around it, but this won't affect the result.

A little (VERY little) empirical data from MS SQL Server, on a couple of random tables from our DB.
For the pattern:
SELECT col1, col2 FROM table GROUP BY col1, col2
and
SELECT DISTINCT col1, col2 FROM table
When there's no covering index for the query, both ways produced the following query plan:
|--Sort(DISTINCT ORDER BY:([table].[col1] ASC, [table].[col2] ASC))
|--Clustered Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]))
and when there was a covering index, both produced:
|--Stream Aggregate(GROUP BY:([table].[col1], [table].[col2]))
|--Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]), ORDERED FORWARD)
so from that very small sample SQL Server certainly treats both the same.

In MySQL I've found using a GROUP BY is often better in performance than DISTINCT.
Doing an "EXPLAIN SELECT DISTINCT" shows "Using where; Using temporary " MySQL will create a temporary table.
vs a "EXPLAIN SELECT a,b, c from T1, T2 where T2.A=T1.A GROUP BY a" just shows "Using where"

Both would generate the same query plan in MS SQL Server.... If you have MS SQL Server you could just enable the actual execution plan to see which one is better for your needs ...
Please have a look at those posts:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html

If you really are looking for distinct values, the distinct makes the source code more readable (like if it's part of a stored procedure) If I'm writing ad-hoc queries I'll usually start with the group by, even if I have no aggregations because I'll often end up putting them on.

Related

Using "dynamic" variables in a MySQL query [duplicate]

I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.

MySql performance query vs view with 'explain' output

I'm trying to understand why a direct query takes ~0.5s to run but a view using the same query takes ~10s to run. MySql v5.6.27.
Direct Query:
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i
where i.b = -1 and i.a > 1500;
Direct Query 'explain' Results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,i,range,PRIMARY,PRIMARY,4,\N,3629,Using where
2,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
The View's definition/query is the same without the 'where' clause...
select
a,b,
(select count(*) from TableA i3 where i3.b = i.a) as e,
func1(a) as f, func2(a) as g
from TableA i;
Query against the view:
select * from ViewA t where t.b = -1 and t.a > 1500;
Query of View 'explain' results:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,PRIMARY,<derived2>,ALL,\N,\N,\N,\N,7259,Using where
2,DERIVED,i,ALL,\N,\N,\N,\N,7259,\N
3,DEPENDENT SUBQUERY,i3,ALL,\N,\N,\N,\N,7259,Using where
Why does the query against the view end up performing 3 full table scans whereas the direct query performs ~1.5?
The short answer is: the MySQL optimizer is not clever enough to do it.
When processing a view, MySQL can either merge the view or create a temporary table for it:
For MERGE, the text of a statement that refers to the view and the view definition are merged such that parts of the view definition replace corresponding parts of the statement.
For TEMPTABLE, the results from the view are retrieved into a temporary table, which then is used to execute the statement.
This applies in a very similar way to derived tables and subqueries too.
The behaviour you are seeking is the merge. This is the default value, and MySQL will try to use it whenever possible. If it is not possible (or rather: if MySQL thinks it is not possible), MySQL has to evaluate the complete view, no matter if you only need one row out of it. This obviously takes more time, and is what happens in your view.
There is a list of things that prevent MySQL from using the merge algorithm:
MERGE cannot be used if the view contains any of the following constructs:
Aggregate functions (SUM(), MIN(), MAX(), COUNT(), and so forth)
DISTINCT
GROUP BY
HAVING
LIMIT
UNION or UNION ALL
Subquery in the select list
Assignment to user variables
Refers only to literal values (in this case, there is no underlying table)
You can test this if MySQL will merge or not: try to create the view specifying the merge-algorithm:
create algorithm=merge view viewA as ...
If MySQL doesn't think it can merge the view, you get the warning
1 warning(s): 1354 View merge algorithm can't be used here for now (assumed undefined algorithm)
In your case, the Subquery in the select list is preventing the merge. This is not because it would be impossible to do. You have already prooven that it is possible to merge it: by just rewriting it.
But the MySQL optimizer didn't see that possibility. It is not specific to views: it will actually not merge it either if you use the unmerged viewcode directly: explain select * from (select a, b, ... from TableA i) as ViewA where .... You would have to test this on MySQL 5.7, as MySQL 5.6 will not merge in this situation on principle (as, in a query, it assumes you want to have a temptable here, even for very simple derived tables that could be merged). MySQL 5.7 will by default try to do it, although it won't work with your view.
As the optimizer gets improved, in some situation, the optimizer will merge even in cases where there is an subquery in the select list, so there are some exceptions to that list. MariaDB, which is based on MySQL, is actually a lot better doing the merge optimization, and would merge your view just like you did it - so it is possible to do it even as a machine.
So to summarize: the MySQL optimizer is currently not clever enough to do it.
And you unfortunately cannot do much about it, except testing if MySQL accepts algorithm=merge and then not using views that MySQL cannot merge and instead merge them yourself.

Is using correlated subquery better than join? (indexing perspective)

I read somewhere something like this:
Indexes will be used per queries.
So as you know, this is two queries:
SELECT m1.*, (SELECT 1 FROM mytable2 m2 WHERE col2 = ?) AS sth
FROM mytable1 m1 WHERE col1 = ?
Well query above can use two indexes: mytable1(col1), mytable2(col2). Because of being two separated queries.
Now take a look at this one: (the same as previous query, just uses join instead of subquery)
SELECT m1.*, m2.1 AS sth
FROM mytable1 m1
JOIN mytable2 m2 ON m2.col2 = ?
WHERE m1.col1 = ?
But this ^ query, is just one query. So it can use just one index. Is my understanding right? So using subquery is better for indexing, right?
But this ^ query, is just one query. So it can use just one index. Is my understanding right? So using subquery is better for indexing, right?
You misunderstand. MySQL can use one index per table reference.
So in this case, it can use both indexes: mytable1(col1), mytable2(col2).
You can even use two different indexes from the same table, if you do a self-join or a UNION or a subquery. Each time you reference the table counts as a separate table reference.
SELECT m1.*, m2.1 AS sth
FROM mytable1 m1
JOIN mytable2 m2 ON m2.col2 = ?
WHERE m1.col1 = ?
Regardless of indexing, this is a strange query. You have no condition that relates mytable1 to mytable2. So you're doing a Cartesian product between the two tables. One or both table may be selecting a single row, depending on your conditions for col1 and col2. But it's still a Cartesian product, so if the conditions on both tables return multiple rows, you'll get result set with a lot of repetition.
This is too long for a comment.
The two queries are different, in multiple respects:
The first returns all rows in mytable1 that match the where condition, regardless of whether there is a match in the second table. The second only returns rows that match.
The first fails with an error if the subquery returns more than one row. The second returns multiple rows that match.
As a consequence, the first could return NULL for sth, the second cannot.
My advice is to first learn to write the query that meets your functional needs. Then worry about performance.
As for your question, both correlated subqueries and joins can make use of an index. The idea that correlated subqueries are always worse than joins is an old-wives' tale (no offense to old wives) that should be forgotten.
Generally speaking, it all depends. At the end of the day, SQL Server will create execution plans, and depending on how it interprets your query, one might be better than the other. Having that said, generally, join is better.

Can a query only use one index per table?

I have a query like this:
( SELECT * FROM mytable WHERE author_id = ? AND seen IS NULL )
UNION
( SELECT * FROM mytable WHERE author_id = ? AND date_time > ? )
Also I have these two indexes:
(author_id, seen)
(author_id, date_time)
I read somewhere:
A query can generally only use one index per table when process the WHERE clause
As you see in my query, there is two separated WHERE clause. So I want to know, "only one index per table" means my query can use just one of those two indexes or it can use one of those indexes for each subquery and both indexes are useful?
In other word, is this sentence true?
"always one of those index will be used, and the other one is useless"
That statement about only using one index is no longer true about MySQL. For instance, it implements the index merge optimization which can take advantage of two indexes for some where clauses that have or. Here is a description in the documentation.
You should try this form of your query and see if it uses index mer:
SELECT *
FROM mytable
WHERE author_id = ? AND (seen IS NULL OR date_time > ? );
This should be more efficient than the union version, because it does not incur the overhead of removing duplicates.
Also, depending on the distribution of your data, the above query with an index on mytable(author_id, date_time, seen) might work as well or better than your version.
UNION combines results of subqueries. Each subquery will be executed independent of others and then results will be merged. So, in this case WHERE limits are applied to each subquery and not to all united result.
In answer to your question: yes, each subquery can use some index.
There are cases when the database engine can use more indexes for one select statement, however when filtering one set of rows really it not possible. If you want to use indexing on two columns then build one index on both columns instead of two indexes.
Every single subquery or part of composite query is itself a query can be evaluated as single query for performance and index access .. you can also force the use of different index for eahc query .. In your case you are using union and these are two separated query .. united in a resulting query
. you can have a brief guide how mysql ue index .. acccessing at this guide
http://dev.mysql.com/doc/refman/5.7/en/mysql-indexes.html

Sql syntax: select without from clause as subquery in select (subselect)

While editing some queries to add alternatives for columns without values, I accidentally wrote something like this (here is the simplyfied version):
SELECT id, (SELECT name) FROM t
To my surprise, MySQL didn't throw any error, but completed the query giving my expected results (the name column values).
I tried to find any documentation about it, but with no success.
Is this SQL standard or a MySQL specialty?
Can I be sure that the result of this syntax is really the column value from the same (outer) table? The extended version would be like this:
SELECT id, (SELECT name FROM t AS t1 where t1.id=t2.id) FROM t AS t2
but the EXPLAIN reports No tables used in the Extra column for the former version, which I think is very nice.
Here's a simple fiddle on SqlFiddle (it keeps timing out for me, I hope you have better luck).
Clarification: I know about subqueries, but I always wrote subqueries (correlated or not) that implied a table to select from, hence causing an additional step in the execution plan; my question is about this syntax and the result it gives, that in MySQL seems to return the expected value without any.
What you within your first query is a correlated subquery which simply returns the name column from the table t. no actual subquery needs to run here (which is what your EXPLAIN is telling you).
In a SQL database query, a correlated subquery (also known as a
synchronized subquery) is a subquery (a query nested inside another
query) that uses values from the outer query.
https://en.wikipedia.org/wiki/Correlated_subquery
SELECT id, (SELECT name) FROM t
is the same as
SELECT id, (SELECT t.name) FROM t
Your 2nd query
SELECT id, (SELECT name FROM t AS t1 where t1.id=t2.id) FROM t AS t2
Also contains correlated subquery but this one is actually running a query on table t to find records where t1.id = t2.id.
This is the default behavior for the SQL language and it is defined on the SQL ANSI 2011 over ISO/IEC 9075-1:2011(en) documentation. Unfortunately it is not open. This behavior is described on the section 4.11 SQL-Statements.
This behavior happens because the databases process the select comand without the from clause, therefore if it encounters:
select id, (select name) from some
It will try to find that name field as a column of the outer queries to process.
Fortunately I remember that some while ago I've answered someone here and find a valid available link to an SQL ANSI document that is online in FULL but it is for the SQL ANSI 99 and the section may not be the same one as the new document. I think, did not check, that it is around the section 4.30. Take a look. And I really recommend the reading (I did that back in the day).
Database Language SQL - ISO/IEC 9075-2:1999 (E)
It's not standard. In oracle,
select 1, (select 2)
from dual
Throws error, ORA-00923: FROM keyword not found where expected
How can you be sure of your results? Get a better understanding of what the query is supposed to acheive before you write it. Even the exetended version in the question does not make any sense.