While editing some queries to add alternatives for columns without values, I accidentally wrote something like this (here is the simplyfied version):
SELECT id, (SELECT name) FROM t
To my surprise, MySQL didn't throw any error, but completed the query giving my expected results (the name column values).
I tried to find any documentation about it, but with no success.
Is this SQL standard or a MySQL specialty?
Can I be sure that the result of this syntax is really the column value from the same (outer) table? The extended version would be like this:
SELECT id, (SELECT name FROM t AS t1 where t1.id=t2.id) FROM t AS t2
but the EXPLAIN reports No tables used in the Extra column for the former version, which I think is very nice.
Here's a simple fiddle on SqlFiddle (it keeps timing out for me, I hope you have better luck).
Clarification: I know about subqueries, but I always wrote subqueries (correlated or not) that implied a table to select from, hence causing an additional step in the execution plan; my question is about this syntax and the result it gives, that in MySQL seems to return the expected value without any.
What you within your first query is a correlated subquery which simply returns the name column from the table t. no actual subquery needs to run here (which is what your EXPLAIN is telling you).
In a SQL database query, a correlated subquery (also known as a
synchronized subquery) is a subquery (a query nested inside another
query) that uses values from the outer query.
https://en.wikipedia.org/wiki/Correlated_subquery
SELECT id, (SELECT name) FROM t
is the same as
SELECT id, (SELECT t.name) FROM t
Your 2nd query
SELECT id, (SELECT name FROM t AS t1 where t1.id=t2.id) FROM t AS t2
Also contains correlated subquery but this one is actually running a query on table t to find records where t1.id = t2.id.
This is the default behavior for the SQL language and it is defined on the SQL ANSI 2011 over ISO/IEC 9075-1:2011(en) documentation. Unfortunately it is not open. This behavior is described on the section 4.11 SQL-Statements.
This behavior happens because the databases process the select comand without the from clause, therefore if it encounters:
select id, (select name) from some
It will try to find that name field as a column of the outer queries to process.
Fortunately I remember that some while ago I've answered someone here and find a valid available link to an SQL ANSI document that is online in FULL but it is for the SQL ANSI 99 and the section may not be the same one as the new document. I think, did not check, that it is around the section 4.30. Take a look. And I really recommend the reading (I did that back in the day).
Database Language SQL - ISO/IEC 9075-2:1999 (E)
It's not standard. In oracle,
select 1, (select 2)
from dual
Throws error, ORA-00923: FROM keyword not found where expected
How can you be sure of your results? Get a better understanding of what the query is supposed to acheive before you write it. Even the exetended version in the question does not make any sense.
Related
I have a doubt and question regarding alias in sql. If i want to use the alias in same query can i use it. For eg:
Consider Table name xyz with column a and b
select (a/b) as temp , temp/5 from xyz
Is this possible in some way ?
You are talking about giving an identifier to an expression in a query and then reusing that identifier in other parts of the query?
That is not possible in Microsoft SQL Server which nearly all of my SQL experience is limited to. But you can however do the following.
SELECT temp, temp / 5
FROM (
SELECT (a/b) AS temp
FROM xyz
) AS T1
Obviously that example isn't particularly useful, but if you were using the expression in several places it may be more useful. It can come in handy when the expressions are long and you want to group on them too because the GROUP BY clause requires you to re-state the expression.
In MSSQL you also have the option of creating computed columns which are specified in the table schema and not in the query.
You can use Oracle with statement too. There are similar statements available in other DBs too. Here is the one we use for Oracle.
with t
as (select a/b as temp
from xyz)
select temp, temp/5
from t
/
This has a performance advantage, particularly if you have a complex queries involving several nested queries, because the WITH statement is evaluated only once and used in subsequent statements.
Not possible in the same SELECT clause, assuming your SQL product is compliant with entry level Standard SQL-92.
Expressions (and their correlation names) in the SELECT clause come into existence 'all at once'; there is no left-to-right evaluation that you seem to hope for.
As per #Josh Einstein's answer here, you can use a derived table as a workaround (hopefully using a more meaningful name than 'temp' and providing one for the temp/5 expression -- have in mind the person who will inherit your code).
Note that code you posted would work on the MS Access Database Engine (and would assign a meaningless correlation name such as Expr1 to your second expression) but then again it is not a real SQL product.
Its possible I guess:
SELECT (A/B) as temp, (temp/5)
FROM xyz,
(SELECT numerator_field as A, Denominator_field as B FROM xyz),
(SELECT (numerator_field/denominator_field) as temp FROM xyz);
This is now available in Amazon Redshift
E.g.
select clicks / impressions as probability, round(100 * probability, 1) as percentage from raw_data;
Ref:
https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/
You might find W3Schools "SQL Alias" to be of good help.
Here is an example from their tutorial:
SELECT po.OrderID, p.LastName, p.FirstName
FROM Persons AS p,
Product_Orders AS po
WHERE p.LastName='Hansen' AND p.FirstName='Ola'
Regarding using the Alias further in the query, depending on the database you are using it might be possible.
I'm using knowage software for data analysis, I'm facing performance issues, now I'm watching 'dataset audit' log to see what queries does the system perform. I found this one that, to me, is a nonsense:
SELECT COUNT(*)
FROM
(select TOP(100) PERCENT "ATC_1" AS "ATC_1"
from
(SELECT [ID_AFo]
,[ATC]
,[ATC_1]
,[ATC_3]
,[ATC_4]
,[ATC_5]
FROM [AFO]
) T order by "ATC_1" ASC
) u
inner T query is the dataset definition query I entered that basically is a select * from [AFO] on my table, outer wrap are made by knowage (I never wrote them)
doesn't a select count (*) from T have performed the same calculation but avoiding a cexpensive order by?
EDIT:
Backend (data source) is MSSQL, cache server is MYSQL so frequent queries are on mysql
This query is equivalent to:
SELECT COUNT(*)
FROM [AFO];
The only reason that I can think of for constructing such a query is if the "100" could be set to another value. I'm not sure if SQL Server's optimizer is good enough to eliminate the ORDER BY in the subquery.
First, let me note that I am aware of the other threads with a similar question, but they didn't help my understanding very much. On the opposite, I now sometimes run into the problem that assigning aliases ruins my code, as described below.
So I got said error message very often, and in turn started to give aliases to those subqueries which I thought were 'derived tables', but sometimes when doing so, I now get the message 'You have an error in your SQL syntax' instead, and after removing the 'AS ...' statement, everything runs fine.
So I am really trying to figure out when exactly something is a derived table and hence needs and alias, and when not.
I will give you an example: Given some tables P, LTP, and T, the following query runs flawless:
SELECT DISTINCT pname FROM P WHERE P.pnr IN (SELECT pnr FROM LTP WHERE lnr='L1' AND tnr IN (SELECT tnr FROM T WHERE gewicht>10));
How are the statements in the brackets not derived tables though? I would have assumed that in this case I would have had to give them aliases like this:
SELECT DISTINCT pname FROM P WHERE P.pnr IN (SELECT pnr FROM LTP WHERE lnr='L1' AND tnr IN (SELECT tnr FROM T WHERE gewicht>10) AS TNEW) AS LTPNEW;
but both of these ruin the code.
I would really appreciate if somebody could point out to me what exactly I am misunderstanding.
If the subquery is in the table_references portion of a query (the FROM clause and all accompanying JOINs), it needs to include an alias.
If the subquery appears elsewhere, like in the WHERE or SELECT section, it's just a regular subquery and no aliasing is required.
From the documentation:
Derived tables is the internal name for subqueries in the FROM clause.
As a rule of thumb, if you can reference a column from the subquery by name, then the subquery needs an alias to prevent ambiguity.
This is a simple question about efficiency specifically related to the MySQL implementation. I want to just check if a table is empty (and if it is empty, populate it with the default data). Would it be best to use a statement like SELECT COUNT(*) FROM `table` and then compare to 0, or would it be better to do a statement like SELECT `id` FROM `table` LIMIT 0,1 then check if any results were returned (the result set has next)?
Although I need this for a project I am working on, I am also interested in how MySQL works with those two statements and whether the reason people seem to suggest using COUNT(*) is because the result is cached or whether it actually goes through every row and adds to a count as it would intuitively seem to me.
You should definitely go with the second query rather than the first.
When using COUNT(*), MySQL is scanning at least an index and counting the records. Even if you would wrap the call in a LEAST() (SELECT LEAST(COUNT(*), 1) FROM table;) or an IF(), MySQL will fully evaluate COUNT() before evaluating further. I don't believe MySQL caches the COUNT(*) result when InnoDB is being used.
Your second query results in only one row being read, furthermore an index is used (assuming id is part of one). Look at the documentation of your driver to find out how to check whether any rows have been returned.
By the way, the id field may be omitted from the query (MySQL will use an arbitrary index):
SELECT 1 FROM table LIMIT 1;
However, I think the simplest and most performant solution is the following (as indicated in Gordon's answer):
SELECT EXISTS (SELECT 1 FROM table);
EXISTS returns 1 if the subquery returns any rows, otherwise 0. Because of this semantic MySQL can optimize the execution properly.
Any fields listed in the subquery are ignored, thus 1 or * is commonly written.
See the MySQL Manual for more info on the EXISTS keyword and its use.
It is better to do the second method or just exists. Specifically, something like:
if exists (select id from table)
should be the fastest way to do what you want. You don't need the limit; the SQL engine takes care of that for you.
By the way, never put identifiers (table and column names) in single quotes.
Why would someone use a group by versus distinct when there are no aggregations done in the query?
Also, does someone know the group by versus distinct performance considerations in MySQL and SQL Server. I'm guessing that SQL Server has a better optimizer and they might be close to equivalent there, but in MySQL, I expect a significant performance advantage to distinct.
I'm interested in dba answers.
EDIT:
Bill's post is interesting, but not applicable. Let me be more specific...
select a, b, c
from table x
group by a, b,c
versus
select distinct a,b,c
from table x
GROUP BY maps groups of rows to one row, per distinct value in specific columns, which don't even necessarily have to be in the select-list.
SELECT b, c, d FROM table1 GROUP BY a;
This query is legal SQL (correction: only in MySQL; actually it's not standard SQL and not supported by other brands). MySQL accepts it, and it trusts that you know what you're doing, selecting b, c, and d in an unambiguous way because they're functional dependencies of a.
However, Microsoft SQL Server and other brands don't allow this query, because it can't determine the functional dependencies easily. edit: Instead, standard SQL requires you to follow the Single-Value Rule, i.e. every column in the select-list must either be named in the GROUP BY clause or else be an argument to a set function.
Whereas DISTINCT always looks at all columns in the select-list, and only those columns. It's a common misconception that DISTINCT allows you to specify the columns:
SELECT DISTINCT(a), b, c FROM table1;
Despite the parentheses making DISTINCT look like function call, it is not. It's a query option and a distinct value in any of the three fields of the select-list will lead to a distinct row in the query result. One of the expressions in this select-list has parentheses around it, but this won't affect the result.
A little (VERY little) empirical data from MS SQL Server, on a couple of random tables from our DB.
For the pattern:
SELECT col1, col2 FROM table GROUP BY col1, col2
and
SELECT DISTINCT col1, col2 FROM table
When there's no covering index for the query, both ways produced the following query plan:
|--Sort(DISTINCT ORDER BY:([table].[col1] ASC, [table].[col2] ASC))
|--Clustered Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]))
and when there was a covering index, both produced:
|--Stream Aggregate(GROUP BY:([table].[col1], [table].[col2]))
|--Index Scan(OBJECT:([db].[dbo].[table].[IX_some_index]), ORDERED FORWARD)
so from that very small sample SQL Server certainly treats both the same.
In MySQL I've found using a GROUP BY is often better in performance than DISTINCT.
Doing an "EXPLAIN SELECT DISTINCT" shows "Using where; Using temporary " MySQL will create a temporary table.
vs a "EXPLAIN SELECT a,b, c from T1, T2 where T2.A=T1.A GROUP BY a" just shows "Using where"
Both would generate the same query plan in MS SQL Server.... If you have MS SQL Server you could just enable the actual execution plan to see which one is better for your needs ...
Please have a look at those posts:
http://blog.sqlauthority.com/2007/03/29/sql-server-difference-between-distinct-and-group-by-distinct-vs-group-by/
http://www.sqlmag.com/Article/ArticleID/24282/sql_server_24282.html
If you really are looking for distinct values, the distinct makes the source code more readable (like if it's part of a stored procedure) If I'm writing ad-hoc queries I'll usually start with the group by, even if I have no aggregations because I'll often end up putting them on.