Evaluating a correlated subquery in SQL

Evaluating a correlated subquery in SQL - mysql

I'm having trouble getting my head around evaluating correlated subqueries. An example is using a correlated subquery in SELECT so that GROUP BY isn't needed:
Consider the relations:
Movies : Title, Director Length
Schedule : Theatre, Title
I have the following query
SELECT S.Theater, MAX(M.Length)
FROM Movies M JOIN Schedule S ON M.Title=S.Title
GROUP BY S.Theater
Which gets the longest film that every theatre is playing. This is the same query without using GROUP BY:
SELECT DISTINCT S.theater,
(SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title=S.Title)
FROM Schedule S
but I don't understand how it quite works.
I'd appreciate if anybody could give me an example of how correlated subqueries are evaluated.
Thanks :)

Conceptually...
To understand this, first ignore the bit about correlated subquery.
Consider the order of operations for a statement like this:
SELECT t.foo FROM mytable t
MySQL prepares an empty resultset. Rows in the resultset will consist of one column, because there is one expression in the SELECT list. A row is retrieved from mytable. MySQL puts a row into the resultset, using the value from the foo column from the mytable row, assigning it to the foo column in the resultset. Fetch the next row, repeat that same process, until there are no more rows to fetch from the table.
Pretty easy stuff. But bear with me.
Consider this statement:
SELECT t.foo AS fi, 'bar' AS fo FROM mytable t
MySQL process that the same way. Prepare an empty resultset. Rows in the resultset are going to have two columns this time. The first column is given the name fi (because we assigned the name fi with an alias). The second column in rows of the resultset will be named fo, because (again) we assigned an alias.
Now we etch a row from mytable, and insert a row into the resultset. The value of the foo column goes into the column name fi, and the literal string 'bar' goes into the column named fo. Continue fetching rows and inserting rows into the resultset, until no more rows to fetch.
Not too hard.
Next, consider this statement, which looks a little more tricky:
SELECT t.foo AS fi, (SELECT 'bar') AS fo FROM mytable t
Same thing happens again. Empty resultset. Rows have two columns, name fi and fo.
Fetch a row from mytable, and insert a row into the resultset. The value of foo goes into column fi (just like before.) This is where it gets tricky... for the second column in the resultset, MySQL executes the query inside the parens. In this case it's a pretty simple query, we can test that pretty easily to see what it returns. Take the result from that query and assign that to the fo column, and insert the row into the resultset.
Still with me?
SELECT t.foo AS fi, (SELECT q.tip FROM bartab q LIMIT 1) AS fo FROM mytable
This is starting to look more complicated. But it's not really that much different. The same things happen again. Prepare the empty resultset. Rows will have two columns, one name fi, the other named fo. Fetch a row from mytable. Get the value from foo column, and assign it to the fi column in the result row. For the fo column, execute the query, and assign the result from the query to the fo column. Insert the result row into the resultset. Fetch another row from mytable, a repeat the process.
Here we should stop and notice something. MySQL is picky about that query in the SELECT list. Really really picky. MySQL has restrictions on that. The query must return exactly one column. And it cannot return more than one row.
In that last example, for the row being inserted into the resultset, MySQL is looking for a single value to assign to the fo column. When we think about it that way, it makes sense that the query can't return more than one column... what would MySQL do with the value from the second column? And it makes sense that we don't want to return more than one row... what would MySQL do with multiple rows?
MySQL will allow the query to return zero rows. When that happens, MySQL assigns a NULL to the fo column.
If you have an understanding of that, your 95% of the way there to understanding the correlated subquery.
Let's look at another example. Our single line of SQL is getting a little unweildy, so we'll just add some line breaks and spaces to make it easier for us to work with. The extra spaces and linebreaks don't change the meaning of our statement.
SELECT t.foo AS fi
, ( SELECT q.tip
FROM bartab q
WHERE q.col = t.foo
ORDER BY q.tip DESC
LIMIT 1
) AS fo
FROM mytable t
Okay, that looks a lot more complicated. But is it really? It's the same thing again. Prepare an empty resultset. Rows will have two columns, fi and fo. Fetch a row from mytable, and get a row ready to insert into the resultset. Copy the value from the foo column, assign it to the fi column. And for the fo column, execute the query, take the single value returned by the query to the fo column, and push the row into the resultset. Fetch the next row from mytable, and repeat.
To explain (finall!) the part about "correlated".
That query we are going to run to get the result for the fo column. That contains a reference to a column from the outer table. t.foo. In this example that appears in the WHERE clause; it doesn't have to, it could appear anywhere in the statement.
What MySQL does with that, when it runs that subquery, it passes in the value of the foo column, into the query. If the row we just fetched from mytable has a value of 42 in the foo column... that subquery is equivalent to
SELECT q.tip
FROM bartab q
WHERE q.col = 42
ORDER BY q.tip DESC
LIMIT 1
But since we're not passing in the literal value of 42, what we're passing in is values from the row in the outer query, the result returned by our subquery is "related" to the row we're processing in the outer query.
We could be a lot more complicated in our subquery, as long as we remember the rule about the subquery in the SELECT list... it has to return exactly one column, and at most one row. It returns at most one value.
Correlated subqueries can appear in parts of the statement other than the SELECT list, such as the WHERE clause. The same general concept applies. For each row processed by the outer query, the values of the column(s) from that row are passed in to the subquery. The result returned from the subquery is related to the row being processed in the outer query.
The discussion omits all the steps before the actual execution... parsing the statament into tokens, performing the syntax check (keywords and identifiers in the right place). Then performing the semantics check (does mytable exist, does the user have select privilege on it, does the column foo exist in mytable). Then determining the access plan. And in the execution, obtaining the required locks, and so on. All that happens with every statement we execute.)
And we're going to not discuss the kinds of horrendous performance issues we can create with correlated subqueries. Though the previous discussion should give a clue. Since the subquery is executed for every row we're putting into the resultset (if it's in the SELECT list of our outer query), or is being executed for every row that is accessed by the outer query... if the outer query is returning 40,000 rows, that means our correlated subquery is going to be executed 40,000 times. So we better well make sure that subquery executes fast. Even when it executes fast, we're still going to execute it 40,000 times.

From a conceptual standpoint, imagine that the database is going through each row of the result without the subquery:
SELECT DISTINCT S.Theater, S.Title
FROM Schedule S
And then, for each one of those, running the subquery for you:
SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = (whatever S.Title was)
And placing that in as the value. Really, it's not (conceptually) that different from using a function:
SELECT DISTINCT S.Theater, SUBSTRING(S.Title, 1, 5)
FROM Schedule S
It's just that this function performs a query against another table, instead.
I do say conceptually, though. The database may be optimizing the correlated query into something more like a join. Whatever it does internally matters for performance, but doesn't matter as much for understanding the concept.
But, it may not return the results you're expecting. Consider the following data (sorry sqlfiddle seems to be erroring atm):
CREATE TABLE Movies (
Title varchar(255),
Length int(10) unsigned,
PRIMARY KEY (Title)
);
CREATE TABLE Schedule (
Title varchar(255),
Theater varchar(255),
PRIMARY KEY (Theater, Title)
);
INSERT INTO Movies
VALUES ('Star Wars', 121);
INSERT INTO Movies
VALUES ('Minions', 91);
INSERT INTO Movies
VALUES ('Up', 96);
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Minions', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Up', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 6');
And then this query:
SELECT DISTINCT
S.Theater,
(
SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = S.Title
) AS MaxLength
FROM Schedule S;
You'll get this result:
+----------+-----------+
| Theater | MaxLength |
+----------+-----------+
| Cinema 6 | 121 |
| Cinema 8 | 91 |
| Cinema 8 | 121 |
| Cinema 8 | 96 |
+----------+-----------+
As you can see, it's not a replacement for GROUP BY (and you can still use GROUP BY), it's just running the subquery for each row. DISTINCT will only remove duplicates from the result. It's not giving the "greatest length" per theater anymore, it's just giving each unique movie length associated with the theater name.
PS: You might likely use an ID column of some sort to identify movies, rather than using the Title in the join. This way, if by chance the name of the movie has to be amended, it only needs to change in one place, not all over Schedule too. Plus, it's faster to join on an ID number than a string.

Related

Not selecting duplicates in join / where query

I've been trying to learn MySQL, and I'm having some trouble creating a join query to not select duplicates.
Basically, here's where I'm at :
SELECT atable.phonenumber, btable.date
FROM btable
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
However, in my database, there is the possibility of having duplicate rows in column atable.phonenumber.
For example (added asterisks for clarity)
phonenumber | date
-------------|-----------
*555-681-2105 | 2015-08-12
555-425-5161 | 2015-08-15
331-484-7784 | 2015-08-17
*555-681-2105 | 2015-08-25
.. and so on.
I tried using SELECT DISTINCT but that doesn't work. I also was looking through other solutions which recommended GROUP BY, but that threw an error, most likely because of my WHERE clause and condition. Not really sure how I can easily accomplish this.

DISTINCT applies to the whole row being returned, essentially saying "I want only unique rows" - any row value may participate in making the row unique
You are getting phone numbers duplicated because you're only looking at the column in isolation. The database is looking at phone number and also date. The rows you posted have different dates, and these hence cause the rows to be different
I suggest you do as the commenter recommended and decide what you want to do with the dates. If you want the latest date for a phone number, do this:
SELECT atable.phonenumber, max(btable.date)
FROM battle
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
GROUP BY atable.phonenumber
When you write a query that uses grouping, you will get a set of rows where there is only one set of value combinations for anything that is in the group by list. In this case, only unique phone numbers. But, because you want other values as well (I.e. Date) you MUST use what's called an aggregate function, to specify what you want to do with all the various values that aren't part of the unique set. Sometimes it will be MAX or MIN, sometimes it will be SUM, COUNT, AVG and so on.
if you're familiar with hash tables or dictionaries from elsewhere in programming, this is what a group by is: it maps a set of values (a key) to a list of rows that have those key values, and then the aggregating function is applied to any of the values in the list associated with the key
The simple rule when using group by (and one that MySQL will do implicitly for you) is to write queries thus:
SELECT
List,
of,
columns,
you,
want,
in,
unique,
combination,
FN(List),
FN(of),
FN(columns),
FN(you),
FN(want),
FN(aggregating)
FROM table
GROUP BY
List,
of,
columns,
you,
want,
in,
unique,
combination
i.e. You can copy paste from your select list to your group list. MySQL does this implicitly for you if you don't do it (i.e. If you use one or more aggregate functions like max in your select list, but forget or omit the group by clause- it will take everything that isn't in an agggregate function and run the grouping as if you'd written it). Whether group by is hence largely redundant is often debated, but there do exist other things you can do with a group by, such as rollup, cube and grouping sets. Also you can group on a column, if that column is used in a deterministic function, without having to group on the result of he deterministic function. Whether there is any point to doing so is a debate for another time :)

You should add GROUP BY, and an aggregate to the date field, something like this:
SELECT atable.phonenumber, MAX(btable.date)
FROM btable
LEFT JOIN atable ON btable.id = atable.id
WHERE btable.country_id = 4
GROUP BY atable.phonenumber
This will return the maximum date, hat is the latest date...

Union query to combine results of 3 tables

I am relatively new to coding so please have patience.
I am trying to combine data from 3 tables. I have managed to get some data back but it isn't what i need. Please see my example below.
select oid, rrnhs, idnam, idfnam, dte1, ta
as 'access type' from person
left join
(select fk_oid, min(dte), dte1, ta
from
((Select fk_oid,min(accessdate) as dte, accessdate1 as dte1, accesstype as ta
from vascularpdaccess
where isnull(accesstype)=false group by fk_oid)
union
(Select fk_oid, min(hpdate) as dte, hpdate as dte1, HPACCE as ta
from hdtreatment
where isnull(hptype)=false group by fk_oid)) as bla
group by fk_oid) as access
on person.oid=access.fk_oid
where person.rrnhs in (1000010000, 2000020000, 3000030000)
My understanding with a union is that the columns have to be of the same data type but i have two problems. The first is that accesstype and hpacce combine in to a the same column as expected, but i dont want to actually see the hpacce data (dont know if this is even possible).
Secondly, the idea of the query is to pull back a patients 'accesstype' date at the first date of hpdate.
I dont know if this even makes sens to you guys but hoping someone can help..y'all are usually pretty nifty!
Thanks in advance!
Mikey

All queries need to have the same number of columns in the SELECT statement. It looks like you first query has the max number of columns, so you will need to "pad" the other to have the same number of columns. You can use NULL as col to create the column with all null values.

To answer the question (I think) you were asking... for a UNION or UNION ALL set operation, you are correct: the number of columns and the datatypes of the columns returned must match.
But it is possible to return a literal as an expression in the SELECT list. For example, if you don't want to return the value of HPACCE column, you can replace that with a literal or a NULL. (If that column is character datatype (we can't tell from the information provided in the question), you could use (for example) a literal empty string '' AS ta in place of HPACCE AS ta.
SELECT fk_oid
, MIN(HPDATE) AS dte
, hpdate AS dte1
, NULL AS ta
-- -------------------- ^^^^
FROM hdtreatment
Some other notes:
The predicate ISNULL(foo)=FALSE can be more simply expressed as foo IS NOT NULL.
The UNION set operator will remove duplicate rows. If that's not necessary, you could use a UNION ALL set operator.
The subsequent GROUP BY fk_oid operation on the inline view bla is going to collapse rows; but it's indeterminate which row the values from dte1 and ta will be from. (i.e. there is no guarantee those values will be from the row that had the "minimum" value of dte.) Other databases will throw an exception/error with this statement, along the lines of "non-aggregate in SELECT list not in GROUP BY". But this is allowed (without error or warning) by a MySQL specific extension to GROUP BY behavior. (We can get MySQL to behave like other databases and throw an error of we specify a value for sql_mode that includes ONLY_FULL_GROUP_BY (?).)
The predicate on the outer query doesn't get pushed down into the inline view bla. The view bla is going to materialized for every fk_oid, and that could be a performance issue on large sets.
Also, qualifying all column references would make the statement easier to read. And, that will also insulate the statement from throwing an "ambiguous column" error in the future, when a column named (e.g.) ta or dte1 is added to the person table.

MySql Duplicated rows string comparation performance

I have a table with more then 2 million records,
I need to find duplication records in column with string type additionaly I have index for this field.
I have next query:
select m.* from member as m
where lower(m.username) in
(select lower(b.username) from member as b
where b.Username like 'a%'
group by b.username
having count(b.username) >= 2);
sub-query return only 4 records less then 0.2 seconds, but if I use them in where conditions section, this query working very long time and never return results....
I have tried to run next query, that theoretically the same logic:
select * from member as m where lower(Username) in (lower('a1'),
lower('a2'),lower('a3'),lower('a4'));
and it works fine and fast.
what is the issues ?
additionally I would like to run query with out where b.Username like 'a%' part?

In common case MySQL can not use index for IN subqueries
This is sad, but, actually, MySQL can not recognize "constant subqueries". What does it mean? It means that if you have a subquery that returns static list of values - and you use that in IN within another query, MySQL will not use index (by range).
Why it is so?
Actually, the most correct point is - because MySQL treats following queries:
.. WHERE `field` IN ('foo', 'bar', 'baz')
and
.. WHERE `field` IN (SELECT `col` FROM t)
-as different queries (I'm assuming that column col in table t in second query have same values, i.e. 'foo', 'bar', 'baz'). First query is equivalent for it's "expected" case, i.e. for range of values. But second query is equal for = ANY subquery - and so MySQL will not use index for that.
What to do
Actually, your case and cases similar to it - are cases when it's better to split your query into two parts. First part will be retrieve static list of values from your table. Second part will substitute result of your first part into IN clause and then you'll get index using.
Alternative - you can use JOIN syntax for table to itself. That may seems useful if you want to resolve an issue with one query (or if your list is too long)

MySQL Subquery - Returning More Than One Row

I have a table in which there are a listing of names, first and last in a column. So a column, called "manager" could have a value of "John Doe". I want to right a query that simply goes through each row in this table and displays the first letter and last name of the "manager" column. Everything I do comes up with "Subquery returns more than one row".
Starting small, I've just decided to pull the first letter:
SELECT id, LEFT((SELECT manager FROM my_table), 1) FROM my_table;
Or am I just completely off base on this

You're using a subquery to fetch into a field of a parent query. As such, the subquery can return only a single row. think of it this way: a result set is a 2-dimensional construct. a series of columns and rows. The data a subquery returns has to match the physical constraints of the thing it's returning into.
Since you're fetching into a field, that means one SINGLE value. If multiple values were allowed to be returned, you'd effectively be trying to turn your 2D result set into a 3d set (rows + columns plus a skyscraper growing out of one of those fields).
Your query does NOT need to be a subquery at all:
SELECT id, LEFT(manager, 1) AS first_letter FROM yourtable
Also, not that if you want separate first and last names, you would be better off storing those are separate fields. It is very easy to rebuilt a name from a set of first/last name fields, but very very difficult to reliably separate a monolithic name into individual first and last names, e.g.
simple:
John Doe (fn: john, ln: doe)
hard:
Billy Jo Todd (is that "Billy" and "Jo Todd", "Billy" and "Todd" with middle name Jo?
dead simple:
field firstname = John
field lastname = Doe

If you want to use a subquery, this query works as you intend it to, though I am not sure this is the best way to proceed in any case. We would need more information about your needs to assert that.
SELECT
m1.id,
m2.manager
FROM
my_table AS m1 INNER JOIN
(SELECT id, LEFT(manager, 1) AS manager FROM my_table) as m2
ON m1.id = m2.id
http://sqlfiddle.com/#!2/9c395/6

You must add a condition to your subquery to return the row you want to compare like
SELECT manager FROM my_table WHERE id = 1

What does it mean by select 1 from table?

I have seen many queries with something as follows.
Select 1
From table
What does this 1 mean, how will it be executed and, what will it return?
Also, in what type of scenarios, can this be used?

select 1 from table will return the constant 1 for every row of the table. It's useful when you want to cheaply determine if record matches your where clause and/or join.

SELECT 1 FROM TABLE_NAME means, "Return 1 from the table". It is pretty unremarkable on its own, so normally it will be used with WHERE and often EXISTS (as #gbn notes, this is not necessarily best practice, it is, however, common enough to be noted, even if it isn't really meaningful (that said, I will use it because others use it and it is "more obvious" immediately. Of course, that might be a viscous chicken vs. egg issue, but I don't generally dwell)).
SELECT * FROM TABLE1 T1 WHERE EXISTS (
SELECT 1 FROM TABLE2 T2 WHERE T1.ID= T2.ID
);
Basically, the above will return everything from table 1 which has a corresponding ID from table 2. (This is a contrived example, obviously, but I believe it conveys the idea. Personally, I would probably do the above as SELECT * FROM TABLE1 T1 WHERE ID IN (SELECT ID FROM TABLE2); as I view that as FAR more explicit to the reader unless there were a circumstantially compelling reason not to).
EDIT
There actually is one case which I forgot about until just now. In the case where you are trying to determine existence of a value in the database from an outside language, sometimes SELECT 1 FROM TABLE_NAME will be used. This does not offer significant benefit over selecting an individual column, but, depending on implementation, it may offer substantial gains over doing a SELECT *, simply because it is often the case that the more columns that the DB returns to a language, the larger the data structure, which in turn mean that more time will be taken.

If you mean something like
SELECT * FROM AnotherTable
WHERE EXISTS (SELECT 1 FROM table WHERE...)
then it's a myth that the 1 is better than
SELECT * FROM AnotherTable
WHERE EXISTS (SELECT * FROM table WHERE...)
The 1 or * in the EXISTS is ignored and you can write this as per Page 191 of the ANSI SQL 1992 Standard:
SELECT * FROM AnotherTable
WHERE EXISTS (SELECT 1/0 FROM table WHERE...)

it does what it says - it will always return the integer 1. It's used to check whether a record matching your where clause exists.

select 1 from table is used by some databases as a query to test a connection to see if it's alive, often used when retrieving or returning a connection to / from a connection pool.

The result is 1 for every record in the table.

To be slightly more specific, you would use this to do
SELECT 1 FROM MyUserTable WHERE user_id = 33487
instead of doing
SELECT * FROM MyUserTable WHERE user_id = 33487
because you don't care about looking at the results. Asking for the number 1 is very easy for the database (since it doesn't have to do any look-ups).

Although it is not widely known, a query can have a HAVING clause without a GROUP BY clause.
In such circumstances, the HAVING clause is applied to the entire set. Clearly, the SELECT clause cannot refer to any column, otherwise you would (correct) get the error, "Column is invalid in select because it is not contained in the GROUP BY" etc.
Therefore, a literal value must be used (because SQL doesn't allow a resultset with zero columns -- why?!) and the literal value 1 (INTEGER) is commonly used: if the HAVING clause evaluates TRUE then the resultset will be one row with one column showing the value 1, otherwise you get the empty set.
Example: to find whether a column has more than one distinct value:
SELECT 1
FROM tableA
HAVING MIN(colA) < MAX(colA);

If you don't know there exist any data in your table or not, you can use following query:
SELECT cons_value FROM table_name;
For an Example:
SELECT 1 FROM employee;
It will return a column which contains the total number of rows & all rows have the same constant value 1 (for this time it returns 1 for all rows);
If there is no row in your table it will return nothing.
So, we use this SQL query to know if there is any data in the table & the number of rows indicates how many rows exist in this table.

If you just want to check a true or false based on the WHERE clause, select 1 from table where condition is the cheapest way.

This means that You want a value "1" as output or Most of the time used as Inner Queries because for some reason you want to calculate the outer queries based on the result of inner queries.. not all the time you use 1 but you have some specific values...
This will statically gives you output as value 1.

I see it is always used in SQL injection,such as:
www.urlxxxxx.com/xxxx.asp?id=99 union select 1,2,3,4,5,6,7,8,9 from database;
These numbers can be used to guess where the database exists and guess the column name of the database you specified.And the values of the tables.

it simple means that you are retrieving the number first column from table ,,,,means
select Emply_num,Empl_no From Employees ;
here you are using select 1 from Employees;
that means you are retrieving the Emply_num column.
Thanks

The reason is another one, at least for MySQL. This is from the MySQL manual
InnoDB computes index cardinality values for a table the first time that table is accessed after startup, instead of storing such values in the table. This step can take significant time on systems that partition the data into many tables. Since this overhead only applies to the initial table open operation, to “warm up” a table for later use, access it immediately after startup by issuing a statement such as SELECT 1 FROM tbl_name LIMIT 1

This is just used for convenience with IF EXISTS(). Otherwise you can go with
select * from [table_name]
Image In the case of 'IF EXISTS', we just need know that any row with specified condition exists or not doesn't matter what is content of row.
select 1 from Users
above example code, returns no. of rows equals to no. of users with 1 in single column

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Evaluating a correlated subquery in SQL - mysql

Related

Not selecting duplicates in join / where query

Union query to combine results of 3 tables

MySql Duplicated rows string comparation performance

MySQL Subquery - Returning More Than One Row

What does it mean by select 1 from table?

Categories

Resources