Related
I'm having trouble getting my head around evaluating correlated subqueries. An example is using a correlated subquery in SELECT so that GROUP BY isn't needed:
Consider the relations:
Movies : Title, Director Length
Schedule : Theatre, Title
I have the following query
SELECT S.Theater, MAX(M.Length)
FROM Movies M JOIN Schedule S ON M.Title=S.Title
GROUP BY S.Theater
Which gets the longest film that every theatre is playing. This is the same query without using GROUP BY:
SELECT DISTINCT S.theater,
(SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title=S.Title)
FROM Schedule S
but I don't understand how it quite works.
I'd appreciate if anybody could give me an example of how correlated subqueries are evaluated.
Thanks :)
Conceptually...
To understand this, first ignore the bit about correlated subquery.
Consider the order of operations for a statement like this:
SELECT t.foo FROM mytable t
MySQL prepares an empty resultset. Rows in the resultset will consist of one column, because there is one expression in the SELECT list. A row is retrieved from mytable. MySQL puts a row into the resultset, using the value from the foo column from the mytable row, assigning it to the foo column in the resultset. Fetch the next row, repeat that same process, until there are no more rows to fetch from the table.
Pretty easy stuff. But bear with me.
Consider this statement:
SELECT t.foo AS fi, 'bar' AS fo FROM mytable t
MySQL process that the same way. Prepare an empty resultset. Rows in the resultset are going to have two columns this time. The first column is given the name fi (because we assigned the name fi with an alias). The second column in rows of the resultset will be named fo, because (again) we assigned an alias.
Now we etch a row from mytable, and insert a row into the resultset. The value of the foo column goes into the column name fi, and the literal string 'bar' goes into the column named fo. Continue fetching rows and inserting rows into the resultset, until no more rows to fetch.
Not too hard.
Next, consider this statement, which looks a little more tricky:
SELECT t.foo AS fi, (SELECT 'bar') AS fo FROM mytable t
Same thing happens again. Empty resultset. Rows have two columns, name fi and fo.
Fetch a row from mytable, and insert a row into the resultset. The value of foo goes into column fi (just like before.) This is where it gets tricky... for the second column in the resultset, MySQL executes the query inside the parens. In this case it's a pretty simple query, we can test that pretty easily to see what it returns. Take the result from that query and assign that to the fo column, and insert the row into the resultset.
Still with me?
SELECT t.foo AS fi, (SELECT q.tip FROM bartab q LIMIT 1) AS fo FROM mytable
This is starting to look more complicated. But it's not really that much different. The same things happen again. Prepare the empty resultset. Rows will have two columns, one name fi, the other named fo. Fetch a row from mytable. Get the value from foo column, and assign it to the fi column in the result row. For the fo column, execute the query, and assign the result from the query to the fo column. Insert the result row into the resultset. Fetch another row from mytable, a repeat the process.
Here we should stop and notice something. MySQL is picky about that query in the SELECT list. Really really picky. MySQL has restrictions on that. The query must return exactly one column. And it cannot return more than one row.
In that last example, for the row being inserted into the resultset, MySQL is looking for a single value to assign to the fo column. When we think about it that way, it makes sense that the query can't return more than one column... what would MySQL do with the value from the second column? And it makes sense that we don't want to return more than one row... what would MySQL do with multiple rows?
MySQL will allow the query to return zero rows. When that happens, MySQL assigns a NULL to the fo column.
If you have an understanding of that, your 95% of the way there to understanding the correlated subquery.
Let's look at another example. Our single line of SQL is getting a little unweildy, so we'll just add some line breaks and spaces to make it easier for us to work with. The extra spaces and linebreaks don't change the meaning of our statement.
SELECT t.foo AS fi
, ( SELECT q.tip
FROM bartab q
WHERE q.col = t.foo
ORDER BY q.tip DESC
LIMIT 1
) AS fo
FROM mytable t
Okay, that looks a lot more complicated. But is it really? It's the same thing again. Prepare an empty resultset. Rows will have two columns, fi and fo. Fetch a row from mytable, and get a row ready to insert into the resultset. Copy the value from the foo column, assign it to the fi column. And for the fo column, execute the query, take the single value returned by the query to the fo column, and push the row into the resultset. Fetch the next row from mytable, and repeat.
To explain (finall!) the part about "correlated".
That query we are going to run to get the result for the fo column. That contains a reference to a column from the outer table. t.foo. In this example that appears in the WHERE clause; it doesn't have to, it could appear anywhere in the statement.
What MySQL does with that, when it runs that subquery, it passes in the value of the foo column, into the query. If the row we just fetched from mytable has a value of 42 in the foo column... that subquery is equivalent to
SELECT q.tip
FROM bartab q
WHERE q.col = 42
ORDER BY q.tip DESC
LIMIT 1
But since we're not passing in the literal value of 42, what we're passing in is values from the row in the outer query, the result returned by our subquery is "related" to the row we're processing in the outer query.
We could be a lot more complicated in our subquery, as long as we remember the rule about the subquery in the SELECT list... it has to return exactly one column, and at most one row. It returns at most one value.
Correlated subqueries can appear in parts of the statement other than the SELECT list, such as the WHERE clause. The same general concept applies. For each row processed by the outer query, the values of the column(s) from that row are passed in to the subquery. The result returned from the subquery is related to the row being processed in the outer query.
The discussion omits all the steps before the actual execution... parsing the statament into tokens, performing the syntax check (keywords and identifiers in the right place). Then performing the semantics check (does mytable exist, does the user have select privilege on it, does the column foo exist in mytable). Then determining the access plan. And in the execution, obtaining the required locks, and so on. All that happens with every statement we execute.)
And we're going to not discuss the kinds of horrendous performance issues we can create with correlated subqueries. Though the previous discussion should give a clue. Since the subquery is executed for every row we're putting into the resultset (if it's in the SELECT list of our outer query), or is being executed for every row that is accessed by the outer query... if the outer query is returning 40,000 rows, that means our correlated subquery is going to be executed 40,000 times. So we better well make sure that subquery executes fast. Even when it executes fast, we're still going to execute it 40,000 times.
From a conceptual standpoint, imagine that the database is going through each row of the result without the subquery:
SELECT DISTINCT S.Theater, S.Title
FROM Schedule S
And then, for each one of those, running the subquery for you:
SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = (whatever S.Title was)
And placing that in as the value. Really, it's not (conceptually) that different from using a function:
SELECT DISTINCT S.Theater, SUBSTRING(S.Title, 1, 5)
FROM Schedule S
It's just that this function performs a query against another table, instead.
I do say conceptually, though. The database may be optimizing the correlated query into something more like a join. Whatever it does internally matters for performance, but doesn't matter as much for understanding the concept.
But, it may not return the results you're expecting. Consider the following data (sorry sqlfiddle seems to be erroring atm):
CREATE TABLE Movies (
Title varchar(255),
Length int(10) unsigned,
PRIMARY KEY (Title)
);
CREATE TABLE Schedule (
Title varchar(255),
Theater varchar(255),
PRIMARY KEY (Theater, Title)
);
INSERT INTO Movies
VALUES ('Star Wars', 121);
INSERT INTO Movies
VALUES ('Minions', 91);
INSERT INTO Movies
VALUES ('Up', 96);
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Minions', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Up', 'Cinema 8');
INSERT INTO Schedule
VALUES ('Star Wars', 'Cinema 6');
And then this query:
SELECT DISTINCT
S.Theater,
(
SELECT MAX(M.Length)
FROM Movies M
WHERE M.Title = S.Title
) AS MaxLength
FROM Schedule S;
You'll get this result:
+----------+-----------+
| Theater | MaxLength |
+----------+-----------+
| Cinema 6 | 121 |
| Cinema 8 | 91 |
| Cinema 8 | 121 |
| Cinema 8 | 96 |
+----------+-----------+
As you can see, it's not a replacement for GROUP BY (and you can still use GROUP BY), it's just running the subquery for each row. DISTINCT will only remove duplicates from the result. It's not giving the "greatest length" per theater anymore, it's just giving each unique movie length associated with the theater name.
PS: You might likely use an ID column of some sort to identify movies, rather than using the Title in the join. This way, if by chance the name of the movie has to be amended, it only needs to change in one place, not all over Schedule too. Plus, it's faster to join on an ID number than a string.
I want to insert 5 new lines in a table if and only if none of the 5 lines are already there. If one of them is in the table, then I want to abort the insertion (without updating anything), and know which one (or which ones) were already there.
I can think of long ways to do this (such as looking if SELECT col1 WHERE col1 IN (value1,value2,...) returns anything, and then inserting only if it doesn't)
I also guess transactions can do this, but I'm currently learning how they work. However, I don't know if a transaction can give me which entry(ies) is (are) a duplicate(s).
With or without transactions, is there any way to do this in one or two queries only ?
Thanks
I doubt there is a better way than the solution you mentioned: First run a SELECT query and if it doesn't return anything, INSERT. You asked for something in one or two queries. This is exactly two queries, so pretty efficient in my view. I can't think of an efficient way to use transactions for this. Transactions are good when you have multiple INSERT or UPDATE queries, you have only one.
The insert instruction does not give a lot of chances to do the job. If you turn on an UNIQUE constraint in the desired field and than insert all the fields in only one instruction such
INSERT INTO FOO(col1) VALUES
(val1),
(val2),
(val3),
(val4),
(val5);
It is going to give an exception due the constraint violation and therefore abort the instruction.
If you want avoid the exception the job becomes a little pervert:
INSERT INTO FOO(col1) VALUES
Seleect a.* from (Select val1
union
Select val2
union
select val3
union
select val4
union
select val5 ) a
inner join
( select g.* from(
select false b from foo where col1 in(val1,val2....)
union
select true) g
limit 1) b on b.b
What happen? the most inner query returns true only if there is no values therefore it will insert all the values only if there is no values.
I have the following table:
create table x (
id integer,
property integer
);
I need to efficiently run queries that test multiple property values for a given id. For instance, I may want a query that gets all ids with a property satisfying the condition: 1 and not (2 or 3 or 4 or 5).
If all my properties were in boolean columns ('t1' = true if property value 1 exists for id, false otherwise, etc...), then I can simply run this very fast query (assume y were such a table):
select * from y where t1 and not (t2 or t3 or t4 or t5);
Unfortunately, I have thousands of properties, so this won't do. Furthermore, there's no rhyme or reason as to the queries, so while I can bundle groups of properties into conceptual relations, the query boundaries don't respect that. Additionally, these queries are (indirectly) determined by the user, so creating views in anticipation of them won't help. Finally, we'll constantly be adding data sets with new properties whose relations may be new, vague or cross-cutting, meaning that trying to create tables of relations may become a maintenance nightmare.
Hence why I chose the original schema.
To try to accomplish my queries, I tried first creating a pivot on the fields involved in the query, then querying that:
create table pivot as (
select
id,
max(if(property=1,true,false)) as 't1',
max(if(property=2,true,false)) as 't2',
max(if(property=3,true,false)) as 't3',
max(if(property=4,true,false)) as 't4',
max(if(property=5,true,false)) as 't5'
from x);
select * from pivot where t1 and not (t2 or t3 or t4 or t5);
However, this is very slow. In fact, it's slower than an un-optimized home-brewed solution.
I know I can produce complex queries with sub-queries, but a limited test suggested that performance would be even worse (unless I structured the query incorrectly).
What can I do to speed up my queries?
I assume that id is not unique and an existing record (some_id, property_id) means that the property is on.
First, I would notice that I need to efficiently run queries that test multiple property values for a given id and I may want a query that gets all ids with a property satisfying the condition: 1 and not (2 or 3 or 4 or 5) may lead to completely different queries.
But here is my idea. Some more assumptions:
there is always a positive condition that cuts away a significant part of ids (otherwise I would join several properties into an extra one that becomes such a condition)
any pair (id, property) is unique (an you even have the corresponding UNIQUE index)
Now, if you have an index on (property, id), then the following query will take all matching ids from the covering index (that is quickly):
SELECT id
FROM t1
WHERE property = 150;
If this query leads to a significantly smaller result set than the whole table, you can afford to make another fast correlated subquery for another property that will significantly decrease the result set. This subquery will require another covering index (id, property) and the corresponding UNIQUE index is what it needs:
SELECT id
FROM t1
WHERE property = 150
AND NOT EXISTS (
SELECT 1
FROM t2
WHERE t2.id = t1.id AND t2.property = 130
)
AND NOT EXISTS (
SELECT 1
FROM t2
WHERE t2.id = t1.id AND t2.property = 90
);
If an earlier correlated subquery results to false, all the following subqueries will not be executed for the row. That is why the order is crucial.
You will need to play with the properties order, and probably hardcode that in the code that executes the query.
UPD: then, I afraid, you do not have much choice. The best you can do is to walk through the index in a single pass and compute what you need. The speed of the query then will mainly depend on the number of rows in your table. So, again assuming you have the UNIQUE index on (id, property), you can write something like:
SELECT id
FROM t1
GROUP BY id
HAVING
COUNT(IF(property=150, 1, NULL))
AND NOT COUNT(IF(property=130, 1, NULL))
AND NOT COUNT(IF(property=90, 1, NULL));
If you have:
P(id,property) -- thing [i] has property #[property]
and you want:
x(i,p1,p2,p3,p4,p5) -- thing [i] has property p1 per [p1] and ...
Does this suit your needs?
CREATE VIEW x AS
SELECT id,
property=p1 AS p1,
property=p2 AS p2,
property=p3 AS p3,
property=p4 AS p4,
property=p5 AS p5
FROM P
Presumably you are doing something like this:
-- for A(id,...)
SELECT id,... FROM A JOIN x USING(id) WHERE t1 AND NOT (t2 OR t3 OR t4 OR t5)
where you want only the A rows with ids where P(id,t1) and not ... .
(In case you ever do make a table with multiple boolean columns that you would like to minimize storate for you can use BIT (1) (BIT (m) takes (m+7)/8 bytes). You can use them like booleans.
Do not encode multiple bits into values yourself, it stops the DBMS from optimizing.)
Have you considered building a table consisting of (id, propertybin), that gets populated as follows:
INSERT LookupTable (id, propertybin)
SELECT id, sum(1 << property)
FROM MyTable
GROUP BY id
Now, propertybin holds a binary value for each ID, where each bit represents whether a property from 0 to 63 exists in MyTable. I realise you have more than 64 properties, but I think this idea could be expanded upon to hold more properties - since you now have 1 column able to hold information about 64 properties, you would "only" need, say 100 such columns to hold information of 6400 properties.
Now, to find all id's satisfying a certain condition, you would use bitwise logic. For example, to find all id's having property 0, but not 1, 2, 3 or 4, use:
SELECT id
FROM LookupTable
WHERE ((propertybin ^ 0x1E) & 0x1F) = 0x1F
Here, we use the exclusive-OR bitwise operator ^ together with the value 0x1E = binary 11110, to filter the not-criteria. We use the AND bitwise operator & with the value 0x1F = binary 11111, to indicate that we are only interested in the first 5 bits.
Alternatively, a sufficiently large column of type VARBINARY could possibly be used, eliminating the limitation of 64 properties per column. However, I'm unsure of the possibilities offered by MySQL to filter certain bits in a VARBINARY value.
I'm trying to figure out how to efficiently run a set of queries that will provide a new table of all values that would return results for an arbitrary query.
Say my table has a schema like:
id
name
age
city
What is an efficient way to list all values that would return results for an arbitrary query (e.g., SELECT * FROM main WHERE NOT city=X AND age BETWEEN Y AND Z)?
My naive approach for this would be to use a script and recurse through all possible combinations of {city, age, age} and see which SELECTs return more than 0 results, but that seems incredibly inefficient. I've also tried building large joins on {city, age, age} as well and basically using that table as an argument list to the query, but that quickly becomes an impossibility for queries on many columns.
For simple conjunctive equality queries (e.g., SELECT * FROM main WHERE name=X AND age=Y), this is much simpler, as I can do something like:
SELECT name, age, count(*) AS count FROM main GROUP BY name, age HAVING count > 0
But I'm having difficulty coming up with a general approach for anything more complicated than that.
Any pointers in the right direction would be most helpful, thanks.
EDIT:
It appears I did a very poor job explaining this, sorry.
Imagine a user gives me a database and a template query and says, "Tell me all the values I can use in this query that will yield results from this database." For example, the user might want to know all age range queries that will return at least one row (e.g., the template query is SELECT * FROM main WHERE age BETWEEN X AND Y).
In that particular example, one could run a SELECT to find the min/max ages in the database, and just tell the user to query between those ages.
Now imagine that the query template is more complicated, such as SELECT * FROM main WHERE NOT city=W AND age BETWEEN X AND Y AND name LIKE Z. How could one determine the range of W/X/Y/Z values that can be used with this query to return results? Does it require creating a join table with every single {city, age, age, name} combination and running the SELECT on each row? How can I do this efficiently so that the operation is time-bound on large databases?
Hopefully that clarifies it.
You could try to write a trigger after insert into your table which inserts the values into another table if they don't exist already. This way you'd have a table with the distinct values of your table. This table would look something like
columnNameFromYourTable | distinctValue
city NY
city LA
age 1
age 2
...
Then when you want to know if a record exists in your table for query SELECT * FROM main WHERE NOT city=W AND age BETWEEN X AND Y AND name LIKE Z you'd query the distinctTable with
select 1 from dual where 1=1
and not exists (select 1 from distinctTable where columnNameFromYourTable = 'city' and distinctValue = 'W')
and exists (select 1 from distinctTable where columnNameFromYourTable = 'age' and distinctValue BETWEEN X AND Y)
and exists (select 1 from distinctTable where columnNameFromYourTable = 'name' and distinctValue LIKE '%Z%')
This would be pretty fast then. If it returns 1 there is an entry in your table, if NULL not.
I have seen many queries with something as follows.
Select 1
From table
What does this 1 mean, how will it be executed and, what will it return?
Also, in what type of scenarios, can this be used?
select 1 from table will return the constant 1 for every row of the table. It's useful when you want to cheaply determine if record matches your where clause and/or join.
SELECT 1 FROM TABLE_NAME means, "Return 1 from the table". It is pretty unremarkable on its own, so normally it will be used with WHERE and often EXISTS (as #gbn notes, this is not necessarily best practice, it is, however, common enough to be noted, even if it isn't really meaningful (that said, I will use it because others use it and it is "more obvious" immediately. Of course, that might be a viscous chicken vs. egg issue, but I don't generally dwell)).
SELECT * FROM TABLE1 T1 WHERE EXISTS (
SELECT 1 FROM TABLE2 T2 WHERE T1.ID= T2.ID
);
Basically, the above will return everything from table 1 which has a corresponding ID from table 2. (This is a contrived example, obviously, but I believe it conveys the idea. Personally, I would probably do the above as SELECT * FROM TABLE1 T1 WHERE ID IN (SELECT ID FROM TABLE2); as I view that as FAR more explicit to the reader unless there were a circumstantially compelling reason not to).
EDIT
There actually is one case which I forgot about until just now. In the case where you are trying to determine existence of a value in the database from an outside language, sometimes SELECT 1 FROM TABLE_NAME will be used. This does not offer significant benefit over selecting an individual column, but, depending on implementation, it may offer substantial gains over doing a SELECT *, simply because it is often the case that the more columns that the DB returns to a language, the larger the data structure, which in turn mean that more time will be taken.
If you mean something like
SELECT * FROM AnotherTable
WHERE EXISTS (SELECT 1 FROM table WHERE...)
then it's a myth that the 1 is better than
SELECT * FROM AnotherTable
WHERE EXISTS (SELECT * FROM table WHERE...)
The 1 or * in the EXISTS is ignored and you can write this as per Page 191 of the ANSI SQL 1992 Standard:
SELECT * FROM AnotherTable
WHERE EXISTS (SELECT 1/0 FROM table WHERE...)
it does what it says - it will always return the integer 1. It's used to check whether a record matching your where clause exists.
select 1 from table is used by some databases as a query to test a connection to see if it's alive, often used when retrieving or returning a connection to / from a connection pool.
The result is 1 for every record in the table.
To be slightly more specific, you would use this to do
SELECT 1 FROM MyUserTable WHERE user_id = 33487
instead of doing
SELECT * FROM MyUserTable WHERE user_id = 33487
because you don't care about looking at the results. Asking for the number 1 is very easy for the database (since it doesn't have to do any look-ups).
Although it is not widely known, a query can have a HAVING clause without a GROUP BY clause.
In such circumstances, the HAVING clause is applied to the entire set. Clearly, the SELECT clause cannot refer to any column, otherwise you would (correct) get the error, "Column is invalid in select because it is not contained in the GROUP BY" etc.
Therefore, a literal value must be used (because SQL doesn't allow a resultset with zero columns -- why?!) and the literal value 1 (INTEGER) is commonly used: if the HAVING clause evaluates TRUE then the resultset will be one row with one column showing the value 1, otherwise you get the empty set.
Example: to find whether a column has more than one distinct value:
SELECT 1
FROM tableA
HAVING MIN(colA) < MAX(colA);
If you don't know there exist any data in your table or not, you can use following query:
SELECT cons_value FROM table_name;
For an Example:
SELECT 1 FROM employee;
It will return a column which contains the total number of rows & all rows have the same constant value 1 (for this time it returns 1 for all rows);
If there is no row in your table it will return nothing.
So, we use this SQL query to know if there is any data in the table & the number of rows indicates how many rows exist in this table.
If you just want to check a true or false based on the WHERE clause, select 1 from table where condition is the cheapest way.
This means that You want a value "1" as output or Most of the time used as Inner Queries because for some reason you want to calculate the outer queries based on the result of inner queries.. not all the time you use 1 but you have some specific values...
This will statically gives you output as value 1.
I see it is always used in SQL injection,such as:
www.urlxxxxx.com/xxxx.asp?id=99 union select 1,2,3,4,5,6,7,8,9 from database;
These numbers can be used to guess where the database exists and guess the column name of the database you specified.And the values of the tables.
it simple means that you are retrieving the number first column from table ,,,,means
select Emply_num,Empl_no From Employees ;
here you are using select 1 from Employees;
that means you are retrieving the Emply_num column.
Thanks
The reason is another one, at least for MySQL. This is from the MySQL manual
InnoDB computes index cardinality values for a table the first time that table is accessed after startup, instead of storing such values in the table. This step can take significant time on systems that partition the data into many tables. Since this overhead only applies to the initial table open operation, to “warm up” a table for later use, access it immediately after startup by issuing a statement such as SELECT 1 FROM tbl_name LIMIT 1
This is just used for convenience with IF EXISTS(). Otherwise you can go with
select * from [table_name]
Image In the case of 'IF EXISTS', we just need know that any row with specified condition exists or not doesn't matter what is content of row.
select 1 from Users
above example code, returns no. of rows equals to no. of users with 1 in single column