This issue came up when I got different records counts for what I thought were identical queries one using a not in where constraint and the other a left join. The table in the not in constraint had one null value (bad data) which caused that query to return a count of 0 records. I sort of understand why but I could use some help fully grasping the concept.
To state it simply, why does query A return a result but B doesn't?
A: select 'true' where 3 in (1, 2, 3, null)
B: select 'true' where 3 not in (1, 2, null)
This was on SQL Server 2005. I also found that calling set ansi_nulls off causes B to return a result.
Query A is the same as:
select 'true' where 3 = 1 or 3 = 2 or 3 = 3 or 3 = null
Since 3 = 3 is true, you get a result.
Query B is the same as:
select 'true' where 3 <> 1 and 3 <> 2 and 3 <> null
When ansi_nulls is on, 3 <> null is UNKNOWN, so the predicate evaluates to UNKNOWN, and you don't get any rows.
When ansi_nulls is off, 3 <> null is true, so the predicate evaluates to true, and you get a row.
NOT IN returns 0 records when compared against an unknown value
Since NULL is an unknown, a NOT IN query containing a NULL or NULLs in the list of possible values will always return 0 records since there is no way to be sure that the NULL value is not the value being tested.
Whenever you use NULL you are really dealing with a Three-Valued logic.
Your first query returns results as the WHERE clause evaluates to:
3 = 1 or 3 = 2 or 3 = 3 or 3 = null
which is:
FALSE or FALSE or TRUE or UNKNOWN
which evaluates to
TRUE
The second one:
3 <> 1 and 3 <> 2 and 3 <> null
which evaluates to:
TRUE and TRUE and UNKNOWN
which evaluates to:
UNKNOWN
The UNKNOWN is not the same as FALSE
you can easily test it by calling:
select 'true' where 3 <> null
select 'true' where not (3 <> null)
Both queries will give you no results
If the UNKNOWN was the same as FALSE then assuming that the first query would give you FALSE the second would have to evaluate to TRUE as it would have been the same as NOT(FALSE).
That is not the case.
There is a very good article on this subject on SqlServerCentral.
The whole issue of NULLs and Three-Valued Logic can be a bit confusing at first but it is essential to understand in order to write correct queries in TSQL
Another article I would recommend is SQL Aggregate Functions and NULL.
Compare to null is undefined, unless you use IS NULL.
So, when comparing 3 to NULL (query A), it returns undefined.
I.e. SELECT 'true' where 3 in (1,2,null)
and
SELECT 'true' where 3 not in (1,2,null)
will produce the same result, as NOT (UNDEFINED) is still undefined, but not TRUE
IF you want to filter with NOT IN for a subquery containg NULLs justcheck for not null
SELECT blah FROM t WHERE blah NOT IN
(SELECT someotherBlah FROM t2 WHERE someotherBlah IS NOT NULL )
The title of this question at the time of writing is
SQL NOT IN constraint and NULL values
From the text of the question it appears that the problem was occurring in a SQL DML SELECT query, rather than a SQL DDL CONSTRAINT.
However, especially given the wording of the title, I want to point out that some statements made here are potentially misleading statements, those along the lines of (paraphrasing)
When the predicate evaluates to UNKNOWN you don't get any rows.
Although this is the case for SQL DML, when considering constraints the effect is different.
Consider this very simple table with two constraints taken directly from the predicates in the question (and addressed in an excellent answer by #Brannon):
DECLARE #T TABLE
(
true CHAR(4) DEFAULT 'true' NOT NULL,
CHECK ( 3 IN (1, 2, 3, NULL )),
CHECK ( 3 NOT IN (1, 2, NULL ))
);
INSERT INTO #T VALUES ('true');
SELECT COUNT(*) AS tally FROM #T;
As per #Brannon's answer, the first constraint (using IN) evaluates to TRUE and the second constraint (using NOT IN) evaluates to UNKNOWN. However, the insert succeeds! Therefore, in this case it is not strictly correct to say, "you don't get any rows" because we have indeed got a row inserted as a result.
The above effect is indeed the correct one as regards the SQL-92 Standard. Compare and contrast the following section from the SQL-92 spec
7.6 where clause
The result of the is a table of those rows of T for
which the result of the search condition is true.
4.10 Integrity constraints
A table check constraint is satisfied if and only if the specified
search condition is not false for any row of a table.
In other words:
In SQL DML, rows are removed from the result when the WHERE evaluates to UNKNOWN because it does not satisfy the condition "is true".
In SQL DDL (i.e. constraints), rows are not removed from the result when they evaluate to UNKNOWN because it does satisfy the condition "is not false".
Although the effects in SQL DML and SQL DDL respectively may seem contradictory, there is practical reason for giving UNKNOWN results the 'benefit of the doubt' by allowing them to satisfy a constraint (more correctly, allowing them to not fail to satisfy a constraint): without this behaviour, every constraints would have to explicitly handle nulls and that would be very unsatisfactory from a language design perspective (not to mention, a right pain for coders!)
p.s. if you are finding it as challenging to follow such logic as "unknown does not fail to satisfy a constraint" as I am to write it, then consider you can dispense with all this simply by avoiding nullable columns in SQL DDL and anything in SQL DML that produces nulls (e.g. outer joins)!
In A, 3 is tested for equality against each member of the set, yielding (FALSE, FALSE, TRUE, UNKNOWN). Since one of the elements is TRUE, the condition is TRUE. (It's also possible that some short-circuiting takes place here, so it actually stops as soon as it hits the first TRUE and never evaluates 3=NULL.)
In B, I think it is evaluating the condition as NOT (3 in (1,2,null)). Testing 3 for equality against the set yields (FALSE, FALSE, UNKNOWN), which is aggregated to UNKNOWN. NOT ( UNKNOWN ) yields UNKNOWN. So overall the truth of the condition is unknown, which at the end is essentially treated as FALSE.
SQL uses three-valued logic for truth values. The IN query produces the expected result:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE col IN (NULL, 1)
-- returns first row
But adding a NOT does not invert the results:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT col IN (NULL, 1)
-- returns zero rows
This is because the above query is equivalent of the following:
SELECT * FROM (VALUES (1), (2)) AS tbl(col) WHERE NOT (col = NULL OR col = 1)
Here is how the where clause is evaluated:
| col | col = NULL⁽¹⁾ | col = 1 | col = NULL OR col = 1 | NOT (col = NULL OR col = 1) |
|-----|----------------|---------|-----------------------|-----------------------------|
| 1 | UNKNOWN | TRUE | TRUE | FALSE |
| 2 | UNKNOWN | FALSE | UNKNOWN⁽²⁾ | UNKNOWN⁽³⁾ |
Notice that:
The comparison involving NULL yields UNKNOWN
The OR expression where none of the operands are TRUE and at least one operand is UNKNOWN yields UNKNOWN (ref)
The NOT of UNKNOWN yields UNKNOWN (ref)
You can extend the above example to more than two values (e.g. NULL, 1 and 2) but the result will be same: if one of the values is NULL then no row will match.
Null signifies and absence of data, that is it is unknown, not a data value of nothing. It's very easy for people from a programming background to confuse this because in C type languages when using pointers null is indeed nothing.
Hence in the first case 3 is indeed in the set of (1,2,3,null) so true is returned
In the second however you can reduce it to
select 'true' where 3 not in (null)
So nothing is returned because the parser knows nothing about the set to which you are comparing it - it's not an empty set but an unknown set. Using (1, 2, null) doesn't help because the (1,2) set is obviously false, but then you're and'ing that against unknown, which is unknown.
It may be concluded from answers here that NOT IN (subquery) doesn't handle nulls correctly and should be avoided in favour of NOT EXISTS. However, such a conclusion may be premature. In the following scenario, credited to Chris Date (Database Programming and Design, Vol 2 No 9, September 1989), it is NOT IN that handles nulls correctly and returns the correct result, rather than NOT EXISTS.
Consider a table sp to represent suppliers (sno) who are known to supply parts (pno) in quantity (qty). The table currently holds the following values:
VALUES ('S1', 'P1', NULL),
('S2', 'P1', 200),
('S3', 'P1', 1000)
Note that quantity is nullable i.e. to be able to record the fact a supplier is known to supply parts even if it is not known in what quantity.
The task is to find the suppliers who are known supply part number 'P1' but not in quantities of 1000.
The following uses NOT IN to correctly identify supplier 'S2' only:
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND 1000 NOT IN (
SELECT spy.qty
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
);
However, the below query uses the same general structure but with NOT EXISTS but incorrectly includes supplier 'S1' in the result (i.e. for which the quantity is null):
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1', NULL ),
( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT DISTINCT spx.sno
FROM sp spx
WHERE spx.pno = 'P1'
AND NOT EXISTS (
SELECT *
FROM sp spy
WHERE spy.sno = spx.sno
AND spy.pno = 'P1'
AND spy.qty = 1000
);
So NOT EXISTS is not the silver bullet it may have appeared!
Of course, source of the problem is the presence of nulls, therefore the 'real' solution is to eliminate those nulls.
This can be achieved (among other possible designs) using two tables:
sp suppliers known to supply parts
spq suppliers known to supply parts in known quantities
noting there should probably be a foreign key constraint where spq references sp.
The result can then be obtained using the 'minus' relational operator (being the EXCEPT keyword in Standard SQL) e.g.
WITH sp AS
( SELECT *
FROM ( VALUES ( 'S1', 'P1' ),
( 'S2', 'P1' ),
( 'S3', 'P1' ) )
AS T ( sno, pno )
),
spq AS
( SELECT *
FROM ( VALUES ( 'S2', 'P1', 200 ),
( 'S3', 'P1', 1000 ) )
AS T ( sno, pno, qty )
)
SELECT sno
FROM spq
WHERE pno = 'P1'
EXCEPT
SELECT sno
FROM spq
WHERE pno = 'P1'
AND qty = 1000;
this is for Boy:
select party_code
from abc as a
where party_code not in (select party_code
from xyz
where party_code = a.party_code);
this works regardless of ansi settings
also this might be of use to know the logical difference between join, exists and in
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
Related
I have a Devices (unique elements) and a DeviceTests tables. For each device from Devices there's a max of 6 different DeviceTests. These tests can be true, false or null. The type of those tests goes from 1 to 6.
I'd like to extract all devices with no errors for tests: 1, 2, 3 and 6. Ths is my current query:
SELECT
Devices.*
FROM
Devices LEFT JOIN DeviceTests ON Devices.Imei = DeviceTests.Imei
GROUP BY
Devices.Imei
HAVING
BIT_AND(Result IS NOT NULL OR (Result IS NULL AND TestType IN ('4','5')) AND
!BIT_OR(Result IS NOT NULL AND !Result);
Assuming that DeviceTests.Result is NULL when the test has no errors. I dont' really get why you would use HAVING, BIT_AND and BIT_OR functions. You have a where clause that is supposed to present you with options to set conditions upon the dataset.
SELECT Devices.*
FROM Devices
LEFT JOIN DeviceTests ON Devices.Imei = DeviceTests.Imei
WHERE DevicesTests.TestType IN ('1','2', '3', '6')
AND DeviceTests.Result IS NULL
GROUP BY Devices.Imei
ORDER BY DeviceTests.Id ASC
I am also assuming that you have a column Id in your DeviceTests table and you want to sort ascending by test ID, if you don't, then you'll get an error
I have got 2 tables, Security and SecurityTransactions.
Security:
create table security(SecurityId int, SecurityName varchar(50));
insert into security values(1,'apple');
insert into security values(2,'google');
insert into security values(3,'ibm');
SecurityTable:
create table SecurityTransactions(SecurityId int, Buy_sell boolean, Quantity int);
insert into securitytransactions values ( 1 , false, 100 );
insert into securitytransactions values ( 1 , true, 20 );
insert into securitytransactions values ( 1 , false, 50 );
insert into securitytransactions values ( 2 , false, 120 );
I want to find out the security name and it's no of appearance in SecurityTransactions.
The answer is below:
SecurityName | Appearance
apple | 3
google | 1
I wrote the below sql query :
select S.SecurityName, count(t.securityID) as Appearance
from security S inner join securitytransactions t on S.SecurityId = t.SecurityId
group by t.SecurityId, S.SecurityName;
this query gave me the desired result, but it was still rejected by a person saying group by should have been s.securityName. why is it so ?
EDIT :
Which one do you this is correct and why ?
a. group by t.SecurityId, S.securityName
b. group by t.SecurityId
c. group by S.securityName
According to ANSI SQL, if you use a group by clause your select list may only contain items in the group by clause, single row transformations thereof, or aggregate expressions. MySQL is non-standard, and allows other columns too. In this case it happened to produce the right answer, as there's a 1:1 relationship between the SecuirtyId and SecurityName, but generally speaking, this is a bad practice that will make your code hard to understand at best, and unpredictable at worst.
EDIT:
To address the edited question - grouping by both SecuirtyId and SecurityName isn't technically wrong, it's just redundant. Since there's a 1:1 relationship between the two columns, adding the SecurityId column to the group by clause won't change the result, and will just confuse people reading the query.
This has been racking my head. I've scoured the internet (including this place) and can't find a solution. So as a last resort I was hoping the good people of this forum might be able to help me out.
I have two tables:
TableA
Order_detailsID
OrderID
TitleID
Return_date
TableB
TitleID
Title_name
Quantity_in_stock
And would like to run a query that shows the remaining 'Quantity_in_stock'.
If the 'Return_date' is set to NULL then it means the item is currently out -- so I have been trying to use the count() function for the NULL values and subtract it from the 'Quantity_in_stock'.
This is the script I have so far:
DELIMITER //
CREATE PROCEDURE InStock()
BEGIN
Select TableB.TitleID,
TableB.Title_name,
TableB.Quantity_in_stock AS 'Total_Stock',
COUNT(TableA.return_date IS NULL) AS 'Rented_Out',
TableB.Quantity_in_stock - COUNT(TableA.return_date IS NULL) AS 'Remaining Stock'
From TableB
LEFT JOIN TableA
ON TableA.TitleID = TableB.TitleID
GROUP BY TableB.TitleID;
END//
This works if there is one of more of the TitleIDs at NULL, however if there are no values at NULL, then the Count() is still returning a value of 1 when it should be 0.
What am I doing wrong?
Instead of:
COUNT(TableA.return_date IS NULL)
use this:
SUM(CASE
WHEN TableA.TitleID IS NULL THEN 0
WHEN TableA.return_date IS NOT NULL THEN 0
ELSE 1
END)
The problem with the TableA.return_date IS NULL predicate is that it's true in two completely different situations:
When there is no matching record in TableA
When there is a matching record but TableA.return_date value of this exact record is NULL.
Using the CASE expression you can differentiate between these two cases.
I will like to mention a simple concept here, just keep counting the rows when that particular column is null.
select count(*) from table_name where column_name is null
Below two queries are subqueries. Both are the same and both works fine for me. But the problem is Method 1 query takes about 10 secs to execute while Method 2 query takes under 1 sec.
I was able to convert method 1 query to method 2 but I don't understand what's happening in the query. I have been trying to figure it out myself. I would really like to learn what's the difference between below two queries and how does the performance gain happen ? what's the logic behind it ?
I'm new to these advance techniques. I hope someone will help me out here. Given that I read the docs which does not give me a clue.
Method 1 :
SELECT
*
FROM
tracker
WHERE
reservation_id IN (
SELECT
reservation_id
FROM
tracker
GROUP BY
reservation_id
HAVING
(
method = 1
AND type = 0
AND Count(*) > 1
)
OR (
method = 1
AND type = 1
AND Count(*) > 1
)
OR (
method = 2
AND type = 2
AND Count(*) > 0
)
OR (
method = 3
AND type = 0
AND Count(*) > 0
)
OR (
method = 3
AND type = 1
AND Count(*) > 1
)
OR (
method = 3
AND type = 3
AND Count(*) > 0
)
)
Method 2 :
SELECT
*
FROM
`tracker` t
WHERE
EXISTS (
SELECT
reservation_id
FROM
`tracker` t3
WHERE
t3.reservation_id = t.reservation_id
GROUP BY
reservation_id
HAVING
(
METHOD = 1
AND TYPE = 0
AND COUNT(*) > 1
)
OR
(
METHOD = 1
AND TYPE = 1
AND COUNT(*) > 1
)
OR
(
METHOD = 2
AND TYPE = 2
AND COUNT(*) > 0
)
OR
(
METHOD = 3
AND TYPE = 0
AND COUNT(*) > 0
)
OR
(
METHOD = 3
AND TYPE = 1
AND COUNT(*) > 1
)
OR
(
METHOD = 3
AND TYPE = 3
AND COUNT(*) > 0
)
)
An Explain Plan would have shown you why exactly you should use Exists. Usually the question comes Exists vs Count(*). Exists is faster. Why?
With regard to challenges present by NULL: when subquery returns Null, for IN the entire query becomes Null. So you need to handle that as well. But using Exist, it's merely a false. Much easier to cope. Simply IN can't compare anything with Null but Exists can.
e.g. Exists (Select * from yourtable where bla = 'blabla'); you get true/false the moment one hit is found/matched.
In this case IN sort of takes the position of the Count(*) to select ALL matching rows based on the WHERE because it's comparing all values.
But don't forget this either:
EXISTS executes at high speed against IN : when the subquery results is very large.
IN gets ahead of EXISTS : when the subquery results is very small.
Reference to for more details:
subquery using IN.
IN - subquery optimization
Join vs. sub-query.
Method 2 is fast because it is using EXISTS operator, where I MySQL do not load any results.
As mentioned in your docs link as well, that it omits whatever is there in SELECT clause. It only checks for the first value that matches the criteria, once found it sets the condition TRUE and moves for further processing.
On the other side Method 1 has IN operator which loads all possible values and then matches it. Condition is set TRUE only when exact match is found which is time consuming process.
Hence your method 2 is fast.
Hope it helps...
The EXISTS operator is a Boolean operator that returns either true or false. The EXISTS operator is often used the in a subquery to test for an “exist” condition.
SELECT
select_list
FROM
a_table
WHERE
[NOT] EXISTS(subquery);
If the subquery returns any row, the EXISTS operator returns true, otherwise, it returns false.
In addition, the EXISTS operator terminates further processing immediately once it finds a matching row. Because of this characteristic, you can use the EXISTS operator to improve the performance of the query in some cases.
The NOT operator negates the EXISTS operator. In other words, the NOT EXISTS returns true if the subquery returns no row, otherwise it returns false.
You can use SELECT *, SELECT column, SELECT a_constant, or anything in the subquery. The results are the same because MySQL ignores the select_list that appears in the SELECT clause.
The reason is that the EXISTS operator works based on the “at least found” principle. It returns true and stops scanning table once at least one matching row found.
On the other hands, when the IN operator is combined with a subquery, MySQL must process the subquery first and then uses the result of the subquery to process the whole query.
The general rule of thumb is that if the subquery contains a large volume of data, the EXISTS operator provides a better performance.
However, the query that uses the IN operator will perform faster if the result set returned from the subquery is very small.
For detail explanations and examples: MySQL EXISTS - mysqltutorial.org
The second Method is faster because you've got this like there "WHERE t3.reservation_id = t.reservation_id". In the first case your subquery has to do a full scan into the table to verify the information. However at the 2o Method the subquery knows exactly what it is looking for and once it is found is checked the having condition then.
Their Official Documentation.SubQuery Optimization with Exists
For MySQL, I want a query that returns a SUM of an expression, EXCEPT that I want the returned value to be NULL if any of the expressions included in the SUM are NULL. (The normal operation of SUM is to ignore NULL values.
Here's a simple test case that illustrates
CREATE TABLE t4 (fee VARCHAR(3), fi INT);
INSERT INTO t4 VALUES ('fo',10),('fo',200),('fo',NULL),('fum',400);
SELECT fee, SUM(fi) AS sum_fi FROM t4 GROUP BY fee
This returns exactly the result set I expect:
fee sum_fi
--- ------
fo 210
fum 400
What I want is a query that returns a DIFFERENT result set:
fee sum_fi
--- ------
fo NULL
fum 400
What I want is for the value returned for sum_fi to be NULL whenever an included value is NULL. (This is different than the SUM aggregate function ignores NULL, and returns a total for the non-NULL expressions.)
Question: What query can I use to return the desired result set?
I don't find any built in aggregate function that exhibits this behavior.
How about using SUM(fi is NULL) to determine if you have a NULL value in your data set?
This should work:
SELECT fee, IF(SUM(fi is NULL), NULL, SUM(fi)) AS sum_fi FROM t4 GROUP BY fee
You can use this query:
SELECT fee,
IF(COUNT(fi) < (SELECT count(fee) FROM t4 temp2 WHERE temp1.fee = temp2.fee),
NULL, SUM(fi)) AS sum_fi
FROM t4 temp1 GROUP BY fee
There is also this solution:
SELECT fee, CASE WHEN COUNT(*) = COUNT(fi) THEN SUM(fi) ELSE NULL END AS sum_fi
FROM t4 GROUP BY fee
which I derived from the answer of the question Aaron W tagged as a duplicate of this. Admittedly, the latter form looks better than my original idea.