I've always used the IN (val1, val2, ...) syntax quite easily when testing for a bunch of values. However, I'm wondering what type of data structure it actually evaluates to, is this a table function? For example:
-- to populate data
CREATE TABLE main_territory (
name varchar NOT NULL,
is_fake_territory integer NOT NULL,
code varchar NOT NULL
);
INSERT INTO main_territory (name, is_fake_territory, code) VALUES ('Afghanistan', 0, 'AF'), ('Albania', 0, 'AL'), ('Algeria', 0, 'DZ');
select '1' as "query#", * from main_territory where code in ('AF', 'AL') union all
select '2' as "query#", * from main_territory where code in (select 'AF' UNION ALL select 'AL') UNION ALL
select '3' as "query#", * from main_territory where code in (select code from main_territory where name ='Albania' or name = 'Afghanistan')
The second and third queries return a one-columned table (is this called a scalar-table?), and so I would imagine doing (expr1, expr2, ...) does the same thing -- it evaluates to a one-columed table. Is that accurate, or what actual data type is this?
I would not call the IN ( ) predicate tuple comparison. An example of tuple comparison (aka row constructor comparison) is:
WHERE (col1, col2) = ('abc', 123)
Or you can even do multivalued row constructor comparison:
WHERE (col1, col2) IN (('abc', 123), ('xyz', 456))
The examples you show are simply the IN ( ) predicate, which compares a single value to a list of values. If the value matches any of those in the list, the predicate is satisfied. The list can either be a fixed list of expressions or literals:
WHERE code IN ('AF', 'AL')
Or it can be the result of a subquery:
WHERE code IN (SELECT code FROM ...)
How this is implemented depends on the code of the respective RDBMS. It might have to materialize the result of the subquery and store it as a list internally. In some software, they may use a temporary table with one column as the data structure to store the result of the subquery. Then the IN ( ) predicate can be executed as a join against that temporary table. If there's one thing an SQL engine ought to be able to do efficiently, it's a join. :-)
But this might be expensive if the result of the subquery is millions of rows. In that case, a clever optimizer would "factor out" the IN ( ) predicate and just do a join. That is, it would read each value of code and do an index lookup into the second table's code column. This means there's no data structure per se, it's just the evaluation of a join.
The real answer would be implementation-dependent. Both MySQL and PostgreSQL are open-source, so you can try downloading and reading the code yourself if you want to know the implementation.
The confusion is understandable, since these are actually two different kinds of IN:
WHERE expr IN (2, 3, 4, ...)
WHERE expr IN (SELECT ...)
The first will be converted to an array like this:
QUERY PLAN
═══════════════════════════════════════════════════
Seq Scan on tab
Filter: (expr = ANY ('{2,3,4,...}'::integer[]))
or, if the list has only one element, to
QUERY PLAN
══════════════════════
Seq Scan on tab
Filter: (expr = 2)
The second will be executed as a join, for example:
QUERY PLAN
═══════════════════════════════════
Hash Join
Hash Cond: (tab.expr = sub.col)
-> Seq Scan on tab
-> Hash
-> Seq Scan on sub
So, to answer your question: A plain IN list will be converted to an array, and IN becomes = ANY.
Suppose I have a table A with a primary key
PRIMARY KEY(c1,c2)
and the cardinality of c1 is very low, whereas c2 is very high
When executing the following queries,
select *
from A
where (c1, c2) in (('001', 'aaa'))
select *
from A
where c1 = '001'and c2 = 'aaa'
the optimizer uses the index.
However, for the following cases,
Case 1:
select *
from A
where (c1, c2) in (('001', 'aaa'), ('002', 'bbb'), ('003', 'ccc'))
Case 2:
select *
from A
where (c1 = '001'and c2 = 'aaa') or
(c1 = '002'and c2 = 'bbb') or
(c1 = '003'and c2 = 'ccc')
the optimizer stops using the index for the Case 1 but still uses for the Case 2.
What makes the optimizer stop using the index for Case 1?
*MySQL Version: 5.6.10
Although those expressions are semantically equivalent, MySQL only added the Range Optimization of Row Constructor Expressions, which is actually able to execute them the same way, in MySQL 5.7.3 (emphasis mine):
The optimizer now is able to apply the range scan access method to queries of this form:
SELECT ... FROM t1 WHERE ( col_1, col_2 ) IN (( 'a', 'b' ), ( 'c', 'd' ));
Previously, for range scans to be used it was necessary for the query to be written as:
SELECT ... FROM t1 WHERE ( col_1 = 'a' AND col_2 = 'b' )
OR ( col_1 = 'c' AND col_2 = 'd' );
For the optimizer to use a range scan, queries must satisfy these conditions:
Only IN() predicates are used, not NOT IN().
On the left side of the IN() predicate, the row constructor contains only column references.
On the right side of the IN() predicate, row constructors contain only runtime constants, which are either literals or local column references that are bound to constants during execution.
On the right side of the IN() predicate, there is more than one row constructor.
I have used FIND_IN_SET multiple times before but this case is a bit different.
Earlier I was searching a single value in the table like
SELECT * FROM tbl_name where find_in_set('1212121212', sku)
But now I have the list of SKUs which I want to search in the table. E.g
'3698520147','088586004490','868332000057','081308003405','088394000028','089541300893','0732511000148','009191711092','752830528161'
I have two columns in the table SKU LIKE 081308003405 and SKU Variation
In SKU column I am saving single value but in variation column I am saving the value in the comma-separated format LIKE 081308003405,088394000028,089541300893
SELECT * FROM tbl_name
WHERE 1
AND upc IN ('3698520147','088586004490','868332000057','081308003405','088394000028',
'089541300893','0732511000148','009191711092','752830528161')
I am using IN function to search UPC value now I want to search variation as well in the variation column. This is my concern is how to search using SKU list in variation column
For now, I have to check in the loop for UPC variation which is taking too much time. Below is the query
SELECT id FROM products
WHERE 1 AND upcVariation AND FIND_IN_SET('88076164444',upc_variation) > 0
First of all consider to store the data in a normalized way. Here is a good read: Is storing a delimited list in a database column really that bad?
Now - Assumng the following schema and data:
create table products (
id int auto_increment,
upc varchar(50),
upc_variation text,
primary key (id),
index (upc)
);
insert into products (upc, upc_variation) values
('01234', '01234,12345,23456'),
('56789', '45678,34567'),
('056789', '045678,034567');
We want to find products with variations '12345' and '34567'. The expected result is the 1st and the 2nd rows.
Normalized schema - many-to-many relation
Instead of storing the values in a comma separated list, create a new table, which maps product IDs with variations:
create table products_upc_variations (
product_id int,
upc_variation varchar(50),
primary key (product_id, upc_variation),
index (upc_variation, product_id)
);
insert into products_upc_variations (product_id, upc_variation) values
(1, '01234'),
(1, '12345'),
(1, '23456'),
(2, '45678'),
(2, '34567'),
(3, '045678'),
(3, '034567');
The select query would be:
select distinct p.*
from products p
join products_upc_variations v on v.product_id = p.id
where v.upc_variation in ('12345', '34567');
As you see - With a normalized schema the problem can be solved with a quite basic query. And we can effectively use indices.
"Exploiting" a FULLTEXT INDEX
With a FULLTEXT INDEX on (upc_variation) you can use:
select p.*
from products p
where match (upc_variation) against ('12345 34567');
This looks quite "pretty" and is probably efficient. But though it works for this example, I wouldn't feel comfortable with this solution, because I can't say exactly, when it doesn't work.
Using JSON_OVERLAPS()
Since MySQL 8.0.17 you can use JSON_OVERLAPS(). You should either store the values as a JSON array, or convert the list to JSON "on the fly":
select p.*
from products p
where json_overlaps(
'["12345","34567"]',
concat('["', replace(upc_variation, ',', '","'), '"]')
);
No index can be used for this. But neither can for FIND_IN_SET().
Using JSON_TABLE()
Since MySQL 8.0.4 you can use JSON_TABLE() to generate a normalized representation of the data "on the fly". Here again you would either store the data in a JSON array, or convert the list to JSON in the query:
select distinct p.*
from products p
join json_table(
concat('["', replace(p.upc_variation, ',', '","'), '"]'),
'$[*]' columns (upcv text path '$')
) v
where v.upcv in ('12345', '34567');
No index can be used here. And this is probably the slowest solution of all presented in this answer.
RLIKE / REGEXP
You can also use a regular expression:
select p.*
from products p
where p.upc_variation rlike '(^|,)(12345|34567)(,|$)'
See demo of all queries on dbfiddle.uk
You can try with below example:
SELECT * FROM TABLENAME
WHERE 1 AND ( FIND_IN_SET('3698520147', SKU)
OR UPC IN ('3698520147') )
I have a solution for you, you can consider this solution:
1: Create a temporary table example here: Sql Fiddle
select
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.name, ',', numbers.n), ',', -1) sku_variation
from
numbers inner join tablename
on CHAR_LENGTH(tablename.sku_split)
-CHAR_LENGTH(REPLACE(tablename.sku_split, ',', ''))>=numbers.n-1
order by id, n
2: Use the temporary table to filter. find in set with your data
Performance considerations. The main thing that matters for performance is whether some index can be used. The complexity of the expression has only a minuscule impact on overall performance.
Step 1 is to learn what can be optimized, and in what way:
Equal: WHERE x = 1 -- can use index
IN/1: WHERE x IN (1) -- Turned into the Equal case by Optimizer
IN/many: WHERE x IN (22,33,44) -- Usually worse than Equal and better than "range"
Easy OR: WHERE (x = 22 OR x = 33) -- Turned into IN if possible
General OR: WHERE (sku = 22 OR upc = 33) -- not sargable (cf UNION)
Easy LIKE: WHERE x LIKE 'abc' -- turned into Equal
Range LIKE: WHERE x LIKE 'abc%' -- equivalent to "range" test
Wild LIKE: WHERE x LIKE '%abc%' -- not sargable
REGEXP: WHERE x RLIKE 'aaa|bbb|ccc' -- not sargable
FIND_IN_SET: WHERE FIND_IN_SET(x, '22,33,44') -- not sargable, even for single item
JSON: -- not sargable
FULLTEXT: WHERE MATCH(x) AGAINST('aaa bbb ccc') -- fast, but not equivalent
NOT: WHERE NOT ((any of the above)) -- usually poor performance
"Sargable" -- able to use index. Phrased differently "Hiding the column in a function call" prevents using an index.
FULLTEXT: There are many restrictions: "word-oriented", min word size, stopwords, etc. But it is very fast when it applies. Note: When used with outer tests, MATCH comes first (if possible), then further filtering will be done without the benefit of indexes, but on a smaller set of rows.
Even when an expression "can" use an index, it "may not". Whether a WHERE clause makes good use of an index is a much longer discussion than can be put here.
Step 2 Learn how to build composite indexes when you have multiple tests (WHERE ... AND ...):
When constructing a composite (multi-column) index, include columns in this order:
'Equal' -- any number of such columns.
'IN/many' column(s)
One range test (BETWEEN, <, etc)
(A couple of side notes.) The Optimizer is smart enough to clean up WHERE 1 AND .... But there are not many things that the Optimizer will handle. In particular, this is not sargable: `AND DATE(x) = '2020-02-20', but this does optimize as a "range":
AND x >= '2020-02-20'
AND x < '2020-02-20' + INTERVAL 1 DAY
Reading
Building indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Sargable: https://en.wikipedia.org/wiki/Sargable
Tips on Many-to-many: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
This depends on how you use it. In MySQL I found that find_in_set is way faster than using JSON when tested on the following commands, so much faster it wasn't even a competition (to be clear, the speed test did not include the set command line):
Fastest
set #ids = (select group_concat(`ID`) from `table`);
select count(*) from `table` where find_in_set(`ID`, #ids);
10 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where `ID` member of( #ids );
34 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where JSON_CONTAINS(#ids, convert(`ID`, char));
34 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where json_overlaps(#ids, json_array(`ID`));
SELECT * FROM tbl_name t1,(select
group_concat('3698520147',',','088586004490',',','868332000057',',',
'081308003405',',','088394000028',',','089541300893',',','0732511000148',',','009191711092',
',','752830528161') as skuid)t
WHERE FIND_IN_SET(t1.sku,t.skuid)>0
I have complex SQL-Statements with group_concat and i need sometimes to convert the value. So i use something like this (Its an example):
SELECT IF(1=1,CAST(TEST AS CHAR),CAST(TEST AS UNSIGNED))
FROM(
SELECT '40' as TEST
UNION
SELECT '5' as TEST
UNION
SELECT '60' as TEST
) as t1
ORDER BY TEST ASC;
Should return CHAR, returns CHAR.
SELECT IF(1=0,CAST(TEST AS CHAR),CAST(TEST AS UNSIGNED))
FROM(
SELECT '40' as TEST
UNION
SELECT '5' as TEST
UNION
SELECT '60' as TEST
) as t1
ORDER BY TEST ASC;
Should return UNSIGNED, returns CHAR
So the result of IF condition is always a CHAR and need to be CAST.
How i can resolve the problem?
MySQL tries to figure out the column type just by parsing the SQL, not based on runtime conditions like the way an IF() expression goes. While your example allows the result of the IF() to be determined before actually processing any rows, this is not true in general and MySQL doesn't perform this optimization.
It just sees that the IF() can return either UNSIGNED or CHAR, and figures out a common type that both can be converted to safely. This is CHAR.
This is explained in the documentation:
The default return type of IF() (which may matter when it is stored into a temporary table) is calculated as follows:
If expr2 or expr3 produce a string, the result is a string.
If expr2 and expr3 are both strings, the result is case-sensitive if either string is case sensitive.
If expr2 or expr3 produce a floating-point value, the result is a floating-point value.
If expr2 or expr3 produce an integer, the result is an integer.
I have been little bit confused while reading MySQL documentation MySQL documentation
A row subquery is a subquery variant that returns a single row and can thus return more than one column value. Legal operators for row subquery comparisons are:
= > < >= <= <> != <=>
For "=" documentation provides good explanation:
SELECT * FROM t1 WHERE (column1,column2) = (1,1);
is same as:
SELECT * FROM t1 WHERE column1 = 1 AND column2 = 1;
But how do the ">" or "<" can compare two rows? What is "plain" variant for this operator?
create table example(a integer,b integer);
insert into example values (1,1);
insert into example values (1,2);
insert into example values (2,1);
select * from example where (a,b) > (1,1)
a | b
-----
1 | 2
2 | 1
Playground: http://www.sqlfiddle.com/#!9/88641/2
p.s.:
PostgreSQL have same behaviour.
Oracle fail with error "ORA-01796: this operator cannot be used with lists "
MySQL and PostgreSQL has same behaviour, let's read postgres documentation
The = and <> cases work slightly differently from the others. Two rows are considered equal if all their corresponding members are non-null and equal; the rows are unequal if any corresponding members are non-null and unequal; otherwise the result of the row comparison is unknown (null).
For the <, <=, > and >= cases, the row elements are compared left-to-right, stopping as soon as an unequal or null pair of elements is found. If either of this pair of elements is null, the result of the row comparison is unknown (null); otherwise comparison of this pair of elements determines the result. For example, ROW(1,2,NULL) < ROW(1,3,0) yields true, not null, because the third pair of elements are not considered.
So (a,b) = (c,d) evaluated as
a = c and b = d
But (a,b,) < (c,d) evaluated as
a < b or (a=c and b < c)
It's look very strange. And I have no guess how will be evaluated comparation of rows with 3 of more attributes;