Understanding tuple syntax in SQL - mysql

I've always used the IN (val1, val2, ...) syntax quite easily when testing for a bunch of values. However, I'm wondering what type of data structure it actually evaluates to, is this a table function? For example:
-- to populate data
CREATE TABLE main_territory (
name varchar NOT NULL,
is_fake_territory integer NOT NULL,
code varchar NOT NULL
);
INSERT INTO main_territory (name, is_fake_territory, code) VALUES ('Afghanistan', 0, 'AF'), ('Albania', 0, 'AL'), ('Algeria', 0, 'DZ');
select '1' as "query#", * from main_territory where code in ('AF', 'AL') union all
select '2' as "query#", * from main_territory where code in (select 'AF' UNION ALL select 'AL') UNION ALL
select '3' as "query#", * from main_territory where code in (select code from main_territory where name ='Albania' or name = 'Afghanistan')
The second and third queries return a one-columned table (is this called a scalar-table?), and so I would imagine doing (expr1, expr2, ...) does the same thing -- it evaluates to a one-columed table. Is that accurate, or what actual data type is this?

I would not call the IN ( ) predicate tuple comparison. An example of tuple comparison (aka row constructor comparison) is:
WHERE (col1, col2) = ('abc', 123)
Or you can even do multivalued row constructor comparison:
WHERE (col1, col2) IN (('abc', 123), ('xyz', 456))
The examples you show are simply the IN ( ) predicate, which compares a single value to a list of values. If the value matches any of those in the list, the predicate is satisfied. The list can either be a fixed list of expressions or literals:
WHERE code IN ('AF', 'AL')
Or it can be the result of a subquery:
WHERE code IN (SELECT code FROM ...)
How this is implemented depends on the code of the respective RDBMS. It might have to materialize the result of the subquery and store it as a list internally. In some software, they may use a temporary table with one column as the data structure to store the result of the subquery. Then the IN ( ) predicate can be executed as a join against that temporary table. If there's one thing an SQL engine ought to be able to do efficiently, it's a join. :-)
But this might be expensive if the result of the subquery is millions of rows. In that case, a clever optimizer would "factor out" the IN ( ) predicate and just do a join. That is, it would read each value of code and do an index lookup into the second table's code column. This means there's no data structure per se, it's just the evaluation of a join.
The real answer would be implementation-dependent. Both MySQL and PostgreSQL are open-source, so you can try downloading and reading the code yourself if you want to know the implementation.

The confusion is understandable, since these are actually two different kinds of IN:
WHERE expr IN (2, 3, 4, ...)
WHERE expr IN (SELECT ...)
The first will be converted to an array like this:
QUERY PLAN
═══════════════════════════════════════════════════
Seq Scan on tab
Filter: (expr = ANY ('{2,3,4,...}'::integer[]))
or, if the list has only one element, to
QUERY PLAN
══════════════════════
Seq Scan on tab
Filter: (expr = 2)
The second will be executed as a join, for example:
QUERY PLAN
═══════════════════════════════════
Hash Join
Hash Cond: (tab.expr = sub.col)
-> Seq Scan on tab
-> Hash
-> Seq Scan on sub
So, to answer your question: A plain IN list will be converted to an array, and IN becomes = ANY.

Related

Filter on a view is applied to filters excluded in a view

I have a table that can have JSON data in a column. I added a calculated column to the table with the ISJSON() function to mark any rows that do not contain valid JSON data
CREATE TABLE tbl1 (Id INT IDENTITY(1,1) NOT NULL, Content NVARCHAR(MAX), IsJsonRecord AS ISJSON(Content))
GO
INSERT INTO tbl1 (Content) VALUES ('a'), ('{"name":"asd"}')
GO
Now I have a view that parses the JSON data out to more readable formats such as
CREATE VIEW vw1
AS
SELECT Id,
JSON_VALUE(Content, '$."name"') AS Name
FROM tbl1
WHERE IsJsonRecord > 0
The WHERE clause works as expected when I select from the view.
SELECT *
FROM vw1
When I query the view with an additional where clause I get an error due to malformed JSON data (as below).
SELECT *
FROM vw1
WHERE [Name] LIKE '%a%'
It seems like the query WHERE clause is applied to rows that do not conform to the WHERE clause already specified in the view.
Is this the expected behavior?
I understand that the view is "optimized away", but I was expecting the query optimizer to apply filters to distinct fields before it applied filters that requires functions to operate on data. I would think that logic could have performance benefits in some scenarios.
I'm not too sure what to do to accommodate the WHERE clause on the view. My actual case is much more complex than the example, and I'm not sure that I can test each column in the view with a CASE statement over the JSON_VALUE statement.
Any suggestions?
As it stands, no you cannot tell the compiler in which order to do the filters, although there are tricks you can do which usually force it to do so.
The problem is that your query is effectively compiled like this:
SELECT *
FROM (
SELECT Id,
JSON_VALUE(Content, '$."name"') AS Name
FROM tbl1
WHERE IsJsonRecord > 0
) vw1
WHERE [Name] LIKE '%a%'
Which is then trivially optimized to:
SELECT Id,
JSON_VALUE(Content, '$."name"') AS Name
FROM tbl1
WHERE IsJsonRecord > 0 AND [Name] LIKE '%a%'
At this point the compiler will make a decision as to which part to evaluate first. For whatever reason, it has chosen to do the LIKE first, this may be due to a good index.
You have a number of solutions:
Use CASE, which is guaranteed (except in certain situations) to be evaluated in order
SELECT Id,
JSON_VALUE(Content, '$."name"') AS Name
FROM tbl1
WHERE CASE WHEN IsJsonRecord > 0 THEN CASE WHEN [Name] LIKE '%a%' THEN 1 END END = 1
Add an index on the persisted column, either with the column as the leading key, or a filtered index IsJsonRecord > 0.
This is not guaranteed to always work, but it usually does. Make sure to include all columns necessary.
A variation on the above is to add a clustered index to the view. There are many restrictions on doing this, but it can work very well. Make sure to add WITH (NOEXPAND) hint to the query.
A more guaranteed option is to add a TOP to the view.
This forces the compiler to ensure that IsJsonRecord > 0 is logically evaluated first, which nearly always means it will be physically executed first.
SELECT *
FROM (
SELECT TOP (9223372036854775807) *
FROM vw1
) vw1
WHERE [Name] LIKE '%a%'

How to use FIND_IN_SET using list of data

I have used FIND_IN_SET multiple times before but this case is a bit different.
Earlier I was searching a single value in the table like
SELECT * FROM tbl_name where find_in_set('1212121212', sku)
But now I have the list of SKUs which I want to search in the table. E.g
'3698520147','088586004490','868332000057','081308003405','088394000028','089541300893','0732511000148','009191711092','752830528161'
I have two columns in the table SKU LIKE 081308003405 and SKU Variation
In SKU column I am saving single value but in variation column I am saving the value in the comma-separated format LIKE 081308003405,088394000028,089541300893
SELECT * FROM tbl_name
WHERE 1
AND upc IN ('3698520147','088586004490','868332000057','081308003405','088394000028',
'089541300893','0732511000148','009191711092','752830528161')
I am using IN function to search UPC value now I want to search variation as well in the variation column. This is my concern is how to search using SKU list in variation column
For now, I have to check in the loop for UPC variation which is taking too much time. Below is the query
SELECT id FROM products
WHERE 1 AND upcVariation AND FIND_IN_SET('88076164444',upc_variation) > 0
First of all consider to store the data in a normalized way. Here is a good read: Is storing a delimited list in a database column really that bad?
Now - Assumng the following schema and data:
create table products (
id int auto_increment,
upc varchar(50),
upc_variation text,
primary key (id),
index (upc)
);
insert into products (upc, upc_variation) values
('01234', '01234,12345,23456'),
('56789', '45678,34567'),
('056789', '045678,034567');
We want to find products with variations '12345' and '34567'. The expected result is the 1st and the 2nd rows.
Normalized schema - many-to-many relation
Instead of storing the values in a comma separated list, create a new table, which maps product IDs with variations:
create table products_upc_variations (
product_id int,
upc_variation varchar(50),
primary key (product_id, upc_variation),
index (upc_variation, product_id)
);
insert into products_upc_variations (product_id, upc_variation) values
(1, '01234'),
(1, '12345'),
(1, '23456'),
(2, '45678'),
(2, '34567'),
(3, '045678'),
(3, '034567');
The select query would be:
select distinct p.*
from products p
join products_upc_variations v on v.product_id = p.id
where v.upc_variation in ('12345', '34567');
As you see - With a normalized schema the problem can be solved with a quite basic query. And we can effectively use indices.
"Exploiting" a FULLTEXT INDEX
With a FULLTEXT INDEX on (upc_variation) you can use:
select p.*
from products p
where match (upc_variation) against ('12345 34567');
This looks quite "pretty" and is probably efficient. But though it works for this example, I wouldn't feel comfortable with this solution, because I can't say exactly, when it doesn't work.
Using JSON_OVERLAPS()
Since MySQL 8.0.17 you can use JSON_OVERLAPS(). You should either store the values as a JSON array, or convert the list to JSON "on the fly":
select p.*
from products p
where json_overlaps(
'["12345","34567"]',
concat('["', replace(upc_variation, ',', '","'), '"]')
);
No index can be used for this. But neither can for FIND_IN_SET().
Using JSON_TABLE()
Since MySQL 8.0.4 you can use JSON_TABLE() to generate a normalized representation of the data "on the fly". Here again you would either store the data in a JSON array, or convert the list to JSON in the query:
select distinct p.*
from products p
join json_table(
concat('["', replace(p.upc_variation, ',', '","'), '"]'),
'$[*]' columns (upcv text path '$')
) v
where v.upcv in ('12345', '34567');
No index can be used here. And this is probably the slowest solution of all presented in this answer.
RLIKE / REGEXP
You can also use a regular expression:
select p.*
from products p
where p.upc_variation rlike '(^|,)(12345|34567)(,|$)'
See demo of all queries on dbfiddle.uk
You can try with below example:
SELECT * FROM TABLENAME
WHERE 1 AND ( FIND_IN_SET('3698520147', SKU)
OR UPC IN ('3698520147') )
I have a solution for you, you can consider this solution:
1: Create a temporary table example here: Sql Fiddle
select
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.name, ',', numbers.n), ',', -1) sku_variation
from
numbers inner join tablename
on CHAR_LENGTH(tablename.sku_split)
-CHAR_LENGTH(REPLACE(tablename.sku_split, ',', ''))>=numbers.n-1
order by id, n
2: Use the temporary table to filter. find in set with your data
Performance considerations. The main thing that matters for performance is whether some index can be used. The complexity of the expression has only a minuscule impact on overall performance.
Step 1 is to learn what can be optimized, and in what way:
Equal: WHERE x = 1 -- can use index
IN/1: WHERE x IN (1) -- Turned into the Equal case by Optimizer
IN/many: WHERE x IN (22,33,44) -- Usually worse than Equal and better than "range"
Easy OR: WHERE (x = 22 OR x = 33) -- Turned into IN if possible
General OR: WHERE (sku = 22 OR upc = 33) -- not sargable (cf UNION)
Easy LIKE: WHERE x LIKE 'abc' -- turned into Equal
Range LIKE: WHERE x LIKE 'abc%' -- equivalent to "range" test
Wild LIKE: WHERE x LIKE '%abc%' -- not sargable
REGEXP: WHERE x RLIKE 'aaa|bbb|ccc' -- not sargable
FIND_IN_SET: WHERE FIND_IN_SET(x, '22,33,44') -- not sargable, even for single item
JSON: -- not sargable
FULLTEXT: WHERE MATCH(x) AGAINST('aaa bbb ccc') -- fast, but not equivalent
NOT: WHERE NOT ((any of the above)) -- usually poor performance
"Sargable" -- able to use index. Phrased differently "Hiding the column in a function call" prevents using an index.
FULLTEXT: There are many restrictions: "word-oriented", min word size, stopwords, etc. But it is very fast when it applies. Note: When used with outer tests, MATCH comes first (if possible), then further filtering will be done without the benefit of indexes, but on a smaller set of rows.
Even when an expression "can" use an index, it "may not". Whether a WHERE clause makes good use of an index is a much longer discussion than can be put here.
Step 2 Learn how to build composite indexes when you have multiple tests (WHERE ... AND ...):
When constructing a composite (multi-column) index, include columns in this order:
'Equal' -- any number of such columns.
'IN/many' column(s)
One range test (BETWEEN, <, etc)
(A couple of side notes.) The Optimizer is smart enough to clean up WHERE 1 AND .... But there are not many things that the Optimizer will handle. In particular, this is not sargable: `AND DATE(x) = '2020-02-20', but this does optimize as a "range":
AND x >= '2020-02-20'
AND x < '2020-02-20' + INTERVAL 1 DAY
Reading
Building indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Sargable: https://en.wikipedia.org/wiki/Sargable
Tips on Many-to-many: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
This depends on how you use it. In MySQL I found that find_in_set is way faster than using JSON when tested on the following commands, so much faster it wasn't even a competition (to be clear, the speed test did not include the set command line):
Fastest
set #ids = (select group_concat(`ID`) from `table`);
select count(*) from `table` where find_in_set(`ID`, #ids);
10 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where `ID` member of( #ids );
34 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where JSON_CONTAINS(#ids, convert(`ID`, char));
34 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where json_overlaps(#ids, json_array(`ID`));
SELECT * FROM tbl_name t1,(select
group_concat('3698520147',',','088586004490',',','868332000057',',',
'081308003405',',','088394000028',',','089541300893',',','0732511000148',',','009191711092',
',','752830528161') as skuid)t
WHERE FIND_IN_SET(t1.sku,t.skuid)>0

Find out values which are not available in table

Let assume Following simple table
Col1
======
one
two
Let assume Following simple query
Select count(*) from TABLE_A where Col1 in ('one','two','three','four')
In above query it will produce following result
2
Now I want to find out what are the values in IN- condition which is not available in table_A.
How to find out that values which are not available in table?
like below result
three
four
Above queries only example. In my real time query in have 1000 values in IN-Condition.
Working Database : DB2
This is the one of the work around to achieve your expectation.
Instead of hard-coding the values in IN condition, you can move those values in to a table. If it done simply using LEFT JOIN with NULL check you can get the not matching values.
SELECT MR.Col1
FROM MatchingRecords MR -- here MatchingRecords table contains the IN condition values
LEFT JOIN Table_A TA ON TA.Col1 = MR.Col1
WHERE TA.Col1 IS NULL;
Working DEMO
If the values are to be listed in the statement string rather than stored in a table, then perhaps a revision to the syntax being used for that list of values currently being composed [apparently, from some other input than a TABLE] for the IN predicate can be effected? The following revised syntax for a list of values could be used both for the original aggregate query [shown immediately below as the first of two queries], and for the query for which the how-to-code is being asked [the second of the two queries below]:
Select count(*)
from TABLE_A
where Col1 in ( values('one'),('two'),('three'),('four') )
; -- report from above query follows:
COUNT ( * )
2
[Bgn-Edit 05-Aug-2016: adding this text and example just below]Apparently at least one DB2 variant balks at unnamed columns for the derived table, so the query just below names the column; I chose COL1, so as to match the name from the actual TABLE, but that should not be necessary. The (col1) is added to the original query that remains from the original pre-edit version; that version remains after this edit\insertion and is missing the (col1) added here:
select *
from ( values('one'),('two'),('three'),('four') ) as x (col1)
except ( select * from table_a )
; -- report from above query follows:
COL1
three
four
The following is the original query given, for which the comment below suggests a failure for an unnamed column when run on some unstated DB2 variant; I should have noted that this SQL query functions without error, on DB2 for i 7.1
[End-Edit 05-Aug-2016]
select *
from ( values('one'),('two'),('three'),('four') ) as x
except ( select * from table_a )
; -- report from above query follows:
VALUES
three
four

SQL query to match a list of values with a list of fields in any order without repetition

I recently had to wrote a query to filter some specific data that looked like the following:
Let's suppose that I have 3 distinct values that I want to search in 3 different fields of one of my tables on my database, they must be searched in all possible orders without repetition.
Here is an example (to make it easy to understand, I will use named queries notation to show where the values must be placed):
val1 = "a", val2 = "b", val3 = "c"
This is the query I've generated:
SELECT * FROM table WHERE
(fieldA = :val1 AND fieldB = :val2 AND fieldC = :val3) OR
(fieldA = :val1 AND fieldB = :val3 AND fieldC = :val2) OR
(fieldA = :val2 AND fieldB = :val1 AND fieldC = :val3) OR
(fieldA = :val2 AND fieldB = :val3 AND fieldC = :val1) OR
(fieldA = :val3 AND fieldB = :val1 AND fieldC = :val2) OR
(fieldA = :val3 AND fieldB = :val2 AND fieldC = :val1)
What I had to do is generate a query that simulates a permutation without repetition. Is there a better way to do this type of query?
This is OK for 3x3 but if I need to do the same with something bigger like 9x9 then generating the query will be a huge mess.
I'm using MariaDB, but I'm okay accepting answers that can run on PostgreSQL.
(I want to learn if there is a smart way of writing this type of queries without "brute force")
There isn't a much better way, but you can use in:
SELECT *
FROM table
WHERE :val1 in (fieldA, fieldB, fieldC) and
:val2 in (fieldA, fieldB, fieldC) and
:val3 in (fieldA, fieldB, fieldC)
It is shorter at least. And, this is standard SQL, so it should work in any database.
... I'm okay accepting answers that can run on PostgreSQL. (I want to
learn if there is a smart way of writing this type of queries without "brute force")
There is a "smart way" in Postgres, with sorted arrays.
Integer
For integer values use sort_asc() of the additional module intarray.
SELECT * FROM tbl
WHERE sort_asc(ARRAY[id1, id2, id3]) = '{1,2,3}' -- compare sorted arrays
Works for any number of elements.
Other types
As clarified in a comment, we are dealing with strings.
Create a variant of sort_asc() that works for any type that can be sorted:
CREATE OR REPLACE FUNCTION sort_asc(anyarray)
RETURNS anyarray LANGUAGE sql IMMUTABLE AS
'SELECT array_agg(x ORDER BY x COLLATE "C") FROM unnest($1) AS x';
Not as fast as the sibling from intarray, but fast enough.
Make it IMMUTABLE to allow its use in indexes.
Use COLLATE "C" to ignore sorting rules of the current locale: faster, immutable.
To make the function work for any type that can be sorted, use a polymorphic parameter.
Query is the same:
SELECT * FROM tbl
WHERE sort_asc(ARRAY[val1, val2, val3]) = '{bar,baz,foo}';
Or, if you are not sure about the sort order in "C" locale ...
SELECT * FROM tbl
WHERE sort_asc(ARRAY[val1, val2, val3]) = sort_asc('{bar,baz,foo}'::text[]);
Index
For best read performance create a functional index (at some cost to write performance):
CREATE INDEX tbl_arr_idx ON tbl (sort_asc(ARRAY[val1, val2, val3]));
SQL Fiddle demonstrating all.
My answer assumes there is a Key column that we can single out. The output should be all the keys that meet all 3 values and each field and value being used:
This "should" get you a list of Keys that meet the criteria
SELECT F.KEY
FROM (
SELECT DISTINCT L.Key, L.POS
FROM (
SELECT Key, 'A' AS POS, FieldA AS FIELD FROM table AS A
UNION ALL
SELECT Key, 'B' AS POS, FieldB AS FIELD FROM table AS A
UNION ALL
SELECT Key, 'C' AS POS, FieldC AS FIELD FROM table AS A ) AS L
WHERE L.FIELD IN(:VAL1, :VAL2, :VAL3)
) AS F
GROUP BY F.KEY
HAVING COUNT(*) = 3
Although Gordon's answer is definitely shorter and almost certainly faster as well, I was toying with the idea on how to minimize the code change when the number of combinations increase.
And I can come up with is something for Postgres which is by no means shorter, but more "change-friendly":
with recursive params (val) as (
values (1),(2),(3) -- these are the input values
), all_combinations as (
select array[val] as elements
from params
union all
select ac.elements||p.val
from params p
join all_combinations ac
on array_length(ac.elements,1) < (select count(*) from params)
)
select *
from the_table
where array[id1,id2,id3] = any (select elements from all_combinations);
What does it do?
First we create a CTE holding the values we are looking for, the recursive CTE then builds a list of all possible permutations from those values. This list will include too many elements because it will also hold arrays with 1 or two elements.
The final select that puts the columns that should be compared into an array and compares that with the permutations generated by the CTE.
Here is a SQLFiddle example: http://sqlfiddle.com/#!15/43066/1
When the number of values (and columns) increase you only need to add the new value to the values row constructor and add the additional column to the array of columns in the where condition.
Using a naive approach, I would use the in clause for this job, and since there should not be any repetition, exclude when the fields repeat.
There is also some optimisations you could do.
First you can exclude the last field, since:
A <> B, A <> C
A <> B, B <> C,
Also means that:
C <> B, C <> A
And also, the following queries doesn't need a previously queried field, since:
A <> B == B <> A
The query would be written as:
SELECT * FROM table
WHERE :val1 in (fieldA, fieldB, fieldC) and
:val2 in (fieldA, fieldB, fieldC) and
:val3 in (fieldA, fieldB, fieldC) and
fieldA not in (fieldB, fieldC) and
fieldB <> fieldC
This is a naive approach, there are probably others which use the MySQL API, but this one does the job.

MySQL: comparison of integer value and string field with index

Table a_table has index on string_column.
I have a query:
SELECT * FROM a_table WHERE string_column = 10;
I used EXPLAIN to find that no indexes are used.
Why? Could you help me with MySQL documentation link?
Updated: Sandbox (SQL Fiddle)
The essential point is that the index cannot be used if the database has to do a conversion on the table-side of the comparison.
Besides that, the DB always coverts Strings -> Numbers because this is the deterministic way (otherwise 1 could be converted to '01', '001' as mentioned in the comments).
So, if we compare the two cases that seem to confuse you:
-- index is used
EXPLAIN SELECT * FROM a_table WHERE int_column = '1';
The DB converts the string '1' to the number 1 and then executes the query. It finally has int on both sides so it can use the index.
-- index is NOT used. WTF?
EXPLAIN SELECT * FROM a_table WHERE str_column = 1;
Again, it converts the string to numbers. However, this time it has to convert the data stored in the table. In fact, you are performing a search like cast(str_column as int) = 1. That means, you are not searching on the indexed data anymore, the DB cannot use the index.
Please have a look at this for further details:
http://use-the-index-luke.com/sql/where-clause/obfuscation/numeric-strings
http://use-the-index-luke.com/sql/where-clause/functions/case-insensitive-search