I have used FIND_IN_SET multiple times before but this case is a bit different.
Earlier I was searching a single value in the table like
SELECT * FROM tbl_name where find_in_set('1212121212', sku)
But now I have the list of SKUs which I want to search in the table. E.g
'3698520147','088586004490','868332000057','081308003405','088394000028','089541300893','0732511000148','009191711092','752830528161'
I have two columns in the table SKU LIKE 081308003405 and SKU Variation
In SKU column I am saving single value but in variation column I am saving the value in the comma-separated format LIKE 081308003405,088394000028,089541300893
SELECT * FROM tbl_name
WHERE 1
AND upc IN ('3698520147','088586004490','868332000057','081308003405','088394000028',
'089541300893','0732511000148','009191711092','752830528161')
I am using IN function to search UPC value now I want to search variation as well in the variation column. This is my concern is how to search using SKU list in variation column
For now, I have to check in the loop for UPC variation which is taking too much time. Below is the query
SELECT id FROM products
WHERE 1 AND upcVariation AND FIND_IN_SET('88076164444',upc_variation) > 0
First of all consider to store the data in a normalized way. Here is a good read: Is storing a delimited list in a database column really that bad?
Now - Assumng the following schema and data:
create table products (
id int auto_increment,
upc varchar(50),
upc_variation text,
primary key (id),
index (upc)
);
insert into products (upc, upc_variation) values
('01234', '01234,12345,23456'),
('56789', '45678,34567'),
('056789', '045678,034567');
We want to find products with variations '12345' and '34567'. The expected result is the 1st and the 2nd rows.
Normalized schema - many-to-many relation
Instead of storing the values in a comma separated list, create a new table, which maps product IDs with variations:
create table products_upc_variations (
product_id int,
upc_variation varchar(50),
primary key (product_id, upc_variation),
index (upc_variation, product_id)
);
insert into products_upc_variations (product_id, upc_variation) values
(1, '01234'),
(1, '12345'),
(1, '23456'),
(2, '45678'),
(2, '34567'),
(3, '045678'),
(3, '034567');
The select query would be:
select distinct p.*
from products p
join products_upc_variations v on v.product_id = p.id
where v.upc_variation in ('12345', '34567');
As you see - With a normalized schema the problem can be solved with a quite basic query. And we can effectively use indices.
"Exploiting" a FULLTEXT INDEX
With a FULLTEXT INDEX on (upc_variation) you can use:
select p.*
from products p
where match (upc_variation) against ('12345 34567');
This looks quite "pretty" and is probably efficient. But though it works for this example, I wouldn't feel comfortable with this solution, because I can't say exactly, when it doesn't work.
Using JSON_OVERLAPS()
Since MySQL 8.0.17 you can use JSON_OVERLAPS(). You should either store the values as a JSON array, or convert the list to JSON "on the fly":
select p.*
from products p
where json_overlaps(
'["12345","34567"]',
concat('["', replace(upc_variation, ',', '","'), '"]')
);
No index can be used for this. But neither can for FIND_IN_SET().
Using JSON_TABLE()
Since MySQL 8.0.4 you can use JSON_TABLE() to generate a normalized representation of the data "on the fly". Here again you would either store the data in a JSON array, or convert the list to JSON in the query:
select distinct p.*
from products p
join json_table(
concat('["', replace(p.upc_variation, ',', '","'), '"]'),
'$[*]' columns (upcv text path '$')
) v
where v.upcv in ('12345', '34567');
No index can be used here. And this is probably the slowest solution of all presented in this answer.
RLIKE / REGEXP
You can also use a regular expression:
select p.*
from products p
where p.upc_variation rlike '(^|,)(12345|34567)(,|$)'
See demo of all queries on dbfiddle.uk
You can try with below example:
SELECT * FROM TABLENAME
WHERE 1 AND ( FIND_IN_SET('3698520147', SKU)
OR UPC IN ('3698520147') )
I have a solution for you, you can consider this solution:
1: Create a temporary table example here: Sql Fiddle
select
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.name, ',', numbers.n), ',', -1) sku_variation
from
numbers inner join tablename
on CHAR_LENGTH(tablename.sku_split)
-CHAR_LENGTH(REPLACE(tablename.sku_split, ',', ''))>=numbers.n-1
order by id, n
2: Use the temporary table to filter. find in set with your data
Performance considerations. The main thing that matters for performance is whether some index can be used. The complexity of the expression has only a minuscule impact on overall performance.
Step 1 is to learn what can be optimized, and in what way:
Equal: WHERE x = 1 -- can use index
IN/1: WHERE x IN (1) -- Turned into the Equal case by Optimizer
IN/many: WHERE x IN (22,33,44) -- Usually worse than Equal and better than "range"
Easy OR: WHERE (x = 22 OR x = 33) -- Turned into IN if possible
General OR: WHERE (sku = 22 OR upc = 33) -- not sargable (cf UNION)
Easy LIKE: WHERE x LIKE 'abc' -- turned into Equal
Range LIKE: WHERE x LIKE 'abc%' -- equivalent to "range" test
Wild LIKE: WHERE x LIKE '%abc%' -- not sargable
REGEXP: WHERE x RLIKE 'aaa|bbb|ccc' -- not sargable
FIND_IN_SET: WHERE FIND_IN_SET(x, '22,33,44') -- not sargable, even for single item
JSON: -- not sargable
FULLTEXT: WHERE MATCH(x) AGAINST('aaa bbb ccc') -- fast, but not equivalent
NOT: WHERE NOT ((any of the above)) -- usually poor performance
"Sargable" -- able to use index. Phrased differently "Hiding the column in a function call" prevents using an index.
FULLTEXT: There are many restrictions: "word-oriented", min word size, stopwords, etc. But it is very fast when it applies. Note: When used with outer tests, MATCH comes first (if possible), then further filtering will be done without the benefit of indexes, but on a smaller set of rows.
Even when an expression "can" use an index, it "may not". Whether a WHERE clause makes good use of an index is a much longer discussion than can be put here.
Step 2 Learn how to build composite indexes when you have multiple tests (WHERE ... AND ...):
When constructing a composite (multi-column) index, include columns in this order:
'Equal' -- any number of such columns.
'IN/many' column(s)
One range test (BETWEEN, <, etc)
(A couple of side notes.) The Optimizer is smart enough to clean up WHERE 1 AND .... But there are not many things that the Optimizer will handle. In particular, this is not sargable: `AND DATE(x) = '2020-02-20', but this does optimize as a "range":
AND x >= '2020-02-20'
AND x < '2020-02-20' + INTERVAL 1 DAY
Reading
Building indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Sargable: https://en.wikipedia.org/wiki/Sargable
Tips on Many-to-many: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
This depends on how you use it. In MySQL I found that find_in_set is way faster than using JSON when tested on the following commands, so much faster it wasn't even a competition (to be clear, the speed test did not include the set command line):
Fastest
set #ids = (select group_concat(`ID`) from `table`);
select count(*) from `table` where find_in_set(`ID`, #ids);
10 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where `ID` member of( #ids );
34 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where JSON_CONTAINS(#ids, convert(`ID`, char));
34 x slower
set #ids = (select json_arrayagg(`ID`) from `table`);
select count(*) from `table` where json_overlaps(#ids, json_array(`ID`));
SELECT * FROM tbl_name t1,(select
group_concat('3698520147',',','088586004490',',','868332000057',',',
'081308003405',',','088394000028',',','089541300893',',','0732511000148',',','009191711092',
',','752830528161') as skuid)t
WHERE FIND_IN_SET(t1.sku,t.skuid)>0
Related
I've always used the IN (val1, val2, ...) syntax quite easily when testing for a bunch of values. However, I'm wondering what type of data structure it actually evaluates to, is this a table function? For example:
-- to populate data
CREATE TABLE main_territory (
name varchar NOT NULL,
is_fake_territory integer NOT NULL,
code varchar NOT NULL
);
INSERT INTO main_territory (name, is_fake_territory, code) VALUES ('Afghanistan', 0, 'AF'), ('Albania', 0, 'AL'), ('Algeria', 0, 'DZ');
select '1' as "query#", * from main_territory where code in ('AF', 'AL') union all
select '2' as "query#", * from main_territory where code in (select 'AF' UNION ALL select 'AL') UNION ALL
select '3' as "query#", * from main_territory where code in (select code from main_territory where name ='Albania' or name = 'Afghanistan')
The second and third queries return a one-columned table (is this called a scalar-table?), and so I would imagine doing (expr1, expr2, ...) does the same thing -- it evaluates to a one-columed table. Is that accurate, or what actual data type is this?
I would not call the IN ( ) predicate tuple comparison. An example of tuple comparison (aka row constructor comparison) is:
WHERE (col1, col2) = ('abc', 123)
Or you can even do multivalued row constructor comparison:
WHERE (col1, col2) IN (('abc', 123), ('xyz', 456))
The examples you show are simply the IN ( ) predicate, which compares a single value to a list of values. If the value matches any of those in the list, the predicate is satisfied. The list can either be a fixed list of expressions or literals:
WHERE code IN ('AF', 'AL')
Or it can be the result of a subquery:
WHERE code IN (SELECT code FROM ...)
How this is implemented depends on the code of the respective RDBMS. It might have to materialize the result of the subquery and store it as a list internally. In some software, they may use a temporary table with one column as the data structure to store the result of the subquery. Then the IN ( ) predicate can be executed as a join against that temporary table. If there's one thing an SQL engine ought to be able to do efficiently, it's a join. :-)
But this might be expensive if the result of the subquery is millions of rows. In that case, a clever optimizer would "factor out" the IN ( ) predicate and just do a join. That is, it would read each value of code and do an index lookup into the second table's code column. This means there's no data structure per se, it's just the evaluation of a join.
The real answer would be implementation-dependent. Both MySQL and PostgreSQL are open-source, so you can try downloading and reading the code yourself if you want to know the implementation.
The confusion is understandable, since these are actually two different kinds of IN:
WHERE expr IN (2, 3, 4, ...)
WHERE expr IN (SELECT ...)
The first will be converted to an array like this:
QUERY PLAN
═══════════════════════════════════════════════════
Seq Scan on tab
Filter: (expr = ANY ('{2,3,4,...}'::integer[]))
or, if the list has only one element, to
QUERY PLAN
══════════════════════
Seq Scan on tab
Filter: (expr = 2)
The second will be executed as a join, for example:
QUERY PLAN
═══════════════════════════════════
Hash Join
Hash Cond: (tab.expr = sub.col)
-> Seq Scan on tab
-> Hash
-> Seq Scan on sub
So, to answer your question: A plain IN list will be converted to an array, and IN becomes = ANY.
I have a code like this:
SELECT column1 = (SELECT MAX(column-name21) FROM table-name2 WHERE condition2 GROUP BY id2) as m,
column2 = (SELECT count(*) FROM table-name2 WHERE condition2 GROUP BY id2) as c,
column-names
FROM table-name
WHERE condition
ORDER BY ordercondition
LIMIT 25,50
those internal selects are quite long and complicated.
My question is are there in mysql language contracts, which allow one to avoid duplicating code and computations in this case?
For example, something like this
SELECT (column1, column2) = (SELECT MAX(column-name1) as m, count(*) as c FROM table-name WHERE condition GROUP BY id),
column-names
FROM table-name
WHERE condition
ORDER BY ordercondition
LIMIT 25,50
which of course won't be interpreted by mysql.
I tried this:
SELECT (SELECT MAX(column-name1) as column1, count(*) as column2 FROM table-name WHERE condition GROUP BY id),
column-names
FROM table-name
WHERE condition
ORDER BY ordercondition
LIMIT 25,50
and it also doesn't work.
Such subqueries get cumbersome when you need more than one from the same source. Usually, the "fix" is to us a "derived table" and JOIN:
SELECT x2.col1, x2.col2, names
FROM ( SELECT MAX(c21) AS col1,
COUNT(*) AS col2,
?? -- may be needed for "cond2"
FROM t2
WHERE cond2a ) AS x2
JOIN t1
ON cond2b
WHERE cond1
ORDER BY ??? -- Limit is non-deterministic without ORDER BY
LIMIT 25, 50
If the "condition" in the subquery is "correlated", please specify it; it makes a big difference in how to transform the query.
The construct COUNT(col) is usually a mistake:
COUNT(*) -- the number of rows.
COUNT(DISTINCT col) -- the number of different values in column `col`.
COUNT(col) -- count the number of rows with non-NULL `col`.
Please provide your actual query and provide SHOW CREATE TABLE. I sloughed over several issues; "the devil is in the details".
for Edit 1
INDEX(tool, uuuuId) -- would help performance
Is uuuuId some form of "hash" or "UUID"? If so, that is relevant to seeing how the performance works. Also, how big (approximately) are the tables? What is the value of innodb_buffer_pool_size. (I am fishing for whether you are I/O-bound or CPU-bound.)
WZ needs INDEX(uuuuId, ppppppId, check1) But actually, that Select...=Yes can be turned and EXISTS for some speedup.
Z might benefit from INDEX(check1, uuuuId, ppppppId, check2)
Since Z and WZ are the same table, this might take care of both:
INDEX(ppppppId, uuuuId, check1, check2)
(The order is important.)
I cannot create a virtual table for this. Basically what I have, is a list of values:
'Succinylcholine','Thiamine','Trandate','Tridol Drip'
I want to know which of those values is not present in table1 and display them. Is this possible? I have tried using left joins and creating a variable with the list which I can compare to the table, but it returns the wrong results.
This is one of the things I have tried:
SET #list="'Amiodarone','Ammonia Inhalents','Aspirin';
SELECT #list FROM table1 where #list not in (
SELECT Description
FROM table1
);
With only narrow exceptions, you need to have data in table form to be able to obtain those data in your result set. This is the essential problem that all attempts at a solution to this problem run into, given that you cannot create a temporary table. If indeed you can provide the input in any form or format (per your comment), then you can provide it in the form of a subquery:
(
SELECT 'Amiodarone' AS description
UNION ALL
SELECT 'Ammonia Inhalents'
UNION ALL
SELECT 'Aspirin'
)
(Note that that exercises the biggest of the exceptions I noted: you can select scalars directly, without a base table. If you like, you can express that explicitly -- in MySQL and Oracle, at least -- by selecting FROM DUAL.)
In that case, this should work for you:
SELECT
a.description
FROM
(
SELECT 'Amiodarone' AS description
UNION ALL
SELECT 'Ammonia Inhalents'
UNION ALL
SELECT 'Aspirin'
) a
LEFT JOIN table1
ON a.description = table1.description
WHERE table1.description IS NULL
That won't work. the variable's contents will be treated as a monolithic string - one solid block of letters, not 3 separate comma-separated values. The query will be parsed/executed as:
SELECT ... WHERE "'Amio.....rin'" IN (x,y,z,...)
^--------------^--- string
Plus, since you're just doing a sub-select on the very same table, there's no point in this kind of a construct. You could try mysql find_in_set() function:
SELECT #list
FROM table1
WHERE FIND_IN_SET(Description, #list) <> ''
I've got a column "code" which may have a string of multiple values e.g. "CODE1&CODE2"... I just need the first one for my JOIN ... kind of like code.split("&")[0]
SELECT myTable.*, otherTable.id AS theID
FROM myTable INNER JOIN otherTable
ON myTable.(+++ code before the & +++) = otherTable.code
The value in myTable may also just be CODE1
SUBSTRING_INDEX will do exactly what you want - return the substring of your column up to the specified character:
SELECT
myTable.*,
otherTable.id AS theID
FROM myTable
INNER JOIN otherTable
ON SUBSTRING_INDEX(myTable.code, '&', 1) = otherTable.code
More info at: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
And here's a fiddle demoing it: http://sqlfiddle.com/#!2/96a6e/2
Please note that this will be SLOW if you're joining many columns. You're not only eliminating the possibility of using an index, but performing a very slow string operation on every comparison. I wouldn't suggest using this on very large tables. If your data set is huge, you may want to consider rearchitecting your DB.
I need to find duplicates in a table. In MySQL I simply write:
SELECT *,count(id) count FROM `MY_TABLE`
GROUP BY SOME_COLUMN ORDER BY count DESC
This query nicely:
Finds duplicates based on SOME_COLUMN, giving its repetition count.
Sorts in desc order of repetition, which is useful to quickly scan major dups.
Chooses a random value for all remaining columns, giving me an idea of values in those columns.
Similar query in Postgres greets me with an error:
column "MY_TABLE.SOME_COLUMN" must appear in the GROUP BY clause or be
used in an aggregate function
What is the Postgres equivalent of this query?
PS: I know that MySQL behaviour deviates from SQL standards.
Back-ticks are a non-standard MySQL thing. Use the canonical double quotes to quote identifiers (possible in MySQL, too). That is, if your table in fact is named "MY_TABLE" (all upper case). If you (more wisely) named it my_table (all lower case), then you can remove the double quotes or use lower case.
Also, I use ct instead of count as alias, because it is bad practice to use function names as identifiers.
Simple case
This would work with PostgreSQL 9.1:
SELECT *, count(id) ct
FROM my_table
GROUP BY primary_key_column(s)
ORDER BY ct DESC;
It requires primary key column(s) in the GROUP BY clause. The results are identical to a MySQL query, but ct would always be 1 (or 0 if id IS NULL) - useless to find duplicates.
Group by other than primary key columns
If you want to group by other column(s), things get more complicated. This query mimics the behavior of your MySQL query - and you can use *.
SELECT DISTINCT ON (1, some_column)
count(*) OVER (PARTITION BY some_column) AS ct
,*
FROM my_table
ORDER BY 1 DESC, some_column, id, col1;
This works because DISTINCT ON (PostgreSQL specific), like DISTINCT (SQL-Standard), are applied after the window function count(*) OVER (...). Window functions (with the OVER clause) require PostgreSQL 8.4 or later and are not available in MySQL.
Works with any table, regardless of primary or unique constraints.
The 1 in DISTINCT ON and ORDER BY is just shorthand to refer to the ordinal number of the item in the SELECT list.
SQL Fiddle to demonstrate both side by side.
More details in this closely related answer:
Select first row in each GROUP BY group?
count(*) vs. count(id)
If you are looking for duplicates, you are better off with count(*) than with count(id). There is a subtle difference if id can be NULL, because NULL values are not counted - while count(*) counts all rows. If id is defined NOT NULL, results are the same, but count(*) is generally more appropriate (and slightly faster, too).
Here's another approach, uses DISTINCT ON:
select
distinct on(ct, some_column)
*,
count(id) over(PARTITION BY some_column) as ct
from my_table x
order by ct desc, some_column, id
Data source:
CREATE TABLE my_table (some_column int, id int, col1 int);
INSERT INTO my_table VALUES
(1, 3, 4)
,(2, 4, 1)
,(2, 5, 1)
,(3, 6, 4)
,(3, 7, 3)
,(4, 8, 3)
,(4, 9, 4)
,(5, 10, 1)
,(5, 11, 2)
,(5, 11, 3);
Output:
SOME_COLUMN ID COL1 CT
5 10 1 3
2 4 1 2
3 6 4 2
4 8 3 2
1 3 4 1
Live test: http://www.sqlfiddle.com/#!1/e2509/1
DISTINCT ON documentation: http://www.postgresonline.com/journal/archives/4-Using-Distinct-ON-to-return-newest-order-for-each-customer.html
mysql allows group by to omit non-aggregated selected columns from the group by list, which it executes by returning the first row found for each unique combination of grouped by columns. This is non-standard SQL behaviour.
postgres on the other hand is SQL standard compliant.
There is no equivalent query in postgres.
Here is a self-joined CTE, which allows you to use select *. key0 is the intended unique key, {key1,key2} are the additional key elements needed to address the currently non-unique rows. Use at your own risk, YMMV.
WITH zcte AS (
SELECT DISTINCT tt.key0
, MIN(tt.key1) AS key1
, MIN(tt.key2) AS key2
, COUNT(*) AS cnt
FROM ztable tt
GROUP BY tt.key0
HAVING COUNT(*) > 1
)
SELECT zt.*
, zc.cnt AS cnt
FROM ztable zt
JOIN zcte zc ON zc.key0 = zt.key0 AND zc.key1 = zt.key1 AND zc.key2 = zt.key2
ORDER BY zt.key0, zt.key1,zt.key2
;
BTW: to get the intended behaviour for the OP, the HAVING COUNT(*) > 1 clause should be omitted.