Select id from input list NOT present in database - mysql

With MySql vers 8.0:
CREATE TABLE cnacs(cid VARCHAR(20), PRIMARY KEY(cid));
Then,
INSERT INTO cnacs VALUES('1');
The first two statements execute successfully. The next statement does not, however. My goal is to return a list of unused cid's from the input table [1, 2]:
SELECT * FROM (VALUES ('1'),('2')) as T(cid) EXCEPT SELECT cid FROM cnacs;
In theory, I'd like the output to be '2', since it has not yet been added. The aforementioned query was inspired by Remus's answer on https://dba.stackexchange.com/questions/37627/identifying-which-values-do-not-match-a-table-row

This is at least the correct syntax for what you are trying to do.
If this query is anything more than a learning exercise though I'd rethink the approach, storing these '1' and '2' values (or however many there ends up being) in their own table
SELECT Column_0
FROM (SELECT * FROM (VALUES ROW('1'), ROW('2')) TMP) VALS
LEFT
JOIN cnacs
ON VALS.Column_0 = cnacs.cid
WHERE cnacs.cid IS NULL

Related

Case in list - Tableau

I'm trying to filter out a huge amount of data out so i decided to create a calculated field and using case if product_id in the list then '1' else '0'
but for some reason it though syntax error.
this is my calculated field:
CASE when product_id in (
'31049','31048','26166','27816','26031','28861','28864','28863','28203','28110','20641','38112','45174','20645','28404','20646','20648','26159','33287','31417','40551','41020','40550','40550','40553','40554','29804','29941','31430','33354','36730','26073','31432','31433','31431','38154','38166','26029','28341','45138','38069','42069','26060','26060','33886','33886','28392','29518','44879','20651','20655','42914','37535','28031','27588','29297','37688','37709','29551','29551','30183','29550','26187','29549','41348') THEN '1' ELSE '0'
END
Any idea who it should be written?
Thanx in advance :)
On a sample dataset this works:
SELECT RIDE_ID as ri,
CASE
WHEN ri in ('5EB0FAD625CFAEAB', '5A9314E3AF8DCC30') THEN '1'
ELSE '0'
END AS result
FROM CITIBIKE_TRIPS LIMIT 10;
I get:
yes it works in the database but not in Tableau :) I couldn't run it in a calculated field
Maybe using LATERAL would allow to run it from Tableau:
CREATE OR REPLACE TABLE t(ID INT, product_id TEXT);
INSERT INTO t VALUES (1, '31049'),(2,'31048'), (3, '100');
SELECT *
FROM t
,LATERAL (SELECT CASE WHEN t.product_id IN ( '31049','31048','26166','27816'/*...*/)
THEN '1' ELSE '0' END) AS s(result);
One option— create a table with your keys that you wish to filter, and then use a join to let the database do the work. Could be easier to maintain. Likely more efficient
Another option is to create a set in Tableau based on the product_id field. Define that set by checking the product ids you wish, and then place the set of the filter shelf to filter to either include or exclude the product_ids in your set.

Is it possible to pass several rows of data into a MySQL subquery instead of using a temporary table?

I'd like to know if it's possible to pass rows of data directly into a select subquery, rather than setting up a temporary table and joining on that.
My actual use case is trying to prevent thousands of individual queries, and for architectural reasons adding a temporary table would be a pain (but not impossible, so it's where I may have to go.)
An simplified example of my issue is :
I have a table giving the number plate of the cars in a car park, with each row containing section (a letter), space (a number), and reg_plate (a tinytext).
My boss gives me a list of 3 places and wants the reg number of the car in each (or null if empty).
Now I can do this by creating a temporary table containing the section/space sequence I'm interested in, and then join my carpark table against that to give me each of the rows.
I'm wondering is there a syntax where I could do this in a select, perhaps with a subselect something like this - obviously invalid, but hopefully it shows what I'm getting at:
SELECT targets.section, targets.space, cp.reg_plate
FROM carpark cp
LEFT JOIN (
SELECT field1 AS section, field2 AS space
FROM (
('a', 7), ('c', 14), ('c', 23)
)
) targets ON (cp.section = targets.section AND cp.space = targets.space)
Any ideas gratefully received!
You can use UNION:
SELECT targets.section, targets.space, cp.reg_plate
FROM carpark cp
LEFT JOIN (
SELECT 'a' as section, 7 AS space
UNION ALL
SELECT 'c', 14
UNION ALL
SELECT 'c', 23
) targets ON cp.section = targets.section AND cp.space = targets.space
A followup;
I realised I was hoping to find something analagous to the VALUES used in an insert statement.
It turns out this is available in MySQL 8+, and VALUES is available as a table constructor:
https://dev.mysql.com/doc/refman/8.0/en/values.html
SELECT targets.section, targets.space, cp.reg_plate
FROM carpark cp
LEFT JOIN (
VALUES
ROW ('a', 7), ROW('c', 14), ROW('c', 23)
)
) targets ON (cp.section = targets.column_0 AND cp.space = targets.column_1)
Not available in earlier MySQL (so the UNION method is OK) but great in 8+.

convert mysql (on duplicate key ) query to oracle merge

INSERT INTO table1(id,dept_id,name,description,creation_time,modified_time)
VALUES('id','dept_id','name','description','creation_time','modified_time')
ON DUPLICATE KEY UPDATE dept_id=VALUES(dept_id),name=VALUES(name),
description=VALUES(description),creation_time=VALUES(creation_time),
modified_time=VALUES(modified_time)
I used the below oracle to convert the above mysql query. The query fails. Can you please help me figure out what is wrong with the oracle query.
Merge into table1 t1 using
(VALUES ('id','dept_id','name','description','creation_time','modified_time')) as temp
(id,dept_id,name,description,creation_time,modified_time) on t1. id = temp.id
WHEN MATCHED THEN UPDATE SET dept_id=t1.dept_id, description=t1.description, name=t1.name,
creation_time=t1.creation_time, modified_time=t1.modified_time
WHEN NOT MATCHED THEN INSERT (id,dept_id,name,description,creation_time,modified_time)
VALUES ('id','dept_id','name','description','creation_time','modified_time')
To do this, you need to use a table or subquery in the using clause (in your case, you need a subquery).
In Oracle, you can use the dual table if you need to select something without needing to select from an actual table; this is a table that contains only a single row and a single column.
Your merge statement should therefore look something like:
MERGE INTO table1 tgt
USING (SELECT 'id' id,
'dept_id' dept_id,
'name' NAME,
'description' description,
'creation_time' creation_time,
'modified_time' modified_time
FROM dual) src
ON tgt.id = src.id
WHEN MATCHED THEN
UPDATE
SET tgt.dept_id = src.dept_id,
tgt.description = src.description,
tgt.name = src.name,
tgt.creation_time = src.creation_time,
tgt.modified_time = src.modified_time
WHEN NOT MATCHED THEN
INSERT
(tgt.id,
tgt.dept_id,
tgt.name,
tgt.description,
tgt.creation_time,
tgt.modified_time)
VALUES
(src.id,
src.dept_id,
src.name,
src.description,
src.creation_time,
src.modified_time);
Note how the when not matched clause uses the columns from the source subquery, rather than using the literal values you supplied. (I assume that in your actual code, these literal values are actually variables; creation_time is a pretty odd value to store in a column labelled creation_time!).
I've also switched the aliases to make it clearer where you're merging to and from; I find this makes it easier to understand what the merge statement is doing. YMMV.

MySQL query all existent and non existent entries from list (inline table)

I have a MySQL database with a table of tag names. I have a list of tags I want to assign and need to check whether they are in the database or not. Therefore I want to write a query which gives me the ids of all tags in the list which are already present and the ones which are not present yet.
In SQLite I already managed to write this query, but as it contains a CTE it can't directly be converted to MySQL.
The SQLite query is:
WITH
check_tags(name) AS ( VALUES ("name1"), ("name2) )
SELECT check_tags.name, tags.id FROM check_tags
LEFT JOIN tags ON check_tags.name = tags.name
The result would be for example:
id | name
---------------
1 | name1
Null | name2
In MySQL it could be something with SELECT * FROM ( VALUES("name1), ("name2") ) ... which I have seen for other database systems, but this also doesn't work with MySQL.
All these different SQL dialects make searching for help difficult.
The answer was to use an inline table as Aaron Kurtzhals pointed out.
My query now is:
CREATE TEMPORARY TABLE MyInlineTable (id LONG, content VARCHAR(255) );
INSERT INTO MyInlineTable VALUES
(1, 'name1'),
(2, 'name2');
SELECT * from MyInlineTable LEFT JOIN tags on MyInlineTable.content = tags.name

SQL to select set of 10 records that collectively best fulfill a criterion

My table:
CREATE TABLE `beer`.`matches` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`hashId` int(10) unsigned NOT NULL,
`ruleId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
If a hash has matched a rule, there's an entry in this table.
1) Count how many hashIds there are for each unique ruleId (AKA "how many hashes matched each rule")
SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*)
2) Select the 10 best rules (ruleIds), that is, select the 10 rules that combined matches the greatest number of unique hashes. This means that a rule that matches a lot of hashes is not neccessarily a good rule, if another rule covers all the same hashes. Basically I want to select the 10 ruleIds that catches the most unique hashIds.
?
EDIT: Basically I have a sub-optimal solution in PHP/SQL here, but depending on the data it doesn't necessarily give me the best answer to question 2). I'd be interested in a better solution. Read the comments for more information.
I think your problem is a variation of the "knapsack problem".
I think you already understand that you can't just take whatever ruleIds match the most hashIds like the other answers are suggesting, because while each of those ruleIds match say 100 hashIds, they might all match the same 100 hashIds... but if you had selected 10 other ruleIds which only matched 25 hashIds, but with each of the hashIds matched by each ruleId being unique, you'd end up with more unique hashIds with that set.
To solve this, you could start by selecting whatever ruleId matches the most hashIds, and then next selecting whatever ruleId matches the most hashIds that aren't included in the hashIds matched by the previous ruleIds... continuing this process until you've selected 10 ruleIds.
There could still be anomalies in your data distribution that would cause this to not produce an optimal set of ruleIds... so if you wanted to go crazy, you could consider implementing a genetic algorithm to try to improve the "fitness" of your set of 10 ruleIds.
This isn't a task that SQL is particularly well suited to handle, but here's an example of the knapsack problem being solved with a genetic algorithm written in SQL(!)
EDIT
Here's an untested implementation of the solution where ruleIds are selected 1 at a time, with each iteration selecting whatever ruleId has the most unique hashIds that weren't previously covered by any other selected ruleIds:
--------------------------------------------------------------------------
-- Create Test Data
--------------------------------------------------------------------------
create create matches (
  id int(10) unsigned not null auto_increment,
  hashId int(10) unsigned not null,
  ruleId int(10) unsigned not null,
  primary key (id)
);
insert into matches (hashid, ruleid)
values
(1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1),
(1,2), (2,2), (3,2), (4,2), (5,2), (6,2), (7,2), (8,2), (9,2), (10,2),
(1,3), (2,3), (3,3), (4,3), (5,3), (6,3), (7,3), (8,3), (9,3), (10,3),
(1,4), (2,4), (3,4), (4,4), (5,4), (6,4), (7,4), (8,4), (9,4), (10,4),
(1,5), (2,5), (3,5), (4,5), (5,5), (6,5), (7,5), (8,5), (9,5), (10,5),
(1,6), (2,6), (3,6), (4,6), (5,6), (6,6), (7,6), (8,6), (9,6), (10,6),
(1,7), (2,7), (3,7), (4,7), (5,7), (6,7), (7,7), (8,7), (9,7), (10,7),
(1,8), (2,8), (3,8), (4,8), (5,8), (6,8), (7,8), (8,8), (9,8), (10,8),
(1,9), (2,9), (3,9), (4,9), (5,9), (6,9), (7,9), (8,9), (9,9), (10,9),
(11,10), (12,10), (13,10), (14,10), (15,10),
(11,11), (12,11), (13,11), (14,11), (15,11),
(16,12), (17,12), (18,12), (19,12), (20,12),
(21,13), (22,13), (23,13), (24,13), (25,13),
(26,14), (27,14), (28,14), (29,14), (30,14),
(31,15), (32,15), (33,15), (34,15), (35,15),
(36,16), (37,16), (38,16), (39,16), (40,16),
(41,17), (42,17), (43,17), (44,17), (45,17),
(46,18), (47,18), (48,18), (49,18), (50,18),
(51,19), (52,19), (53,19), (54,19), (55,19),
(56,20), (57,20), (58,20), (59,20), (60,20)
--------------------------------------------------------------------------
-- End Create Test Data
--------------------------------------------------------------------------
create table selectedRules (
ruleId int(10) unsigned not null
);
set #rulesSelected = 0;
while (#rulesSelected < 10) do
insert into selectedRules (ruleId)
select m.ruleId
from
matches m left join (
select distinct m2.hashId
from
selectedRules sr join
matches m2 on m2.ruleId = sr.ruleId
) prev on prev.hashId = m.hashId
where prev.hashId is null
group by m.ruleId
order by count(distinct m.hashId) desc
limit 1;
set #rulesSelected = #rulesSelected + 1;
end while;
select ruleId from selectedRules;
If you really want to find the best solution (optimal solution), the problem is that you have to check all the possible combinations of 10 ruleIds, and find how many hashIds are returned by each of this possible combination. The problem is that the number of combinations is grossly the different number of ruleids ^ 10 (in fact, the number is smaller, if you consider that you cannot repeat the same ruleIds in the combinations... its a combination of m elements taken in groups of 10).
NOTE: To be exact, the number of possible combinations is
m!/(n!(m-n)!) => m!/(10!(m-10!)) where ! is factorial: m! = m * m-1 * m-2... * 3 * 2 * 1
To do this combinations you have to join your table with itself, 10 times, excluding the previous combinations of ruleids, somewhat like this:
select m1.ruleid r1, m2.ruleid r2, m3.ruleid r3 ...
from matches m1 inner join matches m2 on m2<>m1
inner join matches m3 on m3 <> m1 and m3 <> m2
...
Then you have to find the highest count of
select r1, r2, r3..., count(distinct hashid)
from ("here the combinations of 10 ruleIds define above") G10
inner join M
on ruleid = r1 or ruleid = r2 or ruleid=r3...
group by r1, r2, r3...
This gigantic query would take a lot of time to run.
There can be much faster procedures that will give you sub-optimal results.
SOME OPTIMIZATION:
This could be somewhat optimized, depending on the data shape, looking for groups which are equal to or included in other groups. This would require less than (m*(m+1))/2 operations, which compared to the other number, it's a big deal, specially if it's quite probable to find several groups which can be discarded, which will lower m. Anyway, the main has still a gigantic cost.
Although I come from the PostgreSQL world, I found this question really interesting and took my time to look into it.
I split the whole process into 2 subroutines:
first, a sub-query (or function) is required, that for a given ruleId combination (array) will return all possible (array)+ruleId entries with the number of unique hashId (count) found for the entry;
then, one should query the max(count) from the #1 and get a list of array+ruleId combinations from the #1. I used recursive function for this. If current recursion level matches the required amount of ruleIds (10 in question), then return found array+ruleId combinations, otherwise recursively go into this same step (#2), giving found combination as input.
As a result second function will return all combinations that will give you the maximal amount of unique hashId's for a given ruleId count.
Here's the code that will create a test setup, PostgreSQL 9.1 tested. As the original question is for MySQL, I will comment on what is going on there:
create table matches (
id int4 not null,
hashId int4 not null,
ruleId int4 not null,
primary key (id)
);
insert into matches
SELECT generate_series(1,200), (random()*59+1)::int4, (random()*19+1)::int4;
-- This query will generate a 200-rows table, with:
-- - first column having values in 1-200 range (id)
-- - second column will have random numbers in 1-60 range (hashId)
-- - third column will have random numbers in 1-20 range (ruleId)
Function for the phase 1 (quite simple):
CREATE OR REPLACE FUNCTION count_matches(i_array int4[],
OUT arr int4[], OUT cnt int4) RETURNS SETOF record
AS $$
DECLARE
rec_o record;
rec_i record;
BEGIN
-- in the outer loop, we're going over all the combinations of input array
-- with the ruleId appended
FOR rec_o IN SELECT DISTINCT i_array||ruleId AS rules
FROM matches ORDER BY 1
LOOP
-- in the inner loop we're counting the distinct hashId combinations
-- for the outer loop provided array
-- and returning the new array + count
FOR rec_i IN SELECT count(distinct hashId) AS cnt
FROM matches WHERE ruleId = ANY(rec_o.rules)
LOOP
arr := rec_o.rules;
cnt := rec_i.cnt;
RETURN NEXT ;
END LOOP;
END LOOP;
RETURN ;
END;
$$ LANGUAGE plpgsql;
If you will give the empty array as input for this function, you will get the same results as from the case #1 of initial question:
SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*);
-- both queries yields same results
SELECT cnt, arr FROM count_matches(ARRAY[]::int4[]);
Now the main working function:
-- function receives 3 parameters, 2 of them have default values
-- which makes it possible to query: max_matches(10)
-- to obtain results from the initial question
CREATE OR REPLACE FUNCTION max_matches(maxi int4,
arri int4[] DEFAULT array[]::int4[],
curi int4 DEFAULT 1, OUT arr int4[]) RETURNS SETOF int4[]
AS $$
DECLARE
maxcnt int4;
a int4[];
b int4[];
BEGIN
-- Fall out early for "easy" cases
IF maxi < 2 THEN
RAISE EXCEPTION 'Too easy, do a GROUP BY query instead';
END IF;
a = array[]::int4[];
-- first, we find out what is the maximal possible number of hashIds
-- on a given level
SELECT max(cnt) INTO maxcnt FROM count_matches(arri);
-- then we check each combination that yield the found number
-- of unique hashIds
FOR arr IN SELECT cm.arr FROM count_matches(arri) cm
WHERE cm.cnt = maxcnt
LOOP
-- if we're on the deepest level of recursion,
-- we just return back the found combination
IF curi = maxi THEN
RETURN NEXT ;
ELSE
-- otherwise we ask further down
FOR b IN SELECT * FROM max_matches(maxi, arr, curi+1) LOOP
-- this loop and IF clause are required to eliminate
-- equal arrays, so that if we get {6,14} and {14,6} returned
-- we will use only one of the two, as they're the same
IF NOT a #> b THEN
a = array_cat(a, b);
RETURN QUERY SELECT b;
END IF;
END LOOP;
END IF;
END LOOP;
RETURN ;
END;
$$ LANGUAGE plpgsql;
Unfortunately this approach is time consuming. For my test setup I have the following performance, which seems overkill to spend 8 seconds for 200-rows "big" table.
select * from max_matches(10);
arr
-----------------------------
{6,14,4,16,8,1,7,10,11,18}
{6,14,4,16,8,1,7,11,12,18}
{6,14,4,16,8,7,10,11,15,18}
{6,14,4,16,11,10,1,7,18,20}
(4 rows)
Time: 8034,700 ms
I hope you don't mind me jumping into this question. And I also hope you will find my answer useful for your purposes at least partially :)
And thanks for the question, I have had a very good time trying to solve it!
The approach that I think will work best for this is based on the same logic/method that the statiscal technique multi-variate co-factors analysis uses.
That is, instead of trying to solve the inherently combinatorial problem of "What combination of 10 factors(or 'rules' for your problem) out of the existing rules, best fulfills some criterion?", it incrementally answers a much easier question "Given what I already have, what additional factor('rule'), best improves how well the criterion is fulfilled?"
Proceduraly, it goes like this: First, find the rule that has the most (distinct) hashes matching it. Don't worry about overlap with other rules, just find the single best one. Add the to a list (or table) of already-selected rules.
Now, find the next-best rule, given the rule(s) that you already have. In other words, find the rule that matches the most hashes, excluding any hashes already matched by the already-selected rule(s). Add this new rule to your already-selected list of rules, and repeat until you have 10 rules.
So this approach basically avoids the inherently combinatorial problem of trying to find the absolute, globally best solution, by finding the incrementally relative/local best solution. Some points in this approach:
It is O(n*k), where 'k' is the number of rules you want to find. Combinatorial approaches tend to be non-polynomial like O(2^n) or O(n!), which are highly undesirably, performance-wise.
It is possible that this approach will not give the absolute *best* 10 rules for your criterion. However, in my experience, it tends to do very well in the real-world cases of problems like this one. It is usually at most one or two rules off the absolute 10 best.
The SQL code for the incremental search is very easy (you already have most of it). But the SQL code to actually do it N=10 times is inherently procedural and thus requires the less standard/more idiosyncratic parts of SQL (translation: I know how to do it in TSQL, but not in MySql).
Here's a solution which might be good enough.
Indexes and/or manually created caching tables might help performance figures, although on a sparsely populated table, it works instantly.
The idea is brutally simple: create a view to explicitly show all possibilities, then combine all of them and find the best by ordering.
The same-rule combinations are allowed as certain rule by itself might be more efficient than a combination of others.
Based on table similar to one you described above, with columns named "id", "hash_id" and "rule_id", create a helper view (this way it's easier to test/debug) using the following select:
SELECT `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,`t1`.`rule_id` AS `r1`,`t2`.`rule_id` AS `r2`,`t3`.`rule_id` AS `r3` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)
The above view is set to create a triple-join table. You can add t4.hash_id as h4,t4.rule_id as r4 to the SELECT and join hashTable t4 to the FROM to add a fourth join, and so forth until 10.
After creating the view, the following query gives the combination of 2 best rules with their hash coverage explicitly shown:
select group_concat(distinct h1),concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2)
If you don't need to see the hash coverage, this one may be better:
select count(distinct h1) as cc,concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2) order by cc
Adding 3rd rule match is simple by adding h3 and r3 to the union and grouping using it:
select count(distinct h1),concat(r1, r2, r3) from (select distinct h1,r1,r2,r3 from hashView union distinct select distinct h2,r1,r2,r3 from hashView union distinct select distinct h3,r1,r2,r3 from hashView) as uu group by concat(r1,r2,r3)
If you do not need the option to choose how many top rules to match, you can do the concat() in the View itself and save some time on the union queries.
A possible performance increase is eliminating permuted rule id's.
All above were only tested using single-digit rule id's, so instead of concat(), you should probably use concat_ws(), like this, for pre-concatted view:
select `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,concat_ws(",",`t1`.`rule_id`,`t2`.`rule_id`,`t3`.`rule_id`) AS `r` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)
And then the union query:
select count(distinct h1) as cc,r from (select distinct h1,r from hashView union distinct select distinct h2,r from hashView union distinct select distinct h3,r from hashView) as uu group by r order by cc
Let know if this solves the problem at hand or if there are additional constraints that weren't disclosed before.
Depending on amount of rules and hashes, you can also always reverse the rule<->hash relation and instead create hash-based views.
Best idea is probably to combine this approach with real-life heuristics.