I'm trying to filter out a huge amount of data out so i decided to create a calculated field and using case if product_id in the list then '1' else '0'
but for some reason it though syntax error.
this is my calculated field:
CASE when product_id in (
'31049','31048','26166','27816','26031','28861','28864','28863','28203','28110','20641','38112','45174','20645','28404','20646','20648','26159','33287','31417','40551','41020','40550','40550','40553','40554','29804','29941','31430','33354','36730','26073','31432','31433','31431','38154','38166','26029','28341','45138','38069','42069','26060','26060','33886','33886','28392','29518','44879','20651','20655','42914','37535','28031','27588','29297','37688','37709','29551','29551','30183','29550','26187','29549','41348') THEN '1' ELSE '0'
END
Any idea who it should be written?
Thanx in advance :)
On a sample dataset this works:
SELECT RIDE_ID as ri,
CASE
WHEN ri in ('5EB0FAD625CFAEAB', '5A9314E3AF8DCC30') THEN '1'
ELSE '0'
END AS result
FROM CITIBIKE_TRIPS LIMIT 10;
I get:
yes it works in the database but not in Tableau :) I couldn't run it in a calculated field
Maybe using LATERAL would allow to run it from Tableau:
CREATE OR REPLACE TABLE t(ID INT, product_id TEXT);
INSERT INTO t VALUES (1, '31049'),(2,'31048'), (3, '100');
SELECT *
FROM t
,LATERAL (SELECT CASE WHEN t.product_id IN ( '31049','31048','26166','27816'/*...*/)
THEN '1' ELSE '0' END) AS s(result);
One option— create a table with your keys that you wish to filter, and then use a join to let the database do the work. Could be easier to maintain. Likely more efficient
Another option is to create a set in Tableau based on the product_id field. Define that set by checking the product ids you wish, and then place the set of the filter shelf to filter to either include or exclude the product_ids in your set.
My table:
CREATE TABLE `beer`.`matches` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`hashId` int(10) unsigned NOT NULL,
`ruleId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
If a hash has matched a rule, there's an entry in this table.
1) Count how many hashIds there are for each unique ruleId (AKA "how many hashes matched each rule")
SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*)
2) Select the 10 best rules (ruleIds), that is, select the 10 rules that combined matches the greatest number of unique hashes. This means that a rule that matches a lot of hashes is not neccessarily a good rule, if another rule covers all the same hashes. Basically I want to select the 10 ruleIds that catches the most unique hashIds.
?
EDIT: Basically I have a sub-optimal solution in PHP/SQL here, but depending on the data it doesn't necessarily give me the best answer to question 2). I'd be interested in a better solution. Read the comments for more information.
I think your problem is a variation of the "knapsack problem".
I think you already understand that you can't just take whatever ruleIds match the most hashIds like the other answers are suggesting, because while each of those ruleIds match say 100 hashIds, they might all match the same 100 hashIds... but if you had selected 10 other ruleIds which only matched 25 hashIds, but with each of the hashIds matched by each ruleId being unique, you'd end up with more unique hashIds with that set.
To solve this, you could start by selecting whatever ruleId matches the most hashIds, and then next selecting whatever ruleId matches the most hashIds that aren't included in the hashIds matched by the previous ruleIds... continuing this process until you've selected 10 ruleIds.
There could still be anomalies in your data distribution that would cause this to not produce an optimal set of ruleIds... so if you wanted to go crazy, you could consider implementing a genetic algorithm to try to improve the "fitness" of your set of 10 ruleIds.
This isn't a task that SQL is particularly well suited to handle, but here's an example of the knapsack problem being solved with a genetic algorithm written in SQL(!)
EDIT
Here's an untested implementation of the solution where ruleIds are selected 1 at a time, with each iteration selecting whatever ruleId has the most unique hashIds that weren't previously covered by any other selected ruleIds:
--------------------------------------------------------------------------
-- Create Test Data
--------------------------------------------------------------------------
create create matches (
id int(10) unsigned not null auto_increment,
hashId int(10) unsigned not null,
ruleId int(10) unsigned not null,
primary key (id)
);
insert into matches (hashid, ruleid)
values
(1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1),
(1,2), (2,2), (3,2), (4,2), (5,2), (6,2), (7,2), (8,2), (9,2), (10,2),
(1,3), (2,3), (3,3), (4,3), (5,3), (6,3), (7,3), (8,3), (9,3), (10,3),
(1,4), (2,4), (3,4), (4,4), (5,4), (6,4), (7,4), (8,4), (9,4), (10,4),
(1,5), (2,5), (3,5), (4,5), (5,5), (6,5), (7,5), (8,5), (9,5), (10,5),
(1,6), (2,6), (3,6), (4,6), (5,6), (6,6), (7,6), (8,6), (9,6), (10,6),
(1,7), (2,7), (3,7), (4,7), (5,7), (6,7), (7,7), (8,7), (9,7), (10,7),
(1,8), (2,8), (3,8), (4,8), (5,8), (6,8), (7,8), (8,8), (9,8), (10,8),
(1,9), (2,9), (3,9), (4,9), (5,9), (6,9), (7,9), (8,9), (9,9), (10,9),
(11,10), (12,10), (13,10), (14,10), (15,10),
(11,11), (12,11), (13,11), (14,11), (15,11),
(16,12), (17,12), (18,12), (19,12), (20,12),
(21,13), (22,13), (23,13), (24,13), (25,13),
(26,14), (27,14), (28,14), (29,14), (30,14),
(31,15), (32,15), (33,15), (34,15), (35,15),
(36,16), (37,16), (38,16), (39,16), (40,16),
(41,17), (42,17), (43,17), (44,17), (45,17),
(46,18), (47,18), (48,18), (49,18), (50,18),
(51,19), (52,19), (53,19), (54,19), (55,19),
(56,20), (57,20), (58,20), (59,20), (60,20)
--------------------------------------------------------------------------
-- End Create Test Data
--------------------------------------------------------------------------
create table selectedRules (
ruleId int(10) unsigned not null
);
set #rulesSelected = 0;
while (#rulesSelected < 10) do
insert into selectedRules (ruleId)
select m.ruleId
from
matches m left join (
select distinct m2.hashId
from
selectedRules sr join
matches m2 on m2.ruleId = sr.ruleId
) prev on prev.hashId = m.hashId
where prev.hashId is null
group by m.ruleId
order by count(distinct m.hashId) desc
limit 1;
set #rulesSelected = #rulesSelected + 1;
end while;
select ruleId from selectedRules;
If you really want to find the best solution (optimal solution), the problem is that you have to check all the possible combinations of 10 ruleIds, and find how many hashIds are returned by each of this possible combination. The problem is that the number of combinations is grossly the different number of ruleids ^ 10 (in fact, the number is smaller, if you consider that you cannot repeat the same ruleIds in the combinations... its a combination of m elements taken in groups of 10).
NOTE: To be exact, the number of possible combinations is
m!/(n!(m-n)!) => m!/(10!(m-10!)) where ! is factorial: m! = m * m-1 * m-2... * 3 * 2 * 1
To do this combinations you have to join your table with itself, 10 times, excluding the previous combinations of ruleids, somewhat like this:
select m1.ruleid r1, m2.ruleid r2, m3.ruleid r3 ...
from matches m1 inner join matches m2 on m2<>m1
inner join matches m3 on m3 <> m1 and m3 <> m2
...
Then you have to find the highest count of
select r1, r2, r3..., count(distinct hashid)
from ("here the combinations of 10 ruleIds define above") G10
inner join M
on ruleid = r1 or ruleid = r2 or ruleid=r3...
group by r1, r2, r3...
This gigantic query would take a lot of time to run.
There can be much faster procedures that will give you sub-optimal results.
SOME OPTIMIZATION:
This could be somewhat optimized, depending on the data shape, looking for groups which are equal to or included in other groups. This would require less than (m*(m+1))/2 operations, which compared to the other number, it's a big deal, specially if it's quite probable to find several groups which can be discarded, which will lower m. Anyway, the main has still a gigantic cost.
Although I come from the PostgreSQL world, I found this question really interesting and took my time to look into it.
I split the whole process into 2 subroutines:
first, a sub-query (or function) is required, that for a given ruleId combination (array) will return all possible (array)+ruleId entries with the number of unique hashId (count) found for the entry;
then, one should query the max(count) from the #1 and get a list of array+ruleId combinations from the #1. I used recursive function for this. If current recursion level matches the required amount of ruleIds (10 in question), then return found array+ruleId combinations, otherwise recursively go into this same step (#2), giving found combination as input.
As a result second function will return all combinations that will give you the maximal amount of unique hashId's for a given ruleId count.
Here's the code that will create a test setup, PostgreSQL 9.1 tested. As the original question is for MySQL, I will comment on what is going on there:
create table matches (
id int4 not null,
hashId int4 not null,
ruleId int4 not null,
primary key (id)
);
insert into matches
SELECT generate_series(1,200), (random()*59+1)::int4, (random()*19+1)::int4;
-- This query will generate a 200-rows table, with:
-- - first column having values in 1-200 range (id)
-- - second column will have random numbers in 1-60 range (hashId)
-- - third column will have random numbers in 1-20 range (ruleId)
Function for the phase 1 (quite simple):
CREATE OR REPLACE FUNCTION count_matches(i_array int4[],
OUT arr int4[], OUT cnt int4) RETURNS SETOF record
AS $$
DECLARE
rec_o record;
rec_i record;
BEGIN
-- in the outer loop, we're going over all the combinations of input array
-- with the ruleId appended
FOR rec_o IN SELECT DISTINCT i_array||ruleId AS rules
FROM matches ORDER BY 1
LOOP
-- in the inner loop we're counting the distinct hashId combinations
-- for the outer loop provided array
-- and returning the new array + count
FOR rec_i IN SELECT count(distinct hashId) AS cnt
FROM matches WHERE ruleId = ANY(rec_o.rules)
LOOP
arr := rec_o.rules;
cnt := rec_i.cnt;
RETURN NEXT ;
END LOOP;
END LOOP;
RETURN ;
END;
$$ LANGUAGE plpgsql;
If you will give the empty array as input for this function, you will get the same results as from the case #1 of initial question:
SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*);
-- both queries yields same results
SELECT cnt, arr FROM count_matches(ARRAY[]::int4[]);
Now the main working function:
-- function receives 3 parameters, 2 of them have default values
-- which makes it possible to query: max_matches(10)
-- to obtain results from the initial question
CREATE OR REPLACE FUNCTION max_matches(maxi int4,
arri int4[] DEFAULT array[]::int4[],
curi int4 DEFAULT 1, OUT arr int4[]) RETURNS SETOF int4[]
AS $$
DECLARE
maxcnt int4;
a int4[];
b int4[];
BEGIN
-- Fall out early for "easy" cases
IF maxi < 2 THEN
RAISE EXCEPTION 'Too easy, do a GROUP BY query instead';
END IF;
a = array[]::int4[];
-- first, we find out what is the maximal possible number of hashIds
-- on a given level
SELECT max(cnt) INTO maxcnt FROM count_matches(arri);
-- then we check each combination that yield the found number
-- of unique hashIds
FOR arr IN SELECT cm.arr FROM count_matches(arri) cm
WHERE cm.cnt = maxcnt
LOOP
-- if we're on the deepest level of recursion,
-- we just return back the found combination
IF curi = maxi THEN
RETURN NEXT ;
ELSE
-- otherwise we ask further down
FOR b IN SELECT * FROM max_matches(maxi, arr, curi+1) LOOP
-- this loop and IF clause are required to eliminate
-- equal arrays, so that if we get {6,14} and {14,6} returned
-- we will use only one of the two, as they're the same
IF NOT a #> b THEN
a = array_cat(a, b);
RETURN QUERY SELECT b;
END IF;
END LOOP;
END IF;
END LOOP;
RETURN ;
END;
$$ LANGUAGE plpgsql;
Unfortunately this approach is time consuming. For my test setup I have the following performance, which seems overkill to spend 8 seconds for 200-rows "big" table.
select * from max_matches(10);
arr
-----------------------------
{6,14,4,16,8,1,7,10,11,18}
{6,14,4,16,8,1,7,11,12,18}
{6,14,4,16,8,7,10,11,15,18}
{6,14,4,16,11,10,1,7,18,20}
(4 rows)
Time: 8034,700 ms
I hope you don't mind me jumping into this question. And I also hope you will find my answer useful for your purposes at least partially :)
And thanks for the question, I have had a very good time trying to solve it!
The approach that I think will work best for this is based on the same logic/method that the statiscal technique multi-variate co-factors analysis uses.
That is, instead of trying to solve the inherently combinatorial problem of "What combination of 10 factors(or 'rules' for your problem) out of the existing rules, best fulfills some criterion?", it incrementally answers a much easier question "Given what I already have, what additional factor('rule'), best improves how well the criterion is fulfilled?"
Proceduraly, it goes like this: First, find the rule that has the most (distinct) hashes matching it. Don't worry about overlap with other rules, just find the single best one. Add the to a list (or table) of already-selected rules.
Now, find the next-best rule, given the rule(s) that you already have. In other words, find the rule that matches the most hashes, excluding any hashes already matched by the already-selected rule(s). Add this new rule to your already-selected list of rules, and repeat until you have 10 rules.
So this approach basically avoids the inherently combinatorial problem of trying to find the absolute, globally best solution, by finding the incrementally relative/local best solution. Some points in this approach:
It is O(n*k), where 'k' is the number of rules you want to find. Combinatorial approaches tend to be non-polynomial like O(2^n) or O(n!), which are highly undesirably, performance-wise.
It is possible that this approach will not give the absolute *best* 10 rules for your criterion. However, in my experience, it tends to do very well in the real-world cases of problems like this one. It is usually at most one or two rules off the absolute 10 best.
The SQL code for the incremental search is very easy (you already have most of it). But the SQL code to actually do it N=10 times is inherently procedural and thus requires the less standard/more idiosyncratic parts of SQL (translation: I know how to do it in TSQL, but not in MySql).
Here's a solution which might be good enough.
Indexes and/or manually created caching tables might help performance figures, although on a sparsely populated table, it works instantly.
The idea is brutally simple: create a view to explicitly show all possibilities, then combine all of them and find the best by ordering.
The same-rule combinations are allowed as certain rule by itself might be more efficient than a combination of others.
Based on table similar to one you described above, with columns named "id", "hash_id" and "rule_id", create a helper view (this way it's easier to test/debug) using the following select:
SELECT `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,`t1`.`rule_id` AS `r1`,`t2`.`rule_id` AS `r2`,`t3`.`rule_id` AS `r3` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)
The above view is set to create a triple-join table. You can add t4.hash_id as h4,t4.rule_id as r4 to the SELECT and join hashTable t4 to the FROM to add a fourth join, and so forth until 10.
After creating the view, the following query gives the combination of 2 best rules with their hash coverage explicitly shown:
select group_concat(distinct h1),concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2)
If you don't need to see the hash coverage, this one may be better:
select count(distinct h1) as cc,concat(r1, r2) from (select distinct h1,r1,r2 from hashView union distinct select distinct h2,r1,r2 from hashView) as uu group by concat(r1,r2) order by cc
Adding 3rd rule match is simple by adding h3 and r3 to the union and grouping using it:
select count(distinct h1),concat(r1, r2, r3) from (select distinct h1,r1,r2,r3 from hashView union distinct select distinct h2,r1,r2,r3 from hashView union distinct select distinct h3,r1,r2,r3 from hashView) as uu group by concat(r1,r2,r3)
If you do not need the option to choose how many top rules to match, you can do the concat() in the View itself and save some time on the union queries.
A possible performance increase is eliminating permuted rule id's.
All above were only tested using single-digit rule id's, so instead of concat(), you should probably use concat_ws(), like this, for pre-concatted view:
select `t1`.`hash_id` AS `h1`,`t2`.`hash_id` AS `h2`,`t3`.`hash_id` AS `h3`,concat_ws(",",`t1`.`rule_id`,`t2`.`rule_id`,`t3`.`rule_id`) AS `r` from (`hashTable` `t1` join `hashTable` `t2` join `hashTable` `t3`)
And then the union query:
select count(distinct h1) as cc,r from (select distinct h1,r from hashView union distinct select distinct h2,r from hashView union distinct select distinct h3,r from hashView) as uu group by r order by cc
Let know if this solves the problem at hand or if there are additional constraints that weren't disclosed before.
Depending on amount of rules and hashes, you can also always reverse the rule<->hash relation and instead create hash-based views.
Best idea is probably to combine this approach with real-life heuristics.