mysql distribution of combinations/values - mysql

I have a mysql table which contains some random combination of numbers. For simplicity take the following table as example:
index|n1|n2|n3
1 1 2 3
2 4 10 32
3 3 10 4
4 35 1 2
5 27 1 3
etc
What I want to find out is the number of times a combination has occured in the table. For instance, how many times has the combination of 4 10 or 1 2 or 1 2 3 or 3 10 4 etc occured.
Do I have to create another table that contains all possible combinations and do comparison from there or is there another way to do this?

For a single combination, this is easy:
SELECT COUNT(*)
FROM my_table
WHERE n1 = 3 AND n2 = 10 AND n3 = 4
If you want to do this with multiple combinations, you could create a (temporary) table of them and join that table with you data, something like this:
CREATE TEMPORARY TABLE combinations (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
n1 INTEGER, n2 INTEGER, n3 INTEGER
);
INSERT INTO combinations (n1, n2, n3) VALUES
(1, 2, NULL), (4, 10, NULL), (1, 2, 3), (3, 10, 4);
SELECT c.n1, c.n2, c.n3, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON (c.n1 = t.n1 OR c.n1 IS NULL)
AND (c.n2 = t.n2 OR c.n2 IS NULL)
AND (c.n3 = t.n3 OR c.n3 IS NULL)
GROUP BY c.id;
(demo on SQLize)
Note that this query as written is not very efficient due to the OR c.n? IS NULL clauses, which MySQL isn't smart enough to optimize. If all your combinations contain the same number of terms, you can leave those out, which will allow the query to make use of indexes on the data table.
Ps. With the query above, the combination (1, 2, NULL) won't match (35, 1, 2). However, (NULL, 1, 2) will, so, if you want both, a simple workaround would be to just include both patterns in your table of combinations.
If you actually have many more columns than shown in your example, and you want to match patterns that occur in any set of consecutive columns, then your really should pack your columns into a string and use a LIKE or REGEXP query. For example, if you concatenate all your data columns into a comma-separated string in a column named data, you could search it like this:
INSERT INTO combinations (pattern) VALUES
('1,2'), ('4,10'), ('1,2,3'), ('3,10,4'), ('7,8,9');
SELECT c.pattern, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON CONCAT(',', t.data, ',') LIKE CONCAT('%,', c.pattern, ',%')
GROUP BY c.id;
(demo on SQLize)
You could make this query somewhat faster by making the prefixes and suffixes added with CONCAT() part of the actual data in the tables, but this is still going to be a fairly inefficient query if you have a lot of data to search, because it cannot make use of indexes. If you need to do this kind of substring searching on large datasets efficiently, you may want to use something better suited for than specific purpose than MySQL.

You only have three columns in the table, so you are looking for combinations of 1, 2, and 3 elements.
For simplicity, I'll start with the following table:
select index, n1 as n from t union all
select index, n2 from t union all
select index, n3 from t union all
select distinct index, -1 from t union all
select distinct index, -2 from t
Let's call this "values". Now, we want to get all triples from this table for a given index. In this case, -1 and -2 represent NULL.
select (case when v1.n < 0 then NULL else v1.n end) as n1,
(case when v2.n < 0 then NULL else v2.n end) as n2,
(case when v3.n < 0 then NULL else v3.n end) as n3,
count(*) as NumOccurrences
from values v1 join
values v2
on v1.n < v2.n and v1.index = v2.index join
values v3
on v2.n < v3.n and v2.index = v3.index
This is using the join mechanism to generate the combinations.
This method finds all combinations regardless of ordering (so 1, 2, 3 is the same as 2, 3, 1). Also, this ignores duplicates, so it cannot find (1, 2, 2) if 2 is repeated twice.

SELECT
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10))) AS Combination,
COUNT(CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))) AS Occurrences
FROM
MyTable
GROUP BY
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))
This creates a single column that represents the combination of the values within the 3 columns by concatenating the values. It will count the occurrences of each.

Related

SQL query statement Self Join?

new to SQL.
I have the following set of data
A X Y Z
1 Wind 1 1
2 Wind 2 1
3 Hail 1 1
4 Flood 1 1
4 Rain 1 1
4 Fire 1 1
I would like to select all distinct 'A' fields where for all rows that contain A have flood and rain.
So in this example, the query would return only the number 4 since for the set of all rows that contain A = 4 we have Flood and Rain.
I need the values of A where for a given value 'a' in A, there exists rows with 'a' that must contain all of the following fields provided (in the example Flood and Rain).
Please let me know if you need further clarification.
I need the values of A where for a given value 'a' in A, there exists rows with 'a' that must contain all of the following fields provided (in the example Flood and Rain).
You can use aggregation, and filter with a having clause:
select a
from mytable t
where x in ('Flood', 'Rain') -- either one or the other
having count(*) = 2 -- both match
If tuples (a, x) tuples are not unique, then you want having count(distinct x) = 2 instead.
You Shooud use count(distinct X) group by A and having
count(distinct...) avoid situation where you have two time the same value for X
select A
from my_table
WHERE x in ('Flood', 'Rain')
group A
having count(distinct X) = 2

SQL, build a query using data provided in the query itself

For experimental purposes only.
I would like to build a query but not querying data extracted for any table but querying data provided in the query it self. Like:
select numbers.* from (1, 2, 3) as numbers;
or
select numbers.* from (field1 = 1, field2 = 2, field3 = 3) as numbers;
so I can do things like
select
numbers.*
from (field1 = 1, field2 = 2, field3 = 3) as numbers
where numbers.field1 > 1;
If the solution is specific for a database engine could be interesting too.
If you wanted the values to be on separate rows instead of three fields of the same row, the method is the same, just one row per value linked with a union all.
select *
from(
select 1 as FieldName union all
select 2 union all
select 3 union all
select 4 union all -- we could continue this for a long time
select 5 -- the end
) as x;
select numbers.*
from(
select 1 ,2, 3
union select 3, 4, 5
union select 6, 7, 8
union select 9, 10, 11 -- we could continue this for a long time
union select 12, 13, 14 -- the end
) as numbers;
This works with MySQL and Postgres (and most others as well).
[Edit] Use union all rather than just union as you do not need to remove duplicates from a list of constants. Give the field(s) in the first select a meaningful name. Otherwise, you can't specify a specific field later on: where x.FieldName = 3.
If you don't provide meaningful names for the fields (as in the second example), the system (at least MySQL where this was tested) will assign the name "1" for the first field, "2" as the second and so on. So, if you want to specify one of the fields, you have to write expressions like this:
where numbers.1 = 3
Use the values row constructor:
select *
from (values (1),(2),(3)) as numbers(nr);
or using a CTE.
with numbers (nr) as (
values (1),(2),(3)
)
select *
from numbers
where nr > 2;
Edit: I just noticed that you also taggeg your question with mysql: the above will not work with MySQL, only with Postgres (and a few other DBMS)
You can use a subquery without table like so:
SELECT
numbers.*
FROM (
SELECT
1 AS a,
2 AS b,
3 AS c
UNION
SELECT
4,
5,
6
) AS numbers
WHERE
numbers.a > 1
If you like queries to always have a table referenced there is a Psuedo table that always has 1 row and no columns called DUAL, you can use it like so:
SELECT
numbers.*
FROM (
SELECT
1 AS a,
2 AS b,
3 AS c
FROM
DUAL
UNION
SELECT
4,
5,
6
FROM
DUAL
) AS numbers
WHERE
numbers.a > 1

how to search for a given sequence of rows within a table in SQL Server 2008

The problem:
We have a number of entries within a table but we are only interested in the ones that appear in a given sequence. For example we are looking for three specific "GFTitle" entries ('Pearson Grafton','Woolworths (P and O)','QRX - Brisbane'), however they have to appear in a particular order to be considered a valid route. (See image below)
RowNum GFTitle
------------------------------
1 Pearson Grafton
2 Woolworths (P and O)
3 QRX - Brisbane
4 Pearson Grafton
5 Woolworths (P and O)
6 Pearson Grafton
7 QRX - Brisbane
8 Pearson Grafton
9 Pearson Grafton
So rows (1,2,3) satisfy this rule but rows (4,5,6) don't even though the first two entries (4,5) do.
I am sure there is a way to do this via CTE's but some help would be great.
Cheers
This is very simple using even good old tools :-) Try this quick-and-dirty solution, assuming your table name is GFTitles and RowNumber values are sequential:
SELECT a.[RowNum]
,a.[GFTitle]
,b.[GFTitle]
,c.[GFTitle]
FROM [dbo].[GFTitles] as a
join [dbo].[GFTitles] as b on b.RowNumber = a.RowNumber + 1
join [dbo].[GFTitles] as c on c.RowNumber = a.RowNumber + 2
WHERE a.[GFTitle] = 'Pearson Grafton' and
b.[GFTitle] = 'Woolworths (P and O)' and
c.[GFTitle] = 'QRX - Brisbane'
Assuming RowNum has neither duplicates nor gaps, you could try the following method.
Assign row numbers to the sought sequence's items and join the row set to your table on GFTitle.
For every match, calculate the difference between your table's row number and that of the sequence. If there's a matching sequence in your table, the corresponding rows' RowNum differences will be identical.
Count the rows per difference and return only those where the count matches the number of sequence items.
Here's a query that implements the above logic:
WITH SoughtSequence AS (
SELECT *
FROM (
VALUES
(1, 'Pearson Grafton'),
(2, 'Woolworths (P and O)'),
(3, 'QRX - Brisbane')
) x (RowNum, GFTitle)
)
, joined AS (
SELECT
t.*,
SequenceLength = COUNT(*) OVER (PARTITION BY t.RowNum - ss.RowNum)
FROM atable t
INNER JOIN SoughtSequence ss
ON t.GFTitle = ss.GFTitle
)
SELECT
RowNum,
GFTitle
FROM joined
WHERE SequenceLength = (SELECT COUNT(*) FROM SoughtSequence)
;
You can try it at SQL Fiddle too.

MySQL best practice: matching prefixes

I have a table with codes and an other table with prefixes. I need to match the (longest) prefix for each code.
There is also a secondary scope in which I have to restrict prefixes (this involves bringing in other tables). I don't think this would matter in most cases, but here is a simplified (normalized) scheme (I have to set item.prefix_id):
group (id)
subgroup (id, group_id)
prefix (id, subgroup_id, prefix)
item (id, group_id, code, prefix_id)
It is allright to cache the length of the prefix in a new field and index it. It is allright to cache the group_id in prefix table (although groups are fairly small tables, in most cases I don't think any performance increase is gained). item table contains a few hundred thousand records, prefix contains at most 500.
Edit:
Sorry If the question was not defined enough. When using the word "prefix" I actually mean it, so the codes have to start with the actual prefix.
subgroup
id group_id
-------------
1 1
2 1
3 1
4 2
prefix
id subgroup_id prefix
------------------------
1 1 a
2 2 abc
3 2 123
4 4 abcdef
item
id group_id code prefix_id
-----------------------------------
1 1 abc123 NULL
2 1 abcdef NULL
3 1 a123 NULL
4 2 abc123 NULL
The expected result for the prefix column is (item.id, item.prefix_id):
(1, 2) Because: subroups 1, 2, 3 are under group 1, the code abc123 starts with the the prefix a and the prefix abc and abc is the logest of the two, so we take the id of abc which is 2 and put it into item.prefix_id.
(2, 2) Because: even though prefix {4} (which is abcdef) is the logest matching prefix, it's subgroup (which is 4) is under group 2 but the item is under group 1, so we can choose from subgroups 1, 2, 3 and still abc is the logest match out of the three possible prefixes.
(3, 1) Because: a is the logest match.
(4, NULL) Because: item 4 is under group 2 and the only prefix under group 2 is abcdef which is no match to abc123 (because abc123 does not start with abcdef).
But as I said the whole groping thing is not essential part of the question. My main concern is to match a table with possible prefixes to a table of strings, and how to do it the best way. (Best meaning an optimal tradeoff between readability, maintainability and performance - hence the 'best prectice' in the title).
Currently I'm doing something like:
UPDATE item USE INDEX (code3)
LEFT JOIN prefix ON prefix.length=3 AND LEFT(item.code,3)=prefix.prefix
LEFT JOIN subgroup ON subgroup.id=prefix.subgroup_id
WHERE subgroup.group_id == item.group_id AND
item.segment_id IS NULL
Where code3 is a KEY code3 (segment_id, group_id, code(3)). - And the same logic is repeate with 1, 2, 3 and 4 as length. It seems pretty efficient, but I don't like the presence of duplication in it (4 queries for a single operation). - of course this is in the case when the maximum legth of prefixes is 4.
Thanks for everyone for sharing your ideas this far.
It is allright to cache the group_id in prefix table.
So let's create column group_id in table prefix and fill the column with the appropriate values. I assume that you know how to do this, so let's go to the next step.
The biggest performance benefit we will get from this composite index:
ALTER TABLE `prefix` ADD INDEX `c_index` (
`group_id` ASC,
`prefix` ASC
);
And the UPDATE statement:
UPDATE item i
SET
prefix_id = (
SELECT p.id
FROM prefix p USE INDEX (`c_index`)
WHERE
p.group_id = i.group_id AND
p.prefix IN (
LEFT(i.code, 4),
LEFT(i.code, 3),
LEFT(i.code, 2),
LEFT(i.code, 1)
)
ORDER BY LENGTH(p.prefix) DESC
LIMIT 1
)
In this example I assume that prefix is variable length {1,4}. Together I decided to use IN clause instead of LIKE for to get the full benefit of c_index.
Unless I'm overly simplifying, should be as simple as... Start an inner pre-query to get the longest prefix (regardless of if multiple have the same length per code)
select
PreQuery.Code,
P2.ID,
P2.SubGroup_ID,
P2.Prefix
From
( select
i.code,
max( length( trim( p.Prefix ))) as LongestPrefix
from
item i
join prefix p
on i.prefix_id = p.id
group by
i.code ) PreQuery
Join item i2
on PreQuery.Code = i2.Code
Join Prefix P2
on i2.Prefix_ID = P2.ID
AND PreQuery.LongestPrefix = length( trim( P2.Prefix )))
Now, if you want to do something special about those where there are multiple with the same prefix length, it will need some adjusting, but this should get it for you.
To re-answer since you are trying to UPDATE elements, try the following update query. Now here's the catch around this... The "PreQuery" will actually return ALL matching prefixes for a given item... However, since the order is based on the Prefix Length, for those entries that have more than one matching "prefix", it will first be updated with the shortest prefix, then hit the record with the next longer prefix, and finally end with whichever has the longest for the match. So at the end, it SHOULD get you what you need.
That being said (and I can't specifically test now), if it is only updating based on the FIRST entry found for a given ID, then just make the order in DESCENDING order of the prefix length.
update Item,
( SELECT
I.ID,
P.ID Prefix_ID,
P.Prefix,
I.Code,
LENGTH( TRIM( P.Prefix )) as PrefixLen
FROM
Item I
JOIN SubGroup SG
ON I.Group_ID = SG.Group_ID
JOIN Prefix P
ON SG.ID = P.SubGroup_ID
AND LEFT( P.Prefix, LENGTH( TRIM( P.Prefix )))
= LEFT( I.Code, LENGTH( TRIM( P.Prefix )))
ORDER BY
I.ID,
LENGTH( TRIM( P.Prefix )) ) PreQuery
set
Prefix_ID = PreQuery.Prefix_ID
where
ID = PreQuery.ID

MySQL String Comparison with Percent Output (Position Very Important

I am trying to compare two entries of 6 numbers, each number which can either can be zero or 1 (i.e 100001 or 011101). If 3 out of 6 match, I want the output to be .5. If 2 out of 6 match, i want the output to be .33 etc.
Note that position matters. A match only occurs when both entries have a 1 in the first position, both have a 0 in the second position etc.
Here are the SQL commands to create the table
CREATE TABLE sim
(sim_key int,
string int);
INSERT INTO sim (sim_key, string)
VALUES (1, 111000);
INSERT INTO sim (sim_key, string)
VALUES (2, 101101);
My desired output to compare the two strings, which share 50% of the characters, and output 50%.
Is it possible to do this sort of comparison in SQL? Thanks in advance
Have a look at this example.
CREATE TABLE sim (sim_key int, string int);
INSERT INTO sim (sim_key, string) VALUES (1, 111000);
INSERT INTO sim (sim_key, string) VALUES (2, 101101);
select a.string A, b.string B,
sum(case when Substring(A.string,Pos,1) = Substring(B.string,Pos,1) then 1 else 0 end) Matches,
count(*) as RowCount,
(sum(case when Substring(A.string,Pos,1) = Substring(B.string,Pos,1) then 1 else 0 end) /
count(*) * 100.0) as PercentMatch
from sim A
cross join sim B
inner join (
select 1 Pos union all select 2 union all select 3
union all select 4 union all select 5 union all select 6) P
on P.Pos between 1 and length(A.string)
where A.sim_key= 1 and B.sim_key = 2
group by a.string, b.string
It is crude and probably included more than required but shows how it can be done. It is better to create a numbers table with just numbers from 1 to 1000 or so, that can be used repeatedly in many queries where a number sequence is required. Such a table will replace the (select .. union virtual table used in the inner join)
Instead of keeping 10010101 as integer convert this binary version to true integer when compare use bit logic AND, result convert to binary and count '1' to how many match...
for convert: http://dev.mysql.com/doc/refman/5.5/en/binary-varbinary.html
for compare: http://dev.mysql.com/doc/refman/5.5/en/bit-functions.html bitwise AND
...