MySQL best practice: matching prefixes - mysql

I have a table with codes and an other table with prefixes. I need to match the (longest) prefix for each code.
There is also a secondary scope in which I have to restrict prefixes (this involves bringing in other tables). I don't think this would matter in most cases, but here is a simplified (normalized) scheme (I have to set item.prefix_id):
group (id)
subgroup (id, group_id)
prefix (id, subgroup_id, prefix)
item (id, group_id, code, prefix_id)
It is allright to cache the length of the prefix in a new field and index it. It is allright to cache the group_id in prefix table (although groups are fairly small tables, in most cases I don't think any performance increase is gained). item table contains a few hundred thousand records, prefix contains at most 500.
Edit:
Sorry If the question was not defined enough. When using the word "prefix" I actually mean it, so the codes have to start with the actual prefix.
subgroup
id group_id
-------------
1 1
2 1
3 1
4 2
prefix
id subgroup_id prefix
------------------------
1 1 a
2 2 abc
3 2 123
4 4 abcdef
item
id group_id code prefix_id
-----------------------------------
1 1 abc123 NULL
2 1 abcdef NULL
3 1 a123 NULL
4 2 abc123 NULL
The expected result for the prefix column is (item.id, item.prefix_id):
(1, 2) Because: subroups 1, 2, 3 are under group 1, the code abc123 starts with the the prefix a and the prefix abc and abc is the logest of the two, so we take the id of abc which is 2 and put it into item.prefix_id.
(2, 2) Because: even though prefix {4} (which is abcdef) is the logest matching prefix, it's subgroup (which is 4) is under group 2 but the item is under group 1, so we can choose from subgroups 1, 2, 3 and still abc is the logest match out of the three possible prefixes.
(3, 1) Because: a is the logest match.
(4, NULL) Because: item 4 is under group 2 and the only prefix under group 2 is abcdef which is no match to abc123 (because abc123 does not start with abcdef).
But as I said the whole groping thing is not essential part of the question. My main concern is to match a table with possible prefixes to a table of strings, and how to do it the best way. (Best meaning an optimal tradeoff between readability, maintainability and performance - hence the 'best prectice' in the title).
Currently I'm doing something like:
UPDATE item USE INDEX (code3)
LEFT JOIN prefix ON prefix.length=3 AND LEFT(item.code,3)=prefix.prefix
LEFT JOIN subgroup ON subgroup.id=prefix.subgroup_id
WHERE subgroup.group_id == item.group_id AND
item.segment_id IS NULL
Where code3 is a KEY code3 (segment_id, group_id, code(3)). - And the same logic is repeate with 1, 2, 3 and 4 as length. It seems pretty efficient, but I don't like the presence of duplication in it (4 queries for a single operation). - of course this is in the case when the maximum legth of prefixes is 4.
Thanks for everyone for sharing your ideas this far.

It is allright to cache the group_id in prefix table.
So let's create column group_id in table prefix and fill the column with the appropriate values. I assume that you know how to do this, so let's go to the next step.
The biggest performance benefit we will get from this composite index:
ALTER TABLE `prefix` ADD INDEX `c_index` (
`group_id` ASC,
`prefix` ASC
);
And the UPDATE statement:
UPDATE item i
SET
prefix_id = (
SELECT p.id
FROM prefix p USE INDEX (`c_index`)
WHERE
p.group_id = i.group_id AND
p.prefix IN (
LEFT(i.code, 4),
LEFT(i.code, 3),
LEFT(i.code, 2),
LEFT(i.code, 1)
)
ORDER BY LENGTH(p.prefix) DESC
LIMIT 1
)
In this example I assume that prefix is variable length {1,4}. Together I decided to use IN clause instead of LIKE for to get the full benefit of c_index.

Unless I'm overly simplifying, should be as simple as... Start an inner pre-query to get the longest prefix (regardless of if multiple have the same length per code)
select
PreQuery.Code,
P2.ID,
P2.SubGroup_ID,
P2.Prefix
From
( select
i.code,
max( length( trim( p.Prefix ))) as LongestPrefix
from
item i
join prefix p
on i.prefix_id = p.id
group by
i.code ) PreQuery
Join item i2
on PreQuery.Code = i2.Code
Join Prefix P2
on i2.Prefix_ID = P2.ID
AND PreQuery.LongestPrefix = length( trim( P2.Prefix )))
Now, if you want to do something special about those where there are multiple with the same prefix length, it will need some adjusting, but this should get it for you.

To re-answer since you are trying to UPDATE elements, try the following update query. Now here's the catch around this... The "PreQuery" will actually return ALL matching prefixes for a given item... However, since the order is based on the Prefix Length, for those entries that have more than one matching "prefix", it will first be updated with the shortest prefix, then hit the record with the next longer prefix, and finally end with whichever has the longest for the match. So at the end, it SHOULD get you what you need.
That being said (and I can't specifically test now), if it is only updating based on the FIRST entry found for a given ID, then just make the order in DESCENDING order of the prefix length.
update Item,
( SELECT
I.ID,
P.ID Prefix_ID,
P.Prefix,
I.Code,
LENGTH( TRIM( P.Prefix )) as PrefixLen
FROM
Item I
JOIN SubGroup SG
ON I.Group_ID = SG.Group_ID
JOIN Prefix P
ON SG.ID = P.SubGroup_ID
AND LEFT( P.Prefix, LENGTH( TRIM( P.Prefix )))
= LEFT( I.Code, LENGTH( TRIM( P.Prefix )))
ORDER BY
I.ID,
LENGTH( TRIM( P.Prefix )) ) PreQuery
set
Prefix_ID = PreQuery.Prefix_ID
where
ID = PreQuery.ID

Related

Group By ignores sorting in subquery

There is a TLDR version at the bottom.
Note: I have based my current solution on the proposed solution in this question here (proposed in the question text itself), however it does not work for me even if it works for that person. So I'm not sure how to handle this, because the question seems like a duplicate but the answer given there doesn't work for me. So I guess something must be different for me. If someone can tell me how to correctly handle this, I'm open to hearing.
I have a table like this one here:
scope_id key_id value
0 0 0_0
0 1 0_1
1 0 1_0
2 0 2_0
2 1 2_1
The scopes have a hierarchy where scope 0 is the parent of scope 2 and scope 2 is the parent of scope 1. (on purpose not sorted, they IDs are UUIDs, just for reading numbers here)
My use case is that I want the value of multiple keys in a specific scope (scope 1). However if there is no value defined for scope 1, I would be fine with a value from its parent (scope 2) and lastly if there is also no value in scope 2 I would take a value from its parent, scope 0. So if possible, I want the value from scope 1, if it doesn't have a value then from scope 2 and lastly I try to get the value from scope 0. (The scopes are a tree structure, so each scope can have max one parent, however a parent can have multiple childs).
So in the example above, if I want the value of key 0 in scope 1, I'd like to get 1_0 as the key is defined in the scope. If I want the value of key 1 in scope 1, I'd like to get 2_1 as there is no value defined in the scope 1 but in its parent scope 2 there is. And lastly if I want the value of keys 0 and 1 in scope 1, I want to get 1_0 and 2_1.
Currently it is solved by making 3 separate SQL requests and merging it in code. That works fine and fast enough, but I want to see if it would be faster with a single SQL query. I came up with the following query (based on the update in the question text here):
SELECT *
FROM (
SELECT *
FROM test
WHERE key_id IN (0, 1)
AND scope_id IN (1 , 2, 0)
ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id;
The inner subquery first finds all keys that I want to look at and makes sure they are in the scope that I want to look at or it's parent scope. Then I order the scopes, so that first the child is, then the parent, then the grandparent. Now I expect group by to leave the value of the first row it finds, so hopefully the child (scope 1). However this doesn't work. Instead the first value based on the actual table is used.
TLDR
When grouping with GROUP BY in the query above, why is the order defined by the ORDER BY query ignored? Instead the first value based on the original table is taken when grouping.
Using this code you can try for yourself:
# this group by doesn't work with strict mode
SET sql_mode = '';
CREATE TABLE IF NOT EXISTS test(
scope_id int,
key_id int,
`value` varchar(20),
PRIMARY KEY (scope_id, key_id)
);
INSERT IGNORE INTO test values
(0, 0, "0_0"),
(1, 0, "1_0"),
(2, 0, "2_0"),
(2, 1, "2_1"),
(0, 1, "0_1");
SELECT *
FROM (
SELECT *
FROM test
WHERE key_id IN (0, 1)
AND scope_id IN (1 , 2, 0)
ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id;
# expected result are the rows that contain value 1_0 and 2_1
I understand your question as a greatest-n-per-group variant.
In this situation, you should not think aggregation, but filtering.
You could solve it with a correlated subquery that selects the first available scope_id per key_id:
select t.*
from test t
where t.scope_id = (
select t1.scope_id
from test t1
where t1.key_id = t.key_id
order by field(scope_id, 1, 2, 0)
limit 1
)
For performance, you want an index on (key_id, scope_id).
Demo on DB Fiddle:
scope_id | key_id | value
-------: | -----: | :----
1 | 0 | 1_0
2 | 1 | 2_1
This will get what you want. Use a row number to effectively "save" your order for the next section of the query.
MySQL 8.0 or newer:
SELECT *
FROM (
SELECT *, ROW_NUMBER() rank
FROM test
WHERE key_id IN (0, 1)
AND scope_id IN (1 , 2, 0)
ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id
order by rank;
MySQL 5.7 or older:
SET #row_num = 0;
SELECT *
FROM (
SELECT *, #row_num := #row_num + 1 rank
FROM test
WHERE key_id IN (0, 1)
AND scope_id IN (1 , 2, 0)
ORDER BY FIELD(scope_id, 1 , 2, 0)
) t1
GROUP BY t1.key_id
ORDER BY rank;
Soap Box: MySQL results are, in general, horribly unreliable in any query that has 1 or more columns in a group by or aggregate but does not have all columns in a group by or aggregate.

MySQL updating duplicate IDs based on match and no match criteria all in one table

Hopefully I can explain this clearly. I have a table that has what need to be unique IDs for people within a group. The IDs are generated using first 3 letters of the first name and date of birth. Normally, with smaller groups (less than 500) this works fine. However in large groups we do hit upon some duplicates. We'd then just append a -1, -2, -3 etc. to any duplicate IDs. For example:
ID GROUP UID FIRST_NAME
1 123456 ALE19900123 ALEXIS
2 123456 ALE19900123 ALEXANDER
3 123456 ALE19900123 ALEJANDRO
4 789789 ALE19900123 ALEX
What I'd like to do is for ID 2 and 3 append a -1 and -2 respectively to their UID field so that 1,2 and 3 are now unique (GROUP + UID). ID 4 would be ignored because the GROUP is different
I've started with something like this:
UPDATE table A
JOIN table B
ON B.GROUP = A.GROUP
AND B.UID = A.UID
AND B.FIRST_NAME <> A.FIRST_NAME
AND B.ID < A.ID
SET A.duplicate_record = 1;
That should set the duplicate_record field = 1 for IDs 2 and 3. But then I still need to append a -1, -2, -3 etc. to those UIDs and I'm not sure how to do that. Maybe instead of just setting a flag = 1 for duplicate I should set the count of records that are duplicates?
If group, UID tuple is unique (and it should be), why not insert ignore the first one (without any value appended), check for how many rows were affected by SELECT ROW_COUNT();, and if that is zero, append -1? If you put it in a for cycle (pseudocode):
while i < 1000 do
insert ignore into people (group, uid, first_name) values (123456, concat(their_uid, "-", i), first name);
if ((select row_count();) == 1):
break;
i=i+1;
end while;

MySQL multi-step GROUP BY without subquery

I'm working on improving some queries I inherited, and was curious if it was possible to do the following - given a table the_table that looks like this:
id uri
---+-------------------------
1 /foo/bar/x
1 /foo/bar/y
1 /foo/boo
2 /alpha/beta/carotine
2 /alpha/delic/ipa
3 /plastik/man/spastik
3 /plastik/man/krakpot
3 /plastik/man/helikopter
As an implicit intermediate step I'd like to group these by the 1st + 2nd tuple of uri. The results of that step would look like:
id base
---+---------------
1 /foo/bar
1 /foo/boo
2 /alpha/beta
2 /alpha/delic
3 /plastik/man
And the final result would reflect the number of unique tuple1 + tuple2 values, per unique id:
id cnt
---+-----
1 2
2 2
3 1
I can achieve these results, but not without doing a subquery (to get the results of the implicit step mentioned above), and then select/grouping out of that. Something like:
SELECT
id,
count(base) cnt
FROM (
SELECT
id,
substring_index(uri, '/', 3) AS base
FROM the_table
GROUP BY id, base
)
GROUP BY id;
My reason for wanting to avoid the subquery is that I'm working with a fairly large (20M rows) data set, and the subquery gets very expensive. Gut tells me it's not doable, but figured I'd ask SO...
There's no need for a subquery -- you can use count with distinct to achieve the same result:
SELECT
id,
count(distinct substring_index(uri, '/', 3)) AS base
FROM the_table
GROUP BY id
SQL Fiddle Demo
BTW -- this returns count of 1 for id 3 -- I assume that was a typo in your posting.

how to search for a given sequence of rows within a table in SQL Server 2008

The problem:
We have a number of entries within a table but we are only interested in the ones that appear in a given sequence. For example we are looking for three specific "GFTitle" entries ('Pearson Grafton','Woolworths (P and O)','QRX - Brisbane'), however they have to appear in a particular order to be considered a valid route. (See image below)
RowNum GFTitle
------------------------------
1 Pearson Grafton
2 Woolworths (P and O)
3 QRX - Brisbane
4 Pearson Grafton
5 Woolworths (P and O)
6 Pearson Grafton
7 QRX - Brisbane
8 Pearson Grafton
9 Pearson Grafton
So rows (1,2,3) satisfy this rule but rows (4,5,6) don't even though the first two entries (4,5) do.
I am sure there is a way to do this via CTE's but some help would be great.
Cheers
This is very simple using even good old tools :-) Try this quick-and-dirty solution, assuming your table name is GFTitles and RowNumber values are sequential:
SELECT a.[RowNum]
,a.[GFTitle]
,b.[GFTitle]
,c.[GFTitle]
FROM [dbo].[GFTitles] as a
join [dbo].[GFTitles] as b on b.RowNumber = a.RowNumber + 1
join [dbo].[GFTitles] as c on c.RowNumber = a.RowNumber + 2
WHERE a.[GFTitle] = 'Pearson Grafton' and
b.[GFTitle] = 'Woolworths (P and O)' and
c.[GFTitle] = 'QRX - Brisbane'
Assuming RowNum has neither duplicates nor gaps, you could try the following method.
Assign row numbers to the sought sequence's items and join the row set to your table on GFTitle.
For every match, calculate the difference between your table's row number and that of the sequence. If there's a matching sequence in your table, the corresponding rows' RowNum differences will be identical.
Count the rows per difference and return only those where the count matches the number of sequence items.
Here's a query that implements the above logic:
WITH SoughtSequence AS (
SELECT *
FROM (
VALUES
(1, 'Pearson Grafton'),
(2, 'Woolworths (P and O)'),
(3, 'QRX - Brisbane')
) x (RowNum, GFTitle)
)
, joined AS (
SELECT
t.*,
SequenceLength = COUNT(*) OVER (PARTITION BY t.RowNum - ss.RowNum)
FROM atable t
INNER JOIN SoughtSequence ss
ON t.GFTitle = ss.GFTitle
)
SELECT
RowNum,
GFTitle
FROM joined
WHERE SequenceLength = (SELECT COUNT(*) FROM SoughtSequence)
;
You can try it at SQL Fiddle too.

mysql distribution of combinations/values

I have a mysql table which contains some random combination of numbers. For simplicity take the following table as example:
index|n1|n2|n3
1 1 2 3
2 4 10 32
3 3 10 4
4 35 1 2
5 27 1 3
etc
What I want to find out is the number of times a combination has occured in the table. For instance, how many times has the combination of 4 10 or 1 2 or 1 2 3 or 3 10 4 etc occured.
Do I have to create another table that contains all possible combinations and do comparison from there or is there another way to do this?
For a single combination, this is easy:
SELECT COUNT(*)
FROM my_table
WHERE n1 = 3 AND n2 = 10 AND n3 = 4
If you want to do this with multiple combinations, you could create a (temporary) table of them and join that table with you data, something like this:
CREATE TEMPORARY TABLE combinations (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
n1 INTEGER, n2 INTEGER, n3 INTEGER
);
INSERT INTO combinations (n1, n2, n3) VALUES
(1, 2, NULL), (4, 10, NULL), (1, 2, 3), (3, 10, 4);
SELECT c.n1, c.n2, c.n3, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON (c.n1 = t.n1 OR c.n1 IS NULL)
AND (c.n2 = t.n2 OR c.n2 IS NULL)
AND (c.n3 = t.n3 OR c.n3 IS NULL)
GROUP BY c.id;
(demo on SQLize)
Note that this query as written is not very efficient due to the OR c.n? IS NULL clauses, which MySQL isn't smart enough to optimize. If all your combinations contain the same number of terms, you can leave those out, which will allow the query to make use of indexes on the data table.
Ps. With the query above, the combination (1, 2, NULL) won't match (35, 1, 2). However, (NULL, 1, 2) will, so, if you want both, a simple workaround would be to just include both patterns in your table of combinations.
If you actually have many more columns than shown in your example, and you want to match patterns that occur in any set of consecutive columns, then your really should pack your columns into a string and use a LIKE or REGEXP query. For example, if you concatenate all your data columns into a comma-separated string in a column named data, you could search it like this:
INSERT INTO combinations (pattern) VALUES
('1,2'), ('4,10'), ('1,2,3'), ('3,10,4'), ('7,8,9');
SELECT c.pattern, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON CONCAT(',', t.data, ',') LIKE CONCAT('%,', c.pattern, ',%')
GROUP BY c.id;
(demo on SQLize)
You could make this query somewhat faster by making the prefixes and suffixes added with CONCAT() part of the actual data in the tables, but this is still going to be a fairly inefficient query if you have a lot of data to search, because it cannot make use of indexes. If you need to do this kind of substring searching on large datasets efficiently, you may want to use something better suited for than specific purpose than MySQL.
You only have three columns in the table, so you are looking for combinations of 1, 2, and 3 elements.
For simplicity, I'll start with the following table:
select index, n1 as n from t union all
select index, n2 from t union all
select index, n3 from t union all
select distinct index, -1 from t union all
select distinct index, -2 from t
Let's call this "values". Now, we want to get all triples from this table for a given index. In this case, -1 and -2 represent NULL.
select (case when v1.n < 0 then NULL else v1.n end) as n1,
(case when v2.n < 0 then NULL else v2.n end) as n2,
(case when v3.n < 0 then NULL else v3.n end) as n3,
count(*) as NumOccurrences
from values v1 join
values v2
on v1.n < v2.n and v1.index = v2.index join
values v3
on v2.n < v3.n and v2.index = v3.index
This is using the join mechanism to generate the combinations.
This method finds all combinations regardless of ordering (so 1, 2, 3 is the same as 2, 3, 1). Also, this ignores duplicates, so it cannot find (1, 2, 2) if 2 is repeated twice.
SELECT
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10))) AS Combination,
COUNT(CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))) AS Occurrences
FROM
MyTable
GROUP BY
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))
This creates a single column that represents the combination of the values within the 3 columns by concatenating the values. It will count the occurrences of each.