how to search for a given sequence of rows within a table in SQL Server 2008 - sql-server-2008

The problem:
We have a number of entries within a table but we are only interested in the ones that appear in a given sequence. For example we are looking for three specific "GFTitle" entries ('Pearson Grafton','Woolworths (P and O)','QRX - Brisbane'), however they have to appear in a particular order to be considered a valid route. (See image below)
RowNum GFTitle
------------------------------
1 Pearson Grafton
2 Woolworths (P and O)
3 QRX - Brisbane
4 Pearson Grafton
5 Woolworths (P and O)
6 Pearson Grafton
7 QRX - Brisbane
8 Pearson Grafton
9 Pearson Grafton
So rows (1,2,3) satisfy this rule but rows (4,5,6) don't even though the first two entries (4,5) do.
I am sure there is a way to do this via CTE's but some help would be great.
Cheers

This is very simple using even good old tools :-) Try this quick-and-dirty solution, assuming your table name is GFTitles and RowNumber values are sequential:
SELECT a.[RowNum]
,a.[GFTitle]
,b.[GFTitle]
,c.[GFTitle]
FROM [dbo].[GFTitles] as a
join [dbo].[GFTitles] as b on b.RowNumber = a.RowNumber + 1
join [dbo].[GFTitles] as c on c.RowNumber = a.RowNumber + 2
WHERE a.[GFTitle] = 'Pearson Grafton' and
b.[GFTitle] = 'Woolworths (P and O)' and
c.[GFTitle] = 'QRX - Brisbane'

Assuming RowNum has neither duplicates nor gaps, you could try the following method.
Assign row numbers to the sought sequence's items and join the row set to your table on GFTitle.
For every match, calculate the difference between your table's row number and that of the sequence. If there's a matching sequence in your table, the corresponding rows' RowNum differences will be identical.
Count the rows per difference and return only those where the count matches the number of sequence items.
Here's a query that implements the above logic:
WITH SoughtSequence AS (
SELECT *
FROM (
VALUES
(1, 'Pearson Grafton'),
(2, 'Woolworths (P and O)'),
(3, 'QRX - Brisbane')
) x (RowNum, GFTitle)
)
, joined AS (
SELECT
t.*,
SequenceLength = COUNT(*) OVER (PARTITION BY t.RowNum - ss.RowNum)
FROM atable t
INNER JOIN SoughtSequence ss
ON t.GFTitle = ss.GFTitle
)
SELECT
RowNum,
GFTitle
FROM joined
WHERE SequenceLength = (SELECT COUNT(*) FROM SoughtSequence)
;
You can try it at SQL Fiddle too.

Related

SQL to club records in sequence

I have data in MySQL table, my data looks like
Key, value
A 1
A 2
A 3
A 6
A 7
A 8
A 9
B 1
B 2
and I want to group it based on the continuous sequence. Data is sorted in the table.
Key, min, max
A 1 3
A 6 9
B 1 2
I tried googling it but could find any solution to it. Can someone please help me with this.
This is way easier with a modern DBMS that support window functions, but you can find the upper bounds by checking that there is no successor. In the same way you can find the lower bounds via absence of a predecessor. By combining the lowest upper bound for each lower bound we get the intervals.
select low.keyx, low.valx, min(high.valx)
from (
select t1.keyx, t1.valx from t t1
where not exists (
select 1 from t t2
where t1.keyx = t2.keyx
and t1.valx = t2.valx + 1
)
) as low
join (
select t3.keyx, t3.valx from t t3
where not exists (
select 1 from t t4
where t3.keyx = t4.keyx
and t3.valx = t4.valx - 1
)
) as high
on low.keyx = high.keyx
and low.valx <= high.valx
group by low.keyx, low.valx;
I changed your identifiers since value is a reserved world.
Using a window function is way more compact and efficient. If at all possible, consider upgrading to MySQL 8+, it is superior to 5.7 in so many aspects.
We can create a group by looking at the difference between valx and an enumeration of the vals, if there is a gap the difference increases. Then, we simply pick min and max for each group:
select keyx, min(valx), max(valx)
from (
select keyx, valx
, valx - row_number() over (partition by keyx order by valx) as grp
from t
) as tt
group by keyx, grp;
Fiddle

Aggregating row values in MySQl or Snowflake

I would like to calculate the std dev. min and max of the mer_data array into 3 other fields called std_dev,min_mer and max_mer grouped by mac and timestamp.
This needs to be done without flattening the data as each mer_data row consists of 4000 float values and multiplying that with 700k rows gives a very high dimensional table.
The mer_data field is currently saved as varchar(30000) and maybe Json format might help, I'm not sure.
Input:
Output:
This can be done in Snowflake or MySQL.
Also, the query needs to be optimized so that it does not take much computation time.
While you don't want to split the data up, you will need to if you want to do it in pure SQL. Snowflake has no problems with such aggregations.
WITH fake_data(mac, mer_data) AS (
SELECT * FROM VALUES
('abc','43,44.25,44.5,42.75,44,44.25,42.75,43'),
('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43')
)
SELECT f.mac,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER
FROM fake_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
ORDER BY 1;
I would however discourage the use of strings in the grouping process, so would break it apart like so:
WITH fake_data(mac, mer_data, timestamp) AS (
SELECT * FROM VALUES
('abc','43,44.25,44.5,42.75,44,44.25,42.75,43', '01-01-22'),
('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43', '02-01-22')
), boost_data AS (
SELECT seq8() as seq, *
FROM fake_data
), math_step AS (
SELECT f.seq,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER
FROM boost_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
)
SELECT b.mac,
m.avg_dev,
m.std_dev,
m.MIN_MER,
m.Max_MER,
b.timestamp
FROM boost_data b
JOIN math_step m
ON b.seq = m.seq
ORDER BY 1;
MAC
AVG_DEV
STD_DEV
MIN_MER
MAX_MER
TIMESTAMP
abc
43.5625
0.7529703087
42.75
44.5
01-01-22
def
34.611111111
3.226141056
32.75
43
02-01-22
performance testing:
so using this SQL to make 70K rows of 4000 values each:
create table fake_data_tab AS
WITH cte_a AS (
SELECT SEQ8() as s
FROM TABLE(GENERATOR(ROWCOUNT =>70000))
), cte_b AS (
SELECT a.s, uniform(20::float, 50::float, random()) as v
FROM TABLE(GENERATOR(ROWCOUNT =>4000))
CROSS JOIN cte_a a
)
SELECT s::text as mac
,LISTAGG(v,',') AS mer_data
,dateadd(day,s,'2020-01-01')::date as timestamp
FROM cte_b
GROUP BY 1,3;
takes 79 seconds on a XTRA_SMALL,
now with that we can test the two solutions:
The second set of code (group by numbers, with a join):
WITH boost_data AS (
SELECT seq8() as seq, *
FROM fake_data_tab
), math_step AS (
SELECT f.seq,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER
FROM boost_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
)
SELECT b.mac,
m.avg_dev,
m.std_dev,
m.MIN_MER,
m.Max_MER,
b.timestamp
FROM boost_data b
JOIN math_step m
ON b.seq = m.seq
ORDER BY 1;
takes 1m47s
the original group by strings/dates
SELECT f.mac,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER,
f.timestamp
FROM fake_data_tab f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1,6
ORDER BY 1;
takes 1m46s
Hmm, so leaving the "mac" as a number made the code very fast (~3s), and dealing with strings in ether way changed the data processed from 1.5GB for strings and 150MB for numbers.
If the numbers were in rows, not packed together like that, we can discuss how to do it in SQL.
In rows, GROUP_CONCAT(...) can construct a commalist like you show, and MIN(), STDDEV(), etc can do the other stuff.
If you continue to have the commalist, the do the rest of work in you app programming language. (It is very ugly to have SQL pick apart an array.)

SQL query statement Self Join?

new to SQL.
I have the following set of data
A X Y Z
1 Wind 1 1
2 Wind 2 1
3 Hail 1 1
4 Flood 1 1
4 Rain 1 1
4 Fire 1 1
I would like to select all distinct 'A' fields where for all rows that contain A have flood and rain.
So in this example, the query would return only the number 4 since for the set of all rows that contain A = 4 we have Flood and Rain.
I need the values of A where for a given value 'a' in A, there exists rows with 'a' that must contain all of the following fields provided (in the example Flood and Rain).
Please let me know if you need further clarification.
I need the values of A where for a given value 'a' in A, there exists rows with 'a' that must contain all of the following fields provided (in the example Flood and Rain).
You can use aggregation, and filter with a having clause:
select a
from mytable t
where x in ('Flood', 'Rain') -- either one or the other
having count(*) = 2 -- both match
If tuples (a, x) tuples are not unique, then you want having count(distinct x) = 2 instead.
You Shooud use count(distinct X) group by A and having
count(distinct...) avoid situation where you have two time the same value for X
select A
from my_table
WHERE x in ('Flood', 'Rain')
group A
having count(distinct X) = 2

mySQL count occurances of value on multiple fields. How?

I have a table with 5 fields. Each field can store a number from 1 - 59.
Similar to countif in Excel, how do I count the number of times a number from 1 - 59 shows up in all 5 fields?
Here's an example for the count of occurances for the number 1 in all five fields:
SELECT SUM(pick_1 = 1 OR pick_2 = 1 OR pick_3 = 1 OR pick_4 = 1 OR pick_5 = 1) AS total_count_1
FROM tbldraw
Hopefully I made sense.
There was an answer here that had a solution. I think this is just a variation.
Step1: Create a numbers table (1 field, called id, 59 records (values 1 -59))
Step2:
SELECT numbers_table.number as number
, COUNT(tbldraw.pk_record)
FROM numbers_table
LEFT JOIN tbldraw
ON numbers_table.number = tbldraw.pick_1
OR numbers_table.number = tbldraw.pick_2
OR numbers_table.number = tbldraw.pick_3
OR numbers_table.number = tbldraw.pick_4
OR numbers_table.number = tbldraw.pick_5
GROUP BY number
ORDER BY number
How about a two step process? Assuming a table called summary_table ( int id, int ttl), for each number you care about...
insert into summary_table values (1,
(select count(*)
from table
where field1 = 1 or field2 = 1 or field3 = 1 or field4 = 1 or field5 = 1))
do that 59 times, once for each value. You can use a loop in most cases. Then you can select from the summary_table
select *
from summary_table
order by id
That will do it. I leave the coversion of this SQL into a stored procedure for those that know what database is in use.
The ALL() function, which returns true if the preceding operator is true for all parameters, makes the query particularly elegant and succinct.
To find the count a particular number (eg 3):
select count(*)
from tbldraw
where 3 = all (pick_1, pick_2, pick_3, pick_4, pick_5)
To find the count of all such numbers:
select pick_1, count(*)
from tbldraw
where pick_1 = all (pick_2, pick_3, pick_4, pick_5)
group by pick_1

mysql distribution of combinations/values

I have a mysql table which contains some random combination of numbers. For simplicity take the following table as example:
index|n1|n2|n3
1 1 2 3
2 4 10 32
3 3 10 4
4 35 1 2
5 27 1 3
etc
What I want to find out is the number of times a combination has occured in the table. For instance, how many times has the combination of 4 10 or 1 2 or 1 2 3 or 3 10 4 etc occured.
Do I have to create another table that contains all possible combinations and do comparison from there or is there another way to do this?
For a single combination, this is easy:
SELECT COUNT(*)
FROM my_table
WHERE n1 = 3 AND n2 = 10 AND n3 = 4
If you want to do this with multiple combinations, you could create a (temporary) table of them and join that table with you data, something like this:
CREATE TEMPORARY TABLE combinations (
id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
n1 INTEGER, n2 INTEGER, n3 INTEGER
);
INSERT INTO combinations (n1, n2, n3) VALUES
(1, 2, NULL), (4, 10, NULL), (1, 2, 3), (3, 10, 4);
SELECT c.n1, c.n2, c.n3, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON (c.n1 = t.n1 OR c.n1 IS NULL)
AND (c.n2 = t.n2 OR c.n2 IS NULL)
AND (c.n3 = t.n3 OR c.n3 IS NULL)
GROUP BY c.id;
(demo on SQLize)
Note that this query as written is not very efficient due to the OR c.n? IS NULL clauses, which MySQL isn't smart enough to optimize. If all your combinations contain the same number of terms, you can leave those out, which will allow the query to make use of indexes on the data table.
Ps. With the query above, the combination (1, 2, NULL) won't match (35, 1, 2). However, (NULL, 1, 2) will, so, if you want both, a simple workaround would be to just include both patterns in your table of combinations.
If you actually have many more columns than shown in your example, and you want to match patterns that occur in any set of consecutive columns, then your really should pack your columns into a string and use a LIKE or REGEXP query. For example, if you concatenate all your data columns into a comma-separated string in a column named data, you could search it like this:
INSERT INTO combinations (pattern) VALUES
('1,2'), ('4,10'), ('1,2,3'), ('3,10,4'), ('7,8,9');
SELECT c.pattern, COUNT(t.id) AS num
FROM combinations AS c
LEFT JOIN my_table AS t
ON CONCAT(',', t.data, ',') LIKE CONCAT('%,', c.pattern, ',%')
GROUP BY c.id;
(demo on SQLize)
You could make this query somewhat faster by making the prefixes and suffixes added with CONCAT() part of the actual data in the tables, but this is still going to be a fairly inefficient query if you have a lot of data to search, because it cannot make use of indexes. If you need to do this kind of substring searching on large datasets efficiently, you may want to use something better suited for than specific purpose than MySQL.
You only have three columns in the table, so you are looking for combinations of 1, 2, and 3 elements.
For simplicity, I'll start with the following table:
select index, n1 as n from t union all
select index, n2 from t union all
select index, n3 from t union all
select distinct index, -1 from t union all
select distinct index, -2 from t
Let's call this "values". Now, we want to get all triples from this table for a given index. In this case, -1 and -2 represent NULL.
select (case when v1.n < 0 then NULL else v1.n end) as n1,
(case when v2.n < 0 then NULL else v2.n end) as n2,
(case when v3.n < 0 then NULL else v3.n end) as n3,
count(*) as NumOccurrences
from values v1 join
values v2
on v1.n < v2.n and v1.index = v2.index join
values v3
on v2.n < v3.n and v2.index = v3.index
This is using the join mechanism to generate the combinations.
This method finds all combinations regardless of ordering (so 1, 2, 3 is the same as 2, 3, 1). Also, this ignores duplicates, so it cannot find (1, 2, 2) if 2 is repeated twice.
SELECT
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10))) AS Combination,
COUNT(CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))) AS Occurrences
FROM
MyTable
GROUP BY
CONCAT(CAST(n1 AS VARCHAR(10)),'|',CAST(n2 AS VARCHAR(10)),'|',CAST(n3 AS VARCHAR(10)))
This creates a single column that represents the combination of the values within the 3 columns by concatenating the values. It will count the occurrences of each.