Aggregating row values in MySQl or Snowflake - mysql

I would like to calculate the std dev. min and max of the mer_data array into 3 other fields called std_dev,min_mer and max_mer grouped by mac and timestamp.
This needs to be done without flattening the data as each mer_data row consists of 4000 float values and multiplying that with 700k rows gives a very high dimensional table.
The mer_data field is currently saved as varchar(30000) and maybe Json format might help, I'm not sure.
Input:
Output:
This can be done in Snowflake or MySQL.
Also, the query needs to be optimized so that it does not take much computation time.

While you don't want to split the data up, you will need to if you want to do it in pure SQL. Snowflake has no problems with such aggregations.
WITH fake_data(mac, mer_data) AS (
SELECT * FROM VALUES
('abc','43,44.25,44.5,42.75,44,44.25,42.75,43'),
('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43')
)
SELECT f.mac,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER
FROM fake_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
ORDER BY 1;
I would however discourage the use of strings in the grouping process, so would break it apart like so:
WITH fake_data(mac, mer_data, timestamp) AS (
SELECT * FROM VALUES
('abc','43,44.25,44.5,42.75,44,44.25,42.75,43', '01-01-22'),
('def','32.75,33.25,34.25,34.5,32.75,34,34.25,32.75,43', '02-01-22')
), boost_data AS (
SELECT seq8() as seq, *
FROM fake_data
), math_step AS (
SELECT f.seq,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER
FROM boost_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
)
SELECT b.mac,
m.avg_dev,
m.std_dev,
m.MIN_MER,
m.Max_MER,
b.timestamp
FROM boost_data b
JOIN math_step m
ON b.seq = m.seq
ORDER BY 1;
MAC
AVG_DEV
STD_DEV
MIN_MER
MAX_MER
TIMESTAMP
abc
43.5625
0.7529703087
42.75
44.5
01-01-22
def
34.611111111
3.226141056
32.75
43
02-01-22
performance testing:
so using this SQL to make 70K rows of 4000 values each:
create table fake_data_tab AS
WITH cte_a AS (
SELECT SEQ8() as s
FROM TABLE(GENERATOR(ROWCOUNT =>70000))
), cte_b AS (
SELECT a.s, uniform(20::float, 50::float, random()) as v
FROM TABLE(GENERATOR(ROWCOUNT =>4000))
CROSS JOIN cte_a a
)
SELECT s::text as mac
,LISTAGG(v,',') AS mer_data
,dateadd(day,s,'2020-01-01')::date as timestamp
FROM cte_b
GROUP BY 1,3;
takes 79 seconds on a XTRA_SMALL,
now with that we can test the two solutions:
The second set of code (group by numbers, with a join):
WITH boost_data AS (
SELECT seq8() as seq, *
FROM fake_data_tab
), math_step AS (
SELECT f.seq,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER
FROM boost_data f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1
)
SELECT b.mac,
m.avg_dev,
m.std_dev,
m.MIN_MER,
m.Max_MER,
b.timestamp
FROM boost_data b
JOIN math_step m
ON b.seq = m.seq
ORDER BY 1;
takes 1m47s
the original group by strings/dates
SELECT f.mac,
avg(d.value::float) as avg_dev,
stddev(d.value::float) as std_dev,
MIN(d.value::float) as MIN_MER,
Max(d.value::float) as Max_MER,
f.timestamp
FROM fake_data_tab f, table(split_to_table(f.mer_data,',')) d
GROUP BY 1,6
ORDER BY 1;
takes 1m46s
Hmm, so leaving the "mac" as a number made the code very fast (~3s), and dealing with strings in ether way changed the data processed from 1.5GB for strings and 150MB for numbers.

If the numbers were in rows, not packed together like that, we can discuss how to do it in SQL.
In rows, GROUP_CONCAT(...) can construct a commalist like you show, and MIN(), STDDEV(), etc can do the other stuff.
If you continue to have the commalist, the do the rest of work in you app programming language. (It is very ugly to have SQL pick apart an array.)

Related

SQL to club records in sequence

I have data in MySQL table, my data looks like
Key, value
A 1
A 2
A 3
A 6
A 7
A 8
A 9
B 1
B 2
and I want to group it based on the continuous sequence. Data is sorted in the table.
Key, min, max
A 1 3
A 6 9
B 1 2
I tried googling it but could find any solution to it. Can someone please help me with this.
This is way easier with a modern DBMS that support window functions, but you can find the upper bounds by checking that there is no successor. In the same way you can find the lower bounds via absence of a predecessor. By combining the lowest upper bound for each lower bound we get the intervals.
select low.keyx, low.valx, min(high.valx)
from (
select t1.keyx, t1.valx from t t1
where not exists (
select 1 from t t2
where t1.keyx = t2.keyx
and t1.valx = t2.valx + 1
)
) as low
join (
select t3.keyx, t3.valx from t t3
where not exists (
select 1 from t t4
where t3.keyx = t4.keyx
and t3.valx = t4.valx - 1
)
) as high
on low.keyx = high.keyx
and low.valx <= high.valx
group by low.keyx, low.valx;
I changed your identifiers since value is a reserved world.
Using a window function is way more compact and efficient. If at all possible, consider upgrading to MySQL 8+, it is superior to 5.7 in so many aspects.
We can create a group by looking at the difference between valx and an enumeration of the vals, if there is a gap the difference increases. Then, we simply pick min and max for each group:
select keyx, min(valx), max(valx)
from (
select keyx, valx
, valx - row_number() over (partition by keyx order by valx) as grp
from t
) as tt
group by keyx, grp;
Fiddle

query optimization for mysql

I have the following query which takes about 28 seconds on my machine. I would like to optimize it and know if there is any way to make it faster by creating some indexes.
select rr1.person_id as person_id, rr1.t1_value, rr2.t0_value
from (select r1.person_id, avg(r1.avg_normalized_value1) as t1_value
from (select ma1.person_id, mn1.store_name, avg(mn1.normalized_value) as avg_normalized_value1
from matrix_report1 ma1, matrix_normalized_notes mn1
where ma1.final_value = 1
and (mn1.normalized_value != 0.2
and mn1.normalized_value != 0.0 )
and ma1.user_id = mn1.user_id
and ma1.request_id = mn1.request_id
and ma1.request_id = 4 group by ma1.person_id, mn1.store_name) r1
group by r1.person_id) rr1
,(select r2.person_id, avg(r2.avg_normalized_value) as t0_value
from (select ma.person_id, mn.store_name, avg(mn.normalized_value) as avg_normalized_value
from matrix_report1 ma, matrix_normalized_notes mn
where ma.final_value = 0 and (mn.normalized_value != 0.2 and mn.normalized_value != 0.0 )
and ma.user_id = mn.user_id
and ma.request_id = mn.request_id
and ma.request_id = 4
group by ma.person_id, mn.store_name) r2
group by r2.person_id) rr2
where rr1.person_id = rr2.person_id
Basically, it aggregates data depending on the request_id and final_value (0 or 1). Is there a way to simplify it for optimization? And it would be nice to know which columns should be indexed. I created an index on user_id and request_id, but it doesn't help much.
There are about 4907424 rows on matrix_report1 and 335740 rows on matrix_normalized_notes table. These tables will grow as we have more requests.
First, the others are right about knowing better how to format your samples. Also, trying to explain in plain language what you are trying to do is also a benefit. With sample data and sample result expectations is even better.
However, that said, I think it can be significantly simplified. Your queries are almost completely identical with the exception of the one field of "final_value" = 1 or 0 respectively. Since each query will result in 1 record per "person_id", you can just do the average based on a CASE/WHEN AND remove the rest.
To help optimize the query, your matrix_report1 table should have an index on ( request_id, final_value, user_id ). Your matrix_normalized_notes table should have an index on ( request_id, user_id, store_name, normalized_value ).
Since your outer query is doing the average based on an per stores averages, you do need to keep it nested. The following should help.
SELECT
r1.person_id,
avg(r1.ANV1) as t1_value,
avg(r1.ANV0) as t0_value
from
( select
ma1.person_id,
mn1.store_name,
avg( case when ma1.final_value = 1
then mn1.normalized_value end ) as ANV1,
avg( case when ma1.final_value = 0
then mn1.normalized_value end ) as ANV0
from
matrix_report1 ma1
JOIN matrix_normalized_notes mn1
ON ma1.request_id = mn1.request_id
AND ma1.user_id = mn1.user_id
AND NOT mn1.normalized_value in ( 0.0, 0.2 )
where
ma1.request_id = 4
AND ma1.final_Value in ( 0, 1 )
group by
ma1.person_id,
mn1.store_name) r1
group by
r1.person_id
Notice the inner query is pulling all transactions for the final value as either a zero OR one. But then, the AVG is based on a case/when of the respective value for the normalized value. When the condition is NOT the 1 or 0 respectively, the result is NULL and is thus not considered when the average is computed.
So at this point, it is grouped on a per-person basis already with each store and Avg1 and Avg0 already set. Now, roll these values up directly per person regardless of the store. Again, NULL values should not be considered as part of the average computation. So, if Store "A" doesn't have a value in the Avg1, it should not skew the results. Similarly if Store "B" doesnt have a value in Avg0 result.

how to search for a given sequence of rows within a table in SQL Server 2008

The problem:
We have a number of entries within a table but we are only interested in the ones that appear in a given sequence. For example we are looking for three specific "GFTitle" entries ('Pearson Grafton','Woolworths (P and O)','QRX - Brisbane'), however they have to appear in a particular order to be considered a valid route. (See image below)
RowNum GFTitle
------------------------------
1 Pearson Grafton
2 Woolworths (P and O)
3 QRX - Brisbane
4 Pearson Grafton
5 Woolworths (P and O)
6 Pearson Grafton
7 QRX - Brisbane
8 Pearson Grafton
9 Pearson Grafton
So rows (1,2,3) satisfy this rule but rows (4,5,6) don't even though the first two entries (4,5) do.
I am sure there is a way to do this via CTE's but some help would be great.
Cheers
This is very simple using even good old tools :-) Try this quick-and-dirty solution, assuming your table name is GFTitles and RowNumber values are sequential:
SELECT a.[RowNum]
,a.[GFTitle]
,b.[GFTitle]
,c.[GFTitle]
FROM [dbo].[GFTitles] as a
join [dbo].[GFTitles] as b on b.RowNumber = a.RowNumber + 1
join [dbo].[GFTitles] as c on c.RowNumber = a.RowNumber + 2
WHERE a.[GFTitle] = 'Pearson Grafton' and
b.[GFTitle] = 'Woolworths (P and O)' and
c.[GFTitle] = 'QRX - Brisbane'
Assuming RowNum has neither duplicates nor gaps, you could try the following method.
Assign row numbers to the sought sequence's items and join the row set to your table on GFTitle.
For every match, calculate the difference between your table's row number and that of the sequence. If there's a matching sequence in your table, the corresponding rows' RowNum differences will be identical.
Count the rows per difference and return only those where the count matches the number of sequence items.
Here's a query that implements the above logic:
WITH SoughtSequence AS (
SELECT *
FROM (
VALUES
(1, 'Pearson Grafton'),
(2, 'Woolworths (P and O)'),
(3, 'QRX - Brisbane')
) x (RowNum, GFTitle)
)
, joined AS (
SELECT
t.*,
SequenceLength = COUNT(*) OVER (PARTITION BY t.RowNum - ss.RowNum)
FROM atable t
INNER JOIN SoughtSequence ss
ON t.GFTitle = ss.GFTitle
)
SELECT
RowNum,
GFTitle
FROM joined
WHERE SequenceLength = (SELECT COUNT(*) FROM SoughtSequence)
;
You can try it at SQL Fiddle too.

Generating a series of numbers [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Generating a series of dates
What is the best way in mysql to generate a series of numbers in a given range?
The application I have in mind is to write a report query that returns a row for every number, regardless of whether there is any data to report. An example in its simplest form might be:
SELECT numbers.num, COUNT(answers.id)
FROM <series of numbers between X and Y> numbers
LEFT JOIN answers ON answers.selection_number = numbers.num
GROUP BY 1
I have tried creating a table with lots of numbers, but that seems like a poor workaround.
First, create a table called ints which will contain one record for each digit from 0 to 9.
CREATE TABLE ints ( i tinyint );
Then populate that table with data.
INSERT INTO ints VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
Now you can use a query such as the following to generate a sequence of numbers.
SELECT generator.num, COUNT(answers.id)
FROM (
SELECT a.i*10 + b.i AS num
FROM ints a, ints b
ORDER BY 1
) generator
LEFT JOIN answers ON answers.selection_number = generator.num
WHERE generator.num BETWEEN 18 AND 43
To add another place value to the generated numbers, just add more joins of the ints table and adjust the calculations accordingly. The following will generate three-digit numbers:
SELECT generator.num, COUNT(answers.id)
FROM (
SELECT a.i*100 + b.i*10 + c.i AS num
FROM ints a, ints b, ints c
ORDER BY 1
) generator
LEFT JOIN answers ON answers.selection_number = generator.num
WHERE generator.num BETWEEN 328 AND 643
you can try with group by numbers.num instead of group by 1
SELECT numbers.num, COUNT(answers.id)
FROM numbers
LEFT JOIN answers ON answers.selection_number = numbers.num
WHERE numbers.num between X and Y
GROUP BY numbers.num

Break Numbers List Into Min and Max Ranges

Brain is not working today and my google skills are failing me.
I have a column of numbers ranging from 1 - 1000. I want to dump the min and max values for 100 (or whatever I chose) record ranges into a temp table. The plan is to use this temp table to process ranges of records (in this example 100 at a time) in a larger table.
Swear I have done this before with a CTE but then I had something to group on. Here I just want to break up a single list of numbers into ranges of X.
The output from the temp table should look like:
Min Max
0 99
100 199
200 299
300 399
etc.
Thanks!
You can use this trick from Stuart Ainsworth:
http://codegumbo.com/index.php/2009/01/25/building-ranges-using-a-dynamically-generated-numbers-table/
Numbers tables are awesome, but he uses a dynamically generated numbers table, which is even awesome...r.
If you know all numbers are present in the source table, you can use a recursive CTE to generate the number ranges:
; with numbers as
(
select 0 as a
, 99 as b
union all
select a+100
, b+100
from numbers
where a < 900
)
select *
from numbers
If the source table is sparsely populated, you can limit it to numbers that are actually present like:
... insert CTE from above here ...
select min(ot.NumberColumn)
, max(ot.NumberColumn)
from numbers
left join
OtherTable ot
on ot.NumberColumn between numbers.a and numbers.b
group by
numbers.a
enter code hereI have been having a play with a CTE after you posted this and came up with the following, I would be interested to hear if it works for you at all.
DECLARE #segment int = 100
;
WITH _CTE
(rowNum, value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY col01) -1, col01
FROM dbo.testTable
)
SELECT rowNum/#segment AS Bucket, MIN(Value) AS MinVal, MAX(Value) AS MaxVal
FROM _CTE
group by rowNum/#segment
ORDER BY Bucket
;
col01 in this case is the column that you want the min/max range values from, as is TestTable.