MySQL: How to combine partitions of ranges into largest possible contiguous ranges - mysql

I've been trying to pull off a fairly complex SQL query (maybe simple?) to compress a table with repetitive information. I'm using MySQL 5.7.14 in SequelPro. I'm a novice SQL user with a basic understanding of joins, unions etc. I'm thinking a subquery with some group bys is needed for this one, but I don't know how to do it best.
A simple example of what I'm trying to do is illustrated by the table below:
table
For every col_1 repeated entry, I want to compress into a single entry when the range set by col_2 and 3 (start and end of a range, respectively) overlap. For col_4 and 5, the max value among entries falling in this range should be reported. With the example above, in col_1, there are three ranges for a that overlap and I want to compress this to the min for col_1 and max for col_2 with the max for col_4 and 5. For 'b' in col_2, there are two ranges (31-50, 12-15) that do not overlap, so it would return both rows as is. For c, it would return one row with range 100-300 and values 3, 2 for col_4 and col_5, respectively. The full result desired from this example is shown below:
query output
I should add that there are 'null' values in some places that should be treated as zeros.
Does anybody have anybody know the best, and simplest way to do this?
Thank you in advance!!
Update: I've tried using the range setting query suggested but I get an error. The query is as follows:
WITH a AS (SELECT range
, lower(col_2) AS startdate
, max(upper(col_3)) OVER (ORDER BY range) AS `end`
FROM `combine`
)
, b AS (
SELECT *, lag(`end`) OVER (ORDER BY range) < `start` OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY range) AS grp
FROM b
)
SELECT daterange(min(`start`), max(`end`)) AS range
FROM c
GROUP BY grp
ORDER BY 1;
The error I receive is:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'a AS (SELECT range
, lower(col_2) AS startdate
, max(upper(col_3)) OVE' at line 1

This is not trivial, but it can be done in one single query.
The hard part is combining a set of intervals into the largest possible contiguous intervals. Solutions are detailed in this post.
To get the result you are after, you now need to:
Calculate the largest possible contiguous intervals for each value in col1, using the query given in the link.
The result, based on your example values, would be:
col_1 lower_bound upper_bound
a 20 60
b 12 15
b 31 50
c 100 300
Associate one of those large intervals to each row in your_table. There can only be one such interval per row, so let's INNER JOIN:
SELECT my_table.*, large_intervals.lower_bound, large_intervals.upper_bound
FROM my_table
INNER JOIN (my_awesome_query(your_table)) large_intervals
ON large_intervals.col1 = my_table.col1
AND large_intervals.lower_bound <= my_table.col2
AND large_intervals.upper_bound >= my_table.col3
You would get:
col1 col2 col3 col4 col5 lower_bound upper_bound
a 45 50 1 0 20 60
a 50 61 6 0 20 60
a 20 45 0 5 20 60
b 31 50 0 1 31 50
b 12 15 5 0 12 15
c 100 200 3 2 100 300
c 150 300 1 2 100 300
Then it's easy, just group by col1, lower_bound, upper bound:
SELECT col1, lower_bound AS col2, upper_bound AS col3, MAX(col4) AS col4, MAX(col5) AS col5 FROM (query above) decorated_table GROUP BY col1, lower_bound, upper_bound
And you get exactly the result you're after.
To get back on the hard part: The post mentioned above exposes solutions for PostgreSQL. MySQL doesn't have range types, but the solution can be adapted. For instance, instead of lower(range), use the lower bound directly col2. The solution also makes use of window functions, namely lag and lead, but that is supported by MySQL, with the same syntax, so no problem here. Also note that they use COALESCE(upper(range), 'infinity') to guard against unbound ranges. Since your ranges are finite, you don't need to care about this, you can simply use the upper range directly, i.e. col3. Here's the adaptation:
WITH a AS (
SELECT
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY col2, col3) AS grp
FROM b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM c
GROUP BY grp
ORDER BY 1;
This works for a single group. If you want to get the ranges by col1, you can tweak it like this:
WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM c
GROUP BY col1, grp
ORDER BY 1;
Combining everything, we get the following, which (tested on the example you provided), returns exactly the output you expected:
WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM b
)
, large_intervals AS (
SELECT
col1,
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS upper_bound
FROM c
GROUP BY col1, grp
ORDER BY 1
)
, combine_with_large_interval AS (
SELECT
combine.*,
large_intervals.lower_bound,
large_intervals.upper_bound
FROM combine
INNER JOIN large_intervals
ON large_intervals.col1 = combine.col1
AND large_intervals.lower_bound <= combine.col2
AND large_intervals.upper_bound >= combine.col3
)
SELECT
col1,
lower_bound AS col2,
upper_bound AS col3,
MAX(col4) AS col4,
MAX(col5) AS col5
FROM combine_with_large_interval
GROUP BY col1, lower_bound, upper_bound
ORDER BY col1, col2, col3;
VoilĂ  !

Related

Checking for a one-to-one correlation of data with MySQL

Suppose I have data containing two columns I am interested in. Ideally, the data in these is always in matching sets like this:
A 1
A 1
B 2
B 2
C 3
C 3
C 3
However, there might be bad data where the same value in one column has different values in the other column, like this:
D 4
D 5
or:
E 6
F 6
How do I isolate these bad rows, or at least show that some of them exist?
You can use exists:
select t.*
from t
where exists (select 1 from t t2 where t2.col1 = t.col1 and t2.col2 <> t.col2);
If you just want the col1 values that have non-matches, you can use aggregation:
select col1, min(col2), max(col2)
from t
group by col1
having min(col2) <> max(col2);
Using MIN and MAX as analytic functions we can try:
WITH cte AS (
SELECT t.*, MIN(col2) OVER (PARTITION BY col1) AS min_col2,
MAX(col2) OVER (PARTITION BY col1) AS max_col2
FROM yourTable t
)
SELECT col1, col2
FROM cte
WHERE min_col2 <> max_col2;
The above approach, while seemingly verbose, would return all offending rows.

Averages by multiple columns separately

I have a collection table1 with the following columns:
id (INT)
col1 (VARCHAR)
col2 (VARCHAR)
value (INT)
I want to calculate the average separately by col1 and by col2 to have a response like this:
{
averageByCol1: {col1Value1: 23, col1Value2: 44},
averageByCol2: {col2Value1: 33, col2Value2: 91}
}
Tried to use multiple columns in GROUP BY, but this combines the columns:
SELECT
CONCAT(col1, col2, AVG(value))
FROM table1
GROUP BY col1, col2
Also tried with subquery but it gives me Subquery returns more than 1 row error:
SELECT
(SELECT
CONCAT(col1, AVG(value))
FROM table1
GROUP BY col1) AS col1Averages,
(SELECT
CONCAT(col2, AVG(value))
FROM table1
GROUP BY col2) AS col2Averages;
Using Mysql v5.5.
edit with sample data:
id col1 col2 value
1 v1 b1 34
2 v2 b1 65
3 v1 b1 87
4 v1 b2 78
5 v2 b2 78
6 v1 b2 12
Want average of value by v1, v2, b1, and b2 independently.
Use a UNION for each column you want to calculate an average for
SELECT col1 as col_key, avg(value) as average
FROM test
GROUP BY col1
UNION
SELECT col2, avg(value)
FROM test
GROUP BY col2
this will work:
select avg(value),col1 from Table1 group by col1
union all
select avg(value),col2 from Table1 group by col2
sql fiddle:http://sqlfiddle.com/#!9/c1f111/5/0
If you want 2 queries for separate results:
SELECT col1, AVG(value) AS average1
from table1
GROUP BY col1
ORDER BY col1
and
SELECT col2, AVG(value) AS average2
from table1
GROUP BY col2
ORDER BY col2

Checking multiple columns for multiple value mySql

I have table with columns say:
col1, col2, col3
I want to find if either values say 22 or 33 or 3 is/are in the columns.
For single value assume 22, I could have done:
SELECT * from table_name
WHERE 22 IN (col1, col2, col3)
How can I find 22, 33 in columns col1, col2, col3.
Any help is highly appreciable.
Thanks!
SELECT * from table_name
WHERE col1 IN (1,2,3) or
col2 IN (1,2,3) or
col3 IN (1,2,3)
Maybe also like this
with search as
(
select 1 v
union all
select 2 v
union all
select 3 v
)
select distinct t.*
from table_name t
join search s on t.col1 = s.v or t.col2 = s.v or t.col3 = s.v
dbfiddle demo

MySQL Counting the number of occurrences of a value from a column in another column and storing in new column

How do I structure my query so I can count how many occurrences of a value in column 1 appears in column 2 and then store that result in a new column in the same table? (If a value is duplicated in the first column I still want to store the same value in the new column) For example if I had a table like this:
COL1 COL2
1 2
1 4
2 1
3 1
4 1
4 2
The resulting table will look like this:
COL1 COL2 COL3
1 2 3
1 4 3
2 1 2
3 1 0
4 1 1
4 2 1
Any help is appreciated I am new to sql! Thanks in advance!
Select
col1,
col2,
COALESCE(col3,0) as col3
FROM
mytable
LEFT JOIN
( Select count(*) as col3, col2
from mytable
GROUP BY col2) as temp ON temp.col2 = mytable.col1
And if you want the update (thanks Thorsten Kettner ) :
UPDATE mytable
LEFT JOIN ( Select count(*) as col3, col2
from mytable
GROUP BY col2) as temp ON temp.col2 = mytable.col1
SET mytable.col3 = COALESCE(temp.col3,0)
You can easily count on-the-fly. Don't store this redundantly. This would only cause problems later.
select
col1,
col2,
(
select count(*)
from mytable match
where match.col2 = mytable.col1
) as col3
from mytable;
If you think you must do it; here is the according UPDATE statement:
update mytable
set col3 =
(
select count(*)
from mytable match
where match.col2 = mytable.col1
);
To do that, you can try :
SELECT COL1, COL2, (SELECT COUNT(COL1) FROM `tablename` AS t2
WHERE t2.COL1 = t1.COL1) AS COL3 FROM `tablename` AS t1
Enjoy :)

How to get SUM of certain column without losing all rows?

CREATE TABLE tmp ( col1 int, col2 int );
INSERT INTO tmp VALUES (1,3), (2,5), (3,7);
SELECT col1, col2, SUM(col2) AS Total FROM tmp; -- ???
The SELECT statement leaves me with this data set:
col1 col2 Total
1 3 15
Is there a way to allow all the rows to appear without introducing a subquery, so that the result is this:
col1 col2 Total
1 3 15
2 5 15
3 7 15
You can use a cross join to avoid a subquery:
SELECT t1.col1, t1.col2, sum(t2.col2) sum_col2
from tmp t1
cross join tmp t2
group by 1, 2
See SQL fiddle
Note that this only works if combinations of col1 and col2 are unique.