How to group values from a table if they're close? - mysql

Let's say I define 10 as being a close enough difference between two values, what I want is the average of all the values that are close enough to each other (or in other words, grouped by their closeness). So, if I have a table with the following values:
+-------+
| value |
+-------+
| 1 |
| 1 |
| 2 |
| 4 |
| 2 |
| 1 |
| 4 |
| 3 |
| 22 |
| 23 |
| 24 |
| 22 |
| 20 |
| 19 |
| 89 |
| 88 |
| 86 |
+-------+
I want a query that would output the following result:
+---------+
| 2.2500 |
| 21.6667 |
| 87.6667 |
+---------+
Where 2.2500 would be produced as the average of all the values ranging from 1 to 4 since they're for 10 or less away from each other. In the same way, 21.6667 would be the average of all the values ranging from 19 to 24, and 87.6667 would be the average of all the values ranging from 86 to 89.
Where my specified difference of what is currently 10, would have to be variable.

This isn't so bad. You want to implement the lag() function in MySQL to determine if a value is the start of a new set of rows. Then you want a cumulative sum of this value to identify a group.
The code looks painful, because in MySQL you need to do this with correlated subqueries and join/aggregation rather than with ANSI standard functions, but this is what it looks like:
select min(value) as value_min, max(value) as value_max, avg(value) as value_avg
from (select t.value, count(*) as GroupId
from table t join
(select value
from (select value,
(select max(value)
from table t2
where t2.value < t.value
) as prevValue
from table t
) t
where value - prevvalue < 10
) GroupStarts
on t.value >= GroupStarts.value
group by t.value
) t
group by GroupId;
The subquery GroupStarts is finding the break points, that is, the set of values that differ by 10 or more from the previous value. The next level uses join/aggregation to count the number of such break points before any given value. The outermost query then aggregation using this GroupId.

Create another column with a hash value for the field. This field will be used to test for equality. For example with strings you may store a soundex. For numbers you may store the closest multiple of ten
Otherwise doing a calculation will be much slower. You could also cross join the table to itself and group where the difference of the two fields < 10

I like the other user's suggestion to create a hash column. Joining to yourself has an exponential effect, and should be avoided.
One other possibility is to use /, for example select avg(val), val/10 from myTable group by val/10 would have a value of group that is 0 for 0-9, 1 for 10-19, etc.
At least, it works in SQL Server that way

At first, I would export to an array the whole result.
Afterwards, use a function
function show(elements_to_agroup=4)
{
for (i = 0; i < count(array) ; i++)
{
sum = 0;
if (i % elements_to_agroup)
{
sum = sum / elements_to_agroup;
return sum;
}
else
{
sum =+ array[i];
}
}
}

Related

Calculating average based on distinct ID while preserving all the data in a table?

If I have data like so:
+------+----+-------+-------+
| year | id | value | group |
+------+----+-------+-------+
| 2019 | 1 | 10 | A |
| 2019 | 1 | 10 | B |
| 2019 | 2 | 20 | A |
| 2019 | 3 | 30 | A |
| 2019 | 2 | 20 | B |
| 2020 | 1 | 5 | A |
| 2020 | 1 | 5 | B |
| 2020 | 2 | 10 | A |
| 2020 | 3 | 15 | A |
| 2020 | 2 | 10 | B |
+------+----+-------+-------+
Is there a way to calculate the average value based on the distinct id while preserving all the data?
I need to do this because I will also have WHERE clause(s) to filter other columns in the table, but I also need to get an overall view of the data in the case the WHERE clause(s) are not added (these WHERE filters will be added by an automated software in the OUTERMOST query which I can't control).
The group column is an example.
For the above example, the results should be:
Overall --> 20 for 2019 and 10 for 2020
WHERE group = 'A' --> 20 for 2019 and 10 for 2020
WHERE group = 'B' --> 15 for 2019 and 7.5 for 2020
I tried to do the following:
SELECT
year,
AVG(IF(id = LAG(id) OVER (ORDER BY id), NULL, value)) AS avg
FROM table
WHERE group = 'A' -- this clause may or may not exist
GROUP BY year
Basically I was thinking that if I order by id and check the previous row to see if it has the same id, the value should be NULL and thus it would not be counted into the calculation, but unfortunately I can't put analytical functions inside aggregate functions.
While the data model is inappropriate and not normalized (you are storing values redundantly), the real problem is the late automated SQL injection (the optionally added where clause).
When a where clause gets added to your query, everything is fine, because the where clause properly restricts the rows to take into consideration (group A or B). When no where clause gets added, however, you would have to work on an aggregated data set (distinct year/id rows). The latter means an aggreation on an aggregation, which can be done with a subquery as was shown by DineshDB in an earlier answer. But here you have the problem that the where clause must work on the intermediate result (the subquery) and you say that your software adds the where clause to the main query instead.
The surprising solution to this is making this three aggregations. In below query I am mixing MAX (first aggregation), AVG OVER (second aggregation), and DISTINCT (third aggregation) and the three can happily co-exist in one query. No subquery is needed.
SELECT DISTINCT
year,
AVG(MAX(value)) OVER (PARTITION BY year)
FROM yourtable
WHERE `group` = ... -- optional where clause
GROUP BY year, id
ORDER BY year;
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=973ae4f260597392c55f260d3c260084
The following query will give you the expected output.
SELECT
`Year`,
AVG(DISTINCT `value`*1.0) `value`
FROM table
WHERE `group` = 'B' -- this clause is optional
GROUP BY `Year`;
The query will return the below results.
Year | Value
2019 | 20
2020 | 10
SQLFiddle

MySQL rollup when some columns include NULL values

I have a dataset where it is somewhat common for fields to have NULL as a valid value. This causes an issue when I want to use the ROLLUP operator in MySQL, as I can't distinguish between the NULL values it generates as part of its subtotals/totals and the actual NULL values in the data.
My current query is as follows:
SELECT
COALESCE(car_score, "Total") AS car_score,
COUNT(DISTINCT id) AS volume
FROM cars_table
GROUP BY
car_score ASC WITH ROLLUP;
This provides me with the following table:
cars_score | volume
---------------------------
Total | 500
1 | 100
2 | 200
3 | 300
4 | 400
5 | 500
Total | 2000
when I'd like it to be:
cars_score | volume
---------------------------
NULL | 500
1 | 100
2 | 200
3 | 300
4 | 400
5 | 500
Total | 2000
This is a simple example, and it becomes more frustrating once I have multiple dimensions for the ROLLUP. The reason I can't just change the NULL value before to something else is that I also need to be able to aggregate the data in other parts of the application, so having a proper NULL is important to me.
One option would be to wrap with a subquery which first replaces the actual NULL values which indicate missing data. Then, use COALESCE() as you were to replace the NULL from the rollup with the string "Total":
SELECT
COALESCE(t.car_score, 'Total') AS car_score,
COUNT(DISTINCT t.id) AS volume
FROM
(
SELECT COALESCE(cars_score, 99) AS car_score, id
FROM cars_table
) t
GROUP BY t.car_score WITH ROLLUP
Here I have used 99 as a placeholder to indicate car scores which were missing. You can use any placeholder you want, other than NULL.

Get highest value of 2 columns and the full row information of the row that has the highest number

id | name | num1 | num2
0 | Johnny | 0 | 7
1 | Jason | 50 | 3
2 | John | 60 | 1
3 | Tom | 5 | 70
If I run the following query, I get the following result, as I should: SELECT MAX(GREATEST(num1, num2)) FROM data
What I need to get, however, is the full information from the row.
So since I got 70, I want to be able to access num1, name and id of that row.
Is this possible at all?
I did the following, SELECT * FROM data WHERE num1 = MAX(GREATEST(num1, num2)) OR num2 = MAX(GREATEST(num1, num2)); and got an error saying, "Invalid use of group function."
Am I missing something. Why wouldn't you just?
select * from data order by GREATEST(num1,num2) desc limit 0,1;
If there is a tie you will not have guaranteed repeatable behavior if you only order on greatest() of the two numbers. What if Smith's num1 is 70 and Jones' num2 is 70? Either one could come up as first each time the query is executed. If you want to have a repeatable selection, add another sort column that will guarantee a predictable ordering (for example, sorting on the primary key).

Selecting all fields with values greater than a current field value

I have a table that looks like this.
| path_id | step | point_id | delay_time | stand_time | access |
| 202 | 1 | 111 | 0 | 0 | 7 |
Which lists point_id's in step order.
E.g.: 111 - step 1, 181 - step 2, etc.
I need to write a query that would take point_id, select ALL values which have higher step within ALL path_id's that have a given value and return a grouped set of point_id's.
I am currently using this query
SELECT DISTINCT `pdb`.`point_id` AS `id`
FROM `path_detail` AS `pda` INNER JOIN
`path_detail` AS `pdb` ON pda.path_id = pdb.path_id
AND pda.step < pdb.step
WHERE
(pda.point_id = 111)
GROUP BY `pdb`.`path_id`
Which doesn't seem to work too reliably.
Any suggestions?
Try:
SELECT Distinct `pdb`.`point_id` AS `id`
FROM `path_detail` AS `pda`, `path_detail` AS `pdb`
WHERE
pda.point_id = 111
AND pda.path_id = pdb.path_id
AND pda.step < pdb.step
Order by `pdb`.`point_id` ASC

Totaling a column in SQL

I am trying to run a SQL Query in phpmyadmin that will total multiple rows and insert that total into the cell.
The file is sorted by date and then a few other specifics.
I am just having a hard time finding the syntax for it.
Basically I have one column called 'Points' and another called 'Total_Points'
It needs to look like this:
+--------+--------------+
| Points | Total Points |
+--------+--------------+
| 10 | 10 |
| 10 | 20 |
| 10 | 30 |
| 10 | 40 |
+--------+--------------+
And so on and so on.
It seems like there has to be something out there that would do this and I am just missing it
For a running sum you can use "window functions" like this:
create table tbl(ord int, points int);
insert into tbl values(1, 10),(2, 10), (3, 10), (4, 10);
select
*,
sum(points) over w total_points
from tbl
window w as (order by ord);
ord | points | total_points
-----+--------+--------------
1 | 10 | 10
2 | 10 | 20
3 | 10 | 30
4 | 10 | 40
I "aggregated" your ORDER BY criteria to be the column ord. You may of course replace it.
Besides of this: You did not specify a concrete database vendor. My example runs on PostgreSQL. Other SQL dialects may be a little different.
There is a lot of discussion about cumulative sums in SQL.
You may want to look at this article. If speed isn't a big issue, you can also simply use:
select t1.id, t1.SomeNumt, SUM(t2.SomeNumt) as sum
from #t t1
inner join #t t2
on t1.id >= t2.id
group by t1.id, t1.SomeNumt
order by t1.id
The key issue is that SQL doesn't actually store rows in any order. The ieda of a 'running total' assumes that the rows have an order, and by default, this isn't true. In SQL Server 2012, there is a function for this, but until people are using it, this is the best we can do.