Grouping by to find average differences for specific indexes in SQL - mysql

I have the following table:
person_index score year
3 76 2003
3 86 2004
3 86 2005
3 87 2006
4 55 2005
4 91 2006
I want to group by person_index, getting the average score difference between consecutive years, such that I end up with one row per person, indicating the average increase/decrease:
person_index avg(score_diff)
3 3.67
4 36
So for person with index 3 - there were changes over 3 years, one was 10pt, one was 0, and one was 1pt. Therefore, their average score_diff is 3.67.
EDIT: to clarify, scores can also decrease. And years aren't necessarily consecutive (one person might not get a score at a certain year, so could be 2013 followed by 2015).

Simplest way is to use LAG(MySQL 8.0+):
WITH cte AS (
SELECT *, score - LAG(score) OVER(PARTITION BY person_index ORDER BY year) AS diff
FROM tab
)
SELECT person_index, AVG(diff) AS avg_diff
FROM cte
GROUP BY person_index;
db<>fiddle demo
Output:
+---------------+----------+
| person_index | avg_diff |
+---------------+----------+
| 3 | 3.6667 |
| 4 | 36.0000 |
+---------------+----------+

If the scores only increase -- as in your example -- you can simply do:
select person_id,
( max(score) - min(score) ) / nullif(max(year) - min(year) - 1, 0)
from t
group by person_id;
If they do not only increase, it is a bit trickier because you have to calculate the first and last scores:
select t.person_id,
(tmax.score - tmin.score) / nullif(tmax.year - tmin.year - 1, 0)
from (select t.person_id, min(year) as miny, max(year) as maxy
from t
group by person_id
) p join
t tmin
on tmin.person_id = p.person_id and tmin.year = p.miny join
t tmax
on tmax.person_id = p.person_id and tmax.year = p.maxy join

Related

Using mysql 8 window functions

My salary table looks like this,
employeeId Salary salaryEffectiveFrom
19966 10000.00 2022-07-01
19966 20000.00 2022-07-15
My role/grades table looks like this,
employeeId grade roleEffectiveFrom
19966 grade 3 2022-07-01
19966 grade 2 2022-07-10
I am trying to get the salary a grade is paid for by taking into account the effective date in both tables.
grade 3 is effective from 1-July-2022. grade 2 is effective from the 10th of July, implying grade 3 is effective till the 9th of July i.e. 9 days.
grade 2 is effective from 10-July-2022 onwards.
A salary of 10000 is effective from 1-July-2022 till 14-July-2022 as the salary of 20000 is effective from the 15th. Therefore grade 3 had a salary of 10000 for 9 days, grade 2 salary of 10000 for 4 days with grade 2 with a salary of 20000 from the 10th onwards. The role effectivefrom
date takes precedence over the salary effectivefrom date.
This query,
SELECT er.employeeId,
es.salary,
`grade`,
date(er.effectiveFrom) roleEffectiveFrom,
date(es.effectiveFrom) salaryEffectiveFrom,
DATEDIFF(LEAST(COALESCE(LEAD(er.effectiveFrom)
OVER (PARTITION BY er.employeeId ORDER By er.effectiveFrom),
DATE_ADD(LAST_DAY(er.effectiveFrom),INTERVAL 1 DAY)),
DATE_ADD(LAST_DAY(er.effectiveFrom),INTERVAL 1 DAY)),
er.effectiveFrom) as '#Days' ,
ROUND((salary * 12) / 365, 2) dailyRate
FROM EmployeeRole er
join EmployeeSalary es ON (es.employeeId = er.employeeId)
and er.employeeId = 19966
;
gives me the result set shown below,
employeeId Salary grade roleEffectiveFrom salaryEffectiveFrom Days dailyRate
19966 10000.00 grade 3 2022-07-01 2022-07-01 0 328.77
19966 20000.00 grade 3 2022-07-01 2022-07-15 9 657.53
19966 10000.00 grade 2 2022-07-10 2022-07-01 0 328.77
19966 20000.00 grade 2 2022-07-10 2022-07-15 22 657.53
grade3 is effective for 9 days in July so I want to get the total salary for those 9 days using a daily rate column, 328.77 * 9 = 2985.93 as a separate column but I am unable to do as I am getting the days for the wrong row i.e. 9 should be the result for the first row.
dbfiddle
merge the 2 table dates, lead them then use correlated sub queries
with cte as
(
SELECT employeeid,effectivefrom from EMPLOYEEROLE
union
select employeeid,effectivefrom from employeesalary
)
,cte1 as
(select employeeid,effectivefrom,
coalesce(
date_sub(lead(effectivefrom) over (partition by employeeid order by effectivefrom),interval 1 day) ,
now()) nexteff
from cte
)
select *,
datediff(nexteff,effectivefrom) + 1 diff,
(select grade from employeerole e where e.effectivefrom <= cte1.effectivefrom order by e.effectivefrom desc limit 1) grade,
(select salary from employeesalary e where e.effectivefrom <= cte1.nexteff order by e.effectivefrom desc limit 1) salary
from cte1;
+------------+---------------------+---------------------+------+---------+--------+
| employeeid | effectivefrom | nexteff | diff | grade | salary |
+------------+---------------------+---------------------+------+---------+--------+
| 19966 | 2022-07-01 00:00:00 | 2022-07-09 00:00:00 | 9 | grade 3 | 10000 |
| 19966 | 2022-07-10 00:00:00 | 2022-07-14 00:00:00 | 5 | grade 2 | 10000 |
| 19966 | 2022-07-15 00:00:00 | 2022-10-08 08:51:49 | 86 | grade 2 | 20000 |
+------------+---------------------+---------------------+------+---------+--------+
3 rows in set (0.003 sec)
with cte as
(
SELECT employeeid,effectivefrom from EMPLOYEEROLE
union
select employeeid,effectivefrom from employeesalary
)
,cte1 as
(select cte.employeeid,effectivefrom,
coalesce(
date_sub(lead(effectivefrom) over (partition by employeeid order by effectivefrom),interval 1 day) ,
last_day(maxdt)) nexteff
from cte
JOIN (select cte.employeeid,max(effectivefrom) maxdt from cte group by employeeid) c1
on c1.employeeid = cte.employeeid
)
select *,
datediff(nexteff,effectivefrom) + 1 diff,
(select grade from employeerole e where e.effectivefrom <= cte1.effectivefrom order by e.effectivefrom desc limit 1) grade,
(select salary from employeesalary e where e.effectivefrom <= cte1.nexteff order by e.effectivefrom desc limit 1) salary
from cte1;
+------------+---------------------+---------------------+------+---------+--------+
| employeeid | effectivefrom | nexteff | diff | grade | salary |
+------------+---------------------+---------------------+------+---------+--------+
| 19966 | 2022-07-01 00:00:00 | 2022-07-09 00:00:00 | 9 | grade 3 | 10000 |
| 19966 | 2022-07-10 00:00:00 | 2022-07-14 00:00:00 | 5 | grade 2 | 10000 |
| 19966 | 2022-07-15 00:00:00 | 2022-07-31 00:00:00 | 17 | grade 2 | 20000 |
+------------+---------------------+---------------------+------+---------+--------+
3 rows in set (0.004 sec)
I think if it were me, I'd generate a list containing an entry for each day with the effective grade and salary, and then just aggregate at the end. Take a look at this fiddle:
https://dbfiddle.uk/4t2RW2M2
I've started with the aggregate query, just so we can see the output, then I break out pieces of the query to show intermediate outputs. Here is an image of the final output and the query generating it:
SELECT grade, gradeEffective, salary, salaryEffective,
min(dt) as startsOn, max(dt) as endsOn, count(*) as days,
dailyRate,
sum(dailyRate) as pay
FROM (
SELECT DISTINCT dt, grade, gradeEffective, salary, salaryEffective,
ROUND((salary * 12) / 365, 2) as dailyRate
FROM (
SELECT dts.dt,
first_value(r.grade) OVER w as grade,
first_value(r.effectiveFrom) OVER w as gradeEffective,
first_value(s.salary) OVER w as salary,
first_value(s.effectiveFrom) OVER w as salaryEffective
FROM (
WITH RECURSIVE dates(n) AS (SELECT 0 UNION SELECT n + 1 FROM dates WHERE n + 1 <= 30)
SELECT '2022-07-01' + INTERVAL n DAY as dt FROM dates
) dts
LEFT JOIN EmployeeSalary s ON dts.dt >= s.effectiveFrom
LEFT JOIN EmployeeRole r on dts.dt >= r.effectiveFrom
WINDOW w AS (
PARTITION BY dts.dt
ORDER BY r.effectiveFrom DESC, s.effectiveFrom DESC
ROWS UNBOUNDED PRECEDING
)
) z
) a GROUP BY grade, gradeEffective, salary, salaryEffective, dailyRate
ORDER BY min(dt);
Now, the first thing I've done is create a list of dates using a recursive CTE:
WITH RECURSIVE dates(n) AS (SELECT 0 UNION SELECT n + 1 FROM dates WHERE n + 1 <= 30)
SELECT '2022-07-01' + INTERVAL n DAY as dt FROM dates
which produces a list of dates from July 1st to July 31st.
Take that list of dates and left join both of your tables to it, like so:
SELECT *
FROM (
WITH RECURSIVE dates(n) AS (SELECT 0 UNION SELECT n + 1 FROM dates WHERE n + 1 <= 30)
SELECT '2022-07-01' + INTERVAL n DAY as dt FROM dates
) dts
LEFT JOIN EmployeeSalary s ON dts.dt >= s.effectiveFrom
LEFT JOIN EmployeeRole r on dts.dt >= r.effectiveFrom
with the dt greater than or equal to the effective dates. Notice that after the 9th you start to get duplicate rows for each date.
We'll create a window to get the first values for grade and salary for each date, and we'll order first by role effectiveFrom and then salary effectiveFrom, to fulfil your priority condition.
SELECT dts.dt,
first_value(r.grade) OVER w as grade,
first_value(r.effectiveFrom) OVER w as gradeEffective,
first_value(s.salary) OVER w as salary,
first_value(s.effectiveFrom) OVER w as salaryEffective
FROM (
WITH RECURSIVE dates(n) AS (SELECT 0 UNION SELECT n + 1 FROM dates WHERE n + 1 <= 30)
SELECT '2022-07-01' + INTERVAL n DAY as dt FROM dates
) dts
LEFT JOIN EmployeeSalary s ON dts.dt >= s.effectiveFrom
LEFT JOIN EmployeeRole r on dts.dt >= r.effectiveFrom
WINDOW w AS (
PARTITION BY dts.dt
ORDER BY r.effectiveFrom DESC, s.effectiveFrom DESC
ROWS UNBOUNDED PRECEDING
);
This is still going to leave us multiple entries for some dates, although they are duplicates, so let's use that output in a new query, using DISTINCT to leave us only one copy of each row and using the opportunity to add the daily rate field:
SELECT DISTINCT dt, grade, gradeEffective, salary, salaryEffective,
ROUND((salary * 12) / 365, 2) as dailyRate
FROM (
SELECT dts.dt,
first_value(r.grade) OVER w as grade,
first_value(r.effectiveFrom) OVER w as gradeEffective,
first_value(s.salary) OVER w as salary,
first_value(s.effectiveFrom) OVER w as salaryEffective
FROM (
WITH RECURSIVE dates(n) AS (SELECT 0 UNION SELECT n + 1 FROM dates WHERE n + 1 <= 30)
SELECT '2022-07-01' + INTERVAL n DAY as dt FROM dates
) dts
LEFT JOIN EmployeeSalary s ON dts.dt >= s.effectiveFrom
LEFT JOIN EmployeeRole r on dts.dt >= r.effectiveFrom
WINDOW w AS (
PARTITION BY dts.dt
ORDER BY r.effectiveFrom DESC, s.effectiveFrom DESC
ROWS UNBOUNDED PRECEDING
)
) z;
This produces the deduplicated daily data
and now all we have to do is use aggregation to pull out the sums for each combination of grade and salary, which is the query that I started off with.
Let me know if this is what you were looking for, or if anything is unclear.
Since the start and end conditions weren't fleshed out in the question, I just created the date list arbitrarily. It's not difficult to generate the list based on the first effectiveFrom in both tables, and here is an example that runs from that start date until current:
WITH RECURSIVE dates(n) AS (
SELECT min(effectiveFrom) FROM (
select effectiveFrom from EmployeeRole UNION
select effectiveFrom from EmployeeSalary
) z
UNION SELECT n + INTERVAL 1 DAY FROM dates WHERE n <= now()
)
SELECT n as dt FROM dates
I also didn't handle for multiple employees, since there was only one given and I would just be guessing at the shape of the actual data.
You can start adding two new columns (i.e. tmpFrom and tmpTo), which should give the correct dates which are needed to calculate the 9 Days.
SELECT
er.employeeId,
es.salary,
`grade`,
date(er.effectiveFrom) roleEffectiveFrom,
date(es.effectiveFrom) salaryEffectiveFrom,
DATEDIFF(LEAST(COALESCE(LEAD(er.effectiveFrom)
OVER (PARTITION BY er.employeeId ORDER By er.effectiveFrom),
DATE_ADD(LAST_DAY(er.effectiveFrom),INTERVAL 1 DAY)),
DATE_ADD(LAST_DAY(er.effectiveFrom),INTERVAL 1 DAY)),
er.effectiveFrom) as '#Days' ,
ROUND((salary * 12) / 365, 2) dailyRate,
date(er.effectiveFrom) tmpFrom,
(select e2.effectiveFrom
from EmployeeRole e2
where e2.employeeId = er.employeeId and e2.effectiveFrom > er.effectiveFrom
order by e2.effectiveFrom
limit 1) as tmpTo
FROM EmployeeRole er
join EmployeeSalary es ON (es.employeeId = er.employeeId)
and er.employeeId = 19966
order by er.effectiveFrom
;
In above query I used a sub-select, which might hurt performance. You can study Window Function, and check if there is a function which suits your needs better than this sub-query.
It's up to you to calculate the number of days between those two columns, but you should also solve the NULL value which should be end of month (But I am not sure if I remember your problem correctly...)
see: DBFIDDLE

Get original RANK() value based on row create date

Using MariaDB and trying to see if I can get pull original rankings for each row of a table based on the create date.
For example, imagine a scores table that has different scores for different users and categories (lower score is better in this case)
id
leaderboardId
userId
score
submittedAt ↓
rankAtSubmit
9
15
555
50.5
2022-01-20 01:00:00
2
8
15
999
58.0
2022-01-19 01:00:00
3
7
15
999
59.1
2022-01-15 01:00:00
3
6
15
123
49.0
2022-01-12 01:00:00
1
5
15
222
51.0
2022-01-10 01:00:00
1
4
14
222
87.0
2022-01-09 01:00:00
1
5
15
555
51.0
2022-01-04 01:00:00
1
The "rankAtSubmit" column is what I'm trying to generate here if possible.
I want to take the best/smallest score of each user+leaderboard and determine what the rank of that score was when it was submitted.
My attempt at this failed because in MySQL you cannot reference outer level columns more than 1 level deep in a subquery resulting in an error trying to reference t.submittedAt in the following query:
SELECT *, (
SELECT ranking FROM (
SELECT id, RANK() OVER (PARTITION BY leaderboardId ORDER BY score ASC) ranking
FROM scores x
WHERE x.submittedAt <= t.submittedAt
GROUP BY userId, leaderboardId
) ranks
WHERE ranks.id = t.id
) rankAtSubmit
FROM scores t
Instead of using RANK(), I was able to accomplish this by with a single subquery that counts the number of users that have a score that is lower than and submitted before the given score.
SELECT id, userId, score, leaderboardId, submittedAt,
(
SELECT COUNT(DISTINCT userId) + 1
FROM scores t2
WHERE t2.userId = t.userId AND
t2.leaderboardId = t.leaderboardId AND
t2.score < t.score AND
t2.submittedAt <= t.submittedAt
) AS rankAtSubmit
FROM scores t
What I understand from your question is you want to know the minimum and maximum rank of each user.
Here is the code
SELECT userId, leaderboardId, score, min(rankAtSubmit),max(rankAtSubmit)
FROM scores
group BY userId,
leaderboardId,
scorescode here

Get Percentage of Last X entries in MySQL

I have 2 tables in MySQL(InnoDB). The first is an employee table. The other table is the expense table. For simplicity, the employee table contains just id and first_name. The expense table contains id, employee_id(foreign key), amount_spent, budget, and created_time. What I would like is a query that returns the percentage of their budget spent for the most recent X number of expense they've registered.
So given the employee table:
| id | first_name
-------------------
1 alice
2 bob
3 mike
4 sally
and the expense table:
| id | employee_id | amount_spent | budget | created_time
----------------------------------------------------------
1 1 10 100 10/18
2 1 50 100 10/19
3 1 0 40 10/20
4 2 5 20 10/22
5 2 10 70 10/23
6 2 75 100 10/24
7 3 50 50 10/25
The query for the last 3 trips would return
|employee_id| first_name | percentage_spent |
--------------------------------------------
1 alice .2500 <----------(60/240)
2 bob .4736 <----------(90/190)
3 mike 1.000 <----------(50/50)
The query for the last 2 trips would return
|employee_id| first_name | percentage_spent |
--------------------------------------------
1 alice .3571 <----------(50/140)
2 bob .5000 <----------(85/170)
3 mike 1.000 <----------(50/50)
It would be nice if the query, as noted above, did not return any employees who have not registered any expenses (sally). Thanks in advance!!
I'll advise you to convert datatype of created_time as DATETIME in order to get accurate results.
As of now, I've assumed that most recent id indicates most recent spents as it's what sample data suggests.
Following query should work (didn't tested though):
select t2.employee_id,t1.first_name,
sum(t2.amount_spent)/sum(t2.budget) as percentage_spent
from employee t1
inner join
(select temp.* from
(select e.*,#num := if(#type = employee_id, #num + 1, 1) as row_number,
#type := employee_id as dummy
from expense e
order by employee_id,id desc) temp where temp.row_number <= 3 //write value of **n** here.
) t2
on t1.id = t2.employee_id
group by t2.employee_id
;
Click here for DEMO
Feel free to ask doubt(s), if you've any.
Hope it helps!
If you are using mysql 8.0.2 and higher you might use window function for it.
SELECT employee_id, first_name, sliding_sum_spent/sliding_sum_budget
FROM
(
SELECT employee_id, first_name,
SUM(amount_spent) OVER (PARTITION BY employee_id
ORDER BY created_time
RANGE BETWEEN 3 PRECEDING AND 0 FOLLOWING) AS sliding_sum_spent,
SUM(budget) OVER (PARTITION BY employee_id
ORDER BY created_time
RANGE BETWEEN 3 PRECEDING AND 0 FOLLOWING) AS sliding_sum_budget,
COUNT(*) OVER (PARTITION BY employee_id
ORDER BY created_time DESC) rn
FROM expense
JOIN employee On expense.employee_id = employee.id
) t
WHERE t.rn = 1
As mentioned by Harshil, order of row according to the created_time may be a problem, therefore, it would be better to use date date type.

Is there any fast way to query average with itself excluded in mysql

For below table name: team_score
----------------------------
Team score date
----------------------------
A 1 2017-07-01
B 2 2017-07-02
A 3 2017-07-02
B 4 2017-07-01
C 5 2017-07-02
C 6 2017-07-01
to get this table
-------------------------------------
team avg avg_excluding_itself
-------------------------------------
A 2.0 4.25
B 3.0 3.75
C 5.5 2.50
what will be the most efficient way?
below query will not work as it is too resource consuming. imaging the table is 100GB in size.
select a.team, avg(a.score) as avg, avg(b.score) as avg_excluding_itself
from team_score a join team_score b on a.team <> b.team group by a.team
Two principles:
An average is a sum divided by a count.
"Excluding" can be computed by taking the entire total, minus what is to be excluded.
Yield:
SELECT
team,
ROUND(sum_me / count_me, 1) AS "Team's avg",
ROUND((sum_all - sum_me) / (count_all - count_me), 2) AS "Avg of others"
FROM ( SELECT team,
SUM(score) AS sum_me,
COUNT(*) AS count_me
FROM team_score
GROUP BY team ) AS me
JOIN ( SELECT SUM(score) AS sum_all,
COUNT(*) AS count_all
FROM team_score ) AS x -- only one row
GROUP BY team;
There will be two table scans, one for each subquery; this will to be too inefficient.

SQL query with different averages for different columns for the same data

The SQL challenge I'm working on is to build a query to display the name of an individual and their average performance over 10 iterations of data in one column and their average over 50 iterations of data in the next column. Grouped by name of course. The iterations progress in order therefore the average of the past 10 for an individual would be an average score of the 10 highest ID numbers for that individual. The raw dataset looks like this:
ID, Name, Score
1, Joe, 10
2, Bob, 13
3, Joe, 9
4, Kim, 6
5, Rob, 8
6, Han, 9
7, Kim, 12
There is about 1000 rows like this with about 50 names. The end goal is to run a query that returns something like this:
Name, AvgPast10, AvgPast50
Bob, 8, 10
Joe, 7, 9
Kim, 6, 10
Han, 9, 6
Rob, 7, 5
When I tried to do this I realized that there might be a different ways of doing this. Maybe a self join back onto itself, perhaps nested select statements. I tried and realized that I was getting in over my head. Also, my boss is a real stickler for query optimization. For some reason he despises nested select statements. If I need them then I better have a compelling reason or at least have some idea about how optimization was built into the query.
Admittedly this one uses a nested select (or a subquery):
SELECT Name, AVG(CASE WHEN Rank <= 10 THEN Score END) AvgPast10,
AVG(CASE WHEN Rank <= 50 THEN Score END) AvgPast50
FROM (
SELECT Name,
#rank := IF(#Name = Name, #rank+1, 1) as Rank,
#Name := Name, Score
FROM tbl
ORDER BY Name, ID DESC
) A
GROUP BY Name
See my Demo that uses Past 3 and Past 5 for simplicity.
Your sample is very small so I have used 2 and 50 below, but hopefully the process is clear regardless of numbers or how many averages
| NAME | AVG_2 | AVG_50 |
|------|-------|--------|
| Bob | 13 | 13 |
| Han | 9 | 9 |
| Joe | 9 | 9.5 |
| Kim | 12 | 9 |
| Rob | 8 | 8 |
SELECT
name
, sum_2 / (count_2 * 1.0) AS avg_2
, sum_50 / (count_50 * 1.0) AS avg_50
FROM (
SELECT
name
, COUNT(CASE
WHEN rn <= 2 THEN score END) count_2
, SUM(CASE
WHEN rn <= 2 THEN score END) sum_2
, COUNT(CASE
WHEN rn <= 50 THEN score END) count_50
, SUM(CASE
WHEN rn <= 50 THEN score END) sum_50
FROM (
SELECT
*
, ROW_NUMBER() OVER (PARTITION BY name ORDER BY ID DESC) AS rn
FROM Scores
) x
GROUP BY
name
) y
ORDER BY
name
I wasn't sure what you wanted to do if the number of observations is less than the quantity required (e.g. a count of 20 but average is for 50), I have used the actual count in this example.
see: http://sqlfiddle.com/#!3/84cf6/2