calculating median for data mysql - mysql

I am trying to calculate median of time spent by people on a specific category. The whole dataset I have is around 500k rows but I have tried to summarize a snippet of it below
person category time spent (in mins)
roger dota 20
jim dota 50
joe call of duty 5
jim fallout 25
kathy GTA 40
alicia fallout 100
I have tried to use the query below but I am getting no where.
SELECT x1.person, x1.time spent
from data x1, data x2
GROUP BY x1.val
HAVING SUM(SIGN(1-SIGN(x2.val-x1.val))) = (COUNT(*)+1)/2

A self-join on 500,000 rows is likely to be expensive. Why not just enumerate the rows and grab the one in the middle?
select d.*
from (select d.*, (#rn := #rn + 1) as rn
from data d cross join
(select #rn := 0) params
order by d.val
) d
where 2*rn in (#rn, #rn + 1);
The weird where clause chooses the value in the middle -- it is just an approximation if there are an eve number of rows. Because you want the actual row values, you need the approximation. The normal calculation of the median itself would be:
select avg(d.val)
from (select d.*, (#rn := #rn + 1) as rn
from data d cross join
(select #rn := 0) params
order by d.val
) d
where 2*rn in (#rn - 1, #rn, #rn + 1);
EDIT:
The same logic works per person as well, but with a bit more logic to get the overall counts:
select d.person, avg(val) as median
from (select d.*,
(#rn := if(#p = person, #rn + 1
if(#p := person, 1, 1)
) as rn
from data d cross join
(select #rn := 0, #p := '') params
order by person, d.val
) d join
(select person, count(*) as cnt
from data
group by person
) p
on d.person = p.person
where 2*rn in (d.cnt - 1, d.cnt, d.cnt + 1)
group by person;

Related

mysql / sql: how to delete all rows except the Nth last per user?

I have a message (id, userid, message) table that grows rapidly.
I would like to delete all messages per user except his last 30
ex:
if user1 has 100 messages, we will delete the first 70,
if user2 has 40 messages, we will delete the first 10,
if userN has 10 messages, no action is taken
Is there a way to do it with a single SQL ?
My idea for now is to make a LOOP with PHP and lake N sql, which is very long for N users.
MySQL (pre 8.0) doesn't have a really convenient way to do this. One method uses variables to enumerate the values:
select m.*,
(#rn := if(#u = userid, #rn + 1,
if(#u := userid, 1, 1)
)
) as seqnum
from (select m.*
from messages m
order by userid, id desc
) m cross join
(select #u := -1, #rn := 0) params;
You can turn this into a delete using join:
delete m
from messages m join
(select m.*,
(#rn := if(#u = userid, #rn + 1,
if(#u := userid, 1, 1)
)
) as seqnum
from (select m.*
from messages m
order by userid, id desc
) m cross join
(select #u := -1, #rn := 0) params
) mm
on m.id = mm.id
where seqnum > 30;
As I say in a comment, I don't think this is a good solution for a real-world problem. The history of messages is useful and there are probably other ways to achieve the performance you want. The difference between 30 messages for a user and 70 messages for a user should not have that much of an effect on performance, in a tuned system.
SET #row_number = 0;
DELETE FROM MESSAGE
WHERE ID IN
( SELECT ID FROM
(SELECT ID,
#row_number:=CASE
WHEN #userid = userid THEN
#row_number + 1
ELSE 1
END AS num,
#userid:=userid as userid
FROM MESSAGE) A
WHERE NUM > 70 )

Top 20 percent by id - MySQL

I am using a modified version of a query similiar to another question here:Convert SQL Server query to MySQL
Select *
from
(
SELECT tbl.*, #counter := #counter +1 counter
FROM (select #counter:=0) initvar, tbl
Where client_id = 55
ORDER BY ordcolumn
) X
where counter >= (80/100 * #counter);
ORDER BY ordcolumn
tbl.* contains the field 'client_id' and I am attempting to get the top 20% of the records for each client_id in a single statement. Right now if I feed it a single client_id in the where statement it gives me the correct results, however if I feed it multiple client_id's it simply takes the top 20% of the combined recordset instead of doing each client_id individually.
I'm aware of how to do this in most databases, but the logic in MySQL is eluding me. I get the feeling it involves some ranking and partitioning.
Sample data is pretty straight forward.
Client_id rate
1 1
1 2
1 3
(etc to rate = 100)
2 1
2 2
2 3
(etc to rate = 100)
Actual values aren't that clean, but it works.
As an added bonus...there is also a date field associated to these records and 1 to 100 exists for this client for multiple dates. I need to grab the top 20% of records for each client_id, year(date),month(date)
You need to do the enumeration for each client:
SELECT *
FROM (SELECT tbl.*, #counter := #counter +1 counter
(#rn := if(#c = client_id, #rn + 1,
if(#c := client_id, 1, 1)
)
)
FROM (select #c := -1, #rn := 0) initvar CROSS JOIN tbl
ORDER BY client_id, ordcolumn
) t cross join
(SELECT client_id, COUNT(*) as cnt
FROM tbl
GROUP BY client_id
) tt
where rn >= (80/100 * tt.cnt);
ORDER BY ordcolumn;
Using Gordon's answer as a starting point, I think this might be closer to what you need.
SELECT t.*
, (#counter := #counter+1) AS overallRow
, (#clientRow := if(#prevClient = t.client_id, #clientRow + 1,
if(#prevClient := t.client_id, 1, 1) -- This just updates #prevClient without creating an extra field, though it makes it a little harder to read
)
) AS clientRow
-- Alteratively (for everything done in clientRow)
, #clientRow := if(#prevClient = t.client_id, #clientRow + 1, 1) AS clientRow
, #prevClient := t.client_id AS extraField
-- This may be more reliable as well; I not sure if the order
-- of evaluation of IF(,,) is reliable enough to guarantee
-- no side effects in the non-"alternatively" clientRow calculation.
FROM tbl AS t
INNER JOIN (
SELECT client_id, COUNT(*) AS c
FROM tbl
GROUP BY client_id
) AS cc ON tbl.client_id = cc.client_id
INNER JOIN (select #prevClient := -1, #clientRow := 0) AS initvar ON 1 = 1
WHERE t.client_id = 55
HAVING clientRow * 5 < cc.c -- You can use a HAVING without a GROUP BY in MySQL
-- (note that clientRow is derived, so you cannot use it in the `WHERE`)
ORDER BY t.client_id, t.ordcolumn
;

MySQL get rank from particular row ID

I have list of hospitals under that there are average ratings already calculated. Now I wanted to calculate rank for list of hospitals according to their average ratings from following query
SELECT name,
hospitalID,
currentAvgRating,
#curRank := #curRank + 1 AS rank
FROM hospitals h, (SELECT #curRank := 0) r
ORDER BY currentAvgRating DESC
Now above query works when I want to see all hospitals from table but when I apply WHERE clause like below then result is wrong since with this it takes row position.
SELECT name,
hospitalID,
currentAvgRating,
#curRank := #curRank + 1 AS rank
FROM hospitals h, (SELECT #curRank := 0) r where hospitalID = '453085'
ORDER BY currentAvgRating DESC
Is there any way to get correct result when we apply where clause?
If you proceed what you just found out, logically ("when there is only 1 listitem, it cannot be ordered") - you will come to the conclusion that you NEED to select ALL rows. But nothing wrong with that, you can pack them into a subselect (which isnt even an expensive one) and apply the WHERE to that:
SELECT * FROM (
SELECT name,
hospitalID,
currentAvgRating,
#curRank := #curRank + 1 AS rank
FROM hospitals h, (SELECT #curRank := 0) r
ORDER BY currentAvgRating DESC
) toplist
WHERE toplist.hospitalID = 453085
Wrap in a subquery.
SELECT * FROM (
SELECT name,
hospitalID,
currentAvgRating,
#curRank := #curRank + 1 AS rank
FROM hospitals h, (SELECT #curRank := 0) r
ORDER BY currentAvgRating DESC
)
WHERE hospitalID = '453085'

MySQL Query get the last N rows per Group

Suppose that I have a database which contains the following columns:
VehicleID|timestamp|lat|lon|
I may have multiple times the same VehicleId but with a different timestamp. Thus VehicleId,Timestamp is the primary key.
Now I would like to have as a result the last N measurements per VehicleId or the first N measurements per vehicleId.
How I am able to list the last N tuples according to an ordering column (e.g. in our case timestamp) per VehicleId?
Example:
|VehicleId|Timestamp|
1|1
1|2
1|3
2|1
2|2
2|3
5|5
5|6
5|7
In MySQL, this is most easily done using variables:
select t.*
from (select t.*,
(#rn := if(#v = vehicle, #rn + 1,
if(#v := vehicle, 1, 1)
)
) as rn
from table t cross join
(select #v := -1, #rn := 0) params
order by VehicleId, timestamp desc
) t
where rn <= 3;

SQL - MySQL - average group by and limit problems

I am collecting data from various remote sensors that send their data every so many seconds. I record the name of the remote sensor and the time difference since the last time I received data from that instrument. The data for each instrument comes in a random order and not at set intervals.
The table looks like:
id instname timediff
1 inst01 1000
2 inst02 1100
3 inst01 1210
4 inst03 900
etc.
The id column is auto incrementing.
What I am trying to do is get the average timediff for each instrument for the last 10 values of each instrument.
the closest I've got is:
SELECT
inst AS Instrument,
AVG(diff / 1000) AS Average
FROM
(SELECT
instname AS inst, timediff AS diff
FROM
log
WHERE
instname = 'Inst01'
ORDER BY id DESC
LIMIT 0 , 10) AS two
Obviously this only works for 1 instrument and I'm not convinced the limit is working properly either. I don't know the names of the instruments nor how many I'll be collecting data from.
How do I get the average timediff of the last 10 values for each instrument using SQL?
Somewhat painfully. I think the easiest way is to use variables. The following query enumerates the readings for each instrument:
select l.*,
(#rn := if(#i = instname, #rn + 1,
if(#i := instname, 1, 1)
)
) as rn
from log l cross join
(select #i := '', #rn := 0)
order by instname, id desc;
You can then use this as a subquery to do your calculation:
select instname, avg(timediff)
from (select l.*,
(#rn := if(#i = instname, #rn + 1,
if(#i := instname, 1, 1)
)
) as rn
from log l cross join
(select #i := '', #rn := 0)
order by instname, id desc
) l
where rn <= 10
group by instname;
try using this:tested on less data but should work.
SELECT
inst AS Instrument,
diff AS Average
FROM
(SELECT
t1.instname AS inst,AVG(t1.timediff / 1000) AS diff
FROM
inst t1,inst t2
WHERE
t1.instname = t2.instname group by t1.instname ORDER BY t2.id DESC
LIMIT 0,10
) AS two