Identify and mark record times using SQL - mysql

Given a database table that contains a list of race times, I need to be able to identify which of the performances are faster than earlier finish times for that athlete at a specific distance, e.g. it was their best time at the time of the performance.
Also, would it be better to update this and store as a boolean in an additional column at rest, rather than trying to calculate when doing a SELECT. The database isn't populated in chronological order, so not sure if a TRIGGER would help. I was thinking of a query that runs on the whole table after any inserts/updates. Appreciate this may have a performance impact, so could be run periodically rather than on each row update.
This is on a MySQL 5.6.47 server.
Example table
athleteId date distance finishTime
1 2020-01-04 5K 30:00
1 2020-01-11 5K 30:09
1 2020-01-18 5K 29:45
1 2020-01-25 5K 29:32
1 2020-02-01 5K 31:18
1 2020-02-02 10K 1:06:07
1 2020-02-08 5K 28:25
1 2020-02-23 10K 1:06:02
1 2020-02-23 10K 1:07:30
Expected output
athleteId date distance finishTime isPersonalBest
1 2020-01-04 5K 30:00 Y
1 2020-01-11 5K 30:09 N
1 2020-01-18 5K 29:45 Y
1 2020-01-25 5K 29:32 Y
1 2020-02-01 5K 31:18 N
1 2020-02-02 10K 1:06:07 Y
1 2020-02-08 5K 28:25 Y
1 2020-02-23 10K 1:06:02 Y
1 2020-02-23 10K 1:07:30 N
The data is just an example. The actual finish times are stored in seconds. There will be many more athletes and different event distances. If a performance is the first for that athlete at that distance, it would be classed as a personal best.

If you are running MysQL 8.0, you can use window functions:
select
t.*,
case when finishTime < min(finishTime) over(
partition by athleteId, distance
order by date
rows between unbounded preceding and 1 preceding
)
then 'Y'
else 'N'
end isPersonalBest
from mytable t
In earlier versions, one option is a correlated subquery:
select
t.*,
case when exists(
select 1
from mytable t1
where
t1.athleteId = t.athleteId
and t1.distance = t.distance
and t1.date < t.date
and t1.finishTime <= t.finishTime
)
then 'N'
else 'Y'
end isPersonalBest
from mytable t
I wouldn't recommend actually storing this derived information. Instead, you use the above query to create a view.

You can use a cumulative min in MySQL 8+:
select t.*,
(case when finishTime >=
min(finishTime) over (partition by athleteid, distance
order by date
rows between unbounded preceding and 1 preceding
)
then 'N' else 'Y'
end) as isPersonalBest
from t;
Here is a db<>fiddle.
In earlier versions, you could use not exists:
select t.*,
(case when not exists (select 1
from t t2
where t2.atheleteid = t.athleteid and
t2.distance = t.distance and
t2.date < t.date and
t2.finishTime <= t.finishTime
)
then 'Y' else 'N'
end) as isPersonalBest
from t;

Related

Finding MySQL records that have at least 2 consective timestamp or ID records

I need to run a report that finds the sum of records that have a value in a certain field ( >50 ), but only when there are at least 2 consecutive timestamps. Once the timestamps stop being consecutive, i then need to ignore the until we find the next 2 consecutive.
1 2021-01-26 09:45:58 50
2 2021-01-26 09:47:23 20
3 2021-01-26 09:47:29 50
4 2021-01-26 09:48:23 50
in the example above,
The first record would (ID1) would fail (only 1 hit in the required timescale )
ID2 (value too low )
but records 3 and 4 would qualify for inclusion in the sum.
This is a type of gaps-and-islands problem. You can identify the groups by counting the number of non-50+ rows up to each row. Then, aggregate the groups with the conditions you want:
select grp, sum(value)
from (select t.*,
sum(value < 50) over (order by timestamp) as grp
from t
) t
where value >= 50
group by grp
having count(*) >= 2;
This produces a separate value for each adjacent values. If you want the total sum, then you can use a subquery or CTE based on this query.
If you actually just want the overall sum, you can use lead() and lag():
select sum(value)
from (select t.*,
lag(value) over (order by timestamp) as prev_value,
lead(value) over (order by timestamp) as next_value
from t
) t
where value >= 50 and
(prev_value >= 50 or next_value >= 50)

Sql time points to time range

I have the following sql table:
id time
1 2018-12-30
1 2018-12-31
1 2018-01-03
2 2018-12-15
2 2018-12-30
I want to make a query which will result in following data:
id start_time end_time
1 2018-12-30 2018-12-31
1 2018-12-31 2018-01-03
2 2018-12-15 2018-12-30
Is this even possible to do with sql in reasonable amount of time or it is better to do this with other means?
Failed approach (it takes too much time):
SELECT id, time as start_time, (
SELECT MIN(time)
FROM table as T2
WHERE T2.id = T1.id
AND T2.time < T1.time
) as end_time
FROM table as T1
I have dates in my db, each of them have non unique id. I want to calculate time range between closest dates for each id. So transformation should be performed on each id separately and should not affect other ids. We can even forget about ids, and just imagine that I have only one column in my DB which is dates. I want to sort my dates and perform sliding window with step 1 and capacity 2. So if I have 10 dates, I want to have in a result 9 time ranges, which are should be in increasing order. Assume we have four dates: D1 < D2 < D3 < D4. Result should be (D1,D2), (D2,D3), (D3,D4)
In MySQL 8.x you can use the LEAD() function to peek at the next row:
with x as (
select
id,
time as start_time,
lead(time) over(partition by id order by time) as end_time
from my_table
)
select * from x where end_time is not null

Dealing with sequential/ordered data in SQL

I apologize if this question has already been asked, attempted a search but could not find a relevant thread.
I have been given a semi-large data source (~15m records) which I need to perform some analysis on to determine user behavior. The data source includes fields for the User ID, date of the transaction, and a flag to indicate if the transaction had a certain characteristic. Obviously I'm simplifying here to get to the core of the question. The number of transactions by user will vary quite a bit (from 1 to 200+), the date distribution will vary, and the distribution of flags will vary.
Consider the following table:
ID User ID Date Flag
1 1 2015-01-03 Y
2 1 2015-03-15 N
3 1 2015-07-20 N
4 1 2015-11-18 N
5 1 2015-11-29 N
6 2 2015-02-16 Y
7 2 2015-03-03 N
8 2 2015-06-10 Y
9 2 2015-08-10 Y
How would one go about querying this data to isolate records based upon the characteristics of other records for the same user before or after?
For example:
How would one identify records with a 'Y' flag which are followed by three other records (ordered by date) for the same User ID with an 'N' flag? [Would return 1 in the above table]
How would one identify User IDs where 50% or more of their transactions with 'Y' flags occur in the first 20% of their transactions? [Would return User ID 1 in the above table]
I hope the question is clear enough.
*Edit: The answer below is correct, however he did not know that I am using MySQL as the database (I added in the tag after he answered). MySQL DOES NOT support these functions, either Oracle or SQL Server would be able to implement these functions.
This question assumes a reasonable database that supports window/analytic functions.
The first question can be handled using lead():
select t.*
from (select t.*,
lead(flag, 1) over (partition by userid order by date) as flag_1,
lead(flag, 2) over (partition by userid order by date) as flag_2,
lead(flag, 3) over (partition by userid order by date) as flag_3
from t
) t
where flag = 'Y' and flag_1 = 'N' and flag_2 = 'N' and flag_3 = 'N';
The second also uses window functions:
select user_id
from (select t.*,
row_number() over (partition by user_id order by date) as seqnum,
count(*) over (partition by user_id) as cnt
from t
) t
group by user_id
having sum(case when flag = 'Y' and seqnum/0.2 <= cnt then 1 else 0 end) >=
0.5 * sum(case when flag = 'Y' then 1 else 0 end);
So, the answer to your question is basically: Learn about window (analytic) functions.

mysql moving average of N rows

I have a simple MySQL table like below, used to compute MPG for a car.
+-------------+-------+---------+
| DATE | MILES | GALLONS |
+-------------+-------+---------+
| JAN 25 1993 | 20.0 | 3.00 |
| FEB 07 1993 | 55.2 | 7.22 |
| MAR 11 1993 | 44.1 | 6.28 |
+-------------+-------+---------+
I can easily compute the Miles Per Gallon (MPG) for the car using a select statement, but because the MPG varies widely from fillup to fillup (i.e. you don't fill the exact same amount of gas each time), I would like to computer a 'MOVING AVERAGE' as well. So for any row the MPG is MILES/GALLON for that row, and the MOVINGMPG is the SUM(MILES)/SUM(GALLONS) for the last N rows. If less than N rows exist by that point, just SUM(MILES)/SUM(GALLONS) up to that point.
Is there a single SELECT statement that will fetch the rows with MPG and MOVINGMPG by substituting N into the select statement?
Yes, it's possible to return the specified resultset with a single SQL statement.
Unfortunately, MySQL does not support analytic functions, which would make for a fairly simple statement. Even though MySQL does not have syntax to support them, it is possible to emulate some analytic functions using MySQL user variables.
One of the ways to achieve the specified result set (with a single SQL statement) is to use a JOIN operation, using a unique ascending integer value (rownum, derived by and assigned within the query) to each row.
For example:
SELECT q.rownum AS rownum
, q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM ( SELECT #s_rownum := #s_rownum + 1 AS rownum
, s.date
, s.miles
, s.gallons
FROM mytable s
JOIN (SELECT #s_rownum := 0) c
ORDER BY s.date
) q
JOIN ( SELECT #t_rownum := #t_rownum + 1 AS rownum
, t.date
, t.miles
, t.gallons
FROM mytable t
JOIN (SELECT #t_rownum := 0) d
ORDER BY t.date
) r
ON r.rownum <= q.rownum
AND r.rownum > q.rownum - 2
GROUP BY q.rownum
Your desired value of "n" to specify how many rows to include in each rollup row is specified in the predicate just before the GROUP BY clause. In this example, up to "2" rows in each running total row.
If you specify a value of 1, you will get (basically) the original table returned.
To eliminate any "incomplete" running total rows (consisting of fewer than "n" rows), that value of "n" would need to be specified again, adding:
HAVING COUNT(1) >= 2
sqlfiddle demo: http://sqlfiddle.com/#!2/52420/2
Followup:
Q: I'm trying to understand your SQL statement. Does your solution do a select of twenty rows for each row in the db? In other words, if I have 1000 rows will your statement perform 20000 selects? (I'm worried about performance)...
A: You are right to be concerned with performance.
To answer your question, no, this does not perform 20,000 selects for 1,000 rows.
The performance hit comes from the two (essentially identical) inline views (aliased as q and r). What MySQL does with these (basically) is create temporary MyISAM tables (MySQL calls them "derived tables"), which are basically copies of mytable, with an extra column, each row assigned a unique integer value from 1 to the number of rows.
Once the two "derived" tables are created and populated, MySQL runs the outer query, using those two "derived" tables as a row source. Each row from q, is matched with up to n rows from r, to calculate the "running total" miles and gallons.
For better performance, you could use a column already in the table, rather than having the query assign unique integer values. For example, if the date column is unique, then you could calculate "running total" over a certain period of days.
SELECT q.date AS latest_date
, SUM(q.miles)/SUM(q.gallons) AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM mytable q
JOIN mytable r
ON r.date <= q.date
AND r.date > q.date + INTERVAL -30 DAY
GROUP BY q.date
(For performance, you would want an appropriate index defined with date as a leading column in the index.)
For the first query, any predicates included (in the inline view definition queries) to reduce the number of rows returned (for example, return only date values in the past year) would reduce the number of rows to be processed, and would also likely improve performance.
Again, to your question about running 20,000 selects for 1,000 rows... a nested loops operation is another way to get the same result set. For a large number of rows, this can exhibit slower performance. (On the other hand, this approach can be fairly efficient, when only a few rows are being returned:
SELECT q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, ( SELECT SUM(r.miles)/SUM(r.gallons)
FROM mytable r
WHERE r.date <= q.date
AND r.date >= q.date + INTERVAL -90 DAY
) AS rtot_mpg
FROM mytable q
ORDER BY q.date
Something like this should work:
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallMiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal
FROM YourTable
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
SQL Fiddle Demo
Which produces the following:
DATE MILES GALLONS MILESPERGALLON RUNNINGTOTAL
January, 25 1993 20 3 6.666667 6.666666666667
February, 07 1993 55.2 7.22 7.645429 7.358121330724
March, 11 1993 44.1 6.28 7.022293 7.230303030303
--EDIT--
In response to the comment, you can add another Row Number to limit your results to the last N rows:
SELECT *
FROM (
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallmiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal,
#RowNumber:=#RowNumber+1 rowNumber
FROM (SELECT * FROM YourTable ORDER BY Date DESC) u
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
JOIN (SELECT #RowNumber:= 0) r
) t
WHERE rowNumber <= 3
Just change your ORDER BY clause accordingly. And here is the updated fiddle.

Mysql query to calculate total cost

HI
I have a table listsing_prices (id,listing_id,day_from,day_to,price)
I need to calculate the total cost of an holiday in mysql becouse I need to sort the results by total cost.
EX:
VALUES IN TABLE
1 6 2011-04-27 2011-04-30 55,00
2 6 2011-05-01 2011-05-02 60,00
3 6 2011-05-03 2011-05-15 65,00
holiday from 2011-04-28 to 2011-05-05 total cost = 480
Without creating an actual table to represent every day from start date to end date, you could use mysql query variables. The first query can join to any table as long as it has as many records as days you are concerned with for the hoiday period... in this case, 8 days from April 28 to May 5. By doing a Cartesian and limiting to 8 will in essence, create a temp result set with one record per each day, starting with 2011/04/28 (your starting date).
Then, this is joined back to your pricing table that matches the date period and sums the matching price for total costs...
select
sum( pt.price ) as TotalCosts
from
( SELECT
#r:= date_add(#r, interval 1 day ) CalendarDate
FROM
(select #r := STR_TO_DATE('2011/04/28', '%Y/%m/%d')) vars,
AnyTableWithAtLeast8ays limit 8 ) JustDates,
PricesTable pt
where
JustDates.CalendarDate between pt.date_from and pt.date_to
select count(price) from listing_prices where day_from >= '2011-04-28' and day_to <= '2011-05-05'
-- This will provide a list of ids along with how many days fall between the two
SELECT a.id, DATEDIFF(DAYS, CASE WHEN day_from < '2011-04-28' THEN '2011-04-28' ELSE day_from END CASE, day_to) AS DayCount
FROM listing_prices a
WHERE '2011-04-28' BETWEEN a.date_from AND a.date_to
AND a.date_to <= '2011-05-05'
-- Based on the previous query, sum the number of days within the range
SELECT SUM( a.price * b.DayCount ) AS Total
FROM listing_prices a
JOIN ( SELECT a.id, DATEDIFF(DAYS, CASE WHEN day_from < '2011-04-28' THEN '2011-04-28' ELSE day_from END CASE, day_to) AS DayCount
FROM listing_prices a
WHERE '2011-04-28' BETWEEN a.date_from AND a.date_to
AND a.date_to <= '2011-05-05'
) b ON a.id = b.id
Please note that this is untested ... the query at the top I believe should work but if it doesn't, it can be modified and so that it does work (get the number of days within each range) and then literally copied and pasted into the subquery of the second query. The second query is the one that you will actually use.