Efficiently selecting every nth row without ROW_NUMBER - mysql

I have a table consisting of about 20 million rows, totalling approximately 2 GB. I need to select every nth row, leaving me with only a few hundred rows. But I cannot for the life of me figure out how to do it without getting a timeout.
ROW_NUMBER is not available, and keeping track of the current row number with a variable (e.g. #row) causes a timeout. I presume this is because it is still iterating over every row, but I'm not too sure. There's no integer index for me to use either. A DATETIME field is used instead. This is an example query using #row:
SET #row = 0;
SELECT `field` FROM `table` WHERE (#row := #row + 1) % 1555200 = 0;
Is there anything else I haven't tried?
Thanks in advance!

It's a tricky one for sure. You could work out the minimum date and then use a datediff to get you the sequential values, but this probably isn't sargeable (as below). For me, it took 18 seconds on a table with 16 million rows, but your mileage may vary.
** EDIT ** I should also add that this was with a nonclustered index scan against an index which included the date column (pretty sure this is forced by the function around the date but perhaps someone with more knowledge can expand on this). After creating an index against that column, I got 12 seconds.
Try it out and let me know how it goes :)
DECLARE #n INT = 5;
SELECT
DATEDIFF(DAY, first_date.min_date, DATE_COLUMN) AS ROWNUM
FROM
ss.YOUR_TABLE
OUTER APPLY
( SELECT
MIN(a.DATE_COLUMN) min_date
FROM ss.YOUR_TABLE a
) first_date
WHERE DATEDIFF(DAY, first_date.min_date, DATE_COLUMN) % #n = 0
Edit again:
Just noticed this has been accepted as an answer... In case anyone else comes across this, it probably shouldn't be. On review, this only works if your datetime field has one entry per day and the datetime is sequential (in that rows are added in the same order as the datetime, or if the datetime is the primary key).
Again only works per day with the above caveats, you can change the date diff to use any unit (Month, Year, Minute etc) if you have one row added per unit of time.

Related

Efficient SQL Query to calculate portion of a row in half hourly time series that has occurred

I have a table that looks like this:
id
slot
total
1
2022-12-01T12:00
100
2
2022-12-01T12:30
150
3
2022-12-01T13:00
200
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time (EDIT: WASN'T CLEAR INITIALLY, I WILL PROVIDE A LOWER SLOT BOUND, SO THE SUM WILL BE OVER SOME NUMBER OF DAYS/WEEKS, NOT OVER FULL TABLE). Let's say the time is currently 2022-12-01T12:45. If I run select * from my_table where slot < CURRENT_TIMESTAMP(),
then I get back records 1 and 2.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45, and so I want to find the proportion of the half hour slot of record 2 that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45 (assuming minute granularity), 50% of row 2 has elapsed, so I would expect the total to return as 150 / 2 = 75.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30 would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET #current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < #current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
#current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
create an ondex in slot column btree as it is having high selectivity;

MySQL Comparing Times of Different Formats

I am working with a database full of songs, with titles and durations.
I need to return all songs with a duration greater than 29:59 (MM:SS).
The data is formatted in two different ways.
Format 1
Most of the data in the table is formatted as MM:SS, with some songs being greater than 60 minutes formatted for example as 72:15.
Format 2
Other songs in the table are formatted as HH:MM:SS, where the example given for Format 1 would instead be 01:12:15.
I have tried two different types of queries to solve this problem.
Query 1
The following query returns all of the values that I seek to return for Format 1, but I could not find a way to get values included for Format 2.
select title, duration from songs where
time(cast(duration as time)) >
time(cast('29:59' as time))
Query 2
With the next query, I hoped to use the format specifiers in str_to_date to locate those results with the format HH:MM:SS, but instead I received results such as 3:50. The interpreter is assuming that all of the data is of the form HH:MM, and I do not know how to tell it otherwise without ruining the results.
select title, duration from songs where
time(cast(str_to_date(duration, '%H:%i:%s') as time)) >
time(cast(str_to_date('00:29:59', '%H:%i:%s') as time))
I've tried changing the specifiers in the first call to str_to_date to %i:%s, which gives me all values greater than 29:59, but none greater than 59:59. This is worse than the original query. I've also tried 00:%i:%s and '00:' || duration, '%H:%i:%s'. These two in particular would ruin the results anyway, but I'm just fiddling at this point.
I'm thoroughly stumped, but I'm sure the solution is an easy one. Any help is appreciated.
EDIT: Here is some data requested from the comments below.
Results from show create table:
CREATE TABLE `songs` (
`song_id` int(11) NOT NULL,
`title` varchar(100) NOT NULL,
`duration` varchar(20) DEFAULT NULL,
PRIMARY KEY (`song_id`),
UNIQUE KEY `songs_uq` (`title`,`duration`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Keep in mind, there are more columns than I described above, but I left some out for the sake of simplicity. I will also leave them out in the sample data.
Sample Data
title duration
(Allegro Moderato) 3:50
Agatha 1:56
Antecessor Machine 06:16
Very Long Song 01:24:16
Also Very Long 2:35:22
You are storing unstructured data in a relational database. And that is making you unhappy. So structure it.
Either add a TIME column, or copy song_id into a parallel time table on the side that you can JOIN against. Select all the two-colon durations and trivially update TIME. Repeat, prepending '00:' to all the one-colon durations. Now you have parsed all rows, and can safely ignore the duration column.
Ok, fine, I suppose you could construct a VIEW that offers UNION ALL of those two queries, but that is slow and ugly, much better to fix the on-disk data.
Forget times. Convert to seconds. Here is one way:
select s.*
from (select s.*,
( substring_index(duration, ':', -1) + 0 +
substring_index(substring_index(duration, ':', -2), ':', 1) * 60 +
(case when duration like '%:%:%' then substring_index(duration, ':', 1) * 60*60
else 0
end)
) as duration_seconds
from songs s
) s
where duration_seconds > 29*60 + 59;
After some research I have come up with an answer of my own that I am happy with.
select title, duration from songs where
case
when length(duration) - length(replace(duration, ':', '')) = 1
then time_to_sec(duration) > time_to_sec('29:59')
else time_to_sec(duration) > time_to_sec('00:29:59')
end
Thank you to Gordon Linoff for suggesting that I convert the times to seconds. This made things much easier. I just found his solution a bit overcomplicated, and it reinvents the wheel by not using time_to_sec.
Output Data
title duration
21 Album Mix Tape 45:40
Act 1 1:20:25
Act 2 1:12:05
Agog Opus I 30:00
Among The Vultures 2:11:00
Anabasis 1:12:00
Avalanches Mixtape 60:00
Beautiful And Timeless 73:46
Beggars Banquet Tracks 76:07
Bonus Tracks 68:55
Chindogu 66:23
Spun 101:08
Note: Gordon mentioned his reason for not using time_to_sec was to account for songs greater than 23 hours long. After testing, I found that time_to_sec does support hours larger than 23, just as it supports minutes greater than 59.
It is also perfectly fine with other non-conforming formats such as 1:4:32 (e.g. 01:04:32).

SQL: Reuse function result in query without using sub-query

In a MySQL DB table that stores sale orders, I have a LastReviewed column that holds the last date and time when the sale order was modified (type timestamp, default value CURRENT_TIMESTAMP). I'd like to plot the number of sales that were modified each day, for the last 90 days, for a particular user.
I'm trying to craft a SELECT that returns the number of days since LastReviewed date, and how many records fall within that range. Below is my query, which works just fine:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND DATEDIFF(CURDATE(),LastReviewed)<=90
GROUP BY days
ORDER BY days ASC
Notice that I am computing the DATEDIFF() as well as CURDATE() multiple times for each record. This seems really ineffective, so I'd like to know how I can reuse the results of the previous computation. The first thing I tried was:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND days<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'where clause'. So I started to look around the net. Based on another discussion (Can I reuse a calculated field in a SELECT query?), I next tried the following:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND (SELECT days)<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'field list'. I'm also tried the following:
SELECT #days := DATEDIFF(CURDATE(), LastReviewed) AS days,
COUNT(*) AS number FROM sales
WHERE UserID=123 AND #days <=90
GROUP BY days
ORDER BY days ASC
The query returns zero result, so #days<=90 seems to return false even though if I put it in the SELECT clause and remove the WHERE clause, I can see some results with #days values below 90.
I've gotten things to work by using a sub-query:
SELECT * FROM (
SELECT DATEDIFF(CURDATE(),LastReviewed) AS sales ,
COUNT(*) AS number FROM sales
WHERE UserID=123
GROUP BY days
) AS t
WHERE days<=90
ORDER BY days ASC
However I odn't know whether it's the most efficient way. Not to mention that even this solution computes CURDATE() once per record even though its value will be the same from the start to the end of the query. Isn't that wasteful? Am I overthinking this? Help would be welcome.
Note: Mods, should this be on CodeReview? I posted here because the code I'm trying to use doesn't actually work
There are actually two problems with your question.
First, you're overlooking the fact that WHERE precedes SELECT. When the server evaluates WHERE <expression>, it then already knows the value of the calculations done to evaluate <expression> and can use those for SELECT.
Worse than that, though, you should almost never write a query that uses a column as an argument to a function, since that usually requires the server to evaluate the expression for each row.
Instead, you should use this:
WHERE LastReviewed < DATE_SUB(CURDATE(), INTERVAL 90 DAY)
The optimizer will see this and get all excited, because DATE_SUB(CURDATE(), INTERVAL 90 DAY) can be resolved to a constant, which can be used on one side of a < comparison, which means that if an index exists with LastReviewed as the leftmost relevant column, then the server can immediately eliminate all of the rows with LastReviewed >= that constant value, using the index.
Then DATEDIFF(CURDATE(), LastReviewed) AS days (still needed for SELECT) will only be evaluated against the rows we already know we want.
Add a single index on (UserID, LastReviewed) and the server will be able to pinpoint exactly the relevant rows extremely quickly.
Builtin functions are much less costly than, say, fetching rows.
You could get a lot more performance improvement with the following 'composite' index:
INDEX(UserID, LastReviewed)
and change to
WHERE UserID=123
AND LastReviewed >= CURRENT_DATE() - INTERVAL 90 DAY
Your formulation is 'hiding' LastRevieded in a function call, making it unusable in an index.
If you are still not satisfied with that improvement, then consider a nightly query that computes yesterday's statistics and puts them in a "Summary table". From there, the SELECT you mentioned can run even faster.

mysql table design for daily rotation of top rating

I store top-views and 'likes' in a table called 'counts'. Once a night I run this query
UPDATE `counts` SET rank=d7+d6+d5+d4+d3+d2+d1,d7=d6,d6=d5,d5=d4,d4=d3,d3=d2,d2=d1,d1=0
Each day of the week has a d1-d7 variable, and we move it 'down' one each night and re-calculate the sum.
As my site has grown, this query now takes ~20 minutes.
I'm looking for suggestions on how to organize this more efficiently, as it seems like it might be a common pattern.
As the comments say, we need to see the schema. But I'll make a suggestion anyway. Don't have 7 different fields d1-d7. What if later you decide to keep the score over a year? Ouch.
I'm going to assume that counts has view_id as its PK. Then have another table ranks with columns view_id (set as FK into counts), rank (generalizes d1-d7, whatever datatype they are) and rank_date, which is a date. Now every night you have
UPDATE counts SET rank = (SELECT SUM(rank) FROM ranks r WHERE r.view_id=counts.view_id
AND r.rank_date>=DATE_SUB(CURDATE(), INTERVAL 1 WEEK) );
[Some RDBMSs allow a JOIN-type syntax in UPDATE queries. I believe MySQL understands something similar to the following, but it isn't my usual RDBMS
UPDATE counts, (SELECT view_id, SUM(rank) AS srank FROM ranks r
WHERE r.rank_date>=DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY r.view_id) AS q1
SET rank = srank
WHERE counts.view_id=q1.view_id;
]
If so, that will probably run faster than the first version.
Meanwhile, optionally to clean up, you can delete rows from ranks that are more than 1 week old, but in a more flexible schema, you don't have to.

mysql - set value of cell equal to value of cell in another row

I have a MySQL query that generates a table for my vehicle tracking 'in' and 'out' times.
The problem is that the 'in' time is not the same as the 'out' time so seconds or minutes are lost in between.
Is there a way to set the 'in' time equal to the 'out time' from the previous row, even if I need to embed my current select inside a new select?
you will see on the image below that the first rows out time is 15:45:14 and the in time for the next row is 15:46:14. so in this case a minute is lost
in reality if the vehicles has left one point, it is immediately on the road to the next point so I can set the in time equal to the out time of the previous row. This way, time is never lost
the sql for my query is:
select vehicle,InTime,OutTime from (select
PreQuery.callingname as vehicle,
PreQuery.geofence,
PreQuery.GroupSeq,
MIN( PreQuery.`updatetime` ) as InTime,
UNIX_TIMESTAMP(MIN( PreQuery.`updatetime`))as InSeconds,
MAX( PreQuery.`updatetime` ) as OutTime,
UNIX_TIMESTAMP(MAX( PreQuery.`updatetime`))as OutSeconds,
TIME_FORMAT(SEC_TO_TIME((UNIX_TIMESTAMP(MAX( PreQuery.`updatetime` )) - UNIX_TIMESTAMP(MIN( PreQuery.`updatetime`)))),'%H:%i:%s') as Duration,
(UNIX_TIMESTAMP(MAX( PreQuery.`updatetime` )) - UNIX_TIMESTAMP(MIN( PreQuery.`updatetime`))) as DurationSeconds
from
( select
v_starting.callingname,
v_starting.geofence,
v_starting.`updatetime`,
#lastGroup := #lastGroup + if( #lastAddress = v_starting.geofence
AND #lastVehicle = v_starting.callingname, 0, 1 ) as GroupSeq,
#lastVehicle := v_starting.callingname as justVarVehicleChange,
#lastAddress := v_starting.geofence as justVarAddressChange
from
v_starting,
( select #lastVehicle := '',
#lastAddress := '',
#lastGroup := 0 ) SQLVars
order by
v_starting.`updatetime` ) PreQuery
Group By
PreQuery.callingname,
PreQuery.geofence,
PreQuery.GroupSeq) parent
where (InTime> DATE_SUB('2013-03-23 15:00', INTERVAL 24 HOUR) or OutTime> '2013-03-23 15:00' ) and vehicle='TT08' order by InTime asc
The MySQL syntax is in depth so quite large but could be done on a much simpler query as well. like
select vehicle, intime,outtime from vehicletimes
My desired result is something like:
select vehicle, intime(outtime of row above),outtime from vehicletimes
The first rows in time can be as is and the last rows outtime can be as is. I just need to account for every second between the smallest in time and the largest out time.
Any help appreciated as always.
Thanks in advance
I think this will give you the latest in-time prior to each current out-time, for your existing records:
select
vt.vehicle, max(qGetMaxOut.outtime) as intime , vt.outtime
from
vehicle_times vt
inner join
(
select vehicle, outtime
from vehicle_times
) qGetMaxOut
on qGetMaxOut.vehicle = vt.vehicle
and qGetMaxOut.outtime <= vt.intime
group by
vt.vehicle, vt.outtime
The above query will also help you if you want to insert a new record, but need to find the previous in-time for a particular time (ie if you need to insert a new record who's in/out times are prior to the latest time - eg inserting a record that was somehow previously missed and where newer time entries have been added since). If you need this scenario, let me know and I'll elaborate if you can't work it out from the above.
The join basically joins the table "back on itself" to provide another "copy", but limits the results in the "copy" to only those rows for the current vehicle in the main table, and excludes those rows from the copy where the vehicle's out-time is more recent than the current in-time from the main table. This way you can do a MAX() over the copy, to find what the previous out time was.
I don't know your specific requirements, but I would recommend storing the most accurate information you can. So if "sythensising" a value is just for cosmetic purposes on a few reports, I would leave the data alone, and tidy up the report, rather than loosing data that might come in handy down the track. eg what happens if in the future, you suddenly have a requirement to tell your boss "how long are our vehicles 'in' and sitting idle for?"
But if you do just want to insert a new record with the actual out-time ignored, and replaced by the in-time from the most recent record, then this following query will find that value for you:
select
vt.vehicle, max(vt.outtime) as intime
from
vehicle_times vt
group by
vt.vehicle
Have I missed your requirement?