I would like to ask about how can I take average of rows in a column that are between 5 minutes.
In order to be more accurate I have a table like this
id-----link_id---------date---------------------speed
0---------123------(24/4/2014 12:03:34)----------45
1---------123------(24/4/2014 12:04:34)----------43
2---------127------(24/4/2014 12:04:37)----------50
3---------123------(28/4/2014 12:03:34)----------60
i would like to create a new table that will have the average of speed for rows that have the same link_id and are between 5 minutes
In the case that I mentioned above only the two first rows comply the requirements
and i want a new table like this
id-----link_id---------date---------------------speed
0---------123------(24/4/2014 12:00:00)----------44
2---------127------(24/4/2014 12:00:00)----------50
3---------123------(28/4/2014 12:00:00)----------60
which is the query that i have to use to create a new table with those requirments?
thank you in advance
It is not clear what you mean by 'average of speed for rows that ... are between five minutes.' So I will guess.
I guess you want to compute the averages for each distinct five minute interval. For example, you want averages of all items with timestamps from 2014-04-24 12:00:00 to 2014-04-24:12:04:59, then another average for items with timestamps from 2014-04-24 12:05:00 to 2014-04-24:12:09:59, and so forth.
To do this, you need to start with an expression that will take any DATETIME value and round it down to the beginning of its five-minute interval. How do you get that?
First, this expression will round down a timestamp to the beginning of the minute in which it occurs:
DATE_FORMAT(`date`,'%Y-%m-%d %H:%i:00')
This expression gives the number of minutes past the hour, modulo 5.
MINUTE(`date`)%5
So, this expression gives you the rounded-down DATETIME you need:
DATE_FORMAT(`date`,'%Y-%m-%d %H:%i:00') - INTERVAL (MINUTE(`date`)%5) MINUTE
Great. Now we need to use that in an aggregate query to compute the average speeds.
SELECT link_id,
DATE_FORMAT(`date`,'%Y-%m-%d %H:%i:00') - INTERVAL (MINUTE(`date`)%5) MINUTE AS five_min
AVG(speed) AS avg_speed
FROM mytable
GROUP BY link_id,
DATE_FORMAT(`date`,'%Y-%m-%d %H:%i:00') - INTERVAL (MINUTE(`date`)%5) MINUTE
ORDER BY link_id,
DATE_FORMAT(`date`,'%Y-%m-%d %H:%i:00') - INTERVAL (MINUTE(`date`)%5) MINUTE
This will do the trick you need done. There will be one row for each distinct link_id and five-minute interval of time. The time interval will be named by giving the time at which it begins. Each row will contain the average speed for observations in that time interval.
It's helpful when creating your specification for this kind of query to think very carefully about what you want each row of your result set to contain. If you do that, you will probably find that your query flows naturally from your specification.
Here's a more extensive writeup on how to do this sort of thing.
http://www.plumislandmedia.net/mysql/sql-reporting-time-intervals/
Related
I have a table that looks like this:
id
slot
total
1
2022-12-01T12:00
100
2
2022-12-01T12:30
150
3
2022-12-01T13:00
200
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time (EDIT: WASN'T CLEAR INITIALLY, I WILL PROVIDE A LOWER SLOT BOUND, SO THE SUM WILL BE OVER SOME NUMBER OF DAYS/WEEKS, NOT OVER FULL TABLE). Let's say the time is currently 2022-12-01T12:45. If I run select * from my_table where slot < CURRENT_TIMESTAMP(),
then I get back records 1 and 2.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45, and so I want to find the proportion of the half hour slot of record 2 that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45 (assuming minute granularity), 50% of row 2 has elapsed, so I would expect the total to return as 150 / 2 = 75.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30 would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET #current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < #current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
#current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
create an ondex in slot column btree as it is having high selectivity;
In a MySQL DB table that stores sale orders, I have a LastReviewed column that holds the last date and time when the sale order was modified (type timestamp, default value CURRENT_TIMESTAMP). I'd like to plot the number of sales that were modified each day, for the last 90 days, for a particular user.
I'm trying to craft a SELECT that returns the number of days since LastReviewed date, and how many records fall within that range. Below is my query, which works just fine:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND DATEDIFF(CURDATE(),LastReviewed)<=90
GROUP BY days
ORDER BY days ASC
Notice that I am computing the DATEDIFF() as well as CURDATE() multiple times for each record. This seems really ineffective, so I'd like to know how I can reuse the results of the previous computation. The first thing I tried was:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND days<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'where clause'. So I started to look around the net. Based on another discussion (Can I reuse a calculated field in a SELECT query?), I next tried the following:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND (SELECT days)<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'field list'. I'm also tried the following:
SELECT #days := DATEDIFF(CURDATE(), LastReviewed) AS days,
COUNT(*) AS number FROM sales
WHERE UserID=123 AND #days <=90
GROUP BY days
ORDER BY days ASC
The query returns zero result, so #days<=90 seems to return false even though if I put it in the SELECT clause and remove the WHERE clause, I can see some results with #days values below 90.
I've gotten things to work by using a sub-query:
SELECT * FROM (
SELECT DATEDIFF(CURDATE(),LastReviewed) AS sales ,
COUNT(*) AS number FROM sales
WHERE UserID=123
GROUP BY days
) AS t
WHERE days<=90
ORDER BY days ASC
However I odn't know whether it's the most efficient way. Not to mention that even this solution computes CURDATE() once per record even though its value will be the same from the start to the end of the query. Isn't that wasteful? Am I overthinking this? Help would be welcome.
Note: Mods, should this be on CodeReview? I posted here because the code I'm trying to use doesn't actually work
There are actually two problems with your question.
First, you're overlooking the fact that WHERE precedes SELECT. When the server evaluates WHERE <expression>, it then already knows the value of the calculations done to evaluate <expression> and can use those for SELECT.
Worse than that, though, you should almost never write a query that uses a column as an argument to a function, since that usually requires the server to evaluate the expression for each row.
Instead, you should use this:
WHERE LastReviewed < DATE_SUB(CURDATE(), INTERVAL 90 DAY)
The optimizer will see this and get all excited, because DATE_SUB(CURDATE(), INTERVAL 90 DAY) can be resolved to a constant, which can be used on one side of a < comparison, which means that if an index exists with LastReviewed as the leftmost relevant column, then the server can immediately eliminate all of the rows with LastReviewed >= that constant value, using the index.
Then DATEDIFF(CURDATE(), LastReviewed) AS days (still needed for SELECT) will only be evaluated against the rows we already know we want.
Add a single index on (UserID, LastReviewed) and the server will be able to pinpoint exactly the relevant rows extremely quickly.
Builtin functions are much less costly than, say, fetching rows.
You could get a lot more performance improvement with the following 'composite' index:
INDEX(UserID, LastReviewed)
and change to
WHERE UserID=123
AND LastReviewed >= CURRENT_DATE() - INTERVAL 90 DAY
Your formulation is 'hiding' LastRevieded in a function call, making it unusable in an index.
If you are still not satisfied with that improvement, then consider a nightly query that computes yesterday's statistics and puts them in a "Summary table". From there, the SELECT you mentioned can run even faster.
SELECT
name,
start_time,
TIME(cancelled_date) AS cancelled_time,
TIMEDIFF(start_time, TIME(cancelled_date)) AS difference
FROM
bookings
I'm trying to get from the database a list of bookings which were cancelled with less than an hour's notice. The start time and the cancellation times are both in TIME format, I know a timestamp would have made this easier. So above I've calculated the time difference between the two values and now need to add a WHERE clause to restrict it to only those records that have a difference of under 1:00:00. Obviously this isn't a number, it's a time, so a simple bit of maths won't do it.
start_time is a TIME
cancelled_date is a DATETIME but I'm converting it to TIME in the query to then calculate cancelled_time and difference.
I would be inclined to do this by adding and hour to the notice, something like this:
WHERE start_time > date_add(cancelled_date, interval 1 hour)
I can't quite tell what the right logic is from the question, because your column names don't match the description.
In this case, so a subtraction or doing the comparison are similar performance wise. But, if you had a constant instead of cancelled_date, then there is a difference. The following:
WHERE start_time < date_add(now(), interval -1 hour)
Allows the engine to use an index on start_time.
you can use having difference<time('1:00')
I have some problem with MYSQL,I need to subtract the data between two particular times,for every 5 minutes and then average it the 5 minutes data.
What I am doing now is:
select (avg(columnname)),convert((min(datetime) div 500)*500, datetime) + INTERVAL 5 minute as endOfInterval
from Databasename.Tablename
where datetime BETWEEN '2012-09-12 10:50:00' AND '2012-09-12 14:50:00'
group by datetime div 500;
It is the cumulative average.
Suppose i get 500 at 11 o' clock and 700 at 11.05 ,the average i need is (700-500)/5 = 40.
But now i am getting (500+700)/5 = 240.
I dont need the cumulative average .
Kindly help me.
For the kind of average you're talking about, you don't want to aggregate multiple rows using a GROUP BY clause. INstead, you want to compute your result using exactly two diffrent rows from the same table. This calls for a self-join:
SELECT (b.columnname - a.columnname)/5, a.datetime, b.datetime
FROM Database.Tablename a, Database.Tablename b
WHERE b.datetime = a.datetime + INTERVAL 5 MINUTE
AND a.datetime BETWEEN '2012-09-12 10:50:00' AND '2012-09-12 14:45:00'
a and b refer to two different rows of the same table. The WHERE clause ensures that they are exactly 5 minutes apart.
If there is no second column matching that temporal distance, no resulting row will be included in the query result. If your table doesn't have data points exactly every five minutes, but you have to search for the suitable partner instead, then things become much more difficult. This answer might perhaps be adjusted for that use case. Or you might implement this at the application level, instead of on the database server.
I have a table that is getting hundreds of requests per minute. The issue that I'm having is that I need a way to select only the rows that have been inserted in the past 5 minutes. I am trying this:
SELECT count(id) as count, field1, field2
FROM table
WHERE timestamp > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY timestamp DESC
My issue is that it returns 70k+ results and counting. I am not sure what it is that I am doing wrong, but I would love to get some help on this. In addition, if there were a way to group them by minute to have it look like:
| count | field1 | field2 |
----------------------------
I'd love the help and direction on this, so please let me know your thoughts.
You don't really need DATE_ADD/DATE_SUB, date arithmetic is much simpler:
SELECT COUNT(id), DATE_FORMAT(`timestamp`, '%Y-%m-%d %H:%i')
FROM `table`
WHERE `timestamp` >= CURRENT_TIMESTAMP - INTERVAL 5 MINUTE
GROUP BY 2
ORDER BY 2
The following seems like it would work which is mighty close to what you had:
SELECT
MINUTE(date_field) as `minute`,
count(id) as count
FROM table
WHERE date_field > date_sub(now(), interval 5 minute)
GROUP BY MINUTE(date_field)
ORDER BY MINUTE(date_field);
Note the added column to show the minute and the GROUP BY clause that gathers up the results into the corresponding minute. Imagine that you had 5 little buckets labeled with the last 5 minutes. Now imagine you tossed each row that was 4 minutes old into it's own bucket. count() will then count the number of entries found in each bucket. That's a quick visualization on how GROUP BY works. http://www.tizag.com/mysqlTutorial/mysqlgroupby.php seems to be a decent writeup on GROUP BY if you need more info.
If you run that and the number of entries in each minute seems too high, you'll want to do some troubleshooting. Try replacing COUNT(id) with MAX(date_field) and MIN(date_field) so you can get an idea what kind of dates it is capturing. If MIN() and MAX() are inside the range, you may have more data written to your database than you realize.
You might also double check that you don't have dates in the future as they would all be > now(). The MIN()/MAX() checks mentioned above should identify that too if it's a problem.