Given a table "events_log" in this form :
| id | started_at | duration |
| 1 | 2017-06-01 09:00:00 | 80 |
| 1 | 2017-06-01 09:01:00 | 40 |
| 1 | 2017-06-01 09:01:23 | 20 |
I want to know when the most events were occuring (with a minute precision) :
|period |count|
| 2017-06-01 09:00:00 | 1 |
| 2017-06-01 09:01:00 | 3 |
In reality, there a millions of events to handle.
My solution is to :
Create a temporary table with event start grouped by minute
LEFT JOINing it with the events between each period
See http://sqlfiddle.com/#!9/8546a/1
But performance is terrible ...
Is there a better way to do it ?
I would think group by, something like this:
select date_format(started_at, '%Y-%m-%d %h:%i') as yyyymmddhhmi, count(*)
from t
group by yyyymmddhhmi
order by count(*) desc
limit 10;
Performance will not be great.
Here is modified version of your code. It will scan through events_log table twice. Once when building event_starts helper table and second time when selecting all events that are happening in specified interval. Also note added index, that will significantly speed up execution. It might be also reason why your original query was so slow.
CREATE TABLE events_log (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,started_at DATETIME,duration INT(11));
INSERT INTO events_log (started_at, duration) VALUES ('2017-06-01 09:00:00', 80);
INSERT INTO events_log (started_at, duration) VALUES ('2017-06-01 09:01:00', 40);
INSERT INTO events_log (started_at, duration) VALUES ('2017-06-01 09:01:23', 20);
CREATE /* TEMPORARY */ TABLE tmp_event_starts AS (
select DISTINCT DATE_ADD(started_at, INTERVAL -SECOND(started_at) SECOND) AS period_start
from events_log
);
create index idx_tmp_event_starts
on tmp_event_starts (period_start);
select period_start, count(*), group_concat(id) from events_log as log
join tmp_event_starts as per
on per.period_start >= DATE_ADD(started_at, INTERVAL -SECOND(started_at) SECOND)
and per.period_start <= DATE_ADD(started_at, INTERVAL -SECOND(started_at)+duration SECOND)
group by period_start
;
If you have lot of events happening in same minute and there are not minutes without events, then you might consider generating helper table as a sequence of minutes independent of data. In MySql it is quite uneasy task, but some hints can be found in this blog post Calendar Tables: An Invaluable Database Tool
It will also allow for generating helper table in advance and thus speeding up execution of query itself significantly.
You might also consider adding ended_at columnt to your event_log table, which will eliminate need for conversion during query execution.
Related
I have a table that holds scan datetime values. I am wanting to find the start and stop scan time of the users from the main portion of scanning. The issue is that a user may perform some checks before or after the bulk of the scanning and generate a few more scans. The data might look as below.
....
| 2020-04-01 19:48:05 |
| 2020-04-01 19:48:22 |
| 2020-04-01 19:48:23 |
| 2020-04-01 19:48:48 |
| 2020-04-01 19:48:49 |
| 2020-04-01 20:45:33 |
+---------------------+
If I group by the date and grab the min/max of these values my time elapsed will be much large than the actual. In the case above the max would add almost 1 hour of extra time, which was not really spent scanning.
SELECT date, MIN(datetime), MAX(datetime) FROM table GROUP BY date
There might be 1 extra scan or there might be several scans at the beginning or the end of the data so throwing out the first and last data points is not really an option.
Hmmm . . . I think this is a gap and islands problem. You need some definition of when an outlier occurs. Say it is 5 minutes:
select min(datetime), max(datetime), count(*) as num_scans
from (select t.*,
sum(case when prev_datetime > datetime - interval 5 minute then 0 else 1 end) over (order by datetime) as grp
from (select t.*,
lag(datetime) over (order by datetime) as prev_datetime
from t
) t
) t
group by grp;
I'm not sure how you distinguish actual scans from the outliers. Perhaps if there is more than one row or so. If that is the case, you can remove the outliers with logic such as having count(*) > 1.
I have a table which contains task list of persons. followings are columns
+---------+-----------+-------------------+------------+---------------------+
| task_id | person_id | task_name | status | due_date_time |
+---------+-----------+-------------------+------------+---------------------+
| 1 | 111 | walk 20 min daily | INCOMPLETE | 2017-04-13 17:20:23 |
| 2 | 111 | brisk walk 30 min | COMPLETE | 2017-03-14 20:20:54 |
| 3 | 111 | take medication | COMPLETE | 2017-04-20 15:15:23 |
| 4 | 222 | sport | COMPLETE | 2017-03-18 14:45:10 |
+---------+-----------+-------------------+------------+---------------------+
I want to find out monthly compliance in percentage(completed task/total task * 100) of each person like
+---------------+-----------+------------+------------+
| compliance_id | person_id | compliance | month |
+---------------+-----------+------------+------------+
| 1 | 111 | 100 | 2017-03-01 |
| 2 | 111 | 50 | 2017-04-01 |
| 3 | 222 | 100 | 2017-03-01 |
+---------------+-----------+------------+------------+
Here person_id 111 has 1 task in month 2017-03-14 and which status is completed, as 1 out of 1 task is completed in march then compliance is 100%
Currently, I am using separate table which stores this compliance but I have to calculate compliance update that table every time the task status is changed
I have tried creating a view also but it's taking too much time to execute view almost 0.5 seconds for 1 million records.
CREATE VIEW `person_compliance_view` AS
SELECT
`t`.`person_id`,
CAST((`t`.`due_date_time` - INTERVAL (DAYOFMONTH(`t`.`due_date_time`) - 1) DAY)
AS DATE) AS `month`,
COUNT(`t`.`status`) AS `total_count`,
COUNT((CASE
WHEN (`t`.`status` = 'COMPLETE') THEN 1
END)) AS `completed_count`,
CAST(((COUNT((CASE
WHEN (`t`.`status` = 'COMPLETE') THEN 1
END)) / COUNT(`t`.`status`)) * 100)
AS DECIMAL (10 , 2 )) AS `compliance`
FROM
`task` `t`
WHERE
((`t`.`isDeleted` = 0)
AND (`t`.`due_date_time` < NOW())
GROUP BY `t`.`person_id` , EXTRACT(YEAR_MONTH FROM `t`.`due_date_time`)
Is there any optimized way to do it?
The first question to consider is whether the view can be optimized to give the required performance. This may mean making some changes to the underlying tables and data structure. For example, you might want indexes and you should check query plans to see where they would be most effective.
Other possible changes which would improve efficiency include adding an extra column "year_month" to the base table, which you could populate via a trigger. Another possibility would be to move all the deleted tasks to an 'archive' table to give the view less data to search through.
Whatever you do, a view will always perform worse than a table (assuming the table has relevant indexes). So depending on your needs you may find you need to use a table. That doesn't mean you should junk your view entirely. For example, if a daily refresh of your table is sufficient, you could use your view to help:
truncate table compliance;
insert into compliance select * from compliance_view;
Truncate is more efficient than delete, but you can't use a rollback, so you might prefer to use delete and top-and-tail with START TRANSACTION; ... COMMIT;. I've never created scheduled jobs in MySQL, but if you need help, this looks like a good starting point: here
If daily isn't often enough, you could schedule this to run more often than daily, but better options will be triggers and/or "partial refreshes" (my term, I've no idea if there is a technical term for the idea.
A perfectly written trigger would spot any relevant insert/update/delete and then insert/update/delete the related records in the compliance table. The logic is a little daunting, and I won't attempt it here. An easier option would be a "partial refresh" on called within a trigger. The trigger would spot user targetted by the change, delete only the records from compliance which are related to that user and then insert from your compliance_view the records relating to that user. You should be able to put that into a stored procedure which is called by the trigger.
Update expanding on the options (if a view just won't do):
Option 1: Daily full (or more frequent) refresh via a schedule
You'd want code like this executed (at least) daily.
truncate table compliance;
insert into compliance select * from compliance_view;
Option 2: Partial refresh via trigger
I don't work with triggers often, so can't recall syntax, but the logic should be as follows (not actual code, just pseudo-code)
AFTER INSERT -- you may need one for each of INSERT / UPDATE / DELETE
FOR EACH ROW -- or if there are multiple rows and you can trigger only on the last one to be changed, that would be better
DELETE FROM compliance
WHERE person_id = INSERTED.person_id
INSERT INTO compliance select * from compliance_view where person_id = INSERTED.person_id
END
Option 3: Smart update via trigger
This would be similar to option 2, but instead of deleting all the rows from compliance that relate to the relevant person_id and creating them from scratch, you'd work out which ones to update, and update them and whether any should be added / deleted. The logic is a little involved, and I'm not not going to attempt it here.
Personally, I'd be most tempted by Option 2, but you'd need to combine it with option 1, since the data goes stale due to the use of now().
Here's a similar way of writing the same thing...
Views are of very limited benefit in MySQL, and I think should generally be avoided.
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(task_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,person_id INT NOT NULL
,task_name VARCHAR(30) NOT NULL
,status ENUM('INCOMPLETE','COMPLETE') NOT NULL
,due_date_time DATETIME NOT NULL
);
INSERT INTO my_table VALUES
(1,111,'walk 20 min daily','INCOMPLETE','2017-04-13 17:20:23'),
(2,111,'brisk walk 30 min','COMPLETE','2017-03-14 20:20:54'),
(3,111,'take medication','COMPLETE','2017-04-20 15:15:23'),
(4,222,'sport','COMPLETE','2017-03-18 14:45:10');
SELECT person_id
, DATE_FORMAT(due_date_time,'%Y-%m') yearmonth
, SUM(status = 'complete')/COUNT(*) x
FROM my_table
GROUP
BY person_id
, yearmonth;
person_id yearmonth x
111 2017-03 1.0
111 2017-04 0.5
222 2017-03 1.0
I would like to create an event that when a lend_date column has passed exactly 15 days, it would execute an INSERT query.
It would get the ID of that row and userid, and insert it to another table.
For example:
id | userid | lend_date
---+----------+----------------------
1 | 1 | 2015-09-24 15:58:48
2 | 1 | 2015-09-21 08:22:48
And right now, it is exactly 2015-10-06 08:23:41. So the event should get the ID of the second row, which is 2, and insert it to another table.
What should my event query look like?
The event type is RECURRING. But I'm also not sure if I should execute it every hour or everyday. What would be the best recommendation for it?
And is this a better way than using Task Scheduler?
The other table that I wanted to insert the fetched ID is notification_table, where it will notify the user that he/she has an overdue date.
notification_table looks like this:
id | userid | notificationid | notificationdate |
---+----------+------------------+----------------------+
1 | 1 | 1 | 2015-09-24 15:58:48 |
2 | 1 | 1 | 2015-09-21 08:22:48 |
I'm looking at this query:
INSERT INTO notification_table (id, userid, notificationid, notificationdate)
SELECT id, userid, 1, NOW()
FROM first_table
WHERE lend_date + INTERVAL 15 DAY = NOW();
Seeing the words exactly, event, and datetime in the same sentence makes me cringe. Why? For one thing, it's hard to get one datetime value to exactly match another. For another thing, events often run slightly after the scheduled time, especially on a busy database server. It takes them a little time to start up.
If you need the id values from a table where the records are more than 15 days old, the most time-precise way to get them is with a query or view.
CREATE OR REPLACE VIEW fifteen
AS SELECT id
FROM table
WHERE `datetime` < NOW() - INTERVAL 15 DAY
You can, of course, write an event to copy the ids to a new table. You'll have to go to some trouble to make sure you don't hit the same id values more than once, by using this sort of query in the event.
INSERT INTO newtable (id)
SELECT id
FROM table
WHERE `datetime` < NOW() - INTERVAL 15 DAY
AND id NOT IN (SELECT id FROM newtable)
How often should you run the repeating event? That depends entirely on how quickly the id values need to make it into the new table after they turn fifteen days old. If your application requires it to be less than a minute, you really should go with the view rather than the event. Anything more than a minute of allowable delay will let you use a repeating event at that frequency.
I have a program which scans the users that are online on a server and for each user found inserts a new row in a table. This scan occurs once every 5 minutes and the data is used to draw a user-activity graph on a website.
Here is the structure of my table:
-------------------------------------------------------
| stats_table |
-------------------------------------------------------
| id, bigint(20) unsigned not null PRI auto_increment |
| scan_id, bigint(20) unsigned not null |
| username, varchar(32) null |
| time_scanned, timestamp not null def=curr_timestamp |
-------------------------------------------------------
I want to get the aggregate number of users found since midnight, for each scan.
I have managed to get this, but the query takes over 15 seconds to finish:
SELECT COUNT(*) FROM (SELECT DISTINCT t.scan_id, t1.username FROM
stats_table INNER JOIN stats_table t1 ON
t.scan_id >= t1.scan_id WHERE
t1.time_scanned > CONCAT(DATE(t.time_scanned), ' 00:00:00') AND
t1.time_scanned > DATE_SUB(NOW(), INTERVAL 24 HOUR) AND
t1.time_scanned <= NOW()
) s GROUP BY s.scan_id
so I'm wondering if there is a faster way to get this result?
Here is a visual representation on my graph. Blue represents currently online users, and red the aggregate number of users seen so far today:
To clarify, at 17:00 hours 2 users disconnected and then 15 minutes later 2 new users connected to the server for the first time since midnight. You can see how the red line goes from 7 up to 9 to represent this. Similarly, a new user also connected for the first time today at 23:00.
Since I do not see and index definition I'll assume it is not there.
What you need to do is add an index on the query you are running:
http://use-the-index-luke.com/
http://dev.mysql.com/doc/innodb/1.1/en/innodb-create-index-examples.html
http://en.wikipedia.org/wiki/Database_index
Note this will most likely cause a slower insert/update.
I have a database with a created_at column containing the datetime in Y-m-d H:i:s format.
The latest datetime entry is 2011-09-28 00:10:02.
I need the query to be relative to the latest datetime entry.
The first value in the query should be the latest datetime entry.
The second value in the query should be the entry closest to 7 days from the first value.
The third value should be the entry closest to 7 days from the second value.
REPEAT #3.
What I mean by "closest to 7 days from":
The following are dates, the interval I desire is a week, in seconds a week is 604800 seconds.
7 days from the first value is equal to 1316578202 (1317183002-604800)
the value closest to 1316578202 (7 days) is... 1316571974
unix timestamp | Y-m-d H:i:s
1317183002 | 2011-09-28 00:10:02 -> appear in query (first value)
1317101233 | 2011-09-27 01:27:13
1317009182 | 2011-09-25 23:53:02
1316916554 | 2011-09-24 22:09:14
1316836656 | 2011-09-23 23:57:36
1316745220 | 2011-09-22 22:33:40
1316659915 | 2011-09-21 22:51:55
1316571974 | 2011-09-20 22:26:14 -> closest to 7 days from 1317183002 (first value)
1316499187 | 2011-09-20 02:13:07
1316064243 | 2011-09-15 01:24:03
1315967707 | 2011-09-13 22:35:07 -> closest to 7 days from 1316571974 (second value)
1315881414 | 2011-09-12 22:36:54
1315794048 | 2011-09-11 22:20:48
1315715786 | 2011-09-11 00:36:26
1315622142 | 2011-09-09 22:35:42
I would really appreciate any help, I have not been able to do this via mysql and no online resources seem to deal with relative date manipulation such as this. I would like the query to be modular enough to be able to change the interval weekly, monthly, or yearly. Thanks in advance!
Answer #1 Reply:
SELECT
UNIX_TIMESTAMP(created_at)
AS unix_timestamp,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT max(created_at) - 7
FROM my_table
)
)
AS `random_1`,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT MAX(created_at) - 14
FROM my_table
)
)
AS `random_2`
FROM my_table
WHERE created_at =
(
SELECT MAX(created_at)
FROM my_table
)
Returns:
unix_timestamp | random_1 | random_2
1317183002 | 1317183002 | 1317183002
Answer #2 Reply:
RESULT SET:
This is the result set for a yearly interval:
id | created_at | period_index | period_timestamp
267 | 2010-09-27 22:57:05 | 0 | 1317183002
1 | 2009-12-10 15:08:00 | 1 | 1285554786
I desire this result:
id | created_at | period_index | period_timestamp
626 | 2011-09-28 00:10:02 | 0 | 0
267 | 2010-09-27 22:57:05 | 1 | 1317183002
I hope this makes more sense.
It's not exactly what you asked for, but the following example is pretty close....
Example 1:
select
floor(timestampdiff(SECOND, tbl.time, most_recent.time)/604800) as period_index,
unix_timestamp(max(tbl.time)) as period_timestamp
from
tbl
, (select max(time) as time from tbl) most_recent
group by period_index
gives results:
+--------------+------------------+
| period_index | period_timestamp |
+--------------+------------------+
| 0 | 1317183002 |
| 1 | 1316571974 |
| 2 | 1315967707 |
+--------------+------------------+
This breaks the dataset into groups based on "periods", where (in this example) each period is 7-days (604800 seconds) long. The period_timestamp that is returned for each period is the 'latest' (most recent) timestamp that falls within that period.
The period boundaries are all computed based on the most recent timestamp in the database, rather than computing each period's start and end time individually based on the timestamp of the period before it. The difference is subtle - your question requests the latter (iterative approach), but I'm hoping that the former (approach I've described here) will suffice for your needs, since SQL doesn't lend itself well to implementing iterative algorithms.
If you really do need to determine each period based on the timestamp in the previous period, then your best bet is going to be an iterative approach -- either using a programming language of your choice (like php), or by building a stored procedure that uses a cursor.
Edit #1
Here's the table structure for the above example.
CREATE TABLE `tbl` (
`id` int(10) unsigned NOT NULL auto_increment PRIMARY KEY,
`time` datetime NOT NULL
)
Edit #2
Ok, first: I've improved the original example query (see revised "Example 1" above). It still works the same way, and gives the same results, but it's cleaner, more efficient, and easier to understand.
Now... the query above is a group-by query, meaning it shows aggregate results for the "period" groups as I described above - not row-by-row results like a "normal" query. With a group-by query, you're limited to using aggregate columns only. Aggregate columns are those columns that are named in the group by clause, or that are computed by an aggregate function like MAX(time)). It is not possible to extract meaningful values for non-aggregate columns (like id) from within the projection of a group-by query.
Unfortunately, mysql doesn't generate an error when you try to do this. Instead, it just picks a value at random from within the grouped rows, and shows that value for the non-aggregate column in the grouped result. This is what's causing the odd behavior the OP reported when trying to use the code from Example #1.
Fortunately, this problem is fairly easy to solve. Just wrap another query around the group query, to select the row-by-row information you're interested in...
Example 2:
SELECT
entries.id,
entries.time,
periods.idx as period_index,
unix_timestamp(periods.time) as period_timestamp
FROM
tbl entries
JOIN
(select
floor(timestampdiff( SECOND, tbl.time, most_recent.time)/31536000) as idx,
max(tbl.time) as time
from
tbl
, (select max(time) as time from tbl) most_recent
group by idx
) periods
ON entries.time = periods.time
Result:
+-----+---------------------+--------------+------------------+
| id | time | period_index | period_timestamp |
+-----+---------------------+--------------+------------------+
| 598 | 2011-09-28 04:10:02 | 0 | 1317183002 |
| 996 | 2010-09-27 22:57:05 | 1 | 1285628225 |
+-----+---------------------+--------------+------------------+
Notes:
Example 2 uses a period length of 31536000 seconds (365-days). While Example 1 (above) uses a period of 604800 seconds (7-days). Other than that, the inner query in Example 2 is the same as the primary query shown in Example 1.
If a matching period_time belongs to more than one entry (i.e. two or more entries have the exact same time, and that time matches one of the selected period_time values), then the above query (Example 2) will include multiple rows for the given period timestamp (one for each match). Whatever code consumes this result set should be prepared to handle such an edge case.
It's also worth noting that these queries will perform much, much better if you define an index on your datetime column. For my example schema, that would look like this:
ALTER TABLE tbl ADD INDEX idx_time ( time )
If you're willing to go for the closest that is after the week is out then this'll work. You can extend it to work out the closest but it'll look so disgusting it's probably not worth it.
select unix_timestamp
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 7
from my_table )
)
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 14
from my_table )
)
from my_table
where sql_tstamp = ( select max(sql_tstamp)
from my_table )