Need help Speeding up this Query - mysql

I have a program which scans the users that are online on a server and for each user found inserts a new row in a table. This scan occurs once every 5 minutes and the data is used to draw a user-activity graph on a website.
Here is the structure of my table:
-------------------------------------------------------
| stats_table |
-------------------------------------------------------
| id, bigint(20) unsigned not null PRI auto_increment |
| scan_id, bigint(20) unsigned not null |
| username, varchar(32) null |
| time_scanned, timestamp not null def=curr_timestamp |
-------------------------------------------------------
I want to get the aggregate number of users found since midnight, for each scan.
I have managed to get this, but the query takes over 15 seconds to finish:
SELECT COUNT(*) FROM (SELECT DISTINCT t.scan_id, t1.username FROM
stats_table INNER JOIN stats_table t1 ON
t.scan_id >= t1.scan_id WHERE
t1.time_scanned > CONCAT(DATE(t.time_scanned), ' 00:00:00') AND
t1.time_scanned > DATE_SUB(NOW(), INTERVAL 24 HOUR) AND
t1.time_scanned <= NOW()
) s GROUP BY s.scan_id
so I'm wondering if there is a faster way to get this result?
Here is a visual representation on my graph. Blue represents currently online users, and red the aggregate number of users seen so far today:
To clarify, at 17:00 hours 2 users disconnected and then 15 minutes later 2 new users connected to the server for the first time since midnight. You can see how the red line goes from 7 up to 9 to represent this. Similarly, a new user also connected for the first time today at 23:00.

Since I do not see and index definition I'll assume it is not there.
What you need to do is add an index on the query you are running:
http://use-the-index-luke.com/
http://dev.mysql.com/doc/innodb/1.1/en/innodb-create-index-examples.html
http://en.wikipedia.org/wiki/Database_index
Note this will most likely cause a slower insert/update.

Related

DISTINCT COUNT with GROUP BY query is too slow despite indexes

I have the following query that counts the number of vessels in each zone for each week:
SELECT zone,
DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE zone IS NOT NULL
AND creation_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY zone, date;
The table has about 40 million rows. The execution plan for this is:
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| 1 | SIMPLE | vessel_position | NULL | range | creation_date,zone | zone | 5 | NULL | 21190904 | 50.00 | Using where; Using index; Using filesort |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
Columns vessel_imo, zone and creation_date each indexed. Primary key is the composite key (vessel_imo, creation_date).
When I look at the query profile, I can see that a large amount of time i spent doing Creating sort index.
Is there anything I can do to improve this query further?
Assuming the data, once inserted, does not change, then build and maintain a Summary Table.
The table would have three columns: the zone, the week, and the count-distinct for that week. At the start of each week, build only the rows for the previous week (one per zone; skip NULL). Then build a query to work against that table -- it will be extremely fast since it will be fetching far fewer rows.
Meanwhile, the INDEX(creation_date, zone, vessel_imo) as a secondary index, will make the weekly task reasonably efficient (~52 times as fast as your current query).
It depends on how selective your filtering condition is, and your table structure. Does the filtering condition selects 20% of the rows, 5%, 1%, 0.1%?
If your answer is less than 5% then the following index could help:
create index ix1_date_zone on vessel_position (creation_date, zone);
If your table has many and/or heavy columns, then this option could still be slow, depending on how selective your filtering condition is.
Otherwise, you could try using a more expensive index, to avoid using the table and do:
create index ix2_date_zone_imo on vessel_position
(creation_date, zone, vessel_imo);
This index is more expensive to maintain -- read insert, update, delete rows -- but it would be faster for your select.
Try both options and pick the best for your needs.
SET #mystartdate = DATE_SUB(CURDATE(), INTERVAL 12 MONTH);
SELECT zone, DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE creation_date >= #mystartdate
AND zone > 0
GROUP BY zone, date;
may provide results in less time, please post your comparative times of second run of each ( old and suggested )
Please post new EXPLAIN SELECT … to confirm index of creation date is now used.
Unless old data is allowed to change, why do you have to gather 12 months history, the numbers more than 1 month ago are NOT going to change.

MySQL select row where one field is less than the previous record and one field is greater

I have two tables. A table of called laps which holds a record of all laps completed round a track by a car and a table called best_time that consists of fastest times for certain distances on that lap. For example it will contain the fastest 1k in that lap, or the fastest half mile.
I want to select from these tables the fastest time progression for each distance. So it will show your personal progression for that distance on that lap over time. E.g your fastest 1k might have been set in January and then you broke it in June and again in August. Below is what the best_times table structure looks like
best_time_id int(10)
lap_id int(11)
start_time int(10)
end_time int(10)
total_distance decimal(7,2)
total_elapsed_time decimal(11,2)
I need to select records where the total_elapsed_time is less than the previous record and the start_time is greater. Here is my query so far
SELECT `bt`.`total_distance`, `bt`.`total_elapsed_time`, `bt`.`start_time`
FROM `best_times` AS `bt`, `laps` AS `l`
WHERE (
SELECT COUNT(*) FROM `best_times` AS `bt2`
WHERE `bt2`.`total_distance` = `bt`.`total_distance`
AND `bt2`.`total_elapsed_time` <= `bt`.`total_elapsed_time`
AND `bt2`.`start_time` > `bt`.`start_time`
) <= 10 AND `l`.`lap_id` = `bt`.`lap_id` AND `l`.`car_id` = 1 ORDER BY `bt`.`total_distance` ASC, `bt`.`total_elapsed_time` desc
This kind of works but it selects record that is shouldn't. An example of a result set I'm getting back is this
| total_distance | total_elapsed_time | start_time |
|----------------|--------------------|------------|
| 1000.00 | 99.15 | 1431344798 |
| 1000.00 | 98.25 | 1431604966 | This record shouldn't be here because although it's quicker it happened after the 91.40 time
| 1000.00 | 91.40 | 1431433535 |
I'm close but can anyone see where I'm going wrong. Please let me know if I need to provide more information.
This will show all the rows where elapsed time is less than all previous elapsed times, by total distance and lap id for car=1:
SELECT `bt`.`total_distance`, `bt`.`total_elapsed_time`, `bt`.`start_time`
FROM `best_times` AS `bt`, `laps` AS `l`
WHERE `bt`.`total_elapsed_time` <= (Select min(`bt2`.`total_elapsed_time`) from `best_times` AS `bt2` where `bt2`.`start_time` <= `bt`.`start_time`
AND `bt2`.`total_distance` = `bt`.`total_distance` )
AND `l`.`lap_id` = `bt`.`lap_id`
AND `l`.`car_id` = 1
ORDER BY `bt`.`total_distance` ASC, `bt`.`total_elapsed_time` desc
I'm not too sure about what is lap_id, maybe it can be removed

MySQL Event Insert ID from another table

I would like to create an event that when a lend_date column has passed exactly 15 days, it would execute an INSERT query.
It would get the ID of that row and userid, and insert it to another table.
For example:
id | userid | lend_date
---+----------+----------------------
1 | 1 | 2015-09-24 15:58:48
2 | 1 | 2015-09-21 08:22:48
And right now, it is exactly 2015-10-06 08:23:41. So the event should get the ID of the second row, which is 2, and insert it to another table.
What should my event query look like?
The event type is RECURRING. But I'm also not sure if I should execute it every hour or everyday. What would be the best recommendation for it?
And is this a better way than using Task Scheduler?
The other table that I wanted to insert the fetched ID is notification_table, where it will notify the user that he/she has an overdue date.
notification_table looks like this:
id | userid | notificationid | notificationdate |
---+----------+------------------+----------------------+
1 | 1 | 1 | 2015-09-24 15:58:48 |
2 | 1 | 1 | 2015-09-21 08:22:48 |
I'm looking at this query:
INSERT INTO notification_table (id, userid, notificationid, notificationdate)
SELECT id, userid, 1, NOW()
FROM first_table
WHERE lend_date + INTERVAL 15 DAY = NOW();
Seeing the words exactly, event, and datetime in the same sentence makes me cringe. Why? For one thing, it's hard to get one datetime value to exactly match another. For another thing, events often run slightly after the scheduled time, especially on a busy database server. It takes them a little time to start up.
If you need the id values from a table where the records are more than 15 days old, the most time-precise way to get them is with a query or view.
CREATE OR REPLACE VIEW fifteen
AS SELECT id
FROM table
WHERE `datetime` < NOW() - INTERVAL 15 DAY
You can, of course, write an event to copy the ids to a new table. You'll have to go to some trouble to make sure you don't hit the same id values more than once, by using this sort of query in the event.
INSERT INTO newtable (id)
SELECT id
FROM table
WHERE `datetime` < NOW() - INTERVAL 15 DAY
AND id NOT IN (SELECT id FROM newtable)
How often should you run the repeating event? That depends entirely on how quickly the id values need to make it into the new table after they turn fifteen days old. If your application requires it to be less than a minute, you really should go with the view rather than the event. Anything more than a minute of allowable delay will let you use a repeating event at that frequency.

Graph per-day from ranges in MySQL

I am trying to make a graph that has a point for each day showing the number of horses present per-day.
This is example of data I have (MySQL)
horse_id | start_date | end_date |
1 | 2011-04-02 | 2011-04-03 |
2 | 2011-04-02 | NULL |
3 | 2011-04-04 | 2014-07-20 |
4 | 2012-05-11 | NULL
So a graph on that data should output one row per day starting on 2011-04-02 and ending on CURDATE, for each day it should return how many horses are registered.
I can't quite wrap my head around how I would do this, since I only have a start date and an end date for each item, and I want to know per-day how many was present on that day.
Right now, I do a loop and a SQL query per day, but that is - as you might have guesses - thousands of queries, and I was hoping it could be done smarter.
If a day between 2011-04-02 and now contains nothing, I still want it out but with a 0.
If possible I would like to avoid having a table with a row for each day containing a count.
I hope it makes sense, I am very stuck here.
What you should have, is a table containing just dates from at least the earliest date in your current table till the current date.
Then you can use this table to left join it something like this:
SELECT
dt.date,
COUNT(yt.horse_id)
FROM
dates_table dt
LEFT JOIN your_table yt ON dt.date BETWEEN yt.start_date AND COALESCE(end_date, CURDATE())
GROUP BY dt.date
Be sure to have a column of your_table in the COUNT() function, otherwise it counts the NULL values too.
The COALESCE() function returns the first of its parameter which isn't NULL, so if you don't have an end_date specified, the current date is taken instead.

Query database in weekly interval

I have a database with a created_at column containing the datetime in Y-m-d H:i:s format.
The latest datetime entry is 2011-09-28 00:10:02.
I need the query to be relative to the latest datetime entry.
The first value in the query should be the latest datetime entry.
The second value in the query should be the entry closest to 7 days from the first value.
The third value should be the entry closest to 7 days from the second value.
REPEAT #3.
What I mean by "closest to 7 days from":
The following are dates, the interval I desire is a week, in seconds a week is 604800 seconds.
7 days from the first value is equal to 1316578202 (1317183002-604800)
the value closest to 1316578202 (7 days) is... 1316571974
unix timestamp | Y-m-d H:i:s
1317183002 | 2011-09-28 00:10:02 -> appear in query (first value)
1317101233 | 2011-09-27 01:27:13
1317009182 | 2011-09-25 23:53:02
1316916554 | 2011-09-24 22:09:14
1316836656 | 2011-09-23 23:57:36
1316745220 | 2011-09-22 22:33:40
1316659915 | 2011-09-21 22:51:55
1316571974 | 2011-09-20 22:26:14 -> closest to 7 days from 1317183002 (first value)
1316499187 | 2011-09-20 02:13:07
1316064243 | 2011-09-15 01:24:03
1315967707 | 2011-09-13 22:35:07 -> closest to 7 days from 1316571974 (second value)
1315881414 | 2011-09-12 22:36:54
1315794048 | 2011-09-11 22:20:48
1315715786 | 2011-09-11 00:36:26
1315622142 | 2011-09-09 22:35:42
I would really appreciate any help, I have not been able to do this via mysql and no online resources seem to deal with relative date manipulation such as this. I would like the query to be modular enough to be able to change the interval weekly, monthly, or yearly. Thanks in advance!
Answer #1 Reply:
SELECT
UNIX_TIMESTAMP(created_at)
AS unix_timestamp,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT max(created_at) - 7
FROM my_table
)
)
AS `random_1`,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT MAX(created_at) - 14
FROM my_table
)
)
AS `random_2`
FROM my_table
WHERE created_at =
(
SELECT MAX(created_at)
FROM my_table
)
Returns:
unix_timestamp | random_1 | random_2
1317183002 | 1317183002 | 1317183002
Answer #2 Reply:
RESULT SET:
This is the result set for a yearly interval:
id | created_at | period_index | period_timestamp
267 | 2010-09-27 22:57:05 | 0 | 1317183002
1 | 2009-12-10 15:08:00 | 1 | 1285554786
I desire this result:
id | created_at | period_index | period_timestamp
626 | 2011-09-28 00:10:02 | 0 | 0
267 | 2010-09-27 22:57:05 | 1 | 1317183002
I hope this makes more sense.
It's not exactly what you asked for, but the following example is pretty close....
Example 1:
select
floor(timestampdiff(SECOND, tbl.time, most_recent.time)/604800) as period_index,
unix_timestamp(max(tbl.time)) as period_timestamp
from
tbl
, (select max(time) as time from tbl) most_recent
group by period_index
gives results:
+--------------+------------------+
| period_index | period_timestamp |
+--------------+------------------+
| 0 | 1317183002 |
| 1 | 1316571974 |
| 2 | 1315967707 |
+--------------+------------------+
This breaks the dataset into groups based on "periods", where (in this example) each period is 7-days (604800 seconds) long. The period_timestamp that is returned for each period is the 'latest' (most recent) timestamp that falls within that period.
The period boundaries are all computed based on the most recent timestamp in the database, rather than computing each period's start and end time individually based on the timestamp of the period before it. The difference is subtle - your question requests the latter (iterative approach), but I'm hoping that the former (approach I've described here) will suffice for your needs, since SQL doesn't lend itself well to implementing iterative algorithms.
If you really do need to determine each period based on the timestamp in the previous period, then your best bet is going to be an iterative approach -- either using a programming language of your choice (like php), or by building a stored procedure that uses a cursor.
Edit #1
Here's the table structure for the above example.
CREATE TABLE `tbl` (
`id` int(10) unsigned NOT NULL auto_increment PRIMARY KEY,
`time` datetime NOT NULL
)
Edit #2
Ok, first: I've improved the original example query (see revised "Example 1" above). It still works the same way, and gives the same results, but it's cleaner, more efficient, and easier to understand.
Now... the query above is a group-by query, meaning it shows aggregate results for the "period" groups as I described above - not row-by-row results like a "normal" query. With a group-by query, you're limited to using aggregate columns only. Aggregate columns are those columns that are named in the group by clause, or that are computed by an aggregate function like MAX(time)). It is not possible to extract meaningful values for non-aggregate columns (like id) from within the projection of a group-by query.
Unfortunately, mysql doesn't generate an error when you try to do this. Instead, it just picks a value at random from within the grouped rows, and shows that value for the non-aggregate column in the grouped result. This is what's causing the odd behavior the OP reported when trying to use the code from Example #1.
Fortunately, this problem is fairly easy to solve. Just wrap another query around the group query, to select the row-by-row information you're interested in...
Example 2:
SELECT
entries.id,
entries.time,
periods.idx as period_index,
unix_timestamp(periods.time) as period_timestamp
FROM
tbl entries
JOIN
(select
floor(timestampdiff( SECOND, tbl.time, most_recent.time)/31536000) as idx,
max(tbl.time) as time
from
tbl
, (select max(time) as time from tbl) most_recent
group by idx
) periods
ON entries.time = periods.time
Result:
+-----+---------------------+--------------+------------------+
| id | time | period_index | period_timestamp |
+-----+---------------------+--------------+------------------+
| 598 | 2011-09-28 04:10:02 | 0 | 1317183002 |
| 996 | 2010-09-27 22:57:05 | 1 | 1285628225 |
+-----+---------------------+--------------+------------------+
Notes:
Example 2 uses a period length of 31536000 seconds (365-days). While Example 1 (above) uses a period of 604800 seconds (7-days). Other than that, the inner query in Example 2 is the same as the primary query shown in Example 1.
If a matching period_time belongs to more than one entry (i.e. two or more entries have the exact same time, and that time matches one of the selected period_time values), then the above query (Example 2) will include multiple rows for the given period timestamp (one for each match). Whatever code consumes this result set should be prepared to handle such an edge case.
It's also worth noting that these queries will perform much, much better if you define an index on your datetime column. For my example schema, that would look like this:
ALTER TABLE tbl ADD INDEX idx_time ( time )
If you're willing to go for the closest that is after the week is out then this'll work. You can extend it to work out the closest but it'll look so disgusting it's probably not worth it.
select unix_timestamp
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 7
from my_table )
)
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 14
from my_table )
)
from my_table
where sql_tstamp = ( select max(sql_tstamp)
from my_table )