I have an table in my DB something like this:
----------------------------------------------------------
| event_id | date | start_time | end_time | duration |
----------------------------------------------------------
| 1 | 2011-05-13 | 01:00:00 | 04:00:00 | 10800 |
| 2 | 2011-05-12 | 17:00:00 | 01:00:00 | 28800 |
| 3 | 2011-05-11 | 11:00:00 | 14:00:00 | 10800 |
----------------------------------------------------------
This sample data doesn't give a totally accurate picture, there is typically events covering every hour of every day.
The date always refers to the start_time, as the end_time can sometimes be the following day.
The duration is in seconds.
SELECT *
FROM event_schedules
WHERE (
date = CURDATE() //today
OR
date = DATE_SUB(CURDATE(), INTERVAL 1 DAY) //yesterday
)
// and ended before now()
AND DATE_ADD(CONCAT(date, ' ', start_time), INTERVAL duration SECOND) < NOW()
ORDER BY CONCAT(date, ' ', start_time) DESC
LIMIT 1
I have a clause in there, the OR'ed clause in brackets, that is unnecessary. I hoped that it might improve the query time, by first filtering out any "events" that do not start today or yesterday. The only way to find the most recent "event" is by ordering the records and taking the first. By adding this extra unnecessary clause am I actually reducing the list of records that need to be ordered? If it does I can't imagine the optimizer being able to make this optimization, most other questions similar to this talk about the optimizer.
Be careful when adding filters to your WHERE clause for performance. While it can reduce the overall number of rows that need to be searched, the actual filter itself can cause a higher cost if it's filtering a ton of records and not using an index. In your case, if the column date is indexed, you'll probably get better performance because it can use the index in the OR part, where as it can't in the other parts because it's being called as a function. Also, can you have future dates? If not, why don't you change the OR to
date > DATE_SUB(CURDATE(), INTERVAL 1 DAY)
The order of the where clause does affect the way the sql engine gets the results.
Many of them have a way to view what the engine does with a query. If you're using sqlserver look for "show estimated execution plan" in your client tool. Some have a verb like "explain" that can be used to show how the engine treats a query.
Well, the optimizer in the query engine is a big part of any query's performance, or the relative performance of two equivalent statements.
You didn't tell us if you ran the query with and without the extra where. There may be a performance difference, there may not.
My guess is that the LIMIT has a lot to do with it. The engine knows this is a "one and done" operation. Without the WHERE, sorting is an NlogN operation, which in this special case can be made linear with a simple scan of the dates to find the most recent.
With the WHERE, you're actually increasing the number of steps it has to perform; either it has to fully order the table (NlogN) and then scan that list for the first record that matches the WHERE clause (linear worst-case, constant best-case), OR it has to filter by the WHERE (linear), then scan those records again to find the max date (linear again). Whichever one turns out faster, they're both slower than one linear scan of the list for the most recent date.
Related
This question already has answers here:
Check overlap of date ranges in MySQL
(8 answers)
Closed 4 years ago.
I know this can be accomplished on the backend - but I'm wondering if there's any native, or efficient MySQL function that can be used to check if a given time variable falls within a range covered by two time variables (a start_time, and an end_time).
I have a database currently set up which looks like the following;
+----+--------+------------+----------+
| id | job_id | start_time | end_time |
+----+--------+------------+----------+
| 1 | 40 | 13:00:00 | 14:00:00 |
| 2 | 44 | 14:45:00 | 15:00:00 |
| 3 | 45 | 15:10:00 | 15:30:00 |
+----+--------+------------+----------+
The backend accepts a start_time, and an end_time with a job_id - and then it looks to see if it can be fit in anywhere. So for example, given a start_time of 13:30:00, and an end_time of 13:45:00, the backend should reject this job request as there is no time available for it (it would overlap with the entry at id 1.)
However, if a job is submitted with a start_time of 14:10:00, and an end time of 14:20:00, it should be accepted - as this does not overlap with any existing tasks.
The following query is great to tell if a job can be submitted for say, 13:00:00 until 14:00:00 (an exact duplicate of id 1);
SELECT * WHERE start_time >= '13:00:00' AND end_time <= '14:00:00';
But if the start_time becomes 13:01:00, then the query falls down - as the start_time is less than 13:01:00, at 13:00:00. So it'll get approved for insertion, as the above query will return no overlapping results.
If we change the query to an OR clause on end_time, then literally any job that doesn't end before 14:00:00 would be rejected.
Can anyone suggest a simple way of taking an input variable of a time type, and checking if it falls within range of all available start_time, and end_time variables as noted in the db above?
I would suggest you to check in in one of the following ways: hours/minutes/seconds
SELECT * FROM timetableTemp
WHERE HOUR(TIMEDIFF(start_time, end_time))>=1
AND start_time >=start_time
OR use BETWEEN function
SELECT * FROM timetableTemp
WHERE '13:00:00' BETWEEN start_time AND end_time
support:
SELECT SECOND(TIMEDIFF("13:10:11", "13:10:10")) AS Seconds; -- 1
SELECT MINUTE(TIMEDIFF("13:10:11", "13:10:10")) AS Minutes; -- 0
SELECT HOUR(TIMEDIFF("13:00:00", "14:00:00")) AS Hours; -- 1
Let's suppose you have a new job with start of new_start and end of new_end. Assume your existing list is in order of time of each job entry, and they are all non-overlapping.
What you'd need to check to see if the new job fits in the list without overlap would be:
new_end is less than the start_time of the first list entry, in
which case the new entry fits at the beginning, OR
new_start is
greater than the end_time of the last list entry, in which case the
new entry fits at the end, OR
For some in-between list entry,
new_start is greater than the end_time and new_end is less than
the start_time of the following list entry, in which case the new
entry fits in between this entry and the next.
Lets say, I have a table:
+------------+-----------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-----------+------+-----+-------------------+-----------------------------+
| id | int(10) | NO | PRI | | AUTOINCREMENT |
| id_action | int(10) | NO | IDX | | |
| a_date | date | NO | IDX | | |
| a_datetime | datetime | NO | IDX | | |
+------------+-----------+------+-----+-------------------+-----------------------------+
Each row has some id_action, and the a_date and a_datetime when it was executed on the website.
My question is, when I want to return COUNT() of each id_action grouped by a_date, is it same, when I use this two selects, or they are different in speed? Thanks for any explanation.
SELECT COUNT(id_action), id_action, a_date
FROM my_table
GROUP BY a_date
ORDER BY a_date DESC
and
SELECT COUNT(id_action), id_action, DATE_FORMAT(a_datetime, '%Y-%m-%d') AS `a_date`
FROM my_table
GROUP BY DATE_FORMAT(a_datetime, '%Y-%m-%d')
ORDER BY a_date DESC
In other words, my question is, that each action has its datetime, and if I really need column a_date, or it is the same using DATE_FORMAT function and column a_datetime and I dont need column a_date?
I ran both the queries on similar table on MySQL 5.5.
The table has 10634079 rows.
First one took 10.66 initially and always takes approx 10 secs on further attempts.
Seconds Query takes 1.25 mins to execute first time, on second, 3rd.... attempts its taking 22.091 secs
So in my view, if your are looking for performance, then you must have column a_date, as its taking half of the time when executed without Date_Format.
If performance is not the primay concern (like data redundancy can be) then a_datetime column will serve all other date/datetime related purposes.
DATE : The DATE type is used for values with a date part but no time part.
DATETIME: The DATETIME type is used for values that contain both date and time parts.
so if you have DATETIME you can always derive DATE from it but from DATE you can not get DATETIME.
And as per your sql there will not be a major difference.
It will be better not to have a_date because you already have a_datetime.
but in general if you can use TIMESTAMP you should, because it is more space-efficient than DATETIME.
Using a_date to group by day will be more efficient than a_datetime because of your conversion. In T-SQL I use a combination of DATEADD() and DATEDIFF() to get the date only from DATETIME since math is more efficient than data conversion. For example (again, using T-SQL though I'm sure there's something similar for MySQL):
SELECT COUNT(id_action), id_action,
DATEADD(DD,DATEDIFF(DD,0,a_datetime),0) AS [a_date]
FROM my_table
GROUP BY DATEADD(DD,DATEDIFF(DD,0,a_datetime),0) AS [a_date]
ORDER BY a_date DESC
This will find the number of days between day 0 and a_datetime then add that number of days to day 0 again. (Day 0 is just an arbitrary date chosen for it's simplicity.)
Perhaps the MySQL version of that would be:
DATE_ADD('2014-01-01', INTERVAL DATEDIFF('2014-01-01',a_datetime) DAY)
Sorry I don't have MySQL installed or I would try that myself. I'd expect it to be more efficient than casting/formatting but less efficient than using a_date.
If you are doing a function in your group by clause: "GROUP BY DATE_FORMAT(a_datetime, '%Y-%m-%d')", you will not be leveraging your index: "a_datetime".
As for speed, I believe there will be no noticeable difference between indexing on datetime vs date (but it's always easy to test with 'explain')
Lastly, you can always read a datetime as a date (using cast functions if need be). Your schema is not normalized if you have both a a_date and a_datetime. You should consider removing one of them. If date provides enough granularity for your application, then get rid of datetime. Otherwise, get rid of a_date and cast as required
As already mentioned, the performance of any function(o_datetime) will be worse than just a_date. The choice depends on on your needs, if there is no need to DATETIME, take a DATE and that is.
If you still need to find a function to convert, then I advise you to take a date().
See also How to cast DATETIME as a DATE in mysql?
Put the two statements in editor SQL and execute (CTRL-L) statistics.
https://technet.microsoft.com/en-us/library/ms178071%28v=sql.105%29.aspx
https://msdn.microsoft.com/pt-br/library/ms190287.aspx?f=255&MSPPError=-2147217396
UPDATE: It seems the problem (as noted as various people) is changing a datetime field into a date field in the query.
Using DATE( all_griefs_tbl.actioned_date is too slow, is there a quicker method without either changing actioned_date into a date field or splitting it into a date and time field?
I have 2 tables, one with a load of records that have a status and a datetime field and the other is a calendar table with dates from 2008 to 2015.
What I want to get out is every date in a time period and the number of records that have been "accepted" each day - even if that count is zero - which would look like this:
| Date | number_accepted |
----------------------------
2012-03-01 723
2012-03-02 723
2012-03-03 1055
2012-03-04 1069
2012-03-05 0
2012-03-06 615
2012-03-07 0
2012-03-08 1072
2012-03-09 664
2012-03-10 859
2012-03-11 0
2012-03-12 778
2012-03-13 987
I've tried the following, but it is only fast enough on a small sample of data (-1000 rows). I need something that is works well on at least 600k rows
SELECT calendar.datefield AS Date,
COUNT( all_griefs_tbl.actioned_status ) AS total_griefs
FROM all_griefs_tbl
RIGHT JOIN calendar
ON ( DATE( all_griefs_tbl.actioned_date ) = calendar.datefield )
AND all_griefs_tbl.actioned_status = 'accepted'
WHERE calendar.datefield < CURDATE( )
GROUP BY calendar.datefield
Thanks
EDIT: Execution plan as requested
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE calendar range PRIMARY PRIMARY 3 NULL 1576 Using where; Using index
1 SIMPLE all_griefs_tbl ref actioned_status actioned_status 153 const 294975
A few thoughts...
First, although you state that you want days without any values returned in the db query, I would actually do this check on the result set wherever that is being handled. Whenever you do a join, you make your queries much more complicated and require more memory to handle them. In this case, I wouldn't regard you use of the calendar table as a particular good use of a relational database.
EDIT: To clarify, how is the query being called? i.e. is there some program (that you're developing) accessing the database, running the query and presenting the results? If so, I'd suggest getting this program to process the results before presentation.
Second, if you're committed to the 'join', you really should have an index on all_griefs_tbl.actioned_date since this is the column on which you're doing the join. Alternatively, you could specify a foreign key on calendar.datefield.
Third, do you need to use the function DATE(all_griefs_tbl.actioned_date)? Isn't this already a date? (Not sure of your data types, but if this and calendar.datefield are not the same data type, this looks like bad database design.)
EDIT: In light of what you say, you may want to split all_griefs_tbl.actioned_date into two columns a date column all_griefs_tbl.actioned_date and a timestamp column all_griefs_tbl.actioned_time. At the moment, you're running this DATE() function on every row in all_griefs_tbl in order to do the join - this will very quickly make the query sluggish. This would also allow you to add an index on both date and time columns, which would also improve the performance of the join (Given your current db design, I'm not surprised the index on actioned_date didn't help - I'd rather expect that, because of the DATE() function, if you rerun EXPLAIN with an index on the actioned_date column as it currently stands, it wouldn't show it using this index on all_griefs_tbl.)
Fourth, you may want to consider what types of information are stored in all_griefs_tbl.actioned_status. Could this be replaced by a boolean? This would be more efficient in both storing and processing the data. (Although again, this depends on your database design.)
EDIT: You could consider changing all_griefs_tbl.action_status to a smaller datatype - I expect it's currently a varchar, but you could easily change this to a single (or small) char datatype, or even to a number of booleans. However, I don't expect this to be the main performance overhead and is really a more involved database design decision depending on the needs of your project.
I suggest splitting your actioned_date from datetime into 2 separate date and time columns, lets say actioned_date and actioned_time so you could change your first join condition from
ON ( DATE( all_griefs_tbl.actioned_date ) = calendar.datefield )
to
ON ( all_griefs_tbl.actioned_date = calendar.datefield )
and adding an index
ALTER TABLE all_griefs_tbl ADD INDEX g_status_date( actioned_status, actioned_date, actioned_time );
It would probably make your query instant for a table with 600k rows.
I have a MySQL table containing a column to store time and another to store a value associated with that time.
time | value
------------
1 | 0.5
3 | 1.0
4 | 1.5
.... | .....
The events are not periodic, i.e., the time values do not increment by fix interval.
As there are large number of rows (> 100000), for the purpose of showing the values in a graph I would like to be able to aggregate (mean) the values for an interval of fixed size over the entire length of time for which the data is available. So basically the output should consist of pairs of interval and mean values.
Currently, I am splitting the total time interval into fixed chunks of time, executing individual aggregate queries for that interval and collecting the results in application code (Java). Is there a way to do all of these steps in SQL. Also, I am currently using MySQL but am open to other databases that might support an efficient solution.
SELECT FLOOR(time / x) AS Inter, AVG(value) AS Mean
FROM `table`
GROUP BY Inter;
Where x is your interval of fixed size.
I've usually solved this through a "period" table, with all the valid times in it, and an association with the period on which I report.
For instance:
time day week month year
1 1 1 1 2001
2 1 1 1 2001
....
999 7 52 12 2010
You can then join your time to the "period" table time, and use AVG.
What's the best way to store a date value for which in many cases only the year may be known?
MySQL allows zeros in date parts unless the NO_ZEROES_IN_DATE sql mode is enabled, which isn't by default. Is there any reason not to use a date field where if the month and day may be zero, or to split it up to 3 different fields for year, month and day (year(4), tinyint, tinyint)?
A better way is to split the date into 3 fields. Year, Month, Day. This gives you full flexibility for storing, sorting, and searching.
Also, it's pretty trivial to put the fields back together into a real date field when necessary.
Finally, it's portable across DBMS's. I don't think anyone else supports a 0 as a valid part of a date value.
Unless portability across DBMS is important, I would definitely be inclined to use a single date field. If you require even moderately complex date related queries, having your day, month and year values in separate fields will become a chore.
MySQL has a wealth of date related functions - http://dev.mysql.com/doc/refman/5.1/en/date-and-time-functions.html. Use YEAR(yourdatefield) if you want to return just the year value, or the same if you want to include it in your query's WHERE clause.
You can use a single date field in Mysql to do this. In the example below field has the date data type.
mysql> select * from test;
+------------+------+
| field | id |
+------------+------+
| 2007-00-00 | 1 |
+------------+------+
1 row in set (0.00 sec)
mysql> select * from test where YEAR(field) = 2007;
+------------+------+
| field | id |
+------------+------+
| 2007-00-00 | 1 |
+------------+------+
I would use one field it will make the queries easier.
Yes using the Date and Time functions would be better.
Thanks BrynJ
You could try a LIKE operative. Such as:
SELECT * FROM table WHERE date_feield LIKE 2009;
It depends on how you use the resulting data. A simple answer would be to simply store those dates where only the year is known as January 1. This approach is really simple and allows you to aggregate by year using all the standard built in date functions.
The problem arises if the month or date is significant. For example if you are trying to determine the age of a record in days, weeks, months or if you want to show distribution across this smaller level of granularity. This problem exists any way, though. If you have some full dates and some with only a year, how do you want to represent them in such instances.