mysql: RIGHT JOIN query speed issues involving calendar table - mysql

UPDATE: It seems the problem (as noted as various people) is changing a datetime field into a date field in the query.
Using DATE( all_griefs_tbl.actioned_date is too slow, is there a quicker method without either changing actioned_date into a date field or splitting it into a date and time field?
I have 2 tables, one with a load of records that have a status and a datetime field and the other is a calendar table with dates from 2008 to 2015.
What I want to get out is every date in a time period and the number of records that have been "accepted" each day - even if that count is zero - which would look like this:
| Date | number_accepted |
----------------------------
2012-03-01 723
2012-03-02 723
2012-03-03 1055
2012-03-04 1069
2012-03-05 0
2012-03-06 615
2012-03-07 0
2012-03-08 1072
2012-03-09 664
2012-03-10 859
2012-03-11 0
2012-03-12 778
2012-03-13 987
I've tried the following, but it is only fast enough on a small sample of data (-1000 rows). I need something that is works well on at least 600k rows
SELECT calendar.datefield AS Date,
COUNT( all_griefs_tbl.actioned_status ) AS total_griefs
FROM all_griefs_tbl
RIGHT JOIN calendar
ON ( DATE( all_griefs_tbl.actioned_date ) = calendar.datefield )
AND all_griefs_tbl.actioned_status = 'accepted'
WHERE calendar.datefield < CURDATE( )
GROUP BY calendar.datefield
Thanks
EDIT: Execution plan as requested
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE calendar range PRIMARY PRIMARY 3 NULL 1576 Using where; Using index
1 SIMPLE all_griefs_tbl ref actioned_status actioned_status 153 const 294975

A few thoughts...
First, although you state that you want days without any values returned in the db query, I would actually do this check on the result set wherever that is being handled. Whenever you do a join, you make your queries much more complicated and require more memory to handle them. In this case, I wouldn't regard you use of the calendar table as a particular good use of a relational database.
EDIT: To clarify, how is the query being called? i.e. is there some program (that you're developing) accessing the database, running the query and presenting the results? If so, I'd suggest getting this program to process the results before presentation.
Second, if you're committed to the 'join', you really should have an index on all_griefs_tbl.actioned_date since this is the column on which you're doing the join. Alternatively, you could specify a foreign key on calendar.datefield.
Third, do you need to use the function DATE(all_griefs_tbl.actioned_date)? Isn't this already a date? (Not sure of your data types, but if this and calendar.datefield are not the same data type, this looks like bad database design.)
EDIT: In light of what you say, you may want to split all_griefs_tbl.actioned_date into two columns a date column all_griefs_tbl.actioned_date and a timestamp column all_griefs_tbl.actioned_time. At the moment, you're running this DATE() function on every row in all_griefs_tbl in order to do the join - this will very quickly make the query sluggish. This would also allow you to add an index on both date and time columns, which would also improve the performance of the join (Given your current db design, I'm not surprised the index on actioned_date didn't help - I'd rather expect that, because of the DATE() function, if you rerun EXPLAIN with an index on the actioned_date column as it currently stands, it wouldn't show it using this index on all_griefs_tbl.)
Fourth, you may want to consider what types of information are stored in all_griefs_tbl.actioned_status. Could this be replaced by a boolean? This would be more efficient in both storing and processing the data. (Although again, this depends on your database design.)
EDIT: You could consider changing all_griefs_tbl.action_status to a smaller datatype - I expect it's currently a varchar, but you could easily change this to a single (or small) char datatype, or even to a number of booleans. However, I don't expect this to be the main performance overhead and is really a more involved database design decision depending on the needs of your project.

I suggest splitting your actioned_date from datetime into 2 separate date and time columns, lets say actioned_date and actioned_time so you could change your first join condition from
ON ( DATE( all_griefs_tbl.actioned_date ) = calendar.datefield )
to
ON ( all_griefs_tbl.actioned_date = calendar.datefield )
and adding an index
ALTER TABLE all_griefs_tbl ADD INDEX g_status_date( actioned_status, actioned_date, actioned_time );
It would probably make your query instant for a table with 600k rows.

Related

How to generate faster mysql query with 1.6M rows

I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.

Storing date periods in database

I would like to discuss the "best" way to storage date periods in a database. Let's talk about SQL/MySQL, but this question may be for any database. I have the sensation I am doing something wrong for years...
In english, the information I have is:
-In year 2014, value is 1000
-In year 2015, value is 2000
-In year 2016, there is no value
-In year 2017 (and go on), value is 3000
Someone may store as:
BeginDate EndDate Value
2014-01-01 2014-12-31 1000
2015-01-01 2015-12-31 2000
2017-01-01 NULL 3000
Others may store as:
Date Value
2014-01-01 1000
2015-01-01 2000
2016-01-01 NULL
2017-01-01 3000
First method validation rules looks like mayhem to develop in order to avoid holes and overlaps.
In second method the problem seem to filter one punctual date inside a period.
What my colleagues prefer? Any other suggestion?
EDIT: I used full year only for example, my data usually change with day granularity.
EDIT 2: I thought about using stored "Date" as "BeginDate", order rows by Date, then select the "EndDate" in next (or previous) row. Storing "BeginDate" and "Interval" would lead to hole/overlap problem as method one, that I need a complex validation rule to avoid.
It mostly depends on the way you will be using this information - I'm assuming you do more than just store values for a year in your database.
Lots of guesses here, but I guess you have other tables with time-bounded data, and that you need to compare the dates to find matches.
For instance, in your current schema:
select *
from other_table ot
inner join year_table yt on ot.transaction_date between yt.year_start and yt.year_end
That should be an easy query to optimize - it's a straight data comparison, and if the table is big enough, you can add indexes to speed it up.
In your second schema suggestion, it's not as easy:
select *
from other_table ot
inner join year_table yt
on ot.transaction_date between yt.year_start
and yt.year_start + INTERVAL 1 YEAR
Crucially - this is harder to optimize, as every comparison needs to execute a scalar function. It might not matter - but with a large table, or a more complex query, it could be a bottleneck.
You can also store the year as an integer (as some of the commenters recommend).
select *
from other_table ot
inner join year_table yt on year(ot.transaction_date) = yt.year
Again - this is likely to have a performance impact, as every comparison requires a function to execute.
The purist in me doesn't like to store this as an integer - so you could also use MySQL's YEAR datatype.
So, assuming data size isn't an issue you're optimizing for, the solution really would lie in the way your data in this table relates to the rest of your schema.

Datetime vs Date and Time Mysql

I generally use datetime field to store created_time updated time of data within an application.
But now i have come across a database table where they have kept date and time separate fields in table.
So what are the schema in which two of these should be used and why?
What are pros and cons attached with using of two?
There is a huge difference in performance when using DATE field above DATETIME field. I have a table with more then 4.000.000 records and for testing purposes I added 2 fields with both their own index. One using DATETIME and the other field using DATE.
I disabled MySQL query cache to be able to test properly and looped over the same query for 1000x:
SELECT * FROM `logs` WHERE `dt` BETWEEN '2015-04-01' AND '2015-05-01' LIMIT 10000,10;
DATETIME INDEX:
197.564 seconds.
SELECT * FROM `logs` WHERE `d` BETWEEN '2015-04-01' AND '2015-05-01' LIMIT 10000,10;
DATE INDEX:
107.577 seconds.
Using a date indexed field has a performance improvement of: 45.55%!!
So I would say if you are expecting a lot of data in your table please consider in separating the date from the time with their own index.
I tend to think there are basically no advantages to storing the date and time in separate fields. MySQL offers very convenient functions for extracting the date and time parts of a datetime value.
Okay. There can be some efficiency reasons. In MySQL, you can put separate indexes on the fields. So, if you want to search for particular times, for instance, then a query that counts by hours of the day (for instance) can use an index on the time field. An index on a datetime field would not be used in this case. A separate date field might make it easier to write a query that will use the date index, but, strictly speaking, a datetime should also work.
The one time where I've seen dates and times stored separately is in a trading system. In this case, the trade has a valuation date. The valuation time is something like "NY Open" or "London Close" -- this is not a real time value. It is a description of the time of day used for valuation.
The tricky part is when you have to do date arithmetic on a time value and you do not want a date portion coming into the mix. Ex:
myapptdate = 2014-01-02 09:00:00
Select such and such where myapptdate between 2014-01-02 07:00:00 and 2014-01-02 13:00:00
1900-01-02 07:00:00
2014-01-02 07:00:00
One difference I found is using BETWEEN for dates with non-zero time.
Imagine a search with "between dates" filter. Standard user's expectation is it will return records from the end day as well, so using DATETIME you have to always add an extra day for the BETWEEN to work as expected, while using DATE you only pass what user entered, with no extra logic needed.
So query
SELECT * FROM mytable WHERE mydate BETWEEN '2020-06-24' AND '2020-06-25'
will return a record for 2020-06-25 16:30:00, while query:
SELECT * FROM mytable WHERE mydatetime BETWEEN '2020-06-24' AND '2020-06-25'
won't - you'd have to add an extra day:
SELECT * FROM mytable WHERE mydatetime BETWEEN '2020-06-24' AND '2020-06-26'
But as victor diaz mentioned, doing datetime calculations with date+time would be a super inefficient nightmare and far worse, than just adding a day to the second datetime. Therefore I'd only use DATE if the time is irrelevant, or as a "cache" for speeding queries up for date lookups (see Elwin's answer).

Correct MySQL Structure for a Time Range for Query Optimization?

I have a scenario where I want to be able to SELECT rows from a MySQL table, but exclude rows where the current time-of-day is inside a time-range.
Example:
The "quiet" period for one row is 10pm - 8:30am.
My SQL SELECT statement should not return that row if the current server time is after 10pm or before 8:30am.
Example 2: The "quiet period" is NULL and ignored.
Example 3: A new row is created with a quiet period from 9:53am to 9:55am. If the current server time is in that 2-minute window, the row is not returned by the SELECT.
My question:
What data format would you use in the database, and how would you write the query?
I have thought about a few different approaches (defining start_time as one column and duration as another, defining both in seconds... or using Date stamps... or whatever). None of them seem ideal and require a lot of calculation.
Thanks!
I would store the start and end dates as MySQL native TIME fields.
You would need to consider ranges that span midnight as two separate ranges but you would be able to query the table like this, To find all current quiet periods
SELECT DISTINCT name FROM `quiet_periods`
WHERE start_time<=CURTIME() AND CURTIME()<=end_time
Or to find all non-active quiet periods
SELECT name FROM quiet_periods WHERE name NOT IN (
SELECT name FROM `quiet_periods`
WHERE start_time<=CURTIME() AND CURTIME()<=end_time
)
So with sample data
id -- name -- start_time -- end_time
1 -- late_night -- 00:00:00 -- 08:30:00
2 -- late_night -- 22:00:00 -- 23:59:59
3 -- null_period -- NULL -- NULL
4 -- nearly_10am -- 09:53:00 -- 09:55:00
At 11pm this would return
null_period
nearly_10am
from the second query.
Depending on performance and how many rows you had you might want to refactor the second query into a JOIN and probably add the relevant INDEXes too.

Select day of week from date

I have the following table in MySQL that records event counts of stuff happening each day
event_date event_count
2011-05-03 21
2011-05-04 12
2011-05-05 12
I want to be able to query this efficiently by date range AND by day of week. For example - "What is the event_count on Tuesdays in May?"
Currently the event_date field is a date type. Are there any functions in MySQL that let me query this column by day of week, or should I add another column to the table to store the day of week?
The table will hold hundreds of thousands of rows, so given a choice I'll choose the most efficient solution (as opposed to most simple).
Use DAYOFWEEK in your query, something like:
SELECT * FROM mytable WHERE MONTH(event_date) = 5 AND DAYOFWEEK(event_date) = 7;
This will find all info for Saturdays in May.
To get the fastest reads store a denormalized field that is the day of the week (and whatever else you need). That way you can index columns and avoid full table scans.
Just try the above first to see if it suits your needs and if it doesn't, add some extra columns and store the data on write. Just watch out for update anomalies (make sure you update the day_of_week column if you change event_date).
Note that the denormalized fields will increase the time taken to do writes, increase calculations on write, and take up more space. Make sure you really need the benefit and can measure that it helps you.
Check DAYOFWEEK() function
If you want textual representation of day of week - use DAYNAME() function.