Very similar MySQL queries results significantly varying query durations (WHERE on timespans) - mysql

I have a MySQL table with around 600 K rows in it (Engine: InnoDB).
MySQL is running in a virtualbox machine with Ubuntu 16.04 LTS in it. MySQL server version is 5.7.23, if that's relevant.
The columns in the WHERE clauses (open_time and close_time) are both indexed and they are both DATETIME columns.
The column that I'm taking the sum of (volume) is a double.
This query returns instantly (0.000 seconds):
SELECT *
FROM klines
WHERE (open_time between '2018-01-01 00:00:00' AND '2018-01-01 12:00:00')
;
EXPLAIN output:
Whereas this one takes almost a second to fetch (varies between 0.640 and 0.703 seconds between 10 tries):
SELECT SUM(volume)
FROM klines
WHERE open_time >= '2018-01-01 00:00:00' AND close_time <= '2018-01-01 12:00:00'
;
EXPLAIN output:
Mind that both queries returns about the same rows (720 for first, 721 for the second. Second query returns the same 720 rows which first one returns, plus another one).
So, if I want to get just the rows, it does not matter if I use WHERE clause for two columns or one. But if I want to get the SUM of a column, query gets drastically slower when I use WHERE clause for two columns. If I use a single column however, it again returns instantly.
While I'm perfectly OK with using the query which queries the table using between two open_time criterias, I'm really curious about what's going on.
So, what would be the reason behind this?

open_time between '2018-01-01 00:00:00'
AND '2018-01-01 12:00:00'
can easily use INDEX(open_time) to touch only the interesting rows. But it is not possible to have an index that stops abruptly for this:
open_time >= '2018-01-01 00:00:00'
AND close_time <= '2018-01-01 12:00:00'
INDEX(open_time) could be used, but the last half of the table would be scanned. INDEX(close_time), similarly, would scan the first half of the table. And there is now way to do both.
You probably have an additional constraint that is nowhere visible:
[open..close] time ranges don't overlap?
open is always < close?
These cannot be specified in standard SQL, nor is there any index formulation that would take advantage of either constraint.
Here are two rows that will mess up any optimization attempt:
INSERT INTO klines (open_time, close_time)
VALUES ('2018-01-01 06:00:00', '2037-12-31'),
('1971-01-01', '2018-01-01 06:00:00')
('2037-01-01', '1971-01-01')
There are fixes, but they require either assuming non-overlapping, then playing with the queries is severe ways; or playing with buckets.

Related

My query is really slow even though it has an index

We have to check 7 million rows to make campagne statistics. It takes around 30 seconds to run the query and it doesnt improve with indexes.
Indexes didnt change the speed at all.
I tried adding indexes on the where fields, the where fields + group by and the where fields + sum.
Server type is MYSQL and the server version is 5.5.31.
SELECT
NOW(), `banner_campagne`.name, `banner_view`.banner_uid, SUM(`banner_view`.fetched) AS fetched,
SUM(`banner_view`.loaded) AS loaded,
SUM(`banner_view`.seen) AS seen
FROM `banner_view` INNER JOIN
`banner_campagne`
ON `banner_campagne`.uid = `banner_view`.banner_uid AND
`banner_campagne`.deleted = 0 AND
`banner_campagne`.weergeven = 1
WHERE
`banner_view`.campagne_uid = 6 AND `banner_view`.datetime >= '2019-07-31 00:00:00' AND `banner_view`.datetime < '2019-08-30 00:00:00'
GROUP BY
`banner_view`.banner_uid
I expect the query to run around 5 seconds.
The indexes that you want for this query are probably:
banner_view(campagne_uid, datetime)
banner_campagne(banner_uid, weergeven, deleted)
Note that the order of the columns in the index does matter.

How to generate faster mysql query with 1.6M rows

I have a table that has 1.6M rows. Whenever I use the query below, I get an average of 7.5 seconds.
select * from table
where pid = 170
and cdate between '2017-01-01 0:00:00' and '2017-12-31 23:59:59';
I tried adding a LIMIT 1000 or 10000 or change the date to filter for 1 month, it still processes it to an average of 7.5s. I tried adding a composite index for pid and cdate but it resulted to 1 second slower.
Here is the INDEX list
https://gist.github.com/primerg/3e2470fcd9b21a748af84746554309bc
Can I still make it faster? Is this an acceptable performance considering the amount of data?
Looks like the index is missing. Create this index and see if its helping you.
CREATE INDEX cid_date_index ON table_name (pid, cdate);
And also modify your query to below.
select * from table
where pid = 170
and cdate between CAST('2017-01-01 0:00:00' AS DATETIME) and CAST('2017-12-31 23:59:59' AS DATETIME);
Please provide SHOW CREATE TABLE clicks.
How many rows are returned? If it is 100K rows, the effort to shovel that many rows is significant. And what will you do with that many rows? If you then summarize them, consider summarizing in SQL!
Do have cdate as DATETIME.
Do you use id for anything? Perhaps this would be better:
PRIMARY KEY (pid, cdate, id) -- to get benefit from clustering
INDEX(id) -- if still needed (and to keep AUTO_INCREMENT happy)
This smells like Data Warehousing. DW benefits significantly from building and maintaining Summary table(s), such as one that has the daily click count (etc), from which you could very rapidly sum up 365 counts to get the answer.
CAST is unnecessary. Furthermore 0:00:00 is optional -- it can be included or excluded for either DATE or DATETIME. I prefer
cdate >= '2017-01-01'
AND cdate < '2017-01-01' + INTERVAL 1 YEAR
to avoid leap year, midnight, date arithmetic, etc.

SQL: Reuse function result in query without using sub-query

In a MySQL DB table that stores sale orders, I have a LastReviewed column that holds the last date and time when the sale order was modified (type timestamp, default value CURRENT_TIMESTAMP). I'd like to plot the number of sales that were modified each day, for the last 90 days, for a particular user.
I'm trying to craft a SELECT that returns the number of days since LastReviewed date, and how many records fall within that range. Below is my query, which works just fine:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND DATEDIFF(CURDATE(),LastReviewed)<=90
GROUP BY days
ORDER BY days ASC
Notice that I am computing the DATEDIFF() as well as CURDATE() multiple times for each record. This seems really ineffective, so I'd like to know how I can reuse the results of the previous computation. The first thing I tried was:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND days<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'where clause'. So I started to look around the net. Based on another discussion (Can I reuse a calculated field in a SELECT query?), I next tried the following:
SELECT DATEDIFF(CURDATE(), LastReviewed) AS days, COUNT(*) AS number FROM sales
WHERE UserID=123 AND (SELECT days)<=90
GROUP BY days
ORDER BY days ASC
Error: Unknown column 'days' in 'field list'. I'm also tried the following:
SELECT #days := DATEDIFF(CURDATE(), LastReviewed) AS days,
COUNT(*) AS number FROM sales
WHERE UserID=123 AND #days <=90
GROUP BY days
ORDER BY days ASC
The query returns zero result, so #days<=90 seems to return false even though if I put it in the SELECT clause and remove the WHERE clause, I can see some results with #days values below 90.
I've gotten things to work by using a sub-query:
SELECT * FROM (
SELECT DATEDIFF(CURDATE(),LastReviewed) AS sales ,
COUNT(*) AS number FROM sales
WHERE UserID=123
GROUP BY days
) AS t
WHERE days<=90
ORDER BY days ASC
However I odn't know whether it's the most efficient way. Not to mention that even this solution computes CURDATE() once per record even though its value will be the same from the start to the end of the query. Isn't that wasteful? Am I overthinking this? Help would be welcome.
Note: Mods, should this be on CodeReview? I posted here because the code I'm trying to use doesn't actually work
There are actually two problems with your question.
First, you're overlooking the fact that WHERE precedes SELECT. When the server evaluates WHERE <expression>, it then already knows the value of the calculations done to evaluate <expression> and can use those for SELECT.
Worse than that, though, you should almost never write a query that uses a column as an argument to a function, since that usually requires the server to evaluate the expression for each row.
Instead, you should use this:
WHERE LastReviewed < DATE_SUB(CURDATE(), INTERVAL 90 DAY)
The optimizer will see this and get all excited, because DATE_SUB(CURDATE(), INTERVAL 90 DAY) can be resolved to a constant, which can be used on one side of a < comparison, which means that if an index exists with LastReviewed as the leftmost relevant column, then the server can immediately eliminate all of the rows with LastReviewed >= that constant value, using the index.
Then DATEDIFF(CURDATE(), LastReviewed) AS days (still needed for SELECT) will only be evaluated against the rows we already know we want.
Add a single index on (UserID, LastReviewed) and the server will be able to pinpoint exactly the relevant rows extremely quickly.
Builtin functions are much less costly than, say, fetching rows.
You could get a lot more performance improvement with the following 'composite' index:
INDEX(UserID, LastReviewed)
and change to
WHERE UserID=123
AND LastReviewed >= CURRENT_DATE() - INTERVAL 90 DAY
Your formulation is 'hiding' LastRevieded in a function call, making it unusable in an index.
If you are still not satisfied with that improvement, then consider a nightly query that computes yesterday's statistics and puts them in a "Summary table". From there, the SELECT you mentioned can run even faster.

MySQL DATE_ADD running too slow with dynamic interval

I have the following query that's running pretty slow when executing it on thousands of records.
SELECT
name,
id
FROM
meetings
WHERE
meeting_date < '2014-09-20 11:00:00' AND (
meeting_date >= '2014-09-20 09:00:00' OR
DATE_ADD(meeting_date, INTERVAL meeting_length SECOND) > '2014-09-20 09:00:00'
)
The query checks if meeting_date overlaps in anyway between 2014-09-20 09:00:00 and 2014-09-20 11:00:00. The above query covers all the possible overlapping cases. However, DATE_ADD adds a lot of overhead.
Anyway to optimize DATE_ADD? Removing DATE_ADD greatly boosts the performance but it won't cover all overlapping cases.
I recommend you eliminate the OR.
MySQL won't (can't) perform a range scan operation on an index on column meeting_date when that column is wrapped in a function.
When the comparison is against the bare column, MySQL can do a range scan. But with the comparison to an expression, MySQL has to evaluate that expression for every row in the table, and then comapare.
For a large table, we'd get optimal performance with an index with leading column of meeting_date.
I think the "trick" to getting better performance is to rewrite the query to introduce some additional domain knowledge. Specifically, what are the MINIMUM and MAXIMUM values for meeting_length?
I think it's pretty safe to assume it won't be negative. And we probably don't expect it to be zero. But even if the minimum length is greater than zero, we can use zero as our "known" minimum. (It's going to turn out to be more convenient than some other non-zero value.)
What we really need to know is the MAXIMUM value for meeting_length. If that's a known constant value, that would be great, because we're going to include that value in the query. let's assume the maximum value of meeting_length is the number of seconds in 7 days.
As a demonstration of what I'm thinking:
SELECT m.name
, m.id
FROM meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00' + INTERVAL -7 DAY
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'
Let's unwrap that a bit.
The first predicate is the same as in your original query... the "start" time of the meeting is before the "end" of the specified period.
The third predicate is the same as in your query too... the "end" of the meeting is after the beginning of the specified period. (My personal preference is to use the + INTERVAL form to add a duration to datetime.)
So, just like the original query we're looking for overlap.
I'm suggesting that we include another sargable predicate. The addition of this predicate doesn't really change the check for the overlap, given that we have a known minimum of 0 for meeting_length. What it does do is add a fixed lower bound that we can check against.
To explain that a little bit... if a meeting row that satisfies the condition "meeting end is after the period start", then we also know, for that row, that "meeting start is after (period start MINUS meeting length)". And we also know that "meeting start is after (period start MINUS the MAXIMUM possible value of meeting length.
And for most rows, that's going to be a bigger range... but the "trick" is the the predicate that checks that can compare a "bare" column against a constant.
And that means MySQL will be able to use an index range scan operation to satisfy that. The query is of the form:
WHERE meeting_date > const
AND meeting_date < const
And that's perfect for an index range scan. That should benefit performance... assuming there's a suitable index and that significantly limits the number of rows that need to be checked.
But by itself, that returns more rows than we need, we're going to get some meetings that start and end before the start of the period.
So we still need the additional check, to further filter down the rows. But that won't have to be evaluated for every row, only the rows that are pass through the first two predicates.
AND meeting_date + length > const
We just need to MySQL to recognize that it length won't ever be negative; to recognize that this is actually a "stricter" range, not a broader range. It might work with the AND, but we can force MySQL to evaluate that condition later, by including it in the HAVING clause.
HAVING meeting_date + length > const
But, all of that is really just a guess.
We'd really need to take a look at the EXPLAIN output.
If that index with the leading column of meeting_date also includes the id and name columns, then MySQL could satisfy the query entirely from the index, without any need to reference pages in the underlying table. (If that happens, we'll see "Using index" in the EXPLAIN output.)
Earlier, I said it would be convenient if we had a known constant for maximum meeting_length.
We could also use a query to determine that from the data:
SELECT MAX(meeting_length) FROM meetings
(And index with meeting_length as the leading column will avoid having to do an expensive full scan of the table)
We use that value to derive the "constant" value in the predicate.
We could include that query (as an inline view or a subquery), but that might impact performance. (We'd need to test how "smart" MySQL optimizer is...
We could try it as a subquery:
SELECT m.name
, m.id
FROM meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00'
- INTERVAL (SELECT MAX(l.meeting_length) FROM meetings l) DAY
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'
Or try it as an inline view:
SELECT m.name
, m.id
FROM ( SELECT MAX(l.meeting_length) AS max_seconds
FROM meetings l
) d
CROSS
JOIN meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00'
- INTERVAL d.max_seconds SECOND
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'

What is better: select date with trunc date or between

I need to create a query to select some data of my mysql db based on date, but in my where clause i have to options:
1 - trunc the date:
select count(*) from mailing_user where date_format(create_date, '%Y-%m-%d')='2013-11-05';
2 - use between
select count(*) from mailing_user where create_date between '2013-11-05 00:00:00' and '2013-11-05 23:59:59';
the two query's will work, but whats the better? Or, what's recommended? Why?
Here is an article to read.
http://willem.stuursma.name/2009/01/09/mysql-performance-with-date-functions/
If your created_date column is indexed, the 2nd query will be faster.
But if the column is not indexed and if this is your defined date format, you can use the following query.
select count(*) from mailing_user where DATE(create_date) = '2013-11-05';
I use DATE instead of DATE_FORMAT as I can make use of the native feature of getting in this format('2013-11-05').
From your question it seems you want to select records from one day, according to the documentation A DATETIME or TIMESTAMP value can include a trailing fractional seconds part in up to microseconds (6 digits) precision.
So this means your second query might actually get unlucky and miss some records that were inserted into the table at the very last second of that day, so that is why I would say the first one is more precise and is guaranteed to always get you the correct result.
The downside of this is that you cannot index that column using the date_format-function, because MySQL isn't cool with that.
If you don't want to use date_format and get around the precision issue you would change
where create_date between '2013-11-05 00:00:00' and '2013-11-05 23:59:59'
into
where create_date >= '2013-11-05 00:00:00' and create_date < '2013-12-05 00:00:00'
Number 2 will be faster if you have an index on the create_date because number one won't be able to use the index to quickly scan the results.
However this requires there to be an index on the create_date.
Otherwise I imagine they would be similar speed, possibly the second would still be faster because of the smaller processing time to compare(datetime comparison rather than converting to a string and comparing strings), but I doubt it'd be significant.