MySQL indexes on query with Where, GroupBy and OrderBy clauses - mysql

How can i improve the performance of below query? What indexes might help?
SELECT platform, country, Source, window,
Round(SUM(ProjectedARPI*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPI,
Round(SUM(ProjectedARPIOrganicLow*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPIOrganicLow,
Round(SUM(ProjectedARPIOrganicMed*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPIOrganicMed,
Round(SUM(ProjectedARPIOrganicHigh*PlayerCount) / SUM(PlayerCount), 2) AS ProjectedARPIOrganicHigh,
SUM(PlayerCount) AS PlayerCount, SUM(PayerCount) AS PayerCount,
CASE WHEN(SUM(PlayerCount) > 500 AND SUM(PayerCount) > 10) THEN TRUE ELSE FALSE END AS isSignificant,
ProjectionDate,
min(CohortRangeLow) as CohortRangeLow,
max(CohortRangeHigh) as CohortRangeHigh
FROM web_synch.UI_data
WHERE PlayerCount > 0 AND ProjectionDate BETWEEN '2015-07-25' AND '2016-10-25' AND window = 365
GROUP BY Platform, country, source, ProjectionDate
ORDER BY Platform, source, ProjectionDate;

For this query, basically your only hope in using indexes is either: UI_data(window, ProjectionDate, PlayerCount) or UI_data(window, PlayerCount, ProjectionDate). Which is better depends on which selects fewer records . . . I would guess the first is better.

I suggest that this is the best index:
INDEX(window, -- first because "="
ProjectionDate -- range
) -- nothing after range will be looked at
This has a slight advantage over the 3-column index previously suggested, in that the index will be slightly smaller.
More discussion: Index cookbook .
I expect there will be two sorts -- one for GROUP BY, then one for ORDER BY. It would run a little faster if you made the ORDER BY and the GROUP BY list identical.
Possible bug: if ProjectionDate is a DATE datatype, then the range is three months plus one day. Recommend this pattern:
ProjectionDate >= '2015-07-25'
AND ProjectionDate < '2016-07-25' + INTERVAL 3 MONTH

Related

How to reduce redundant MySQL function calls in a single query?

SELECT hour(datetime), COUNT(animal_id)
FROM animal_outs
WHERE hour(datetime) > 8 AND hour(datetime) < 20
GROUP BY hour(datetime)
I am learning SQL. I am calling hour(datetime) four times in my query. I am curious 1) if this redundancy affects performance, and 2) how I can simplify this redundant code.
Does this affect performance?
Probably not in any meaningful way. The performance of queries is usually dominated by the work done to retrieve and process data. This is typically much more expensive than the overhead for built in functions (although there are some exceptions, such as regular expressions which can be rather expensive).
MySQL allows column aliases in the GROUP BY. So a valid "simplification" is:
SELECT hour(datetime) as hh, COUNT(animal_id)
FROM animal_outs
WHERE hour(datetime) > 8 AND hour(datetime) < 20
GROUP BY hh;
Two versions that are likely to make things worse might look simpler to you, but are not. The first is to use having:
SELECT hour(datetime) as hh, COUNT(animal_id)
FROM animal_outs
GROUP BY hh
HAVING hh > 8 AND hh < 20
Technically, this does what you want. But because it filters after the aggregation, it is doing extra work on the GROUP BY. That likely outweighs any savings on not calling hour().
Another method is a subquery:
SELECT hh, COUNT(animal_id)
FROM (SELECT hour(datetime) as hh, animal_id
FROM animal_outs
) ao
WHERE hh > 8 AND hh < 20
GROUP BY hh;
In most databases, this would do what you want. And it might in the most recent versions of MySQL. However, MySQL has an irritating tendency to materialize (i.e. write to disk) subqueries in the FROM clause. That adds extra overhead -- once again, probably more than the additional calls to hour().
Note: It is possible that hour() is a perniciously expensive function and you might find that either of the last two solutions are faster. Also, you will probably only see an effect on performance if you data has at least a few thousand rows. Trivially small tables (a few dozen or hundred rows) are usually processed quickly regardless of such concerns.
If hour column having integer value then try this one.We can remove only one time redundancy.
SELECT hour
,COUNT(animal_id)
FROM animal_outs
WHERE hour BETWEEN 8
AND 20
GROUP BY hour
If hour is in dateformat then try below code .
SELECT DATEPART(HH,hour)
,COUNT(animal_id)
FROM animal_outs
WHERE DATEPART(HH,hour) BETWEEN 8
AND 20
GROUP BY DATEPART(HH,hour)

How can I make this sql query faster?

I have a table user_notifications that has 1100000 records and I have to run this below query but it takes more than 3 minutes to complete the query what can I do to improve the fetch time.
SELECT `user_notifications`.`user_id`
FROM `user_notifications`
WHERE `user_notifications`.`notification_template_id` = 175
AND (DATE(sent_at) >= DATE_SUB(CURDATE(), INTERVAL 4 day))
AND `user_notifications`.`user_id` IN (
1203, 1282, 1499, 2244, 2575, 2697, 2828, 2900, 3085, 3989,
5264, 5314, 5368, 5452, 5603, 6133, 6498..
)
the user ids in IN block are sometimes upto 1k.
for optimisation I have indexed on user_id and notification_template_id column in user_notification table.
Big IN() lists are inherently slow. Create a temporary table with an index and put the values in the IN() list into that tempory table instead, then you'll get the power of an indexed join instead of giant IN() list.
You seem to be querying for a small date range. How about having an index based on SENT_AT column? Do you know what index the current query is using?
(1) Don't hide columns in functions if you might need to use an index:
AND (DATE(sent_at) >= DATE_SUB(CURDATE(), INTERVAL 4 day))
-->
AND sent_at >= CURDATE() - INTERVAL 4 day
(2) Use a "composite" index for
WHERE `notification_template_id` = 175
AND sent_at >= ...
AND `user_id` IN (...)
The first column should be the one with '='. It is unclear what to put next, so I suggest adding both of these indexes:
INDEX(notification_template_id, user_id, sent_at)
INDEX(notification_template_id, sent_at)
The Optimizer will probably pick between them correctly.
Composite indexes are not the same as indexes on the individual columns.
(3) Yes, you could try putting the IN list in a tmp table, but the cost of doing such might outweigh the benefit. I don't think of 1K values in IN() as being "too many".
(4) My cookbook on building indexes.

Speed up SQL SELECT with arithmetic and geometric calculations

This is a follow-up to my previous post How to improve wind data SQL query performance.
I have expanded the SQL statement to also perform the first part in the calculation of the average wind direction using circular statistics. This means that I want to calculate the average of the cosines and sines of the wind direction. In my PHP script, I will then perform the second part and calculate the inverse tangent and add 180 or 360 degrees if necessary.
The wind direction is stored in my table as voltages read from the sensor in the field 'dirvolt' so I first need to convert it to radians.
The user can look at historical wind data by stepping backwards using a pagination function, hence the use of LIMIT which values are set dynamically in my PHP script.
My SQL statement currently looks like this:
SELECT ROUND(AVG(speed),1) AS speed_mean, MAX(speed) as speed_max,
MIN(speed) AS speed_min, MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM table
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC
LIMIT 0, 72
The query takes about 3-8 seconds to run depending on what value I use to group the data (300 in the code above).
In order for me to learn, is there anything I can do to optimize or improve the SQL statement otherwise?
SHOW CREATE TABLE table;
From that I can see if you already have INDEX(dt) (or equivalent). With that, we can modify the SELECT to be significantly faster.
But first, change the focus from 72*300 seconds worth of readings to datetime ranges, which is 6(?) hours.
Let's look at this query:
SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...';
The '...' would be the same datetime in both places. Does that run fast enough with the index?
If yes, then let's build the final query using that as a subquery:
SELECT FORMAT(AVG(speed), 1) AS speed_mean,
MAX(speed) as speed_max,
MIN(speed) AS speed_min,
MAX(dt) AS last_dt,
AVG(SIN(2.04*dirvolt-0.12)) as dir_sin_mean,
AVG(COS(2.04*dirvolt-0.12)) as dir_cos_mean
FROM
( SELECT * FROM table
WHERE dt >= '...' - INTERVAL 6 HOUR
AND dt < '...'
) AS x
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 300)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 300) DESC;
Explanation: What you had could not use an index, hence had to scan the entire table (which is getting bigger and bigger). My subquery could use an index, hence was much faster. The effort for my outer query was not "too bad" since it worked with only N rows.

MySQL DATE_ADD running too slow with dynamic interval

I have the following query that's running pretty slow when executing it on thousands of records.
SELECT
name,
id
FROM
meetings
WHERE
meeting_date < '2014-09-20 11:00:00' AND (
meeting_date >= '2014-09-20 09:00:00' OR
DATE_ADD(meeting_date, INTERVAL meeting_length SECOND) > '2014-09-20 09:00:00'
)
The query checks if meeting_date overlaps in anyway between 2014-09-20 09:00:00 and 2014-09-20 11:00:00. The above query covers all the possible overlapping cases. However, DATE_ADD adds a lot of overhead.
Anyway to optimize DATE_ADD? Removing DATE_ADD greatly boosts the performance but it won't cover all overlapping cases.
I recommend you eliminate the OR.
MySQL won't (can't) perform a range scan operation on an index on column meeting_date when that column is wrapped in a function.
When the comparison is against the bare column, MySQL can do a range scan. But with the comparison to an expression, MySQL has to evaluate that expression for every row in the table, and then comapare.
For a large table, we'd get optimal performance with an index with leading column of meeting_date.
I think the "trick" to getting better performance is to rewrite the query to introduce some additional domain knowledge. Specifically, what are the MINIMUM and MAXIMUM values for meeting_length?
I think it's pretty safe to assume it won't be negative. And we probably don't expect it to be zero. But even if the minimum length is greater than zero, we can use zero as our "known" minimum. (It's going to turn out to be more convenient than some other non-zero value.)
What we really need to know is the MAXIMUM value for meeting_length. If that's a known constant value, that would be great, because we're going to include that value in the query. let's assume the maximum value of meeting_length is the number of seconds in 7 days.
As a demonstration of what I'm thinking:
SELECT m.name
, m.id
FROM meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00' + INTERVAL -7 DAY
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'
Let's unwrap that a bit.
The first predicate is the same as in your original query... the "start" time of the meeting is before the "end" of the specified period.
The third predicate is the same as in your query too... the "end" of the meeting is after the beginning of the specified period. (My personal preference is to use the + INTERVAL form to add a duration to datetime.)
So, just like the original query we're looking for overlap.
I'm suggesting that we include another sargable predicate. The addition of this predicate doesn't really change the check for the overlap, given that we have a known minimum of 0 for meeting_length. What it does do is add a fixed lower bound that we can check against.
To explain that a little bit... if a meeting row that satisfies the condition "meeting end is after the period start", then we also know, for that row, that "meeting start is after (period start MINUS meeting length)". And we also know that "meeting start is after (period start MINUS the MAXIMUM possible value of meeting length.
And for most rows, that's going to be a bigger range... but the "trick" is the the predicate that checks that can compare a "bare" column against a constant.
And that means MySQL will be able to use an index range scan operation to satisfy that. The query is of the form:
WHERE meeting_date > const
AND meeting_date < const
And that's perfect for an index range scan. That should benefit performance... assuming there's a suitable index and that significantly limits the number of rows that need to be checked.
But by itself, that returns more rows than we need, we're going to get some meetings that start and end before the start of the period.
So we still need the additional check, to further filter down the rows. But that won't have to be evaluated for every row, only the rows that are pass through the first two predicates.
AND meeting_date + length > const
We just need to MySQL to recognize that it length won't ever be negative; to recognize that this is actually a "stricter" range, not a broader range. It might work with the AND, but we can force MySQL to evaluate that condition later, by including it in the HAVING clause.
HAVING meeting_date + length > const
But, all of that is really just a guess.
We'd really need to take a look at the EXPLAIN output.
If that index with the leading column of meeting_date also includes the id and name columns, then MySQL could satisfy the query entirely from the index, without any need to reference pages in the underlying table. (If that happens, we'll see "Using index" in the EXPLAIN output.)
Earlier, I said it would be convenient if we had a known constant for maximum meeting_length.
We could also use a query to determine that from the data:
SELECT MAX(meeting_length) FROM meetings
(And index with meeting_length as the leading column will avoid having to do an expensive full scan of the table)
We use that value to derive the "constant" value in the predicate.
We could include that query (as an inline view or a subquery), but that might impact performance. (We'd need to test how "smart" MySQL optimizer is...
We could try it as a subquery:
SELECT m.name
, m.id
FROM meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00'
- INTERVAL (SELECT MAX(l.meeting_length) FROM meetings l) DAY
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'
Or try it as an inline view:
SELECT m.name
, m.id
FROM ( SELECT MAX(l.meeting_length) AS max_seconds
FROM meetings l
) d
CROSS
JOIN meetings m
WHERE m.meeting_date < '2014-09-20 11:00:00'
AND m.meeting_date > '2014-09-20 09:00:00'
- INTERVAL d.max_seconds SECOND
HAVING m.meeting_date + INTERVAL meeting_length SECOND
> '2014-09-20 09:00:00'

How to optimize query with date calculation

This is my table structure (about 1 millions records):
I need to select a few indices at certain dates, but only Year and Month are relevant:
SELECT `index_name`,`results` FROM `mst_ind` WHERE
((`index_name`='MSCI EAFE Mid NR USD' AND MONTH(`date`) = 3 AND YEAR(`date`) = 2003) OR
(`index_name`='MSCI Morocco PR USD' AND MONTH(`date`) = 3 AND YEAR(`date`) = 2003))
AND `time_period`='M1'
It works fine, but the performance is horrible. I run the query through profiler, but it could not suggest any possible keys.
The primary key contains index_id, date and time_period.
How can I optimize/improve this query?
Thanks!
Update: the explain report:
You are probably invalidating the use of an index as you are applying a transformation to fields that would be indexed by using functions such as MONTH and YEAR.
You could:
write the WHERE clause differently such that it doesn't use the MONTH and YEAR functions, such as:
date >= '2003-03-01' and date < '2003-04-01'
Edit: just realized you probably don't have any indexes on this table. Consider adding indexes to the index_name, date and time_period field.