MySQL Query - Large Date Range Optimisation - mysql

I have the following MySQL query, and am looking for possible opportunities to improve the performance. Over smaller sate ranges (1-2 months) it uses the available indexes but when querying 12+ it scans the entire table and takes 10+ seconds (720k rows).
I have a BTREE index on "ordered_at, sales_channel_id" columns, which is used for the smaller queries. Would there be a more optimal index to use?
I hope the query obvious, but what i'm trying to achieve is a list of all orders for a specific "model_id" broken down by Year, Month Name, Week of the Month and a sum of item quantity.
The query is as follows:
SELECT
`inventory`.`model_id`,
YEAR(`order_items`.ordered_at) AS Year,
MONTHNAME(`order_items`.ordered_at) AS Month,
CONCAT("Week ", FLOOR(((DAY(`order_items`.ordered_at) - 1) / 7) + 1)) AS Week,
SUM(`order_items`.quantity) AS UnitsSold
FROM
`order_items`
JOIN `inventory` ON `inventory`.`sku` = `order_items`.`sku`
AND `inventory`.`id` = (
SELECT
min(id)
FROM
inventory
WHERE
`inventory`.sku = `order_items`.sku)
WHERE
`order_items`.`ordered_at` BETWEEN '2022-01-01 00:00:00' AND '2023-01-01 23:59:59'
AND `order_items`.`sales_channel_id` in(1, 2, 3, 4)
GROUP BY
`order_items`.`model_id`, `Year`, month(`order_items`.ordered_at), `Month`, `Week`
ORDER BY
`order_items`.`model_id` ASC,
`Year` ASC,
month(`order_items`.ordered_at) ASC,
`Month` ASC,
`Week` ASC;
Any help would be greatly appreciated
I've tried re-ordering the where clause and adding/removing possible indexes but the query still takes 10+ seconds over longer date ranges. I also need to run this for a YoY comparison so it essentially takes 20+ seconds to execute (plus data processing and report rendering time)

Each of these prevents the use of sku in an index:
AND `order_items`.`sku` IS NOT NULL
AND `order_items`.`sku` != ''
Since the following are different, there will be an extra temp table and sort. (EXPLAIN FORMAT=JSON SELECT ... should show that.)
GROUP BY `model_id`, `Year`, month(ordered_at), `Month`,
`Week`
ORDER BY `model_id` ASC, `Year` ASC, month(ordered_at) ASC,
`Week` ASC
If I read it correctly, Month is effectively the same as MONTH(ordered_at), hence could be removed form the GROUP BY. If not, then add it in the same position to the ORDER BY.
(Please qualify all columns with which table they are in; I cannot suggest better indexes without knowing what is where.) At least the following is needed:
order_items: INDEX(SKU)
It looks like the LEFT JOIN is really JOIN, so change that.

Related

Rewrite SQL query to Fix Functional Dependency Issue Caused By MySQL 5.7 Strict Mode

I recently upgraded my MySQL server to version 5.7 and the following example query does not work:
SELECT *
FROM (SELECT *
FROM exam_results
WHERE exam_body_id = 6674
AND exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
AND subject_ids LIKE '%4674%'
ORDER BY score DESC
) AS top_scores
GROUP BY user_id
ORDER BY percent_score DESC, time_advantage DESC
LIMIT 10
The query is supposed to select exam results from the specified table matching the top scorers who wrote a particular exam within some time interval. The reason why I had to include a GROUP BY clause when I first wrote the query was to eliminate duplicate users, i.e. users who have more than one top score from writing the exam within the same time period. Without eliminating duplicate user IDs, a query for the top 10 high scorers could return exam results from the same person.
My question is: how do I rewrite this query to remove the error associated with MySQL 5.7 strict mode enforced on GROUP BY clauses while still retaining the functionality I want?
That is because you never really wanted aggregation to begin with. So, you used a MySQL extension that allowed your syntax -- even though it is wrong by the definition of SQL: The GROUP BY and SELECT clauses are incompatible.
You appear to want the row with the maximum score for each user meeting the filtering conditions. A much better approach is to use window functions:
SELECT er.*
FROM (SELECT er.*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY score DESC) as seqnum
FROM exam_results er
WHERE exam_body_id = 6674 AND
exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND
subject_ids LIKE '%4674%'
) er
WHERE seqnum = 1
ORDER BY percent_score DESC, time_advantage DESC
LIMIT 10;
You can do something similar in older versions of MySQL. Probably the closest method uses variables:
SELECT er.*,
(#rn := if(#u = user_id, #rn + 1,
if(#u := user_id, 1, 1)
)
) as rn
FROM (SELECT er.*
FROM exam_results
WHERE exam_body_id = 6674 AND
exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND
subject_ids LIKE '%4674%'
ORDER BY user_id, score DESC
) er CROSS JOIN
(SELECT #u := -1, #rn := 0) params
HAVING rn = 1
ORDER BY percent_score DESC, time_advantage DESC
LIMIT 10
When you aggregate (GROUP BY) a result set by a subset of the columns (user_id), then all the other columns need to be aggregated.
Note: according to the SQL Standard if you are grouping by the primary key this is not necessary, since all the other columns are dependent on the PK. Nevertheless, this is not the case in your question.
Now, you can use any aggregation function like MAX(), MIN(), SUM(), etc. I chose to use MAX(), but you can change it for any of them.
The query can run as:
SELECT
user_id,
max(exam_body_id),
max(exam_date),
max(subject_ids),
max(percent_score),
max(time_advantage)
FROM exam_results
WHERE exam_body_id = 6674
AND exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
AND subject_ids LIKE '%4674%'
GROUP BY user_id
ORDER BY max(percent_score) DESC, max(time_advantage) DESC
LIMIT 10
See running example at DB Fiddle.
Now, why do you need to aggregate the other columns, you ask? Since you are gruping rows the engine needs to produce a single row per group. Therefore, you need to tell the engine which value to pick when there are many values to pick from: the biggest one, the smallest one, the average of them, etc.
In MySQL 5.7.4 or older, the engine didn't require you to aggregate the other columns. The engine silently and randomly decided for you. You may have got the result you wanted today, but tomorrow the engine could choose the MIN() instead of the MAX() without you knowing, therefore leading to unpredictable results every time you run the query.
An alternative to Gordon's answer using user-defined variables and a CASE conditional statement for older versions of MySQL is as follows:
SELECT *
FROM (
SELECT *,
#row_number := CASE WHEN #user_id <> er.user_id
THEN 1
ELSE #row_number + 1 END
AS row_number,
#user_id := er.user_id
FROM exam_results er
CROSS JOIN (SELECT #row_number := 0, #user_id := null) params
WHERE exam_body_id = 6674 AND
exam_date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK) AND
subject_ids LIKE '%4674%'
ORDER BY er.user_id, score DESC
) inner_er
HAVING inner_er.row_number = 1
ORDER BY score DESC, percent_score DESC, time_advantage DESC
LIMIT 10
This achieved the filtering behavior I wanted without having to rely on the unpredictable behavior of a GROUP BY clause and aggregate functions.

SQL query Compare two WHERE clauses using same table

I am looking to compare two sets of data that are stored in the same table. I am sorry if this is a duplicate SO post, I have read some other posts but have not been able to implement it to solve my problem.
I am running a query to show all Athletes and times for the most recent date (2017-05-20):
SELECT `eventID`,
`location`,<BR>
`date`,
`barcode`,
`runner`,
`Gender`,
`time` FROM `TableName` WHERE `date`='2017-05-20'
I would like to compare the time achieved on the 20th May with the previous time for each athlete.
SELECT `time` FROM `TableName` WHERE `date`='2017-05-13'
How can I structure my query showing all of the ATHLETES, TIME on 13th, TIME on 20th
I have tried some methods such as UNION ALL for example
You can get the previous time using a correlated subquery:
SELECT t.*,
(SELECT t2.time
FROM TableName t2
WHERE t2.runner = t.runner AND t2.eventId = t.eventId AND
t2.date < t.date
ORDER BY t2.date DESC
LIMIT 1
) prev_time
FROM `TableName` t
WHERE t.date = '2017-05-20';
For performance, you want an index on (runner, eventid, date, time).

MySQL combine 2 different counts in one query

I have a table, that pretty much looks like this:
users (id INT, masterId INT, date DATETIME)
Every user has exactly one master. But masters can have n users.
Now I want to find out how many users each master has. I'm doing that this way:
SELECT `masterId`, COUNT(`id`) AS `total` FROM `users` GROUP BY `masterId` ORDER BY `total` DESC
But now I also want to know how many new users a master has since the last 14 days. I could do it with this query:
SELECT `masterId`, COUNT(`id`) AS `last14days` FROM `users` WHERE `date` > DATE_SUB(NOW(), INTERVAL 14 DAY) GROUP BY `masterId` ORDER BY `total` DESC
Now the question: Could I somehow get this information with one query, instead of using 2 queries?
You can use conditional aggregation to do this by only counting rows for with the condition is true. In standard SQL this would be done using a case expression inside the aggregate function:
SELECT
masterId,
COUNT(id) AS total,
SUM(CASE WHEN date > DATE_SUB(NOW(), INTERVAL 14 DAY) THEN 1 ELSE 0 END) AS last14days
FROM users
GROUP BY masterId
ORDER BY total DESC
Sample SQL Fiddle

Fastest way to get closest data from multiple tables based on time

I have three tables, with the following setup:
TEMPERATURE_1
time
zone (FK)
temperature
TEMPERATURE_2
time
zone (FK)
temperature
TEMPERATURE_3
time
zone (FK)
temperature
The data in each table is updated periodically, but not necessarily concurrently (ie, the time entries are not identical).
I want to be able to access the closest reading from each table for each time, ie:
TEMPERATURES
time
zone (FK)
temperature_1
temperature_2
temperature_3
In other words, for every unique time across my three tables, I want a row in the TEMPERATURES table, where the temperature_n values are the temperature reading closest in time from each original table.
At the moment, I've set this up using two views:
create view temptimes
as select time, zone
from temperature_1
union
select time, zone
from temperature_2
union
select time, zone
from temperature_3;
create view temperatures
as select tt.time,
tt.zone,
(select temperature
from temperature_1
order by abs(timediff(time, tt.time))
limit 1) as temperature_1,
(select temperature
from temperature_2
order by abs(timediff(time, tt.time))
limit 1) as temperature_2,
(select temperature
from temperature_3
order by abs(timediff(time, tt.time))
limit 1) as temperature_3,
from temptimes as tt
order by tt.time;
This approach works, but is too slow to use in production (it takes minutes+ for small data sets of ~1000 records for each temperature).
I'm not great with SQL, so I'm sure I'm missing the correct way to do this. How should I approach the problem?
The expensive part is where the correlated subqueries have to compute the time difference for every single row of each temperature_* table to find just one closest row for one column of one row in the main query.
It would be dramatically faster if you could just pick one row after and one row before the current time according to an index and only compute the time difference for these two candidates. All you need for that to be fast is an index on the column time in your tables.
I am ignoring the column zone, since its role remains unclear in the question, and it just add more noise to the core problem. Should be easy to add to the query.
Without an additional view, this query does all at once:
SELECT time
,COALESCE(temp1
,CASE WHEN timediff(time, time1a) > timediff(time1b, time) THEN
(SELECT t.temperature
FROM temperature_1 t
WHERE t.time = y.time1b)
ELSE
(SELECT t.temperature
FROM temperature_1 t
WHERE t.time = y.time1a)
END) AS temp1
,COALESCE(temp2
,CASE WHEN timediff(time, time2a) > timediff(time2b, time) THEN
(SELECT t.temperature
FROM temperature_2 t
WHERE t.time = y.time2b)
ELSE
(SELECT t.temperature
FROM temperature_2 t
WHERE t.time = y.time2a)
END) AS temp2
,COALESCE(temp3
,CASE WHEN timediff(time, time3a) > timediff(time3b, time) THEN
(SELECT t.temperature
FROM temperature_3 t
WHERE t.time = y.time3b)
ELSE
(SELECT t.temperature
FROM temperature_3 t
WHERE t.time = y.time3a)
END) AS temp3
FROM (
SELECT time
,max(t1) AS temp1
,max(t2) AS temp2
,max(t3) AS temp3
,CASE WHEN max(t1) IS NULL THEN
(SELECT t.time FROM temperature_1 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time1a
,CASE WHEN max(t1) IS NULL THEN
(SELECT t.time FROM temperature_1 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time1b
,CASE WHEN max(t2) IS NULL THEN
(SELECT t.time FROM temperature_2 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time2a
,CASE WHEN max(t2) IS NULL THEN
(SELECT t.time FROM temperature_2 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time2b
,CASE WHEN max(t3) IS NULL THEN
(SELECT t.time FROM temperature_3 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time3a
,CASE WHEN max(t3) IS NULL THEN
(SELECT t.time FROM temperature_3 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time3b
FROM (
SELECT time, temperature AS t1, NULL AS t2, NULL AS t3 FROM temperature_1
UNION ALL
SELECT time, NULL AS t1, temperature AS t2, NULL AS t3 FROM temperature_2
UNION ALL
SELECT time, NULL AS t1, NULL AS t2, temperature AS t3 FROM temperature_3
) AS x
GROUP BY time
) y
ORDER BY time;
->sqlfiddle
Explain
suqquery x replaces your view temptimes and brings the temperature into the result. If all three tables are in sync and have temperatures for all the same points in time, the rest is not even needed and extremely fast.
For every point in time where one of the three tables has no row, the temperature is being fetched as instructed: take the "closest" one from each table.
suqquery y aggregates the rows from x and fetches the previous time (time1a) and the next time (time1b) according to the current time from each table where the temperature is missing. These lookups should be fast using the index.
The final query fetches the temperature from the row with the closest time for each temperature that's actually missing.
This query could be simpler if MySQL would allow to reference columns from more than one level above the current subquery. Bit it cannot. Works just fine with in PostgreSQL: ->sqlfiddle
It also would be simpler if one could return more than one column from a correlated subquery, but I don't know how to do that in MySQL.
And it would be much simpler with CTEs and window functions, but MySQL doesn't know these modern SQL features (unlike other relevant RDBMS).
The reason that this is slow is that it requires 3 table scans to calculate and order the diferences.
I assume that you allready have indexes on the time zone columns - at the moment they won't help becuase of the table scan problem.
There are a number of options to avoid this depending on what you need and what the data collection rates are.
You have already said that the data is collected periodically but not concurrently. This suggests a few options.
To what level of significance do you need the temp data - the day, the hour, the minute etc. Store the time zone info to that level of significance only (or have another column that does) and do your queries on that.
If you know that the 3 closets times will be within a certain time frame (hour, day etc) put in a where clause to limit the calculation to those times that are potential candidates. You are effectively constructing histogram type buckets - you will need a calendar table to do this efficiently.
Make the comparison unidirectional i.e. limit consideration to only those times after the time you are looking for, so if you are looking for 12:00:00 then 13:45:32 is a candidate but 11:59:59 isn't.
I understand what you are trying to accomplish - ask yourself why and if a simpler solution will neet your needs.
My suggestion is that you don't take the closest time, but you take the first time on or before a given time. The reason for this is simple: generally the data for a given time is what is known at that time. Incorporating future information is generally not a good idea for most purposes.
With this change, you can modify your query to take advantage of an index on time. The problem with an index on your query is that the function precludes the use of the index.
So, if you want the most recent temperature, use this instead for each variable:
(select temperature
from temperature_1 t2
where t2.time <= tt.time
order by t2.time desc
limit 1
) as temperature_1,
Actually, you can also construct it like this:
(select time
from temperature_1 t2
where t2.time <= tt.time
order by t2.time desc
limit 1
) as time_1,
And then join the information for the temperature back in. This will be efficient, with the use of an index.
With that in mind, you could actually have two variables time_1_before and time_1_after, for the best time on or before and the best time on or after. You can use logic in the select to choose the nearest value. The joins back to the temperature should be efficient using an index.
But, I will reiterate, I think the last temperature on or before may be the best choice.

MySQL Query not selecting correct date range

Im currently trying to run a SQL query to export data between a certain date, but it runs the query fine, just not the date selection and i can't figure out what's wrong.
SELECT
title AS Order_No,
FROM_UNIXTIME(entry_date, '%d-%m-%Y') AS Date,
status AS Status,
field_id_59 AS Transaction_ID,
field_id_32 AS Customer_Name,
field_id_26 AS Sub_Total,
field_id_28 AS VAT,
field_id_31 AS Discount,
field_id_27 AS Shipping_Cost,
(field_id_26+field_id_28+field_id_27-field_id_31) AS Total
FROM
exp_channel_data AS d NATURAL JOIN
exp_channel_titles AS t
WHERE
t.channel_id = 5 AND FROM_UNIXTIME(entry_date, '%d-%m-%Y') BETWEEN '01-05-2012' AND '31-05-2012' AND status = 'Shipped'
ORDER BY
entry_date DESC
As explained in the manual, date literals should be in YYYY-MM-DD format. Also, bearing in mind the point made by #ypercube in his answer, you want:
WHERE t.channel_id = 5
AND entry_date >= UNIX_TIMESTAMP('2012-05-01')
AND entry_date < UNIX_TIMESTAMP('2012-06-01')
AND status = 'Shipped'
Besides the date format there is another issue. To effectively use any index on entry_date, you should not apply functions to that column when you use it conditions in WHERE, GROUP BY or HAVING clauses (you can use the formatting in SELECT list, if you need a different than the default format to be shown). An effective way to write that part of the query would be:
( entry_date >= '2012-05-01'
AND entry_date < '2012-06-01'
)
It works with DATE, DATETIME and TIMESTAMP columns.