select *
from `attendance_marks`
where exists (select *
from `attendables`
where `attendance_marks`.`attendable_id` = `attendables`.`id`
and `attendable_type` = 'student'
and `attendable_id` = 258672
and `attendables`.`deleted_at` is null
)
and (`marked_at` between '2022-09-01 00:00:00' and '2022-09-30 23:59:59')
this query is taking too much time approx 7-10 seconds.
I am trying to optimize it but stuck at here.
Attendance_marks indexes
Attendables Indexes
Please help me optimize it a little bit.
For reference
number of rows in attendable = 80966
number of rows in attendance_marks = 1853696
Explain select
I think if we use JOINS instead of Sub-Query, then it will be more performant. Unfortunately, I don't have the exact data to be able to benchmark the performance.
select *
from attendance_marks
inner join attendables on attendables.id = attendance_marks.attendable_id
where attendable_type = 'student'
and attendable_id = 258672
and attendables.deleted_at is null
and (marked_at between '2022-09-01 00:00:00' and '2022-09-30 23:59:59')
I'm not sure if your business requirement allows changing the PK, and adding index. Incase it does then:
Add index to attendable_id.
I assume that attendables.id is PK. Incase not, add an index to it. Or preferably make it the PK.
In case attendable_type have a lot of different values. Then consider adding an index there too.
If possible don't have granularity till the seconds' field in marked_at, instead round to the nearest minute. In our case, we can round off 2022-09-30 23:59:59 to 2022-10-01 00:00:00.
select b.*
from `attendance_marks` AS am
JOIN `attendables` AS b ON am.`attendable_id` = b.`id`
WHERE b.`attendable_type` = 'student'
and b.`attendable_id` = 258672
and b.`deleted_at` is null
AND am.`marked_at` >= '2022-09-01'
AND am.`marked_at` < '2022-09-01 + INTERVAL 1 MONTH
and have these
am: INDEX(marked_at, attendable_id)
am: INDEX(attendable_id, marked_at)
b: INDEX(attendable_type, attendable_id, attendables)
Note that the datetime range works for any granularity.
(Be sure to check that I got the aliases for the correct tables.)
This formulation, with these indexes should allow the Optimizer to pick which table is more efficient to start with.
I have the following two tables:
movie_sales (provided daily)
movie_id
date
revenue
movie_rank (provided every few days or weeks)
movie_id
date
rank
The tricky thing is that every day I have data for sales, but only data for ranks once every few days. Here is an example of sample data:
`movie_sales`
- titanic (ID), 2014-06-01 (date), 4.99 (revenue)
- titanic (ID), 2014-06-02 (date), 5.99 (revenue)
`movie_rank`
- titanic (ID), 2014-05-14 (date), 905 (rank)
- titanic (ID), 2014-07-01 (date), 927 (rank)
And, because the movie_rate.date of 2014-05-14 is closer to the two sales dates, the output should be:
id date revenue closest_rank
titanic 2014-06-01 4.99 905
titanic 2014-06-02 5.99 905
The following query works to get the results by getting the min date difference in the sub-select:
SELECT
id,
date,
revenue,
(SELECT rank from movie_rank where id=s.id ORDER BY ABS(DATEDIFF(date, s.date)) ASC LIMIT 1)
FROM
movie_sales s
But I'm afraid that this would have terrible performance as it will literally be doing millions of subselects...on millions of rows. What would be a better way to do this, or is there really no proper way to do this since an index can not be properly done with a DATEDIFF ?
Unfortunately, you are right. The movie rank table must be searched for each movie sale and of all matching movie rows the closest be picked.
With an index on movie_rank(id) the DBMS finds the movie rows quickly, but an index on movie_rank(id, date) would be better, because the date could be read from the index and only the one best match would be read from the table.
But you also say that there are new ranks every few dates. If it is guaranteed to find a rank in a certain range, e.g. for each date there will be at least one rank in the twenty days before and at least one rank in the twenty days after, you can limit the search accordingly. (The index on movie_rank(id, date) would be essential for this, though.)
SELECT
id,
date,
revenue,
(
select r.rank
from movie_rank r
where r.id = s.id
and r.date between s.date - interval 20 days
and s.date + interval 20 days
order by abs(datediff(date, s.date)) asc
limit 1
)
FROM movie_sales s;
This is difficult to get quick with SQL. In a programming language I would choose this algorithm:
Sort the two tables by date and point to the first rows.
Move the rank pointer forward until we match the sales date or are beyond it. (If we aren't there already.)
Compare the sales date with the rank date we are pointing at and with the rank date of the previous row. Take the closer one.
Move the sales pointer one row forward.
Go to 2.
With this algorithm we would already be in about the position we want to be. Let's see, if we can do the same with SQL. Iterations are done with recursive queries in SQL. These are available in MySQL as of version 8.0.
We start with sorting the rows, i.e. giving them numbers. Then we iterate through both data sets.
with recursive
sales as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_sales
),
ranks as
(
select *, row_number() over (partition by movie_id order by date) as rn
from movie_rank
),
cte (movie_id, revenue, srn, rrn, sdate, rdate, rrank, closest_rank) as
(
select
movie_id, s.revenue, s.rn, r.rn, s.date, r.date, r.ranking,
case when s.date <= r.date then r.ranking end
from (select * from sales where rn = 1) s
join (select * from ranks where rn = 1) r using (movie_id)
union all
select
cte.movie_id,
cte.revenue,
coalesce(s.rn, cte.srn),
coalesce(r.rn, cte.rrn),
coalesce(s.date, cte.sdate),
coalesce(r.date, cte.rdate),
coalesce(r.ranking, cte.rrank),
case when coalesce(r.date, cte.rdate) >= coalesce(s.date, cte.sdate) then
case when abs(datediff(coalesce(r.date, cte.rdate), coalesce(s.date, cte.sdate))) <
abs(datediff(cte.rdate, coalesce(s.date, cte.sdate)))
then coalesce(r.ranking, cte.rrank)
else cte.rrank
end
end
from cte
left join sales s on s.movie_id = cte.movie_id and s.rn = cte.srn + 1 and cte.closest_rank is not null
left join ranks r on r.movie_id = cte.movie_id and r.rn = cte.rrn + 1 and cte.rdate < cte.sdate
where s.movie_id is not null or r.movie_id is not null
-- where cte.closest_rank is null
)
select
movie_id,
sdate,
revenue,
closest_rank
from cte
where closest_rank is not null;
(BTW: I named the column ranking, because rank is a reserved word in SQL.)
Demo: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=e994cb56798efabc8f7249fd8320e1cf
This is probably still slow. The reason for this is: there are no pointers to a row in SQL. If we want to go from row #1 to row #2, we must search that row, while in a programming language we would really just move the pointer one step forward. If the tables had an ID, we could build a chain (next_row_id) instead of using row numbers. That could speed this process up. But well, I guess you already notice: this is not an algorithm made for SQL.
Another approach... Avoid the problem by cleansing the data.
Make sure the rank is available for every day. When a new date comes in, find the previous rank, then fill in all the rows for the intervening days.
(This will take some initial effort to 'fix' all the previous missing dates. After that, it is a small effort when a new list of ranks comes in.)
The "report" would be a simple JOIN on the date. You would probably need a 2-column INDEX(movie_id, date) or something like that.
Ultimate solution would be not to calculate all the ranks every time, but store them (in a new column, or even in a new table if you don't want to change existing tables).
Each time you update you could look for sales data without rank and calculate only for those.
With above approach you get rank always from last available rank BEFORE sales data (e.g. if you've data 14 days before and 1 days after, still the one before would be used)
If you strictly need to use ranking closest in time, then you need to run UPDATE also for newly arrived ranking info. I believe it would still be more efficient in the long run.
How would the following three queries compare in terms of performance? I'm trying to get all records with year=2017:
Using EXTRACT:
SELECT count(*), completed_by_id FROM table
WHERE EXTRACT(YEAR FROM completed_on)=2017
GROUP BY completed_by_id
# Took 11.8s
Using YEAR:
SELECT count(*), completed_by_id FROM table
WHERE YEAR(completed_on)=2017
GROUP BY completed_by_id
# Took 5.15s
Using LIKE 'YEAR%'
SELECT count(*), completed_by_id FROM table
WHERE completed_on LIKE '2017%'
GROUP BY completed_by_id
# Took 6.61s
Note: In my own testing I found YEAR() to be the fastest, LIKE to be the second fastest, and EXTRACT() to be the slowest.
There are about 5M rows in the table and completed_on is DATETIME field that has been indexed.
You haven't described your table or indexes so all advice about query performance is guesswork.
If your completed_on column is a DATETIME, DATE, or TIMESTAMP type and it is indexed, this query will radically outperform all the ones you have shown, and maintain its performance as your table grows.
SELECT count(*), completed_by_id
FROM table
WHERE completed_on >= '2017-01-01'
AND completed_on < '2017-01-01' + INTERVAL 1 YEAR
GROUP BY completed_by_id
Why? It can do a range scan on the index rather than a nonsargable function call on each row's value.
Notice the use of >= at the beginning of the date range and < at the end. We want to include all rows from the first moment of new years day 2017, up until but not including the first moment of new years day 2018. BETWEEN can't do this, because it uses <= rather than < at the end of its range.
If an index is in place, both BETWEEN and the syntax I have shown use a range scan, and perform about the same.
For best results speeding up this query use a compound index on (completed_on, completed_by_id).
If you are storing completed_on as DATE or DATETIME you can use:
SELECT count(*) as cnt, LEFT(completed_on, 4) AS year
FROM table
GROUP BY year
HAVING year=2017
This is my table structure (about 1 millions records):
I need to select a few indices at certain dates, but only Year and Month are relevant:
SELECT `index_name`,`results` FROM `mst_ind` WHERE
((`index_name`='MSCI EAFE Mid NR USD' AND MONTH(`date`) = 3 AND YEAR(`date`) = 2003) OR
(`index_name`='MSCI Morocco PR USD' AND MONTH(`date`) = 3 AND YEAR(`date`) = 2003))
AND `time_period`='M1'
It works fine, but the performance is horrible. I run the query through profiler, but it could not suggest any possible keys.
The primary key contains index_id, date and time_period.
How can I optimize/improve this query?
Thanks!
Update: the explain report:
You are probably invalidating the use of an index as you are applying a transformation to fields that would be indexed by using functions such as MONTH and YEAR.
You could:
write the WHERE clause differently such that it doesn't use the MONTH and YEAR functions, such as:
date >= '2003-03-01' and date < '2003-04-01'
Edit: just realized you probably don't have any indexes on this table. Consider adding indexes to the index_name, date and time_period field.
I have a MySQL table like this one:
day int(11)
hour int(11)
amount int(11)
Day is an integer with a value that spans from 0 to 365, assume hour is a timestamp and amount is just a simple integer. What I want to do is to select the value of the amount field for a certain group of days (for example from 0 to 10) but I only need the last value of amount available for that day, which pratically is where the hour field has its max value (inside that day). This doesn't sound too hard but the solution I came up with is completely inefficient.
Here it is:
SELECT q.day, q.amount
FROM amt_table q
WHERE q.day >= 0 AND q.day <= 4 AND q.hour = (
SELECT MAX(p.hour) FROM amt_table p WHERE p.day = q.day
) GROUP BY day
It takes 5 seconds to execute that query on a 11k rows table, and it just takes a span of 5 days; I may need to select a span of en entire month or year so this is not a valid solution.
Anybody who can help me find another solution or optimize this one is really appreciated
EDIT
No indexes are set, but (day, hour, amount) could be a PRIMARY KEY if needed
Use:
SELECT a.day,
a.amount
FROM AMT_TABLE a
JOIN (SELECT t.day,
MAX(t.hour) AS max_hour
FROM AMT_TABLE t
GROUP BY t.day) b ON b.day = a.day
AND b.max_hour = a.hour
WHERE a.day BETWEEN 0 AND 4
I think you're using the GROUP BY a.day just to get a single amount value per day, but it's not reliable because in MySQL, columns not in the GROUP BY are arbitrary -- the value could change. Sadly, MySQL doesn't yet support analytics (ROW_NUMBER, etc) which is what you'd typically use for cases like these.
Look at indexes on the primary keys first, then add indexes on the columns used to join tables together. Composite indexes (more than one column to an index) are an option too.
I think the problem is the subquery in the where clause. MySQl will at first calculate this "SELECT MAX(p.hour) FROM amt_table p WHERE p.day = q.day" for the whole table and afterwards select the days. Not quite efficient :-)