I have a database table with 5 million rows, I am running:
select
*
from
tbl
where
datetime_created
between
'2014-10-01 00:00:00' and
'2014-10-31 23:59:59'
It took 54 seconds to return 428k results
The columns on the tbl:
id (int pk auto inc)
actor (varchar)
action (enum)
target (varchar)
is_successful (tinyint)
datetime_created (datetime)
The index:
datetime_created (datetime_created, action, target, is_successful)
Any ideas on how I can improve this?
edit:
EXPLAIN results:
select_type: simple
type: range
possible keys
datetime_created
key: datetime_created
key_len: 8
ref: null
rows: 359569
extra: using index condition
428k is a lot of rows to work with in one shot . Even though you have an index on date, the engine still has to scan through the table between the high and low values. I would suggest multiple queries reading the data in smaller chunks and narrowing result set if possible.
E.g. Try adding action enum filter together with the date range should yield much faster results. Say there are 5 enum types then you run 5 queries for each action enum. The more indexed criteria you add the better the query will perform .
Also consider if this is going to be used in an app, that is a massive recordset to deal with. Do you really need to work with 428k results at a time?
Related
I have a MySQL table with around 2m rows in it. I'm trying to run the below query and each time it's taken over 5 seconds to get results. I have an index on created_at column. Below is the EXPLAIN output.
Is this expected?
Thanks in advance.
SELECT
DATE(created_at) AS grouped_date,
HOUR(created_at) AS grouped_hour,
count(*) AS requests
FROM
`advert_requests`
WHERE
DATE(created_at) BETWEEN '2022-09-09' AND '2022-09-12'
GROUP BY
grouped_date,
grouped_hour
The EXPLAIN shows type: index which is an index-scan. That is, it is using the index, but it's iterating over every entry in the index, like a table-scan does for rows in the table. This is supported by rows: 2861816 which tells you the optimizer's estimate of quantity of index entries it will examine (this is a rough number). This is much more expensive than examining only the rows matching the condition, which is the benefit we look for from an index.
So why is this?
When you use any function on an index column in your search like this:
WHERE
DATE(created_at) BETWEEN '2022-09-09' AND '2022-09-12'
It spoils the benefit of the index for reducing the number of rows examined.
MySQL's optimizer doesn't have any intelligence about the result of functions, so it can't infer that the order of return values will be in the same order as the index. Therefore it can't use the fact that the index is sorted to narrow down the search. You and I know that this is natural for DATE(created_at) to be in the same order as created_at, but the query optimizer doesn't know this. There are other functions like MONTH(created_at) where the results are definitely not in sorted order, and MySQL's optimizer doesn't attempt to know which function's results are reliably sorted.
To fix your query, you can try one of two things:
Use an expression index. This is a new feature in MySQL 8.0:
ALTER TABLE `advert_requests` ADD INDEX ((DATE(created_at)))
Notice the extra redundant pair of parentheses. These are required when defining an expression index. The index entries are the results of that function or expression, not the original values of the column.
If you then use the same expression in your query, the optimizer recognizes that and uses the index.
mysql> explain SELECT DATE(created_at) AS grouped_date, HOUR(created_at) AS grouped_hour, count(*) AS requests FROM `advert_requests` WHERE DATE(created_at) BETWEEN '2022-09-09' AND '2022-09-12' GROUP BY grouped_date, grouped_hour\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: advert_requests
partitions: NULL
type: range <-- much better than 'index'
possible_keys: functional_index
key: functional_index
key_len: 4
ref: NULL
rows: 1
filtered: 100.00
Extra: Using where; Using temporary
If you use MySQL 5.7, you can't use expression indexes directly, but you can use a virtual column and define an index on the virtual column:
ALTER TABLE advert_requests
ADD COLUMN created_at_date DATE AS (DATE(created_at)),
ADD INDEX (created_at_date);
The trick of the optimizer recognizing the expression still works.
If you use a version of MySQL older than 5.7, you should upgrade regardless. MySQL 5.6 and older versions are past their end of life by now, and they are security risks.
The second thing you could do is refactor your query so the created_at column is not inside a function.
WHERE
created_at >= '2022-09-09' AND created_at < '2022-09-13'
When comparing a datetime to a date value, the date value is implicitly at 00:00:00.000 time. To include every fraction of a second up to 2022-09-12 23:59:59.999, it's simpler to just use < '2022-09-13'.
The EXPLAIN of this shows that it uses the existing index on created_at.
mysql> explain SELECT DATE(created_at) AS grouped_date, HOUR(created_at) AS grouped_hour, count(*) AS requests FROM `advert_requests` WHERE created_at >= '2022-09-09' AND created_at < '2022-09-13' GROUP BY grouped_date, grouped_hour\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: advert_requests
partitions: NULL
type: range <-- not 'index'
possible_keys: created_at
key: created_at
key_len: 6
ref: NULL
rows: 1
filtered: 100.00
Extra: Using index condition; Using temporary
This solution works on older versions of MySQL as well as 5.7 and 8.0.
Use explain analysis and check whether it is Index range scan or not. if not follow this link:
https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html
(Note that sometimes full table scan can be better if most of the timestamps in the table belong to the selected date range. As I know in such case optimization is not trivial)
If I understand the EXPLAIN correctly, it's able to use the index to implement the WHERE filtering. But this is returning 2.8 million rows, which then have to be grouped by date and hour, and this is a slow process.
You may be able to improve it by creating virtual columns for the date and hour, and index these as well.
ALTER TABLE advert_requests
ADD COLUMN created_date DATE AS (DATE(created_at)), ADD column created_hour INT AS (HOUR(created_at)), ADD INDEX (created_date, created_hour);
I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index
I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.
I have this query that runs unbelievably slow (4 minutes):
SELECT * FROM `ad` WHERE `ad`.`user_id` = USER_ID ORDER BY `ad`.`id` desc LIMIT 20;
Ad table has approximately 10 million rows.
SELECT COUNT(*) FROM `ad` WHERE `ad`.`user_id` = USER_ID;
Returns 10k rows.
Table has following indexes:
PRIMARY KEY (`id`),
KEY `idx_user_id` (`user_id`,`status`,`sorttime`),
EXPLAIN gives this:
id: 1
select_type: SIMPLE
table: ad
type: index
possible_keys: idx_user_id
key: PRIMARY
key_len: 4
ref: NULL
rows: 4249
Extra: Using where
I am failing to understand why does it take so long? Also this query is generated by ORM (pagination) so it would be nice to optimize it from outside (maybe add some extra index).
BTW this query works fast:
select aa.*
from (select id from ad where user_id=USER_ID order by id desc limit 20) as a
join ad as aa on a.id = aa.id ;
Edit: I tried another user with much less rows (dozens) than original one. I am wondering why doesn't original query use idx_user_id:
EXPLAIN SELECT * FROM `ad` WHERE `ad`.`user_id` = ANOTHER_ID ORDER BY `ad`.`id` desc LIMIT 20;
id: 1
select_type: SIMPLE
table: ad
type: ref
possible_keys: idx_user_id
**key: idx_user_id**
key_len: 3
ref: const
rows: 84
Extra: Using where; Using filesort
Edit2: with help of Alexander I decided to try force MySQL to use the index I want, and following query is much faster (1 sec instead of 4 mins):
SELECT *
FROM `ad` USE INDEX (idx_user_id)
WHERE `ad`.`user_id` = 1884774
ORDER BY `ad`.`id` desc LIMIT 20;
In the EXPLAIN output you can see that the key value is PRIMARY. This means that MySQL optimizer decided that it is faster to scan all table records (which are already sorted by id) and search first 20 records with the specific user_id value than to use idx_user_id key, which was considered by optimizer as a possible key and then rejected.
In your second query the optimizer sees that only id values are necessary in the subquery, and decided to use idx_user_id index instead, as that index allows to calculate the list of necessary ids without touching the table itself. Then only 20 records are retrieved by direct search by primary key value, which is very fast operation for that small number of records.
As you query with ANOTHER_ID shows, the MySQL wrong decision was based on the number of rows for the previous USER_ID value. This number was so big that the optimizer guessed that it will find the first 20 records with this specific user_id faster just by looking at the table records itself and skipping records with wrong user_id values.
If table rows are accessed by index, it requires random access operations. For typical HDD random access operations are about 100 time slower then sequential scan. So in order for index to be useful it must reduce the count of rows to less then 1% of the total rows count. If the rows for the specific USER_ID value accounts for more than 1% of the total number of rows, it may be more efficient to do full table scan instead of using of index, if we want to retrieve all these rows. But MySQL optimizer doesn't takes into account the fact that only 20 of this rows will be retrieved. So it mistakenly decided not to use index and do full table scan instead.
In order to make your query fast for any user_id value you can add one more index which will allow the query execution in the fastest way possible:
create index idx_user_id_2 on ad(user_id, id);
This index allows MySQL to do both filtering and sorting. To do that the columns used for filtering should be placed first, and the columns used for ordering should be placed second. MySQL should be smart enough to use that index, because this index allows to search all necessary records without skipping any records.
I'm using mysql for a game. I have a scores table of approximately 150,000 records. The table looks like:
fk_user_id | high_score
The high_score column is an int. It has an index on it. I want to figure out a user's rank by running the following:
SELECT COUNT(*) AS count FROM scores WHERE high_score >= [x]
so supplying a user's current high_score to the above, I can get their rank. The idea would be that every time the user looks at a profile page, I would run the above to get the rank.
I'm wondering how expensive this is, and if I should even go down this path. Is mysql scanning the entire table every time the query is issued? Is this a crazy idea?
Update: Here's what 'explain' says about the query:
id: 1
select_type: SIMPLE
table: scores
type: range
possible_keys: high_score
key: high_score
key_len: 5
ref: null
rows: 1
extra: Using where; Using index
Thanks
MySQL is scanning the entire table for every record you ask it to return.
Why use count(*) can't you use a count(distinct User_ID) or count(user_ID)
You should already have that column indexed, and i'm sure it would return your results accurately.
SELECT COUNT(distinct user_ID) AS count FROM scores WHERE high_score >= [x]
If high_score is index then cost is relatively small if not - then full table scan is made.
Relatively small - just read rowids from key and count them - very small cost.
You can always write explain followed by query to check database exactly is doing for fetching your data.
I've been trying and googling everything and still can't figure out what's going on.
I have a big table (100M+rows). Among others it has 3 columns: user_id, date, type.
It has an index idx(user_id, type, date).
When I EXPLAIN this query:
SELECT *
FROM table
WHERE user_id = 12345
AND type = 'X'
ORDER BY date DESC
LIMIT 5
EXPLAIN shows that MySQL examined 110K rows. which is roughly row many rows this user_id has.
My question is:
Why the same index is not used for ORDER_BY LIMIT 5? It knows which rows belong to the user_id, date is part of the same index, so why not just take last 5 rows in that index?
P.S. I tried index by (user_id, date, type) - same results; i tried removing DESC - same results.
This is the EXPLAIN plan:
id: 1
select_type: SIMPLE
table: s
type: ref
possible_keys: dateIdx,userTypeDateIdx
key: userTypeDateIdx
key_len: 5
ref: const,const
rows: 110118
Extra: Using where
I also tried adding FORCE INDEX FOR ORDER BY hint, but i still get rows: 110118.
Did you ANALYZE TABLE after creating the index?
Mysql will not use the index until the table is analyzed. The best index to use is the one you created with (user_id, type, date)
The date in the index is in ascending order, and you are asking for the most recent five rows in descending order by date; it can't use the index for that. If you changed the index to user_id, type, date desc it would be able to use the index to get the most recent five rows.