WHERE clause in nested query is not improving efficiency - mysql

I have a table where I am running a pretty query intensive RANK window function in MYSQL.
The query looks like this:
SELECT
t1.*,
RANK () OVER (
PARTITION BY
t1.`country`,
t1.`product_id`,
t1.`retrieveDate`,
t1.`retrieveHour`
ORDER BY
t1.`retrieveDatetime` DESC) AS `ranking`
FROM (
SELECT * FROM product_data
WHERE retrieveDate > (CURRENT_DATE() - INTERVAL 1 WEEK)
) t1
WHERE t1.ranking = 1
I originally set up a WHERE filter in the nested query to limit the amount of data to run the actual query but from running EXPLAIN I noticed that no matter what I set with the retrieveDate filter (1 WEEK, 1 MONTH, 2 MONTH), it does not improve performance whatsoever.
This is the EXPLAIN output:
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "355813.66"
},
"windowing": {
"windows": [
{
"name": "<unnamed window>",
"using_filesort": true,
"filesort_key": [
"`product_id`",
"`country`",
"`retrieveDate`",
"`retrieveHour`",
"`retrieveDatetime` desc"
],
"functions": [
"rank"
]
}
],
"cost_info": {
"sort_cost": "271814.81"
},
"table": {
"table_name": "product_data",
"access_type": "ALL",
"rows_examined_per_scan": 815526,
"rows_produced_per_join": 271814,
"filtered": "33.33",
"cost_info": {
"read_cost": "2446.25",
"eval_cost": "27181.48",
"prefix_cost": "83998.85",
"data_read_per_join": "553M"
},
"used_columns": [
"product_id",
"country",
"category",
"rank",
"primaryCategory",
"primaryCategoryRank",
"retrieveDatetime",
"createdAt",
"updatedAt",
"retrieveDate",
"retrieveHour"
],
"attached_condition": "(`product_data`.`retrieveDate` > <cache>((curdate() - interval 1 week)))"
}
}
}
}
Any thoughts on how to actually improve this query?
I currently have the following indexes set up:
BTree index
product_id, type=varchar, pos=1
country, type=int, pos=2
retrieveDate, type=date, pos=13
retrieveHour, type=int, pos=14
retrieveDatetime, type=datetime, pos=10
as well as one standalone retrieveDatetime index

As far as I can see, in the last line of your SQL t1.ranking = 1 won't run, because ranking isn't a column in the table product_data (unless it coincidentally is at which point you've might've found your problem, lol). Assuming that's just a typo. I'm also going to assume that if you run:
SELECT * FROM product_data WHERE retrieveDate > (CURRENT_DATE() - INTERVAL 1 WEEK)
That it is actually faster than WHERE retrieveDate > (CURRENT_DATE() - INTERVAL 1 MONTH), yeah? If not, then it has nothing to do with your outer query. Can you try it without the subquery at all? I'm wondering whether MySQL is just choosing an bad optimisation strategy because of it.
SELECT
*,
RANK () OVER (
PARTITION BY
`country`,
`product_id`,
`retrieveDate`,
`retrieveHour`
ORDER BY
`retrieveDatetime` DESC) AS `ranking`
FROM product_data
WHERE retrieveDate > (CURRENT_DATE() - INTERVAL 1 WEEK) AND ranking = 1;

Related

How to properly compare BigInt Column in MySQL 5.7

I took a lok at Mysql 5.0.91 BIGINT column value comparison with '1' to check if there is some best practices for BigInt but found nothing.
I have a column that is BigInt(20) and my query has a WHERE clause that compares the column of type BigInt IN (). This has a major impact on the performance. Althought I have an index for that when I don't use this IN condition the query performance increases a lot.
So what is the best way to do that comparison I need? IN () is a good practice or there is a best approach?
Table Description (Obfuscated for security reasons)
Field,Type,Null,Key,Default,Extra
id,bigint(20),NO,PRI,NULL,auto_increment
status,varchar(64),NO,,NULL,
dono_id,bigint(20),YES,,NULL,
dono_tipo,varchar(64),YES,MUL,NULL,
MyBigIntField,bigint(20),YES,MUL,NULL
QUERY
explain select
DISTINCT(MyBigIntField)
FROM
MyTable
WHERE
MyBigIntField IN ('16', '49', '58', '155', '226')
AND NOT (status = 'Failure')
AND dono_id <> 1106
and dono_tipo = 'Purchase';
Execution Plan
EXPLAIN
"{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "23.21"
},
"duplicates_removal": {
"using_filesort": false,
"table": {
"table_name": "MyTable",
"access_type": "range",
"possible_keys": [
"idx_MyTable_owner",
"IDX_MyTable_PI"
],
"key": "IDX_MyTable_PI",
"used_key_parts": [
"MyBigIntField"
],
"key_length": "9",
"rows_examined_per_scan": 13,
"rows_produced_per_join": 5,
"filtered": "45.00",
"index_condition": "(`MyDatabase`.`MyTable`.`MyBigIntField` in (16,49,58,155,226))",
"cost_info": {
"read_cost": "22.04",
"eval_cost": "1.17",
"prefix_cost": "23.21",
"data_read_per_join": "231K"
},
"used_columns": [
"id",
"status",
"dono_id",
"dono_tipo",
"MyBigIntField"
],
"attached_condition": "((`MyDatabase`.`MyTable`.`status` <> 'Failure') and (`MyDatabase`.`MyTable`.`dono_id` <> 1106) and (`MyDatabase`.`MyTable`.`dono_tipo` = 'Purchase'))"
}
}
}
}"
myBigInt = 1 -- fine
myBigInt = "1" -- also good
myBigInt = (1) -- optimizer treats this as the same
myBigInt = (1,2) -- Now the query may run a lot slower
myBigInt = ('1', '2') -- Same as without quotes
That is, IN with more than one value may or may not be done an efficient way. We need to see the rest of the query in order to discuss that might happen.
Note that many APIs quote numbers while 'binding' values. As pointed out above, that does not hurt performance.
WHERE
MyBigIntField IN ('16', '49', '58', '155', '226')
AND NOT (status = 'Failure')
AND dono_id <> 1106
and dono_tipo = 'Purchase';
Add a 'composite' index. Start with any = tested columns, then IN and/or one of the other columns. So , for that WHERE:
INDEX(dono_tipo, MyBigIntField)
NOT, <>, >, BETWEEN are harder to optimize.
If there are only two status values, then
... AND status = 'success' ...
INDEX(dono_tipo, status, MyBigIntField)
(There may be other optimizations that you show up if the example were not too obfuscated.)
Here's a similar case, with a different analysis:
WHERE
MyBigIntField IN ('16', '49', '58', '155', '226')
AND foo > 5
and dono_tipo = 'Purchase';
In this case, I recommend
INDEX(dono_tipo, -- first, because of "="
MyBigIntField, -- next because of IN
foo) -- range, which will be useful if IN has only one value
You seem to have MyBigIntField NULLable; perhaps it should be NOT NULL? (This does not impact what I have said above.)

MySQL multi-column indexed only use first column

I have a MySQL table with about 40M+ rows.
The table has many columns and I have a sql below
select
*
from
`conversation`
where
(
`id` > 40000000
AND `last_msg_timestamp` >= '2022-07-17 08:25:00.011'
AND `status` = 8
)
order by
`id`
limit
100
There are about 5M rows with status=8. So I created an index idx_status_id_last_msg_timestamp with columns (status, id, last_msg_timestamp) to improve the query speed.
Now, I found that:
1: if id>? condition has a small value(less than 40M), it works well (about 3ms) and explain show index idx_status_id_last_msg_timestamp is used with key length=12
2: if id>? condition has a big value near max(id), the query is slow (about 1s) and explain show index idx_status_id_last_msg_timestamp is used with key length=4
I want to know why it only uses the status column in the index in case 2.
Thanks
Explain Info: Format=Json, the query take 0.877697 sec
explain format=json select
*
from
`conversation`
where
(
`id` > 40939363
and `last_msg_timestamp` >= '2022-07-19 08:25:00.011'
and `assign_status` = 8
)
order by
`id`
limit
100
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "141199.15"
},
"ordering_operation": {
"using_filesort": false,
"table": {
"table_name": "conversation",
"access_type": "ref",
"possible_keys": [
"PRIMARY",
"idx_status_id_lastmsgtimestamp"
],
"key": "idx_status_id_lastmsgtimestamp",
"used_key_parts": [
"status"
],
"key_length": "4",
"ref": [
"const"
],
"rows_examined_per_scan": 117665,
"rows_produced_per_join": 389,
"filtered": "0.33",
"index_condition": "((`conversation`.`status` <=> 8) and ((`conversation`.`id` > 40939363) and (`conversation`.`last_msg_timestamp` >= '2022-07-19 08:25:00.011')))",
"cost_info": {
"read_cost": "117665.96",
"eval_cost": "77.86",
"prefix_cost": "141199.15",
"data_read_per_join": "325K"
},
"used_columns": [
"id",
"******",
"******",
"******",
"....",
"last_msg_timestamp",
"status"
]
}
}
}
}
The real problem is the missing AND. This may be "valid" syntax, but it does not do what you want:
`id` > 40000000 `last_msg_timestamp` >= '2022-07-17 08:25:00.011'
EXPLAIN fails to show (in key_len) when it is using a column for a 'range' test (id > 40000000) or for ORDER BY (order by id)
EXPLAIN FORMAT=JSON SELECT ... does a better job. (Please provide this.)
I believe that it did use at least 2 of the columns of
INDEX (status, -- for filtering
id, -- at least for range filtering, possibly for ORDER BY
last_msg_timestamp) -- if used, it was not very useful
Another technique for getting insight:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
That will how many rows were actually touched -- probably more than 100 and less than the entire table.
For this type of query, I would consider a slight alteration in the "order by":
select *
from `conversation`
where `last_msg_timestamp` >= '2022-07-17 08:25:00.011'
and `status` = 8
order by last_msg_timestamp, id
limit 100
together with
INDEX(status, last_msg_timestamp, id)
With this change, it will be able to stop after no more than 100 rows in the index.
More Give this a try
select *
from ( SELECT id
FROM `conversation`
where `id` > 40939363
ANd `last_msg_timestamp` >= '2022-07-19 08:25:00.011'
AND `assign_status` = 8
order by `id`
limit 100 ) AS x
ORDER BY id

Different speed of theoretically equal queries on MySQL

I have found a strange speed issue with one of my MySQL queries when run on two different columns, date_from vs date_to.
The table structure is the following:
create table if not exists table1 (
id unsigned int,
field2 int,
field3 varchar(32),
date_from date not null,
date_to date not null,
field6 text
);
create unique index idx_uniq_table1 on table1 (id, field2, field3, date_from);
create index idx_table1_id on table1 (id);
create index idx_table1_field2 on table1 (field2);
create index idx_table1_field3 on table1 (field3);
create index idx_table1_date_from on table1 (date_from);
create index idx_table1_date_to on table1 (date_to);
When I run this query using date_from, execution time is 1.487 seconds:
select field3, min(date_from) from table1 group by field3;
When I run this other query using date_to, execution time is 13.804 seconds, almost 10 times slower:
select field3, max(date_to) from table1 group by field3;
Both columns are NOT NULL, so there are no empty values.
The table has ~7M rows.
The only difference that I see between these two columns is that date_from appears in the unique index but, as far as I know, that should't make a difference if not filtering by all four columns in the index.
Am I missing anything?
This is the explain of the date_from column:
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "889148.90"
},
"grouping_operation": {
"using_filesort": false,
"table": {
"table_name": "table1",
"access_type": "index",
"possible_keys": [
"idx_uniq_table1",
"idx_table1_id",
"idx_table1_field2",
"idx_table1_field3",
"idx_table1_date_from",
"idx_table1_date_to"
],
"key": "idx_table1_field3",
"used_key_parts": [
"field3"
],
"key_length": "130",
"rows_examined_per_scan": 5952609,
"rows_produced_per_join": 5952609,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "293888.00",
"eval_cost": "595260.90",
"prefix_cost": "889148.90",
"data_read_per_join": "908M"
},
"used_columns": [
"id",
"field2",
"field3",
"date_from"
]
}
}
}
}
This is the explain of the date_to column:
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "889148.90"
},
"grouping_operation": {
"using_filesort": false,
"table": {
"table_name": "table1",
"access_type": "index",
"possible_keys": [
"idx_uniq_table1",
"idx_table1_id",
"idx_table1_field2",
"idx_table1_field3",
"idx_table1_date_from",
"idx_table1_date_to"
],
"key": "idx_table1_field3",
"used_key_parts": [
"field3"
],
"key_length": "130",
"rows_examined_per_scan": 5952609,
"rows_produced_per_join": 5952609,
"filtered": "100.00",
"cost_info": {
"read_cost": "293888.00",
"eval_cost": "595260.90",
"prefix_cost": "889148.90",
"data_read_per_join": "908M"
},
"used_columns": [
"id",
"field2",
"field3",
"date_from",
"date_to"
]
}
}
}
}
The only difference I see is in used_columns, at the end, where one contains date_to and the other doesn't.
Naughty. There is no PRIMARY KEY.
Since the "used columns" does not seem to agree with the queries, I don't want to try to explain the timing difference.
Replace the index on field3 by these two:
INDEX(field3, date_from)
INDEX(field3, date_to)
Those will speed up your two Selects.
In addition to Rick's answer about proper index based on what your criteria is... The reason for the speed difference is that the one index that had both the field3 and date_from, the engine was able to use the data within the actual index instead of having to go to the raw data pages that contain the entire record. The index that only had the date_to still had to go to every raw data record to get the field3, thus taking the time.
That is why you can utilize covering indexes. Having each component of data you are looking for to optimize the query. Not saying you want an index with 20 columns, but in this context of what might be common criteria for filtering is exactly why you do.

SQL MAX on primary key, is filter condition unncessary if it is already indexed?

select MAX(id) from studenthistory
where class_id = 1
and date(created_at) = '2021-11-05'
and time(created_at) > TIME('04:00:00')
group by student_id
composite indexes = ("class_id", "student_id", "created_at")
id is the primary key.
Is date(created_at) = '2021-11-05' and time(created_at) > TIME('04:00:00') filter condition unnecessary for Max function since studenthistory is already indexed on class_id and student_id?
The only reason I added that datetime filter is because this table will get huge over time. (historical data) And I wanted to reduce the number of rows the query has to search.
But for the case of Max function - I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.
So can i safely remove the datetime filter and turn it into
select MAX(id) from studenthistory
where class_id = 1
group by student_id
And have the same performance? (or better since it does not need to filter further?)
Checking the query plan seems like the performance is similar, but the size of the table is rather small as of now..
First:
| -> Group aggregate: max(id) (cost=1466.30 rows=7254) (actual time=2.555..5.766 rows=3 loops=1)
-> Filter: ((cast(studenthistory.created_at as date) = '2021-11-05') and (cast(riderlocation.created_at as time(6)) > <cache>(cast('04:00:00' as time)))) (cost=740.90 rows=7254) (actual time=0.762..5.384 rows=5349 loops=1)
-> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1) (cost=740.90 rows=7254) (actual time=0.029..3.589 rows=14638 loops=1)
|
1 row in set (0.00 sec)
Second:
| -> Group aggregate: max(studenthistory.id) (cost=1475.40 rows=7299) (actual time=0.545..5.271 rows=10 loops=1)
-> Index lookup on studenthistory using idx_studenthistory_class_id_931474 (class_id=1) (cost=745.50 rows=7299) (actual time=0.026..4.164 rows=14729 loops=1)
|
1 row in set (0.01 sec)
Many thanks in advance
UPDATE: applying #rick james's suggestion:
Changed index to (class_id, student_id, id).
FLUSH STATUS;
explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory`
WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850+00:00'
GROUP BY `student_id`;
| {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "940.10"
},
"grouping_operation": {
"using_filesort": false,
"table": {
"table_name": "studenthistory",
"access_type": "ref",
"possible_keys": [
"fk_studenthist_student_e25b0310",
"idx_studenthistory_class_id_931474"
],
"key": "idx_studenthistory_class_id_931474",
"used_key_parts": [
"class_id"
],
"key_length": "4",
"ref": [
"const"
],
"rows_examined_per_scan": 8381,
"rows_produced_per_join": 2793,
"filtered": "33.33",
"cost_info": {
"read_cost": "102.00",
"eval_cost": "279.34",
"prefix_cost": "940.10",
"data_read_per_join": "130K"
},
"used_columns": [
"id",
"created_at",
"student_id",
"class_id"
],
"attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')"
}
}
}
} |
i.e. only class_id is used as an index, (as created_at is no longer in the index. rows_produced_per_join is lower due to filter: 2793,
Without datetime filter:
FLUSH STATUS;
mysql> explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory`
WHERE `class_id`=1 GROUP BY `student_id`;
| {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "854.75"
},
"grouping_operation": {
"using_filesort": false,
"table": {
"table_name": "studenthistory",
"access_type": "ref",
"possible_keys": [
"fk_studenthistory_student_e25b0310",
"idx_studenthistory_class_id_931474"
],
"key": "idx_studenthistory_class_id_931474",
"used_key_parts": [
"class_id"
],
"key_length": "4",
"ref": [
"const"
],
"rows_examined_per_scan": 8381,
"rows_produced_per_join": 8381,
"filtered": "100.00",
"using_index": true,
"cost_info": {
"read_cost": "16.65",
"eval_cost": "838.10",
"prefix_cost": "854.75",
"data_read_per_join": "392K"
},
"used_columns": [
"id",
"student_id",
"class_id"
]
}
}
}
} |
Runs on all 3 indexes ("class_id", "student_id", "id"), same 8381 number of rows slightly lower query cost (940 -> 854)
Applying the first query with original index ("class_id", "student_id", "created_at") yields:
FLUSH STATUS;
explain FORMAT=JSON SELECT MAX(`id`) `0` FROM `studenthistory`
WHERE `class_id`=1 AND `created_at`>='2021-11-05T18:25:50.544850+00:00'
GROUP BY `student_id`;
| {
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "858.94"
},
"grouping_operation": {
"using_filesort": false,
"table": {
"table_name": "studenthistory",
"access_type": "ref",
"possible_keys": [
"fk_studenthistory_student_e25b0310",
"idx_studenthistory_class_id_931474"
],
"key": "idx_studenthistory_class_id_931474",
"used_key_parts": [
"class_id"
],
"key_length": "4",
"ref": [
"const"
],
"rows_examined_per_scan": 8381,
"rows_produced_per_join": 2793,
"filtered": "33.33",
"using_index": true,
"cost_info": {
"read_cost": "20.84",
"eval_cost": "279.34",
"prefix_cost": "858.94",
"data_read_per_join": "130K"
},
"used_columns": [
"id",
"created_at",
"student_id",
"class_id"
],
"attached_condition": "(`test-table`.`studenthistory`.`created_at` >= TIMESTAMP'2021-11-05 18:25:50.54485')"
}
}
}
} |
The cost this time is 858, rows "rows_examined_per_scan": 8381, "rows_produced_per_join": 2793. Only class_id was used as key however. (why.?) not the remaining student_id and created_at
Query 1
select MAX(id) from studenthistory
where class_id = 1
and date(created_at) = '2021-11-05'
and time(created_at) > TIME('04:00:00')
group by student_id
Don't split up the date; change to
AND created_at > '2021-11-05 04:00:00'
If you want to check rows that were 'created' on the day, use something
AND created_at >= '2021-11-05'
AND created_at < '2021-11-05' + INTERVAL 1 DAY
Or, if you want to check for "today":
AND created_at >= CURDATE()
After 4am this morning:
AND created_at >= CURDATE() + INTERVAL 4 HOUR
Using date(created_at) makes the created_at part of the INDEX unusable. (cf "sargable")
select MAX(id) ... group by student_id
Is likely to return multiple rows -- one per student. Perhaps you want to get rid of the group by? Or specify a particular student_id?
Query 2 may run faster:
select MAX(id) from studenthistory
where class_id = 1
group by student_id
But the optimal index is INDEX(class_id, student_id, id), (It is OK to include both composite indexes.)
It may return multiple rows, so perhaps you want
select student_id, MAX(id) from studenthistory
where class_id = 1
group by student_id
MAX
I believe MAX would simply fetch the last value without checking the whole row, if it is indexed.
Sometimes.
Your second query can do that. But the first query cannot -- because of the range test (on created_at) being in the way.
EXPLAIN
query plan seems ... similar
Alas, EXPLAIN leaves details out. You can get some more details with EXPLAIN FORMAT=JSON SELECT ..., but not necessarily enough details.
I think you will find that the second query will give a much smaller value for "Rows" after adding my suggested index.
A way to get an accurate measure of "rows (table or index) touched":
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
Sensor data
For sensor data, consider multiple tables:
The raw data ("Fact" table, in Data Warehouse terminology). This has one row per reading per sensor.
The latest value for each sensor. This has one row for each of the 90K sensors. It will be a lot easier to maintain this table than to "find the latest" value for each sensor; that's a "groupwise-maximum" problem.
Summary data. An example is to have high/low/average/etc values for each sensor. This has one row per hour (or day or whatever is useful) per sensor.
The query:
select MAX(id) from studenthistory
where class_id = 1
group by student_id
Can be fast if you create the index:
create index ix1 on studenthistory (class_id, student_id, id);

Avoid table scan and use index instead in query

I am designing a new database and have noticed my queries are not scaling as well as they should be. When my aggregations involve hundreds of records I am seeing significant increases in response times. I am wondering if my query is deeply flawed or if I am just not using the right index.
I have done a lot of tweaking to my query but have not come up with a way to eliminate doing a full table scan and instead use an index. When I use a tool similar to EXPLAIN on my query I see the following:
Full table scans are generally inefficient, avoid using them.
Your query uses MySQL's 'filesort' operation. This tends to slow down queries.
Your query uses MySQL's temporary tables. This can require extra I/O and tends to slow down queries.
Table:
CREATE TABLE `indexTable` (
`id` int(10) unsigned NOT NULL,
`userId` int(10) unsigned NOT NULL,
`col1` varbinary(320) NOT NULL,
`col2` tinyint(3) unsigned NOT NULL,
`col3` tinyint(3) unsigned NOT NULL,
`createdAt` bigint(20) unsigned NOT NULL,
`updatedAt` bigint(20) unsigned NOT NULL,
`metadata` json NOT NULL,
PRIMARY KEY (`id`,`userId`,`col1`,`col2`,`col3`),
KEY `createdAt` (`createdAt`),
KEY `id_userId_col1_col2_createdAt` (`id`,`userId`,`col1`,`col2`,`createdAt`),
KEY `col1_col2_createdAt` (`col1`,`col2`,`createdAt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
Query:
SELECT t1.id, t1.userId, t1.col1, t1.col2, t1.col3, t1.metadata
FROM indexTable as t1
INNER JOIN(
SELECT col1, col2, MAX(createdAt) AS maxCreatedAt
FROM indexTable
WHERE id = ? AND userId = ?
GROUP BY col1, col2
ORDER BY maxCreatedAt
LIMIT 10 OFFSET 0) AS sub
ON t1.col1 = sub.col1
AND t1.col2 = sub.col2
AND t1.createdAt = sub.maxCreatedAt
WHERE t1.id = ? AND t1.userId = ?
ORDER BY t1.createdAt;
PK: id, userId, col1, col2, col3
Index: createdAt
Explain:
{
"query_block": {
"select_id": 1,
"cost_info": {
"query_cost": "34.50"
},
"ordering_operation": {
"using_temporary_table": true,
"using_filesort": true,
"cost_info": {
"sort_cost": "10.00"
},
"nested_loop": [
{
"table": {
"table_name": "sub",
"access_type": "ALL",
"rows_examined_per_scan": 10,
"rows_produced_per_join": 10,
"filtered": "100.00",
"cost_info": {
"read_cost": "10.50",
"eval_cost": "2.00",
"prefix_cost": "12.50",
"data_read_per_join": "3K"
},
"used_columns": [
"col1",
"col2",
"maxCreatedAt"
],
"attached_condition": "(`sub`.`maxCreatedAt` is not null)",
"materialized_from_subquery": {
"using_temporary_table": true,
"dependent": false,
"cacheable": true,
"query_block": {
"select_id": 2,
"cost_info": {
"query_cost": "10.27"
},
"ordering_operation": {
"using_filesort": true,
"grouping_operation": {
"using_temporary_table": true,
"using_filesort": false,
"table": {
"table_name": "indexTable",
"access_type": "ref",
"possible_keys": [
"PRIMARY",
"createdAt",
"id_userId_col1_col2_createdAt",
"col1_col2_createdAt"
],
"key": "PRIMARY",
"used_key_parts": [
"id",
"userId"
],
"key_length": "8",
"ref": [
"const",
"const"
],
"rows_examined_per_scan": 46,
"rows_produced_per_join": 46,
"filtered": "100.00",
"cost_info": {
"read_cost": "1.07",
"eval_cost": "9.20",
"prefix_cost": "10.27",
"data_read_per_join": "16K"
},
"used_columns": [
"id",
"userId",
"createdAt",
"col1",
"col2",
"col3"
],
"attached_condition": "((`MyDB`.`indexTable`.`id` <=> 53) and (`MyDB`.`indexTable`.`userId` <=> 549814))"
}
}
}
}
}
}
},
{
"table": {
"table_name": "t1",
"access_type": "ref",
"possible_keys": [
"PRIMARY",
"createdAt",
"id_userId_col1_col2_createdAt",
"col1_col2_createdAt"
],
"key": "id_userId_col1_col2_createdAt",
"used_key_parts": [
"id",
"userId",
"col1",
"col2",
"createdAt"
],
"key_length": "339",
"ref": [
"const",
"const",
"sub.col1",
"sub.col2",
"sub.maxCreatedAt"
],
"rows_examined_per_scan": 1,
"rows_produced_per_join": 10,
"filtered": "100.00",
"cost_info": {
"read_cost": "10.00",
"eval_cost": "2.00",
"prefix_cost": "24.50",
"data_read_per_join": "3K"
},
"used_columns": [
"id",
"userId",
"createdAt",
"updatedAt",
"col1",
"col2",
"col3",
"metadata",
]
}
}
]
}
}
}
This query finds the most recent record in the grouping of col1 and col2, orders by createdAt, and limits the entries to 10.
The "derived" table (subquery) needs this composite index:
INDEX(id, userid, -- in either order
col1, col2, -- in this order
createdAt) -- to make it "covering"
With that index, is probably will not do a full table scan. However, it will involve a filesort. This is because the ORDER BY is not the same as the GROUP BY and it is an aggregate.
t1 needs
INDEX(col1, col2, -- in either order
createdAt)
sub,maxCreatedAt -- typo??
ORDER BY t1.createdAt -- another necessary filesort.
Do not beware of filesorts. Especially when there are only 10 rows (as in the second case).
Without seeing SHOW CREATE TABLE, I cannot say whether the "filesort" and the "temporary table" touched the disk at all, or was done in RAM.
FORCE INDEX is almost always a bad idea -- even if it helps today, it may hurt tomorrow.
The Optimizer will deliberately (and rightly) use a table scan if too much of the table needs to be looked at -- it is faster than bouncing between the index and the data.
I was able to solve this issue by updating my query to include id and userId in the GROUP BY. I was then able to join on the two additional columns and for some reason that made MySQL use the right index.