Aurora MySQL 5.7 ignores primary key index [duplicate] - mysql

I am observing weird behaviour which I am trying to understand.
MySQL version: 5.7.33
I have the below query:
select * from a_table where time>='2022-05-10' and guid in (102,512,11,35,623,6,21,673);
a_table has primary key on time,guid and index on guid
The query I wrote above has very good performance and as per explain plan is using index condition; using where; using MRR
As I increase the number of value in my in clause, the performance is impacted significantly.
After some dry runs, I was able to get a rough number. For less than ~14500 values the explain plan is same as above. For number of values higher than this, explain plan only uses where and it takes forever to run my query.
In other words, for example, if I put 14,000 values in my in clause, the explain plan has 14,000 rows as expected. However, if I put 15,000 values in my in clause, the explain has 221200324 rows. I dont even have these many rows in my whole table.
I am trying to understand this behaviour and to know if there is any way to fix this.
Thank you

Read about Limiting Memory Use for Range Optimization.
When you have a large list of values in an IN() predicate, it uses more memory during the query optimization step. This was considered a problem in some cases, so recent versions of MySQL set a max memory limit (it's 8MB by default).
If the optimizer finds that it would need more memory than the limit, there is not another condition in your query it can use to optimize, it gives up trying to optimize, and resorts to a table-scan. I infer that your table statistics actually show that the table has ~221 million rows (though table statistics are inexact estimates).
I can't say I know the exact formula to know how much memory is needed for a given list of values, but given your observed behavior, we could guess that it's about 600 bytes per item on average, given that 14k items works and more than that does not work.
You can set range_optimizer_max_mem_size = 0 to disable the memory limit. This creates a risk of excessive memory use, but it avoids the optimizer "giving up." We set this value on all MySQL instances at my last job, because we couldn't educate the developers to avoid creating huge lists of values in their queries.

Related

How does MySQL determine whether to use an index based on the number of "in"?

For example, if there is a table named paper, I execute sql with
[ select paper.user_id, paper.name, paper.score from paper where user_id in (201,205,209……) ]
I observed that when this statement is executed, index will only be used when the number of "in" is less than a certain number. and the certain number is dynamic.
For example,when the total number of rows in the table is 4000 and cardinality is 3939, the number of "in" must be less than 790,MySQL will execute index query.
(View MySQL explain. If <790, type=range; if >790, type=all)
when the total number of rows in the table is 1300000 and cardinality is 1199166, the number of "in" must be less than 8500,MySQL will execute index query.
The result of this experiment is very strange to me.
I imagined that if I implemented this "in" query, I would first find in (max) and in (min), and then find the page where in (max) and in (min) are located,Then exclude the pages before in (min) and the pages after in (max). This is definitely faster than performing a full table scan.
Then, my test data can be summarized as follows:
Data in the table 1 to 1300000
Data of "in" 900000 to 920000
My question is, in a table with 1300000 rows of data, why does MySQL think that when the number of "in" is more than 8500, it does not need to execute index queries?
mysql version 5.7.20
In fact, this magic number is 8452. When the total number of rows in my table is 600000, it is 8452. When the total number of rows is 1300000, it is still 8452. Following is my test screenshot
When the number of in is 8452, this query only takes 0.099s.
Then view the execution plan. range query.
If I increase the number of in from 8452 to 8453, this query will take 5.066s, even if I only add a duplicate element.
Then view the execution plan. type all.
This is really strange. It means that if I execute the query with "8452 in" first, and then execute the remaining query, the total time is much faster than that of directly executing the query with "8453 in".
who can debug MySQL source code to see what happens in this process?
thanks very much.
Great question and nice find!
The query planner/optimizer has to decide if it's going seek the pages it needs to read or it's going to start reading many more and scan for the ones it needs. The seek strategy is more memory and especially cpu intensive while the scan probably is significantly more expensive in terms of I/O.
The bigger a table the less attractive the seek strategy becomes. For a large table a bigger part of the nonclustered index used for the seek needs to come from disk, memory pressure rises and the potential for sequential reads shrinks the longer the seek takes. Therefore the threshold for the rows/results ratio to which a seek is considered lowers as the table size rises.
If this is a problem there're a few things you could try to tune. But when this is a problem for you in production it might be the right time to consider a server upgrade, optimizing the queries and software involved or simply adjust expectations.
'Harden' or (re)enforce the query plans you prefer
Tweak the engine (when this is a problem that affects most tables server/database settings maybe can be optimized)
Optimize nonclustered indexes
Provide query hints
Alter tables and datatypes
It is usually folly go do a query in 2 steps. That framework seems to be fetching ids in one step, then fetching the real stuff in a second step.
If the two queries are combined into a single on (with a JOIN), the Optimizer is mostly forced to do the random lookups.
"Range" is perhaps always the "type" for IN lookups. Don't read anything into it. Whether IN looks at min and max to try to minimize disk hits -- I would expect this to be a 'recent' optimization. (I have not it in the Changelogs.)
Are those UUIDs with the dashes removed? They do not scale well to huge tables.
"Cardinality" is just an estimate. ANALYZE TABLE forces the recomputation of such stats. See if that changes the boundary, etc.

MySQL 'IN' operator on large number of values

I am observing weird behaviour which I am trying to understand.
MySQL version: 5.7.33
I have the below query:
select * from a_table where time>='2022-05-10' and guid in (102,512,11,35,623,6,21,673);
a_table has primary key on time,guid and index on guid
The query I wrote above has very good performance and as per explain plan is using index condition; using where; using MRR
As I increase the number of value in my in clause, the performance is impacted significantly.
After some dry runs, I was able to get a rough number. For less than ~14500 values the explain plan is same as above. For number of values higher than this, explain plan only uses where and it takes forever to run my query.
In other words, for example, if I put 14,000 values in my in clause, the explain plan has 14,000 rows as expected. However, if I put 15,000 values in my in clause, the explain has 221200324 rows. I dont even have these many rows in my whole table.
I am trying to understand this behaviour and to know if there is any way to fix this.
Thank you
Read about Limiting Memory Use for Range Optimization.
When you have a large list of values in an IN() predicate, it uses more memory during the query optimization step. This was considered a problem in some cases, so recent versions of MySQL set a max memory limit (it's 8MB by default).
If the optimizer finds that it would need more memory than the limit, there is not another condition in your query it can use to optimize, it gives up trying to optimize, and resorts to a table-scan. I infer that your table statistics actually show that the table has ~221 million rows (though table statistics are inexact estimates).
I can't say I know the exact formula to know how much memory is needed for a given list of values, but given your observed behavior, we could guess that it's about 600 bytes per item on average, given that 14k items works and more than that does not work.
You can set range_optimizer_max_mem_size = 0 to disable the memory limit. This creates a risk of excessive memory use, but it avoids the optimizer "giving up." We set this value on all MySQL instances at my last job, because we couldn't educate the developers to avoid creating huge lists of values in their queries.

MySQL query time too long in sensor timestamped data table

I have a very simple table to log reading from sensors. There's a column for sensor id number, one for sensor reading and one for the timestamp. This column is of SQL type Timestamp. There's a big amount of data in the table, a few million rows.
When I query for all rows before a certain timestamp with a certain sensor id number, sometimes it can take a very long time. If the timestamp is far in the past, the query is pretty fast but, if it's a recent timestamp, it can take up to 2 or 3 seconds.
It appears as if the SQL engine is iterating over the table until it finds the first timestamp that's larger than the queried timestamp. Or maybe the larger amount of queried data slows it down, I don't know.
In any case, I'm looking for design suggestions here, specifically to address to points: why is it so slow? and how can I make it faster?
Is there any design technique that could be applied here? I don't know much about SQL, maybe there's a way to let the SQL engine know the data is ordered (right now it's not but I could order it upon insertion I guess) and speed up the query. Maybe I should change the way the query is done or change the data type of the timestamp column.
Use EXPLAIN to see the execution plan, and verify that the query is using a suitable index. If not, verify that appropriate indexes are available.
An INDEX is stored "in order", and MySQL can make effective of use with some query patterns. (An InnoDB table is also stored in order, by the cluster key, which is the PRIMARY KEY of the table (if it exists) or the first UNIQUE KEY on non-NULL columns.)
With some query patterns, by using an index, MySQL can eliminate vast swaths of rows from being examined. When MySQL can't make user of an index (either because a suitable index doesn't exist, or because the query has constructs that prevent it), the execution plan is going to do a full scan, that is, examine every row in the table. And when that happens with very large tables, there's a tendency for things to get slow.
EDIT
Q: Why is it so slow?
A: There are several factors that affect the elapsed time. It could be contention, for example, an exclusive table lock taken by another session, or it could be time for I/O (disk reads), or a large "Using filesort" operation. Time for returning resultset over a slow network connection.
It's not possible to diagnose the issue with the limited information provided. We can only provide some suggestions about some common issue.
Q: How can I make it faster?
A: It's not possible to make a specific recommendation. We need to figure out where and what the bottleneck is, and the address that.
Take a look at the output from EXPLAIN to examine the execution plan. Is an appropriate index being used, or is it doing a full scan? How many rows are being examined? Is there "Using filesort" operation? et al.
Q: Is there any design technique that could be applied here?
A: In general, having an appropriate index available, and carefully crafting the SQL statement so the most efficient access plan is enabled.
Q: Maybe I should change the way the query is done
A: Changing the SQL statement may improve performance, that's a good place to start, after looking at the execution plan... can the query be modified to get a more efficient plan?
Q: or change the data type of the timestamp column.
A: I think it's very unlikely that changing the datatype of the TIMESTAMP column will improve performance. That's only 4 bytes. What would you change it to? Using DATETIME would take 7 bytes.
In general, we want the rows to be as short as possible, and to pack as many rows as possible into a block. Its also desirable to have the table physically organized in a way that queries can be satisfied from fewer blocks... the rows the query need are found in fewer pages, rather than the rows being scattered onesy-twosy over a large number of pages.
With InnoDB, increasing the size of the buffer pool may reduce I/O.
And I/O from solid state drives (SSD) will be faster than I/O from spinning hard disks (HDD), and this especially true if there is I/O contention on the HDD from other processes.

MySQL Innodb fail to use index due to extremely wrong rows estimation

I've an innodb table, the query on the table looks like below.
SELECT *
FROM x
WHERE now() BETWEEN a AND b
I've create a composite index on (a,b), the query returns around 4k rows, while the total number of rows in the table is around 700k.
However, when I get EXPLAIN of the execution plan, I found that the query did not use the expected index. Because the estimated rows is around 360k, extremely larger than the actual value.
I know just like many posts (such as Why the rows returns by "explain" is not equal to count()?) had explained, EXPLAIN only get roughly estimation. But FORCE INDEX solution is very tricky and may bring potential performance risks in the future.
Is there any way that I can make MySQL to get a more accurate estimation (the current one is 90 times larger)? Thanks.
InnoDB only keeps approximate row counts for tables. This is explained in the documentation of SHOW TABLE STATUS:
Rows
The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40 to 50%.
I don't think there's any way to make InnoDB keep accurate row counts, it's just not how it works.
This particular construct is hard to optimize:
WHERE constant BETWEEN col1 AND col2
No MySQL index can be devised to make it run fast. The attempts include:
INDEX(col1) -- will scan last half of table
INDEX(col2) -- will scan first half of table
INDEX(col1, col2) -- will scan last half of table
(Whether it does more of the work in the index BTree depends on ICP, covering, etc. But, in any case, lots of rows must be touched.)
One reason it cannot be improved upon is that the 'last' row in the 'half' might actually match.
If the (col1, col2) pairs do not overlap, then there is a possibility of improving the performance by being able to stop after one row. But MySQL does not know if you have this case, so it cannot optimize. Here is an approach for non-overlapping IP-address lookups that is efficient.

Does this field need an index?

I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.