Why does SQL choose join index inconsistently? - mysql

I have a join between two tables on three columns. The join was taking hours to complete, so I added a composite index on all three columns on each table. Then, sometimes the join would be really fast and sometimes it would still be slow.
Using EXPLAIN, I noticed that it was fast when it chose to join using the composite index and slow when it just chose an index on only one of the columns. But each of these runs was using the same data.
Is there randomness involved in SQL selecting which index to use? Why would it be inconsistent?
If it helps: it is a MySQL database being queried from pandas in python.

Q: Is there randomness involved in SQL selecting which index to use?
Not randomness involved, per se. The optimizer makes use of table and index statistics (the number of rows and cardinality) along with predicates in the query to develop estimates, e.g. the number of rows that will need be retrieved.
MySQL also evaluates the cost for join operations, sort operations, etc. for each possible access plan (e.g. which index to use, which order to access the tables in) to come up with an estimated cost for each plan.
And then the optimizer compares the costs, and uses the plan that has the lowest cost. There are some parameters (MySQL system variables) that influence the cost estimates. (For example, tuning the expected cost for I/O operations.)
Q: Why would it be inconsistent?
For an InnoDB table, there is some randomness that comes into play with gathering statistics. InnoDB uses a sampling technique, doing a "deep dive" into a small set of "random" pages. The results from those sample pages is extrapolated into estimates for the whole table.
Some of the InnoDB tuning parameters (MySQL system variables) influence (increase/decrease) the number of pages that are sampled when gathering statistics. Sampling a smaller number of pages can be faster, but the smaller sample makes it more likely that the sample set may not be entirely representative of the entire table. Using a larger number of sample alleviates that to a degree, but the sampling takes longer. It's a tradeoff.
Note that InnoDB automatically re-collects statistics when 10% of the rows in the table are changed with DML operations. (There are some cases where the automatic collection of statistics may not be trigger, for example, creating a new (empty) table and populating it with a LOAD DATA statement, that could result in no statistics collected.)
So, the most likely explanation for the observed behavior is that at different times, there are different statistics available to the optimizer.
Note that it is possible to influence the optimizer to opt for a plan that makes use of particular indexes, by including hints in the SQL text. We typically don't need to do that, nor do we want to do that. But in some cases, where the optimizer is choosing an inefficient plan, we can help get a better plan.
A few references (from the MySQL 5.7 Reference Manual)
https://dev.mysql.com/doc/refman/5.7/en/optimizer-hints.html
https://dev.mysql.com/doc/refman/5.7/en/innodb-performance-optimizer-statistics.html

Related

How does MySQL determine whether to use an index based on the number of "in"?

For example, if there is a table named paper, I execute sql with
[ select paper.user_id, paper.name, paper.score from paper where user_id in (201,205,209……) ]
I observed that when this statement is executed, index will only be used when the number of "in" is less than a certain number. and the certain number is dynamic.
For example,when the total number of rows in the table is 4000 and cardinality is 3939, the number of "in" must be less than 790,MySQL will execute index query.
(View MySQL explain. If <790, type=range; if >790, type=all)
when the total number of rows in the table is 1300000 and cardinality is 1199166, the number of "in" must be less than 8500,MySQL will execute index query.
The result of this experiment is very strange to me.
I imagined that if I implemented this "in" query, I would first find in (max) and in (min), and then find the page where in (max) and in (min) are located,Then exclude the pages before in (min) and the pages after in (max). This is definitely faster than performing a full table scan.
Then, my test data can be summarized as follows:
Data in the table 1 to 1300000
Data of "in" 900000 to 920000
My question is, in a table with 1300000 rows of data, why does MySQL think that when the number of "in" is more than 8500, it does not need to execute index queries?
mysql version 5.7.20
In fact, this magic number is 8452. When the total number of rows in my table is 600000, it is 8452. When the total number of rows is 1300000, it is still 8452. Following is my test screenshot
When the number of in is 8452, this query only takes 0.099s.
Then view the execution plan. range query.
If I increase the number of in from 8452 to 8453, this query will take 5.066s, even if I only add a duplicate element.
Then view the execution plan. type all.
This is really strange. It means that if I execute the query with "8452 in" first, and then execute the remaining query, the total time is much faster than that of directly executing the query with "8453 in".
who can debug MySQL source code to see what happens in this process?
thanks very much.
Great question and nice find!
The query planner/optimizer has to decide if it's going seek the pages it needs to read or it's going to start reading many more and scan for the ones it needs. The seek strategy is more memory and especially cpu intensive while the scan probably is significantly more expensive in terms of I/O.
The bigger a table the less attractive the seek strategy becomes. For a large table a bigger part of the nonclustered index used for the seek needs to come from disk, memory pressure rises and the potential for sequential reads shrinks the longer the seek takes. Therefore the threshold for the rows/results ratio to which a seek is considered lowers as the table size rises.
If this is a problem there're a few things you could try to tune. But when this is a problem for you in production it might be the right time to consider a server upgrade, optimizing the queries and software involved or simply adjust expectations.
'Harden' or (re)enforce the query plans you prefer
Tweak the engine (when this is a problem that affects most tables server/database settings maybe can be optimized)
Optimize nonclustered indexes
Provide query hints
Alter tables and datatypes
It is usually folly go do a query in 2 steps. That framework seems to be fetching ids in one step, then fetching the real stuff in a second step.
If the two queries are combined into a single on (with a JOIN), the Optimizer is mostly forced to do the random lookups.
"Range" is perhaps always the "type" for IN lookups. Don't read anything into it. Whether IN looks at min and max to try to minimize disk hits -- I would expect this to be a 'recent' optimization. (I have not it in the Changelogs.)
Are those UUIDs with the dashes removed? They do not scale well to huge tables.
"Cardinality" is just an estimate. ANALYZE TABLE forces the recomputation of such stats. See if that changes the boundary, etc.

What's the minimum number of rows where indexing becomes valuable in MySQL?

I've read that indexing on some databases (SQL Server is the one I read about) doesn't have much effect until you cross a certain threshold of rows because the database will hold the entire table X in memory.
Ordinarily, I'd plan to index on my WHEREs and unique columns/lesser-changed tables. After hearing about the suggested minimum (which was about 10k), I wanted to learn more about that idea. If there are tables that I know will never pass a certain point, this might change the way I index some of them.
For something like MySQL MyISAM/INNODB, is there a point where indexing has little value and what are some ways of determining that?
Note: Very respectfully, I'm not looking for suggestions about structuring my database like "You should index anyway," I'm looking to understand this concept, if it's true or not, how to determine the thresholds, and similar information.
One of the major uses of indexes is to reduce the number of pages being read. The index itself is usually smaller than the table. So, just in terms of page read/writes, you generally need at least three data pages to see a benefit, because using an index requires at least two data pages (one for the index and one for the original data).
(Actually, if the index covers the query, then the breakeven is two.)
The number of data pages needed for a table depends on the size of the records and the number of rows. So, it is really not possible to specify a threshold on the number of rows.
The above very rudimentary explanation leaves out a few things:
The cost of scanning the data pages to do comparisons for each row.
The cost of loading and using index pages.
Other uses of indexing.
But it gives you an idea, and you can see benefits on tables much smaller than 10k rows. That said you can easily do tests on your data to see how queries work on the tables in question.
Also, I strongly, strongly recommend having primary keys on all tables and using those keys for foreign key relationships. The primary key itself is an index.
Indexes serve a lot of purposes. InnoDB tables are always organized as an index, on the cluster key. Indexes can be used to enforce unique constraints, as well as support foreign key constraints. The topic of "indexes" spans way more than query performance.
In terms of query performance, it really depends on what the query is doing. If we are selecting a small subset of rows, out of large set, then effective use of an index can speed that up by eliminating vast swaths of rows from being checked. That's where the biggest bang comes from.
If we are pulling all of the rows, or nearly all the rows, from a set, then an index typically doesn't help narrow down which rows to check; even when an index is available, the optimizer may choose to do a full scan of all of the rows.
But even when pulling large subsets, appropriate indexes can improve performance for join operations, and can significantly improve performance of queries with GROUP BY or ORDER BY clauses, by making use of an index to retrieve rows in order, rather than requiring a "Using filesort" operation.
If we are looking for a simple rule of thumb... for a large set, if we are needing to pull (or look at) less than 10% of the total rows, then an access plan using a suitable index will typically outperform a full scan. If we are looking for a specific row, based on a unique identifier, index is going to be faster than full scan. If we are pulling all columns for every row in the table n no particular order, then a full scan is going to be faster.
Again, it really comes down to what operations are being performed. What queries are being executed, and the performance profile that we need from those queries. That is going to be the key to determining the indexing strategy.
In terms of gaining understanding, use EXPLAIN to see the execution plan. And learn the operations available to MySQl optimizer.
(The topic of indexing strategy in terms of database performance is much too large for a StackOverflow question.)
Each situation is different. If you profile your code, then you'll understand better each anti-pattern. To demonstrate the extreme unexpectedness, consider Oracle:
If this were Oracle, I would say zero because if an empty table's high water mark is very high, then a query that motivates a full table scan that returns zero rows would be much more expensive than the same query that were to induce even a full index scan.
The same process that I went through to understand Oracle you can do with MySQL: profile your code.

MySQL query time too long in sensor timestamped data table

I have a very simple table to log reading from sensors. There's a column for sensor id number, one for sensor reading and one for the timestamp. This column is of SQL type Timestamp. There's a big amount of data in the table, a few million rows.
When I query for all rows before a certain timestamp with a certain sensor id number, sometimes it can take a very long time. If the timestamp is far in the past, the query is pretty fast but, if it's a recent timestamp, it can take up to 2 or 3 seconds.
It appears as if the SQL engine is iterating over the table until it finds the first timestamp that's larger than the queried timestamp. Or maybe the larger amount of queried data slows it down, I don't know.
In any case, I'm looking for design suggestions here, specifically to address to points: why is it so slow? and how can I make it faster?
Is there any design technique that could be applied here? I don't know much about SQL, maybe there's a way to let the SQL engine know the data is ordered (right now it's not but I could order it upon insertion I guess) and speed up the query. Maybe I should change the way the query is done or change the data type of the timestamp column.
Use EXPLAIN to see the execution plan, and verify that the query is using a suitable index. If not, verify that appropriate indexes are available.
An INDEX is stored "in order", and MySQL can make effective of use with some query patterns. (An InnoDB table is also stored in order, by the cluster key, which is the PRIMARY KEY of the table (if it exists) or the first UNIQUE KEY on non-NULL columns.)
With some query patterns, by using an index, MySQL can eliminate vast swaths of rows from being examined. When MySQL can't make user of an index (either because a suitable index doesn't exist, or because the query has constructs that prevent it), the execution plan is going to do a full scan, that is, examine every row in the table. And when that happens with very large tables, there's a tendency for things to get slow.
EDIT
Q: Why is it so slow?
A: There are several factors that affect the elapsed time. It could be contention, for example, an exclusive table lock taken by another session, or it could be time for I/O (disk reads), or a large "Using filesort" operation. Time for returning resultset over a slow network connection.
It's not possible to diagnose the issue with the limited information provided. We can only provide some suggestions about some common issue.
Q: How can I make it faster?
A: It's not possible to make a specific recommendation. We need to figure out where and what the bottleneck is, and the address that.
Take a look at the output from EXPLAIN to examine the execution plan. Is an appropriate index being used, or is it doing a full scan? How many rows are being examined? Is there "Using filesort" operation? et al.
Q: Is there any design technique that could be applied here?
A: In general, having an appropriate index available, and carefully crafting the SQL statement so the most efficient access plan is enabled.
Q: Maybe I should change the way the query is done
A: Changing the SQL statement may improve performance, that's a good place to start, after looking at the execution plan... can the query be modified to get a more efficient plan?
Q: or change the data type of the timestamp column.
A: I think it's very unlikely that changing the datatype of the TIMESTAMP column will improve performance. That's only 4 bytes. What would you change it to? Using DATETIME would take 7 bytes.
In general, we want the rows to be as short as possible, and to pack as many rows as possible into a block. Its also desirable to have the table physically organized in a way that queries can be satisfied from fewer blocks... the rows the query need are found in fewer pages, rather than the rows being scattered onesy-twosy over a large number of pages.
With InnoDB, increasing the size of the buffer pool may reduce I/O.
And I/O from solid state drives (SSD) will be faster than I/O from spinning hard disks (HDD), and this especially true if there is I/O contention on the HDD from other processes.

Is query cost the best metric for MySQL query optimization?

I am in the process of optimizing the queries in my MySQL database. While using Visual Explain and looking at various query costs, I'm repeatedly finding counter-intuitive values. Operations which use more efficient lookups (e.g. key lookup) seem to have a higher query cost than ostensibly less efficient operations (e.g full table scan or full index scan).
Examples of this can even be seen in the MySQL manual, in the section regarding Visual Explain on this page:
The query cost for the full table scan is a fraction of the key-lookup-based query costs. I see exactly the same scenario in my own database.
All this seems perfectly backwards to me, and raises this question: should I use query cost as the standard when optimizing a query? Or have I fundamentally misunderstood query cost?
MySQL does not have very good metrics relating to Optimization. One of the better ones is EXPLAIN FORMAT=JSON SELECT ..., but it is somewhat cryptic.
Some 'serious' flaws:
Rarely does anything account for a LIMIT.
Statistics on indexes are crude and do not allow for uneven distribution. (Histograms are coming 'soon'.)
Very little is done about whether data/indexes are currently cached, and nothing about whether you have a spinning drive or SSD.
I like this because it lets me compare two formulations/indexes/etc even for small tables where timing is next to useless:
FLUSH STATUS;
perform the query
SHOW SESSION STATUS LIKE "Handler%";
It provides exact counts (unlike EXPLAIN) of reads, writes (to temp table), etc. Its main flaw is in not differentiating how long a read/write took (due to caching, index lookup, etc). However, it is often very good at pointing out whether a query did a table/index scan versus lookup versus multiple scans.
The regular EXPLAIN fails to point out multiple sorts, such as might happen with GROUP BY and ORDER BY. And "Using filesort" does not necessarily mean anything is written to disk.

MySQL performance with large number of records - partitioning?

I am trying to build a database that would contain a large number of records, each with a lot of columns(fields) - maybe around 200-300 fields total for all tables. Let's say that I would have in a few years about 40.000.000 to 60.000.000 records.
I plan to normalize the database, so I will have a lot of tables (about 30-40) -> and lots of joins for queries.
Database will be strictly related to US, meaning that queries will be based on the 50 states alone (if a query is made, it won't allow to search/insert/etc in multiple states, but just one).
What can I do to have better performance?
Someone came with the idea to have all the states in different table structures, meaning I will have 50 tables * the 30-40 for the data (about 200 tables)! Should I even consider this type of approach?
The next idea was to use partitioning based on the US 50 states. How about this?
Any other way?
The best optimization is determined by the queries you run, not by your tables' structure.
If you want to use partitioning, this can be a great optimization, if the partitioning scheme supports the queries you need to optimize. For instance, you could partition by US state, and that would help queries against data for a specific state. MySQL supports "partition pruning" so that the query would only run against the specific partition -- but only if your query mentions a specific value for the column you used as the partition key.
You can always check whether partition pruning is effective by using EXPLAIN PARTITIONS:
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE state = 'NY';
That should report that the query uses a single partition.
Whereas if you need to run queries by date for example, then the partitioning wouldn't help; MySQL would have to repeat the query against all 50 partitions.
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE date > '2013-05-01';
That would list all partitions. There's a bit of overhead to query all partitions, so if this is your typical query, you should probably use range partitioning by date.
So choose your partition key with the queries in mind.
Any other optimization technique follows a similar pattern -- it helps some queries, possibly to the disadvantage of other queries. So be sure you know which queries you need to optimize for, before you decide on the optimization method.
Re your comment:
Certainly there are many databases that have 40 million rows or more, but have good performance. They use different methods, including (in no particular order):
Indexing
Partitioning
Caching
Tuning MySQL configuration variables
Archiving
Increasing hardware capacity (e.g. more RAM, solid state drives, RAID)
My point above is that you can't choose the best optimization method until you know the queries you need to optimize. Furthermore, the best choice may be different for different queries, and may even change over time as data or traffic grows. Optimization is an continual process, because you won't know where your bottlenecks are until after you see how your data grows and the query traffic your database receives.