Is query cost the best metric for MySQL query optimization? - mysql

I am in the process of optimizing the queries in my MySQL database. While using Visual Explain and looking at various query costs, I'm repeatedly finding counter-intuitive values. Operations which use more efficient lookups (e.g. key lookup) seem to have a higher query cost than ostensibly less efficient operations (e.g full table scan or full index scan).
Examples of this can even be seen in the MySQL manual, in the section regarding Visual Explain on this page:
The query cost for the full table scan is a fraction of the key-lookup-based query costs. I see exactly the same scenario in my own database.
All this seems perfectly backwards to me, and raises this question: should I use query cost as the standard when optimizing a query? Or have I fundamentally misunderstood query cost?

MySQL does not have very good metrics relating to Optimization. One of the better ones is EXPLAIN FORMAT=JSON SELECT ..., but it is somewhat cryptic.
Some 'serious' flaws:
Rarely does anything account for a LIMIT.
Statistics on indexes are crude and do not allow for uneven distribution. (Histograms are coming 'soon'.)
Very little is done about whether data/indexes are currently cached, and nothing about whether you have a spinning drive or SSD.
I like this because it lets me compare two formulations/indexes/etc even for small tables where timing is next to useless:
FLUSH STATUS;
perform the query
SHOW SESSION STATUS LIKE "Handler%";
It provides exact counts (unlike EXPLAIN) of reads, writes (to temp table), etc. Its main flaw is in not differentiating how long a read/write took (due to caching, index lookup, etc). However, it is often very good at pointing out whether a query did a table/index scan versus lookup versus multiple scans.
The regular EXPLAIN fails to point out multiple sorts, such as might happen with GROUP BY and ORDER BY. And "Using filesort" does not necessarily mean anything is written to disk.

Related

How to judge the complexity of SQL queries

Any resource where it is explained how to judge the complexity of SQL queries would be much appreciated.
(By "complexity", I assume you mean "slowness"?) Some tips:
Subqueries may or may not slow down a query a lot.
GROUP BY and ORDER BY -- when both are present but different: Usually requires two sorts.
Usually only a single index is used per SELECT.
OR is almost always inefficient. Switching to UNION allows multiple indexes to be efficiently used.
UNION ALL, with a few restrictions, is more efficient than UNION DISTINCT (because of the dedupping pass)
Non-sargeable expressions cannot use an index, hence severely inefficient.
If the entire WHERE, GROUP BY and ORDER BY are handled by a single index can LIMIT be efficiently handled. (Else it must collect all the stuff, sort it, only then can it peel off a few rows.)
Entity-Attribute-Value schema is inefficient.
UUIDs and GUIDs are inefficient on very large tables.
A composite index is often better than a single-column index.
A "covering" index is somewhat better.
Sometimes, especially when a LIMIT is involved, it is better to turn the query inside-out. That is start with a subquery that finds the few ids that you need, then reaches back into the same table and into other tables to get the rest of the desired columns.
"Windowing functions" are poorly implemented in MySQL 8 and MariaDB 10.2. They are useful for "groupwise-max" and "hierarchical schemas". Until the Optimizer improves, I declare them to be "complex".
Recent versions have recognized "row constructors"; previously they were a performance hit.
Having an AUTO_INCREMENT id hurts performance in certain cases; helps in others.
EXPLAIN (or EXPLAIN FORMAT=JSON) tells you what is going on now; it fails to tell you how to rewrite the query or what better index to add.
More indexing tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql In that link, see "Handler counts" for a good way to measure complexity for specific queries. I use it for comparing query formulations, etc, even without populating a large table to get usable timings.
Give me a bunch of Queries; I'll point out the complexities, if any, in each.
Check out the official MySQL documentation on Query Execution Plan:
https://dev.mysql.com/doc/refman/5.7/en/execution-plan-information.html
You could use the EXPLAIN command to get more information about your query.

What's the minimum number of rows where indexing becomes valuable in MySQL?

I've read that indexing on some databases (SQL Server is the one I read about) doesn't have much effect until you cross a certain threshold of rows because the database will hold the entire table X in memory.
Ordinarily, I'd plan to index on my WHEREs and unique columns/lesser-changed tables. After hearing about the suggested minimum (which was about 10k), I wanted to learn more about that idea. If there are tables that I know will never pass a certain point, this might change the way I index some of them.
For something like MySQL MyISAM/INNODB, is there a point where indexing has little value and what are some ways of determining that?
Note: Very respectfully, I'm not looking for suggestions about structuring my database like "You should index anyway," I'm looking to understand this concept, if it's true or not, how to determine the thresholds, and similar information.
One of the major uses of indexes is to reduce the number of pages being read. The index itself is usually smaller than the table. So, just in terms of page read/writes, you generally need at least three data pages to see a benefit, because using an index requires at least two data pages (one for the index and one for the original data).
(Actually, if the index covers the query, then the breakeven is two.)
The number of data pages needed for a table depends on the size of the records and the number of rows. So, it is really not possible to specify a threshold on the number of rows.
The above very rudimentary explanation leaves out a few things:
The cost of scanning the data pages to do comparisons for each row.
The cost of loading and using index pages.
Other uses of indexing.
But it gives you an idea, and you can see benefits on tables much smaller than 10k rows. That said you can easily do tests on your data to see how queries work on the tables in question.
Also, I strongly, strongly recommend having primary keys on all tables and using those keys for foreign key relationships. The primary key itself is an index.
Indexes serve a lot of purposes. InnoDB tables are always organized as an index, on the cluster key. Indexes can be used to enforce unique constraints, as well as support foreign key constraints. The topic of "indexes" spans way more than query performance.
In terms of query performance, it really depends on what the query is doing. If we are selecting a small subset of rows, out of large set, then effective use of an index can speed that up by eliminating vast swaths of rows from being checked. That's where the biggest bang comes from.
If we are pulling all of the rows, or nearly all the rows, from a set, then an index typically doesn't help narrow down which rows to check; even when an index is available, the optimizer may choose to do a full scan of all of the rows.
But even when pulling large subsets, appropriate indexes can improve performance for join operations, and can significantly improve performance of queries with GROUP BY or ORDER BY clauses, by making use of an index to retrieve rows in order, rather than requiring a "Using filesort" operation.
If we are looking for a simple rule of thumb... for a large set, if we are needing to pull (or look at) less than 10% of the total rows, then an access plan using a suitable index will typically outperform a full scan. If we are looking for a specific row, based on a unique identifier, index is going to be faster than full scan. If we are pulling all columns for every row in the table n no particular order, then a full scan is going to be faster.
Again, it really comes down to what operations are being performed. What queries are being executed, and the performance profile that we need from those queries. That is going to be the key to determining the indexing strategy.
In terms of gaining understanding, use EXPLAIN to see the execution plan. And learn the operations available to MySQl optimizer.
(The topic of indexing strategy in terms of database performance is much too large for a StackOverflow question.)
Each situation is different. If you profile your code, then you'll understand better each anti-pattern. To demonstrate the extreme unexpectedness, consider Oracle:
If this were Oracle, I would say zero because if an empty table's high water mark is very high, then a query that motivates a full table scan that returns zero rows would be much more expensive than the same query that were to induce even a full index scan.
The same process that I went through to understand Oracle you can do with MySQL: profile your code.

Why does SQL choose join index inconsistently?

I have a join between two tables on three columns. The join was taking hours to complete, so I added a composite index on all three columns on each table. Then, sometimes the join would be really fast and sometimes it would still be slow.
Using EXPLAIN, I noticed that it was fast when it chose to join using the composite index and slow when it just chose an index on only one of the columns. But each of these runs was using the same data.
Is there randomness involved in SQL selecting which index to use? Why would it be inconsistent?
If it helps: it is a MySQL database being queried from pandas in python.
Q: Is there randomness involved in SQL selecting which index to use?
Not randomness involved, per se. The optimizer makes use of table and index statistics (the number of rows and cardinality) along with predicates in the query to develop estimates, e.g. the number of rows that will need be retrieved.
MySQL also evaluates the cost for join operations, sort operations, etc. for each possible access plan (e.g. which index to use, which order to access the tables in) to come up with an estimated cost for each plan.
And then the optimizer compares the costs, and uses the plan that has the lowest cost. There are some parameters (MySQL system variables) that influence the cost estimates. (For example, tuning the expected cost for I/O operations.)
Q: Why would it be inconsistent?
For an InnoDB table, there is some randomness that comes into play with gathering statistics. InnoDB uses a sampling technique, doing a "deep dive" into a small set of "random" pages. The results from those sample pages is extrapolated into estimates for the whole table.
Some of the InnoDB tuning parameters (MySQL system variables) influence (increase/decrease) the number of pages that are sampled when gathering statistics. Sampling a smaller number of pages can be faster, but the smaller sample makes it more likely that the sample set may not be entirely representative of the entire table. Using a larger number of sample alleviates that to a degree, but the sampling takes longer. It's a tradeoff.
Note that InnoDB automatically re-collects statistics when 10% of the rows in the table are changed with DML operations. (There are some cases where the automatic collection of statistics may not be trigger, for example, creating a new (empty) table and populating it with a LOAD DATA statement, that could result in no statistics collected.)
So, the most likely explanation for the observed behavior is that at different times, there are different statistics available to the optimizer.
Note that it is possible to influence the optimizer to opt for a plan that makes use of particular indexes, by including hints in the SQL text. We typically don't need to do that, nor do we want to do that. But in some cases, where the optimizer is choosing an inefficient plan, we can help get a better plan.
A few references (from the MySQL 5.7 Reference Manual)
https://dev.mysql.com/doc/refman/5.7/en/optimizer-hints.html
https://dev.mysql.com/doc/refman/5.7/en/innodb-performance-optimizer-statistics.html

MySQL query time too long in sensor timestamped data table

I have a very simple table to log reading from sensors. There's a column for sensor id number, one for sensor reading and one for the timestamp. This column is of SQL type Timestamp. There's a big amount of data in the table, a few million rows.
When I query for all rows before a certain timestamp with a certain sensor id number, sometimes it can take a very long time. If the timestamp is far in the past, the query is pretty fast but, if it's a recent timestamp, it can take up to 2 or 3 seconds.
It appears as if the SQL engine is iterating over the table until it finds the first timestamp that's larger than the queried timestamp. Or maybe the larger amount of queried data slows it down, I don't know.
In any case, I'm looking for design suggestions here, specifically to address to points: why is it so slow? and how can I make it faster?
Is there any design technique that could be applied here? I don't know much about SQL, maybe there's a way to let the SQL engine know the data is ordered (right now it's not but I could order it upon insertion I guess) and speed up the query. Maybe I should change the way the query is done or change the data type of the timestamp column.
Use EXPLAIN to see the execution plan, and verify that the query is using a suitable index. If not, verify that appropriate indexes are available.
An INDEX is stored "in order", and MySQL can make effective of use with some query patterns. (An InnoDB table is also stored in order, by the cluster key, which is the PRIMARY KEY of the table (if it exists) or the first UNIQUE KEY on non-NULL columns.)
With some query patterns, by using an index, MySQL can eliminate vast swaths of rows from being examined. When MySQL can't make user of an index (either because a suitable index doesn't exist, or because the query has constructs that prevent it), the execution plan is going to do a full scan, that is, examine every row in the table. And when that happens with very large tables, there's a tendency for things to get slow.
EDIT
Q: Why is it so slow?
A: There are several factors that affect the elapsed time. It could be contention, for example, an exclusive table lock taken by another session, or it could be time for I/O (disk reads), or a large "Using filesort" operation. Time for returning resultset over a slow network connection.
It's not possible to diagnose the issue with the limited information provided. We can only provide some suggestions about some common issue.
Q: How can I make it faster?
A: It's not possible to make a specific recommendation. We need to figure out where and what the bottleneck is, and the address that.
Take a look at the output from EXPLAIN to examine the execution plan. Is an appropriate index being used, or is it doing a full scan? How many rows are being examined? Is there "Using filesort" operation? et al.
Q: Is there any design technique that could be applied here?
A: In general, having an appropriate index available, and carefully crafting the SQL statement so the most efficient access plan is enabled.
Q: Maybe I should change the way the query is done
A: Changing the SQL statement may improve performance, that's a good place to start, after looking at the execution plan... can the query be modified to get a more efficient plan?
Q: or change the data type of the timestamp column.
A: I think it's very unlikely that changing the datatype of the TIMESTAMP column will improve performance. That's only 4 bytes. What would you change it to? Using DATETIME would take 7 bytes.
In general, we want the rows to be as short as possible, and to pack as many rows as possible into a block. Its also desirable to have the table physically organized in a way that queries can be satisfied from fewer blocks... the rows the query need are found in fewer pages, rather than the rows being scattered onesy-twosy over a large number of pages.
With InnoDB, increasing the size of the buffer pool may reduce I/O.
And I/O from solid state drives (SSD) will be faster than I/O from spinning hard disks (HDD), and this especially true if there is I/O contention on the HDD from other processes.

Does this field need an index?

I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.