I've an innodb table, the query on the table looks like below.
SELECT *
FROM x
WHERE now() BETWEEN a AND b
I've create a composite index on (a,b), the query returns around 4k rows, while the total number of rows in the table is around 700k.
However, when I get EXPLAIN of the execution plan, I found that the query did not use the expected index. Because the estimated rows is around 360k, extremely larger than the actual value.
I know just like many posts (such as Why the rows returns by "explain" is not equal to count()?) had explained, EXPLAIN only get roughly estimation. But FORCE INDEX solution is very tricky and may bring potential performance risks in the future.
Is there any way that I can make MySQL to get a more accurate estimation (the current one is 90 times larger)? Thanks.
InnoDB only keeps approximate row counts for tables. This is explained in the documentation of SHOW TABLE STATUS:
Rows
The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40 to 50%.
I don't think there's any way to make InnoDB keep accurate row counts, it's just not how it works.
This particular construct is hard to optimize:
WHERE constant BETWEEN col1 AND col2
No MySQL index can be devised to make it run fast. The attempts include:
INDEX(col1) -- will scan last half of table
INDEX(col2) -- will scan first half of table
INDEX(col1, col2) -- will scan last half of table
(Whether it does more of the work in the index BTree depends on ICP, covering, etc. But, in any case, lots of rows must be touched.)
One reason it cannot be improved upon is that the 'last' row in the 'half' might actually match.
If the (col1, col2) pairs do not overlap, then there is a possibility of improving the performance by being able to stop after one row. But MySQL does not know if you have this case, so it cannot optimize. Here is an approach for non-overlapping IP-address lookups that is efficient.
Related
I am observing weird behaviour which I am trying to understand.
MySQL version: 5.7.33
I have the below query:
select * from a_table where time>='2022-05-10' and guid in (102,512,11,35,623,6,21,673);
a_table has primary key on time,guid and index on guid
The query I wrote above has very good performance and as per explain plan is using index condition; using where; using MRR
As I increase the number of value in my in clause, the performance is impacted significantly.
After some dry runs, I was able to get a rough number. For less than ~14500 values the explain plan is same as above. For number of values higher than this, explain plan only uses where and it takes forever to run my query.
In other words, for example, if I put 14,000 values in my in clause, the explain plan has 14,000 rows as expected. However, if I put 15,000 values in my in clause, the explain has 221200324 rows. I dont even have these many rows in my whole table.
I am trying to understand this behaviour and to know if there is any way to fix this.
Thank you
Read about Limiting Memory Use for Range Optimization.
When you have a large list of values in an IN() predicate, it uses more memory during the query optimization step. This was considered a problem in some cases, so recent versions of MySQL set a max memory limit (it's 8MB by default).
If the optimizer finds that it would need more memory than the limit, there is not another condition in your query it can use to optimize, it gives up trying to optimize, and resorts to a table-scan. I infer that your table statistics actually show that the table has ~221 million rows (though table statistics are inexact estimates).
I can't say I know the exact formula to know how much memory is needed for a given list of values, but given your observed behavior, we could guess that it's about 600 bytes per item on average, given that 14k items works and more than that does not work.
You can set range_optimizer_max_mem_size = 0 to disable the memory limit. This creates a risk of excessive memory use, but it avoids the optimizer "giving up." We set this value on all MySQL instances at my last job, because we couldn't educate the developers to avoid creating huge lists of values in their queries.
I have a table with columns like word, A_, E_, U_ .. these columns with X_ are tinyints having the value of how many times the specific letter exists in the word (to later help optimize the wildcard search query).
There is totally 252k rows. If i make search like WHERE u_ > 0 i get 60k rows. But if i do the explain of that select, it says there is 225k rows to go through and no index possible. Why? Column was added as index. Why it doesn't say there is 60k rows to go through and that possible key is U_?
listing the indexes on table (also strange that others are groupped under A_ index)
In comparison if i run query: where id > 250000 i get 2983 results, and if i do explain of that select it says there is 2982 rows and key to be used primary.
Btw if i group by U_ i get this: (but probably doesnt matter much because i already said the query returns 60k results)
EDIT:
If i create column U (varchar(1)) and do the update U = 'U' where U_ > 0, then if i do the select WHERE U = 'U' i get also 60k rows (obviously), but if i do explain i get this:
Still not so good (rows 120k not 60k) but at least better than rows 225k in previous case. Although this solution is bit more piggy that than the first one, but maybe bit more efficient.
My experience is that MySQL chooses to do a tablescan, even if there is an index on the column you're searching, if your query would select more than approximately 25% of the rows in the table.
The reason for this is that using a secondary index in InnoDB is a bit more work than using a primary index.
Look up value in secondary index, like your index on u_.
Read index entry, and find corresponding primary key value(s) of rows where that value in u_ is stored.
Look up row(s) by primary key.
It's actually at least double the work to look up by secondary key. This isn't a problem if you ultimately match a small minority of rows of the table, and there are definitely cases where a secondary index is really important for your query. So don't be reluctant to use secondary indexes.
But if your query matches too many rows, and that becomes a big portion of the table, then it would be less work to just scan the table start-to-finish.
By analogy, why doesn't the index at the back of a book contain the word "the"? Because the entry would naturally list every single page in the book, and it would be a waste for you to refer to the index and then use it to guide you to each page in the main part of the book. You would have been better off just reading the book.
MySQL does not have any officially documented threshold for choosing a tablescan over an indexed search. The 25% figure is only my experience (actually sometimes it seems closer to 21%, but I don't know the code well enough to understand exactly how the threshold is calculated).
I've seen cases where the proportion of rows matched was very close to whatever threshold is in the implementation, and the behavior of the optimizer can actually flip-flop from one query to the next, resulting in highly variable performance.
If this case applies to you, you can use an index hint to make MySQL's optimizer pretend that a tablescan is prohibitively expensive, and it should prefer an index to a tablescan. This is done with the FORCE INDEX hint.
SELECT * FROM words FORCE INDEX(U_) WHERE U_ > 0
I still try to use index hints conservatively. They aren't necessary except in rare cases, and using an index hint means your query must include the index name. This makes it hard to change indexes without breaking your application code.
You're asking about the backend query optimizer. In particular you're asking: "how does it choose an access path? Why index here but tablescan there?"
Let's think about that optimizer. What is it optimizing? Elapsed time, in expectation. It has a model for how long sequential reads and random reads take, and for query selectivity, that is, expected number of rows returned by a query. From several alternative access paths it chooses the one that appears to require the least elapsed time.
Your id > 250000 query had a few things going for it:
good selectivity, so less than 1% of rows will appear in the result set
id is the Primary Key, so all columns are immediately available upon navigating to the right place in the btree
This caused the optimizer to compute an expected elapsed time for the indexed access path much smaller than expected time for tablescan.
On the other hand, your u_ > 0 query has very poor selectivity, dragging nearly a quarter of the rows into the result set. Additionally, the index is not a covering index for your * demand of copying all column values into the result set. So the optimizer predicts it will have to read a quarter of the index blocks, and then essentially all of the data row blocks that they point to. So compared to tablescan, we'd have to read more blocks from disk, and they would be random reads instead of sequential reads. Both of those argue against using the index, so tablescan was selected because it was cheapest. Also, remember that often multiple rows will fit within a single disk block, or within a single read request. We would call it a pessimizer if it always chose the indexed access path, even in cases where indexed disk I/O would take longer.
summary advice
Use an index on a single column when your queries have good selectivity, returning much less than 1% of a relation's rows. Use a covering index when your queries have poor selectivity and you're willing to make a space vs. time tradeoff.
Say I have a mysql table with an index on the column 'name':
I do this query:
select * from name_table where name = 'John';
Say there are 5 results that are returned from a table with 100 rows.
Say I now insert 1 million new rows, non that have a name John, so there are still only 5 Johns in the table. Will the select statement be as fast as previously, so will inserting all these rows have an impact on the read speed of an indexed table?
Indexes have their own "tables", and when the MySQL engine determines that the lookup references an indexed column, the lookup happens on this table. It isn't really a table per-se, but the gist checks out.
That said, it will be nanoseconds slower, but not something you should concern yourself with.
More importantly, concern youself with indexing pertinent data, and column order, as these have MUCH more of an impact on database performance.
To learn more about what is happening behind the scenes, query the EXPLAIN:
EXPLAIN select * from name_table where name = 'John';
Note: In addition to the column orders listed in the link, it is a good (nay, great) idea to have variable length columns (VARCHAR) after their fixed-length counterparts (CHAR) as, durring the lookup, the engine has to either look at the row, read the column lengths, then skip forward for the lookup (mind you, this is only for non-indexed columns), or read the table declairation and know it always has to look at the column with the offset X. It is more complicated behind the scenes, but if you can shift all fixed-length columns to the front, you will thank yourself. Basically:
Indexed columns.
Everything Fixed-Length in order according to the link.
Everything Variable-Length in order according to the link.
Yes, it will be just as fast.
(In addition to the excellent points made Mike's answer...) there's an important point we should make regarding indexes (B-tree indexes in particular):
The entries in the index are stored "in order".
The index is also organized in a way that allows the database to very quickly identify the blocks in the index that contain the entries it's looking for (or the block that would contain entries, if no matching entries are there.)
What this means is that the database doesn't need to look at every entry in the index. Given a predicate like the one in your question:
WHERE name = 'John'
with an index with a leading column of name, the database can eliminate vast swaths of blocks that don't need to be checked.
Blocks near the beginning of the index contain entries 'Adrian' thru 'Anna', a little later in the index, a block contains entries for Caleb thru Carl, further long in the index James thru Jane, etc.
Because of the way the index is organized, the database effectively "knows" that the entries we're looking for cannot be in any of those blocks (because the index is in order, there's no way the value John could appear in those blocks we mentioned). So none of those blocks needs to be checked. (The database figures out in just a very small number of operations, that 98% of the blocks in the index can be eliminated from consideration.
High cardinality = good performance
The take away from this is that indexes are most effective on columns that have high cardinality. That is, there are a large number of distinct values in the column, and those values are unique or nearly unique.
This should clear up the answer to the question you were asking. You can add brazilians of rows to the table. If only five of those rows have a value of
John in the name column, when you do a query
WHERE name = `John`
it will be just as fast. The database will be able to locate the entries your looking for nearly as fast as it can when you had a thousand rows in the table.
(As the index grows larger, it does add "levels" to the index, to traverse down to the leaf nodes... so, it gets ever so slightly slower because of a tiny few more operations. Where performance really starts to bog down is when the InnoDB buffer cache is too small, and we have to wait for the (glacially slow in comparison) disk io operations to fetch blocks into memory.
Low cardinality = poor performance
Indexes on columns with low cardinality are much less effective. For example, a column that has two possible values, with an even distribution of values across the rows in the table (about half of the rows have one value, and the other half have the other value.) In this case, the database can't eliminate 98% of the blocks, or 90% of the blocks. The database has to slog through half the blocks in the index, and then (usually) perform a lookup to the pages in the underlying table to get the other values for the row.
But with gazillions of rows with a column gender, with two values 'M' and 'F', an index with gender as a leading column will not be effective in satisfying a query
WHERE gender = 'M'
... because we're effectively telling the database to retrieve half the rows in the table, and it's likely those rows are going to be evenly distributed in the table. So nearly every page in the table is going to contain at least one row we need, the database is going to opt to do a full table scan (to look at every row in every block in the table) to locate the rows, rather than using an index.
So, in terms of performance for looking up rows in the table using an index... the size of the table isn't really an issue. The real issue is the cardinality of the values in the index, and how many distinct values we're looking for, and how many rows need to be returned.
I have a field table_name in a table which can have only 20 different values. The total records in the table is about few tens of thousands of rows. If I do a query like this:
SELECT * FROM table WHERE table_name = 'adasd';
at most the returned records are 25% of the total rows. Mostly I get only 10% of the total records. Is there a scope to index the field table_name here? I hear that for indexes to work well it requires the values in that field to be unique or close to it. In my case, its not at all close to unique. But I also heard that if the returned rows are less in number compared to total number of rows, it makes a good case for indexing.
How should I go about this?
No they don't have to be unique to get a benefit from using indexes, however take some time to think about what the DBMS does when processing a query:
Full table scan - a sequential read through the data (i.e. very few seek operations)
Index lookup - a few seeks on the index to find the start of the selected data, then a sequential read (few seeks) to identify rows in the underlying table, then LOTS AND LOTS of seeks to fetch the rows from the table
Seeks are expensive.
(there is a secondary effect of full table scans in that they are more prone to flushing hot data out of the cache - but you should address the primary concern first).
In this case, it's unlikely that the DBMS would use the index if it were present, and even if it did, it would probably be slower than a full table scan. As a (very) rough rule of thumb, you're only going to get a benefit from an index if a predicate identifies less than around 5% of the rows (but it will vary depending on the relative size of the index and the data).
i.e. don't bother adding an index on this field alone.
I think you may benefit from spending some time thinking about why you need to run queries which return so many rows?
Revised Answer
I just learned that creating an index does not mean that MySQL will use it. Keeping that in mind, I will re-phrase my answer:
You should create an index on that column if (general or your own) practices suggest you to do so. MySQL will use heuristics; which include looking at the available indexes and their respective cardinality, to determine the best index to use or not to use an index at all.
Interesting reading about this topic here.
I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.