Indexed column and not indexed column research - mysql

I generated separate MySQL Innodb tables with 2000, 5000, 10000, 50000, 10000, 20000, 50000, 100 000, 200 000 elements(with help of php loop and insert query).
Each table has two columns: id(Primary Key INT autoincrement), number(INT UNIQUE KEY). Then I did the same but this time I generated similar tables where number column doesn't have an INDEX.I generated tables in a such way: the value of column number is equal to value of index + 2: first element == 3, 1000th element is 1002 and so on. I wanted to test a query like that, because It will be used in my application:
SELECT count(number) FROM number_two_hundred_I WHERE number=200002;
After generating data for these tables I wanted to test time for the worst case queries. I used SHOW PROFILES for it. I made an assumption that the worst case query would correspond to the element with value of column number to 1002, 2002, and so on, so here are all the queries that I tested and the time(evaluated by SHOW PROFILES):
SELECT count(number) FROM number_two_thousand_I WHERE number=2002;
// for tables with indexed column number I used **suffix _I** in the end
// of name of the table. Here is the time for it 0.00099250
SELECT count(number) FROM number_two_thousand WHERE number=2002;
// column number is not indexed when there is no **suffix _I**
// time for this one is 0.00226275
SELECT count(number) FROM number_five_thousand_I WHERE number=5002;
// 0.00095600
SELECT count(number) FROM number_five_thousand WHERE number=5002;
// 0.00404125
So here are the results:
2000 el - indexed 0.00099250 not indexed - 0.00226275
5000 el - indexed 0.00095600 not indexed - 0.00404125
10000 el - indexed 0.00156900 not indexed - 0.00761750
20000 el - indexed 0.00155850 not indexed - 0.01452820
50000 el - indexed 0.00051100 not indexed - 0.04127450
100000 el indexed 0.00121750 not indexed - 0.07120075
200000 el indexed 0.00095025 not indexed - 0.11406950
Here is infographic for that. It shows how number of elements depends on the worst case time of query for indexed/not indexed column. Indexed is red color. When I tested speed, I typed the same query in mysql console 2 times, because I figured out that when you make query for the 1st time, sometimes query for not indexed column can be even a bit faster, than for indexed one. Question is: why this type of query for 200000 elements takes sometimes less time, than the same query for 100000 elements when column number is indexed. You can see that there are other unpredictable for me results. I ask this, because when column number is not indexed, the results are quite predictable: 200000 el time is always bigger than 100000. Please tell me what I'm doing wrong when trying to make research about UNIQUE indexed column.

In not indexed case it is always a full-table scan so time collerates well with row number, if it's indexed you are measuring the index lookup time, which is constant in your case (small numbers , with small deviation)

It is not the "worst" case.
Make the UNIQUE key random instead of being in lock step with the PK. An example of such is UUID().
Generate enough rows so that the table and index(es) cannot fit in the buffer_pool.
If you both of those you will eventually see the performance slow down significantly.
UNIQUE keys have the following impact on INSERTs: The uniqueness constraint is checked before returning to the client. For a non-UNIQUE index, the work to insert into the index's BTree can (and is) delayed. (cf "Change buffer). With no index on the second column, there is even less work to do.
WHERE number=2002 --
With UNIQUE(number) -- Drill down the BTree. Very fast, very efficient.
With INDEX(number) -- Drill down the BTree. Very fast, very efficient. However it is slightly slower since it can't assume there is only one such row. That is, after finding the right spot in the BTree, it will scan forward (very efficient) until it finds a value other than 2002.
With no index on number -- Scan the entire table. So the cost depends on table size, not the value of number. It has no clue if 2002 exists anywhere in the table, or how many times. If you plot the times you got, you will see that it is rather linear.
I suggest you use log-log 'paper' for your graph. Anyway, note how linear the non-indexed case is. And the indexed case is essentially constant. Finding number=200002 is just as cheap as finding number=2002. This applies for UNIQUE and INDEX. (Actually, there is a very slight rise in the line because a BTree is really O(log n), not O(1). For 2K rows, there are probably 2 levels in the BTree; for 200K, 3 levels.)
The Query cache can trip you up in timings (if it is turned on). When timing, do SELECT SQL_NO_CACHE ... to avoid the QC. If the QC is on and applies, then the second and subsequent runs of the identical query will take very close to 0.000 seconds.
Those timings that varied between 0.5ms and 1.2ms -- chalk it up to the phase of the moon. Seriously, any timing below 10ms should not be trusted. This is because of all the other things that may be happening on the computer at the same time. You can temper it somewhat by averaging multiple runs -- being sure to avoid (1) the Query cache, and (2) I/O.
As for I/O... This gets back to my earlier comment about what may happen when the table (and/or index) is bigger than can be cached in RAM.
When smaller than RAM, the first run is likely to fetch stuff from disk. The second and subsequent runs are likely to be faster and consistent.
Whem bigger than RAM, all runs may need to hit the disk. Hence, all may be slow, and perhaps more flaky than the variations you found.
Your tags are, technically, incorrect. Most of MySQL's indexes are BTrees (actually B+Trees), not Binary Trees. (Sure, there is a lot of similarity, and many of the principles are shared.)
Back to your research goal.
Assume there is 'background noise' that is messing with your figures.
Make your tests non-trivial (eg the non-indexed case) so that it overwhelms the noise, or
Repeat the timings to mask the issue. And be sure to ignore the first run.
The main cost in performing any SELECT is how many rows it touches.
With your UNIQUE index, it is touching 1 row. So expect fast and O(1) (plus noise).
Without an index, it is touching N rows for an N-row table. So expect O(N).

Related

mysql not using index?

I have a table with columns like word, A_, E_, U_ .. these columns with X_ are tinyints having the value of how many times the specific letter exists in the word (to later help optimize the wildcard search query).
There is totally 252k rows. If i make search like WHERE u_ > 0 i get 60k rows. But if i do the explain of that select, it says there is 225k rows to go through and no index possible. Why? Column was added as index. Why it doesn't say there is 60k rows to go through and that possible key is U_?
listing the indexes on table (also strange that others are groupped under A_ index)
In comparison if i run query: where id > 250000 i get 2983 results, and if i do explain of that select it says there is 2982 rows and key to be used primary.
Btw if i group by U_ i get this: (but probably doesnt matter much because i already said the query returns 60k results)
EDIT:
If i create column U (varchar(1)) and do the update U = 'U' where U_ > 0, then if i do the select WHERE U = 'U' i get also 60k rows (obviously), but if i do explain i get this:
Still not so good (rows 120k not 60k) but at least better than rows 225k in previous case. Although this solution is bit more piggy that than the first one, but maybe bit more efficient.
My experience is that MySQL chooses to do a tablescan, even if there is an index on the column you're searching, if your query would select more than approximately 25% of the rows in the table.
The reason for this is that using a secondary index in InnoDB is a bit more work than using a primary index.
Look up value in secondary index, like your index on u_.
Read index entry, and find corresponding primary key value(s) of rows where that value in u_ is stored.
Look up row(s) by primary key.
It's actually at least double the work to look up by secondary key. This isn't a problem if you ultimately match a small minority of rows of the table, and there are definitely cases where a secondary index is really important for your query. So don't be reluctant to use secondary indexes.
But if your query matches too many rows, and that becomes a big portion of the table, then it would be less work to just scan the table start-to-finish.
By analogy, why doesn't the index at the back of a book contain the word "the"? Because the entry would naturally list every single page in the book, and it would be a waste for you to refer to the index and then use it to guide you to each page in the main part of the book. You would have been better off just reading the book.
MySQL does not have any officially documented threshold for choosing a tablescan over an indexed search. The 25% figure is only my experience (actually sometimes it seems closer to 21%, but I don't know the code well enough to understand exactly how the threshold is calculated).
I've seen cases where the proportion of rows matched was very close to whatever threshold is in the implementation, and the behavior of the optimizer can actually flip-flop from one query to the next, resulting in highly variable performance.
If this case applies to you, you can use an index hint to make MySQL's optimizer pretend that a tablescan is prohibitively expensive, and it should prefer an index to a tablescan. This is done with the FORCE INDEX hint.
SELECT * FROM words FORCE INDEX(U_) WHERE U_ > 0
I still try to use index hints conservatively. They aren't necessary except in rare cases, and using an index hint means your query must include the index name. This makes it hard to change indexes without breaking your application code.
You're asking about the backend query optimizer. In particular you're asking: "how does it choose an access path? Why index here but tablescan there?"
Let's think about that optimizer. What is it optimizing? Elapsed time, in expectation. It has a model for how long sequential reads and random reads take, and for query selectivity, that is, expected number of rows returned by a query. From several alternative access paths it chooses the one that appears to require the least elapsed time.
Your id > 250000 query had a few things going for it:
good selectivity, so less than 1% of rows will appear in the result set
id is the Primary Key, so all columns are immediately available upon navigating to the right place in the btree
This caused the optimizer to compute an expected elapsed time for the indexed access path much smaller than expected time for tablescan.
On the other hand, your u_ > 0 query has very poor selectivity, dragging nearly a quarter of the rows into the result set. Additionally, the index is not a covering index for your * demand of copying all column values into the result set. So the optimizer predicts it will have to read a quarter of the index blocks, and then essentially all of the data row blocks that they point to. So compared to tablescan, we'd have to read more blocks from disk, and they would be random reads instead of sequential reads. Both of those argue against using the index, so tablescan was selected because it was cheapest. Also, remember that often multiple rows will fit within a single disk block, or within a single read request. We would call it a pessimizer if it always chose the indexed access path, even in cases where indexed disk I/O would take longer.
summary advice
Use an index on a single column when your queries have good selectivity, returning much less than 1% of a relation's rows. Use a covering index when your queries have poor selectivity and you're willing to make a space vs. time tradeoff.

MySQL Innodb fail to use index due to extremely wrong rows estimation

I've an innodb table, the query on the table looks like below.
SELECT *
FROM x
WHERE now() BETWEEN a AND b
I've create a composite index on (a,b), the query returns around 4k rows, while the total number of rows in the table is around 700k.
However, when I get EXPLAIN of the execution plan, I found that the query did not use the expected index. Because the estimated rows is around 360k, extremely larger than the actual value.
I know just like many posts (such as Why the rows returns by "explain" is not equal to count()?) had explained, EXPLAIN only get roughly estimation. But FORCE INDEX solution is very tricky and may bring potential performance risks in the future.
Is there any way that I can make MySQL to get a more accurate estimation (the current one is 90 times larger)? Thanks.
InnoDB only keeps approximate row counts for tables. This is explained in the documentation of SHOW TABLE STATUS:
Rows
The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40 to 50%.
I don't think there's any way to make InnoDB keep accurate row counts, it's just not how it works.
This particular construct is hard to optimize:
WHERE constant BETWEEN col1 AND col2
No MySQL index can be devised to make it run fast. The attempts include:
INDEX(col1) -- will scan last half of table
INDEX(col2) -- will scan first half of table
INDEX(col1, col2) -- will scan last half of table
(Whether it does more of the work in the index BTree depends on ICP, covering, etc. But, in any case, lots of rows must be touched.)
One reason it cannot be improved upon is that the 'last' row in the 'half' might actually match.
If the (col1, col2) pairs do not overlap, then there is a possibility of improving the performance by being able to stop after one row. But MySQL does not know if you have this case, so it cannot optimize. Here is an approach for non-overlapping IP-address lookups that is efficient.

MySQL performance with index clairification

Say I have a mysql table with an index on the column 'name':
I do this query:
select * from name_table where name = 'John';
Say there are 5 results that are returned from a table with 100 rows.
Say I now insert 1 million new rows, non that have a name John, so there are still only 5 Johns in the table. Will the select statement be as fast as previously, so will inserting all these rows have an impact on the read speed of an indexed table?
Indexes have their own "tables", and when the MySQL engine determines that the lookup references an indexed column, the lookup happens on this table. It isn't really a table per-se, but the gist checks out.
That said, it will be nanoseconds slower, but not something you should concern yourself with.
More importantly, concern youself with indexing pertinent data, and column order, as these have MUCH more of an impact on database performance.
To learn more about what is happening behind the scenes, query the EXPLAIN:
EXPLAIN select * from name_table where name = 'John';
Note: In addition to the column orders listed in the link, it is a good (nay, great) idea to have variable length columns (VARCHAR) after their fixed-length counterparts (CHAR) as, durring the lookup, the engine has to either look at the row, read the column lengths, then skip forward for the lookup (mind you, this is only for non-indexed columns), or read the table declairation and know it always has to look at the column with the offset X. It is more complicated behind the scenes, but if you can shift all fixed-length columns to the front, you will thank yourself. Basically:
Indexed columns.
Everything Fixed-Length in order according to the link.
Everything Variable-Length in order according to the link.
Yes, it will be just as fast.
(In addition to the excellent points made Mike's answer...) there's an important point we should make regarding indexes (B-tree indexes in particular):
The entries in the index are stored "in order".
The index is also organized in a way that allows the database to very quickly identify the blocks in the index that contain the entries it's looking for (or the block that would contain entries, if no matching entries are there.)
What this means is that the database doesn't need to look at every entry in the index. Given a predicate like the one in your question:
WHERE name = 'John'
with an index with a leading column of name, the database can eliminate vast swaths of blocks that don't need to be checked.
Blocks near the beginning of the index contain entries 'Adrian' thru 'Anna', a little later in the index, a block contains entries for Caleb thru Carl, further long in the index James thru Jane, etc.
Because of the way the index is organized, the database effectively "knows" that the entries we're looking for cannot be in any of those blocks (because the index is in order, there's no way the value John could appear in those blocks we mentioned). So none of those blocks needs to be checked. (The database figures out in just a very small number of operations, that 98% of the blocks in the index can be eliminated from consideration.
High cardinality = good performance
The take away from this is that indexes are most effective on columns that have high cardinality. That is, there are a large number of distinct values in the column, and those values are unique or nearly unique.
This should clear up the answer to the question you were asking. You can add brazilians of rows to the table. If only five of those rows have a value of
John in the name column, when you do a query
WHERE name = `John`
it will be just as fast. The database will be able to locate the entries your looking for nearly as fast as it can when you had a thousand rows in the table.
(As the index grows larger, it does add "levels" to the index, to traverse down to the leaf nodes... so, it gets ever so slightly slower because of a tiny few more operations. Where performance really starts to bog down is when the InnoDB buffer cache is too small, and we have to wait for the (glacially slow in comparison) disk io operations to fetch blocks into memory.
Low cardinality = poor performance
Indexes on columns with low cardinality are much less effective. For example, a column that has two possible values, with an even distribution of values across the rows in the table (about half of the rows have one value, and the other half have the other value.) In this case, the database can't eliminate 98% of the blocks, or 90% of the blocks. The database has to slog through half the blocks in the index, and then (usually) perform a lookup to the pages in the underlying table to get the other values for the row.
But with gazillions of rows with a column gender, with two values 'M' and 'F', an index with gender as a leading column will not be effective in satisfying a query
WHERE gender = 'M'
... because we're effectively telling the database to retrieve half the rows in the table, and it's likely those rows are going to be evenly distributed in the table. So nearly every page in the table is going to contain at least one row we need, the database is going to opt to do a full table scan (to look at every row in every block in the table) to locate the rows, rather than using an index.
So, in terms of performance for looking up rows in the table using an index... the size of the table isn't really an issue. The real issue is the cardinality of the values in the index, and how many distinct values we're looking for, and how many rows need to be returned.

Does changing a column length from varchar 18 to varchar 8 improve search speed drastically?

I have a Mysql database with ~500k rows. I have a column that is basically a short url used as an index for each row. I can make the index 8 char/varchar or 18. I want to know if the extra 10 chars will really slow my searches down drastically. A pro to varchar 18 is I won't have to generate a short url. Whereas, if I do varchar or char 8, I will. The index will be used for retrieving comments, updating the entry, etc. The index however will be unique regardless.
Thanks
Not drastically. Search speed in B+-tree indexes varies as the number of index reads, which varies with logx(N), where N is the number of index entries and x is the number of keys per index block, which in turn is inversely proportional to their length. You can see it is a sub-linear relationship.
The difference in index performance between 8-char and 18-char column values is not zero. But it is quite small. Index performance deteriorates more markedly when you're dealing with stuff like 255-character indexes.
You should be fine either way with a half-megarow table unless you have really extraordinary performance requirements.
When you're done with the table's initial load, do OPTIMIZE TABLE to make the indexes work optimally. If the table gets lots of DELETE or INSERT operations, try to find a time of day or week when you routinely can redo the OPTIMIZE TABLE operation. http://dev.mysql.com/doc/refman/5.6/en/optimize-table.html

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.