MySQL indexes: how do they work? - mysql

I'm a complete newbie with MySQL indexes. I have several MyISAM tables on MySQL 5.0x having utf8 charsets and collations with 100k+ records each. The primary keys are generally integer. Many columns on each table may have duplicate values.
I need to quickly count, sum, average, or otherwise perform custom calculations on any number of fields in each table or joined on any number of others.
I found this page giving an overview of MySQL index usage: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html, but I'm still not sure I'm using indexes right. Just when I think I've made the perfect index out of a collection of fields I want to calculate against, I get the "index must be under 1000 bytes" error.
Can anyone explain how to most efficiently create and use indexes to speed up queries?
Caveat: upgrading Mysql is not possible in this case. Using Navicat Light for db administration, but this app isn't required.

When you create an index on a column or columns in MySQL table, the database is creating a data structure called a B-tree (assuming you use the default index setting), for which the key of each record is a concatenation of the values in the indexed columns.
For example, let's say you have a table that is defined like:
CREATE TABLE mytable (
id int unsigned auto_increment,
column_a char(32) not null default '',
column_b int unsigned not null default 0,
column_c varchar(512),
column_d varchar(512),
PRIMARY KEY (id)
) ENGINE=MyISAM;
Then let's give it some data:
INSERT INTO mytable VALUES (1, 'hello', 2, null, null);
INSERT INTO mytable VALUES (2, 'hello', 3, 'hi', 'there');
INSERT INTO mytable VALUES (3, 'how', 4, 'are', 'you?');
INSERT INTO mytable VALUES (4, 'foo', 5, '', 'bar');
Now suppose you decide to add a key to column_a and column_b like:
ALTER TABLE mytable ADD KEY (column_a, column_b);
The database is going to create the aforementioned B-tree, which will have four keys in it, one for each row:
hello-2
hello-3
how-4
foo-5
When you perform a search that references the column_a column, or that references the column_a AND column_b columns, the database will be able to use this index to narrow the record set it has to examine. Let's say you have a query like:
SELECT ... FROM mytable WHERE column_a = 'hello';
Even though the above query does not specify a value for the column_b column, it can still take advantage of our index by looking for all keys that begin with "hello". For the same reason, if you had a query like:
SELECT ... FROM mytable WHERE column_b = '2';
This query would NOT be able to use our index, because it would have to parse the index keys themselves to try to determine which keys' second value matches '2', which is terribly inefficient.
Now, let's address your original question of the maximum length. Suppose we try to create an index spanning all four non-PK columns in this table:
ALTER TABLE mytable ADD KEY (column_a, column_b, column_c, column_d);
You will get an error:
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
In this case our column lengths are 32, 10, 512, and 512, which in a single-byte-per-character situation is 1066, which is above the limit of 1000. Suppose that it DID work; you would be creating the following keys:
hello-2-
hello-3-hi-there
how-4-are-you?
foo-5--bar
Now, suppose that you had values in column_c and column_d that were very long -- 512 characters each. Even in a basic single-byte character set, your keys would now be over 1000 bytes in length, which is what MySQL is complaining about. It gets even worse with multibyte character sets, where seemingly "small" columns can still push the keys over the limit.
If you MUST use a large compound key, one solution is to use InnoDB tables rather than the default MyISAM tables, which support a larger key length (3500 bytes) -- you can do this by swapping ENGINE=InnoDB instead of ENGINE=MyISAM in the declaration above. However, generally speaking, if you are using long keys there is probably something wrong with your table design.
Remember that single-column indexes often provide more utility than multi-column indexes. You want to use a multi-column index when you are going to often/always take advantage of it by specifying all of the necessary criteria in your queries. Also, as others have mentioned, do NOT index every column of a table, since each index is adding storage overhead to your database. You want to limit your indexes to the columns that are frequently used by queries, and if it seems like you need too many, you should probably think about breaking up your tables up into more logical components.

Indexes generally aren't well suited for custom calculations where the user is able to construct their own queries. Typically you choose the indexes to match the specific queries you intend to run, using EXPLAIN to see if the index is being used.
In the case that you have absolutely no idea what queries might be performed it is generally best to create one index per column - and not one index covering all columns.
If you have a good idea of what queries might be run often you could create an extra index for those specific queries. You can also add indexes later if your users complain that certain types of queries run too slow.
Also, indexes generally aren't that useful for calculating counts, sums and averages since these types of calculations require looking at every row.

It sounds like you are trying to put too many fields into your index. The limit is the probably the number of bytes it takes to encode all the fields.
The index is used in looking up the records, so you want to choose the fields which you are "WHERE"ing on. In choosing between those fields, you want to choose the ones that will narrow the results the quickest.
As an example, a filter on Male/Female will usually not help much because you are only going to save about 50% of the time. However, a filter on State may be useful because you'll break down into many more categories. However, if almost everybody in the database is in a single state then that won't work.

Remember that indexes are for sorting and finding rows.
The error message you got sounds like it is talking about the 1000 byte Prefix Limit for MyISAM table indexes. From http://dev.mysql.com/doc/refman/5.0/en/create-index.html:
The statement shown here creates an
index using the first 10 characters of
the name column:
CREATE INDEX part_of_name ON customer
(name(10)); If names in the column
usually differ in the first 10
characters, this index should not be
much slower than an index created from
the entire name column. Also, using
column prefixes for indexes can make
the index file much smaller, which
could save a lot of disk space and
might also speed up INSERT operations.
Prefix support and lengths of prefixes
(where supported) are storage engine
dependent. For example, a prefix can
be up to 1000 bytes long for MyISAM
tables, and 767 bytes for InnoDB
tables.
Maybe you can try a FULLTEXT index for problematic columns.

Related

MySQL: index a field that contains only distinct values?

Is it useful for SELECT performance to set an index on a field that contains only distinct values?
eg:
order_id
--------
98317490
10928343
82931376
93438473
...
Is it useful for SELECT performance to set an index on a field that contains only distinct values?
That depends. An index is useful if you often search on this column:
WHERE column=value
WHERE column BETWEEN a AND b
The usefulness of an index is determined by its selectivity. For example, if your column contains a boolean, which is:
false in 99.9% of rows
true in 0.1% of rows
Then you can easily guess that using an index to find "true" values will be a huge boost relative to reading the entire table to search for them.
On the other hand, searching for "false" using an index will be slower than not using an index, since you're gonna read the whole table anyway, you might as well not bother to also process the index.
If values are all distinct, then selectivity is maximum, and index will be very useful. That is, assuming you actually search on that column!
An index that is never used only slows down updates.
Of course it is useful, as with all indexes - it is useful if you have select statements where you have this field on the WHERE clause.
Whether this field has distinct values or not doesn't really matter.
Note that if your field is marked as UNIQUE or PRIMARY KEY in the database, the database will technically already have an index for this field, so adding another index for it will not change anything.

Will MySQL use Multiple-column index if I use columns in different order?

Reading the MySQL docs we see this example table with multiple-column index name:
CREATE TABLE test (
id INT NOT NULL,
last_name CHAR(30) NOT NULL,
first_name CHAR(30) NOT NULL,
PRIMARY KEY (id),
INDEX name (last_name,first_name)
);
It is explained with examples in which cases the index will or will not be utilized. For example, it will be used for such query:
SELECT * FROM test
WHERE last_name='Widenius' AND first_name='Michael';
My question is, would it work for this query (which is effectively the same):
SELECT * FROM test
WHERE first_name='Michael' AND last_name='Widenius';
I couldn't find any word about that in the documentation - does MySQL try to swap columns to find appropriate index or is it all up to the query?
Should be the same because (from mysql doc) the query optiminzer work looking at
Each table index is queried, and the best index is used unless the
optimizer believes that it is more efficient to use a table scan. At
one time, a scan was used based on whether the best index spanned more
than 30% of the table, but a fixed percentage no longer determines the
choice between using an index or a scan. The optimizer now is more
complex and bases its estimate on additional factors such as table
size, number of rows, and I/O block size.
http://dev.mysql.com/doc/refman/5.7/en/where-optimizations.html
In some cases, MySQL can read rows from the index without even
consulting the data file.
and this should be you case
Without ICP, the storage engine traverses the index to locate rows in
the base table and returns them to the MySQL server which evaluates
the WHERE condition for the rows. With ICP enabled, and if parts of
the WHERE condition can be evaluated by using only fields from the
index, the MySQL server pushes this part of the WHERE condition down
to the storage engine. The storage engine then evaluates the pushed
index condition by using the index entry and only if this is satisfied
is the row read from the table. ICP can reduce the number of times the
storage engine must access the base table and the number of times the
MySQL server must access the storage engine.
http://dev.mysql.com/doc/refman/5.7/en/index-condition-pushdown-optimization.html
For the two queries you stated, it will work the same.
However, for queries which have only one of the columns, the order of the index matters.
For example, this will use the index:
SELECT * FROM test WHERE last_name='Widenius';
But this wont:
SELECT * FROM test WHERE first_name='Michael';

Cannot achieve a cover index with this table (2 equalities and one selection)?

CREATE TABLE `discount_base` (
`id` varchar(12) COLLATE utf8_unicode_ci NOT NULL,
`amount` decimal(13,4) NOT NULL,
`description` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`family` varchar(4) COLLATE utf8_unicode_ci NOT NULL,
`customer_id` varchar(8) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_CUSTOMER` (`customer_id`),
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`,`amount`),
CONSTRAINT `FK_CUSTOMER` FOREIGN KEY (`customer_id`)
REFERENCES `customer` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I've added a cover index IDX_FAMILY_CUSTOMER_AMOUNT on family, customer_id and amount because most of the time I use the following query:
SELECT amount FROM discount_base WHERE family = :family AND customer_id = :customer_id
However using EXPLAIN and a bounce of records (~ 250000) it says:
'1', 'SIMPLE', 'discount_base', 'ref', 'IDX_CUSTOMER,IDX_FAMILY_CUSTOMER_AMOUNT', 'IDX_FAMILY_CUSTOMER_AMOUNT', '40', 'const,const', '1', 'Using where; Using index'
Why I'm getting using where; using index instead of just using index?
EDIT: Fiddle with a small amount of data (Using where; Using index):
EXPLAIN SELECT amount
FROM discount_base
WHERE family = '0603' and customer_id = '20000275';
Another fiddle where id is family + customer_id (const):
EXPLAIN SELECT amount
FROM discount_base
WHERE `id` = '060320000275';
Interesting problem. It would seem "obvious" that the IDX_FAMILY_CUSTOMER_AMOUNT index would be used for this query:
SELECT amount
FROM discount_base
WHERE family = :family AND customer_id = :customer_id;
"Obvious" to us people, but clearly not to the optimizer. What is happening?
This aspect of index usage is poorly documented. I (intelligently) speculate that when doing comparisons on strings using case-insensitive collations (and perhaps others), then the = operation is really more like an in. Something sort of like this, conceptually:
WHERE family in (lower(:family, upper(:family), . . .) and . . .
This is conceptual. But it means that an index scan is required for the = rather than an index lookup. Minor change typographically. Very important semantically. It prevents the use of the second key. Yup, that is an unfortunately consequence of inequalities, even when they look like =.
So, the optimizer compares the two possible indexes, and it decides that customer_id is more selective than family, and chooses the former.
Alas, both of your keys are case-insensitive strings. My suggestion would be to replace at least one of them with an auto-incrementing integer id. In fact, my suggestion is that basically all tables have an auto-incrementing integer id, which is then used for all foreign key references.
Another solution would be to use a trigger to create a single column CustomerFamily with the values concatenated together. Then this index:
KEY IDX_CUSTOMERFAMILY_AMOUNT (CustomerFamily, amount)
should do what you want. It is also possible that a case-sensitive encoding would also solve the problem.
Are family and customer_id strings? I guess you could be passing customer_id maybe as a integer which could be causing a type conversion to take place and so the index not being used for that particular column.
Ensure you pass customer_id as string or consider changing your table to store cusomer_id as INT.
If you are using alphanumeric Ids then this don't apply.
I'm pretty sure Using index is the important part, and it means "using a covering index".
Two things to further check:
EXPLAIN FORMT=JSON SELECT ...
may give further clues.
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
will show you how many rows were read/written/etc in various ways. If some number says about 250000 (in your case), it indicates a table scan. If all the numbers a small (approximately the number of rows returned by the query), then you can be assured that it did do what that query efficiently.
The numbers there do not distinguish between read to an index versus data. But they ignore caching. Timings (for two identical runs) can differ significantly due to caching; Handler% values won't change.
The answer to your question relies on what the engine is actually using your index for.
In given query, you ask the engine to:
Lookup for values (WHERE/JOIN)
Retrieve information (SELECT) based on this lookup result
For the first part, as soon as you filter the results (lookup), there's an entry in Extra indicating USING WHERE, so this is the reason you see it in your explain plan.
For the second part, the engine does not need to go anywhere out of one given index because it is a covering index. The explain plan notifies it by showing USING INDEX. This USING INDEX hint, combined with USING WHERE, means your index is also used in the lookup portion of the query, as explained in mysql documentation:
https://dev.mysql.com/doc/refman/5.0/en/explain-output.html
Using index
The column information is retrieved from the table using only
information in the index tree without having to do an additional seek
to read the actual row. This strategy can be used when the query uses
only columns that are part of a single index.
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
Check this fiddle:
http://sqlfiddle.com/#!9/8cdf2/10
I removed the where clause and the query now displays USING INDEX only. This is because no lookup is necessary in your table.
The MySQL documentation on EXPLAIN has this to say:
Using index
The column information is retrieved from the table using
only information in the index tree without having to do an additional
seek to read the actual row. This strategy can be used when the query
uses only columns that are part of a single index.
If the Extra column
also says Using where, it means the index is being used to perform
lookups of key values. Without Using where, the optimizer may be
reading the index to avoid reading data rows but not using it for
lookups. For example, if the index is a covering index for the query,
the optimizer may scan it without using it for lookups.
My best guess, based on the information you have provided, is that the optimizer first uses your IDX_CUSTOMER index and then performs a key lookup to retrieve non-key data (amount and family) from the actual data page based on the key (customer_id).
This is most likely caused by cardinality (eg. uniqueness) of the columns in your indexes. You should check the cardinality of
the columns used in your where clause and put the one with the highest cardinality first on your index. Guessing from the column
names and your current results, customer_id has the highest cardinality.
So change this:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`,`amount`)
to this:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`customer_id`,`family`,`amount`)
After making the change, you should run ANALYZE TABLE to update table statistics. This will update table statistics, which can
affect the choices the optimizer makes regarding your indexes.
This sounds fine. According to the MySQL documentation:
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
That means, Using index alone would read entire index to retrieve the results, but not using the index structure to find specific values. You probably could get this with SELECT family, customer_id, amount FROM discount_base. Using where; using index means the optimizer exploits the index to find and retrieve rows matching query parameters (family, customer_id).
This indeed could be a problem.
Note, that there might be millions of string matching by a single index key using the utf8_unicode_ci collation. For example, all these letters are matched by the same index key:
A, a, À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å, Ā, ā, Ă, ă, Ą, ą, Ǎ, ǎ, Ǟ, ǟ, Ǡ, ǡ, Ǻ, ǻ, Ȁ, ȁ, Ȃ, ȃ, Ȧ, ȧ, Ḁ, ḁ, Ạ, ạ, Ả, ả, Ấ, ấ, Ầ, ầ, Ẩ, ẩ, Ẫ, ẫ, Ậ, ậ, Ắ, ắ, Ằ, ằ, Ẳ, ẳ, Ẵ, ẵ, Ặ, ặ.
And there are substantial grounds for believing that then processing an query using CHAR/VARCHAR index, MySQL in addition to regular index lookup performs full linear scan of all the values matched by index to make sure it is indeed matched by the original query parameter. This may be really needed when the index collation and the WHERE collation do not match, but I don't know why it do so all the time, even when this is clearly not needed (in your case for example).
See this question for an evidence and additional details: Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I would only recommend this solution:
Remove amount from index like:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`)
When you do request force USE INDEX:
USE INDEX (`IDX_FAMILY_CUSTOMER_AMOUNT`)
That trick allow to avoid Using where. Hope performance will also be on acceptable level:
http://sqlfiddle.com/#!9/86f46/2
SELECT amount FROM discount_base
USE INDEX (`IDX_FAMILY_CUSTOMER_AMOUNT`)
WHERE family = '1' AND customer_id = '1'
Based on the fiddle provided, it appears that only numeric values are being used for family and customer id. If this assumption is correct, changing these columns to numeric and using just a single key on customer and family appears to have resolved the issue.
Please check this fiddle

Query huge MySQL DB

I have a MySQL DB with two columns. 'Key' and 'Used'. Key is a string, Used is an integer. Is there a very fast way to search for a specific Key and then return the Use in a huge MySQL DB with 6000000 rows of data.
You can make it fast by creating an index on key field:
CREATE INDEX mytable_key_idx ON mytable (`key`);
You can actually make it even faster for reading by creating covering index on both (key, used) fields:
CREATE INDEX mytable_key_used_idx ON mytable (`key`, `used`);
In this case, when reading, MySQL could retrieve used value from the index itself, without reading the table (index-only scan). However, if you have a lot of write activity, covering index may work slower because now it has to update both an index and actual table.
The normative SQL for that would be:
SELECT t.key, t.used FROM mytable t WHERE t.key = 'particularvalue' ;
The output from
EXPLAIN
SELECT t.key, t.used FROM mytable t WHERE t.key = 'particularvalue' ;
Would give details about the access plan, what indexes are being considered, etc.
The output from a
SHOW CREATE TABLE mytable ;
would give information about the table, the engine being used and the available indexes, as well as the datatypes.
Slow performance on a query like this is usually indicative of a suboptimal access plan, either because suitable indexes are not available, or not being used. Sometimes, a characterset mismatch between the column datatype and the literal datatype in the predicate can make an index "unusable" by a particular query.

Best solution for saving boolean values and saving cpu and memory on searches

What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?
If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.
How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.