Is it useful for SELECT performance to set an index on a field that contains only distinct values?
eg:
order_id
--------
98317490
10928343
82931376
93438473
...
Is it useful for SELECT performance to set an index on a field that contains only distinct values?
That depends. An index is useful if you often search on this column:
WHERE column=value
WHERE column BETWEEN a AND b
The usefulness of an index is determined by its selectivity. For example, if your column contains a boolean, which is:
false in 99.9% of rows
true in 0.1% of rows
Then you can easily guess that using an index to find "true" values will be a huge boost relative to reading the entire table to search for them.
On the other hand, searching for "false" using an index will be slower than not using an index, since you're gonna read the whole table anyway, you might as well not bother to also process the index.
If values are all distinct, then selectivity is maximum, and index will be very useful. That is, assuming you actually search on that column!
An index that is never used only slows down updates.
Of course it is useful, as with all indexes - it is useful if you have select statements where you have this field on the WHERE clause.
Whether this field has distinct values or not doesn't really matter.
Note that if your field is marked as UNIQUE or PRIMARY KEY in the database, the database will technically already have an index for this field, so adding another index for it will not change anything.
Reading the MySQL docs we see this example table with multiple-column index name:
CREATE TABLE test (
id INT NOT NULL,
last_name CHAR(30) NOT NULL,
first_name CHAR(30) NOT NULL,
PRIMARY KEY (id),
INDEX name (last_name,first_name)
);
It is explained with examples in which cases the index will or will not be utilized. For example, it will be used for such query:
SELECT * FROM test
WHERE last_name='Widenius' AND first_name='Michael';
My question is, would it work for this query (which is effectively the same):
SELECT * FROM test
WHERE first_name='Michael' AND last_name='Widenius';
I couldn't find any word about that in the documentation - does MySQL try to swap columns to find appropriate index or is it all up to the query?
Should be the same because (from mysql doc) the query optiminzer work looking at
Each table index is queried, and the best index is used unless the
optimizer believes that it is more efficient to use a table scan. At
one time, a scan was used based on whether the best index spanned more
than 30% of the table, but a fixed percentage no longer determines the
choice between using an index or a scan. The optimizer now is more
complex and bases its estimate on additional factors such as table
size, number of rows, and I/O block size.
http://dev.mysql.com/doc/refman/5.7/en/where-optimizations.html
In some cases, MySQL can read rows from the index without even
consulting the data file.
and this should be you case
Without ICP, the storage engine traverses the index to locate rows in
the base table and returns them to the MySQL server which evaluates
the WHERE condition for the rows. With ICP enabled, and if parts of
the WHERE condition can be evaluated by using only fields from the
index, the MySQL server pushes this part of the WHERE condition down
to the storage engine. The storage engine then evaluates the pushed
index condition by using the index entry and only if this is satisfied
is the row read from the table. ICP can reduce the number of times the
storage engine must access the base table and the number of times the
MySQL server must access the storage engine.
http://dev.mysql.com/doc/refman/5.7/en/index-condition-pushdown-optimization.html
For the two queries you stated, it will work the same.
However, for queries which have only one of the columns, the order of the index matters.
For example, this will use the index:
SELECT * FROM test WHERE last_name='Widenius';
But this wont:
SELECT * FROM test WHERE first_name='Michael';
CREATE TABLE `discount_base` (
`id` varchar(12) COLLATE utf8_unicode_ci NOT NULL,
`amount` decimal(13,4) NOT NULL,
`description` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`family` varchar(4) COLLATE utf8_unicode_ci NOT NULL,
`customer_id` varchar(8) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_CUSTOMER` (`customer_id`),
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`,`amount`),
CONSTRAINT `FK_CUSTOMER` FOREIGN KEY (`customer_id`)
REFERENCES `customer` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I've added a cover index IDX_FAMILY_CUSTOMER_AMOUNT on family, customer_id and amount because most of the time I use the following query:
SELECT amount FROM discount_base WHERE family = :family AND customer_id = :customer_id
However using EXPLAIN and a bounce of records (~ 250000) it says:
'1', 'SIMPLE', 'discount_base', 'ref', 'IDX_CUSTOMER,IDX_FAMILY_CUSTOMER_AMOUNT', 'IDX_FAMILY_CUSTOMER_AMOUNT', '40', 'const,const', '1', 'Using where; Using index'
Why I'm getting using where; using index instead of just using index?
EDIT: Fiddle with a small amount of data (Using where; Using index):
EXPLAIN SELECT amount
FROM discount_base
WHERE family = '0603' and customer_id = '20000275';
Another fiddle where id is family + customer_id (const):
EXPLAIN SELECT amount
FROM discount_base
WHERE `id` = '060320000275';
Interesting problem. It would seem "obvious" that the IDX_FAMILY_CUSTOMER_AMOUNT index would be used for this query:
SELECT amount
FROM discount_base
WHERE family = :family AND customer_id = :customer_id;
"Obvious" to us people, but clearly not to the optimizer. What is happening?
This aspect of index usage is poorly documented. I (intelligently) speculate that when doing comparisons on strings using case-insensitive collations (and perhaps others), then the = operation is really more like an in. Something sort of like this, conceptually:
WHERE family in (lower(:family, upper(:family), . . .) and . . .
This is conceptual. But it means that an index scan is required for the = rather than an index lookup. Minor change typographically. Very important semantically. It prevents the use of the second key. Yup, that is an unfortunately consequence of inequalities, even when they look like =.
So, the optimizer compares the two possible indexes, and it decides that customer_id is more selective than family, and chooses the former.
Alas, both of your keys are case-insensitive strings. My suggestion would be to replace at least one of them with an auto-incrementing integer id. In fact, my suggestion is that basically all tables have an auto-incrementing integer id, which is then used for all foreign key references.
Another solution would be to use a trigger to create a single column CustomerFamily with the values concatenated together. Then this index:
KEY IDX_CUSTOMERFAMILY_AMOUNT (CustomerFamily, amount)
should do what you want. It is also possible that a case-sensitive encoding would also solve the problem.
Are family and customer_id strings? I guess you could be passing customer_id maybe as a integer which could be causing a type conversion to take place and so the index not being used for that particular column.
Ensure you pass customer_id as string or consider changing your table to store cusomer_id as INT.
If you are using alphanumeric Ids then this don't apply.
I'm pretty sure Using index is the important part, and it means "using a covering index".
Two things to further check:
EXPLAIN FORMT=JSON SELECT ...
may give further clues.
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
will show you how many rows were read/written/etc in various ways. If some number says about 250000 (in your case), it indicates a table scan. If all the numbers a small (approximately the number of rows returned by the query), then you can be assured that it did do what that query efficiently.
The numbers there do not distinguish between read to an index versus data. But they ignore caching. Timings (for two identical runs) can differ significantly due to caching; Handler% values won't change.
The answer to your question relies on what the engine is actually using your index for.
In given query, you ask the engine to:
Lookup for values (WHERE/JOIN)
Retrieve information (SELECT) based on this lookup result
For the first part, as soon as you filter the results (lookup), there's an entry in Extra indicating USING WHERE, so this is the reason you see it in your explain plan.
For the second part, the engine does not need to go anywhere out of one given index because it is a covering index. The explain plan notifies it by showing USING INDEX. This USING INDEX hint, combined with USING WHERE, means your index is also used in the lookup portion of the query, as explained in mysql documentation:
https://dev.mysql.com/doc/refman/5.0/en/explain-output.html
Using index
The column information is retrieved from the table using only
information in the index tree without having to do an additional seek
to read the actual row. This strategy can be used when the query uses
only columns that are part of a single index.
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
Check this fiddle:
http://sqlfiddle.com/#!9/8cdf2/10
I removed the where clause and the query now displays USING INDEX only. This is because no lookup is necessary in your table.
The MySQL documentation on EXPLAIN has this to say:
Using index
The column information is retrieved from the table using
only information in the index tree without having to do an additional
seek to read the actual row. This strategy can be used when the query
uses only columns that are part of a single index.
If the Extra column
also says Using where, it means the index is being used to perform
lookups of key values. Without Using where, the optimizer may be
reading the index to avoid reading data rows but not using it for
lookups. For example, if the index is a covering index for the query,
the optimizer may scan it without using it for lookups.
My best guess, based on the information you have provided, is that the optimizer first uses your IDX_CUSTOMER index and then performs a key lookup to retrieve non-key data (amount and family) from the actual data page based on the key (customer_id).
This is most likely caused by cardinality (eg. uniqueness) of the columns in your indexes. You should check the cardinality of
the columns used in your where clause and put the one with the highest cardinality first on your index. Guessing from the column
names and your current results, customer_id has the highest cardinality.
So change this:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`,`amount`)
to this:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`customer_id`,`family`,`amount`)
After making the change, you should run ANALYZE TABLE to update table statistics. This will update table statistics, which can
affect the choices the optimizer makes regarding your indexes.
This sounds fine. According to the MySQL documentation:
If the Extra column also says Using where, it means the index is being
used to perform lookups of key values. Without Using where, the
optimizer may be reading the index to avoid reading data rows but not
using it for lookups. For example, if the index is a covering index
for the query, the optimizer may scan it without using it for lookups.
That means, Using index alone would read entire index to retrieve the results, but not using the index structure to find specific values. You probably could get this with SELECT family, customer_id, amount FROM discount_base. Using where; using index means the optimizer exploits the index to find and retrieve rows matching query parameters (family, customer_id).
This indeed could be a problem.
Note, that there might be millions of string matching by a single index key using the utf8_unicode_ci collation. For example, all these letters are matched by the same index key:
A, a, À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å, Ā, ā, Ă, ă, Ą, ą, Ǎ, ǎ, Ǟ, ǟ, Ǡ, ǡ, Ǻ, ǻ, Ȁ, ȁ, Ȃ, ȃ, Ȧ, ȧ, Ḁ, ḁ, Ạ, ạ, Ả, ả, Ấ, ấ, Ầ, ầ, Ẩ, ẩ, Ẫ, ẫ, Ậ, ậ, Ắ, ắ, Ằ, ằ, Ẳ, ẳ, Ẵ, ẵ, Ặ, ặ.
And there are substantial grounds for believing that then processing an query using CHAR/VARCHAR index, MySQL in addition to regular index lookup performs full linear scan of all the values matched by index to make sure it is indeed matched by the original query parameter. This may be really needed when the index collation and the WHERE collation do not match, but I don't know why it do so all the time, even when this is clearly not needed (in your case for example).
See this question for an evidence and additional details: Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I would only recommend this solution:
Remove amount from index like:
KEY `IDX_FAMILY_CUSTOMER_AMOUNT` (`family`,`customer_id`)
When you do request force USE INDEX:
USE INDEX (`IDX_FAMILY_CUSTOMER_AMOUNT`)
That trick allow to avoid Using where. Hope performance will also be on acceptable level:
http://sqlfiddle.com/#!9/86f46/2
SELECT amount FROM discount_base
USE INDEX (`IDX_FAMILY_CUSTOMER_AMOUNT`)
WHERE family = '1' AND customer_id = '1'
Based on the fiddle provided, it appears that only numeric values are being used for family and customer id. If this assumption is correct, changing these columns to numeric and using just a single key on customer and family appears to have resolved the issue.
Please check this fiddle
I have a MySQL DB with two columns. 'Key' and 'Used'. Key is a string, Used is an integer. Is there a very fast way to search for a specific Key and then return the Use in a huge MySQL DB with 6000000 rows of data.
You can make it fast by creating an index on key field:
CREATE INDEX mytable_key_idx ON mytable (`key`);
You can actually make it even faster for reading by creating covering index on both (key, used) fields:
CREATE INDEX mytable_key_used_idx ON mytable (`key`, `used`);
In this case, when reading, MySQL could retrieve used value from the index itself, without reading the table (index-only scan). However, if you have a lot of write activity, covering index may work slower because now it has to update both an index and actual table.
The normative SQL for that would be:
SELECT t.key, t.used FROM mytable t WHERE t.key = 'particularvalue' ;
The output from
EXPLAIN
SELECT t.key, t.used FROM mytable t WHERE t.key = 'particularvalue' ;
Would give details about the access plan, what indexes are being considered, etc.
The output from a
SHOW CREATE TABLE mytable ;
would give information about the table, the engine being used and the available indexes, as well as the datatypes.
Slow performance on a query like this is usually indicative of a suboptimal access plan, either because suitable indexes are not available, or not being used. Sometimes, a characterset mismatch between the column datatype and the literal datatype in the predicate can make an index "unusable" by a particular query.
What is the best solution for inserting boolean values on database if you want more query performance and minimum losing of memory on select statement.
For example:
I have a table with 36 fields that 30 of them has boolean values (zero or one) and i need to search records using the boolean fields that just have true values.
SELECT * FROM `myTable`
WHERE
`field_5th` = 1
AND `field_12th` = 1
AND `field_20` = 1
AND `field_8` = 1
Is there any solution?
If you want to store boolean values or flags there are basically three options:
Individual columns
This is reflected in your example above. The advantage is that you will be able to put indexes on the flags you intend to use most often for lookups. The disadvantage is that this will take up more space (since the minimum column size that can be allocated is 1 byte.)
However, if you're column names are really going to be field_20, field_21, etc. Then this is absolutely NOT the way to go. Numbered columns are a sign you should use either of the other two methods.
Bitmasks
As was suggested above you can store multiple values in a single integer column. A BIGINT column would give you up to 64 possible flags.
Values would be something like:
UPDATE table SET flags=b'100';
UPDATE table SET flags=b'10000';
Then the field would look something like: 10100
That would represent having two flag values set. To query for any particular flag value set, you would do
SELECT flags FROM table WHERE flags & b'100';
The advantage of this is that your flags are very compact space-wise. The disadvantage is that you can't place indexes on the field which would help improve the performance of searching for specific flags.
One-to-many relationship
This is where you create another table, and each row there would have the id of the row it's linked to, and the flag:
CREATE TABLE main (
main_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
);
CREATE TABLE flag (
main_id INT UNSIGNED NOT NULL,
name VARCHAR(16)
);
Then you would insert multiple rows into the flag table.
The advantage is that you can use indexes for lookups, and you can have any number of flags per row without changing your schema. This works best for sparse values, where most rows do not have a value set. If every row needs all flags defined, then this isn't very efficient.
For performance comparisson you can read a blog post I wrote on the topic:
Set Performance Compare
Also when you ask which is "Best" that's a very subjective question. Best at what? It all really depends on what your data looks like and what your requirements are and how you want to query it.
Keep in mind that if you want to do a query like:
SELECT * FROM table WHERE some_flag=true
Indexes will only help you if few rows have that value set. If most of the rows in the table have some_flag=true, then mysql will ignore indexes and do a full table scan instead.
How many rows of data are you querying over? You can store the boolean values in an integer value and use bit operations to test for them them. It's not indexable, but storage is very well packed. Using TINYINT fields with indexes would pick one index to use and scan from there.