Is search on index with less verity faster? - mysql

I have a table with 3 index column on it lets saye col1, col2, col3 they are all int
My friend said that, if column2 has less verity of data (I mean it has only 12,13,14 and other column have like random numbers), it is faster to put your condition for the WHERE clause on that column first, because mysql will begin to get populate data from that point first!
so basically he says that the second is faster
select * from my_table where col1=1 and col2=2 and col3=3
select * from my_table where col2=2 and col1=1 and col3=3
is that true? I couldn't find any reading material on the subject.

While "verity" is a term applied to datasets, it has nothing to do with databases. I think you are talking about cardinality.
The order in which you declare predicates in your query has no impact at all on how the optimiser resolves the query. You can easily test this yourself using 'explain'.
The order of columns in an index does have a big impact on performance

Related

Why does Mysql decide to use an index on column specified in Order By clause when its not present in where clause?

Why does Mysql decide to use an index on column specified in Order By clause although that column is not present in where clause ?
This happens when Order By + Limit clause are used together in the query.
Example query:
select col1, col2,col3 from table_name where col1 = 'x' and col3='y' order by colY limit 3;
table_name has 9M records
In the absence of limit clause,
mysql uses the index on col1 column which is wayy faster.
Better
select col1, col2,col3
from table_name
where col1 = 'x'
and col3 = 'y'
order by col4
limit 3;
The optimal index is one of these two:
INDEX(col1, col3, col4)
INDEX(col3, col1, col4)
In both, the Optimizer can completely resolve the WHERE and do the ORDER BY and even stop after 3 rows due to the LIMIT.
Best. Even better performance would come from adding col2 to the end of either. This makes it a "covering" index, so all the work can be done in the index's BTree without touching the data's BTree.
Back to your question
If you don't have one of those indexes, the Optimizer is in a quandary, and often picks the wrong of the two likely choices. Let's say you have only
INDEX(col1), INDEX(col4)
Plan A focuses on filtering: Use col1, but have to sort all the matching rows before peeling off 3. But it might get a million rows and have to sort them.
Plan B avoids sorting: Scan through the index in col4 order. If it is really lucky, the first 3 rows will match the WHERE clause. If it is really unlucky, it will scan the entire table without finding 3 acceptable rows. But they will be sorted!
The "statistics" are meager, and cannot realistically decide between the two choices.
Either Plan could be really slow.
Similar problems occur with JOINs with the WHERE clause filtering on both tables.

Are multiple WHERE/AND clauses in MySQL checked sequentially? And are subsequent checks skipped if the first is false?

Scenario
Let's say I have a MySQL query that contains multiple WHERE/AND clauses.
For for instance, say I have the query
SELECT * FROM some_table
WHERE col1 = 5
AND col2 = 9
AND col3 LIKE '%string%'
Question
Is the col1 = 5 check done as the first in sequence here, since it's written first? And more importantly, are the other two checks skipped if col1 != 5?
The reason I ask is that the third clause, col3 LIKE '%string3%, will take more time to run, and I'm wondering if it makes sense to put it last, since I don't want to run it if one of the first two checks are false.
The SQL optimizer looks at the query at whole and tries to determine the most optimal query plan for the query. The order of the contitions in where-clause does not matter.
The optimal index for that query is
INDEX(col1, col2) -- in either order
Given that, it will definitely check both col1 and col2 simultaneously in order to whittle down the number of rows to a much smaller number. Hence the LIKE will happen only for the few rows that match both col1 and col2. This order of actions is very likely to be optimal.
Often, it is better (though not identical) to use FULLTEXT(col3) and have
WHERE col1 = 5
AND col2 = 9
AND MATCH(col3) AGAINST ("+string" IN BOOLEAN MODE)
In this case, I am pretty sure it will start with the FULLTEXT index to test col3, hoping to get very few rows to double check against the other clauses. Because of various other issues, this is optimal. Any index(es) on col1 and col2 will not be used.
The general statement (so far) is: The Optimizer will pick AND clause(s) that it can use for one INDEX first. In that sense, the order of the AND clauses is violated -- as an optimization.
If you don't have any suitable indexes, well, shame on you.
There are many possibilities. I will be happy to discuss individual queries, but it will be hard to make too many generalities.

Issues with a MySQL Index due to a specific part

The query is
SELECT row
FROM `table`
USE INDEX(`indexName`)
WHERE row1 = '0'
AND row2 = '0'
AND row3 >= row4
AND (row5 = '0' OR row5 LIKE 'value')
I have the following MySQL Query which I've created a index for using;
CREATE INDEX indexName ON `table` (row1, row2, row3, row5);
However, the performance is not really good. It's extracting about 17,000+ rows out of a 5.9+ million row table in anywhere from 6-12 seconds.
It seems like the bottleneck is the row3 >= row4 - because without that part in the code it runs in 0.6-0.7 seconds.
(from Comment)
The row (placeholder column name) is actually the id (primary key, index) column in the table, which is the result set I'm outputting later on. I'm outputting an array of IDs that are matching the parameters in my query, and then selecting a random ID from that array to gather data through the final query on a specific row. This was done as a workaround for rand(). Any adjustments needed based on that knowledge?
17K rows is not a tiny result set. Large result sets often take time just because of the overhead of delivering the data from the MySQL server to the program requesting them.
The contents of the 'value' you use in row5 LIKE 'value' matter a great deal to query performance. If 'value' starts with a wildcard character like % your query will be slow.
That being said, you need a so-called covering index. You've tried to create one with the index you created. It's close but not perfect.
Your query filters on equality to constant values on row1, row2, and row5, so those columns should come first in your index. The query planner can random-access your index to the first matching entry, and then sequentially scan the index until it gets to the last matching entry. That's as fast as it gets.
Then you want to examine row3 and row4 (to compare them). Those columns should come next in the index. Finally, if your query's SELECT clause mentions a subset of the columns in your table you should put the rest of those columns in the index. So, based on the query in your question, your index should be
CREATE INDEX indexName ON `table` (row1, row2, row5, row3, row4, row);
The query planner will be able to satisfy the entire query by scanning through a subset of the index, using a so-called index range scan. That should be decently fast.
Pro tip: don't force the query planner's hand with USE INDEX(). Instead, structure your indexes to handle your queries efficiently.
An index can't be used to compare two columns in the same table (at best, it could be used for an index scan rather than a table scan if all output fields are contained in the index), so there basically is no "correct" way to do this.
If you have control over the structure AND the processes the fill the table, you could add a calculated field that holds the difference between the two fields. Then add that field to the index and adjust your query to use that field instead of the other 2.
It ain't pretty and doesn't offer a lot of flexibility (eg. if you want to compare another field, you need to add it as well etc), but it does get the job done.
(This is an adaptation of http://mysql.rjweb.org/doc.php/random )
Let's actually fold the randomization into the query. This will eliminate gathering a bunch of ids, processing them, and then reaching back into the table. It will also avoid the need for an extra index.
Find min and max id values.
Pick a random id between min and max.
Scan forward, looking for the first row with col1...col5 matching the criteria.
Something like...
SELECT b.* -- should replace with actual list of columns
FROM
( SELECT id
FROM tbl
WHERE id >= ( SELECT MIN(id) +
( MAX(id) - MIN(id)
- 22 -- somewhat avoids running off end
) * RAND()
FROM tbl )
AND col1 = 0 ... -- your various criteria
ORDER BY id
LIMIT 1
) AS a
JOIN tbl AS b USING(id);
Pros/cons:
Probably faster than anything else you can devise.
If there RAND() hits too late in the table, it will return nothing. In this (rare) case, run the query again, but starting at 0.
Big gaps in id will lead to a bias in which id is returned. (The link above discusses some kludges to handle such.)

Can MySQL use Indexes when there is OR between conditions?

I have two queries plus its own EXPLAIN's results:
One:
SELECT *
FROM notifications
WHERE id = 5204 OR seen = 3
Benchmark (for 10,000 rows): 0.861
Two:
SELECT h.* FROM ((SELECT n.* from notifications n WHERE id = 5204)
UNION ALL
(SELECT n.* from notifications n WHERE seen = 3)) h
Benchmark (for 10,000 rows): 2.064
The result of two queries above is identical. Also I have these two indexes on notifications table:
notifications(id) -- this is PK
notification(seen)
As you know, OR usually prevents effective use of indexes, That's why I wrote second query (by UNION). But after some tests I figured it out which still using OR is much faster that using UNION. So I'm confused and I really cannot choose the best option in my case.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better. May you please help me should I use which approach?
The query plan for the OR case appears to indicate that MySQL is indeed using indexes, so evidently yes, it can do, at least in this case. That seems entirely reasonable, because there is an index on seen, and id is the PK.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better.
If "logical and reasonable explanations" are contradicted by reality, then it is safe to assume that the logic is flawed or the explanations are wrong or inapplicable. Performance is notoriously difficult to predict; performance testing is essential where speed is important.
May you please help me should I use which approach?
You should use the one that tests faster on input that adequately models that which the program will see in real use.
Note also, however, that your two queries are not semantically equivalent: if the row with id = 5204 also has seen = 3 then the OR query will return it once, but the UNION ALL query will return it twice. It is pointless to choose between correct code and incorrect code on any basis other than which one is correct.
index_merge, as the name suggests, combines the primary keys of two indexes using the Sort Merge Join or Sort Merge Union for AND and OR conditions, appropriately, and then looks up the rest of the values in the table by PK.
For this to work, conditions on both indexes should be so that each index would yield primary keys in order (your conditions are).
You can find the strict definition of the conditions in the docs, but in a nutshell, you should filter by all parts of the index with an equality condition, plus possibly <, =, or > on the PK.
If you have an index on (col1, col2, col3), this should be col1 = :val1 AND col2 = :val2 AND col3 = :val3 [ AND id > :id ] (the part in the square brackets is not necessary).
The following conditions won't work:
col1 = :val1 -- you omit col2 and col3
col1 = :val1 AND col2 = :val2 AND col3 > :val3 -- you can only use equality on key parts
As a free side effect, your output is sorted by id.
You could achieve the similar results using this:
SELECT *
FROM (
SELECT 5204 id
UNION ALL
SELECT id
FROM mytable
WHERE seen = 3
AND id <> 5204
) q
JOIN mytable m
ON m.id = q.id
, except that in earlier versions of MySQL the derived table would have to be materialized which would definitely make the query performance worse, and your results would not have been ordered by id anymore.
In short, if your query allows index_merge(union), go for it.
The answer is contained in your question. The EXPLAIN output for OR says Using union(PRIMARY, seen) - that means that the index_merge optimization is being used and the query is actually executed by unioning results from the two indexes.
So MySQL can use index in some cases and it does in this one. But the index_merge is not always available or is not used because the statistics of the indexes say it won't be worth it. In those cases OR may be a lot slower than UNION (or not, you need to always check both versions if you are not sure).
In your test you "got lucky" and MySQL did the right optimization for you automatically. It is not always so.

MySQL simple but slow query (wrong indexes?)

i'm working on a simple query that runs in about 1.2 seconds in a myisam table populated by 126,000 records:
SELECT * FROM my_table
WHERE primary_key != 5 AND
(
col1 = 528 OR (col2 = 265 AND col3 = 1)
)
ORDER BY primary_key DESC
I have already created single indexes for each field used in the where clause, but only primary_key (autoincrement field of my_table) is used as key while col1 and col2 are just ignored and the query becomes much slower. How should I create the indexes (maybe multiple-indexs) or edit the query?
You will get optimal performance if you have the following multi-column "covering" indexes:
(primary_key, col1)
(primary_key, col2, col3)
And issue the following query:
(SELECT * FROM my_table
WHERE primary_key != 5 AND
col1 = 528)
UNION
(SELECT * FROM my_table
WHERE primary_key != 5 AND
col2 = 265 AND col3 = 1)
ORDER BY primary_key DESC
You may get variable, better, performance by changing the order of the fields in the indexes, based on cardinality.
In your original query, no index could be used for the entire selection in the WHERE clause, which caused partial table scanning.
In the above query, the first subquery is able to utilize the first index completely, avoiding any table scanning, and the second subquery uses the second index.
Unfortunately, MySQL won't be able to utilize an index to sort the records on the full result set, and will probably use filesort to order them. So, if you don't need the records ordered by primary_key, remove the outer ORDER clause for better performance, though if the result set is small, it shouldn't be an issue either way.
Use EXPLAIN to find out what is going on. This will identify what indexes are being used and hence enable you to tune the query.
Unfortunately, there is almost no way to predict which indexes will work better than others without a thorough understanding of MySQL internals. However, for each query (or, at least sub-query/join) MySQL will only use 1 index, you have stated that the primary key is being used, so I assume you have looked at the EXPLAIN output.
You will likely want to try multi-column indexes on either all of (primary_key, col1, col2, col3) or a subset of these, in different orders to find the best result. The best index will depend on the primarily on the cardinality of the columns and thus, may even change over time depending on the data in the table.