I have two queries plus its own EXPLAIN's results:
One:
SELECT *
FROM notifications
WHERE id = 5204 OR seen = 3
Benchmark (for 10,000 rows): 0.861
Two:
SELECT h.* FROM ((SELECT n.* from notifications n WHERE id = 5204)
UNION ALL
(SELECT n.* from notifications n WHERE seen = 3)) h
Benchmark (for 10,000 rows): 2.064
The result of two queries above is identical. Also I have these two indexes on notifications table:
notifications(id) -- this is PK
notification(seen)
As you know, OR usually prevents effective use of indexes, That's why I wrote second query (by UNION). But after some tests I figured it out which still using OR is much faster that using UNION. So I'm confused and I really cannot choose the best option in my case.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better. May you please help me should I use which approach?
The query plan for the OR case appears to indicate that MySQL is indeed using indexes, so evidently yes, it can do, at least in this case. That seems entirely reasonable, because there is an index on seen, and id is the PK.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better.
If "logical and reasonable explanations" are contradicted by reality, then it is safe to assume that the logic is flawed or the explanations are wrong or inapplicable. Performance is notoriously difficult to predict; performance testing is essential where speed is important.
May you please help me should I use which approach?
You should use the one that tests faster on input that adequately models that which the program will see in real use.
Note also, however, that your two queries are not semantically equivalent: if the row with id = 5204 also has seen = 3 then the OR query will return it once, but the UNION ALL query will return it twice. It is pointless to choose between correct code and incorrect code on any basis other than which one is correct.
index_merge, as the name suggests, combines the primary keys of two indexes using the Sort Merge Join or Sort Merge Union for AND and OR conditions, appropriately, and then looks up the rest of the values in the table by PK.
For this to work, conditions on both indexes should be so that each index would yield primary keys in order (your conditions are).
You can find the strict definition of the conditions in the docs, but in a nutshell, you should filter by all parts of the index with an equality condition, plus possibly <, =, or > on the PK.
If you have an index on (col1, col2, col3), this should be col1 = :val1 AND col2 = :val2 AND col3 = :val3 [ AND id > :id ] (the part in the square brackets is not necessary).
The following conditions won't work:
col1 = :val1 -- you omit col2 and col3
col1 = :val1 AND col2 = :val2 AND col3 > :val3 -- you can only use equality on key parts
As a free side effect, your output is sorted by id.
You could achieve the similar results using this:
SELECT *
FROM (
SELECT 5204 id
UNION ALL
SELECT id
FROM mytable
WHERE seen = 3
AND id <> 5204
) q
JOIN mytable m
ON m.id = q.id
, except that in earlier versions of MySQL the derived table would have to be materialized which would definitely make the query performance worse, and your results would not have been ordered by id anymore.
In short, if your query allows index_merge(union), go for it.
The answer is contained in your question. The EXPLAIN output for OR says Using union(PRIMARY, seen) - that means that the index_merge optimization is being used and the query is actually executed by unioning results from the two indexes.
So MySQL can use index in some cases and it does in this one. But the index_merge is not always available or is not used because the statistics of the indexes say it won't be worth it. In those cases OR may be a lot slower than UNION (or not, you need to always check both versions if you are not sure).
In your test you "got lucky" and MySQL did the right optimization for you automatically. It is not always so.
Related
What is the best way to create index when I have a query like this?
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
I know that only one index can be used in a query so I can't create two indexes, one for user_1 and one for user_2.
Also could solution for this type of query be used for this query?
WHERE ((user_1 = '$user_id' AND user_2 = '$friend_id') OR (user_1 = '$friend_id' AND user_2 = '$user_id'))
MySQL has a hard time with OR conditions. In theory, there's an index merge optimization that #duskwuff mentions, but in practice, it doesn't kick in when you think it should. Besides, it doesn't give as performance as a single index when it does.
The solution most people use to work around this is to split up the query:
SELECT ... WHERE user_1 = ?
UNION
SELECT ... WHERE user_2 = ?
That way each query will be able to use its own choice for index, without relying on the unreliable index merge feature.
Your second query is optimizable more simply. It's just a tuple comparison. It can be written this way:
WHERE (user_1, user_2) IN (('$user_id', '$friend_id'), ('$friend_id', '$user_id'))
In old versions of MySQL, tuple comparisons would not use an index, but since 5.7.3, it will (see https://dev.mysql.com/doc/refman/5.7/en/row-constructor-optimization.html).
P.S.: Don't interpolate application code variables directly into your SQL expressions. Use query parameters instead.
I know that only one index can be used in a query…
This is incorrect. Under the right circumstances, MySQL will routinely use multiple indexes in a query. (For example, a query JOINing multiple tables will almost always use at least one index on each table involved.)
In the case of your first query, MySQL will use an index merge union optimization. If both columns are indexed, the EXPLAIN output will give an explanation along the lines of:
Using union(index_on_user_1,index_on_user_2); Using where
The query shown in your second example is covered by an index on (user_1, user_2). Create that index if you plan on running those queries routinely.
The two cases are different.
At the first case both columns needs to be searched for the same value. If you have a two column index (u1,u2) then it may be used at the column u1 as it cannot be used at column u2. If you have two indexes separate for u1 and u2 probably both of them will be used. The choice comes from statistics based on how many rows are expected to be returned. If returned rows expected few an index seek will be selected, if the appropriate index is available. If the number is high a scan is preferable, either table or index.
At the second case again both columns need to be checked again, but within each search there are two sub-searches where the second sub-search will be upon the results of the first one, due to the AND condition. Here it matters more and two indexes u1 and u2 will help as any field chosen to be searched first will have an index. The choice to use an index is like i describe above.
In either case however every OR will force 1 more search or set of searches. So the proposed solution of breaking using union does not hinder more as the table will be searched x times no matter 1 select with OR(s) or x selects with union and no matter index selection and type of search (seek or scan). As a result, since each select at the union get its own execution plan part, it is more likely that (single column) indexes will be used and finally get all row result sets from all parts around the OR(s). If you do not want to copy a large select statement to many unions you may get the primary key values and then select those or use a view to be sure the majority of the statement is in one place.
Finally, if you exclude the union option, there is a way to trick the optimizer to use a single index. Create a double index u1,u2 (or u2,u1 - whatever column has higher cardinality goes first) and modify your statement so all OR parts use all columns:
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
will be converted to:
... WHERE ((user_1 = '$user_id' and user_2=user_2) OR (user_1=user_1 and user_2 = '$user_id')) ...
This way a double index (u1,u2) will be used at all times. Please not that this will work if columns are nullable and bypassing this with isnull or coalesce may cause index not to be selected. It will work with ansi nulls off however.
Scenario
Let's say I have a MySQL query that contains multiple WHERE/AND clauses.
For for instance, say I have the query
SELECT * FROM some_table
WHERE col1 = 5
AND col2 = 9
AND col3 LIKE '%string%'
Question
Is the col1 = 5 check done as the first in sequence here, since it's written first? And more importantly, are the other two checks skipped if col1 != 5?
The reason I ask is that the third clause, col3 LIKE '%string3%, will take more time to run, and I'm wondering if it makes sense to put it last, since I don't want to run it if one of the first two checks are false.
The SQL optimizer looks at the query at whole and tries to determine the most optimal query plan for the query. The order of the contitions in where-clause does not matter.
The optimal index for that query is
INDEX(col1, col2) -- in either order
Given that, it will definitely check both col1 and col2 simultaneously in order to whittle down the number of rows to a much smaller number. Hence the LIKE will happen only for the few rows that match both col1 and col2. This order of actions is very likely to be optimal.
Often, it is better (though not identical) to use FULLTEXT(col3) and have
WHERE col1 = 5
AND col2 = 9
AND MATCH(col3) AGAINST ("+string" IN BOOLEAN MODE)
In this case, I am pretty sure it will start with the FULLTEXT index to test col3, hoping to get very few rows to double check against the other clauses. Because of various other issues, this is optimal. Any index(es) on col1 and col2 will not be used.
The general statement (so far) is: The Optimizer will pick AND clause(s) that it can use for one INDEX first. In that sense, the order of the AND clauses is violated -- as an optimization.
If you don't have any suitable indexes, well, shame on you.
There are many possibilities. I will be happy to discuss individual queries, but it will be hard to make too many generalities.
I have a query like this:
( SELECT * FROM mytable WHERE author_id = ? AND seen IS NULL )
UNION
( SELECT * FROM mytable WHERE author_id = ? AND date_time > ? )
Also I have these two indexes:
(author_id, seen)
(author_id, date_time)
I read somewhere:
A query can generally only use one index per table when process the WHERE clause
As you see in my query, there is two separated WHERE clause. So I want to know, "only one index per table" means my query can use just one of those two indexes or it can use one of those indexes for each subquery and both indexes are useful?
In other word, is this sentence true?
"always one of those index will be used, and the other one is useless"
That statement about only using one index is no longer true about MySQL. For instance, it implements the index merge optimization which can take advantage of two indexes for some where clauses that have or. Here is a description in the documentation.
You should try this form of your query and see if it uses index mer:
SELECT *
FROM mytable
WHERE author_id = ? AND (seen IS NULL OR date_time > ? );
This should be more efficient than the union version, because it does not incur the overhead of removing duplicates.
Also, depending on the distribution of your data, the above query with an index on mytable(author_id, date_time, seen) might work as well or better than your version.
UNION combines results of subqueries. Each subquery will be executed independent of others and then results will be merged. So, in this case WHERE limits are applied to each subquery and not to all united result.
In answer to your question: yes, each subquery can use some index.
There are cases when the database engine can use more indexes for one select statement, however when filtering one set of rows really it not possible. If you want to use indexing on two columns then build one index on both columns instead of two indexes.
Every single subquery or part of composite query is itself a query can be evaluated as single query for performance and index access .. you can also force the use of different index for eahc query .. In your case you are using union and these are two separated query .. united in a resulting query
. you can have a brief guide how mysql ue index .. acccessing at this guide
http://dev.mysql.com/doc/refman/5.7/en/mysql-indexes.html
I'm working on a project involving joins between datasets and we have a requirement to allow previews of arbitrary joins between arbitrary datasets. Which is crazy, but thats why its fun. This is use facing so given a join I want to show ~10 rows of results quickly.
I've been basing my experimentation around different ways to sub-sample the different tables in such a way that I get at least a few result rows but keep the samples small enough that the join is fast and not cause the sampling to be expensive.
Here are the methods I've found pass the smell test. I would like to know a few things about them:
What types of joins or datasets would these fail at?
How could I identify those datasets?
If both of these are bad at the same thing, how could they be improved?
Is there a type of sampling I have not put here that is better?
Subselect with a limit.
Takes a random sample of one dataset to reduce the overall size.
SELECT col1, col2 FROM table1 JOIN
(SELECT col1, col2 FROM table2 LIMIT #) AS sample2
on table1.col1 = sample2.col1
LIMIT 10;
I like this because its easy and there is potential in the future to be smart about which table to samples from. It is also possible to select a portion where table1.col1 never equals sample2.col1 so no results are returned.
Find equals values of col1 and Sample them
More complicated, multi-query approach. Here I would do a distinct select of the columns to join on, compare the results to find common values and then do a subselect limiting the results to the common values.
SELECT DISTINCT col1 FROM table1;
SELECT DISTINCT col1 FROM table2;
commonVals = intersection of above results
SELECT col1, col2 FROM table1 JOIN
(SELECT col1, col2 FROM table2 WHERE col1 IN(commonVals) LIMIT #) as sample2
on table1.col1 = sample2.col1
LIMIT 10;
This gets us a good sample of table2, but the select distinct query may be more expensive than the join. I believe there may be a way to determine if this method is faster if you knew something about how long the distinct cals would take but at this point we don't have that much knowledge of the datasets.
Slap a LIMIT on the join
This is the easiest and the one I'm leaning towards.
SELECT col1, col1 FROM table1 join table2 on table1.col1 = table2.col1 LIMIT #
Assuming the join is good, this will always return data and for at least a large set of cases it will do it fast.
The problem with the first approach is that the rows in the first table might not have a match in the second table. Remember, inner joins not only do matching, they also do filtering.
The second approach could work, if all the columns used for joining have indexes on them. You can then get a list of matching ids by doing something like:
where id in (select id from table1) and id in (select id from table2) . . .
This gets rid of the initial code and should be pretty fast.
The third method is using the capabilities of the database most directly. You would be depending on the ability of MySQL to optimize according to the size of the result set. This is something that it does, at least in theory.
I would strongly recommend the third approach in conjunction with indexes on the columns used in the joins. This requires minimal changes to the query (just add a limit clause). It allows the database to pursue additional optimizations, if appropriate. It works on a more general set of queries.
i'm working on a simple query that runs in about 1.2 seconds in a myisam table populated by 126,000 records:
SELECT * FROM my_table
WHERE primary_key != 5 AND
(
col1 = 528 OR (col2 = 265 AND col3 = 1)
)
ORDER BY primary_key DESC
I have already created single indexes for each field used in the where clause, but only primary_key (autoincrement field of my_table) is used as key while col1 and col2 are just ignored and the query becomes much slower. How should I create the indexes (maybe multiple-indexs) or edit the query?
You will get optimal performance if you have the following multi-column "covering" indexes:
(primary_key, col1)
(primary_key, col2, col3)
And issue the following query:
(SELECT * FROM my_table
WHERE primary_key != 5 AND
col1 = 528)
UNION
(SELECT * FROM my_table
WHERE primary_key != 5 AND
col2 = 265 AND col3 = 1)
ORDER BY primary_key DESC
You may get variable, better, performance by changing the order of the fields in the indexes, based on cardinality.
In your original query, no index could be used for the entire selection in the WHERE clause, which caused partial table scanning.
In the above query, the first subquery is able to utilize the first index completely, avoiding any table scanning, and the second subquery uses the second index.
Unfortunately, MySQL won't be able to utilize an index to sort the records on the full result set, and will probably use filesort to order them. So, if you don't need the records ordered by primary_key, remove the outer ORDER clause for better performance, though if the result set is small, it shouldn't be an issue either way.
Use EXPLAIN to find out what is going on. This will identify what indexes are being used and hence enable you to tune the query.
Unfortunately, there is almost no way to predict which indexes will work better than others without a thorough understanding of MySQL internals. However, for each query (or, at least sub-query/join) MySQL will only use 1 index, you have stated that the primary key is being used, so I assume you have looked at the EXPLAIN output.
You will likely want to try multi-column indexes on either all of (primary_key, col1, col2, col3) or a subset of these, in different orders to find the best result. The best index will depend on the primarily on the cardinality of the columns and thus, may even change over time depending on the data in the table.