Index when using OR in query - mysql

What is the best way to create index when I have a query like this?
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
I know that only one index can be used in a query so I can't create two indexes, one for user_1 and one for user_2.
Also could solution for this type of query be used for this query?
WHERE ((user_1 = '$user_id' AND user_2 = '$friend_id') OR (user_1 = '$friend_id' AND user_2 = '$user_id'))

MySQL has a hard time with OR conditions. In theory, there's an index merge optimization that #duskwuff mentions, but in practice, it doesn't kick in when you think it should. Besides, it doesn't give as performance as a single index when it does.
The solution most people use to work around this is to split up the query:
SELECT ... WHERE user_1 = ?
UNION
SELECT ... WHERE user_2 = ?
That way each query will be able to use its own choice for index, without relying on the unreliable index merge feature.
Your second query is optimizable more simply. It's just a tuple comparison. It can be written this way:
WHERE (user_1, user_2) IN (('$user_id', '$friend_id'), ('$friend_id', '$user_id'))
In old versions of MySQL, tuple comparisons would not use an index, but since 5.7.3, it will (see https://dev.mysql.com/doc/refman/5.7/en/row-constructor-optimization.html).
P.S.: Don't interpolate application code variables directly into your SQL expressions. Use query parameters instead.

I know that only one index can be used in a query…
This is incorrect. Under the right circumstances, MySQL will routinely use multiple indexes in a query. (For example, a query JOINing multiple tables will almost always use at least one index on each table involved.)
In the case of your first query, MySQL will use an index merge union optimization. If both columns are indexed, the EXPLAIN output will give an explanation along the lines of:
Using union(index_on_user_1,index_on_user_2); Using where
The query shown in your second example is covered by an index on (user_1, user_2). Create that index if you plan on running those queries routinely.

The two cases are different.
At the first case both columns needs to be searched for the same value. If you have a two column index (u1,u2) then it may be used at the column u1 as it cannot be used at column u2. If you have two indexes separate for u1 and u2 probably both of them will be used. The choice comes from statistics based on how many rows are expected to be returned. If returned rows expected few an index seek will be selected, if the appropriate index is available. If the number is high a scan is preferable, either table or index.
At the second case again both columns need to be checked again, but within each search there are two sub-searches where the second sub-search will be upon the results of the first one, due to the AND condition. Here it matters more and two indexes u1 and u2 will help as any field chosen to be searched first will have an index. The choice to use an index is like i describe above.
In either case however every OR will force 1 more search or set of searches. So the proposed solution of breaking using union does not hinder more as the table will be searched x times no matter 1 select with OR(s) or x selects with union and no matter index selection and type of search (seek or scan). As a result, since each select at the union get its own execution plan part, it is more likely that (single column) indexes will be used and finally get all row result sets from all parts around the OR(s). If you do not want to copy a large select statement to many unions you may get the primary key values and then select those or use a view to be sure the majority of the statement is in one place.
Finally, if you exclude the union option, there is a way to trick the optimizer to use a single index. Create a double index u1,u2 (or u2,u1 - whatever column has higher cardinality goes first) and modify your statement so all OR parts use all columns:
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
will be converted to:
... WHERE ((user_1 = '$user_id' and user_2=user_2) OR (user_1=user_1 and user_2 = '$user_id')) ...
This way a double index (u1,u2) will be used at all times. Please not that this will work if columns are nullable and bypassing this with isnull or coalesce may cause index not to be selected. It will work with ansi nulls off however.

Related

MySQL - Poor performance in a select from a simple table

I have a very simple table with three columns:
- A BigINT,
- Another BigINT,
- A string.
The first two columns are defined as INDEX and there are no repetitions. Moreover, both columns have values in a growing order.
The table has nearly 400K records.
I need to select the string when a value is within those of column 1 and two, in order words:
SELECT MyString
FROM MyTable
WHERE Col_1 <= Test_Value
AND Test_Value <= Col_2 ;
The result may be either a NOT FOUND or a single value.
The query takes nearly a whole second while, intuitively (imagining a binary search throughout an array), it should take just a small fraction of a second.
I checked the index type and it is BTREE for both columns (1 and 2).
Any idea how to improve performance?
Thanks in advance.
EDIT:
The explain reads:
Select type: Simple,
Type: Range,
Possible Keys: PRIMARY
Key: Primary,
Key Length: 8,
Rows: 441,
Filtered: 33.33,
Extra: Using where.
If I understand your obfuscation correctly, you have a start and end value such as a datetime or an ip address in a pair of columns? And you want to see if your given datetime/ip is in the given range?
Well, there is no way to generically optimize such a query on such a table. The optimizer does not know whether a given value could be in multiple ranges. Or, put another way, whether the ranges are disjoint.
So, the optimizer will, at best, use an index starting with either start or end and scan half the table. Not efficient.
Are the ranges non-overlapping? IP Addresses
What can you say about the result? Perhaps a kludge like this will work: SELECT ... WHERE Col_1 <= Test_Value ORDER BY Col_1 DESC LIMIT 1.
Your query, rewritten with shorter identifiers, is this
SELECT s FROM t WHERE t.low <= v AND v <= t.high
To satisfy this query using indexes would go like this: First we must search a table or index for all rows matching the first of these criteria
t.low <= v
We can think of that as a half-scan of a BTREE index. It starts at the beginning and stops when it gets to v.
It requires another half-scan in another index to satisfy v <= t.high. It then requires a merge of the two resultsets to identify the rows matching both criteria. The problem is, the two resultsets to merge are large, and they're almost entirely non-overlapping.
So, the query planner probably should just choose a full table scan instead to satisfy your criteria. That's especially true in the case of MySQL, where the query planner isn't very good at using more than one index.
You may, or may not, be able to speed up this exact query with a compound index on (low, high, s) -- with your original column names (Col_1, Col_2, MyString). This is called a covering index and allows MySQL to satisfy the query completely from the index. It sometimes helps performance. (It would be easier to guess whether this will help if the exact definition of your table were available; the efficiency of covering indexes depends on stuff like other indexes, primary keys, column size, and so forth. But you've chosen minimal disclosure for that information.)
What will really help here? Rethinking your algorithm could do you a lot of good. It seems you're trying to retrieve rows where a test point v lies in the range [t.low, t.high]. Does your application offer an a-priori limit on the width of the range? That is, is there a known maximum value of t.high - t.low? If so, let's call that value maxrange. Then you can rewrite your query like this:
SELECT s
FROM t
WHERE t.low BETWEEN v-maxrange AND v
AND t.low <= v AND v <= t.high
When maxrange is available we can add the col BETWEEN const1 AND const2 clause. That turns into an efficient range scan on an index on low. In that case, the covering index I mentioned above will certainly accelerate this query.
Read this. http://use-the-index-luke.com/
Well... I found a suitable solution for me (not sure your guys will like it but, as stated, it works for me).
I simply partitioned my 400K records into a number of tables and created a simple table that serves as a selector:
The selector table holds the minimal value of the first column for each partition along with a simple index (i.e. 1, 2, ,...).
I then user the following to get the index of the table that is supposed to contain the searched for range like:
SELECT Table_Index
FROM tbl_selector
WHERE start_range <= Test_Val
ORDER BY start_range DESC LIMIT 1 ;
This will give me the Index of the table I wish to select from.
I then have a CASE on the retrieved Index to select the correct partition table from perform the actual search.
(I guess that more elegant would be to use Dynamic SQL, but will take care of that later; for now just wanted to test the approach).
The result is that I get the response well below a second (~0.08) and it is uniform regardless of the number being used for test. This, by the way, was not the case with the previous approach: There, if the number was "close" to the beginning of the table, the result was produced quite fast; if, on the other hand, the record was near the end of the table, it would take several seconds to complete).
[By the way, I assume you understand what I mean by beginning and end of the table]
Again, I'm sure people might dislike this, but it does the job for me.
Thank you all for the effort to assist!!

Can a query only use one index per table?

I have a query like this:
( SELECT * FROM mytable WHERE author_id = ? AND seen IS NULL )
UNION
( SELECT * FROM mytable WHERE author_id = ? AND date_time > ? )
Also I have these two indexes:
(author_id, seen)
(author_id, date_time)
I read somewhere:
A query can generally only use one index per table when process the WHERE clause
As you see in my query, there is two separated WHERE clause. So I want to know, "only one index per table" means my query can use just one of those two indexes or it can use one of those indexes for each subquery and both indexes are useful?
In other word, is this sentence true?
"always one of those index will be used, and the other one is useless"
That statement about only using one index is no longer true about MySQL. For instance, it implements the index merge optimization which can take advantage of two indexes for some where clauses that have or. Here is a description in the documentation.
You should try this form of your query and see if it uses index mer:
SELECT *
FROM mytable
WHERE author_id = ? AND (seen IS NULL OR date_time > ? );
This should be more efficient than the union version, because it does not incur the overhead of removing duplicates.
Also, depending on the distribution of your data, the above query with an index on mytable(author_id, date_time, seen) might work as well or better than your version.
UNION combines results of subqueries. Each subquery will be executed independent of others and then results will be merged. So, in this case WHERE limits are applied to each subquery and not to all united result.
In answer to your question: yes, each subquery can use some index.
There are cases when the database engine can use more indexes for one select statement, however when filtering one set of rows really it not possible. If you want to use indexing on two columns then build one index on both columns instead of two indexes.
Every single subquery or part of composite query is itself a query can be evaluated as single query for performance and index access .. you can also force the use of different index for eahc query .. In your case you are using union and these are two separated query .. united in a resulting query
. you can have a brief guide how mysql ue index .. acccessing at this guide
http://dev.mysql.com/doc/refman/5.7/en/mysql-indexes.html

Can MySQL use Indexes when there is OR between conditions?

I have two queries plus its own EXPLAIN's results:
One:
SELECT *
FROM notifications
WHERE id = 5204 OR seen = 3
Benchmark (for 10,000 rows): 0.861
Two:
SELECT h.* FROM ((SELECT n.* from notifications n WHERE id = 5204)
UNION ALL
(SELECT n.* from notifications n WHERE seen = 3)) h
Benchmark (for 10,000 rows): 2.064
The result of two queries above is identical. Also I have these two indexes on notifications table:
notifications(id) -- this is PK
notification(seen)
As you know, OR usually prevents effective use of indexes, That's why I wrote second query (by UNION). But after some tests I figured it out which still using OR is much faster that using UNION. So I'm confused and I really cannot choose the best option in my case.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better. May you please help me should I use which approach?
The query plan for the OR case appears to indicate that MySQL is indeed using indexes, so evidently yes, it can do, at least in this case. That seems entirely reasonable, because there is an index on seen, and id is the PK.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better.
If "logical and reasonable explanations" are contradicted by reality, then it is safe to assume that the logic is flawed or the explanations are wrong or inapplicable. Performance is notoriously difficult to predict; performance testing is essential where speed is important.
May you please help me should I use which approach?
You should use the one that tests faster on input that adequately models that which the program will see in real use.
Note also, however, that your two queries are not semantically equivalent: if the row with id = 5204 also has seen = 3 then the OR query will return it once, but the UNION ALL query will return it twice. It is pointless to choose between correct code and incorrect code on any basis other than which one is correct.
index_merge, as the name suggests, combines the primary keys of two indexes using the Sort Merge Join or Sort Merge Union for AND and OR conditions, appropriately, and then looks up the rest of the values in the table by PK.
For this to work, conditions on both indexes should be so that each index would yield primary keys in order (your conditions are).
You can find the strict definition of the conditions in the docs, but in a nutshell, you should filter by all parts of the index with an equality condition, plus possibly <, =, or > on the PK.
If you have an index on (col1, col2, col3), this should be col1 = :val1 AND col2 = :val2 AND col3 = :val3 [ AND id > :id ] (the part in the square brackets is not necessary).
The following conditions won't work:
col1 = :val1 -- you omit col2 and col3
col1 = :val1 AND col2 = :val2 AND col3 > :val3 -- you can only use equality on key parts
As a free side effect, your output is sorted by id.
You could achieve the similar results using this:
SELECT *
FROM (
SELECT 5204 id
UNION ALL
SELECT id
FROM mytable
WHERE seen = 3
AND id <> 5204
) q
JOIN mytable m
ON m.id = q.id
, except that in earlier versions of MySQL the derived table would have to be materialized which would definitely make the query performance worse, and your results would not have been ordered by id anymore.
In short, if your query allows index_merge(union), go for it.
The answer is contained in your question. The EXPLAIN output for OR says Using union(PRIMARY, seen) - that means that the index_merge optimization is being used and the query is actually executed by unioning results from the two indexes.
So MySQL can use index in some cases and it does in this one. But the index_merge is not always available or is not used because the statistics of the indexes say it won't be worth it. In those cases OR may be a lot slower than UNION (or not, you need to always check both versions if you are not sure).
In your test you "got lucky" and MySQL did the right optimization for you automatically. It is not always so.

Instructing MySQL to apply WHERE clause to rows returned by previous WHERE clause

I have the following query:
SELECT dt_stamp
FROM claim_notes
WHERE type_id = 0
AND dt_stamp >= :dt_stamp
AND DATE( dt_stamp ) = :date
AND user_id = :user_id
AND note LIKE :click_to_call
ORDER BY dt_stamp
LIMIT 1
The claim_notes table has about half a million rows, so this query runs very slowly since it has to search against the unindexed note column (which I can't do anything about). I know that when the type_id, dt_stamp, and user_id conditions are applied, I'll be searching against about 60 rows instead of half a million. But MySQL doesn't seem to apply these in order. What I'd like to do is to see if there's a way to tell MySQL to only apply the note LIKE :click_to_call condition to the rows that meet the former conditions so that it's not searching all rows with this condition.
What I've come up with is this:
SELECT dt_stamp
FROM (
SELECT *
FROM claim_notes
WHERE type_id = 0
AND dt_stamp >= :dt_stamp
AND DATE( dt_stamp ) = :date
AND user_id = :user_id
)
AND note LIKE :click_to_call
ORDER BY dt_stamp
LIMIT 1
This works and is extremely fast. I'm just wondering if this is the right way to do this, or if there is a more official way to handle it.
It shouldn't be necessary to do this. The MySQL optimizer can handle it if you have multiple terms in your WHERE clause separated by AND. Basically, it knows how to do "apply all the conditions you can using indexes, then apply unindexed expressions only to the remaining rows."
But choosing the right index is important. A multi-column index is best for a series of AND terms than individual indexes. MySQL can apply index intersection, but that's much less effective than finding the same rows with a single index.
A few logical rules apply to creating multi-column indexes:
Conditions on unique columns are preferred over conditions on non-unique columns.
Equality conditions (=) are preferred over ranges (>=, IN, BETWEEN, !=, etc.).
After the first column in the index used for a range condition, subsequent columns won't use an index.
Most of the time, searching the result of a function on a column (e.g. DATE(dt_stamp)) won't use an index. It'd be better in that case to store a DATE data type and use = instead of >=.
If the condition matches > 20% of the table, MySQL probably will decide to skip the index and do a table-scan anyway.
Here are some webinars by myself and my colleagues at Percona to help explain index design:
Tools and Techniques for Index Design
MySQL Indexing: Best Practices
Advanced MySQL Query Tuning
Really Large Queries: Advanced Optimization Techniques
You can get the slides for these webinars for free, and view the recording for free, but the recording requires registration.
Don't go for the derived table solution as it is not performant. I'm surprised about the fact that having = and >= operators MySQL is going for the LIKE first.
Anyway, I'd say you could try adding some indexes on those fields and see what happens:
ALTER TABLE claim_notes ADD INDEX(type_id, user_id);
ALTER TABLE claim_notes ADD INDEX(dt_stamp);
The latter index won't actually improve the search on the indexes but rather the sorting of the results.
Of course, having an EXPLAIN of the query would help.

MySQL: SELECT(x) WHERE vs COUNT WHERE?

This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.