I have a query with a join on mytable unique indexed on (col1,col2)
explain
...
join mytable on col1=1 and col2=2
shows the correct use of the index with type: eq_ref
But when using a list the index is not used anymore
join mytable on col1=1 and col2 in (2,3,4)
Extra: Range checked for each record (index map: 0x1);
This gives the same result:
join mytable on (col1,col2) in ((1,2),(1,3),(1,4))
Is there a way to use an index when providing a list of values?
Normally one does not JOIN x ON c = constant. The ON clause should specify how this table is related to the other table(s). The WHERE clause should provide for filtering, such as col1=1 and col2=2.
"Row constructors" have long existed, but only recently (5.7.3) are they the least bit optimized. What version are you running? Even with the latest version, I would expect col1=1 and col2 in (2,3,4) to be optimized at least as well as the row-constructor approach.
OR often prevents any optimization. And perhaps it is never faster than some alternative.
We need to see the entire query, the EXPLAIN, and SHOW CREATE TABLE; there could be other issue lurking in the shadows.
Related
I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.
I have two queries plus its own EXPLAIN's results:
One:
SELECT *
FROM notifications
WHERE id = 5204 OR seen = 3
Benchmark (for 10,000 rows): 0.861
Two:
SELECT h.* FROM ((SELECT n.* from notifications n WHERE id = 5204)
UNION ALL
(SELECT n.* from notifications n WHERE seen = 3)) h
Benchmark (for 10,000 rows): 2.064
The result of two queries above is identical. Also I have these two indexes on notifications table:
notifications(id) -- this is PK
notification(seen)
As you know, OR usually prevents effective use of indexes, That's why I wrote second query (by UNION). But after some tests I figured it out which still using OR is much faster that using UNION. So I'm confused and I really cannot choose the best option in my case.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better. May you please help me should I use which approach?
The query plan for the OR case appears to indicate that MySQL is indeed using indexes, so evidently yes, it can do, at least in this case. That seems entirely reasonable, because there is an index on seen, and id is the PK.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better.
If "logical and reasonable explanations" are contradicted by reality, then it is safe to assume that the logic is flawed or the explanations are wrong or inapplicable. Performance is notoriously difficult to predict; performance testing is essential where speed is important.
May you please help me should I use which approach?
You should use the one that tests faster on input that adequately models that which the program will see in real use.
Note also, however, that your two queries are not semantically equivalent: if the row with id = 5204 also has seen = 3 then the OR query will return it once, but the UNION ALL query will return it twice. It is pointless to choose between correct code and incorrect code on any basis other than which one is correct.
index_merge, as the name suggests, combines the primary keys of two indexes using the Sort Merge Join or Sort Merge Union for AND and OR conditions, appropriately, and then looks up the rest of the values in the table by PK.
For this to work, conditions on both indexes should be so that each index would yield primary keys in order (your conditions are).
You can find the strict definition of the conditions in the docs, but in a nutshell, you should filter by all parts of the index with an equality condition, plus possibly <, =, or > on the PK.
If you have an index on (col1, col2, col3), this should be col1 = :val1 AND col2 = :val2 AND col3 = :val3 [ AND id > :id ] (the part in the square brackets is not necessary).
The following conditions won't work:
col1 = :val1 -- you omit col2 and col3
col1 = :val1 AND col2 = :val2 AND col3 > :val3 -- you can only use equality on key parts
As a free side effect, your output is sorted by id.
You could achieve the similar results using this:
SELECT *
FROM (
SELECT 5204 id
UNION ALL
SELECT id
FROM mytable
WHERE seen = 3
AND id <> 5204
) q
JOIN mytable m
ON m.id = q.id
, except that in earlier versions of MySQL the derived table would have to be materialized which would definitely make the query performance worse, and your results would not have been ordered by id anymore.
In short, if your query allows index_merge(union), go for it.
The answer is contained in your question. The EXPLAIN output for OR says Using union(PRIMARY, seen) - that means that the index_merge optimization is being used and the query is actually executed by unioning results from the two indexes.
So MySQL can use index in some cases and it does in this one. But the index_merge is not always available or is not used because the statistics of the indexes say it won't be worth it. In those cases OR may be a lot slower than UNION (or not, you need to always check both versions if you are not sure).
In your test you "got lucky" and MySQL did the right optimization for you automatically. It is not always so.
I'm trying to get the column order in our indexes set correctly and haven't seen a direct answer on this. If we have a query like the following
SELECT ... all the things ...
FROM tb_contact
inner join tb_contact_association on tb_contact.id = tb_contact_association.attached_id
where tb_contact_association.contact_id = '498'
order by ...
We're looking at a pivot table, tb_contact_association on this join. And this table is never really queried without looking at both attached_id (on the join) and contact_id (the where).
When creating an index for tb_contact_association, should the index cover both "attached_id,contact_id" in that order? With the joined on first, then the where? Or the other way around? Or each of them individually?
Thanks.
Generally, the ordering of fields in an index doesn't matter, IF you use the appropriate fields.
e.g. for a query like:
SELECT .. WHERE f1 = 'a' AND f2 = 'b' AND f3 = 'c'
INDEX(f3, f2, f1) - index can be used
INDEX(f1, f3, f1) - can be used
INDEX(f1, f2, f3) - can be used
INDEX(f1, f3) - completely usable
INDEX(f3, f1) - completely usable
INDEX(f4, f1) - cannot be used - no 'f4' field in the where clause
INDEX(f1, f4) - can be used, because 'f1' is in the where clause, but f4
component will be ignored
The actual ordering of the WHERE clause doesn't matter. WHERE f1 = 'a' AND f2 = 'b' v.s. WHERE f2 = 'b' AND f1 = 'a' are both indentical as far as the query compiler/optimizer are concerned.
The indexes needed depend on which direction the join will run. You can determine this by running an EXPLAIN on your select statement. In this case though, since your WHERE clause is filtering on the tb_contact_association table, the optimizer will most likely start with this table and join into the tb_contact table.
The exception would be if tb_contact is small (few rows) compared to tb_contact_association. To see why this is the case, consider an extreme example. If tb_contact is only one row long, it's obviously going to be faster to start from that row, join into the corresponding row in the tb_contact_association table, and test its value for contact_id, rather than go through the whole larger tb_contact_association table looking for contact_id=498 (even with an index), and then joining back to the tb_contact table.
But, for any normal tables, the query above would start with tb_contact_association. For a join, you need an index on the column you're joining to. In this case, that's tb_contact.id. You'll also want an index to help your WHERE clause, ie on tb_contact_association.contact_id.
You don't actually need an index on tb_contact_association.attached_id for this particular query, as long as the join always goes in the direction we expect. A composite index on (contact_id, attached_id) (in that order) in tb_contact_association should be a slight help, because it will allow all necessary info for that table to be pulled directly from the index, saving a read from the data table for each row. (With this index added, you should see "using index" in the extra section of the query EXPLAIN.) The contact_id column is used for the WHERE clause, just as with a single index on that column, but with the composite index, it can then just read attached_id straight from the index, rather than from the table.
Most likely, both fields should have an index. However in this query, only contact_id needs an index, Nathan's answer explains why in more details.
The optimal index for your specific query would be (contact_id, attached_id).
i'm working on a simple query that runs in about 1.2 seconds in a myisam table populated by 126,000 records:
SELECT * FROM my_table
WHERE primary_key != 5 AND
(
col1 = 528 OR (col2 = 265 AND col3 = 1)
)
ORDER BY primary_key DESC
I have already created single indexes for each field used in the where clause, but only primary_key (autoincrement field of my_table) is used as key while col1 and col2 are just ignored and the query becomes much slower. How should I create the indexes (maybe multiple-indexs) or edit the query?
You will get optimal performance if you have the following multi-column "covering" indexes:
(primary_key, col1)
(primary_key, col2, col3)
And issue the following query:
(SELECT * FROM my_table
WHERE primary_key != 5 AND
col1 = 528)
UNION
(SELECT * FROM my_table
WHERE primary_key != 5 AND
col2 = 265 AND col3 = 1)
ORDER BY primary_key DESC
You may get variable, better, performance by changing the order of the fields in the indexes, based on cardinality.
In your original query, no index could be used for the entire selection in the WHERE clause, which caused partial table scanning.
In the above query, the first subquery is able to utilize the first index completely, avoiding any table scanning, and the second subquery uses the second index.
Unfortunately, MySQL won't be able to utilize an index to sort the records on the full result set, and will probably use filesort to order them. So, if you don't need the records ordered by primary_key, remove the outer ORDER clause for better performance, though if the result set is small, it shouldn't be an issue either way.
Use EXPLAIN to find out what is going on. This will identify what indexes are being used and hence enable you to tune the query.
Unfortunately, there is almost no way to predict which indexes will work better than others without a thorough understanding of MySQL internals. However, for each query (or, at least sub-query/join) MySQL will only use 1 index, you have stated that the primary key is being used, so I assume you have looked at the EXPLAIN output.
You will likely want to try multi-column indexes on either all of (primary_key, col1, col2, col3) or a subset of these, in different orders to find the best result. The best index will depend on the primarily on the cardinality of the columns and thus, may even change over time depending on the data in the table.
Let's say we have a common join like the one below:
EXPLAIN SELECT *
FROM visited_links vl
JOIN device_tracker dt ON ( dt.Client_id = vl.Client_id
AND dt.Device_id = vl.Device_id )
GROUP BY dt.id
if we execute an explain, it says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE vl index NULL vl_id 273 NULL 1977 Using index; Using temporary; Using filesort
1 SIMPLE dt ref Device_id,Device_id_2 Device_id 257 datumprotect.vl.device_id 4 Using where
I know that sometimes it's difficult to choose the right indexes when you are using group by but, what indexes could I set for avoiding 'using temporary, using filesort' in this query? why is this happening? and specially, why this happens after using an index?
One point to mention is that the fields returned by the select (* in this case) should either be in the GROUP BY clause or be using agregate functions such as SUM() or MAX(). Otherwise unexpected results can occur. This is because if the database is not told how to choose fields that are not in the group by clause you may get any member of the group, pretty much at random.
The way I look at it is to break the query down into bits.
you have a join on (dt.Client_id = vl.Client_id and dt.Device_id = vl.Device_id) so all of those fields should be indexed in their respective tables.
You are using GROUP BY dt.id so you need an index that includes dt.id
BUT...
an index on (dt.client_id,dt.device_id,dt.id) will not work for the GROUP BY
and
an index on (dt.id, dt.client_id, dt.device_id) will not work for the join.
Sometimes you end up with a query which just can't use an index.
See also:
http://ntsrikanth.blogspot.com/2007/11/sql-query-order-of-execution.html
You didn't post your indices, but first of all, you'll want to have an index for (client_id, device_id) on visited_links, and (client_id, device_id, id) on device_tracker to make sure that query is fully indexed.
From page 191 of the excellent High Performance MySQL, 2nd Ed.:
MySQL has two kinds of GROUP BY strategies when it can't use an index: it can use a temporary table or a filesort to perform the grouping. Either one can be more efficient depending on the query. You can force the optimizer to choose one method or the other with the SQL_BIG_RESULT and SQL_SMALL_RESULT optimizer hints.
In your case, I think the issue stems from joining on multiple columns and using GROUP BY together, even after the suggested indices are in place. If you remove either (a) one of the join conditions or (b) the GROUP BY, this shouldn't need a filesort.
However, keep in mind that a filesort doesn't always use actual files, it can also happen entirely within a memory buffer if the result set is small enough, so the performance penalty may be minimal. Consider the wall-clock time for the query too.
HTH!