MySQL index order with joins - mysql

I'm trying to get the column order in our indexes set correctly and haven't seen a direct answer on this. If we have a query like the following
SELECT ... all the things ...
FROM tb_contact
inner join tb_contact_association on tb_contact.id = tb_contact_association.attached_id
where tb_contact_association.contact_id = '498'
order by ...
We're looking at a pivot table, tb_contact_association on this join. And this table is never really queried without looking at both attached_id (on the join) and contact_id (the where).
When creating an index for tb_contact_association, should the index cover both "attached_id,contact_id" in that order? With the joined on first, then the where? Or the other way around? Or each of them individually?
Thanks.

Generally, the ordering of fields in an index doesn't matter, IF you use the appropriate fields.
e.g. for a query like:
SELECT .. WHERE f1 = 'a' AND f2 = 'b' AND f3 = 'c'
INDEX(f3, f2, f1) - index can be used
INDEX(f1, f3, f1) - can be used
INDEX(f1, f2, f3) - can be used
INDEX(f1, f3) - completely usable
INDEX(f3, f1) - completely usable
INDEX(f4, f1) - cannot be used - no 'f4' field in the where clause
INDEX(f1, f4) - can be used, because 'f1' is in the where clause, but f4
component will be ignored
The actual ordering of the WHERE clause doesn't matter. WHERE f1 = 'a' AND f2 = 'b' v.s. WHERE f2 = 'b' AND f1 = 'a' are both indentical as far as the query compiler/optimizer are concerned.

The indexes needed depend on which direction the join will run. You can determine this by running an EXPLAIN on your select statement. In this case though, since your WHERE clause is filtering on the tb_contact_association table, the optimizer will most likely start with this table and join into the tb_contact table.
The exception would be if tb_contact is small (few rows) compared to tb_contact_association. To see why this is the case, consider an extreme example. If tb_contact is only one row long, it's obviously going to be faster to start from that row, join into the corresponding row in the tb_contact_association table, and test its value for contact_id, rather than go through the whole larger tb_contact_association table looking for contact_id=498 (even with an index), and then joining back to the tb_contact table.
But, for any normal tables, the query above would start with tb_contact_association. For a join, you need an index on the column you're joining to. In this case, that's tb_contact.id. You'll also want an index to help your WHERE clause, ie on tb_contact_association.contact_id.
You don't actually need an index on tb_contact_association.attached_id for this particular query, as long as the join always goes in the direction we expect. A composite index on (contact_id, attached_id) (in that order) in tb_contact_association should be a slight help, because it will allow all necessary info for that table to be pulled directly from the index, saving a read from the data table for each row. (With this index added, you should see "using index" in the extra section of the query EXPLAIN.) The contact_id column is used for the WHERE clause, just as with a single index on that column, but with the composite index, it can then just read attached_id straight from the index, rather than from the table.

Most likely, both fields should have an index. However in this query, only contact_id needs an index, Nathan's answer explains why in more details.
The optimal index for your specific query would be (contact_id, attached_id).

Related

Is it possible to further optimize this MySQL query?

I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.

Index when using OR in query

What is the best way to create index when I have a query like this?
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
I know that only one index can be used in a query so I can't create two indexes, one for user_1 and one for user_2.
Also could solution for this type of query be used for this query?
WHERE ((user_1 = '$user_id' AND user_2 = '$friend_id') OR (user_1 = '$friend_id' AND user_2 = '$user_id'))
MySQL has a hard time with OR conditions. In theory, there's an index merge optimization that #duskwuff mentions, but in practice, it doesn't kick in when you think it should. Besides, it doesn't give as performance as a single index when it does.
The solution most people use to work around this is to split up the query:
SELECT ... WHERE user_1 = ?
UNION
SELECT ... WHERE user_2 = ?
That way each query will be able to use its own choice for index, without relying on the unreliable index merge feature.
Your second query is optimizable more simply. It's just a tuple comparison. It can be written this way:
WHERE (user_1, user_2) IN (('$user_id', '$friend_id'), ('$friend_id', '$user_id'))
In old versions of MySQL, tuple comparisons would not use an index, but since 5.7.3, it will (see https://dev.mysql.com/doc/refman/5.7/en/row-constructor-optimization.html).
P.S.: Don't interpolate application code variables directly into your SQL expressions. Use query parameters instead.
I know that only one index can be used in a query…
This is incorrect. Under the right circumstances, MySQL will routinely use multiple indexes in a query. (For example, a query JOINing multiple tables will almost always use at least one index on each table involved.)
In the case of your first query, MySQL will use an index merge union optimization. If both columns are indexed, the EXPLAIN output will give an explanation along the lines of:
Using union(index_on_user_1,index_on_user_2); Using where
The query shown in your second example is covered by an index on (user_1, user_2). Create that index if you plan on running those queries routinely.
The two cases are different.
At the first case both columns needs to be searched for the same value. If you have a two column index (u1,u2) then it may be used at the column u1 as it cannot be used at column u2. If you have two indexes separate for u1 and u2 probably both of them will be used. The choice comes from statistics based on how many rows are expected to be returned. If returned rows expected few an index seek will be selected, if the appropriate index is available. If the number is high a scan is preferable, either table or index.
At the second case again both columns need to be checked again, but within each search there are two sub-searches where the second sub-search will be upon the results of the first one, due to the AND condition. Here it matters more and two indexes u1 and u2 will help as any field chosen to be searched first will have an index. The choice to use an index is like i describe above.
In either case however every OR will force 1 more search or set of searches. So the proposed solution of breaking using union does not hinder more as the table will be searched x times no matter 1 select with OR(s) or x selects with union and no matter index selection and type of search (seek or scan). As a result, since each select at the union get its own execution plan part, it is more likely that (single column) indexes will be used and finally get all row result sets from all parts around the OR(s). If you do not want to copy a large select statement to many unions you may get the primary key values and then select those or use a view to be sure the majority of the statement is in one place.
Finally, if you exclude the union option, there is a way to trick the optimizer to use a single index. Create a double index u1,u2 (or u2,u1 - whatever column has higher cardinality goes first) and modify your statement so all OR parts use all columns:
... WHERE (user_1 = '$user_id' OR user_2 = '$user_id') ...
will be converted to:
... WHERE ((user_1 = '$user_id' and user_2=user_2) OR (user_1=user_1 and user_2 = '$user_id')) ...
This way a double index (u1,u2) will be used at all times. Please not that this will work if columns are nullable and bypassing this with isnull or coalesce may cause index not to be selected. It will work with ansi nulls off however.

Need some clarification on indexes (WHERE, JOIN)

We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time
You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.

MySQL indexes optimisation

I have a big query with different tables queried with joins and with WHERE CLAUSES.
Now from my understanding the best index to have is to see the WHERE CLAUSE and add it as an index
select name from Table WHERE name = 'John'
We would have an index on the "name" field .
How would we determine the best index to have if the clause looks like this:
WHERE table1.field = 'x' and table2.field = 'y' etc...
of course the query is much more complicated than that , just want to know how to proceed and if you guys have a better idea .
SELECT ...
FROM tA
JOIN tB WHERE tA.x = tB.y
WHERE tA.name = 'foo'
AND tB.name = 'bar'
begs for
tA: INDEX(name, x)
tB: INDEX(name, y)
On the other hand:
SELECT ...
FROM tA
JOIN tB WHERE tA.name = tB.name
needs INDEX(name) on both tables.
If name is the PRIMARY KEY on each table, then those indexes are redundant and should not be added.
Etc.
How would we determine the best index to have if the clause looks like this:
WHERE table1.field = 'x' and table2.field = 'y' etc...
First of all as you are using join of 2 tables then join fields should be indexed and for better performance these fields should be integer type.
Now try to check which condition is filtering more data means reducing rows and try to create index on that field or composite index on multiple fields (make sure field should be in most left in index which is filtering more data) but index size should not increase too much.
Normally (not always) one table uses single index, so as you are filtering data from multiple tables so you can create index on both tables columns if you are getting sufficient data filteration by these fields.
Further anyone can advise better after seeing your actual query.
There is no such thing as single index for multiple tables. The first thing you could do, is to create an index for table1 on field and another one for table2 on field. If this still not fast enough, depending on your database schema, you could set a foreign key.
Lastly, you can create a view which contains data from both tables and then index that view. The advantage of a view is to have the data pre-joined which might make the query even faster.

Index on HAVING?

In my basic understanding of an index, the index is used on a column in a WHERE clause. Since the HAVING clause is similar to a WHERE clause applied after a GROUP BY statement, does an index have the same effect on that? For example:
SELECT * FROM table WHERE full_name = 'Bob Jones'
--> index on full_name would be beneficial here
and
SELECT * FROM table WHERE first_name = 'Bob'
GROUP BY
height HAVING height > 72
In this second query, would an index on both first_name and height improve the performance? Which index would be more important, or are they roughly the equivalent? Also, do indexes improve GROUP BY performance as well (regardless of a HAVING)?
A HAVING clause is essentially the last thing done to filter a query's results before they're sent off to the client. It's only useful if you need to filter on the results of an aggregate function, whose value can NOT be available during the row-level filtering that WHERE clauses do.
Essentially, a HAVING clause can be seen as applying another query, turning your main query into a subquery.
e.g.
SELECT ...
FROM sometable
HAVING somefield = X
is really no different that
SELECT *
FROM (
SELECT ...
FROM sometable
)
WHERE somefield = X
If the field you're filtering with the HAVING is NOT a derived field (aggregate value, calculated field, etc...) then you're almost certainly better off doing the filtering at the WHERE level, which keeps unecessary rows from being loaded off disk in the first place.
Since having is applied last, rows will be loaded from disk, then possibly discarded if they don't match the HAVING criteria.