I have a big query with different tables queried with joins and with WHERE CLAUSES.
Now from my understanding the best index to have is to see the WHERE CLAUSE and add it as an index
select name from Table WHERE name = 'John'
We would have an index on the "name" field .
How would we determine the best index to have if the clause looks like this:
WHERE table1.field = 'x' and table2.field = 'y' etc...
of course the query is much more complicated than that , just want to know how to proceed and if you guys have a better idea .
SELECT ...
FROM tA
JOIN tB WHERE tA.x = tB.y
WHERE tA.name = 'foo'
AND tB.name = 'bar'
begs for
tA: INDEX(name, x)
tB: INDEX(name, y)
On the other hand:
SELECT ...
FROM tA
JOIN tB WHERE tA.name = tB.name
needs INDEX(name) on both tables.
If name is the PRIMARY KEY on each table, then those indexes are redundant and should not be added.
Etc.
How would we determine the best index to have if the clause looks like this:
WHERE table1.field = 'x' and table2.field = 'y' etc...
First of all as you are using join of 2 tables then join fields should be indexed and for better performance these fields should be integer type.
Now try to check which condition is filtering more data means reducing rows and try to create index on that field or composite index on multiple fields (make sure field should be in most left in index which is filtering more data) but index size should not increase too much.
Normally (not always) one table uses single index, so as you are filtering data from multiple tables so you can create index on both tables columns if you are getting sufficient data filteration by these fields.
Further anyone can advise better after seeing your actual query.
There is no such thing as single index for multiple tables. The first thing you could do, is to create an index for table1 on field and another one for table2 on field. If this still not fast enough, depending on your database schema, you could set a foreign key.
Lastly, you can create a view which contains data from both tables and then index that view. The advantage of a view is to have the data pre-joined which might make the query even faster.
Related
I am trying to improve the performance of a query using a "materialized view" to optimize away joins. The first query below is the original, which employs joins. The second is the query written against a table i generated which includes all the joined data (the equivalent of a materialized view). They both return the same result set. Unfortunalatey, somehow, the second query is MUCH slower when handling a very long set of input ids (the IN clause). I don't understand how that could be!!!! Executing all the joins has to have a fair amount of overheat that is saved by the "materialized view", right?
SELECT
clinical_sample.INTERNAL_ID AS "internalId",
sample.STABLE_ID AS "sampleId",
patient.STABLE_ID AS "patientId",
clinical_sample.ATTR_ID AS "attrId",
cancer_study.CANCER_STUDY_IDENTIFIER AS "studyId",
clinical_sample.ATTR_VALUE AS "attrValue"
FROM clinical_sample
INNER JOIN sample ON clinical_sample.INTERNAL_ID = sample.INTERNAL_ID
INNER JOIN patient ON sample.PATIENT_ID = patient.INTERNAL_ID
INNER JOIN cancer_study ON patient.CANCER_STUDY_ID =
cancer_study.CANCER_STUDY_ID
WHERE cancer_study.CANCER_STUDY_IDENTIFIER = 'xxxxx'
AND sample.STABLE_ID IN
('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND clinical_sample.ATTR_ID IN
(
'CANCER_TYPE'
);
SELECT
internalId,
sampleId,
patientId,
attrId,
studyId,
attrValue
FROM test
WHERE
sampleId IN ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND studyId = 'xxxxx'
AND attrId = 'CANCER_TYPE';
Update: I did notice in Workbench report that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the second, joinless query. So having joins seems to actually be a major optimization somehow. I have index in sampleId, studyId, attrId and composite of all three.
Both table "test" and "clinical_sample" have the same number of rows.
It would help to see what the PRIMARY KEY of each table is.
Some of these indexes are likely to help:
clinical_sample: INDEX(ATTR_ID, INTERNAL_ID, ATTR_VALUE)
sample: INDEX(STABLE_ID, INTERNAL_ID, PATIENT_ID)
patient: INDEX(INTERNAL_ID, STABLE_ID, CANCER_STUDY_ID)
cancer_study: INDEX(CANCER_STUDY_IDENTIFIER, CANCER_STUDY_ID)
I agree with Barmar's INDEX(studyId, attrId, sampleId) for the materialized view.
I have index in sampleId, studyId, attrId and composite of all three.
Let's see the EXPLAIN. It may show that it is using your index just on (sampleId) when it should be using the composite index.
Also put the IN column last, not first, regardless of cardinality. More precisely, put = columns first in a composite index.
Food for thought: When and why are database joins expensive?
this leads me to believe that normalized tables with indexes could actually be faster than my denormalized attempt (materialized view).
I am new to database index and I've just read about what an index is, differences between clustered and non clustered and what composite index is.
So for a inner join query like this:
SELECT columnA
FROM table1
INNER JOIN table2
ON table1.columnA= table2.columnA;
In order to speed up the join, should I create 2 indexes, one for table1.columnA and the other for table2.columnA , or just creating 1 index for table1 or table2?
One is good enough? I don't get it, for example, if I select some data from table2 first and based on the result to join on columnA, then I am looping through results one by one from table2, then an index from table2.columnA is totally useless here, because I don't need to find anything in table2 now. So I am needing a index for table1.columnA.
And vice versa, I need a table2.columnA if I select some results from table1 first and want to join on columnA.
Well, I don't know how in reality "select xxxx first then join based on ..." looks like, but that scenario just came into my mind. It would be much appreciated if someone could also give a simple example.
One index is sufficient, but the question is which one?
It depends on how the MySQL optimizer decides to order the tables in the join.
For an inner join, the results are the same for table1 INNER JOIN table2 versus table2 INNER JOIN table1, so the optimizer may choose to change the order. It is not constrained to join the tables in the order you specified them in your query.
The difference from an indexing perspective is whether it will first loop over rows of table1, and do lookups to find matching rows in table2, or vice-versa: loop over rows of table2 and do lookups to find rows in table1.
MySQL does joins as "nested loops". It's as if you had written code in your favorite language like this:
foreach row in table1 {
look up rows in table2 matching table1.column_name
}
This lookup will make use of the index in table2. An index in table1 is not relevant to this example, since your query is scanning every row of table1 anyway.
How can you tell which table order is used? You can use EXPLAIN. It will show you a row for each table reference in the query, and it will present them in the join order.
Keep in mind the presence of an index in either table may influence the optimizer's choice of how to order the tables. It will try to pick the table order that results in the least expensive query.
So maybe it doesn't matter which table you add the index to, because whichever one you put the index on will become the second table in the join order, because it makes it more efficient to do the lookup that way. Use EXPLAIN to find out.
90% of the time in a properly designed relational database, one of the two columns you join together is a primary key, and so should have a clustered index built for it.
So as long as you're in that case, you don't need to do anything at all. The only reason to add additional non-clustered indices is if you're also further filtering the join with a where clause at the end of your statement, you need to make sure both the join columns and the filtered columns are in a correct index together (ie correct sort order, etc).
I must run this query with MySQL:
select requests.id, requests.id_temp, categories.id
from opadithree.requests inner join
opadi.request_detail_2
on substring(requests.id_sub_temp, 3) = request_detail_2.id inner join
opadithree.categories
on request_detail_2.theme = categories.cu_code
where categories.atc = false and id_sub_temp like "2_%";
However for some reason the query is too slow. The table requests has 15583 rows. The table request_detail_2 66469 rows and the table categories has 13452 rows.
The most problematic column id_sub_temp has data strings in the following formats: "2_number" or "3_number".
Do you know some trick to make the query faster?
Here are the indexes I'd try:
First, I need an index so your WHERE condition on id_sub_temp can find the rows needed efficiently. Then add the column id_temp so the result can select that column from the index instead of forcing it to read the row.
CREATE INDEX bk1 ON requests (id_sub_temp, id_temp);
Next I'd like the join to categories to filter by atc=false and then match the cu_code. I tried reversing the order of these columns so cu_code was first, but that resulted in an expensive index-scan instead of a lookup. Maybe that was only because I was testing with empty tables. Anyway, I don't think the column order is important in this case.
CREATE INDEX bk2 ON categories (atc, cu_code);
The join to request_detail_2 is currently by primary key, which is already pretty efficient.
I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.
I'm trying to get the column order in our indexes set correctly and haven't seen a direct answer on this. If we have a query like the following
SELECT ... all the things ...
FROM tb_contact
inner join tb_contact_association on tb_contact.id = tb_contact_association.attached_id
where tb_contact_association.contact_id = '498'
order by ...
We're looking at a pivot table, tb_contact_association on this join. And this table is never really queried without looking at both attached_id (on the join) and contact_id (the where).
When creating an index for tb_contact_association, should the index cover both "attached_id,contact_id" in that order? With the joined on first, then the where? Or the other way around? Or each of them individually?
Thanks.
Generally, the ordering of fields in an index doesn't matter, IF you use the appropriate fields.
e.g. for a query like:
SELECT .. WHERE f1 = 'a' AND f2 = 'b' AND f3 = 'c'
INDEX(f3, f2, f1) - index can be used
INDEX(f1, f3, f1) - can be used
INDEX(f1, f2, f3) - can be used
INDEX(f1, f3) - completely usable
INDEX(f3, f1) - completely usable
INDEX(f4, f1) - cannot be used - no 'f4' field in the where clause
INDEX(f1, f4) - can be used, because 'f1' is in the where clause, but f4
component will be ignored
The actual ordering of the WHERE clause doesn't matter. WHERE f1 = 'a' AND f2 = 'b' v.s. WHERE f2 = 'b' AND f1 = 'a' are both indentical as far as the query compiler/optimizer are concerned.
The indexes needed depend on which direction the join will run. You can determine this by running an EXPLAIN on your select statement. In this case though, since your WHERE clause is filtering on the tb_contact_association table, the optimizer will most likely start with this table and join into the tb_contact table.
The exception would be if tb_contact is small (few rows) compared to tb_contact_association. To see why this is the case, consider an extreme example. If tb_contact is only one row long, it's obviously going to be faster to start from that row, join into the corresponding row in the tb_contact_association table, and test its value for contact_id, rather than go through the whole larger tb_contact_association table looking for contact_id=498 (even with an index), and then joining back to the tb_contact table.
But, for any normal tables, the query above would start with tb_contact_association. For a join, you need an index on the column you're joining to. In this case, that's tb_contact.id. You'll also want an index to help your WHERE clause, ie on tb_contact_association.contact_id.
You don't actually need an index on tb_contact_association.attached_id for this particular query, as long as the join always goes in the direction we expect. A composite index on (contact_id, attached_id) (in that order) in tb_contact_association should be a slight help, because it will allow all necessary info for that table to be pulled directly from the index, saving a read from the data table for each row. (With this index added, you should see "using index" in the extra section of the query EXPLAIN.) The contact_id column is used for the WHERE clause, just as with a single index on that column, but with the composite index, it can then just read attached_id straight from the index, rather than from the table.
Most likely, both fields should have an index. However in this query, only contact_id needs an index, Nathan's answer explains why in more details.
The optimal index for your specific query would be (contact_id, attached_id).