I have two tables. Table 1 (for questions) has 2 columns "qid" (the question id) and "question". Table 2 (for answers) has "aid" (answer id), "qid" and "answers". The "qid" in table 2 is a foreign key from table 1 and a single question may have multiple answers.
Both tables have 30,000+ entries.
I have made a loop,
for(each qid)
{
select question from table1 where qid = id;
select answer from table2 where qid = id;
}
The number of ids passed is 10 (to the loop). The query takes about 18 seconds to execute.
Is this normal for this much delay or is there something wrong with this approach. I want to know if there is any way to make the above query faster.
You can do it in a single query which should be a lot faster.
SELECT t1.qid, t1.question, t2.aid, t2.answer
FROM table1 t1
INNER JOIN table2 t2 ON t2.qid = t1.qid
WHERE t1.qid IN (?,?,?,etc.);
Or you can do
SELECT t1.qid, t1.question, t2.aid, t2.answer FROM table1 t1, table2 t2 WHERE t1.qid=t2.qid AND t1.qid IN(...some condition...);
I completely agree with #wvdz.
Additionally this is a general list of things you can do to improve performance of selects:
Analyze & possibly rewrite your query, (there's a lot left unsaid here so I recommend visiting one of the resource links I've included, or both).
If the query includes what is effectively the primary key, make sure
you have actually created the primary key constraint for that column
(this creates an index)
Consider creating indexes for any columns that will be used in the
conditions of the query (similar to point one, you will want to read up on this if you think you need more optimization, this becomes more important the more data you have in a table)
Also here are a couple of good resources for tuning your sql queries:
http://beginner-sql-tutorial.com/sql-query-tuning.htm
http://www.quest.com/whitepapers/10_SQL_Tips.pdf
NOTE on Primary Keys: Should you want more information on the use of Primary Keys this answer I gave in the past explains how I use primary keys and why... mentioning this because, in my opinion & experience, every table should include a primary key: MySQL - Should every table contain it's own id/primary column?
Related
I am trying to improve the performance of a query using a "materialized view" to optimize away joins. The first query below is the original, which employs joins. The second is the query written against a table i generated which includes all the joined data (the equivalent of a materialized view). They both return the same result set. Unfortunalatey, somehow, the second query is MUCH slower when handling a very long set of input ids (the IN clause). I don't understand how that could be!!!! Executing all the joins has to have a fair amount of overheat that is saved by the "materialized view", right?
SELECT
clinical_sample.INTERNAL_ID AS "internalId",
sample.STABLE_ID AS "sampleId",
patient.STABLE_ID AS "patientId",
clinical_sample.ATTR_ID AS "attrId",
cancer_study.CANCER_STUDY_IDENTIFIER AS "studyId",
clinical_sample.ATTR_VALUE AS "attrValue"
FROM clinical_sample
INNER JOIN sample ON clinical_sample.INTERNAL_ID = sample.INTERNAL_ID
INNER JOIN patient ON sample.PATIENT_ID = patient.INTERNAL_ID
INNER JOIN cancer_study ON patient.CANCER_STUDY_ID =
cancer_study.CANCER_STUDY_ID
WHERE cancer_study.CANCER_STUDY_IDENTIFIER = 'xxxxx'
AND sample.STABLE_ID IN
('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND clinical_sample.ATTR_ID IN
(
'CANCER_TYPE'
);
SELECT
internalId,
sampleId,
patientId,
attrId,
studyId,
attrValue
FROM test
WHERE
sampleId IN ('P-0068343-T02-IM7' , 'P-0068353-T01-IM7' ,
'P-0068363-T01-IM7' , 'P-0068364-T01-IM7' )
AND studyId = 'xxxxx'
AND attrId = 'CANCER_TYPE';
Update: I did notice in Workbench report that the query with joins seems to scan far fewer rows. About 829k vs ~2400k for the second, joinless query. So having joins seems to actually be a major optimization somehow. I have index in sampleId, studyId, attrId and composite of all three.
Both table "test" and "clinical_sample" have the same number of rows.
It would help to see what the PRIMARY KEY of each table is.
Some of these indexes are likely to help:
clinical_sample: INDEX(ATTR_ID, INTERNAL_ID, ATTR_VALUE)
sample: INDEX(STABLE_ID, INTERNAL_ID, PATIENT_ID)
patient: INDEX(INTERNAL_ID, STABLE_ID, CANCER_STUDY_ID)
cancer_study: INDEX(CANCER_STUDY_IDENTIFIER, CANCER_STUDY_ID)
I agree with Barmar's INDEX(studyId, attrId, sampleId) for the materialized view.
I have index in sampleId, studyId, attrId and composite of all three.
Let's see the EXPLAIN. It may show that it is using your index just on (sampleId) when it should be using the composite index.
Also put the IN column last, not first, regardless of cardinality. More precisely, put = columns first in a composite index.
Food for thought: When and why are database joins expensive?
this leads me to believe that normalized tables with indexes could actually be faster than my denormalized attempt (materialized view).
I am new to database index and I've just read about what an index is, differences between clustered and non clustered and what composite index is.
So for a inner join query like this:
SELECT columnA
FROM table1
INNER JOIN table2
ON table1.columnA= table2.columnA;
In order to speed up the join, should I create 2 indexes, one for table1.columnA and the other for table2.columnA , or just creating 1 index for table1 or table2?
One is good enough? I don't get it, for example, if I select some data from table2 first and based on the result to join on columnA, then I am looping through results one by one from table2, then an index from table2.columnA is totally useless here, because I don't need to find anything in table2 now. So I am needing a index for table1.columnA.
And vice versa, I need a table2.columnA if I select some results from table1 first and want to join on columnA.
Well, I don't know how in reality "select xxxx first then join based on ..." looks like, but that scenario just came into my mind. It would be much appreciated if someone could also give a simple example.
One index is sufficient, but the question is which one?
It depends on how the MySQL optimizer decides to order the tables in the join.
For an inner join, the results are the same for table1 INNER JOIN table2 versus table2 INNER JOIN table1, so the optimizer may choose to change the order. It is not constrained to join the tables in the order you specified them in your query.
The difference from an indexing perspective is whether it will first loop over rows of table1, and do lookups to find matching rows in table2, or vice-versa: loop over rows of table2 and do lookups to find rows in table1.
MySQL does joins as "nested loops". It's as if you had written code in your favorite language like this:
foreach row in table1 {
look up rows in table2 matching table1.column_name
}
This lookup will make use of the index in table2. An index in table1 is not relevant to this example, since your query is scanning every row of table1 anyway.
How can you tell which table order is used? You can use EXPLAIN. It will show you a row for each table reference in the query, and it will present them in the join order.
Keep in mind the presence of an index in either table may influence the optimizer's choice of how to order the tables. It will try to pick the table order that results in the least expensive query.
So maybe it doesn't matter which table you add the index to, because whichever one you put the index on will become the second table in the join order, because it makes it more efficient to do the lookup that way. Use EXPLAIN to find out.
90% of the time in a properly designed relational database, one of the two columns you join together is a primary key, and so should have a clustered index built for it.
So as long as you're in that case, you don't need to do anything at all. The only reason to add additional non-clustered indices is if you're also further filtering the join with a where clause at the end of your statement, you need to make sure both the join columns and the filtered columns are in a correct index together (ie correct sort order, etc).
I have 2 tables a and b,each have 2 M and 3.2 Million records. I'am trying to get those id's which are not exists in b from a. I have written below query,
select a.id from a where not exists (select b.id from b where a.id =b.id)
this is taking longer time. is there any better way to get results faster.
Update: I just look into the table structure for both tables and found table a.id has decimal datatype and table b.id has varchar as datatype
will this difference in datatype cause any issues.
Could you try the LEFT JOIN with NULL. It will return the Id's which are exists in TableA and those are not in TableB.
SELECT T1.Id
FROM TableA T1
LEFT JOIN TableB T2 ON T2.Id = T1.Id
WHERE T2.Id IS NULL
While you could write your query using an anti-join, it probably would not affect the performance much, and in fact the underlying execution plan could even be the same. The only way I can see to speed up your query would be to add an index to the b table:
CREATE TABLE idx ON b (id);
But, if b.id be a primary key, then it should already be part of the clustered index. In this case, your current performance might be as good as you can get.
(this is mostly comment, but it's a bit long)
Please take some time to read some of the many questions about query optimization here is SO. The ones which are downvoted and closed omit table/index definitions and explain plans. The ones which will receive upvotes include these along with cardinality, performance and result metrics.
The join to table a in your sub query is redundant. When you remove the second reference to that table you end up with a simpler query. Then you can use a "not in" or a left join.
But the performance is still going to suck. Wherever possible you should try to avoid painting yourself into a corner like this in your data design.
Thanks for your valuable answers, I found the way. It got resolved after keeping same datatypes for lookup ID's, got results in 22 sec.
We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time
You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.
I know this question has been asked 100 times, and this isn't a "how do I do it", but an efficiency question - a topic I don't know much about.
From my internet reading I have settled on one way of solving the most recent problem that sounds like it's pretty efficient - LEFT JOIN a "max" table (grouped by the matching conditions) and then LEFT JOIN the row that matches the grouped conditions. Something like this:
Select employee.*, evaluation.* form employee
LEFT JOIN (select max(report_date) report_date, employee_id
from evaluation group by employee_id) most_recent_eval
on most_recent_eval.employee_id = employee.id
LEFT JOIN evaluation
on evaluation.employee_id = employee.id and evaluation.report_date = most_recent_eval.report_date
Are there problems with this that I don't know about? Is this doing 2 table scans (one to find the max, and one to find the row)? Does it have to do 2 full scans for every employee?
The reason I'm asking is that I am now looking at joining on 3 tables where I need the most recent row (evaluations, security clearance, and project) and it seems like any inefficiencies are going to be massively multiplied.
Can anyone give me some advice on this?
You should be in pretty good shape with the query pattern you propose.
One possible suggestion, that will help if your evaluation table has its own autoincrementing id column. You may be able to find the latest evaluation for each employee with this subquery:
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
Then your join can look like this:
FROM employee
LEFT JOIN (
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
) most_recent_eval ON most_recent_eval.employee_id=employee.id
LEFT JOIN evaluation ON most_recent_eval.id = evaluation.id
This will work if your id values and your report_date values in your evaluation table are in the same order. Only you know if that's the case in your application. But if it is, this is a very helpful optimization.
Other than that, you may need to add some compound indexes to some tables to speed up your queries. Get them working correctly first. Read http://use-the-index-luke.com/ . Remember that lots of single-column indexes are generally harmful to MySQL query performance unless they're chosen to accelerate particular queries.
If you create a compound index on (employee_id, report_date), this subquery
select max(report_date) report_date, employee_id
from evaluation
group by employee_id
can be satisfied with an astonishingly efficient loose index scan. Similarly, if you're using InnoDB, the query
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
can be satisfied by a loose index scan on a single-column index on employee_id. (If you're using MyISAM, you need a compound index on (employee_id, id) because InnoDB puts the primary key column implicitly into every index.)