Performance related to Not exists function - mysql

I have 2 tables a and b,each have 2 M and 3.2 Million records. I'am trying to get those id's which are not exists in b from a. I have written below query,
select a.id from a where not exists (select b.id from b where a.id =b.id)
this is taking longer time. is there any better way to get results faster.
Update: I just look into the table structure for both tables and found table a.id has decimal datatype and table b.id has varchar as datatype
will this difference in datatype cause any issues.

Could you try the LEFT JOIN with NULL. It will return the Id's which are exists in TableA and those are not in TableB.
SELECT T1.Id
FROM TableA T1
LEFT JOIN TableB T2 ON T2.Id = T1.Id
WHERE T2.Id IS NULL

While you could write your query using an anti-join, it probably would not affect the performance much, and in fact the underlying execution plan could even be the same. The only way I can see to speed up your query would be to add an index to the b table:
CREATE TABLE idx ON b (id);
But, if b.id be a primary key, then it should already be part of the clustered index. In this case, your current performance might be as good as you can get.

(this is mostly comment, but it's a bit long)
Please take some time to read some of the many questions about query optimization here is SO. The ones which are downvoted and closed omit table/index definitions and explain plans. The ones which will receive upvotes include these along with cardinality, performance and result metrics.
The join to table a in your sub query is redundant. When you remove the second reference to that table you end up with a simpler query. Then you can use a "not in" or a left join.
But the performance is still going to suck. Wherever possible you should try to avoid painting yourself into a corner like this in your data design.

Thanks for your valuable answers, I found the way. It got resolved after keeping same datatypes for lookup ID's, got results in 22 sec.

Related

Does adding join condition on two different tables (excluding the table to be joined) slows down query and performance

I have 3 tables in mySQL => table1, table2 and table3 and the data in all three tables is large (>100k)
My join condition is :
select * from table1 t1
join table2 t2 on t1.col1 = t2.col1
join table3 t3 on t3.col2 = t2.col2 and t3.col3 = t1.col3
This query renders result very slow and according to me the issue is in the second join condition as if I remove the second condition, the query renders result instantly.
Can anyone please explain the reason of the query being slow?
Thanks in advance.
Do you have these indexes?
table2: (col1)
table3: (col2, col3) -- in either order
Another tip: Don't use * (as in SELECT *) unless you really need all the columns. It prevents certain optimizations. If you want to discuss this further, please provide the real query and SHOW CREATE TABLE for each table.
If any of the columns used for joining are not the same datatype, character set, and collation, then indexes may not be useful.
Please provide EXPLAIN SELECT ...; it will give some clues we can discuss.
How many rows in the resultset? Sounds like over 100K? If so, then perhaps the network transfer time is the real slowdown?
Since the second join is over both tables (two joins) it creates more checks on evaluation. This is creating a triangle rather than a long joined line.
Also, since all three tables have ~100K lines, even with clustered index on the given columns, it's bound to have a performance hit, also due to all columns being retrieved.
At least, have the select statement as T1.col1, T1.col2...,T2.col1... and so on.
Also have distinct indexes on all columns used in join condition.
More so, do you really want a huge join without a where clause? Try adding restrictive conditions for each table and see the magic as it first filters out the available set of results from each table (100k may become 10k) and then the join is attempted.
Also check SQL Profiler output to see if a TABLE SCAN is being used (most probably yes), if so, having an INDEX SCAN should improve the situation.

SQL query taking too much time to exectute

I have two tables. Table 1 (for questions) has 2 columns "qid" (the question id) and "question". Table 2 (for answers) has "aid" (answer id), "qid" and "answers". The "qid" in table 2 is a foreign key from table 1 and a single question may have multiple answers.
Both tables have 30,000+ entries.
I have made a loop,
for(each qid)
{
select question from table1 where qid = id;
select answer from table2 where qid = id;
}
The number of ids passed is 10 (to the loop). The query takes about 18 seconds to execute.
Is this normal for this much delay or is there something wrong with this approach. I want to know if there is any way to make the above query faster.
You can do it in a single query which should be a lot faster.
SELECT t1.qid, t1.question, t2.aid, t2.answer
FROM table1 t1
INNER JOIN table2 t2 ON t2.qid = t1.qid
WHERE t1.qid IN (?,?,?,etc.);
Or you can do
SELECT t1.qid, t1.question, t2.aid, t2.answer FROM table1 t1, table2 t2 WHERE t1.qid=t2.qid AND t1.qid IN(...some condition...);
I completely agree with #wvdz.
Additionally this is a general list of things you can do to improve performance of selects:
Analyze & possibly rewrite your query, (there's a lot left unsaid here so I recommend visiting one of the resource links I've included, or both).
If the query includes what is effectively the primary key, make sure
you have actually created the primary key constraint for that column
(this creates an index)
Consider creating indexes for any columns that will be used in the
conditions of the query (similar to point one, you will want to read up on this if you think you need more optimization, this becomes more important the more data you have in a table)
Also here are a couple of good resources for tuning your sql queries:
http://beginner-sql-tutorial.com/sql-query-tuning.htm
http://www.quest.com/whitepapers/10_SQL_Tips.pdf
NOTE on Primary Keys: Should you want more information on the use of Primary Keys this answer I gave in the past explains how I use primary keys and why... mentioning this because, in my opinion & experience, every table should include a primary key: MySQL - Should every table contain it's own id/primary column?

mysql inner join speed efficiency

I'm working with logic on creating queries with inner join.
In theory, would it speed up the process is you started with a smaller table?
Say I'm comparing keys across two tables. Table A only has 4 rows. Table B has 100.
So would:
SELECT * FROM `a` INNER JOIN `b` ON `a`.`key` = `b`.`key` WHERE `b`.`key`='value'
run faster than:
SELECT * FROM `b` INNER JOIN `a` ON `b`.`key` = `a`.`key` WHERE `b`.`key`='value'
EDIT: I've tried this with much larger data sets (10,000+ entries) and have always seen varied results. I tried researching and couldn't find a definitive answer. If this question is too vague, apologies.
no, The combination would be the same.
4X100 = 100X4 (one match doesnt mean its unique - so all will have to be looked at)
Even if the a.key and b.key columns were indexed the index would be used the rows needed would still be reduced and multiplied in a similar way to above.
Smaller sets would (i assume) have more difference as caching can occur at the CPU, But the query optimiser should spot that and rewrite to the best execution plan.
INNER JOIN is cartesian product of two tables + conditions.
Another syntax of this query is:
SELECT * FROM `a`, `b` WHERE `a`.`key` = `b`.`key` AND `b`.`key`='value'
Order of tables in FROM clause doesn't matter. So I believe, that both queries do actually the same, but you should check it on larger sample of data than 100 rows to verify that.

SQL Server Partition Performance Issue

I have the following scenario:
TableA (ID GUID, Type INT) : +60M rows
TableB (ID GUID) : +5M rows
TableA has an Index on ID and Type
TableB the Primary Key is ID
I need to improve the following query:
SELECT * FROM TableA A
INNER JOIN TableB B
ON A.ID = B.ID AND A.Type = 5
The query takes about 30 seconds to complete.
We have tried partitioning TableA on the Type column but the query execution time remains the same. Even the execution plan is still the same. As far as I understood partitioning the table should greatly improve the performance?
Do I have to adjust my query to use the partition thus increasing performance?
Are my indexes wrong?
Thanks in advance!
You are one of the people who think partitioning is a magic switch that improves performance when pressed. Partitioning mostly reduces performance and helps in a few narrow cases. It is mostly a management feature for bulk loading and data archiving/deletion.
Partitioning introduces serious consequences and cannot be done without proper understanding, planing and testing.
Create the proper indices (in your case A(Type, ID) would be a good start. Alternatively A(ID) WHERE Type = 5 (a filtered index)).

MySQL Join clause vs WHERE clause

What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.