How Join works in SQL?

How Join works in SQL? - mysql

If I have three tables say A,B and C
A is join to B and
B is join to C
Then is it required to join A to C?
I tried this with join A to C and because of this my query performance is increased but I dont understand how my query performance increased?
It will be very helpful if you can explain me why A to C join increase the performance?
there are 3 tables.
AWARD_MEMBER, HOUSEHOLD_MEMBER and CONDITIONALITY_GROUP
This is how i joined them,
AWARD_MEMBER.HH_MEMBER_ID = HOUSEHOLD_MEMBER.HH_MEMBER_ID AND
HOUSEHOLD_MEMBER.HH_MEMBER_ID = CONDITIONALITY_GROUP.HH_MEMBER_ID AND
AWARD_MEMBER.HH_MEMBER_ID = CONDITIONALITY_GROUP.HH_MEMBER_ID
say AWARD_MEMBER as A, HOUSEHOLD_MEMBER as B and CONDITIONALITY_GROUP as C
so AWARD_MEMBER.HH_MEMBER_ID = CONDITIONALITY_GROUP.HH_MEMBER_ID join is really required in query? If yes then how it works?
Thanks,
Sandeep

Since you are not mentioning about the table structure i cant say why it is happening (increasing perfomance while joining A to C ).
But if you are looking for performance optimisation , you should look at the EXPALIN keyword.
EXPLAIN is one of the most powerful tools at your disposal for understanding and optimizing troublesome MySQL queries
EXPALIN
EXPALIN returns a row of information for each table used in the SELECT statement.These are the output columns produced by EXPLAIN.
The type column of EXPLAIN output describes how tables are joined.
The following list describes the join types, ordered from the best
type to the worst
system
const
eq_ref
ref
fulltext
ref_or_null
index_merge
unique_subquery
index_subquery
range
index
ALL
The "rows" column in the EXPLAIN output is an estimate of the number
of rows that need to be examined.
Whether or not the query will actually need to examine every row in
the table really depends on the access plan (shown in other columns in
the EXPLAIN output. For example, an index range scan operation doesn't
have to look at EVERY row, only the rows in particular index range.
But a full scan operation will look at EVERY row in the table).
By looking at rows and type you can easily understand why it got
improved when you joined (A to C).
Note : EXPLAIN EXTENDED - show you additional information about the way it executes the query.
Hope this helps..

Related

In which order are MySQL left join conditions evaluated?

I'm having the following query:
SELECT
*
FROM ARTICLE AS article
LEFT JOIN VALUATION AS valuation ON (valuation.ARTICLEID = article.ID AND valuation.BUYDATE <= '2021-10-21'
AND valuation.SELLDATE > '2021-10-21' )
LEFT JOIN VALUATION AS previousvaluation ON(previousvaluation.ARTICLEID = article.ID AND
AND previousvaluation.BUYDATE < '2021-10-21' AND previousvaluation.SELLDATE >= '2021-10-21' AND article.NOTICEDATE < '2021-10-21')
LEFT JOIN ART_OWNER AS articleOwner ON (articleOwner.ID = article.owner )
WHERE article.QUANTITY = 0
It is giving me the following execution plan:
As seen in the execution plan,the "previousValuation" lookup is showing 10 rows produced which multiply data processed by the "articleOwner" join by 10.
My "previousValuation" join will ALWAYS return 0 or 1 line but it is showing 10 rows just because the join is not a ref join and is only taking usage of one column in the table PK.
Why this join is not taking in consideration non indexed columns and is join condition on those non indexed columns evaluated at the join time or later?
(When is the "attached_condition" condition evaluated)
Thanks

These might help:
VALUATION: INDEX(ARTICLEID, SELLDATE)
VALUATION: INDEX(ARTICLEID, BUYDATE)
And drop INDEX(ARTICLEID) if you have such. (The single-column version may get in the way of my suggested pair of indexes.)
The numbers in Explain are estimates -- sometimes very crude estimates. They are sometimes very far off and can lead to using the wrong query plan.
The order of LEFT JOINs should not matter. The number of rows fetched (or found to be missing) does not change depending on the order.
When there are two ranges in ON (or WHERE), the Optimizer will use only one of them. My suggestion should help the Optimizer try both directions (past and future) in hopes that it will discover (via probes into the index) which one will be more productive.
LEFT is often used when it should not be. Are you sure you need it in these cases?
Do you really want SELECT *? It provides all columns from all 4 tables.
Why this join is not taking in consideration non indexed columns and is join condition on those non indexed columns evaluated at the join time or later?
(I'm unclear on what you are asking.) The evaluation happens in (crudely) this order:
Filter by INDEX (if appropriate)
Fetch the entire row for any rows that have not been filtered out by the INDEX
Perform the "attached condition" to finish the filtering.
That's all. The second step gathered all the data; there is no step 4 to optimize. I could be wrong here. If there are columns that are stored "off-record" (eg big TEXT or BLOB), they may not have been fetched in step 2. (I do not know the answer to this.)
(And that hints at a big reason for not saying SELECT * if you have big columns that you do not need to fetch.)

Need some clarification on indexes (WHERE, JOIN)

We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time

You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.

how to convert left join to sub query?

I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'

In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.

explain plan meanings in mysql

i use explain plan,but i am confused what is its real meaning.
explain extended
select *
from (select type_id from con_consult_type cct
where cct.consult_id = (select id
from con_consult
where id = 1))
cctt left join con_type ct on cctt.type_id = ct.id;
the results is
i google the derived is temporary table,but what is its sql of the temporary table?is ctt table?
and the step 2,is result of cctt left join con_type ct on cctt.type_id = ct.id?
the FK_CONSULT_TO_CONSULT_TYPE is consult_id refer con_consult id column,
how to use the index in the sql?
get all results of ctt,and then use the index filter?
please help me explain what the explain meanings.

This is a bad query to learn the basics of the explain output, there is simply too much happening with all the sub queries, and joins.
I can give a run down of some of the essentials;
'rows' column: Less is better, it shows how many rows had to be scanned by the database, anything less than a couple of hundred is good, generally indicates how well it is able to find your data from the indexes;
'possible_keys': and 'keys': If 'rows' is big, you may have to tweek your keys to provide the engine with some help finding your data
'type': Type of join
To answer some of your questions;
'sql of the temporary table' - it's the first subquery in your sql
With FK_CONSULT_TO_CONSULT_TYPE you dont have to do anything, the engine has allready picked this up as an index which is what the explain is saying.
Queries are broken into 3 essentials steps; select data, filter, and join. Each row in the explain is a detail into one or more of these operations, it may not necessarily relate to a specific section of your SQL as the engine may have combined various parts into one.

Condition on joined table faster than condition on reference

I have a query involving two tables: table A has lots of rows, and contains a field called b_id, which references a record from table B, which has about 30 different rows. Table A has an index on b_id, and table B has an index on the column name.
My query looks something like this:
SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;
With condition being some random condition on table A (I have lots of those, all exhibiting the same behavior).
This query is extremely slow (taking north of 2 seconds), and using explain, shows that query optimizer starts with table B, coming up with about 29 rows, and then scans table A. Doing a STRAIGHT_JOIN, turned the order around and the query ran instantaneously.
I'm not a fan of black magic, so I decided to try something else: come up with the id for the record in B that has the name dummy, let's say 23, and then simplify the query to:
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
To my surprise, this query was actually slower than the straight join, taking north of a second.
Any ideas on why the join would be faster than the simplified query?
UPDATE: following a request in the comments, the outputs from explain:
Straight join:
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
| 1 | SIMPLE | B | eq_ref | PRIMARY,id_name | PRIMARY | 4 | schema.A.b_id | 1 | Using where |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
No join:
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | A | ALL | b_id | NULL | NULL | NULL | 200707 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
UPDATE 2:
Tried another variant:
SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;
This runs faster than the no join, but still slower than the join, so it seems that the inequality operation is responsible for part of the performance hit, but not all.

If you are using MySQL 5.6 or later then you can ask the query optimizer what it is doing;
SET optimizer_trace="enabled=on";
## YOUR QUERY
SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
##END YOUR QUERY
SELECT trace FROM information_schema.optimizer_trace;
SET optimizer_trace="enabled=off";
You will almost certainly need to refer to the following sections in the MySQL reference Tracing the Optimiser and The Optimizer
Looking at the first explain it appears that the query is quicker probably because the optimizer can use the table B to filter down to the rows required based on the join and then use the foreign key to get the rows in table A.
In the explain it's this bit that is interesting; there is only one row matching and it's using schema.A.b_id. Effectively this is pre-filtering the rows from A which is where I think the performance difference comes from.
| ref | rows | Extra |
| schema.A.b_id | 1 | Using where |
So, as is usual with queries it is all down to indexes - or more accurately missing indexes. Just because you have indexes on individual fields it doesn't necessarily mean that these are suitable for the query you're running.
Basic rule: If the EXPLAIN doesn't say Using Index then you need to add a suitable index.
Looking at the explain output the first interesting thing is ironically the last thing on each line; namely the Extra
In the first example we see
| 1 | SIMPLE | A | .... Using where |
| 1 | SIMPLE | B | ... Using where |
Both of these Using where is not good; ideally at least one, and preferably both should say Using index
When you do
SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;
and see Using where then you need to add an index as it's doing a table scan.
for example if you did
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)
You should see Using where; Using index (assuming here that Id is the primary key and has an index)
If you then added a condition onto the end
EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0
and see Using where then you need to add an index for the two fields. Just having an index on a field doesn't mean that MySQL will be able to use that index during the query across multiple fields - this is something that internally the query optimizer will decide upon. I'm not exactly certain of the internal rules; but generally adding an extra index to match the query helps immensely.
So adding an index (on the two fields in the query above):
ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)
should change it such that when querying based upon those two fields there is an index.
I've tried this on one of my databases that has Transactions and User tables.
I'll use this query
EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
Running without index on the two fields:
PRIMARY,user PRIMARY 4 NULL 14334 Using where
Then add an index:
ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);
Then the same query again and this time
PRIMARY,user,Index 4 Index 4 4 NULL 12628 Using where; Using index
This time it's using the indexes - and as a result will be a lot quicker.
From comments by #Wrikken - and also bear in mind that I don't have the accurate schema / data so some of this investigation has required assumptions about the schema (which may be wrong)
SELECT COUNT(A.id) FROM A FORCE INDEX (b_id)
would perform at least as good as
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
If we look at the first EXPLAIN in the OP we see that there are two elements to the query. Referring to the EXPLAIN documentation for *eq_ref* I can see that this is going to define the rows for consideration based on this relationship.
The order of the explain output doesn't necessarily mean it's doing one and then the other; it's simply what has been chosen to execute the query (at least as far as I can tell).
For some reason the query optimizer has decided not to use the index on b_id - I'm assuming here that because of the query the optimizer has decided that it will be more efficient to do a table scan.
The second explain concerns me a little because it's not considering the index on b_id; possibly because of the AND <condition> (which is omitted so I'm guessing as to what it could be). When I try this with an index on b_id it does use the index; but as soon as a condition is added it doesn't use the index.
So, when doing
SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.
This all indicates to me is that the PRIMARY index on B is where the speed difference is coming from; I'm assuming because of the schema.A.b_id in the explain that there is a Foreign key on this table; which must be a better collection of related rows than the index on b_id - so the query optimizer can use this relationship to define which rows to pick - and because a primary index is better than secondary indexes it's going to be much quicker to select rows out of B and then use the relationship link to match against the rows in A.

I do not see any strange behavior here. What you need is to understand the basics of how MySQL uses indexes. Here is an article I usually recommend: 3 ways MySQL uses indexes.
It is always funny to observe people writing things like WHERE (B.name != 'dummy') AND <condition>, because this AND <condition> might be the reason why MySQL optimizer chose the specific index, and there is no valid reason to compare the performance of the query with that of another one with WHERE b_id != 23 AND <condition>, because the two queries usually need different indexes to perform good.
One thing you should understand, is that MySQL likes equality comparisons, and does not like range conditions and inequality comparisons. It is usually better to specify the correct values than to use a range condition or specify a != value.
So, let's compare the two queries.
With straight join
For each row in the A.id order (which is the primary key and is clustered, that is data is stored in its order on disk) take data for the row from disk to check if your <condition> is met and b_id, then (I repeat for each matching row) find the appropriate row for b_id, go on disk, take b.name, compare it with 'dummy'. Even though this plan in not at all efficient, you have only 200000 rows in your A table, so that it seems rather performant.
Without straight join
For each row in table B compare if name is matching, look into the A.b_id index (which is obviously sorted by b_id, since it is an index, hence contains A.ids in random order), and for each A.id for the given A.b_id find the corresponding A row on disk to check the <condition>, if it matches count the id, otherwise, discard the row.
As you see, there is nothing strange in the fact that the second query takes so long, you basically force MySQL to randomly access almost each row in A table, where in the first query you read the A table in the order it is stored on disk.
The query with no join does not use any index at all. It actually should take about the same as the query with straight join. My guess is that the order of the b_id!=23 and <condition> is significant.
UPD1: Could you still compare the performance of your query without join with the following:
SELECT COUNT(A.id)
FROM A
WHERE IF(b_id!=23, <condition>, 0);
UPD2: the fact the you do not see an index in EXPLAIN does not mean that no index is used at all. An index is at least used to define the reading order: when there is no other useful index, it is usually the primary key, but, as I said above, when there is an equility condition and the corresponding index, MySQL will use the index. So, basically, to understand which index is used you can look at the order in which rows are output. If the order is the same as the primary key, than no index was used (that is the primary key index was used), if the order of rows is shuffled - than there was some other index involved.
In your case, the second condition seems to be true for most of the rows, but the index is still used, that is to get b_id MySQL goes on disk in random order, that's why it is slow. No black magic here, and this second condition does affect the performance.

Probably this should be a comment rather than an answer but it will be a bit long.
First of all, it is hard to believe that two queries that have (almost) exactly the same explain run at different speed. Furthermore, this is less likely if the one with the extra line in the explain runs faster. And I guess the word faster is the key here.
You've compared speed (the time it takes for a query to finish) and that is an extremely empiric way of testing. For example, you could have improperly disabled the cache, which makes that comparison useless. Not to mention that your <insert your preferred software application here> could have made a page fault or any other operation at the time you've run the test that could have resulted in a decrease of the query speed.
The right way of measuring query performance is based on the explain (that's why it is there)
So the closest thing I have to answer the question: Any ideas on why the join would be faster than the simplified query?... is, in short, a layer 8 error.
I do have some other comments, though, that should be taken into account in order to speed things up. If A.id is a primary key (the name smells like it is), according to your explain, why does the count(A.id) have to scan all the rows? It should be able to get the data directly from the index but I don't see the Using index in the extra flags. It seems you don't even have a unique index on and that it is not a non nullable field. That also smells odd. Make sure that the field is not null and that there is a unique index on it, run the explain again, confirm the extra flags contain the Using index and then (properly) time the query. It should run much faster.
Also note that an approach that would result in the same performance improvement as I mentioned above would be to replace count(A.id) with count(*).
Just my 2 cents.

Because MySQL will not use index for index!=val in where.
The optimizer will decide to use an index by guessing. As a "!=" will more likely fetch everything, it skip and prevent using index to reduce overhead. (yes, mysql is stupid, and it does not statistic index column)
You may do a faster SELECT, by using index in(everything other then val), that MySQL will learn to use the index.
Example here showing query optimizer will choose to not use index by value

The answer to this question is actually a very simple consequence of algorithm design:
The key difference between these two queries is the merge operation.
Before I give a lesson on algorithms, I will mention the reason why the merge operation improves the performance. The merge improves the performance because it reduces the overall load on the aggregation. This is an iteration vs recursion issue. In the iteration analogy, we are simply looping through the entire index and counting the matches. In the recursion analogy, we are dividing and conquering (so to speak); or in other words, we are filtering the results that we need to count, thus reducing the volume of numbers we actually need to count.
Here are the key questions:
Why is a merge sort faster than an insertion sort?
Is a merge sort always faster than an insertion sort?
Let's explain this with a parable:
Let's say we have a deck of playing cards, and we need to sum the numbers of playing cards that have the numbers 7, 8 and 9 (assuming we don't know the answer in advance).
Let's say that we decide upon two ways to solve this problem:
We can hold the deck in one hand and move the cards to the table, one by one, counting as we go.
We can separate the cards into two groups: black suits and red suits. Then we can perform step 1 upon one of the groups and reuse the results for the second group.
If we choose option 2, then we have divided our problem in half. As a consequence, we can count the matching black cards and multiply the number by 2. In other words, we are re-using the part of the query execution plan that required the counting. This reasoning especially works when we know in advance how the cards were sorted (aka "clustered index"). Counting half of the cards is obviously much less time consuming than counting the entire deck.
If we wanted to improve the performance yet again, depending on how large the size of our database is, we may even further consider sorting into four groups (instead of two groups): clubs, diamonds, hearts, and spades. Whether or not we want to perform this further step depends on whether or not the overhead of sorting the cards into the additional groups is justified by the performance gain. In small numbers of cards, the performance gain is likely not worth the extra overhead required to sort into the different groups. As the number of cards grows, the performance gain begins to outweigh the overhead cost.
Here is an excerpt from "Introduction to Algorithms, 3rd edition," (Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein):
(Note: If someone can tell me how to format the sub-notation, I will edit this to improve readability.)
(Also, keep in mind that "n" is the number of objects we are dealing with.)
"As an example, in Chapter 2, we will see two algorithms for sorting.
The ﬁrst, known as insertion sort, takes time roughly equal to c1n2
to sort n items, where c1 is a constant that does not depend on n.
That is, it takes time roughly proportional to n2. The second, merge
sort, takes time roughly equal to c2n lg n, where lg n stands for
log2 n and c2 is another constant that also does not depend on n.
Insertion sort typically has a smaller constant factor than merge
sort, so that c1 < c2. We shall see that the constant factors can
have far less of an impact on the running time than the dependence on
the input size n. Let’s write insertion sort’s running time as c1n ·
n and merge sort’s running time as c2n · lg n. Then we see that where
insertion sort has a factor of n in its running time, merge sort has
a factor of lg n, which is much smaller. (For example, when n = 1000,
lg n is approximately 10, and when n equals one million, lg n is
approximately only 20.) Although insertion sort usually runs faster
than merge sort for small input sizes, once the input size n becomes
large enough, merge sort’s advantage of lg n vs. n will more than
compensate for the difference in constant factors. No matter how much
smaller c1 is than c2, there will always be a crossover point beyond
which merge sort is faster."
Why is this relevant? Let us look at the query execution plans for these two queries. We will see that there is a merge operation caused by the inner join.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008