I have two tables
Table A (Primary Key is ID)
id \ firstname \ lastname \ zip \ state
Table B
some_field\ business name \ zip \ id
I need to get the first name and last name associated with the id using the id from Table B (note this is the same id as in Table A)
I did a JOIN on Table A and Table B so that I could get the first name and last name
A friend of mine said I should not use JOIN this way and that I should of just done two separate queries. Does that make any sense?
Does JOIN do anything that makes the process slower than two seperate queries? How could two seperate queries ever be faster than one query?
Q: Does this make sense?
A: No, without some valid reasons, it doesn't make sense.
Q: Does JOIN do anything that makes the process slower than two separate queries?
A: Yes, there are some things that can make a join slower, so we can't rule out that possibility. We can't make a blanket statement that "two separate queries will be faster" or that a "join will be slower".
An equijoin of two tables that are properly indexed is likely to be more efficient. But performance is best gauged by actually executing the statements, at expected production volumes of data, and observing and measuring performance.
Some of the things that could potentially make a join slower... complicated join predicate (involving columns wrapped in functions, inequality comparisons, compound predicates combined with OR, multiple tables involved where the optimizer has more join paths and operations to consider to come up with an execution plan. Or, a join that produces a hugh jass intermediate result that is later collapsed with a GROUP BY. (In short, it is possible to write a horrendously inefficient statement that uses a join operation. But it is usually not the join operation that is the culprit. This list of things is just a sampling, it's not an exhaustive list.)
A JOIN is the normative pattern for the use case you describe. It's not clear why your friend recommended that you avoid a JOIN operation. what reason your friend gives.
If your main query is primarily against (the unfortunately named) Table_B, and you are wanting to do a lookup of first_name and last_name from Table_A, the JOIN is suited to that.
If you are only returning a one row (or a few rows) from Table_B, then an extra roundtrip for another query to get first_name and last_name won't be a problem. But if you are returning thousands for rows from Table_B, then executing thousands of separate, singleton queries against Table_A is going to kill performance and scalability.
If your friend is concerned that a value in the foreign key column in Table_B won't match a value in the id column of Table_A, or there is a NULL value in the foreign key column, your friend would be right to point out that an inner join would prevent the row from Table_B from being returned.
In that case, we'd use an outer join, so we can return the row from Table_B even when a matching row from Table_A is not found.
Your friend might also be concerned about performance of the JOIN operation, possibly because your friend has been burned by not having suitable indexes defined.
Assuming that a suitable index exists on Table_A (with a leading column id). and that id is UNIQUE in Table_A... then performance of a single query with a simple equijoin between a single column foreign key and single column primary key will likely be more efficient than running a bloatload of separate statements.
Or, perhaps your friend is concerned with issue with an immature ORM framework, one that doesn't efficiently handle the results returned from a join query.
If the database is being implemented in way that the two tables could be on separate database servers, then using a JOIN would fly in the face of that design. And if that was the design intent, a separation of the tables, then the application should also be using a separate connection for each of the two tables.
Unless your friend can provide some specific reason for avoiding a JOIN operation, my recommendation is that you ignore his advice.
(There has to be a good reason to avoid a JOIN operation. I suspect that maybe your friend doesn't understand how relational databases work.)
In your case it doesn't make any big difference because you just have an id as a foreign key on it which anyways has an index. Since it's indexed, it will be efficient and having a join on that is the best thing.
It becomes more complicated based on what you want, what are the fields and what you want to accomplish etc.
So, yes, no big difference in your case.
Related
I am new to SQL. At the moment I am experiencing some slower MySQL queries. I think I need to improve my indexes but not sure how.
drop temporary table if exists temp ;
CREATE TEMPORARY TABLE temp
(index idx_a (EXTRACT_DATE, project_id, SERVICE_NAME) )
select distinct DATE(c.EXTRACT_DATETIME) as EXTRACT_DATE,p.project_id, p.project_name, c.CLUSTER_NAME, c.SERVICE_NAME,
UPPER(CONCAT(SUBSTRING_INDEX(c.ENV_NAME, '-', 1),'-',c.CLUSTER_NAME)) as CLUSTER_ID
from p
left join c
on p.project_id = c.project_id ;
The short answer is that you need indexes at least to optimize the lookups done by the JOIN. The explain shows that both tables you are joining are doing a full table scan, then joining them the hard was, using "block nested loop" which indicates it is not using an index.
It would help to at least create an index on c.project_id.
ALTER TABLE c ADD INDEX (project_id);
This would mean there is still a table-scan to read the p table (estimated 5720 rows), but at least when it needs to find the related rows in c, it only reads the rows it needs, without doing a table-scan of 287K rows for each row of p.
The query you posted in an earlier question had another condition:
where DAYNAME(c.EXTRACT_DATETIME) = 'Friday' ;
I don't know why you haven't included this condition in the new question you posted.
If this is still a condition you need to handle, this could help optimize the query further. MySQL 5.7 (which you said in the other question you are using) supports virtual columns, defined for an expression, and you can index virtual columns.
ALTER TABLE c
ADD COLUMN isFriday AS (DAYNAME(EXTRACT_DATETIME) = 'Friday'),
ADD INDEX (isFriday);
Then if you search on the new isFriday column, or even if you search on the same expression used for the virtual column definition, it will use the index.
So what you really need is an index on c that uses both columns, one for the join, and then for the additional condition.
ALTER TABLE c
ADD COLUMN isFriday AS (DAYNAME(EXTRACT_DATETIME) = 'Friday'),
ADD INDEX (project_id, isFriday);
You aren’t filtering on anything other than the outer join column. This leads me to expect that most of the rows in both tables are going to need reading. In order to do this only once, you may be best off using a hash join rather than a nested loop and index. A hash join will allow both tables to be read completely once rather than the back and forth approach of a nested loop which will likely mean the same pages read each time a row is looked up.
In order to use hash joins, you need to be running and a version of MySQL at least above version 8. It would be recommended to use the latest available stable release.
We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time
You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.
I have a table which has a huge amount of data. I have 9 column in that table (bp_detail) and 1 column of ID which is my primary key in the table. So I am fetching data using query
select * from bp_detail
so what I need to do to get data in a fast way? should I need to make indexes? if yes then on which column?
I am also using that table (bp_detail) for inner join with a table (extras) to get record on the base of where clause, and the query that I am using is:
select * from bp_detail bp inner join extras e
on (bp.id = e.bp_id)
where bp.id = '4' or bp.name = 'john'
I have joined these tables by applying foreign key on bp_detail id and extras bp_id so in this case what should I do to get speedy data. Right Now I have an indexed on column "name" in extras table.
Guidance highly obliged
If selecting all records you would gain nothing by indexing any column. Index makes filtering/ordering by the database engine quicker. Imagine large book with 20000 pages. Having index on first page with chapter names and page numbers you can quickly navigate through the book. Same applies to the database since it is nothing more than a collection of records kept one after another.
You are planning to join tables though. The filtering takes place when JOINING:
on (bp.id = e.bp_id)
and in the WHERE:
where bp.id = '4' or bp.name = 'john'
(Anyway, any reason why you are filtering by both the ID and the NAME? ID should be unique enough).
Usually table ID's should be primary keys so joining is covered. If you plan to filter by the name frequently, consider adding an index there too. You ought to check how does database indexes work as well.
Regarding the name index, the lookup speed depends on search type. If you plan to use the = equality search it will be very quick. It will be quite quick with right wildcard too (eg. name = 'john%'), but quite slow with the wildcard on both sides (eg. name = '%john%').
Anyway, is your database large enough? Without much data and if your application is not read-intensive this feels like beginner's mistake called premature optimization.
depending on your searching criteria, if you are just selecting all of the data then the primary key is enough, to enhance the join part you can create an index on e.bp_id can help you more if you shared the tables schema
Let's say my model contains 2 tables: persons and addresses. One person can have O, 1 or more addresses. I'm trying to execute a query that lists all persons and includes the number of addresses they have respectively. Here is the 2 queries that I have to achieve that:
SELECT
persons.*,
count(addresses.id) AS number_of_addresses
FROM `persons`
LEFT JOIN addresses ON persons.id = addresses.person_id
GROUP BY persons.id
and
SELECT
persons.*,
(SELECT COUNT(*)
FROM addresses
WHERE addresses.person_id = persons.id) AS number_of_addresses
FROM `persons`
And I was wondering if one is better than the other in term of performance.
The way to determine performance characteristics is to actually run the queries and see which is better.
If you have no indexes, then the first is probably better. If you have an index on addresses(person_id), then the second is probably better.
The reason is a little complicated. The basic reason is that group by (in MySQL) uses a sort. And, sorts are O(n * log(n)) in complexity. So, the time to do a sort grows faster than a data (not much faster, but a bit fast). The consequence is that a bunch of aggregations for each person is faster than one aggregation by person over all the data.
That is conceptual. In fact, MySQL will use the index for the correlated subquery, so it is often faster than the overall group by, which does not make use of an index.
I think the first query is optimum and more optimization can provided by changing table structure. For example define both person_id and address_id fields (order is important) as primary key in addresses table to join faster.
The mysql table storage structure is indexed organized table(clustered index) so the primary key index is very faster than normal index specially in join operation.
What is the order MySQL joins the tables, how is it chosen and when does STRAIGHT_JOIN comes in handy?
MySQL is only capable of doing nested loops (possibly using indexes), so if both join tables are indexed, the time for the join is calculated as A * log(B) if A is leading and B * log(A) if B is leading.
It is easy to see that the table with fewer records satisfying the WHERE condition should be made leading.
There are some other factors that affect the join performance, such as WHERE conditions, ORDER BY and LIMIT clauses etc. MySQL tries to predict the time for the join orders and if statistics are up to date does it quite well.
STRAIGHT_JOIN is useful when the statistics are not accurate (say, naturally skewed) or in case of bugs in the optimizer.
For instance, the following spatial join:
SELECT *
FROM a
JOIN b
ON MBRContains(a.area, b.area)
is subject to a join swap (the smaller table is made leading), however, MBRContains is not converted to MBRWithin and the resulting plan does not make use of the index.
In this case you should explicitly set the join order using STRAIGHT_JOIN.
As others have stated about the optimizer and which tables may meet the criteria on smaller result sets, but that may not always work. As I had been working with gov't contract / grants database. The table was some 14+ million records. However, it also had over 20 lookup tables (states, congressional districts, type of business classification, owner ethnicity, etc)
Anyhow with these smaller tables, the join was using one of the small lookups, back to the master table and then joining all the others. It CHOKED the database and cancelled the query after 30+ hours. Since my primary table was listed FIRST, and all subsequent were lookup and joined AFTER, just adding STRAIGHT_JOIN at the top FORCED the order I had listed and the complex query was running again in just about 2 hrs (expected for all it had to do).
Get whatever is your primary basis to the top with all subsequent extras later I've found, definitely helps.
The order of tables is specified by the optimizer. Straight_join comes in handy when the optimizer does it wrong, which is not so often. I used it only once in a big join, where the optimizer gave one particular table at first place in join (I saw it in explain select command), so I placed the table so that it is joined later in the join. It helped a lot to speed up the query.