Performance Penalties for Unused Joins - mysql

I'm writing a script that generates a report based on a query that uses several tables joined together. One of the inputs to the script is going to be a list of the fields that are required on the report. Depending on the fields requested, some of the tables might not be needed. My question is: is there a [significant] performance penalty for including a join when if it is not referenced in a SELECT or WHERE clause?
Consider the following tables:
mysql> SELECT * FROM `Books`;
+----------------------+----------+
| title | authorId |
+----------------------+----------+
| Animal Farm | 3 |
| Brave New World | 2 |
| Fahrenheit 451 | 1 |
| Nineteen Eighty-Four | 3 |
+----------------------+----------+
mysql> SELECT * FROM `Authors`;
+----+----------+-----------+
| id | lastName | firstName |
+----+----------+-----------+
| 1 | Bradbury | Ray |
| 2 | Huxley | Aldous |
| 3 | Orwell | George |
+----+----------+-----------+
Does
SELECT
`Authors`.`lastName`
FROM
`Authors`
WHERE
`Authors`.`id` = 1
Outperform:
SELECT
`Authors`.`lastName`
FROM
`Authors`
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
WHERE
`Authors`.`id` = 1
?
It seems to me that MySQL should just know to ignore the JOIN completely, since the table is not referenced in the SELECT or WHERE clause. But somehow I doubt this is the case. Of course, this is a really basic example. The actual data involved will be much more complex.
And really, it's not a terribly huge deal... I just need to know if my script needs to be "smart" about the joins, and only include them if the fields requested will rely on them.

This isn't actually unused since it means that only Authors that exist in Books are included in the result set.
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
However if you "knew" that every Author existed in Book than there would be some performance benefit in removing the join but it would largely depend on idexes and the number of records in the table and the logic in the join (especially when doing data conversions)

This is the kind of question that is impossible to answer. Yes, adding the join will take additional time; it's impossible to tell whether you'd be able to measure that time without, well, uh....measuring the time.
Broadly speaking, if - like in your example - you're joining on primary keys, with unique indices, it's unlikely to make a measurable difference.
If you've got more complex joins (which you hint at), or are joining on fields without an index, or if your join involves a function, the performance penalty may be significant.
Of course, it may still be easier to do it this way that write multiple queries which are essentially the same, other than removing unneeded joins.
Final bit of advice - try abstracting the queries into views. That way, you can optimize performance once, and perhaps write your report queries in a more simple way...

Joins will always take time.
Side effects
On top of that inner join (which is the default join) influences the result by limiting the number of rows you get.
So depending on whether all authors are in books the two queries may or may not be identical.
Also if an author has written more than one book the resultset of the 'joined' query will show duplicate results.
Performance
In the WHERE clause you have stated authors.id to be a constant =1, therefore (provided you have indexes on author.id and books.author_id) it will be a very fast lookup for both tables. The query-time between the two tables will be very close.
In general joins can take quite a lot of time though and with all the added side effects should only be undertaken if you really want to use the extra info the join offers.

It seems that there are two things that you are trying to determine: If there are any optimizations that can be done between the two select statements, and which of the two would be the fastest to execute.
It seems that since the join really is limiting the returned results by authors who have books in the list, that there can not be that much optimization done.
It also seems that for the case that you were describing where the joined table really has no limiting effect on the returned results, that the query where there was no joining of the tables would perform faster.

Related

How should mysql query be [duplicate]

Are JOIN queries faster than several queries? (You run your main query, and then you run many other SELECTs based on the results from your main query)
I'm asking because JOINing them would complicate A LOT the design of my application
If they are faster, can anyone approximate very roughly by how much? If it's 1.5x I don't care, but if it's 10x I guess I do.
For inner joins, a single query makes sense, since you only get matching rows.
For left joins, multiple queries is much better... look at the following benchmark I did:
Single query with 5 Joins
query: 8.074508 seconds
result size: 2268000
5 queries in a row
combined query time: 0.00262 seconds
result size: 165 (6 + 50 + 7 + 12 + 90)
.
Note that we get the same results in both cases (6 x 50 x 7 x 12 x 90 = 2268000)
left joins use exponentially more memory with redundant data.
The memory limit might not be as bad if you only do a join of two tables, but generally three or more and it becomes worth different queries.
As a side note, my MySQL server is right beside my application server... so connection time is negligible. If your connection time is in the seconds, then maybe there is a benefit
Frank
This is way too vague to give you an answer relevant to your specific case. It depends on a lot of things. Jeff Atwood (founder of this site) actually wrote about this. For the most part, though, if you have the right indexes and you properly do your JOINs it is usually going to be faster to do 1 trip than several.
This question is old, but is missing some benchmarks. I benchmarked JOIN against its 2 competitors:
N+1 queries
2 queries, the second one using a WHERE IN(...) or equivalent
The result is clear: on MySQL, JOIN is much faster. N+1 queries can drop the performance of an application drastically:
That is, unless you select a lot of records that point to a very small number of distinct, foreign records. Here is a benchmark for the extreme case:
This is very unlikely to happen in a typical application, unless you're joining a -to-many relationship, in which case the foreign key is on the other table, and you're duplicating the main table data many times.
Takeaway:
For *-to-one relationships, always use JOIN
For *-to-many relationships, a second query might be faster
See my article on Medium for more information.
I actually came to this question looking for an answer myself, and after reading the given answers I can only agree that the best way to compare DB queries performance is to get real-world numbers because there are just to many variables to be taken into account BUT, I also think that comparing the numbers between them leads to no good in almost all cases. What I mean is that the numbers should always be compared with an acceptable number and definitely not compared with each other.
I can understand if one way of querying takes say 0.02 seconds and the other one takes 20 seconds, that's an enormous difference. But what if one way of querying takes 0.0000000002 seconds, and the other one takes 0.0000002 seconds ? In both cases one way is a whopping 1000 times faster than the other one, but is it really still "whopping" in the second case ?
Bottom line as I personally see it: if it performs well, go for the easy solution.
The real question is: Do these records have a one-to-one relationship or a one-to-many relationship?
TLDR Answer:
If one-to-one, use a JOIN statement.
If one-to-many, use one (or many) SELECT statements with server-side code optimization.
Why and How To Use SELECT for Optimization
SELECT'ing (with multiple queries instead of joins) on large group of records based on a one-to-many relationship produces an optimal efficiency, as JOIN'ing has an exponential memory leak issue. Grab all of the data, then use a server-side scripting language to sort it out:
SELECT * FROM Address WHERE Personid IN(1,2,3);
Results:
Address.id : 1 // First person and their address
Address.Personid : 1
Address.City : "Boston"
Address.id : 2 // First person's second address
Address.Personid : 1
Address.City : "New York"
Address.id : 3 // Second person's address
Address.Personid : 2
Address.City : "Barcelona"
Here, I am getting all of the records, in one select statement. This is better than JOIN, which would be getting a small group of these records, one at a time, as a sub-component of another query. Then I parse it with server-side code that looks something like...
<?php
foreach($addresses as $address) {
$persons[$address['Personid']]->Address[] = $address;
}
?>
When Not To Use JOIN for Optimization
JOIN'ing a large group of records based on a one-to-one relationship with one single record produces an optimal efficiency compared to multiple SELECT statements, one after the other, which simply get the next record type.
But JOIN is inefficient when getting records with a one-to-many relationship.
Example: The database Blogs has 3 tables of interest, Blogpost, Tag, and Comment.
SELECT * from BlogPost
LEFT JOIN Tag ON Tag.BlogPostid = BlogPost.id
LEFT JOIN Comment ON Comment.BlogPostid = BlogPost.id;
If there is 1 blogpost, 2 tags, and 2 comments, you will get results like:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag2, comment1,
Row4: tag2, comment2,
Notice how each record is duplicated. Okay, so, 2 comments and 2 tags is 4 rows. What if we have 4 comments and 4 tags? You don't get 8 rows -- you get 16 rows:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag1, comment3,
Row4: tag1, comment4,
Row5: tag2, comment1,
Row6: tag2, comment2,
Row7: tag2, comment3,
Row8: tag2, comment4,
Row9: tag3, comment1,
Row10: tag3, comment2,
Row11: tag3, comment3,
Row12: tag3, comment4,
Row13: tag4, comment1,
Row14: tag4, comment2,
Row15: tag4, comment3,
Row16: tag4, comment4,
Add more tables, more records, etc., and the problem will quickly inflate to hundreds of rows that are all full of mostly redundant data.
What do these duplicates cost you? Memory (in the SQL server and the code that tries to remove the duplicates) and networking resources (between SQL server and your code server).
Source: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html ; https://dev.mysql.com/doc/workbench/en/wb-relationship-tools.html
Did a quick test selecting one row from a 50,000 row table and joining with one row from a 100,000 row table. Basically looked like:
$id = mt_rand(1, 50000);
$row = $db->fetchOne("SELECT * FROM table1 WHERE id = " . $id);
$row = $db->fetchOne("SELECT * FROM table2 WHERE other_id = " . $row['other_id']);
vs
$id = mt_rand(1, 50000);
$db->fetchOne("SELECT table1.*, table2.*
FROM table1
LEFT JOIN table1.other_id = table2.other_id
WHERE table1.id = " . $id);
The two select method took 3.7 seconds for 50,000 reads whereas the JOIN took 2.0 seconds on my at-home slow computer. INNER JOIN and LEFT JOIN did not make a difference. Fetching multiple rows (e.g., using IN SET) yielded similar results.
Construct both separate queries and joins, then time each of them -- nothing helps more than real-world numbers.
Then even better -- add "EXPLAIN" to the beginning of each query. This will tell you how many subqueries MySQL is using to answer your request for data, and how many rows scanned for each query.
Depending on the complexity for the database compared to developer complexity, it may be simpler to do many SELECT calls.
Try running some database statistics against both the JOIN and the multiple SELECTS. See if in your environment the JOIN is faster/slower than the SELECT.
Then again, if changing it to a JOIN would mean an extra day/week/month of dev work, I'd stick with multiple SELECTs
Cheers,
BLT
In my experience I have found it's usually faster to run several queries, especially when retrieving large data sets.
When interacting with the database from another application, such as PHP, there is the argument of one trip to the server over many.
There are other ways to limit the number of trips made to the server and still run multiple queries that are often not only faster but also make the application easier to read - for example mysqli_multi_query.
I'm no novice when it comes to SQL, I think there is a tendency for developers, especially juniors to spend a lot of time trying to write very clever joins because they look smart, whereas there are actually smart ways to extract data that look simple.
The last paragraph was a personal opinion, but I hope this helps. I do agree with the others though who say you should benchmark. Neither approach is a silver bullet.
Whether you should use a join is first and foremost about whether a join makes sense. Only at that point is performance even something to be considered, as nearly all other cases will result in significantly worse performance.
Performance differences will largely be tied to how related the info you're querying for is. Joins work, and they're fast when the data is related and you index stuff correctly, but they do often result in some redundancy and sometimes more results than needed. And if your data sets are not directly related, sticking them in a single query will result in what's called a Cartesian product (basically, all possible combinations of rows), which is almost never what you want.
This is often caused by many-to-one-to-many relationships. For example, HoldOffHunger's answer mentioned a single query for posts, tags, and comments. Comments are related to a post, as are tags...but tags are unrelated to comments.
+------------+ +---------+ +---------+
| comment | | post | | tag |
|------------|* 1|---------|1 *|---------|
| post_id |-----| post_id |-----| post_id |
| comment_id | | ... | | tag_id |
| user_id | | | | ... |
| ... | | | | ... |
+------------+ +---------+ +---------+
In this case, it is unambiguously better for this to be at least two separate queries. If you try to join tags and comments, because there's no direct relation between the two, you end up with every possible combination of tag and comment. many * many == manymany. Aside from that, since posts and tags are unrelated, you can do those two queries in parallel, leading to potential gain.
Let's consider a different scenario, though: You want the comments attached to a post, and the commenters' contact info.
+----------+ +------------+ +---------+
| user | | comment | | post |
|----------|1 *|------------|* 1|---------|
| user_id |-----| post_id |-----| post_id |
| username | | user_id | | ... |
| ... | | ... | +---------+
+----------+ +------------+
This is where you should consider a join. Aside from being a much more natural query, most database systems (including MySQL) have lots of smart people put lots of hard work into optimizing queries just like it. For separate queries, since each query depends on the results of the previous one, the queries can't be done in parallel, and the total time becomes not just the actual execute time of the queries, but also the time spent fetching results, sifting through them for IDs for the next query, linking rows together, etc.
Will it be faster in terms of throughput? Probably. But it also potentially locks more database objects at a time (depending on your database and your schema) and thereby decreases concurrency. In my experience people are often mislead by the "fewer database round-trips" argument when in reality on most OLTP systems where the database is on the same LAN, the real bottleneck is rarely the network.
Here is a link with 100 useful queries, these are tested in Oracle database but remember SQL is a standard, what differ between Oracle, MS SQL Server, MySQL and other databases are the SQL dialect:
http://javaforlearn.com/100-sql-queries-learn/
There are several factors which means there is no binary answer. The question of what is best for performance depends on your environment. By the way, if your single select with an identifier is not sub-second, something may be wrong with your configuration.
The real question to ask is how do you want to access the data. Single selects support late-binding. For example if you only want employee information, you can select from the Employees table. The foreign key relationships can be used to retrieve related resources at a later time and as needed. The selects will already have a key to point to so they should be extremely fast, and you only have to retrieve what you need. Network latency must always be taken into account.
Joins will retrieve all of the data at once. If you are generating a report or populating a grid, this may be exactly what you want. Compiled and optomized joins are simply going to be faster than single selects in this scenario. Remember, Ad-hoc joins may not be as fast--you should compile them (into a stored proc). The speed answer depends on the execution plan, which details exactly what steps the DBMS takes to retrieve the data.
Yes, one query using JOINS would be quicker. Although without knowing the relationships of the tables you are querying, the size of your dataset, or where the primary keys are, it's almost impossible to say how much faster.
Why not test both scenarios out, then you'll know for sure...

MySQL query painfully slow on large data

I'm no MySQL whiz but I get it, I have just inherited a pretty large table (600,000 rows and around 90 columns (Please kill me...)) and I have a smaller table that I've created to link it with a categories table.
I'm trying to query said table with a left join so I have both sets of data in one object but it runs terribly slow and I'm not hot enough to sort it out; I'd really appreciate a little guidance and explanation as to why it's so slow.
SELECT
`products`.`Product_number`,
`products`.`Price`,
`products`.`Previous_Price_1`,
`products`.`Previous_Price_2`,
`products`.`Product_number`,
`products`.`AverageOverallRating`,
`products`.`Name`,
`products`.`Brand_description`
FROM `product_categories`
LEFT OUTER JOIN `products`
ON `products`.`product_id`= `product_categories`.`product_id`
WHERE COALESCE(product_categories.cat4, product_categories.cat3,
product_categories.cat2, product_categories.cat1) = '123456'
AND `product_categories`.`product_id` != 0
The two tables are MyISAM, the products table has indexing on Product_number and Brand_Description and the product_categories table has a unique index on all columns combined; if this info is of any help at all.
Having inherited this system I need to get this working asap before I nuke it and do it properly so any help right now will earn you my utmost respect!
[Edit]
Here is the output of the explain extended:
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
| 1 | SIMPLE | product_categories | index | NULL | cat1 | 23 | NULL | 1224419 | 100.00 | Using where; Using index |
| 1 | SIMPLE | products | ALL | Product_id | NULL | NULL | NULL | 512376 | 100.00 | |
+----+-------------+--------------------+-------+---------------+------+---------+------+---------+----------+--------------------------+
Optimize Table
To establish a baseline, I would first recommend running an OPTIMIZE TABLE command on both tables. Please note that this might take some time. From the docs:
OPTIMIZE TABLE should be used if you have deleted a large part of a
table or if you have made many changes to a table with variable-length
rows (tables that have VARCHAR, VARBINARY, BLOB, or TEXT columns).
Deleted rows are maintained in a linked list and subsequent INSERT
operations reuse old row positions. You can use OPTIMIZE TABLE to
reclaim the unused space and to defragment the data file. After
extensive changes to a table, this statement may also improve
performance of statements that use the table, sometimes significantly.
[...]
For MyISAM tables, OPTIMIZE TABLE works as follows:
If the table has deleted or split rows, repair the table.
If the index pages are not sorted, sort them.
If the table's statistics are not up to date (and the repair could not be accomplished by sorting the index), update them.
Indexing
If space and index management isn't a concern, you can try adding a composite index on
product_categories.cat4, product_categories.cat3, product_categories.cat2, product_categories.cat1
This would be advised if you use a leftmost subset of these columns often in your queries. The query plan indicates that it can use the cat1 index of product_categories. This most likely only includes the cat1 column. By adding all four category columns to an index, it can more efficiently seek to the desired row. From the docs:
MySQL can use multiple-column indexes for queries that test all the
columns in the index, or queries that test just the first column, the
first two columns, the first three columns, and so on. If you specify
the columns in the right order in the index definition, a single
composite index can speed up several kinds of queries on the same
table.
Structure
Furthermore, given that your table has 90 columns you should also be aware that a wider table can lead to slower query performance. You may want to consider Vertically Partitioning your table into multiple tables:
Having too many columns can bloat your record size, which in turn
results in more memory blocks being read in and out of memory causing
higher I/O. This can hurt performance. One way to combat this is to
split your tables into smaller more independent tables with smaller
cardinalities than the original. This should now allow for a better
Blocking Factor (as defined above) which means less I/O and faster
performance. This process of breaking apart the table like this is a
called a Vertical Partition.
The meaning of your query seems to be "find all products that have the category '123456'." Is that correct?
COALESCE is an extraordinarily expensive function to use in a WHERE statement, because it operates on index-hostile NULL values. Your explain result shows that your query is not being very selective on your product_categories table. In MySQL you need to avoid functions in WHERE statements altogether if you want to exploit indexes to make your queries fast.
The thing someone else said about 90-column tables being harmful is also true. But you're stuck with it, so let's just deal with it.
Can we rework your query to get rid of the function-based WHERE? Let's try this.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT DISTINCT product_id
FROM product_categories
WHERE product_id <> 0
AND ( cat1='123456'
OR cat2='123456'
OR cat3='123456'
OR cat4='123456')
)
For this to work fast you're going to need to create separate indexes on your four cat columns. The composite unique index ("on all columns combined") is not going to help you. It still may not be so good.
A better solution might be FULLTEXT searching IN BOOLEAN MODE. You're working with the MyISAM access method so this is possible. It's definitely worth a try. It could be very fast indeed.
SELECT /* some columns from the products table */
FROM products
WHERE product_id IN
(
SELECT product_id
FROM product_categories
WHERE MATCH(cat1,cat2,cat3,cat4)
AGAINST('123456' IN BOOLEAN MODE)
AND product_id <> 0
)
For this to work fast you're going to need to create a FULLTEXT index like so.
CREATE FULLTEXT INDEX cat_lookup
ON product_categories (cat1, cat2, cat3, cat4)
Note that neither of these suggested queries produce precisely the same results as your COALESCE query. The way your COALESCE query is set up, some combinations won't match it that will match these queries. For example.
cat1 cat2 cat3 cat4
123451 123453 123455 123456 matches your and my queries
123456 123455 123454 123452 matches my queries but not yours
But it's likely that my queries will produce a useful list of products, even if it has a few more items in yours.
You can debug this stuff by just working with the inner queries on product_categories.
There is something strange. Does the table product_categories indeed have a product_id column? Shouldn't the from and where clauses be like this:
FROM `product_categories` pc
LEFT OUTER JOIN `products` p ON p.category_id = pc.id
WHERE
COALESCE(product_categories.cat4, product_categories.cat3,product_categories.cat2, product_categories.cat1) = '123456'
AND pc.id != 0

Most efficient way to count all rows in a table but select only one

Currently I'm running these two queries:
SELECT COUNT(*) FROM `mytable`
SELECT * FROM `mytable` WHERE `id`=123
I'm wondering what format will be the most efficient. Does the order the queries are executed make a difference? Is there a single query that will do what I want?
Both queries are fairly unrelated. The COUNT doesn't use any indexes, while the SELECT likely uses the primary key for a fast look-up. The only thing the queries have in common is the table.
Since these are so simple, the query optimizer and results cache shouldn't have a problem performing very well on these queries.
Are they causing you performance problems? If not, don't bother optimizing them.
Does the order the queries are executed make a difference?
No, they reach for different things. The count will read a field that contains the number of colums of the table, the select by id will use the index. Both are fast and simple.
Is there a single query that will do what I want?
Yes, but it will make your code less clear, less maintenable (due to mixing concepts) and in the best case will not improve the performance (probably it will make it worse).
If you really really want to group them somehow, create a stored procedure, but unless you use this pair of queries a lot or in several places of the code, it can be an overkill.
First of: Ben S. makes a good point. This is not worth optimizing.
But if one wants to put those two statements in one SQl statement I think this is one way to do it:
select *,count(*) from mytable
union all
select *,-1 from mytable where id = 123
This will give one row for the count(*) (where one ignores all but the last column) and as many rows where id = 123 (where one ignores the last column as it is always -1)
Like this:
| Column1 | Column2 | Column3 | ColumnN | Count(*) Column |
---------------------------------------------------------------
| ignore | ignore | ignore | ignore | 4711 |
|valid data|valid data|valid data|valid data| -1 (ignore) |
Regards
Sigersted
What table engine are you using?
select count(*) is better on MyISAM compare to InnoDB. In MyISAM the number of rows for each table is stored. When doing count(*) the value is return. InnoDB doesn't do this because it supports transactions.
More info:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/

JOIN queries vs multiple queries

Are JOIN queries faster than several queries? (You run your main query, and then you run many other SELECTs based on the results from your main query)
I'm asking because JOINing them would complicate A LOT the design of my application
If they are faster, can anyone approximate very roughly by how much? If it's 1.5x I don't care, but if it's 10x I guess I do.
For inner joins, a single query makes sense, since you only get matching rows.
For left joins, multiple queries is much better... look at the following benchmark I did:
Single query with 5 Joins
query: 8.074508 seconds
result size: 2268000
5 queries in a row
combined query time: 0.00262 seconds
result size: 165 (6 + 50 + 7 + 12 + 90)
.
Note that we get the same results in both cases (6 x 50 x 7 x 12 x 90 = 2268000)
left joins use exponentially more memory with redundant data.
The memory limit might not be as bad if you only do a join of two tables, but generally three or more and it becomes worth different queries.
As a side note, my MySQL server is right beside my application server... so connection time is negligible. If your connection time is in the seconds, then maybe there is a benefit
Frank
This is way too vague to give you an answer relevant to your specific case. It depends on a lot of things. Jeff Atwood (founder of this site) actually wrote about this. For the most part, though, if you have the right indexes and you properly do your JOINs it is usually going to be faster to do 1 trip than several.
This question is old, but is missing some benchmarks. I benchmarked JOIN against its 2 competitors:
N+1 queries
2 queries, the second one using a WHERE IN(...) or equivalent
The result is clear: on MySQL, JOIN is much faster. N+1 queries can drop the performance of an application drastically:
That is, unless you select a lot of records that point to a very small number of distinct, foreign records. Here is a benchmark for the extreme case:
This is very unlikely to happen in a typical application, unless you're joining a -to-many relationship, in which case the foreign key is on the other table, and you're duplicating the main table data many times.
Takeaway:
For *-to-one relationships, always use JOIN
For *-to-many relationships, a second query might be faster
See my article on Medium for more information.
I actually came to this question looking for an answer myself, and after reading the given answers I can only agree that the best way to compare DB queries performance is to get real-world numbers because there are just to many variables to be taken into account BUT, I also think that comparing the numbers between them leads to no good in almost all cases. What I mean is that the numbers should always be compared with an acceptable number and definitely not compared with each other.
I can understand if one way of querying takes say 0.02 seconds and the other one takes 20 seconds, that's an enormous difference. But what if one way of querying takes 0.0000000002 seconds, and the other one takes 0.0000002 seconds ? In both cases one way is a whopping 1000 times faster than the other one, but is it really still "whopping" in the second case ?
Bottom line as I personally see it: if it performs well, go for the easy solution.
The real question is: Do these records have a one-to-one relationship or a one-to-many relationship?
TLDR Answer:
If one-to-one, use a JOIN statement.
If one-to-many, use one (or many) SELECT statements with server-side code optimization.
Why and How To Use SELECT for Optimization
SELECT'ing (with multiple queries instead of joins) on large group of records based on a one-to-many relationship produces an optimal efficiency, as JOIN'ing has an exponential memory leak issue. Grab all of the data, then use a server-side scripting language to sort it out:
SELECT * FROM Address WHERE Personid IN(1,2,3);
Results:
Address.id : 1 // First person and their address
Address.Personid : 1
Address.City : "Boston"
Address.id : 2 // First person's second address
Address.Personid : 1
Address.City : "New York"
Address.id : 3 // Second person's address
Address.Personid : 2
Address.City : "Barcelona"
Here, I am getting all of the records, in one select statement. This is better than JOIN, which would be getting a small group of these records, one at a time, as a sub-component of another query. Then I parse it with server-side code that looks something like...
<?php
foreach($addresses as $address) {
$persons[$address['Personid']]->Address[] = $address;
}
?>
When Not To Use JOIN for Optimization
JOIN'ing a large group of records based on a one-to-one relationship with one single record produces an optimal efficiency compared to multiple SELECT statements, one after the other, which simply get the next record type.
But JOIN is inefficient when getting records with a one-to-many relationship.
Example: The database Blogs has 3 tables of interest, Blogpost, Tag, and Comment.
SELECT * from BlogPost
LEFT JOIN Tag ON Tag.BlogPostid = BlogPost.id
LEFT JOIN Comment ON Comment.BlogPostid = BlogPost.id;
If there is 1 blogpost, 2 tags, and 2 comments, you will get results like:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag2, comment1,
Row4: tag2, comment2,
Notice how each record is duplicated. Okay, so, 2 comments and 2 tags is 4 rows. What if we have 4 comments and 4 tags? You don't get 8 rows -- you get 16 rows:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag1, comment3,
Row4: tag1, comment4,
Row5: tag2, comment1,
Row6: tag2, comment2,
Row7: tag2, comment3,
Row8: tag2, comment4,
Row9: tag3, comment1,
Row10: tag3, comment2,
Row11: tag3, comment3,
Row12: tag3, comment4,
Row13: tag4, comment1,
Row14: tag4, comment2,
Row15: tag4, comment3,
Row16: tag4, comment4,
Add more tables, more records, etc., and the problem will quickly inflate to hundreds of rows that are all full of mostly redundant data.
What do these duplicates cost you? Memory (in the SQL server and the code that tries to remove the duplicates) and networking resources (between SQL server and your code server).
Source: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html ; https://dev.mysql.com/doc/workbench/en/wb-relationship-tools.html
Did a quick test selecting one row from a 50,000 row table and joining with one row from a 100,000 row table. Basically looked like:
$id = mt_rand(1, 50000);
$row = $db->fetchOne("SELECT * FROM table1 WHERE id = " . $id);
$row = $db->fetchOne("SELECT * FROM table2 WHERE other_id = " . $row['other_id']);
vs
$id = mt_rand(1, 50000);
$db->fetchOne("SELECT table1.*, table2.*
FROM table1
LEFT JOIN table1.other_id = table2.other_id
WHERE table1.id = " . $id);
The two select method took 3.7 seconds for 50,000 reads whereas the JOIN took 2.0 seconds on my at-home slow computer. INNER JOIN and LEFT JOIN did not make a difference. Fetching multiple rows (e.g., using IN SET) yielded similar results.
Construct both separate queries and joins, then time each of them -- nothing helps more than real-world numbers.
Then even better -- add "EXPLAIN" to the beginning of each query. This will tell you how many subqueries MySQL is using to answer your request for data, and how many rows scanned for each query.
Depending on the complexity for the database compared to developer complexity, it may be simpler to do many SELECT calls.
Try running some database statistics against both the JOIN and the multiple SELECTS. See if in your environment the JOIN is faster/slower than the SELECT.
Then again, if changing it to a JOIN would mean an extra day/week/month of dev work, I'd stick with multiple SELECTs
Cheers,
BLT
In my experience I have found it's usually faster to run several queries, especially when retrieving large data sets.
When interacting with the database from another application, such as PHP, there is the argument of one trip to the server over many.
There are other ways to limit the number of trips made to the server and still run multiple queries that are often not only faster but also make the application easier to read - for example mysqli_multi_query.
I'm no novice when it comes to SQL, I think there is a tendency for developers, especially juniors to spend a lot of time trying to write very clever joins because they look smart, whereas there are actually smart ways to extract data that look simple.
The last paragraph was a personal opinion, but I hope this helps. I do agree with the others though who say you should benchmark. Neither approach is a silver bullet.
Whether you should use a join is first and foremost about whether a join makes sense. Only at that point is performance even something to be considered, as nearly all other cases will result in significantly worse performance.
Performance differences will largely be tied to how related the info you're querying for is. Joins work, and they're fast when the data is related and you index stuff correctly, but they do often result in some redundancy and sometimes more results than needed. And if your data sets are not directly related, sticking them in a single query will result in what's called a Cartesian product (basically, all possible combinations of rows), which is almost never what you want.
This is often caused by many-to-one-to-many relationships. For example, HoldOffHunger's answer mentioned a single query for posts, tags, and comments. Comments are related to a post, as are tags...but tags are unrelated to comments.
+------------+ +---------+ +---------+
| comment | | post | | tag |
|------------|* 1|---------|1 *|---------|
| post_id |-----| post_id |-----| post_id |
| comment_id | | ... | | tag_id |
| user_id | | | | ... |
| ... | | | | ... |
+------------+ +---------+ +---------+
In this case, it is unambiguously better for this to be at least two separate queries. If you try to join tags and comments, because there's no direct relation between the two, you end up with every possible combination of tag and comment. many * many == manymany. Aside from that, since posts and tags are unrelated, you can do those two queries in parallel, leading to potential gain.
Let's consider a different scenario, though: You want the comments attached to a post, and the commenters' contact info.
+----------+ +------------+ +---------+
| user | | comment | | post |
|----------|1 *|------------|* 1|---------|
| user_id |-----| post_id |-----| post_id |
| username | | user_id | | ... |
| ... | | ... | +---------+
+----------+ +------------+
This is where you should consider a join. Aside from being a much more natural query, most database systems (including MySQL) have lots of smart people put lots of hard work into optimizing queries just like it. For separate queries, since each query depends on the results of the previous one, the queries can't be done in parallel, and the total time becomes not just the actual execute time of the queries, but also the time spent fetching results, sifting through them for IDs for the next query, linking rows together, etc.
Will it be faster in terms of throughput? Probably. But it also potentially locks more database objects at a time (depending on your database and your schema) and thereby decreases concurrency. In my experience people are often mislead by the "fewer database round-trips" argument when in reality on most OLTP systems where the database is on the same LAN, the real bottleneck is rarely the network.
Here is a link with 100 useful queries, these are tested in Oracle database but remember SQL is a standard, what differ between Oracle, MS SQL Server, MySQL and other databases are the SQL dialect:
http://javaforlearn.com/100-sql-queries-learn/
There are several factors which means there is no binary answer. The question of what is best for performance depends on your environment. By the way, if your single select with an identifier is not sub-second, something may be wrong with your configuration.
The real question to ask is how do you want to access the data. Single selects support late-binding. For example if you only want employee information, you can select from the Employees table. The foreign key relationships can be used to retrieve related resources at a later time and as needed. The selects will already have a key to point to so they should be extremely fast, and you only have to retrieve what you need. Network latency must always be taken into account.
Joins will retrieve all of the data at once. If you are generating a report or populating a grid, this may be exactly what you want. Compiled and optomized joins are simply going to be faster than single selects in this scenario. Remember, Ad-hoc joins may not be as fast--you should compile them (into a stored proc). The speed answer depends on the execution plan, which details exactly what steps the DBMS takes to retrieve the data.
Yes, one query using JOINS would be quicker. Although without knowing the relationships of the tables you are querying, the size of your dataset, or where the primary keys are, it's almost impossible to say how much faster.
Why not test both scenarios out, then you'll know for sure...

Optimizing a simple query on two large tables

I'm trying to offer a feature where I can show pages most viewed by friends. My friends table has 5.7M rows and the views table has 5.3M rows. At the moment I just want to run a query on these two tables and find the 20 most viewed page id's by a person's friend.
Here's the query as I have it now:
SELECT page_id
FROM `views` INNER JOIN `friendships` ON friendships.receiver_id = views.user_id
WHERE (`friendships`.`creator_id` = 143416)
GROUP BY page_id
ORDER BY count(views.user_id) desc
LIMIT 20
And here's how an explain looks:
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | friendships | ref | PRIMARY,index_friendships_on_creator_id | index_friendships_on_creator_id | 4 | const | 271 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | views | ref | PRIMARY | PRIMARY | 4 | friendships.receiver_id | 11 | Using index |
+----+-------------+-------------+------+-----------------------------------------+---------------------------------+---------+-----------------------------------------+------+----------------------------------------------+
The views table has a primary key of (user_id, page_id), and you can see this is being used. The friendships table has a primary key of (receiver_id, creator_id), and a secondary index of (creator_id).
If I run this query without the group by and limit, there's about 25,000 rows for this particular user - which is typical.
On the most recent real run, this query took 7 seconds too execute, which is way too long for a decent response in a web app.
One thing I'm wondering is if I should adjust the secondary index to be (creator_id, receiver_id). I'm not sure that will give much of a performance gain though. I'll likely try it today depending on answers to this question.
Can you see any way the query can be rewritten to make it lightening fast?
Update: I need to do more testing on it, but it appears my nasty query works out better if I don't do the grouping and sorting in the db, but do it in ruby afterwards. The overall time is much shorter - by about 80% it seems. Perhaps my early testing was flawed - but this definitely warrants more investigation. If it's true - then wtf is Mysql doing?
As far as I know, the best way to make a query like that "lightning fast", is to create a summary table that tracks friend page views per page per creator.
You would probably want to keep it up-to-date with triggers. Then your aggregation is already done for you, and it is a simple query to get the most viewed pages. You can make sure you have proper indexes on the summary table, so that the database doesn't even have to sort to get the most viewed.
Summary tables are the key to maintaining good performance for aggregation-type queries in read-mostly environments. You do the work up-front, when the updates occur (infrequent) and then the queries (frequent) don't have to do any work.
If your stats don't have to be perfect, and your writes are actually fairly frequent (which is probably the case for something like page views), you can batch up views in memory and process them in the background, so that the friends don't have to take the hit of keeping the summary table up-to-date, as they view pages. That solution also reduces contention on the database (fewer processes updating the summary table).
You should absolutely look into denormalizing this table. If you create a separate table that maintains the user id's and the exact counts for every page they viewed your query should become a lot simpler.
You can easily maintain this table by using a trigger on your views table, that does updates to the 'views_summary' table whenever an insert happens on the 'views' table.
You might even be able to denormalize this further by looking at the actual relationships, or just maintain the top x pages per person
Hope this helps,
Evert
Your indexes look correct although if friendship has very big rows, you might want the index on (creator_id, receiver_id) to avoid reading all of it.
However something's not right here, why are you doing a filesort for 271 rows?
Make sure that your MySQL has at least a few megabytes for tmp_table_size and max_heap_table_size. That should make the GROUP BY faster.
sort_buffer should also have a sane value.