will normalization of school database make better perfomance - mysql

I have old server and huge MySQL database of students marks
There is only one common table for all students:
student_id | teacher_id | mark | comment
There are six schools in this project and ~800 of students, everyday we have ~5000 of marks
Students have problem with perfomance - every query of their marks takes about two minutes to get results
by the way I use table indexing
I have the question - if I use normalization and make separate table for every student like this:
STUDENTS_TABLE
student_id | table
ivanov | ivanov_table
IVANOV_TABLE
teacher_id | mark | comment
will it help me make better perfomance?
I have no oportunity to buy new server.
ADD:
when I use mysql> SELECT * FROM all_students_table where student_id=001 it takes to long. I think it is because of information of all students is in one huge table. And I suppose if individual table for every student will be created - it will take less time for query like this: mysql> SELECT * FROM student_001_table. Am I right?
ADD:
This table is three years old and
mysql> SELECT COUNT(*) FROM students_marks
give result 2 453 389 of rows and it grows every day

Since a simple query like SELECT * FROM all_students_table where student_id=001 takes too long, the only sensible conclusion is that the table does not have proper indexes. A query like this needs an index on student_id. When that index is present, the query should perform almost as good for 2.5 million rows as it does for 1,000 rows (assuming each student_id appears similarly frequently in the table)

First, WHAT #Arjan said about indexes.
Second, from experience, you will need at least 3 tables and probably 5 tables with very precise INDEXes and PRIMARY KEYS
Schools
Teachers
Students
Student_In_School
Student_Grade (Links Teacher/Student/Grade/Comment) <- Your current table
Third, and this is counter intuitive, performance will DECREASE, because you have to search multiple tables and link them. Normalization is NOT for performance, but rather for sanity checks and validation and elimination of repeated information. For example, you can NOW have the actual names of the teachers/students as opposed to an obscure IDs.
The good news is you have very little data (despite what you think) and an old machine can handle it with proper INDEXes
Hope this helps

Related

How should mysql query be [duplicate]

Are JOIN queries faster than several queries? (You run your main query, and then you run many other SELECTs based on the results from your main query)
I'm asking because JOINing them would complicate A LOT the design of my application
If they are faster, can anyone approximate very roughly by how much? If it's 1.5x I don't care, but if it's 10x I guess I do.
For inner joins, a single query makes sense, since you only get matching rows.
For left joins, multiple queries is much better... look at the following benchmark I did:
Single query with 5 Joins
query: 8.074508 seconds
result size: 2268000
5 queries in a row
combined query time: 0.00262 seconds
result size: 165 (6 + 50 + 7 + 12 + 90)
.
Note that we get the same results in both cases (6 x 50 x 7 x 12 x 90 = 2268000)
left joins use exponentially more memory with redundant data.
The memory limit might not be as bad if you only do a join of two tables, but generally three or more and it becomes worth different queries.
As a side note, my MySQL server is right beside my application server... so connection time is negligible. If your connection time is in the seconds, then maybe there is a benefit
Frank
This is way too vague to give you an answer relevant to your specific case. It depends on a lot of things. Jeff Atwood (founder of this site) actually wrote about this. For the most part, though, if you have the right indexes and you properly do your JOINs it is usually going to be faster to do 1 trip than several.
This question is old, but is missing some benchmarks. I benchmarked JOIN against its 2 competitors:
N+1 queries
2 queries, the second one using a WHERE IN(...) or equivalent
The result is clear: on MySQL, JOIN is much faster. N+1 queries can drop the performance of an application drastically:
That is, unless you select a lot of records that point to a very small number of distinct, foreign records. Here is a benchmark for the extreme case:
This is very unlikely to happen in a typical application, unless you're joining a -to-many relationship, in which case the foreign key is on the other table, and you're duplicating the main table data many times.
Takeaway:
For *-to-one relationships, always use JOIN
For *-to-many relationships, a second query might be faster
See my article on Medium for more information.
I actually came to this question looking for an answer myself, and after reading the given answers I can only agree that the best way to compare DB queries performance is to get real-world numbers because there are just to many variables to be taken into account BUT, I also think that comparing the numbers between them leads to no good in almost all cases. What I mean is that the numbers should always be compared with an acceptable number and definitely not compared with each other.
I can understand if one way of querying takes say 0.02 seconds and the other one takes 20 seconds, that's an enormous difference. But what if one way of querying takes 0.0000000002 seconds, and the other one takes 0.0000002 seconds ? In both cases one way is a whopping 1000 times faster than the other one, but is it really still "whopping" in the second case ?
Bottom line as I personally see it: if it performs well, go for the easy solution.
The real question is: Do these records have a one-to-one relationship or a one-to-many relationship?
TLDR Answer:
If one-to-one, use a JOIN statement.
If one-to-many, use one (or many) SELECT statements with server-side code optimization.
Why and How To Use SELECT for Optimization
SELECT'ing (with multiple queries instead of joins) on large group of records based on a one-to-many relationship produces an optimal efficiency, as JOIN'ing has an exponential memory leak issue. Grab all of the data, then use a server-side scripting language to sort it out:
SELECT * FROM Address WHERE Personid IN(1,2,3);
Results:
Address.id : 1 // First person and their address
Address.Personid : 1
Address.City : "Boston"
Address.id : 2 // First person's second address
Address.Personid : 1
Address.City : "New York"
Address.id : 3 // Second person's address
Address.Personid : 2
Address.City : "Barcelona"
Here, I am getting all of the records, in one select statement. This is better than JOIN, which would be getting a small group of these records, one at a time, as a sub-component of another query. Then I parse it with server-side code that looks something like...
<?php
foreach($addresses as $address) {
$persons[$address['Personid']]->Address[] = $address;
}
?>
When Not To Use JOIN for Optimization
JOIN'ing a large group of records based on a one-to-one relationship with one single record produces an optimal efficiency compared to multiple SELECT statements, one after the other, which simply get the next record type.
But JOIN is inefficient when getting records with a one-to-many relationship.
Example: The database Blogs has 3 tables of interest, Blogpost, Tag, and Comment.
SELECT * from BlogPost
LEFT JOIN Tag ON Tag.BlogPostid = BlogPost.id
LEFT JOIN Comment ON Comment.BlogPostid = BlogPost.id;
If there is 1 blogpost, 2 tags, and 2 comments, you will get results like:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag2, comment1,
Row4: tag2, comment2,
Notice how each record is duplicated. Okay, so, 2 comments and 2 tags is 4 rows. What if we have 4 comments and 4 tags? You don't get 8 rows -- you get 16 rows:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag1, comment3,
Row4: tag1, comment4,
Row5: tag2, comment1,
Row6: tag2, comment2,
Row7: tag2, comment3,
Row8: tag2, comment4,
Row9: tag3, comment1,
Row10: tag3, comment2,
Row11: tag3, comment3,
Row12: tag3, comment4,
Row13: tag4, comment1,
Row14: tag4, comment2,
Row15: tag4, comment3,
Row16: tag4, comment4,
Add more tables, more records, etc., and the problem will quickly inflate to hundreds of rows that are all full of mostly redundant data.
What do these duplicates cost you? Memory (in the SQL server and the code that tries to remove the duplicates) and networking resources (between SQL server and your code server).
Source: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html ; https://dev.mysql.com/doc/workbench/en/wb-relationship-tools.html
Did a quick test selecting one row from a 50,000 row table and joining with one row from a 100,000 row table. Basically looked like:
$id = mt_rand(1, 50000);
$row = $db->fetchOne("SELECT * FROM table1 WHERE id = " . $id);
$row = $db->fetchOne("SELECT * FROM table2 WHERE other_id = " . $row['other_id']);
vs
$id = mt_rand(1, 50000);
$db->fetchOne("SELECT table1.*, table2.*
FROM table1
LEFT JOIN table1.other_id = table2.other_id
WHERE table1.id = " . $id);
The two select method took 3.7 seconds for 50,000 reads whereas the JOIN took 2.0 seconds on my at-home slow computer. INNER JOIN and LEFT JOIN did not make a difference. Fetching multiple rows (e.g., using IN SET) yielded similar results.
Construct both separate queries and joins, then time each of them -- nothing helps more than real-world numbers.
Then even better -- add "EXPLAIN" to the beginning of each query. This will tell you how many subqueries MySQL is using to answer your request for data, and how many rows scanned for each query.
Depending on the complexity for the database compared to developer complexity, it may be simpler to do many SELECT calls.
Try running some database statistics against both the JOIN and the multiple SELECTS. See if in your environment the JOIN is faster/slower than the SELECT.
Then again, if changing it to a JOIN would mean an extra day/week/month of dev work, I'd stick with multiple SELECTs
Cheers,
BLT
In my experience I have found it's usually faster to run several queries, especially when retrieving large data sets.
When interacting with the database from another application, such as PHP, there is the argument of one trip to the server over many.
There are other ways to limit the number of trips made to the server and still run multiple queries that are often not only faster but also make the application easier to read - for example mysqli_multi_query.
I'm no novice when it comes to SQL, I think there is a tendency for developers, especially juniors to spend a lot of time trying to write very clever joins because they look smart, whereas there are actually smart ways to extract data that look simple.
The last paragraph was a personal opinion, but I hope this helps. I do agree with the others though who say you should benchmark. Neither approach is a silver bullet.
Whether you should use a join is first and foremost about whether a join makes sense. Only at that point is performance even something to be considered, as nearly all other cases will result in significantly worse performance.
Performance differences will largely be tied to how related the info you're querying for is. Joins work, and they're fast when the data is related and you index stuff correctly, but they do often result in some redundancy and sometimes more results than needed. And if your data sets are not directly related, sticking them in a single query will result in what's called a Cartesian product (basically, all possible combinations of rows), which is almost never what you want.
This is often caused by many-to-one-to-many relationships. For example, HoldOffHunger's answer mentioned a single query for posts, tags, and comments. Comments are related to a post, as are tags...but tags are unrelated to comments.
+------------+ +---------+ +---------+
| comment | | post | | tag |
|------------|* 1|---------|1 *|---------|
| post_id |-----| post_id |-----| post_id |
| comment_id | | ... | | tag_id |
| user_id | | | | ... |
| ... | | | | ... |
+------------+ +---------+ +---------+
In this case, it is unambiguously better for this to be at least two separate queries. If you try to join tags and comments, because there's no direct relation between the two, you end up with every possible combination of tag and comment. many * many == manymany. Aside from that, since posts and tags are unrelated, you can do those two queries in parallel, leading to potential gain.
Let's consider a different scenario, though: You want the comments attached to a post, and the commenters' contact info.
+----------+ +------------+ +---------+
| user | | comment | | post |
|----------|1 *|------------|* 1|---------|
| user_id |-----| post_id |-----| post_id |
| username | | user_id | | ... |
| ... | | ... | +---------+
+----------+ +------------+
This is where you should consider a join. Aside from being a much more natural query, most database systems (including MySQL) have lots of smart people put lots of hard work into optimizing queries just like it. For separate queries, since each query depends on the results of the previous one, the queries can't be done in parallel, and the total time becomes not just the actual execute time of the queries, but also the time spent fetching results, sifting through them for IDs for the next query, linking rows together, etc.
Will it be faster in terms of throughput? Probably. But it also potentially locks more database objects at a time (depending on your database and your schema) and thereby decreases concurrency. In my experience people are often mislead by the "fewer database round-trips" argument when in reality on most OLTP systems where the database is on the same LAN, the real bottleneck is rarely the network.
Here is a link with 100 useful queries, these are tested in Oracle database but remember SQL is a standard, what differ between Oracle, MS SQL Server, MySQL and other databases are the SQL dialect:
http://javaforlearn.com/100-sql-queries-learn/
There are several factors which means there is no binary answer. The question of what is best for performance depends on your environment. By the way, if your single select with an identifier is not sub-second, something may be wrong with your configuration.
The real question to ask is how do you want to access the data. Single selects support late-binding. For example if you only want employee information, you can select from the Employees table. The foreign key relationships can be used to retrieve related resources at a later time and as needed. The selects will already have a key to point to so they should be extremely fast, and you only have to retrieve what you need. Network latency must always be taken into account.
Joins will retrieve all of the data at once. If you are generating a report or populating a grid, this may be exactly what you want. Compiled and optomized joins are simply going to be faster than single selects in this scenario. Remember, Ad-hoc joins may not be as fast--you should compile them (into a stored proc). The speed answer depends on the execution plan, which details exactly what steps the DBMS takes to retrieve the data.
Yes, one query using JOINS would be quicker. Although without knowing the relationships of the tables you are querying, the size of your dataset, or where the primary keys are, it's almost impossible to say how much faster.
Why not test both scenarios out, then you'll know for sure...

organize data acccording to the first letter it starts with, like an English dictionary [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
Improve this question
I want to organize data acccording to the first letter it starts with, like an English dictionary.
Option A Store words in a single table.
The field word_id is a PK, name is unique/index field and word_code possibly a primary column or just a regular column but has to be indexed.
Storing all the words in one table is an easy way to start with my project, but I'm not sure whether this table becomes slower as it grows.
Table: word
word_id
word_code
name
1
112
pizza
2
97
alien
3
111
orange
...
99
candy
10000000
99
cat
Option B Create partitions in word.
The partition may provide INSERT performance benefit, but not affect the SELECT speed. According to a book, when I read/write all the partitions that I've created regularly, I won't have a better read performance regardless of the partition types. Some people say that creating partitions seem pointless when the size of table is small. (No performance benefits and maybe slower than a single table in some situations.)
When would be the right time to create one? Maybe when the size becomes over a million? I'm not even sure it is worth creating the partition in the first place.
Option C Create 26 tables from a-z and store each word separately.
At first glance it looks reasonable to organize data for the long term. On the other hand, it forces me to write more server-side code and might be difficult to maintain as well.
Table: word_starts_with_a
word_id
name
1
apple
2
alert
...
art
10000000
angle
Table: word_starts_with_b
word_id
name
1
bus
2
bear
...
bath
10000000
bubble
Which should I use one of those options (or a better way to store the data efficiently) as a parent table referenced by children tables?
The parent table is subject to INSERT and SELECT operations.
A database Table is unordered. SELECTs that include an ORDER BY clause produce ordered output.
Fetch all words in alphabetical order, regardless of what order they were inserted:
SELECT name FROM t ORDER BY name;
Fetch all words beginning with "B", in arbitrary order:
SELECT name FROM t WHERE name LIKE 'B%';
Fetch all words beginning with "B", in alphabetical order:
SELECT name FROM t WHERE name LIKE 'B%' ORDER BY name;
With a suitable index each of these will take the same amount of time (very few milliseconds) from a combined table of any number of rows:
SELECT * FROM t WHERE name = 'aardvark';
SELECT * FROM t WHERE name = 'zebra';
Do have an index on name. If name is unique throughout the table, then consider making it the PRIMARY KEY -- PRIMARY KEY(name). If there can be duplicates, then word_id may be necessary.
A thousand-row table is pretty fast simply because it is not big. A million-row table may need the suggested index (or PK). A billion-row table certainly will need it. How many "names" do you expect to collect? There are only a few million words in any language; and under 8 billion persons in the world (last I heard!).
That is, do not fear performance as the table grows; instead, look for a suitable index. A "suitable" index is a function of the query. So we need to see the important queries in order to advise in more detail.
Rule of Thumb: If you split a word list into the number of tables as there are letters, the largest table will be about 10% of the total. For example, about 10% of English words start with S. Note that that is not a lot of improvement over having all the words in a single list -- especially since the index is in a BTree structure that has about 100 child nodes for each node. That is, a billion rows has a BTree of only 5 levels. That is, my aardvark/zebra example will touch 5 disk blocks to get each of the two rows. (Caching may prevent the actual need to hit the disk.) Without a suitable index, 'zebra' would, as you fear, take seconds or even minutes.
Do not split the table into multiple tables, nor into multiple PARTITIONs. At least not until we have found a real reason to do so.

mysql optimization - copy or serialized old rows

Suppose i have a simple table with this columns:
| id | user_id | order_id |
About 1,000,000 rows is inserted to this table per month and as it is clear relation between user_id and order_id is 1 to M.
The records in the last month needed for accounting issues and the others is just for showing order histories to the users.To archive records before last past month,i have two options in my mind:
first,create a similar table and each month copy old records to it.so it will get bigger and bigger each month according to growth of orders.
second,create a table like below:
| id | user_id | order_idssss |
and each month, for each row to be inserted to this table,if there exist user_id,just update order_ids and add new order_id to the end of order_ids.
in this solution number of rows in the table will be get bigger according to user growth ratio.
suppose for each solution we have an index on user_id.
.
Now question is which one is more optimized for SELECT all order_ids per user in case of load on server.
the first one has much more records than the second one,but in the second one some programming language is needed to split order_ids.
The first choice is the better choice from among the two you have shown. With respect, I should say your second choice is a terrible idea.
MySQL (with all SQL dbms systems) is excellent at handling very large numbers of rows of uniformly laid out (that is, normalized) data.
But, your best choice is to do nothing except create appropriate indexes to make it easy to look up order history by date or by user. Leave all your data in this table and optimize lookup instead.
Until this table contains at least fifty million rows (at least four years' worth of data), the time you spend reprogramming your system to allow it to be split into a current and an archive version will be far more costly than just keeping it together.
If you want help figuring out which indexes you need, you should ask another question showing your queries. It's not clear from this question how you look up orders by date.
In a 1:many relationship, don't make an extra table. Instead have the user_id be a column in the Orders table. Furthermore, this is likely to help performance:
PRIMARY KEY(user_id, order_id),
INDEX(order_id)
Is a "month" a calendar month? Or "30 days ago until now"?
If it is a calendar month, consider PARTITION BY RANGE(TO_DAYS(datetime)) and have an ever-increasing list of monthly partitions. However, do not create future months in advance; create them just before they are needed. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Note: This would require adding datetime to the end of the PK.
At 4 years' worth of data (48 partitions), it will be time to rethink things. (I recommend not going much beyond that number of partitions.)
Read about "transportable tablespaces". This may become part of your "archiving" process.
Use InnoDB.
With that partitioning, either of these becomes reasonably efficient:
WHERE user_id = 123
AND datetime > CURDATE() - INTERVAL 30 DAY
WHERE user_id = 123
AND datetime >= '2017-11-01' -- or whichever start-of-month you need
Each of the above will hit at most one non-empty partition more than the number of months desired.
If you want to discuss this more, please provide SHOW CREATE TABLE (in any variation), plus some of the important SELECTs.

If your table has more selects than inserts, are indexes always beneficial?

I have a mysql innodb table where I'm performing a lot of selects using different columns. I thought that adding an index on each of those fields could help performance, but after reading a bit on indexes I'm not sure if adding an index on a column you select on always helps.
I have far more selects than inserts/updates happening in my case.
My table 'students' looks like:
id | student_name | nickname | team | time_joined_school | honor_roll
and I have the following queries:
# The team column is varchar(32), and only has about 20 different values.
# The honor_roll field is a smallint and is only either 0 or 1.
1. select from students where team = '?' and honor_roll = ?;
# The student_name field is varchar(32).
2. select from students where student_name = '?';
# The nickname field is varchar(64).
3. select from students where nickname like '%?%';
all the results are ordered by time_joined_school, which is a bigint(20).
So I was just going to add an index on each of the columns, does that make sense in this scenario?
Thanks
Indexes help the database more efficiently find the data you're looking for. Which is to say you don't need an index simply because you're selecting a given column, but instead you (generally) need an index for columns you're selecting based on - i.e. using a WHERE clause (even if you don't end up including the searched column in your result).
Broadly, this means you should have indexes on columns that segregate your data in logical ways, and not on extraneous, simply informative columns. Before looking at your specific queries, all of these columns seem like reasonable candidates for indexing, since you could reasonably construct queries around these columns. Examples of columns that would make less sense would be things phone_number, address, or student_notes - you could index such columns, but generally you don't need or want to.
Specifically based on your queries, you'll want student_name, team, and honor_roll to be indexed, since you're defining WHERE conditions based on the values of these columns. You'll also benefit from indexing time_joined_school if, as you suggest, you're ORDER BYing your queries based on that column. Your LIKE query is not actually easy for most RDBs to handle, and indexing nickname won't help. Check out How to speed up SELECT .. LIKE queries in MySQL on multiple columns? for more.
Note also that the ratio of SELECT to INSERT is not terribly relevant for deciding whether to use an index or not. Even if you only populate the table once, and it's read-only from that point on, SELECTs will run faster if you index the correct columns.
Yes indexes help on accerate your querys.
In your case you should have index on:
1) Team and honor_roll from query 1 (only 1 index with 2 fields)
2) student_name
3) time_joined_school from order
For the query 3 you can't use indexes because of the like statement. Hope this helps.

JOIN queries vs multiple queries

Are JOIN queries faster than several queries? (You run your main query, and then you run many other SELECTs based on the results from your main query)
I'm asking because JOINing them would complicate A LOT the design of my application
If they are faster, can anyone approximate very roughly by how much? If it's 1.5x I don't care, but if it's 10x I guess I do.
For inner joins, a single query makes sense, since you only get matching rows.
For left joins, multiple queries is much better... look at the following benchmark I did:
Single query with 5 Joins
query: 8.074508 seconds
result size: 2268000
5 queries in a row
combined query time: 0.00262 seconds
result size: 165 (6 + 50 + 7 + 12 + 90)
.
Note that we get the same results in both cases (6 x 50 x 7 x 12 x 90 = 2268000)
left joins use exponentially more memory with redundant data.
The memory limit might not be as bad if you only do a join of two tables, but generally three or more and it becomes worth different queries.
As a side note, my MySQL server is right beside my application server... so connection time is negligible. If your connection time is in the seconds, then maybe there is a benefit
Frank
This is way too vague to give you an answer relevant to your specific case. It depends on a lot of things. Jeff Atwood (founder of this site) actually wrote about this. For the most part, though, if you have the right indexes and you properly do your JOINs it is usually going to be faster to do 1 trip than several.
This question is old, but is missing some benchmarks. I benchmarked JOIN against its 2 competitors:
N+1 queries
2 queries, the second one using a WHERE IN(...) or equivalent
The result is clear: on MySQL, JOIN is much faster. N+1 queries can drop the performance of an application drastically:
That is, unless you select a lot of records that point to a very small number of distinct, foreign records. Here is a benchmark for the extreme case:
This is very unlikely to happen in a typical application, unless you're joining a -to-many relationship, in which case the foreign key is on the other table, and you're duplicating the main table data many times.
Takeaway:
For *-to-one relationships, always use JOIN
For *-to-many relationships, a second query might be faster
See my article on Medium for more information.
I actually came to this question looking for an answer myself, and after reading the given answers I can only agree that the best way to compare DB queries performance is to get real-world numbers because there are just to many variables to be taken into account BUT, I also think that comparing the numbers between them leads to no good in almost all cases. What I mean is that the numbers should always be compared with an acceptable number and definitely not compared with each other.
I can understand if one way of querying takes say 0.02 seconds and the other one takes 20 seconds, that's an enormous difference. But what if one way of querying takes 0.0000000002 seconds, and the other one takes 0.0000002 seconds ? In both cases one way is a whopping 1000 times faster than the other one, but is it really still "whopping" in the second case ?
Bottom line as I personally see it: if it performs well, go for the easy solution.
The real question is: Do these records have a one-to-one relationship or a one-to-many relationship?
TLDR Answer:
If one-to-one, use a JOIN statement.
If one-to-many, use one (or many) SELECT statements with server-side code optimization.
Why and How To Use SELECT for Optimization
SELECT'ing (with multiple queries instead of joins) on large group of records based on a one-to-many relationship produces an optimal efficiency, as JOIN'ing has an exponential memory leak issue. Grab all of the data, then use a server-side scripting language to sort it out:
SELECT * FROM Address WHERE Personid IN(1,2,3);
Results:
Address.id : 1 // First person and their address
Address.Personid : 1
Address.City : "Boston"
Address.id : 2 // First person's second address
Address.Personid : 1
Address.City : "New York"
Address.id : 3 // Second person's address
Address.Personid : 2
Address.City : "Barcelona"
Here, I am getting all of the records, in one select statement. This is better than JOIN, which would be getting a small group of these records, one at a time, as a sub-component of another query. Then I parse it with server-side code that looks something like...
<?php
foreach($addresses as $address) {
$persons[$address['Personid']]->Address[] = $address;
}
?>
When Not To Use JOIN for Optimization
JOIN'ing a large group of records based on a one-to-one relationship with one single record produces an optimal efficiency compared to multiple SELECT statements, one after the other, which simply get the next record type.
But JOIN is inefficient when getting records with a one-to-many relationship.
Example: The database Blogs has 3 tables of interest, Blogpost, Tag, and Comment.
SELECT * from BlogPost
LEFT JOIN Tag ON Tag.BlogPostid = BlogPost.id
LEFT JOIN Comment ON Comment.BlogPostid = BlogPost.id;
If there is 1 blogpost, 2 tags, and 2 comments, you will get results like:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag2, comment1,
Row4: tag2, comment2,
Notice how each record is duplicated. Okay, so, 2 comments and 2 tags is 4 rows. What if we have 4 comments and 4 tags? You don't get 8 rows -- you get 16 rows:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag1, comment3,
Row4: tag1, comment4,
Row5: tag2, comment1,
Row6: tag2, comment2,
Row7: tag2, comment3,
Row8: tag2, comment4,
Row9: tag3, comment1,
Row10: tag3, comment2,
Row11: tag3, comment3,
Row12: tag3, comment4,
Row13: tag4, comment1,
Row14: tag4, comment2,
Row15: tag4, comment3,
Row16: tag4, comment4,
Add more tables, more records, etc., and the problem will quickly inflate to hundreds of rows that are all full of mostly redundant data.
What do these duplicates cost you? Memory (in the SQL server and the code that tries to remove the duplicates) and networking resources (between SQL server and your code server).
Source: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html ; https://dev.mysql.com/doc/workbench/en/wb-relationship-tools.html
Did a quick test selecting one row from a 50,000 row table and joining with one row from a 100,000 row table. Basically looked like:
$id = mt_rand(1, 50000);
$row = $db->fetchOne("SELECT * FROM table1 WHERE id = " . $id);
$row = $db->fetchOne("SELECT * FROM table2 WHERE other_id = " . $row['other_id']);
vs
$id = mt_rand(1, 50000);
$db->fetchOne("SELECT table1.*, table2.*
FROM table1
LEFT JOIN table1.other_id = table2.other_id
WHERE table1.id = " . $id);
The two select method took 3.7 seconds for 50,000 reads whereas the JOIN took 2.0 seconds on my at-home slow computer. INNER JOIN and LEFT JOIN did not make a difference. Fetching multiple rows (e.g., using IN SET) yielded similar results.
Construct both separate queries and joins, then time each of them -- nothing helps more than real-world numbers.
Then even better -- add "EXPLAIN" to the beginning of each query. This will tell you how many subqueries MySQL is using to answer your request for data, and how many rows scanned for each query.
Depending on the complexity for the database compared to developer complexity, it may be simpler to do many SELECT calls.
Try running some database statistics against both the JOIN and the multiple SELECTS. See if in your environment the JOIN is faster/slower than the SELECT.
Then again, if changing it to a JOIN would mean an extra day/week/month of dev work, I'd stick with multiple SELECTs
Cheers,
BLT
In my experience I have found it's usually faster to run several queries, especially when retrieving large data sets.
When interacting with the database from another application, such as PHP, there is the argument of one trip to the server over many.
There are other ways to limit the number of trips made to the server and still run multiple queries that are often not only faster but also make the application easier to read - for example mysqli_multi_query.
I'm no novice when it comes to SQL, I think there is a tendency for developers, especially juniors to spend a lot of time trying to write very clever joins because they look smart, whereas there are actually smart ways to extract data that look simple.
The last paragraph was a personal opinion, but I hope this helps. I do agree with the others though who say you should benchmark. Neither approach is a silver bullet.
Whether you should use a join is first and foremost about whether a join makes sense. Only at that point is performance even something to be considered, as nearly all other cases will result in significantly worse performance.
Performance differences will largely be tied to how related the info you're querying for is. Joins work, and they're fast when the data is related and you index stuff correctly, but they do often result in some redundancy and sometimes more results than needed. And if your data sets are not directly related, sticking them in a single query will result in what's called a Cartesian product (basically, all possible combinations of rows), which is almost never what you want.
This is often caused by many-to-one-to-many relationships. For example, HoldOffHunger's answer mentioned a single query for posts, tags, and comments. Comments are related to a post, as are tags...but tags are unrelated to comments.
+------------+ +---------+ +---------+
| comment | | post | | tag |
|------------|* 1|---------|1 *|---------|
| post_id |-----| post_id |-----| post_id |
| comment_id | | ... | | tag_id |
| user_id | | | | ... |
| ... | | | | ... |
+------------+ +---------+ +---------+
In this case, it is unambiguously better for this to be at least two separate queries. If you try to join tags and comments, because there's no direct relation between the two, you end up with every possible combination of tag and comment. many * many == manymany. Aside from that, since posts and tags are unrelated, you can do those two queries in parallel, leading to potential gain.
Let's consider a different scenario, though: You want the comments attached to a post, and the commenters' contact info.
+----------+ +------------+ +---------+
| user | | comment | | post |
|----------|1 *|------------|* 1|---------|
| user_id |-----| post_id |-----| post_id |
| username | | user_id | | ... |
| ... | | ... | +---------+
+----------+ +------------+
This is where you should consider a join. Aside from being a much more natural query, most database systems (including MySQL) have lots of smart people put lots of hard work into optimizing queries just like it. For separate queries, since each query depends on the results of the previous one, the queries can't be done in parallel, and the total time becomes not just the actual execute time of the queries, but also the time spent fetching results, sifting through them for IDs for the next query, linking rows together, etc.
Will it be faster in terms of throughput? Probably. But it also potentially locks more database objects at a time (depending on your database and your schema) and thereby decreases concurrency. In my experience people are often mislead by the "fewer database round-trips" argument when in reality on most OLTP systems where the database is on the same LAN, the real bottleneck is rarely the network.
Here is a link with 100 useful queries, these are tested in Oracle database but remember SQL is a standard, what differ between Oracle, MS SQL Server, MySQL and other databases are the SQL dialect:
http://javaforlearn.com/100-sql-queries-learn/
There are several factors which means there is no binary answer. The question of what is best for performance depends on your environment. By the way, if your single select with an identifier is not sub-second, something may be wrong with your configuration.
The real question to ask is how do you want to access the data. Single selects support late-binding. For example if you only want employee information, you can select from the Employees table. The foreign key relationships can be used to retrieve related resources at a later time and as needed. The selects will already have a key to point to so they should be extremely fast, and you only have to retrieve what you need. Network latency must always be taken into account.
Joins will retrieve all of the data at once. If you are generating a report or populating a grid, this may be exactly what you want. Compiled and optomized joins are simply going to be faster than single selects in this scenario. Remember, Ad-hoc joins may not be as fast--you should compile them (into a stored proc). The speed answer depends on the execution plan, which details exactly what steps the DBMS takes to retrieve the data.
Yes, one query using JOINS would be quicker. Although without knowing the relationships of the tables you are querying, the size of your dataset, or where the primary keys are, it's almost impossible to say how much faster.
Why not test both scenarios out, then you'll know for sure...