I have old server and huge MySQL database of students marks
There is only one common table for all students:
student_id | teacher_id | mark | comment
There are six schools in this project and ~800 of students, everyday we have ~5000 of marks
Students have problem with perfomance - every query of their marks takes about two minutes to get results
by the way I use table indexing
I have the question - if I use normalization and make separate table for every student like this:
STUDENTS_TABLE
student_id | table
ivanov | ivanov_table
IVANOV_TABLE
teacher_id | mark | comment
will it help me make better perfomance?
I have no oportunity to buy new server.
ADD:
when I use mysql> SELECT * FROM all_students_table where student_id=001 it takes to long. I think it is because of information of all students is in one huge table. And I suppose if individual table for every student will be created - it will take less time for query like this: mysql> SELECT * FROM student_001_table. Am I right?
ADD:
This table is three years old and
mysql> SELECT COUNT(*) FROM students_marks
give result 2 453 389 of rows and it grows every day
Since a simple query like SELECT * FROM all_students_table where student_id=001 takes too long, the only sensible conclusion is that the table does not have proper indexes. A query like this needs an index on student_id. When that index is present, the query should perform almost as good for 2.5 million rows as it does for 1,000 rows (assuming each student_id appears similarly frequently in the table)
First, WHAT #Arjan said about indexes.
Second, from experience, you will need at least 3 tables and probably 5 tables with very precise INDEXes and PRIMARY KEYS
Schools
Teachers
Students
Student_In_School
Student_Grade (Links Teacher/Student/Grade/Comment) <- Your current table
Third, and this is counter intuitive, performance will DECREASE, because you have to search multiple tables and link them. Normalization is NOT for performance, but rather for sanity checks and validation and elimination of repeated information. For example, you can NOW have the actual names of the teachers/students as opposed to an obscure IDs.
The good news is you have very little data (despite what you think) and an old machine can handle it with proper INDEXes
Hope this helps
I would like ask if there is any performance advantage of 1 over the other.
Here is an example:
// suppose I want to retrieve 10000 different records
select *
from table_a
where from in (1,2,3,4,5,6 .... 10000)
// alternatively
select *
from table_a
where from=1 or from=2 or from=3 ... from=10000
compared to
select * from table_a where from=1
select * from table_a where from=2
select * from table_a where from=3
.
.
select * from table_a where from=10000
What are the scenarios that one will outperform the other?
The WHERE clause is simplified here, it may have nested AND and OR clauses.
There are many factors beyond your simple example involved.
For your exact example 1 query is better than 1000, because example is simple and against one field.
Main factor is network I/O operations, physical and/or logic reads
and such.
But if you have more WHERE conditions especially when there are joins, that can be questionable what is better.
And it depends on actual DB tables, relationships, indexes design, types of joins, size of tables and (so and so)...
As general direction in most cases 1 SQL is better, but other factors can be much more important than that.
All starts from very careful database design. Mistakes there (happen quite often), cost a lot later.
Usually 1000 queries are better when database was designed badly.
I'm writing a script that generates a report based on a query that uses several tables joined together. One of the inputs to the script is going to be a list of the fields that are required on the report. Depending on the fields requested, some of the tables might not be needed. My question is: is there a [significant] performance penalty for including a join when if it is not referenced in a SELECT or WHERE clause?
Consider the following tables:
mysql> SELECT * FROM `Books`;
+----------------------+----------+
| title | authorId |
+----------------------+----------+
| Animal Farm | 3 |
| Brave New World | 2 |
| Fahrenheit 451 | 1 |
| Nineteen Eighty-Four | 3 |
+----------------------+----------+
mysql> SELECT * FROM `Authors`;
+----+----------+-----------+
| id | lastName | firstName |
+----+----------+-----------+
| 1 | Bradbury | Ray |
| 2 | Huxley | Aldous |
| 3 | Orwell | George |
+----+----------+-----------+
Does
SELECT
`Authors`.`lastName`
FROM
`Authors`
WHERE
`Authors`.`id` = 1
Outperform:
SELECT
`Authors`.`lastName`
FROM
`Authors`
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
WHERE
`Authors`.`id` = 1
?
It seems to me that MySQL should just know to ignore the JOIN completely, since the table is not referenced in the SELECT or WHERE clause. But somehow I doubt this is the case. Of course, this is a really basic example. The actual data involved will be much more complex.
And really, it's not a terribly huge deal... I just need to know if my script needs to be "smart" about the joins, and only include them if the fields requested will rely on them.
This isn't actually unused since it means that only Authors that exist in Books are included in the result set.
JOIN
`Books`
ON `Authors`.`id` = `Books`.`authorId`
However if you "knew" that every Author existed in Book than there would be some performance benefit in removing the join but it would largely depend on idexes and the number of records in the table and the logic in the join (especially when doing data conversions)
This is the kind of question that is impossible to answer. Yes, adding the join will take additional time; it's impossible to tell whether you'd be able to measure that time without, well, uh....measuring the time.
Broadly speaking, if - like in your example - you're joining on primary keys, with unique indices, it's unlikely to make a measurable difference.
If you've got more complex joins (which you hint at), or are joining on fields without an index, or if your join involves a function, the performance penalty may be significant.
Of course, it may still be easier to do it this way that write multiple queries which are essentially the same, other than removing unneeded joins.
Final bit of advice - try abstracting the queries into views. That way, you can optimize performance once, and perhaps write your report queries in a more simple way...
Joins will always take time.
Side effects
On top of that inner join (which is the default join) influences the result by limiting the number of rows you get.
So depending on whether all authors are in books the two queries may or may not be identical.
Also if an author has written more than one book the resultset of the 'joined' query will show duplicate results.
Performance
In the WHERE clause you have stated authors.id to be a constant =1, therefore (provided you have indexes on author.id and books.author_id) it will be a very fast lookup for both tables. The query-time between the two tables will be very close.
In general joins can take quite a lot of time though and with all the added side effects should only be undertaken if you really want to use the extra info the join offers.
It seems that there are two things that you are trying to determine: If there are any optimizations that can be done between the two select statements, and which of the two would be the fastest to execute.
It seems that since the join really is limiting the returned results by authors who have books in the list, that there can not be that much optimization done.
It also seems that for the case that you were describing where the joined table really has no limiting effect on the returned results, that the query where there was no joining of the tables would perform faster.
Are JOIN queries faster than several queries? (You run your main query, and then you run many other SELECTs based on the results from your main query)
I'm asking because JOINing them would complicate A LOT the design of my application
If they are faster, can anyone approximate very roughly by how much? If it's 1.5x I don't care, but if it's 10x I guess I do.
For inner joins, a single query makes sense, since you only get matching rows.
For left joins, multiple queries is much better... look at the following benchmark I did:
Single query with 5 Joins
query: 8.074508 seconds
result size: 2268000
5 queries in a row
combined query time: 0.00262 seconds
result size: 165 (6 + 50 + 7 + 12 + 90)
.
Note that we get the same results in both cases (6 x 50 x 7 x 12 x 90 = 2268000)
left joins use exponentially more memory with redundant data.
The memory limit might not be as bad if you only do a join of two tables, but generally three or more and it becomes worth different queries.
As a side note, my MySQL server is right beside my application server... so connection time is negligible. If your connection time is in the seconds, then maybe there is a benefit
Frank
This is way too vague to give you an answer relevant to your specific case. It depends on a lot of things. Jeff Atwood (founder of this site) actually wrote about this. For the most part, though, if you have the right indexes and you properly do your JOINs it is usually going to be faster to do 1 trip than several.
This question is old, but is missing some benchmarks. I benchmarked JOIN against its 2 competitors:
N+1 queries
2 queries, the second one using a WHERE IN(...) or equivalent
The result is clear: on MySQL, JOIN is much faster. N+1 queries can drop the performance of an application drastically:
That is, unless you select a lot of records that point to a very small number of distinct, foreign records. Here is a benchmark for the extreme case:
This is very unlikely to happen in a typical application, unless you're joining a -to-many relationship, in which case the foreign key is on the other table, and you're duplicating the main table data many times.
Takeaway:
For *-to-one relationships, always use JOIN
For *-to-many relationships, a second query might be faster
See my article on Medium for more information.
I actually came to this question looking for an answer myself, and after reading the given answers I can only agree that the best way to compare DB queries performance is to get real-world numbers because there are just to many variables to be taken into account BUT, I also think that comparing the numbers between them leads to no good in almost all cases. What I mean is that the numbers should always be compared with an acceptable number and definitely not compared with each other.
I can understand if one way of querying takes say 0.02 seconds and the other one takes 20 seconds, that's an enormous difference. But what if one way of querying takes 0.0000000002 seconds, and the other one takes 0.0000002 seconds ? In both cases one way is a whopping 1000 times faster than the other one, but is it really still "whopping" in the second case ?
Bottom line as I personally see it: if it performs well, go for the easy solution.
The real question is: Do these records have a one-to-one relationship or a one-to-many relationship?
TLDR Answer:
If one-to-one, use a JOIN statement.
If one-to-many, use one (or many) SELECT statements with server-side code optimization.
Why and How To Use SELECT for Optimization
SELECT'ing (with multiple queries instead of joins) on large group of records based on a one-to-many relationship produces an optimal efficiency, as JOIN'ing has an exponential memory leak issue. Grab all of the data, then use a server-side scripting language to sort it out:
SELECT * FROM Address WHERE Personid IN(1,2,3);
Results:
Address.id : 1 // First person and their address
Address.Personid : 1
Address.City : "Boston"
Address.id : 2 // First person's second address
Address.Personid : 1
Address.City : "New York"
Address.id : 3 // Second person's address
Address.Personid : 2
Address.City : "Barcelona"
Here, I am getting all of the records, in one select statement. This is better than JOIN, which would be getting a small group of these records, one at a time, as a sub-component of another query. Then I parse it with server-side code that looks something like...
<?php
foreach($addresses as $address) {
$persons[$address['Personid']]->Address[] = $address;
}
?>
When Not To Use JOIN for Optimization
JOIN'ing a large group of records based on a one-to-one relationship with one single record produces an optimal efficiency compared to multiple SELECT statements, one after the other, which simply get the next record type.
But JOIN is inefficient when getting records with a one-to-many relationship.
Example: The database Blogs has 3 tables of interest, Blogpost, Tag, and Comment.
SELECT * from BlogPost
LEFT JOIN Tag ON Tag.BlogPostid = BlogPost.id
LEFT JOIN Comment ON Comment.BlogPostid = BlogPost.id;
If there is 1 blogpost, 2 tags, and 2 comments, you will get results like:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag2, comment1,
Row4: tag2, comment2,
Notice how each record is duplicated. Okay, so, 2 comments and 2 tags is 4 rows. What if we have 4 comments and 4 tags? You don't get 8 rows -- you get 16 rows:
Row1: tag1, comment1,
Row2: tag1, comment2,
Row3: tag1, comment3,
Row4: tag1, comment4,
Row5: tag2, comment1,
Row6: tag2, comment2,
Row7: tag2, comment3,
Row8: tag2, comment4,
Row9: tag3, comment1,
Row10: tag3, comment2,
Row11: tag3, comment3,
Row12: tag3, comment4,
Row13: tag4, comment1,
Row14: tag4, comment2,
Row15: tag4, comment3,
Row16: tag4, comment4,
Add more tables, more records, etc., and the problem will quickly inflate to hundreds of rows that are all full of mostly redundant data.
What do these duplicates cost you? Memory (in the SQL server and the code that tries to remove the duplicates) and networking resources (between SQL server and your code server).
Source: https://dev.mysql.com/doc/refman/8.0/en/nested-join-optimization.html ; https://dev.mysql.com/doc/workbench/en/wb-relationship-tools.html
Did a quick test selecting one row from a 50,000 row table and joining with one row from a 100,000 row table. Basically looked like:
$id = mt_rand(1, 50000);
$row = $db->fetchOne("SELECT * FROM table1 WHERE id = " . $id);
$row = $db->fetchOne("SELECT * FROM table2 WHERE other_id = " . $row['other_id']);
vs
$id = mt_rand(1, 50000);
$db->fetchOne("SELECT table1.*, table2.*
FROM table1
LEFT JOIN table1.other_id = table2.other_id
WHERE table1.id = " . $id);
The two select method took 3.7 seconds for 50,000 reads whereas the JOIN took 2.0 seconds on my at-home slow computer. INNER JOIN and LEFT JOIN did not make a difference. Fetching multiple rows (e.g., using IN SET) yielded similar results.
Construct both separate queries and joins, then time each of them -- nothing helps more than real-world numbers.
Then even better -- add "EXPLAIN" to the beginning of each query. This will tell you how many subqueries MySQL is using to answer your request for data, and how many rows scanned for each query.
Depending on the complexity for the database compared to developer complexity, it may be simpler to do many SELECT calls.
Try running some database statistics against both the JOIN and the multiple SELECTS. See if in your environment the JOIN is faster/slower than the SELECT.
Then again, if changing it to a JOIN would mean an extra day/week/month of dev work, I'd stick with multiple SELECTs
Cheers,
BLT
In my experience I have found it's usually faster to run several queries, especially when retrieving large data sets.
When interacting with the database from another application, such as PHP, there is the argument of one trip to the server over many.
There are other ways to limit the number of trips made to the server and still run multiple queries that are often not only faster but also make the application easier to read - for example mysqli_multi_query.
I'm no novice when it comes to SQL, I think there is a tendency for developers, especially juniors to spend a lot of time trying to write very clever joins because they look smart, whereas there are actually smart ways to extract data that look simple.
The last paragraph was a personal opinion, but I hope this helps. I do agree with the others though who say you should benchmark. Neither approach is a silver bullet.
Whether you should use a join is first and foremost about whether a join makes sense. Only at that point is performance even something to be considered, as nearly all other cases will result in significantly worse performance.
Performance differences will largely be tied to how related the info you're querying for is. Joins work, and they're fast when the data is related and you index stuff correctly, but they do often result in some redundancy and sometimes more results than needed. And if your data sets are not directly related, sticking them in a single query will result in what's called a Cartesian product (basically, all possible combinations of rows), which is almost never what you want.
This is often caused by many-to-one-to-many relationships. For example, HoldOffHunger's answer mentioned a single query for posts, tags, and comments. Comments are related to a post, as are tags...but tags are unrelated to comments.
+------------+ +---------+ +---------+
| comment | | post | | tag |
|------------|* 1|---------|1 *|---------|
| post_id |-----| post_id |-----| post_id |
| comment_id | | ... | | tag_id |
| user_id | | | | ... |
| ... | | | | ... |
+------------+ +---------+ +---------+
In this case, it is unambiguously better for this to be at least two separate queries. If you try to join tags and comments, because there's no direct relation between the two, you end up with every possible combination of tag and comment. many * many == manymany. Aside from that, since posts and tags are unrelated, you can do those two queries in parallel, leading to potential gain.
Let's consider a different scenario, though: You want the comments attached to a post, and the commenters' contact info.
+----------+ +------------+ +---------+
| user | | comment | | post |
|----------|1 *|------------|* 1|---------|
| user_id |-----| post_id |-----| post_id |
| username | | user_id | | ... |
| ... | | ... | +---------+
+----------+ +------------+
This is where you should consider a join. Aside from being a much more natural query, most database systems (including MySQL) have lots of smart people put lots of hard work into optimizing queries just like it. For separate queries, since each query depends on the results of the previous one, the queries can't be done in parallel, and the total time becomes not just the actual execute time of the queries, but also the time spent fetching results, sifting through them for IDs for the next query, linking rows together, etc.
Will it be faster in terms of throughput? Probably. But it also potentially locks more database objects at a time (depending on your database and your schema) and thereby decreases concurrency. In my experience people are often mislead by the "fewer database round-trips" argument when in reality on most OLTP systems where the database is on the same LAN, the real bottleneck is rarely the network.
Here is a link with 100 useful queries, these are tested in Oracle database but remember SQL is a standard, what differ between Oracle, MS SQL Server, MySQL and other databases are the SQL dialect:
http://javaforlearn.com/100-sql-queries-learn/
There are several factors which means there is no binary answer. The question of what is best for performance depends on your environment. By the way, if your single select with an identifier is not sub-second, something may be wrong with your configuration.
The real question to ask is how do you want to access the data. Single selects support late-binding. For example if you only want employee information, you can select from the Employees table. The foreign key relationships can be used to retrieve related resources at a later time and as needed. The selects will already have a key to point to so they should be extremely fast, and you only have to retrieve what you need. Network latency must always be taken into account.
Joins will retrieve all of the data at once. If you are generating a report or populating a grid, this may be exactly what you want. Compiled and optomized joins are simply going to be faster than single selects in this scenario. Remember, Ad-hoc joins may not be as fast--you should compile them (into a stored proc). The speed answer depends on the execution plan, which details exactly what steps the DBMS takes to retrieve the data.
Yes, one query using JOINS would be quicker. Although without knowing the relationships of the tables you are querying, the size of your dataset, or where the primary keys are, it's almost impossible to say how much faster.
Why not test both scenarios out, then you'll know for sure...
I know it's generally a bad idea to do queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should I go for this query since that allows the table to change but still yields the same results.
SELECT COUNT(*) FROM `group_relations`
Or the more specfic
SELECT COUNT(`group_id`) FROM `group_relations`
I have a feeling the latter could potentially be faster, but are there any other things to consider?
Update: I am using InnoDB in this case, sorry for not being more specific.
If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,
select count(*)
will count all rows, whereas
select count(group_id)
will only count the rows where group_id is not null.
Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.
Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.
If I remember it right, in MYSQL COUNT(*) counts all rows, whereas COUNT(column_name) counts only the rows that have a non-NULL value in the given column.
COUNT(*) count all rows while COUNT(column_name) will count only rows without NULL values in the specified column.
Important to note in MySQL:
COUNT() is very fast on MyISAM tables for * or not-null columns, since the row count is cached. InnoDB has no row count caching, so there is no difference in performance for COUNT(*) or COUNT(column_name), regardless if the column can be null or not. You can read more on the differences on this post at the MySQL performance blog.
if you try SELECT COUNT(1) FROMgroup_relations it will be a bit faster because it will not try to retrieve information from your columns.
Edit: I just did some research and found out that this only happens in some db. In sqlserver it's the same to use 1 or *, but on oracle it's faster to use 1.
http://social.msdn.microsoft.com/forums/en-US/transactsql/thread/9367c580-087a-4fc1-bf88-91a51a4ee018/
Apparently there is no difference between them in mysql, like sqlserver the parser appears to change the query to select(1). Sorry if I mislead you in some way.
I was curious about this myself. It's all fine to read documentation and theoretical answers, but I like to balance those with empirical evidence.
I have a MySQL table (InnoDB) that has 5,607,997 records in it. The table is in my own private sandbox, so I know the contents are static and nobody else is using the server. I think this effectively removes all outside affects on performance. I have a table with an auto_increment Primary Key field (Id) that I know will never be null that I will use for my where clause test (WHERE Id IS NOT NULL).
The only other possible glitch I see in running tests is the cache. The first time a query is run will always be slower than subsequent queries that use the same indexes. I'll refer to that below as the cache Seeding call. Just to mix it up a little I ran it with a where clause I know will always evaluate to true regardless of any data (TRUE = TRUE).
That said here are my results:
QueryType
| w/o WHERE | where id is not null | where true=true
COUNT()
| 9 min 30.13 sec ++ | 6 min 16.68 sec ++ | 2 min 21.80 sec ++
| 6 min 13.34 sec | 1 min 36.02 sec | 2 min 0.11 sec
| 6 min 10.06 se | 1 min 33.47 sec | 1 min 50.54 sec
COUNT(Id)
| 5 min 59.87 sec | 1 min 34.47 sec | 2 min 3.96 sec
| 5 min 44.95 sec | 1 min 13.09 sec | 2 min 6.48 sec
COUNT(1)
| 6 min 49.64 sec | 2 min 0.80 sec | 2 min 11.64 sec
| 6 min 31.64 sec | 1 min 41.19 sec | 1 min 43.51 sec
++This is considered the cache Seeding call. It is expected to be slower than the rest.
I'd say the results speak for themselves. COUNT(Id) usually edges out the others. Adding a Where clause dramatically decreases the access time even if it's a clause you know will evaluate to true. The sweet spot appears to be COUNT(Id)... WHERE Id IS NOT NULL.
I would love to see other peoples' results, perhaps with smaller tables or with where clauses against different fields than the field you're counting. I'm sure there are other variations I haven't taken into account.
Seek Alternatives
As you've seen, when tables grow large, COUNT queries get slow. I think the most important thing is to consider the nature of the problem you're trying to solve. For example, many developers use COUNT queries when generating pagination for large sets of records in order to determine the total number of pages in the result set.
Knowing that COUNT queries will grow slow, you could consider an alternative way to display pagination controls that simply allows you to side-step the slow query. Google's pagination is an excellent example.
Denormalize
If you absolutely must know the number of records matching a specific count, consider the classic technique of data denormalization. Instead of counting the number of rows at lookup time, consider incrementing a counter on record insertion, and decrementing that counter on record deletion.
If you decide to do this, consider using idempotent, transactional operations to keep those denormalized values in synch.
BEGIN TRANSACTION;
INSERT INTO `group_relations` (`group_id`) VALUES (1);
UPDATE `group_relations_count` SET `count` = `count` + 1;
COMMIT;
Alternatively, you could use database triggers if your RDBMS supports them.
Depending on your architecture, it might make sense to use a caching layer like memcached to store, increment and decrement the denormalized value, and simply fall through to the slow COUNT query when the cache key is missing. This can reduce overall write-contention if you have very volatile data, though in cases like this, you'll want to consider solutions to the dog-pile effect.
MySQL ISAM tables should have optimisation for COUNT(*), skipping full table scan.
An asterisk in COUNT has no bearing with asterisk for selecting all fields of table. It's pure rubbish to say that COUNT(*) is slower than COUNT(field)
I intuit that select COUNT(*) is faster than select COUNT(field). If the RDBMS detected that you specify "*" on COUNT instead of field, it doesn't need to evaluate anything to increment count. Whereas if you specify field on COUNT, the RDBMS will always evaluate if your field is null or not to count it.
But if your field is nullable, specify the field in COUNT.
COUNT(*) facts and myths:
MYTH: "InnoDB doesn't handle count(*) queries well":
Most count(*) queries are executed same way by all storage engines if you have a WHERE clause, otherwise you InnoDB will have to perform a full table scan.
FACT: InnoDB doesn't optimize count(*) queries without the where clause
It is best to count by an indexed column such as a primary key.
SELECT COUNT(`group_id`) FROM `group_relations`
It should depend on what you are actually trying to achieve as Sebastian has already said, i.e. make your intentions clear! If you are just counting the rows then go for the COUNT(*), or counting a single column go for the COUNT(column).
It might be worth checking out your DB vendor too. Back when I used to use Informix it had an optimisation for COUNT(*) which had a query plan execution cost of 1 compared to counting single or mutliple columns which would result in a higher figure
if you try SELECT COUNT(1) FROM group_relations it will be a bit faster because it will not try to retrieve information from your columns.
COUNT(1) used to be faster than COUNT(*), but that's not true anymore, since modern DBMS are smart enough to know that you don't wanna know about columns
The advice I got from MySQL about things like this is that, in general, trying to optimize a query based on tricks like this can be a curse in the long run. There are examples over MySQL's history where somebody's high-performance technique that relies on how the optimizer works ends up being the bottleneck in the next release.
Write the query that answers the question you're asking -- if you want a count of all rows, use COUNT(*). If you want a count of non-null columns, use COUNT(col) WHERE col IS NOT NULL. Index appropriately, and leave the optimization to the optimizer. Trying to make your own query-level optimizations can sometimes make the built-in optimizer less effective.
That said, there are things you can do in a query to make it easier for the optimizer to speed it up, but I don't believe COUNT is one of them.
Edit: The statistics in the answer above are interesting, though. I'm not sure whether there is actually something at work in the optimizer in this case. I'm just talking about query-level optimizations in general.
I know it's generally a bad idea to do
queries like this:
SELECT * FROM `group_relations`
But when I just want the count, should
I go for this query since that allows
the table to change but still yields
the same results.
SELECT COUNT(*) FROM `group_relations`
As your question implies, the reason SELECT * is ill-advised is that changes to the table could require changes in your code. That doesn't apply to COUNT(*). It's pretty rare to want the specialized behavior that SELECT COUNT('group_id') gives you - typically you want to know the number of records. That's what COUNT(*) is for, so use it.