Which query will execute faster and which is perfect query ?
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
Or
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
I have placed two queries both will output same result. Now I want to know which method will execute faster and which method is correct way ?
You should always use EXPLAIN to determine how your query will run.
Unfortunately, MySQL will execute your subquery as a DEPENDENT QUERY, which means that the subquery will be ran for each row in the outer query. You'd think MySQL would be smart enough to detect that the subquery isn't a correlated subquery and would run it just once, alas, it's not yet that smart.
So, MySQL will scan through all of the rows in students, running the subquery for each row, and not utilizing any indexes on the outer query whatsoever.
Writing the query as a JOIN would allow MySQL to utilize indexes, and the following query would be the optimum way to write it:
SELECT COUNT(*) AS count
FROMstudents s
JOIN classes c
ON c.id = s.classes_id
AND c.departments_id = 1
WHERE s.status = 1
This would utilize the following indexes:
students(`status`)
classes(`id`, `departements_id`) : multi-column index
From a design and clarity standpoint I'd avoid inner selects like the first one. It is true that to be 100% sure on if or how each query will be optimized and which will run 'better' requires seeing how the SQL server you're using will interperet it and its plan. In Mysql, use "Explain".
However.... Even without seeing this, my money is still on the Join only version... The inner select version has to perform the inner select in it's entirety before determining the values to use inside the "IN" clause--I know this to be true when you wrap stuff in functions, and pretty sure it's true when sticking a select in as IN arguements. I also know that that's a good way to totally neutralize any benefit you might have with indexes on the tables inside the inner select.
I'm generally of the opinion that Inner selects are only really needed for very rare query situations. Usually, those who use them often are thinking like traditional iterative flow programmers not really thinking in relational DB result set terms...
EXPLAIN Both the queries individually
The difference between both queries is of Sub-Queries vs Joins
Mostly Joins are faster than sub-queries. Join creates execution plan and predict what data is going to process, hence it saves time. On the other hand sub-queries run all the queries until all the data is loaded. Most developer use Sub-queries because these are more readable than JOINS, but where the performance is matter, JOIN is better solution.
The best way to find out is to measure it:
Without index
Query 1: 0.9s
Query 2: 0.9s
With index
Query 1: 0.4s
Query 2: 0.2s
The conclusion is:
If you don't have indexes then it makes no difference which query you use.
The join is faster if you have the right index.
The effect of adding the correct index is greater than the effect of choosing the right query. If performance matters, make sure you have the correct indexes.
Of course, your results may vary depending on the MySQL version and the distribution of data you have.
Here's how I tested it:
1,000,000 students (25% with status 1).
50,000 courses.
10 departments.
Here's the SQL I used to create the test data:
CREATE TABLE students
(id INT PRIMARY KEY AUTO_INCREMENT,
status int NOT NULL,
classes_id int NOT NULL);
CREATE TABLE classes
(id INT PRIMARY KEY AUTO_INCREMENT,
departments_id INT NOT NULL);
CREATE TABLE numbers(id INT PRIMARY KEY AUTO_INCREMENT);
INSERT INTO numbers VALUES (),(),(),(),(),(),(),(),(),();
INSERT INTO numbers
SELECT NULL
FROM numbers AS n1
CROSS JOIN numbers AS n2
CROSS JOIN numbers AS n3
CROSS JOIN numbers AS n4
CROSS JOIN numbers AS n5
CROSS JOIN numbers AS n6;
INSERT INTO classes (departments_id)
SELECT id % 10 FROM numbers WHERE id <= 50000;
INSERT INTO students (status, classes_id)
SELECT id % 4 = 0, id % 50000 + 1 FROM numbers WHERE id <= 1000000;
SELECT COUNT(*) AS count
FROM students
WHERE status = 1
AND classes_id IN (SELECT id FROM classes WHERE departments_id = 1);
SELECT COUNT(*) AS count
FROM students s
LEFT JOIN classes c
ON c.id = s.classes_id
WHERE status = 1
AND c.departments_id = 1;
CREATE INDEX ix_students ON students(status, classes_id);
The two queries won't produce the same results:
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
...will return the number of rows in the students table that have a classes_id field that is also in the classes table with a departments_id of 1.
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
...will return the total number of rows in the students table where the status field is 1 and possibly more than that depending on how your data is organised.
If you want the queries to return the same thing, you need to change the LEFT JOIN to an INNER JOIN so it will match only the rows that suit both conditions.
Run EXPLAIN SELECT ... on both queries and check which one does what ;)
Related
I have a query with a long list (> 2000 ids) in a WHERE IN clause in mysql (InnoDB):
SELECT id
FROM table
WHERE user_id IN ('list of >2000 ids')
I tried to optimize this by using an INNER JOIN instead of the wherein like this (both ids and the user_id use an index):
SELECT table.id
FROM table
INNER JOIN users ON table.user_id = users.id WHERE users.type = 1
Surprisingly, however, the first query is much faster (by the factor 5 to 6). Why is this the case? Could it be that the second query outperforms the first one, when the number of ids in the where in clause becomes much larger?
This is not Ans to your Question but you may use as alternative to your first query, You can better increase performance by replacing IN Clause with EXISTS since EXISTS performance better than IN ref : Here
SELECT id
FROM table t
WHERE EXISTS (SELECT 1 FROM USERS WHERE t.user_id = users.id)
This is an unfair comparison between the 2 queries.
In the 1st query you provide a list of constants as a search criteria, therefore MySQL has to open and search only table and / or 1 index file.
In the 2nd query you instruct MySQL to obtain the list dynamically from another table and join that list back to the main table. It is also not clear, if indexes were used to create a join or a full table scan was needed.
To have a fair comparison, time the query that you used to obtain the list in the 1st query along with the query itself. Or try
SELECT table.id FROM table WHERE user_id IN (SELECT users.id FROM users WHERE users.type = 1)
The above fetches the list of ids dynamically in a subquery.
I have a MYSQL query of this form:
SELECT
employee.name,
totalpayments.totalpaid
FROM
employee
JOIN (
SELECT
paychecks.employee_id,
SUM(paychecks.amount) totalpaid
FROM
paychecks
GROUP BY
paychecks.employee_id
) totalpayments on totalpayments.employee_id = employee.id
I've recently found that this returns MUCH faster in this form:
SELECT
employee.name,
(
SELECT
SUM(paychecks.amount)
FROM
paychecks
WHERE
paychecks.employee_id = employee.id
) totalpaid
FROM
employee
It surprises me that there would be a difference in speed, and that the lower query would be faster. I prefer the upper form for development, because I can run the subquery independently.
Is there a way to get the "best of both worlds": speedy results return AND being able to run the subquery in isolation?
Likely, the correlated subquery is able to make effective use of an index, which is why it's fast, even though that subquery has to be executed multiple times.
For the first query with the inline view, that causing MySQL to create a derived table, and for large sets, that's effectively a MyISAM table.
In MySQL 5.6.x and later, the optimizer may choose to add an index on the derived table, if that would allow a ref operation and the estimated cost of the ref operation is lower than the nested loops scan.
I recommend you try using EXPLAIN to see the access plan. (Based on your report of performance, I suspect you are running on MySQL version 5.5 or earlier.)
The two statements are not entirely equivalent, in the case where there are rows in employees for which there are no matching rows in paychecks.
An equivalent result could be obtained entirely avoiding a subquery:
SELECT e.name
, SUM(p.amount) AS total_paid
FROM employee e
JOIN paychecks p
ON p.employee_id = e.id
GROUP BY e.id
(Use an inner join to get a result equivalent to the first query, use a LEFT outer join to be equivalent to the second query. Wrap the SUM() aggregate in an IFNULL function if you want to return a zero rather than a NULL value when no matching row with a non-null value of amount is found in paychecks.)
Join is basically Cartesian product that means all the records of table A will be combined with all the records of table B. The output will be
number of records of table A * number of records of table b =rows in the new table
10 * 10 = 100
and out of those 100 records, the ones that match the filters will be returned in the query.
In the nested queries, there is a sample inner query and whatever is the total size of records of the inner query will be the input to the outter query that is why nested queries are faster than joins.
[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?
This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.
EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.
You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
I have 3 SQL queries as given:
select student_id from user where user_id =4; // returns 35
select * from student where student_id in (35);
select * from student where student_id in (select student_id from user where user_id =4);
first 2 queries take less than 0.5 second, but the third, similar as 2nd containing 1st as subquery, is taking around 8 seconds.
I indexed tables according to my need, but time is not reducing.
Can someone please give me a solution or provide some explanation for this behaviour.
Thanks!
Actually, MySQL execute the inner query at the end, it scans every indexes before. MySQL rewrites the subquery in order to make the inner query fully dependent of the outer one.
For exemple, it select * from student (depend of your database, but could return many results), then apply the inner query user_id=4 to the previous result.
The dev team are working on this problem and it should be "solved" in the 6.0 http://dev.mysql.com/doc/refman/5.5/en/optimizing-subqueries.html
EDIT:
In your case, you should use a JOIN method.
Not with a subquery but why don't you use a join here?
select
s.*
from
student s
inner join
user u
on s.id_student_id = u.student_id
where
u.user_id = 4
;
Hi i have an issue with a mysql select statement i cant get my head around,
Table client_directory_data
id int,
verified int,
client_id int,
created timestamp,
description longtext
select * from client_directory_data where verified = 1 order by created desc
but this selects multiple rows for each client_id
what i need to do is to select every client_id which has a verified = 1 but only get the most recent row for each client_id, i hope that makes sense.
This is an issue I face all the time. Fortunately there's a nice little trick for doing this:
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
And if you want the whole row you can just join onto it like so:
SELECT
*
FROM (
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
) ids
JOIN client_directory_data USING (id);
Of course if you're ordering by an indexed field anyway (that you could therefore join on efficiently anyway), it's better to use MAX(id) AS id, although it actually has very little impact on performance. The main reason to use MAX() is really to make the code a little simpler. It also avoids the pitfalls you may encounter if the field contains commas (which you can get around with a different seperator for the group concat) or hitting the max GROUP_CONCAT length (which can be extended with SET group_concat_max_len = xxx; and only causes warnings anyway).
I can see why this would intuitively seem like it would have performance issues, however it's actually the best performng method I've found for these queries - especially on large tables.
Here are some benchmarks I've taken from some of the larger tables currently available to me comparing the three methods in this thread.
Query A: (~5,000 records, ~900 results, non-indexed field)
GROUP_CONCAT method: 0.0100 seconds
MAX method: 0.102 seconds
LEFT JOIN method: 0.0082 seconds
Query B : (~300,000 records, ~95,000 results)
GROUP_CONCAT method: 1.8618 seconds
MAX method: 1.7904 seconds
LEFT JOIN method: 6.4649 seconds
Query C : (~300,000 records, ~7 results)
GROUP_CONCAT method: 0.103 seconds
MAX method: 0.0102 seconds
LEFT JOIN method: (I got bored after 4 hours)
Query D : (~500,000 records, ~5,000 different values of the field being grouped)
GROUP method: 0.1355 seconds
MAX Method : 0.0429 seconds
LEFT JOIN method: (I got bored after 10 minutes)
That makes sense and is a classic question.
Assuming that the most recent row is the one with highest id, you can use:
SELECT *
FROM client_directory_data c
LEFT JOIN client_directory_data d ON c.client_id = d.client_id AND d.verified = 1 AND d.id > c.id
WHERE d.id IS NULL
AND c.verified = 1;
You can have an explanation of this query pattern here.
Make id as primary key for the table client_directory_data