Mysql: Why is WHERE IN much faster than JOIN in this case? - mysql

I have a query with a long list (> 2000 ids) in a WHERE IN clause in mysql (InnoDB):
SELECT id
FROM table
WHERE user_id IN ('list of >2000 ids')
I tried to optimize this by using an INNER JOIN instead of the wherein like this (both ids and the user_id use an index):
SELECT table.id
FROM table
INNER JOIN users ON table.user_id = users.id WHERE users.type = 1
Surprisingly, however, the first query is much faster (by the factor 5 to 6). Why is this the case? Could it be that the second query outperforms the first one, when the number of ids in the where in clause becomes much larger?

This is not Ans to your Question but you may use as alternative to your first query, You can better increase performance by replacing IN Clause with EXISTS since EXISTS performance better than IN ref : Here
SELECT id
FROM table t
WHERE EXISTS (SELECT 1 FROM USERS WHERE t.user_id = users.id)

This is an unfair comparison between the 2 queries.
In the 1st query you provide a list of constants as a search criteria, therefore MySQL has to open and search only table and / or 1 index file.
In the 2nd query you instruct MySQL to obtain the list dynamically from another table and join that list back to the main table. It is also not clear, if indexes were used to create a join or a full table scan was needed.
To have a fair comparison, time the query that you used to obtain the list in the 1st query along with the query itself. Or try
SELECT table.id FROM table WHERE user_id IN (SELECT users.id FROM users WHERE users.type = 1)
The above fetches the list of ids dynamically in a subquery.

Related

MySQL internal order of operations in SELECT query

What is the internal order of operations in a MySQL SELECT query and a relational query?
For instance, a SELECT query to a single table:
SELECT `name`
FROM `users`
WHERE `publication_count`>0
ORDER BY `publication_count` DESC
I know that at first all table fields are fetched and then only name field is left at the end. Does it happen before or after the condition in WHERE is applied? When is ORDER BY applied?
A relational query using two tables:
SELECT `users`.`name`, `post`.`text`
FROM `users`, `posts`
WHERE `posts`.`author_id`=`user`.`id`
ORDER BY `posts`.`date` DESC
Same question. What happens after what? (I know that at first the Cartesian product is generated)
Processing regarding your example simplifying the rules goes as follow:
1. FROM -- all elements in list (including multiple tables)
2. WHERE -- discard rows not matching conditions
3. SELECT -- output rows are computed (not fetched)
4. ORDER BY -- sort output rows
Also, you shouldn't be using old-fashioned implicit join syntax in WHERE condition. Instead, please use JOIN:
SELECT ...
FROM users
INNER JOIN posts ON users.id = posts.author_id
ORDER BY ...

Fast to query slow to create table

I have an issue on creating tables by using select keyword (it runs so slow). The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query.
SELECT *
FROM amusementPart a
INNER JOIN (
SELECT DISTINCT name, type, cageID, dateOfEntry
FROM bigRegistrations
GROUP BY cageID
) r ON a.type = r.cageID
But because of slow performance, someone suggested me steps to improve the performance. 1) use temporary table, 2)store the result and use it and join it the the other statement.
use myzoo
CREATE TABLE animalRegistrations AS
SELECT DISTINCT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
unfortunately, It is still slow. If I only use the select statement, the result will be shown in 1-2 seconds. But if I add the create table, the query will take ages (approx 25 minutes)
Any good approach to improve the query time?
edit: the size of big registration table is around 3.5 million rows
Can you please try the query in the way below to achieve The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query, the query you are using is not fetching records as per your requirement and it will faster:
SELECT a.*, b.name, b.type, b.cageID, b.dateOfEntry
FROM amusementPart a
INNER JOIN bigRegistrations b ON a.type = b.cageID
INNER JOIN (SELECT c.cageID, max(c.dateOfEntry) dateofEntry
FROM bigRegistrations c
GROUP BY c.cageID) t ON t.cageID = b.cageID AND t.dateofEntry = b.dateofEntry
Suggested indexing on cageID and dateofEntry
This is a multipart question.
Use Temporary Table
Don't use Distinct - group all columns to make distinct (dont forget to check for index)
Check the SQL Execution plans
Here you are not creating a temporary table. Try the following...
CREATE TEMPORARY TABLE IF NOT EXISTS animalRegistrations AS
SELECT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
Have you tried doing an explain to see how the plan is different from one execution to the next?
Also, I have found that there can be locking issues in some DB when doing insert(select) and table creation using select. I ran this in MySQL, and it solved some deadlock issues I was having.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
The reason the query runs so slow is probably because it is creating the temp table based on all 3.5 million rows, when really you only need a subset of those, i.e. the bigRegistrations that match your join to amusementPart. The first single select statement is faster b/c SQL is smart enough to know it only needs to calculate the bigRegistrations where a.type = r.cageID.
I'd suggest that you don't need a temp table, your first query is quite simple. Rather, you may just need an index. You can determine this manually by studying the estimated execution plan, or running your query in the database tuning advisor. My guess is you need to create an index similar to below. Notice I index by cageId first since that is what you join to amusementParks, so that would help SQL narrow the results down the quickest. But I'm guessing a bit - view the query plan or tuning advisor to be sure.
CREATE NONCLUSTERED INDEX IX_bigRegistrations ON bigRegistrations
(cageId, name, type, dateOfEntry)
Also, if you want the animal with the latest entry date, I think you want this query instead of the one you're using. I'm assuming the PK is all 4 columns.
SELECT name, type, cageID, dateOfEntry
FROM bigRegistrations BR
WHERE BR.dateOfEntry =
(SELECT MAX(BR1.dateOfEntry)
FROM bigRegistrations BR1
WHERE BR1.name = BR.name
AND BR1.type = BR.type
AND BR1.cageID = BR.cageID)

MySQL(version 5.5): Why `JOIN` is faster than `IN` clause?

[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?
This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.
EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.
You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)

How to determine what is more effective: DISTINCT or WHERE EXISTS?

For example, I have 3 tables: user, group and permission, and two many2many relationships between them: user_groups and group_permissions.
I need to select all permissions of given user, without repeats. Every time I encounter a similar problem, I can not determine which version of a query better:
SELECT permisson_id FROM group_permission WHERE EXISTS(
SELECT 1 FROM user_groups
WHERE user_groups.user_id = 42
AND user_groups.group_id = group_permission.group_id
)
SELECT DISTINCT permisson_id FROM group_permission
INNER JOIN user_groups ON user_groups.user_id = 42
AND user_groups.group_id = group_permission.group_id
I have enough experience to make conclusions based on explain. The first query have subquery, but my experiences have shown that the first query is faster. Perhaps because of the large number of filtered permissions in result.
What would you do in this situation? Why?
Thanks!
Use EXISTS Rather than DISTINCT
You can suppress the display of duplicate rows using DISTINCT; you use EXISTS to check for the existence of rows returned by a subquery. Whenever possible, you should use EXISTS rather than DISTINCT because DISTINCT sorts the retrieved rows before suppressing the duplicate rows.
in your case there whould be many duplicated data so the exists should be faster.
by http://my.safaribooksonline.com/book/-/9780072229813/high-performance-sql-tuning/ch16lev1sec10

Which MySQL Query is faster?

Which query will execute faster and which is perfect query ?
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
Or
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
I have placed two queries both will output same result. Now I want to know which method will execute faster and which method is correct way ?
You should always use EXPLAIN to determine how your query will run.
Unfortunately, MySQL will execute your subquery as a DEPENDENT QUERY, which means that the subquery will be ran for each row in the outer query. You'd think MySQL would be smart enough to detect that the subquery isn't a correlated subquery and would run it just once, alas, it's not yet that smart.
So, MySQL will scan through all of the rows in students, running the subquery for each row, and not utilizing any indexes on the outer query whatsoever.
Writing the query as a JOIN would allow MySQL to utilize indexes, and the following query would be the optimum way to write it:
SELECT COUNT(*) AS count
FROMstudents s
JOIN classes c
ON c.id = s.classes_id
AND c.departments_id = 1
WHERE s.status = 1
This would utilize the following indexes:
students(`status`)
classes(`id`, `departements_id`) : multi-column index
From a design and clarity standpoint I'd avoid inner selects like the first one. It is true that to be 100% sure on if or how each query will be optimized and which will run 'better' requires seeing how the SQL server you're using will interperet it and its plan. In Mysql, use "Explain".
However.... Even without seeing this, my money is still on the Join only version... The inner select version has to perform the inner select in it's entirety before determining the values to use inside the "IN" clause--I know this to be true when you wrap stuff in functions, and pretty sure it's true when sticking a select in as IN arguements. I also know that that's a good way to totally neutralize any benefit you might have with indexes on the tables inside the inner select.
I'm generally of the opinion that Inner selects are only really needed for very rare query situations. Usually, those who use them often are thinking like traditional iterative flow programmers not really thinking in relational DB result set terms...
EXPLAIN Both the queries individually
The difference between both queries is of Sub-Queries vs Joins
Mostly Joins are faster than sub-queries. Join creates execution plan and predict what data is going to process, hence it saves time. On the other hand sub-queries run all the queries until all the data is loaded. Most developer use Sub-queries because these are more readable than JOINS, but where the performance is matter, JOIN is better solution.
The best way to find out is to measure it:
Without index
Query 1: 0.9s
Query 2: 0.9s
With index
Query 1: 0.4s
Query 2: 0.2s
The conclusion is:
If you don't have indexes then it makes no difference which query you use.
The join is faster if you have the right index.
The effect of adding the correct index is greater than the effect of choosing the right query. If performance matters, make sure you have the correct indexes.
Of course, your results may vary depending on the MySQL version and the distribution of data you have.
Here's how I tested it:
1,000,000 students (25% with status 1).
50,000 courses.
10 departments.
Here's the SQL I used to create the test data:
CREATE TABLE students
(id INT PRIMARY KEY AUTO_INCREMENT,
status int NOT NULL,
classes_id int NOT NULL);
CREATE TABLE classes
(id INT PRIMARY KEY AUTO_INCREMENT,
departments_id INT NOT NULL);
CREATE TABLE numbers(id INT PRIMARY KEY AUTO_INCREMENT);
INSERT INTO numbers VALUES (),(),(),(),(),(),(),(),(),();
INSERT INTO numbers
SELECT NULL
FROM numbers AS n1
CROSS JOIN numbers AS n2
CROSS JOIN numbers AS n3
CROSS JOIN numbers AS n4
CROSS JOIN numbers AS n5
CROSS JOIN numbers AS n6;
INSERT INTO classes (departments_id)
SELECT id % 10 FROM numbers WHERE id <= 50000;
INSERT INTO students (status, classes_id)
SELECT id % 4 = 0, id % 50000 + 1 FROM numbers WHERE id <= 1000000;
SELECT COUNT(*) AS count
FROM students
WHERE status = 1
AND classes_id IN (SELECT id FROM classes WHERE departments_id = 1);
SELECT COUNT(*) AS count
FROM students s
LEFT JOIN classes c
ON c.id = s.classes_id
WHERE status = 1
AND c.departments_id = 1;
CREATE INDEX ix_students ON students(status, classes_id);
The two queries won't produce the same results:
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
...will return the number of rows in the students table that have a classes_id field that is also in the classes table with a departments_id of 1.
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
...will return the total number of rows in the students table where the status field is 1 and possibly more than that depending on how your data is organised.
If you want the queries to return the same thing, you need to change the LEFT JOIN to an INNER JOIN so it will match only the rows that suit both conditions.
Run EXPLAIN SELECT ... on both queries and check which one does what ;)