MySQL(version 5.5): Why `JOIN` is faster than `IN` clause? - mysql

[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?

This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.

EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.

You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)

Related

Join Performances When Searching For NULL Value

I need to find a value that exists in LoyaltyTransactionBasketItemStores table but not in DimProductConsolidate table. I need the item code and its corresponding company. This is my query
SELECT
A.ProductReference, A.CompanyCode
FROM
(SELECT ProductReference, CompanyCode FROM dwhdb.LoyaltyTransactionsBasketItemsStores GROUP BY ProductReference) A
LEFT JOIN
(SELECT LoyaltyVariantArticleCode FROM dwhdb.DimProductConsolidate) B ON B.LoyaltyVariantArticleCode = A.ProductReference
WHERE
B.LoyaltyVariantArticleCode IS NULL
It is a pretty straight forward query. But when I run it, it's taking 1 hour and still not finish. Then I use EXPLAIN and this is the result
But when I remove the CompanyCode from my query, its performance is increasing a lot. This is the EXPLAIN result
I want to know why is this happening and is there any way to get ProductReference and its company with a lot more better performance?
Your current query is rife with syntax and structural errors. I would use exists logic here:
SELECT a.ProductReference, a.CompanyCode
FROM dwhdb.LoyaltyTransactionsBasketItemsStores a
WHERE NOT EXISTS (SELECT 1 FROM dwhdb.DimProductConsolidate b
WHERE b.LoyaltyVariantArticleCode = a.ProductReference);
Your current query is doing a GROUP BY in the first subquery, but you never select aggregates, but rather other non aggregate columns. On most other databases, and even on MySQL in strict mode, this syntax is not allowed. Also, there is no need to have 2 subqueries here. Rather, just select from the basket table and then assert that matching records do not exist in the other table.

Query performance issue with multiple left joins

In mysql v8.x I have a table with about 7000 records in it. I'm trying to create a single query combines two subqueries of the same table.
I thought I could achieve this by left joining on the subqueries and then matching on any records that have values for these as shown in the example below (note: this effect happens when my_table has just just an id column).
The query seems to work quickly when the subqueries return records but not when the subqueries return empty (which I've recreated in the example below with WHERE FALSE). When this happens there is a situation where executing these queries on their own that each take a millisecond or so, takes 12 seconds.
My understanding is that these these joins should return the same number of rows as the source table and as such there shouldn't be such a big difference. I'm interested in understanding how the join works in this type of case and why it's producing such a difference in execution time.
SELECT my_table.* FROM accessory_requests
LEFT JOIN
( SELECT my_table.id
FROM my_table
WHERE FALSE
) as join1
ON join1.id = my_table.id
LEFT JOIN
( SELECT my_table.id
FROM my_table
WHERE FALSE
) as join2
ON join2.id = my_table.id
WHERE join1.id IS NOT NULL OR join2.id IS NOT NULL;
Your query is all messed up and it is not really clear what you are trying to do.
However, I can comment on your performance issues. MySQL has a tendency to materialize subqueries in the FROM clause. That means that a new copy of the table is created. In doing so, indexes are lost on the table. So, eliminate the subqueries in the FROM clause.
If you ask another question with sample data, desired results, and a decent explanation, then it might be possible to help with a more efficient form of the query. I suspect you just want not exists, but that is a rather large leap from this question.
combines two subqueries of the same table.
What do you mean?
If you want to take the rows from each subquery, then simply do
( SELECT ... ) -- what you are calling the first subquery
UNION
( SELECT ... ) -- 2nd
Also,
LEFT JOIN ( ... ) as join1 ON ...
WHERE join1.id IS NOT NULL;
is probably the same as simply
JOIN ( ... ) as join1 ON ...
If by "combining" you mean to have multiple columns, then see the tag [pivot-table].

How to avoid running an expensive sub-query twice in a union

I want to union two queries. Both queries use an inner join into a data set, that is very intensive to compute, but the dataset query is the same for both queries. For example:
SELECT veggie_id
FROM potatoes
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=potatoes.potato_id
UNION
SELECT veggie_id
FROM carrots
INNER JOIN ( [...] ) massive_market
ON massive_market.carrot_id=carrots.carrot_id
Where [...] corresponds to a subquery that takes a second to compute, and returns rows of at least carrot_id and potato_id.
I want to avoid having the query for massive_market [...] twice in my overal query.
Whats the best way to do this?
If that subquery takes more than a second to run, I'd say it's down to an indexing issue as opposed to the query itself (of course, without seeing that query, that is somewhat conjecture, I'd recommend posting that query too). In my experience, 9/10 slow queries issues are down to improper indexing of the database.
Ensure veggie_id, potato_id and carrot_id are indexed
Also, if you're using any joins in the massive_market subquery, ensure the columns you're performing the joins on are indexed too.
Edit
If indexing has been done properly, the only other solution I can think of off the top of my head is:
CREATE TEMPORARY TABLE tmp_veggies (potato_id [datatype], carrot_id [datatype]);
INSERT IGNORE INTO tmp_veggies (potato_id, carrot_id) select potatoes.veggie_id, carrots.veggie_id from [...] massive_market
RIGHT OUTER JOIN potatoes on massive_market.potato_id = potatoes.potato_id
RIGHT OUTER JOIN carrots on massive_market.carrot_id = carrots.carrot_id;
SELECT carrot_id FROM tmp_veggies
UNION
SELECT potato_id FROM tmp_veggies;
This way, you've reversed the query so it's only running the massive subquery once and the UNION is happening on the temporary table (which'll be dropped automatically but not until the connection is closed, so you may want to drop the table manually). You can add any additional columns you need into the CREATE TEMPORARY TABLE and SELECT statement
The goal is to pull all repeated query-strings out of the list of query-strings requiring the repeated query-strings. So I kept potatoes and carrots within one unionizing subquery, and placed massive_market afterwards and outside this unification.
This seems obrvious, but my question originated from a much more complex query, and the work needed to pull this strategy off was a bit more involving in my case. For my simple example in my question above, this would resolve in something like:
SELECT veggie_id
FROM (
SELECT veggie_id, potato_id, NULL AS carrot_id FROM potatoes
UNION
SELECT veggie_id, NULL AS potato_id, carrot_id FROM carrots
) unionized
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=unionized.potato_id
OR massive_market.carrot_id=unionized.carrot_id

MySQL slow query when combining two very fast queries

I have two MySQL queries that run very fast, but when I combine them the new query is very slow.
Fast (<1 second, 15 results):
SELECT DISTINCT
Id, Name, Company_Id
FROM people
where Company_Id in (5295, 1834)
and match(Locations) against('austin')
Fast (<1 second, 2970 results):
select distinct Company_Id from technologies
where match(Name) against('elastic')
and Company_Id is not null
When I combine the two like this:
SELECT DISTINCT Id, Name, Company_Id
FROM people
where Company_Id in
( select Company_Id from technologies
where match(Name) against('elastic')
and Company_Id is not null
)
and match(Locations) against('austin')
The result query takes over 2 minutes to complete. It has 278 rows hit.
I've tried rewriting the slow query a few ways. One other example is like this:
SELECT DISTINCT
`Extent1`.`Id`, `Extent1`.`Name`, `Extent1`.`Company_Id`
FROM `people` AS `Extent1`
INNER JOIN `technologies` AS `Extent2`
ON (`Extent1`.`Company_Id` = `Extent2`.`Company_Id`)
WHERE (`Extent1`.`Company_Id` IS NOT NULL)
AND ((match(`Extent1`.`Locations`) against('austin'))
AND (match(`Extent2`.`Name`) against('elastic')))
I'm using MySQL 5.7 on Windows. I have full text index on the Name and Location columns. My InnoDB Buffer Usage never goes above 40%. I tried to use MySQL workbench to look at the execution plan, but it shows "Explain data not available for statement"
Please let me know if you see anything I could improve or try. Thank you.
IN ( SELECT ... ) is poorly optimized, at least in older versions of MySQL. What version are you using?
When using a FULLTEXT index (MATCH...), that part is performed first, if possible. This is because nearly always the FT lookup is faster than whatever else is going on.
But when using two fulltext queries, it picks one, then can't use fulltext on the other.
Here's one possible workaround:
Have a extra table for searches. It includes both Name and Locations in it.
Have FULLTEXT(Name, Locations)
MATCH (Name, Locations) AGAINST ('+austin +elastic' IN BOOLEAN MODE)
If necessary, AND that with something to verify that it is not, for example, finding a person named 'Austin'.
Another possibility:
5.7 (or 5.6?) might be able to optimize this by creating indexes on the subqueries:
SELECT ...
FROM ( SELECT Company_Id FROM ... MATCH(Name) ... ) AS x
JOIN ( SELECT Company_Id FROM ... MATCH(Locations) ... ) AS y
USING(Company_id);
Provide the EXPLAIN; I am hoping to see <auto-key>.
Test that. If it is 'fast', then you may need to add on another JOIN and/or WHERE. (I am unclear what your ultimate query needs to be.)
Write the query with the subquery in the from clause:
select distinct p.Id, p.Name, p.Company_Id
from people p join
(select Company_Id
from technologies
where match(Name) against('elastic') and Company_Id is not null
) t
on p.Company_Id = t.Company_Id
where match(p.Locations) against ('austin');
I suspect that you have a problem with your data structure, though. You should have a CompanyLocations table, rather than storing locations in a list in the table.

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.
Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.