How to avoid running an expensive sub-query twice in a union

How to avoid running an expensive sub-query twice in a union - mysql

I want to union two queries. Both queries use an inner join into a data set, that is very intensive to compute, but the dataset query is the same for both queries. For example:
SELECT veggie_id
FROM potatoes
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=potatoes.potato_id
UNION
SELECT veggie_id
FROM carrots
INNER JOIN ( [...] ) massive_market
ON massive_market.carrot_id=carrots.carrot_id
Where [...] corresponds to a subquery that takes a second to compute, and returns rows of at least carrot_id and potato_id.
I want to avoid having the query for massive_market [...] twice in my overal query.
Whats the best way to do this?

If that subquery takes more than a second to run, I'd say it's down to an indexing issue as opposed to the query itself (of course, without seeing that query, that is somewhat conjecture, I'd recommend posting that query too). In my experience, 9/10 slow queries issues are down to improper indexing of the database.
Ensure veggie_id, potato_id and carrot_id are indexed
Also, if you're using any joins in the massive_market subquery, ensure the columns you're performing the joins on are indexed too.
Edit
If indexing has been done properly, the only other solution I can think of off the top of my head is:
CREATE TEMPORARY TABLE tmp_veggies (potato_id [datatype], carrot_id [datatype]);
INSERT IGNORE INTO tmp_veggies (potato_id, carrot_id) select potatoes.veggie_id, carrots.veggie_id from [...] massive_market
RIGHT OUTER JOIN potatoes on massive_market.potato_id = potatoes.potato_id
RIGHT OUTER JOIN carrots on massive_market.carrot_id = carrots.carrot_id;
SELECT carrot_id FROM tmp_veggies
UNION
SELECT potato_id FROM tmp_veggies;
This way, you've reversed the query so it's only running the massive subquery once and the UNION is happening on the temporary table (which'll be dropped automatically but not until the connection is closed, so you may want to drop the table manually). You can add any additional columns you need into the CREATE TEMPORARY TABLE and SELECT statement

The goal is to pull all repeated query-strings out of the list of query-strings requiring the repeated query-strings. So I kept potatoes and carrots within one unionizing subquery, and placed massive_market afterwards and outside this unification.
This seems obrvious, but my question originated from a much more complex query, and the work needed to pull this strategy off was a bit more involving in my case. For my simple example in my question above, this would resolve in something like:
SELECT veggie_id
FROM (
SELECT veggie_id, potato_id, NULL AS carrot_id FROM potatoes
UNION
SELECT veggie_id, NULL AS potato_id, carrot_id FROM carrots
) unionized
INNER JOIN ( [...] ) massive_market
ON massive_market.potato_id=unionized.potato_id
OR massive_market.carrot_id=unionized.carrot_id

Related

Query performance issue with multiple left joins

In mysql v8.x I have a table with about 7000 records in it. I'm trying to create a single query combines two subqueries of the same table.
I thought I could achieve this by left joining on the subqueries and then matching on any records that have values for these as shown in the example below (note: this effect happens when my_table has just just an id column).
The query seems to work quickly when the subqueries return records but not when the subqueries return empty (which I've recreated in the example below with WHERE FALSE). When this happens there is a situation where executing these queries on their own that each take a millisecond or so, takes 12 seconds.
My understanding is that these these joins should return the same number of rows as the source table and as such there shouldn't be such a big difference. I'm interested in understanding how the join works in this type of case and why it's producing such a difference in execution time.
SELECT my_table.* FROM accessory_requests
LEFT JOIN
( SELECT my_table.id
FROM my_table
WHERE FALSE
) as join1
ON join1.id = my_table.id
LEFT JOIN
( SELECT my_table.id
FROM my_table
WHERE FALSE
) as join2
ON join2.id = my_table.id
WHERE join1.id IS NOT NULL OR join2.id IS NOT NULL;

Your query is all messed up and it is not really clear what you are trying to do.
However, I can comment on your performance issues. MySQL has a tendency to materialize subqueries in the FROM clause. That means that a new copy of the table is created. In doing so, indexes are lost on the table. So, eliminate the subqueries in the FROM clause.
If you ask another question with sample data, desired results, and a decent explanation, then it might be possible to help with a more efficient form of the query. I suspect you just want not exists, but that is a rather large leap from this question.

combines two subqueries of the same table.
What do you mean?
If you want to take the rows from each subquery, then simply do
( SELECT ... ) -- what you are calling the first subquery
UNION
( SELECT ... ) -- 2nd
Also,
LEFT JOIN ( ... ) as join1 ON ...
WHERE join1.id IS NOT NULL;
is probably the same as simply
JOIN ( ... ) as join1 ON ...
If by "combining" you mean to have multiple columns, then see the tag [pivot-table].

how to convert left join to sub query?

I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'

In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.

Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

MySQL(version 5.5): Why `JOIN` is faster than `IN` clause?

[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?

This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.

EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.

You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)

Learning SQL: UNION or JOIN?

Forgive me if this seems like common sense as I am still learning how to split my data between multiple tables.
Basically, I have two:
general with the fields userID,owner,server,name
count with the fields userID,posts,topics
I wish to fetch the data from them and cannot decide how I should do it: in a UNION:
SELECT `userID`, `owner`, `server`, `name`
FROM `english`.`general`
WHERE `userID` = 54 LIMIT 1
UNION
SELECT `posts`, `topics`
FROM `english`.`count`
WHERE `userID` = 54 LIMIT 1
Or a JOIN:
SELECT `general`.`userID`, `general`.`owner`, `general`.`server`,
`general`.`name`, `count`.`posts`, `count`.`topics`
FROM `english`.`general`
JOIN `english`.`count` ON
`general`.`userID`=`count`.`userID` AND `general`.`userID`=54
LIMIT 1
Which do you think would be the more efficient way and why? Or perhaps both are too messy to begin with?

It's not about efficiency, but about how they work.
UNION just unions 2 different independent queries. So you get 2 result sets one after another.
JOIN appends each row from one result set to each row from another result set. So in total result set you have "long" rows (in terms of amount of columns)

Just for completeness as I don't think it's mentioned elsewhere: often UNION ALL is what's intended when people use UNION.
UNION will remove duplicates (so relatively expensive because it requires a sort). This remove duplicates in the final result (so it doesn't matter if there's a duplicate in a single query or the same data from individual SELECTs). UNION is a set operation.
UNION ALL just sticks the results together: no sorting, no duplicate removal. This is going to be quicker (or at least no worse) than UNION.
If you know the individual queries won't return duplicate results use UNION ALL. (In fact often best to assume UNION ALL and think about UNION if you need that behaviour; using SELECT DISTINCT with UNION is redundant).

You want to use a JOIN. Joining is used to creating a single set which is a combination of related data. Your union example doesn't make sense (and probably won't run). UNION is for linking two result sets with identical columns to create a set that has the combined rows (it does not 'union' the columns.)

If you want to fetch users and near user posts and topics. you need to write QUERY using JOIN like this:
SELECT general.*,count.posts,count.topics FROM general LEFT JOIN count ON general.userID=count.userID

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008