In university SQL course relation database was all about JOINS between tables.
So i adopted general approach to first do all necessary JOINS, then select data, filter with WHERE, GROUP BY when neccesary. This way code and logic is straightforward.
But very often when things get more complicated than a single LEFT JOIN, i get very poor performance.
Today i just rewrote JOIN query, which 600 second execution time
to different approach with:
SELECT (SELECT ... WHERE ID = X.ID) FROM X
and
SELECT ... WHERE Y IN (SELECT ...)
and now it finishes in 0.0027 seconds.
I am frustrated, i use indexes on fields on which i join, but performance is so poor...
LEFT JOIN may, but does not always, force looking at the 'left' table first.
JOINs (but not LEFT JOINs) plus a WHERE that touches one table gives the Optimizer a strong, reliable, hint to look at that one table first.
JOIN, plus a WHERE touching multiple table -- the Optimizer sometimes picks the right 'first' table, sometimes does not.
The Optimizer usually gets the rows from one table (whichever it picked as the best to start with), then does a NLJ (Nested Loop Join). This means reaching into the next table one row at a time. This 'reach' needs a good index.
IN ( SELECT ... ), in old versions was terribly unoptimal. Now, it may turn into a "semi-join", like an EXIST ( SELECT ... ) and be quite efficient. Sometimes manually doing such is beneficial.
"Explode-Implode" hits a lot of people. This is where there is a JOIN and a GROUP BY. The grouping is mostly to implode the large number of rows created by the join. Sometimes a "derived" table can be an excellent optimization. (This is a manual reformulation of the query.)
Often a LEFT JOIN that is used for an aggregate can be folded in something like this: SELECT ..., ( SELECT SUM(foo) FROM ... ) AS foos, ..., thereby mitigating the explode-implode.
Not understanding the benefit of 'composite' indexes is perhaps the most common problem on this forum.
Shall I ramble on? I doubt if I have covered more than 1/4 of the cases. So, I agree with #leftjoin.
Here are some simple tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Related
I am wondering how MySQL (or its underlying engine) processes the queries.
There are two set queries below (one uses left join and the other one uses cross join), which eventually will give the same result.
My question is, how come the processing time of the two sets of queries are similar?
What I expected is that the first set query will run quicker because the computer is dealing with left join so the size of the "table" won't be expanding, while the second set of queries makes the size of the "table" (what I assume is that the computer needs to get the result of the cross-join from multiple tables before it can go ahead and do the where clause) relatively larger.
select s.*, a.score as score_01, b.score as score_02
from student s
left join (select \* from sc where cid = '01') a using (sid)
left join (select \* from sc where cid = '02') b using (sid)
where a.score > b.score;
select s.*, a.score as score_01, b.score as score_02
from student s
,(select * from sc where cid = '01') a
,(select * from sc where cid = '02') b
where a.score > b.score and a.sid = b.sid and s.sid = a.sid;
I tried both sets of queries and expected the processing time for the first set query will be shorter, but it is not the case.
Add this to sc:
INDEX(sid, cid, score)
Better yet, if you have a useless id on side replace it with
PRIMARY KEY(sid, cid)`
(Assuming that pair is Unique.)
With either of those fixes, I expect both of your queries run at similar speed, and faster than currently.
For further discussion, please provide SHOW CREATE TABLE.
Addressing some of the Comments
MySQL ignores the keywords INNER, OUTER, and CROSS. So, it up to the WHERE to figure whether it is "inner" or "outer".
MySQL throws the ON and WHERE conditions together (except when it matters for LEFT), then decides what is used for filtering (WHERE) so it may be able to do that first. Then other conditions (which belonged in ON) help it get to the 'next' table.
So... Please use ON to say how the tables are related; use WHERE for filtering. (And don't use the old comma-join.)
That is, MySQL will [usually] look at one table at a time, doing a "Nested Loop Join" (NLJ) to get to the next.
There are many possible ways to evaluate a JOIN; MySQL ponders which one might be best, then uses that.
The order of non-LEFT JOINs does not matter, nor does the order of expressions AND'd together in WHERE.
In some situations, a HAVING expression can (and is) moved to the WHERE clause.
Although FROM comes before WHERE, the two get somewhat tangled up together. But, in general, the clauses are required to be in a certain order, and that order is logically the order that things have to happen in.
It is up to the Optimizer to combine steps. For example
WHERE a = 1
ORDER BY b
and the table has INDEX(a,b) -- The index will be used to do both, essentially at the same time. Ditto for
SELECT a, MAX(b)
...
GROUP BY a
ORDER BY a
can hop through the BTree index on (a,b) and deliver the results without an extra sort pass for either the GROUP BY or ORDER BY.
SELECT x is executed after WHERE y = 'abc' -- Well, in some sense it is. But if you have INDEX(y,x), the Optimizer is smart enough to grab the x values while it is performing the WHERE.
When a WHERE references more than one table of a JOIN, the Optimizer has a quandary. Which table should it start its NLJ with? It has some statistics to help make the decision, but it does not always get it right. It will usually
filter on one of the tables
NLJ to get to the next table, meanwhile throwing in any WHERE clauses for that table in with the ON clause.
Repeat for other tables.
When there is both a WHERE and an ORDER BY, the Optimizer will usually filter filter, then sort. But sometimes (not always correctly) it will decide to use an index for the ORDER BY (thereby eliminating the sort) and filter as it reads the table. LIMIT, which is logically done last further muddies the decision.
MySQL does not have FULL OUTER JOIN. It can be simulated with two JOIN and a UNION. (It is only very rarely needed.)
I was playing around with SQLite and I ran into an odd performance issue with CROSS JOINS on very small data sets. For example, any cross join I do in SQLite takes about 3x or longer than the same cross join in mysql. For example, here would be an example for 3,000 rows in mysql:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins? I have had a lot of luck using SQLite on a single table/database, but whenever joining tables, it seems be become a bit more problematic.
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins?
Yes. The algorithm used by SQLite is very simple. In SQLite, joins are executed as nested loop joins. The database goes through one table, and for each row, searches matching rows from the other table.
SQLite is unable to figure out how to use an index to speed the join and without indices, an k-way join takes time proportional to N^k. MySQL for example, creates some "ghostly" indexes which helps the iteration process to be faster.
It has been commented already by Shawn that this question would need much more details in order to get a really accurate answer.
However, as a general answer, please be aware that this note in the SQLite documentation states that the algorithm used to perform CROSS JOINs may be suboptimal (by design!), and that their usage is generally discouraged:
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite. The "CROSS JOIN" join operator produces the same result as the "INNER JOIN", "JOIN" and "," operators, but is handled differently by the query optimizer in that it prevents the query optimizer from reordering the tables in the join. An application programmer can use the CROSS JOIN operator to directly influence the algorithm that is chosen to implement the SELECT statement. Avoid using CROSS JOIN except in specific situations where manual control of the query optimizer is desired. Avoid using CROSS JOIN early in the development of an application as doing so is a premature optimization. The special handling of CROSS JOIN is an SQLite-specific feature and is not a part of standard SQL.
This clearly indicates that the SQLite query planner handles CROSS JOINs differently than other RDBMS.
Note: nevertheless, I am unsure that this really applies to your use case, where both derived tables being joined have the same number of records.
Why MySQL might be faster: It uses the optimization that it calls "Using join buffer (Block Nested Loop)".
But... There are many things that are "wrong" with the query. I would hate for you to draw a conclusion on comparing DB engines based on your findings.
It could be that one DB will create an index to help with join, even if none were already there.
SELECT * probably hauls around all the columns, unless the Optimizer is smart enough to toss all the columns except for territory.
A LIMIT without an ORDER BY gives you random value. You might think that the resultset is necessarily 3000 rows of the value "3000" in each, but it is perfectly valid to come up with other results. (Depending on what you ORDER BY, it still may not be deterministic.)
Having a COUNT(*) without a column saying what it is counting (territory) seems unrealistic.
You have the same subquery twice. Some engine may be smart enough to evaluate it only once. Or you could reformulate it with WITH to (possibly) give the Optimizer a big hint of such. (I think the example below shows how it would be reformulated in MySQL 8.0 or MariaDB 10.2; I don't know about SQLite).
If you are pitting one DB against the other, please use multiple queries that relate to your application.
This is not necessarily a "small" dataset, since the intermediate table (unless optimized away) has 9,000,000 rows.
I doubt if I have written more than one cross join in a hundred queries, maybe a thousand. Its performance is hardly worth worrying about.
WITH w AS ( SELECT territory FROM main_s LIMIT 3000 )
SELECT COUNT(*)
FROM w AS x1
JOIN w AS x2
GROUP BY x1.territory;
As noted above, using CROSS JOIN in SQLite restricts the optimiser from reordering tables so that you can influence the order the nested loops that perform the join will take.
However, that's a red herring here as you are limiting rows in both sub selects to 3000 rows, and its the same table, so there is no optimisation to be had there anyway.
Lets see what your query actually does:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You say; produce an intermediate result set of 9 million rows (3000 x 3000), group them on x.territory and return count of the size of the group.
So let's say the row size of your table is 100 bytes.
You say, for each of 3000 rows of 100 bytes, give me 3000 rows of 100 bytes.
Hence you get 9 million rows of 200 bytes length, an intermediate result set of 1.8GB.
So here are some optimisations you could make.
select COUNT(*) from (
select territory from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You don't use anything other than territory from x, so select just that. Lets assume it is 8 bytes, so now you create an intermediate result set of:
9M x 108 = 972MB
So we nearly halve the amount of data. Lets try the same for x2.
But wait, you are not using any data fields from x2. You are just using it multiply the result set by 3000. If we do this directly we get:
select COUNT(*) * 3000 from (
select territory from main_s limit 3000
) group by territory
The intermediate result set is now:
3000 x 8 = 24KB which is now 0.001% of the original.
Further, now that SELECT * is not being used, it's possible the optimiser will be able to use an index on main_s that includes territory as a covering index (meaning it doesn't need to lookup the row to get territory).
This is done when there is a WHERE clause, it will try to chose a covering index that will also satisfy the query without using row lookups, but it's not explicit in the documentation if this is also done when WHERE is not used.
If you determined a covering index was not being use (assuming one exists), then counterintuitively (because sorting takes time), you could use ORDER BY territory in the sub select to cause the covering index to be used.
select COUNT(*) * 3000 from (
select territory from main_s limit 3000 order by territory
) group by territory
Check the optimiser documentation here:
https://www.sqlite.org/draft/optoverview.html
To summarise:
The optimiser uses the structure of your query to look for hints and clues about how the query may be optimised to run quicker.
These clues take the form of keywords such as WHERE clauses, ORDER By, JOIN (ON), etc.
Your query as written provides none of these clues.
If I understand your question correctly, you are interested in why other SQL systems are able to optimise your query as written.
The most likely reasons seem to be:
Ability to eliminate unused columns from sub selects (likely)
Ability to use covering indexes without WHERE or ORDER BY (likely)
Ability to eliminate unused sub selects (unlikely)
But this is a theory that would need testing.
Sqlite uses CROSS JOIN as a flag to the query-planner to disable optimizations. The docs are quite clear:
Programmers can force SQLite to use a particular loop nesting order for a join by using the CROSS JOIN operator instead of just JOIN, INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are commutative in theory, SQLite chooses to never reorder the tables in a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be in an outer loop relative to the right table.
https://www.sqlite.org/optoverview.html#crossjoin
I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'
In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.
I have 2 tables sl and sd.
I want to optimize the following query, if it is possible
select sl.*, sd.* from sl join sd where sl.conf_id='blah' and sd.for_as=1
My understanding is that the cartesian product is first performed and then filtering happens.
Is there a way to have the filtering done first?
Run EXPLAIN SELECT ... -- it will probably say "Using join buffer". This is where it loads one table into memory (if not too big) and repeatedly scans it for the data. Not a pretty site, but a lot faster than before the 'join buffer' came into play.
Since you have no ON or WHERE tying the two tables together, you really want the "cross join"? That is, if there are 40 'blah' and 70 '1', you will end up with 40*70 = 2800 rows?
As for optimizing, the optimizer will pick one of the tables, giving preference to one that has a useful index, scan it (index or table), then repeatedly use the join buffer (if possible) to scan (index or table) of the other.
In other words, one table will use an index if possible, doing the filtering before the Cartesian product, the other might use the join buffer. If the tables aren't too big, the performance won't be too bad.
What is the order MySQL joins the tables, how is it chosen and when does STRAIGHT_JOIN comes in handy?
MySQL is only capable of doing nested loops (possibly using indexes), so if both join tables are indexed, the time for the join is calculated as A * log(B) if A is leading and B * log(A) if B is leading.
It is easy to see that the table with fewer records satisfying the WHERE condition should be made leading.
There are some other factors that affect the join performance, such as WHERE conditions, ORDER BY and LIMIT clauses etc. MySQL tries to predict the time for the join orders and if statistics are up to date does it quite well.
STRAIGHT_JOIN is useful when the statistics are not accurate (say, naturally skewed) or in case of bugs in the optimizer.
For instance, the following spatial join:
SELECT *
FROM a
JOIN b
ON MBRContains(a.area, b.area)
is subject to a join swap (the smaller table is made leading), however, MBRContains is not converted to MBRWithin and the resulting plan does not make use of the index.
In this case you should explicitly set the join order using STRAIGHT_JOIN.
As others have stated about the optimizer and which tables may meet the criteria on smaller result sets, but that may not always work. As I had been working with gov't contract / grants database. The table was some 14+ million records. However, it also had over 20 lookup tables (states, congressional districts, type of business classification, owner ethnicity, etc)
Anyhow with these smaller tables, the join was using one of the small lookups, back to the master table and then joining all the others. It CHOKED the database and cancelled the query after 30+ hours. Since my primary table was listed FIRST, and all subsequent were lookup and joined AFTER, just adding STRAIGHT_JOIN at the top FORCED the order I had listed and the complex query was running again in just about 2 hrs (expected for all it had to do).
Get whatever is your primary basis to the top with all subsequent extras later I've found, definitely helps.
The order of tables is specified by the optimizer. Straight_join comes in handy when the optimizer does it wrong, which is not so often. I used it only once in a big join, where the optimizer gave one particular table at first place in join (I saw it in explain select command), so I placed the table so that it is joined later in the join. It helped a lot to speed up the query.