When to use straight_join? - mysql

What is the order MySQL joins the tables, how is it chosen and when does STRAIGHT_JOIN comes in handy?

MySQL is only capable of doing nested loops (possibly using indexes), so if both join tables are indexed, the time for the join is calculated as A * log(B) if A is leading and B * log(A) if B is leading.
It is easy to see that the table with fewer records satisfying the WHERE condition should be made leading.
There are some other factors that affect the join performance, such as WHERE conditions, ORDER BY and LIMIT clauses etc. MySQL tries to predict the time for the join orders and if statistics are up to date does it quite well.
STRAIGHT_JOIN is useful when the statistics are not accurate (say, naturally skewed) or in case of bugs in the optimizer.
For instance, the following spatial join:
SELECT *
FROM a
JOIN b
ON MBRContains(a.area, b.area)
is subject to a join swap (the smaller table is made leading), however, MBRContains is not converted to MBRWithin and the resulting plan does not make use of the index.
In this case you should explicitly set the join order using STRAIGHT_JOIN.

As others have stated about the optimizer and which tables may meet the criteria on smaller result sets, but that may not always work. As I had been working with gov't contract / grants database. The table was some 14+ million records. However, it also had over 20 lookup tables (states, congressional districts, type of business classification, owner ethnicity, etc)
Anyhow with these smaller tables, the join was using one of the small lookups, back to the master table and then joining all the others. It CHOKED the database and cancelled the query after 30+ hours. Since my primary table was listed FIRST, and all subsequent were lookup and joined AFTER, just adding STRAIGHT_JOIN at the top FORCED the order I had listed and the complex query was running again in just about 2 hrs (expected for all it had to do).
Get whatever is your primary basis to the top with all subsequent extras later I've found, definitely helps.

The order of tables is specified by the optimizer. Straight_join comes in handy when the optimizer does it wrong, which is not so often. I used it only once in a big join, where the optimizer gave one particular table at first place in join (I saw it in explain select command), so I placed the table so that it is joined later in the join. It helped a lot to speed up the query.

Related

In SQL, is it true that the shorter the query, the faster it run

I'm very curious to know because when I run these two query in MySQL the shorter run faster
SELECT FirstName, LastName, City, State
FROM Person
LEFT JOIN Address ON Person.PersonId = Address.PersonId;
and
SELECT FirstName, LastName, City, State
FROM Person
LEFT OUTER JOIN Address ON Person.PersonId = Address.PersonId;
In addition, I want to ask if the same thing happen in the other RDBMS such as Posrgres, MS SQL Server, Oracle and SQLite?
LEFT OUTER JOIN is just a synonym for LEFT JOIN, and the execution plan for both of your queries should be identical. Therefore, I would attribute any difference in performance to things other than the use of OUTER in the left join syntax. You should check what other tasks might be running on your database during the test, and also what other processes might be running on the OS.
First, performance of a query is not based on how long it is. There are many "obvious" reasons for this. But perhaps the most is that the size of the data is determined by the size of tables in the FROM clause, and that has nothing to do with how many characters are in the name.
Second, LEFT JOIN and LEFT OUTER JOIN are synonyms for each other. No doubt, the OUTER takes a new compute cycles more on a modern computer to parse. But the difference would be measured in fractions of a microsecond (probably) and are pretty immeasurable.
Some of the things that affect the performance of queries are:
The size of the data.
Operations such as GROUP BY AND ORDER BY.
JOINs, particularly without indexes.
And much, much more.
No, you cannot correlate the length of a SQL statement's text to its execution duration. It's more accurate to correlate a statement's execution duration to its execution plan with respect to the cardinality of the plan's row source operations.
Existence proof:
Take a billion row table, T, with a single column primary key, P
The full table scan query
select * from T
will take longer to execute than the unique index scan query (followed by a table access by index rowid):
select * from T where P = :pval
Except for some odd maintenance operations (eg global index rebuild over a partitioned table), the shorter query will no doubt take longer to execute than the longer query.

Cross join in SQLite vs other dbs

I was playing around with SQLite and I ran into an odd performance issue with CROSS JOINS on very small data sets. For example, any cross join I do in SQLite takes about 3x or longer than the same cross join in mysql. For example, here would be an example for 3,000 rows in mysql:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins? I have had a lot of luck using SQLite on a single table/database, but whenever joining tables, it seems be become a bit more problematic.
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins?
Yes. The algorithm used by SQLite is very simple. In SQLite, joins are executed as nested loop joins. The database goes through one table, and for each row, searches matching rows from the other table.
SQLite is unable to figure out how to use an index to speed the join and without indices, an k-way join takes time proportional to N^k. MySQL for example, creates some "ghostly" indexes which helps the iteration process to be faster.
It has been commented already by Shawn that this question would need much more details in order to get a really accurate answer.
However, as a general answer, please be aware that this note in the SQLite documentation states that the algorithm used to perform CROSS JOINs may be suboptimal (by design!), and that their usage is generally discouraged:
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite. The "CROSS JOIN" join operator produces the same result as the "INNER JOIN", "JOIN" and "," operators, but is handled differently by the query optimizer in that it prevents the query optimizer from reordering the tables in the join. An application programmer can use the CROSS JOIN operator to directly influence the algorithm that is chosen to implement the SELECT statement. Avoid using CROSS JOIN except in specific situations where manual control of the query optimizer is desired. Avoid using CROSS JOIN early in the development of an application as doing so is a premature optimization. The special handling of CROSS JOIN is an SQLite-specific feature and is not a part of standard SQL.
This clearly indicates that the SQLite query planner handles CROSS JOINs differently than other RDBMS.
Note: nevertheless, I am unsure that this really applies to your use case, where both derived tables being joined have the same number of records.
Why MySQL might be faster: It uses the optimization that it calls "Using join buffer (Block Nested Loop)".
But... There are many things that are "wrong" with the query. I would hate for you to draw a conclusion on comparing DB engines based on your findings.
It could be that one DB will create an index to help with join, even if none were already there.
SELECT * probably hauls around all the columns, unless the Optimizer is smart enough to toss all the columns except for territory.
A LIMIT without an ORDER BY gives you random value. You might think that the resultset is necessarily 3000 rows of the value "3000" in each, but it is perfectly valid to come up with other results. (Depending on what you ORDER BY, it still may not be deterministic.)
Having a COUNT(*) without a column saying what it is counting (territory) seems unrealistic.
You have the same subquery twice. Some engine may be smart enough to evaluate it only once. Or you could reformulate it with WITH to (possibly) give the Optimizer a big hint of such. (I think the example below shows how it would be reformulated in MySQL 8.0 or MariaDB 10.2; I don't know about SQLite).
If you are pitting one DB against the other, please use multiple queries that relate to your application.
This is not necessarily a "small" dataset, since the intermediate table (unless optimized away) has 9,000,000 rows.
I doubt if I have written more than one cross join in a hundred queries, maybe a thousand. Its performance is hardly worth worrying about.
WITH w AS ( SELECT territory FROM main_s LIMIT 3000 )
SELECT COUNT(*)
FROM w AS x1
JOIN w AS x2
GROUP BY x1.territory;
As noted above, using CROSS JOIN in SQLite restricts the optimiser from reordering tables so that you can influence the order the nested loops that perform the join will take.
However, that's a red herring here as you are limiting rows in both sub selects to 3000 rows, and its the same table, so there is no optimisation to be had there anyway.
Lets see what your query actually does:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You say; produce an intermediate result set of 9 million rows (3000 x 3000), group them on x.territory and return count of the size of the group.
So let's say the row size of your table is 100 bytes.
You say, for each of 3000 rows of 100 bytes, give me 3000 rows of 100 bytes.
Hence you get 9 million rows of 200 bytes length, an intermediate result set of 1.8GB.
So here are some optimisations you could make.
select COUNT(*) from (
select territory from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You don't use anything other than territory from x, so select just that. Lets assume it is 8 bytes, so now you create an intermediate result set of:
9M x 108 = 972MB
So we nearly halve the amount of data. Lets try the same for x2.
But wait, you are not using any data fields from x2. You are just using it multiply the result set by 3000. If we do this directly we get:
select COUNT(*) * 3000 from (
select territory from main_s limit 3000
) group by territory
The intermediate result set is now:
3000 x 8 = 24KB which is now 0.001% of the original.
Further, now that SELECT * is not being used, it's possible the optimiser will be able to use an index on main_s that includes territory as a covering index (meaning it doesn't need to lookup the row to get territory).
This is done when there is a WHERE clause, it will try to chose a covering index that will also satisfy the query without using row lookups, but it's not explicit in the documentation if this is also done when WHERE is not used.
If you determined a covering index was not being use (assuming one exists), then counterintuitively (because sorting takes time), you could use ORDER BY territory in the sub select to cause the covering index to be used.
select COUNT(*) * 3000 from (
select territory from main_s limit 3000 order by territory
) group by territory
Check the optimiser documentation here:
https://www.sqlite.org/draft/optoverview.html
To summarise:
The optimiser uses the structure of your query to look for hints and clues about how the query may be optimised to run quicker.
These clues take the form of keywords such as WHERE clauses, ORDER By, JOIN (ON), etc.
Your query as written provides none of these clues.
If I understand your question correctly, you are interested in why other SQL systems are able to optimise your query as written.
The most likely reasons seem to be:
Ability to eliminate unused columns from sub selects (likely)
Ability to use covering indexes without WHERE or ORDER BY (likely)
Ability to eliminate unused sub selects (unlikely)
But this is a theory that would need testing.
Sqlite uses CROSS JOIN as a flag to the query-planner to disable optimizations. The docs are quite clear:
Programmers can force SQLite to use a particular loop nesting order for a join by using the CROSS JOIN operator instead of just JOIN, INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are commutative in theory, SQLite chooses to never reorder the tables in a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be in an outer loop relative to the right table.
https://www.sqlite.org/optoverview.html#crossjoin

how to convert left join to sub query?

I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'
In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.

Optimized query to find avg in MySQL with inner join

Currently I am running a query to find average with joining one more table.
Results are as expected but it does not perform very well, taking a lot of time to execute. So need a help to find the better query. Current query is:
SELECT AVG(t2.a),
AVG(t2.b),
AVG(t2.c),
t1.column1,
t1.column2
FROM table1 t1
INNER JOIN table2 t2
ON t1.column = t2.column
GROUP BY t1.column1, t2.column2
In the future when asking questions related to performance please always include an EXPLAIN output. You basically just write "EXPLAIN SELECT ....;" and it'll show you the execution plan of that query which includes detailed information which may hint towards possible optimizations.
Two things:
JOINs on unindexed columns may be very slow.
GROUP BY statements are generally slow kind of queries as they require sorting, especially when grouping multiple columns. GROUP BY can do index scans but this requires that there's tuple indexes on the involved columns which in your case since you're selecting columns from different tables probably doesn't work.
How many rows do you have? If you're grouping hundreds of million of rows you can easily expect query times that are in the range of hours (and I'm dead serious about hours). Grouping is just a horrifically expensive operation. Especially because you have memory limits which means the sorting takes place on-disk which induces an additional slowdown due to disk i/o which is just soo much slower than memory.
There are two possible answers.
The query is wrong -- Because the JOIN occurs before the AVERAGE, hence the average is of too many rows.
The query is right -- in which case, there is a lot of work to do, so it takes time. I have to believe that this is the case, since you GROUP BY columns from both tables.
Please provide real column names; it could help our understanding of the query.
But assuming the first case, let's fix the math and speed it up.
Computer the averages in a "derived" table.
Do the JOIN.
I won't attempt to write the code until I have some assurance that the case is worth pursuing.

Should i avoid JOIN when MySQL not optimizing them?

In university SQL course relation database was all about JOINS between tables.
So i adopted general approach to first do all necessary JOINS, then select data, filter with WHERE, GROUP BY when neccesary. This way code and logic is straightforward.
But very often when things get more complicated than a single LEFT JOIN, i get very poor performance.
Today i just rewrote JOIN query, which 600 second execution time
to different approach with:
SELECT (SELECT ... WHERE ID = X.ID) FROM X
and
SELECT ... WHERE Y IN (SELECT ...)
and now it finishes in 0.0027 seconds.
I am frustrated, i use indexes on fields on which i join, but performance is so poor...
LEFT JOIN may, but does not always, force looking at the 'left' table first.
JOINs (but not LEFT JOINs) plus a WHERE that touches one table gives the Optimizer a strong, reliable, hint to look at that one table first.
JOIN, plus a WHERE touching multiple table -- the Optimizer sometimes picks the right 'first' table, sometimes does not.
The Optimizer usually gets the rows from one table (whichever it picked as the best to start with), then does a NLJ (Nested Loop Join). This means reaching into the next table one row at a time. This 'reach' needs a good index.
IN ( SELECT ... ), in old versions was terribly unoptimal. Now, it may turn into a "semi-join", like an EXIST ( SELECT ... ) and be quite efficient. Sometimes manually doing such is beneficial.
"Explode-Implode" hits a lot of people. This is where there is a JOIN and a GROUP BY. The grouping is mostly to implode the large number of rows created by the join. Sometimes a "derived" table can be an excellent optimization. (This is a manual reformulation of the query.)
Often a LEFT JOIN that is used for an aggregate can be folded in something like this: SELECT ..., ( SELECT SUM(foo) FROM ... ) AS foos, ..., thereby mitigating the explode-implode.
Not understanding the benefit of 'composite' indexes is perhaps the most common problem on this forum.
Shall I ramble on? I doubt if I have covered more than 1/4 of the cases. So, I agree with #leftjoin.
Here are some simple tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql