Join performance: Oracle vs MySQL

Join performance: Oracle vs MySQL - mysql

Given a query reduced to the form:
select b.field1
from table_a a
inner join table_b b on b.field1 = a.field1
left join table_c c on c.field1 = a.field1
left join table_d d on d.field1 = b.field1
left join table_e e on e.field1 = b.field6
group by b.field1,
b.field2,
b.field3,
b.field4,
b.field5,
e.field2,
e.field3
;
With a certain amount of data it is running in 20 seconds in Oracle. Nothing is indexed in Oracle.
Migrated into MySQL the query does not want to finish (executes in minutes). Every field in question is indexed in MySQL. Explain tells that everything is fine.
After still not working, the grouping fields got multiple-column indexes. Still nothing.
What can be the problem that there is still a huge leak in the MySQL performance? Is there a method to speed it up?

Oracle is able to do hash joins and merge joins, MySQL is not.
Since your tables are not filtered in any way, hash joins would be the most efficient way to do the joins, especially if you don't have any indexes.
With nested loops, even if all join fields are indexed, MySQL needs to do an index seek on each value from the leading table in a loop (each time starting from the root index page), then do the table lookup to retrieve the record, then repeat it for each joined table. This involves lots of random seeks.
A hash join, on the other side, requires scanning the smaller table once (building a hash table) then scanning the bigger table once (searching the hash table built). This involves sequential scans which are much faster.
Also, with nested loops, a left-joined table can only be driven (scanned in the inner loop), while with a hash join tables on either side can be leading (scanned) or driven (hashed then searched). This affects performance too.
MySQL's optimizer, though does support a couple of handy tricks which other engines lack, has very limited capabilities compared to other engines and currently supports neither hash joins nor merge joins. Thus said, a query like this would most probably be slow on MySQL, even if it's fast on other engines on the same data.

Related

how to convert left join to sub query?

I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'

In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.

Temp tables vs subqueries in inner join

Both SQL, return the same results. The first my joins are on the subqueries the second the final queryis a join with a temporary that previously I create/populate them
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN (
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa
) hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
DROP TEMPORARY TABLE IF EXISTS hasRemittance;
CREATE TEMPORARY TABLE hasRemittance
SELECT collegiate_id FROM collegiateRemittances r
INNER JOIN remittances r1 USING(remittance_id)
WHERE r1.type_id = 1 AND r1.name = remesa;
SELECT COUNT(*) totalCollegiates, SUM(getFee(c.collegiate_id, dateS)) totalMoney
FROM collegiates c
LEFT JOIN hasRemittance ON hasRemittance.collegiate_id = c.collegiate_id
WHERE hasRemittance.collegiate_id IS NULL AND c.typePayment = 1 AND c.active = 1 AND c.exentFee = 0 AND c.approvedBoard = 1 AND IF(notCollegiate, c.collegiate_id NOT IN (notCollegiate), '1=1');
Which will have better performance for a few thousand records?

The two formulations are identical except that your explicit temp table version is 3 sql statements instead of just 1. That is, the overhead of the back and forth to the server makes it slower. But...
Since the implicit temp table is in a LEFT JOIN, that subquery may be evaluated in one of two ways...
Older versions of MySQL were 'dump' and re-evaluated it. Hence slow.
Newer versions automatically create an index. Hence fast.
Meanwhile, you could speed up the explicit temp table version by adding a suitable index. It would be PRIMARY KEY(collegiate_id). If there is a chance of that INNER JOIN producing dups, then say SELECT DISTINCT.
For "a few thousand" rows, you usually don't need to worry about performance.
Oracle has a zillion options for everything. MySQL has very few, with the default being (usually) the best. So ignore the answer that discussed various options that you could use in MySQL.
There are issues with
AND IF(notCollegiate,
c.collegiate_id NOT IN (notCollegiate),
'1=1')
I can't tell which table notCollegiate is in. notCollegiate cannot be a list, so why use IN? Instead simply use !=. Finally, '1=1' is a 3-character string; did you really want that?
For performance (of either version)
remittances needs INDEX(type_id, name, remittance_id) with remittance_id specifically last.
collegiateRemittances needs INDEX(remittance_id) (unless it is the PK).
collegiates needs INDEX(typePayment, active, exentFee , approvedBoard) in any order.
Bottom line: Worry more about indexes than how you formulate the query.
Ouch. Another wrinkle. What is getFee()? If it is a Stored Function, maybe we need to worry about optimizing it?? And what is dateS?

It depends actually. You'll have to test performance of every option. On my website I had 2 tables with articles and comments to them. It turned out it's faster to call comment counts 20 times for each article, than using a single union query. MySQL (like other DBs) caches queries, so small simple queries can run amazingly fast.

I did not saw that you have tagged the question as mysql so I initialy aswered for Oracle. Here is what I think about mySQL.
MySQL
There are two options when it comes to temporary tables Memory or Disk. And for Disk you can have MyIsam - non transactional and InnoDB transactional. Of course you can expect better performance for non transactional type of storage.
Additionaly you need to figure out how big resultset are you dealing with. For small resultset the memory option would be faster for large resultset the disk option would be faster.
Again at the end as in my original answer you need to figure out what performance is good enough and go for the most descriptive and easy to read option.
Oracle
It depends on what kind of temporary table you are dealing with.
You can have session based temporary tables - data is held until logout, or transaction based - data is held until commit . On top of this they can support transaction logging or not support it. Depending on configuration you can get better performance from a temporary table.
As everything in the world performance is relative therm. Most probably for few thousand records it will not do significant difference between the two queries. In which case I would go not for the most performant on but for the most easier to read and understand one.

Filtering order for SQL joins on large tables with text

I have multiple large tables (several million rows) of data that need to all be combined via inner joins in a single query and filtered. These tables are all large and some of them contain large text columns. However, I don't need all the large text columns in the result of my query. I could filter the tables incrementally as I join them in subqueries or I could skip the subqueries and just join all the tables and filter in the select clause. Which one of these would be faster, and why?
Example with filtering subquery:
select aa.col1, aa.col2, aa.col3, aa.col4, c.col5, c.col6
from
(select a.col1, a.col2, b.col3, b.col4
from table_a a
join table_b b using(col1)
where a.col2 < 10 and b.col3 > 3)
as aa
join table_c c using(col1)
Example without subquery:
select a.col1, a.col2, b.col3, b.col4, c.col5, c.col6
from table_a a
join table_b b using(col1)
join table_c c using(col1)
where a.col2 < 10 and b.col3 > 3
I've done a little bit of research and some people are saying that the filtering order doesn't matter and that the sql query optimizer will choose the most efficient route. However, I've also seen some answers saying to filter incrementally.
With my own experiments in MYSQL, I've found that using subqueries speeds things up due to the large text field. The fetch time dominates the sql execution time (I guess due to large text fields) and filtering the data before the second join cuts down on the fetch time considerably. However, I don't understand the underlying mechanism for this and don't know if it's a fluke of my particular setup or generally applicable. Are there general rules for this type of query in SQL? Is there a difference between these types of queries in Microsoft SQL Server vs MYSQL? I primarily care about the speed of the entire query.

As per my study the second query is faster. Because subquery takes time.
Suppose you have a query:
SELECT * FROM table where id IN (SELECT id FROM table where condition1 AND condition 2 )
In this query first the subquery will execute, after selecting the subquery it checks the outer where conditions and then select.
And if you are using joins then it is faster because first it join table on the common field and then it check the other condition and then selects the data. So they are faster.

Filtering in derived tables can indeed be faster, but... it will depend specifically on the database design, the number of records filtered out, the indexes and other local conditions. So it is best to write both queries and do performance testing with your own system. Look at the explain plan for both and test the actual timing for both (you may need to clear the cache bewtteeen for a fair test)

Optimising sql query with cartesian join

I have 2 tables sl and sd.
I want to optimize the following query, if it is possible
select sl.*, sd.* from sl join sd where sl.conf_id='blah' and sd.for_as=1
My understanding is that the cartesian product is first performed and then filtering happens.
Is there a way to have the filtering done first?

Run EXPLAIN SELECT ... -- it will probably say "Using join buffer". This is where it loads one table into memory (if not too big) and repeatedly scans it for the data. Not a pretty site, but a lot faster than before the 'join buffer' came into play.
Since you have no ON or WHERE tying the two tables together, you really want the "cross join"? That is, if there are 40 'blah' and 70 '1', you will end up with 40*70 = 2800 rows?
As for optimizing, the optimizer will pick one of the tables, giving preference to one that has a useful index, scan it (index or table), then repeatedly use the join buffer (if possible) to scan (index or table) of the other.
In other words, one table will use an index if possible, doing the filtering before the Cartesian product, the other might use the join buffer. If the tables aren't too big, the performance won't be too bad.

Where is better to put 'on' conditions in multiple joins? (mysql)

I have multiple joins including left joins in mysql. There are two ways to do that.
I can put "ON" conditions right after each join:
select * from A join B ON(A.bid=B.ID) join C ON(B.cid=C.ID) join D ON(c.did=D.ID)
I can put them all in one "ON" clause:
select * from A join B join C join D ON(A.bid=B.ID AND B.cid=C.ID AND c.did=D.ID)
Which way is better?
Is it different if I need Left join or Right join in my query?

For simple uses MySQL will almost inevitably execute them in the same manner, so it is a manner of preference and readability (which is a great subject of debate).
However with more complex queries, particularly aggregate queries with OUTER JOINs that have the potential to become disk and io bound - there may be performance and unseen implications in not using a WHERE clause with OUTER JOIN queries.
The difference between a query that runs for 8 minutes, or .8 seconds may ultimately depend on the WHERE clause, particularly as it relates to indexes (How MySQL uses Indexes): The WHERE clause is a core part of providing the query optimizer the information it needs to do it's job and tell the engine how to execute the query in the most efficient way.
From How MySQL Optimizes Queries using WHERE:
"This section discusses optimizations that can be made for processing
WHERE clauses...The best join combination for joining the tables is
found by trying all possibilities. If all columns in ORDER BY and
GROUP BY clauses come from the same table, that table is preferred
first when joining."
For each table in a join, a simpler WHERE is constructed to get a fast
WHERE evaluation for the table and also to skip rows as soon as
possible
Some examples:
Full table scans (type = ALL) with NO Using where in EXTRA
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id AND cr.role = cr2.role AND cr.util = 1000
[Err] Out of memory
Uses where to optimize results, with index (Using where,Using index):
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
WHERE cr.role = cr2.role
AND cr.util = 1000
515661 rows in set (0.124s)
****Combination of ON/WHERE - Same result - Same plan in EXPLAIN*******
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
AND cr.role = cr2.role
WHERE cr.util = 1000
515661 rows in set (0.121s)
MySQL is typically smart enough to figure out simple queries like the above and will execute them similarly but in certain cases it will not.
Outer Join Query Performance:
As both LEFT JOIN and RIGHT JOIN are OUTER JOINS (Great in depth review here) the issue of the Cartesian product arises, the avoidance of Table Scans must be avoided, so that as many rows as possible not needed for the query are eliminated as fast as possible.
WHERE, Indexes and the query optimizer used together may completely eliminate the problems posed by cartesian products when used carefully with aggregate functions like AVERAGE, GROUP BY, SUM, DISTINCT etc. orders of magnitude of decrease in run time is achieved with proper indexing by the user and utilization of the WHERE clause.
Finally
Again, for the majority of queries, the query optimizer will execute these in the same manner - making it a manner of preference but when query optimization becomes important, WHERE is a very important tool. I have seen some performance increase in certain cases with INNER JOIN by specifying an indexed col as an additional ON..AND ON clause but I could not tell you why.

Put the ON clause with the JOIN it applies to.
The reasons are:
readability: others can easily see how the tables are joined
performance: if you leave the conditions later in the query, you'll get way more joins happening than need to - it's like putting the conditions in the where clause
convention: by following normal style, your code will be more portable and less likely to encounter problems that may occur with unusual syntax - do what works

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008