I have 2 tables sl and sd.
I want to optimize the following query, if it is possible
select sl.*, sd.* from sl join sd where sl.conf_id='blah' and sd.for_as=1
My understanding is that the cartesian product is first performed and then filtering happens.
Is there a way to have the filtering done first?
Run EXPLAIN SELECT ... -- it will probably say "Using join buffer". This is where it loads one table into memory (if not too big) and repeatedly scans it for the data. Not a pretty site, but a lot faster than before the 'join buffer' came into play.
Since you have no ON or WHERE tying the two tables together, you really want the "cross join"? That is, if there are 40 'blah' and 70 '1', you will end up with 40*70 = 2800 rows?
As for optimizing, the optimizer will pick one of the tables, giving preference to one that has a useful index, scan it (index or table), then repeatedly use the join buffer (if possible) to scan (index or table) of the other.
In other words, one table will use an index if possible, doing the filtering before the Cartesian product, the other might use the join buffer. If the tables aren't too big, the performance won't be too bad.
Related
I'm beginner in mysql, i have written a query by using left join to get columns as mentioned in query, i want to convert that query to sub-query please help me out.
SELECT b.service_status,
s.b2b_acpt_flag,
b2b.b2b_check_in_report,
b2b.b2b_swap_flag
FROM user_booking_tb AS b
LEFT JOIN b2b.b2b_booking_tbl AS b2b ON b.booking_id=b2b.gb_booking_id
LEFT JOIN b2b.b2b_status AS s ON b2b.b2b_booking_id = s.b2b_booking_id
WHERE b.booking_id='$booking_id'
In this case would actually recommend the join which should generally be quicker as long as you have proper indexes on the joining columns in both tables.
Even with subqueries, you will still want those same joins.
Size and nature of your actual data will affect performance so to know for sure you are best to test both options and measure results. However beware that the optimal query can potentially switch around as your tables grow.
SELECT b.service_status,
(SELECT b2b_acpt_flag FROM b2b_status WHERE b.booking_id=b2b_booking_id)as b2b_acpt_flag,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_check_in_report,
(SELECT b2b_check_in_report FROM b2b_booking_tbl WHERE b.booking_id=gb_booking_id) as b2b_swap_flag
FROM user_booking_tb AS b
WHERE b.booking_id='$booking_id'
To dig into how this query works, you are effectively performing 3 additional queries for each and every row returned by the main query.
If b.booking_id='$booking_id' is unique, this is an extra 3 queries, but if there may be multiple entries, this could multiply and become quite slow.
Each of these extra queries will be fast, no network overhead, single row, hopefully matching on a primary key. So 3 extra queries are nominal performance, as long as quantity is low.
A join would result as a single query across 2 indexed tables, which often will shave a few milliseconds off.
Another instance where a subquery may work is where you are filtering the results rather than adding extra columns to output.
SELECT b.*
FROM user_booking_tb AS b
WHERE b.booking_id in (SELECT booking_id FROM othertable WHERE this=this and that=that)
Depending how large the typical list of booking_id's is will affect which is more efficient.
In university SQL course relation database was all about JOINS between tables.
So i adopted general approach to first do all necessary JOINS, then select data, filter with WHERE, GROUP BY when neccesary. This way code and logic is straightforward.
But very often when things get more complicated than a single LEFT JOIN, i get very poor performance.
Today i just rewrote JOIN query, which 600 second execution time
to different approach with:
SELECT (SELECT ... WHERE ID = X.ID) FROM X
and
SELECT ... WHERE Y IN (SELECT ...)
and now it finishes in 0.0027 seconds.
I am frustrated, i use indexes on fields on which i join, but performance is so poor...
LEFT JOIN may, but does not always, force looking at the 'left' table first.
JOINs (but not LEFT JOINs) plus a WHERE that touches one table gives the Optimizer a strong, reliable, hint to look at that one table first.
JOIN, plus a WHERE touching multiple table -- the Optimizer sometimes picks the right 'first' table, sometimes does not.
The Optimizer usually gets the rows from one table (whichever it picked as the best to start with), then does a NLJ (Nested Loop Join). This means reaching into the next table one row at a time. This 'reach' needs a good index.
IN ( SELECT ... ), in old versions was terribly unoptimal. Now, it may turn into a "semi-join", like an EXIST ( SELECT ... ) and be quite efficient. Sometimes manually doing such is beneficial.
"Explode-Implode" hits a lot of people. This is where there is a JOIN and a GROUP BY. The grouping is mostly to implode the large number of rows created by the join. Sometimes a "derived" table can be an excellent optimization. (This is a manual reformulation of the query.)
Often a LEFT JOIN that is used for an aggregate can be folded in something like this: SELECT ..., ( SELECT SUM(foo) FROM ... ) AS foos, ..., thereby mitigating the explode-implode.
Not understanding the benefit of 'composite' indexes is perhaps the most common problem on this forum.
Shall I ramble on? I doubt if I have covered more than 1/4 of the cases. So, I agree with #leftjoin.
Here are some simple tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Given a query reduced to the form:
select b.field1
from table_a a
inner join table_b b on b.field1 = a.field1
left join table_c c on c.field1 = a.field1
left join table_d d on d.field1 = b.field1
left join table_e e on e.field1 = b.field6
group by b.field1,
b.field2,
b.field3,
b.field4,
b.field5,
e.field2,
e.field3
;
With a certain amount of data it is running in 20 seconds in Oracle. Nothing is indexed in Oracle.
Migrated into MySQL the query does not want to finish (executes in minutes). Every field in question is indexed in MySQL. Explain tells that everything is fine.
After still not working, the grouping fields got multiple-column indexes. Still nothing.
What can be the problem that there is still a huge leak in the MySQL performance? Is there a method to speed it up?
Oracle is able to do hash joins and merge joins, MySQL is not.
Since your tables are not filtered in any way, hash joins would be the most efficient way to do the joins, especially if you don't have any indexes.
With nested loops, even if all join fields are indexed, MySQL needs to do an index seek on each value from the leading table in a loop (each time starting from the root index page), then do the table lookup to retrieve the record, then repeat it for each joined table. This involves lots of random seeks.
A hash join, on the other side, requires scanning the smaller table once (building a hash table) then scanning the bigger table once (searching the hash table built). This involves sequential scans which are much faster.
Also, with nested loops, a left-joined table can only be driven (scanned in the inner loop), while with a hash join tables on either side can be leading (scanned) or driven (hashed then searched). This affects performance too.
MySQL's optimizer, though does support a couple of handy tricks which other engines lack, has very limited capabilities compared to other engines and currently supports neither hash joins nor merge joins. Thus said, a query like this would most probably be slow on MySQL, even if it's fast on other engines on the same data.
What is the order MySQL joins the tables, how is it chosen and when does STRAIGHT_JOIN comes in handy?
MySQL is only capable of doing nested loops (possibly using indexes), so if both join tables are indexed, the time for the join is calculated as A * log(B) if A is leading and B * log(A) if B is leading.
It is easy to see that the table with fewer records satisfying the WHERE condition should be made leading.
There are some other factors that affect the join performance, such as WHERE conditions, ORDER BY and LIMIT clauses etc. MySQL tries to predict the time for the join orders and if statistics are up to date does it quite well.
STRAIGHT_JOIN is useful when the statistics are not accurate (say, naturally skewed) or in case of bugs in the optimizer.
For instance, the following spatial join:
SELECT *
FROM a
JOIN b
ON MBRContains(a.area, b.area)
is subject to a join swap (the smaller table is made leading), however, MBRContains is not converted to MBRWithin and the resulting plan does not make use of the index.
In this case you should explicitly set the join order using STRAIGHT_JOIN.
As others have stated about the optimizer and which tables may meet the criteria on smaller result sets, but that may not always work. As I had been working with gov't contract / grants database. The table was some 14+ million records. However, it also had over 20 lookup tables (states, congressional districts, type of business classification, owner ethnicity, etc)
Anyhow with these smaller tables, the join was using one of the small lookups, back to the master table and then joining all the others. It CHOKED the database and cancelled the query after 30+ hours. Since my primary table was listed FIRST, and all subsequent were lookup and joined AFTER, just adding STRAIGHT_JOIN at the top FORCED the order I had listed and the complex query was running again in just about 2 hrs (expected for all it had to do).
Get whatever is your primary basis to the top with all subsequent extras later I've found, definitely helps.
The order of tables is specified by the optimizer. Straight_join comes in handy when the optimizer does it wrong, which is not so often. I used it only once in a big join, where the optimizer gave one particular table at first place in join (I saw it in explain select command), so I placed the table so that it is joined later in the join. It helped a lot to speed up the query.
I have two tables: p_group.full_data, which is a large dataset I'm working on (100k rows, 200 columns) and p_group.full_data_aggregated, which I've produced to summarise a load of other tables.
Now, what I'd like to do is perform a join between full_data and full_data_aggregated to select out certain rows, averages, and so on. The query I have is as follows:
SELECT 'name', p.group_id, a.group_condition, p.event_index, AVG(p.value) FROM p_group.full_data p
JOIN p_group.full_data_aggregated as a on p.group_id = a.group_id AND p.event_index = a.event_index
WHERE (a.group_condition='open')
GROUP BY p.group_id, p.event_index
I have an index on: full_data.group_id, full_data.event_index and full_data_aggregated.group_id, full_data_aggregated.event_index, full_data_aggregated.group_condition.
Now, the problem is that this query simply won't finish: previously, I had my full_data split up into different tables (one for each group_id), and that worked fine. But now that I have joined the groups together, the query sits there running and so I can only assume I have done something stupid.
Is there anything else I can try to actually get this query to run at a decent speed? I'm sure I've messed up something with indices and the group by function, but I can't work out what. I've tried all sorts of variations of the above query. EXPLAIN indicates that it is "using where; using temporary; using filesort" but I'm not sure how to fix this.
Thanks!
I assume that your indexes are combination indexes (with group_id and event_index together). If you have separate indexes for each field, then only one index is used at a time and the database engine is going through significantly more data.
For example, if you only have a few unique group_id, but lots of event_index, and you have two indexes, one on group_id only, and the other one on event_index, then you query is going to run through a large number of rows for each group_id. If you have one index instead, with both fields in order, then the query will run much faster.