Optimize JOIN query in MySQL - mysql

This is my MySQL in general:
SELECT *
FROM
Table1
INNER JOIN Table2 ON (Table1.Table2_id = Table2.id)
INNER JOIN Table3 ON (Table1.Table3_id = Table3.id)
INNER JOIN Table4 ON (Table1.Table4_id = Table4.id)
INNER JOIN Table5 ON (Table1.Table5_id = Table5.id)
LEFT JOIN Table6 ON (Table1.Table6_id = Table6.id)
ORDER BY Table1.barcode DESC
LIMIT 50
I put index on all the ID (By default there is but for being sure I rechecked them) and also for Table1.barcode.
It's a very slow query on my database (about 15 secs).
I check it without order by and as I excepted that was really fast.
I remove limit 50 and order by, it took the same time (about 15 secs).
I should say that there are for sure lots of data:
Table1: 300442 records,
Table2: 77 records,
Table3: 314085 records,
Table4: 28987 records,
Table5: 127805 records,
Table6: 3230 records
I want to make it fast.
Maybe I can change the * to the field that I need (So I try to do it).
Is changing the join order would help me?
I can increase the memory of server and also the number of CPU and the speed of CPU, in my query which of them is more effective?
Is there any other recommendation?
Thanks in advance.

You should perhaps try to use explain and figure out what is going on:
https://dev.mysql.com/doc/refman/5.7/en/using-explain.html
Is changing the join order would help me?
The join order usually does not matter since most engines would do the order optimization internally. Some also have a way to force join order and check yourself whether for your case you might get better results with a specific order that is different than what the engine generated.
I can increase the memory of server and also the number of CPU and the speed of CPU, in my query which of them is more effective?
I would focus on memory (caching, shared_buffer) but before doing any server change you should first investigate the actual issue well and try to tune your existing system. (for ideas see: https://wiki.postgresql.org/wiki/Performance_Optimization (General Setup and Optimization section)
Maybe I can change the * to the field that I need (So I try to do it).
Definitely. Prefer that to * in general.

Related

MySQL Optimiser - cost planner doesn't know when DuplicateWeedout Strategy creates disk table

This is my sample query
Select table1.id
from table1
where table.id in (select table2.id
from table2
where table2.id in (select table3.id
from table3)
)
order by table1.id
limit 100
On checking the optimiser trace for the above query.
Optimiser trace cost
DUPLICATE-WEEDOUT strategy - Cost: 1.08e7
FIRST MATCH strategy - Cost: 1.85e7
As DUPLICATE-WEEDOUT cost is less, mysql took DUPLICATE-WEEDOUT strategy for the above query.
Seems everything good in join_optimization part right. But finally, after checking the join_execution part.
DUPLICATE-WEEDOUT usually creates temp table. But here as the heap-size is not enough for temp table, it went on creating ondisk temp table(converting_tmp_table_to_ondisk).
Due to disk temp table my query execution became slower.
So what happened here?
Optimiser trace doesn't calculate the cost of disk table in join-optimisation part itself. If disk table cost was calculated, it would be higher than first match.
Then final_semijoin_strategy would be FIRST-MATCH strategy, with this my query would have been faster.
Is there any way MYSQL calculate the cost of disk table in join-optimisation part itself or any other work around for this particular issue?
MYSQ-5.7, INNODB
Note: This is a very dynamic query where multiple condition will add based on request in query. So I have done optimising the query in all possible manner. And finally stuck with this disk table cost issue. Kindly avoid optimising the query(like changing the query structure, forcing first-match strategy). And for increasing the heap size(Im not sure much about it, in different forum many said it might bring different issue in other queries)
IN( SELECT ... ) has been notoriously inefficient. Try to avoid it.
The query, as presented, is probably equivalent to
SELECT t1.id
FROM t1
JOIN t2 USING(id)
JOIN t3 USING(id)
ORDER BY id
LIMIT 100
which will optimize nicely.
This formulation should not need to build any temp table, much less a disk-based one.

Query SQL runs slow with group by in MySQL

I have a sql below, it runs about 30 mins, which is too long for me.
SELECT LPP.learning_project_pupilID, SL.serviceID, MAX(LPPO.start_date), SUM(LPPOT.license_mode_value) totalAssignedLicenses
FROM t_services_licenses SL
INNER JOIN t_pupils_offers_services POS ON POS.service_licenseID = SL.service_licenseID
INNER JOIN j_learning_projects_pupils_offers LPPO ON LPPO.learning_project_pupil_offerID = POS.learning_project_pupil_offerID
INNER JOIN j_learning_projects_pupils LPP ON LPPO.learning_project_pupilID = LPP.learning_project_pupilID
INNER JOIN j_learning_projects_pupils_offers_tracking LPPOT ON LPPOT.pupil_offer_serviceID = POS.pupil_offer_serviceID
INNER JOIN t_filters_items FI ON FI.itemID = LPP.learning_project_pupilID_for_filter_join
WHERE FI.filterID = '4dce2235-aafd-4ba2-b248-c137ad6ce8ca'
AND SL.serviceID IN ('OnlineConversationClasses', 'TwentyFourSeven')
GROUP BY LPP.learning_project_pupilID, SL.serviceID
The explain result below(tell me if you can't view the image):
http://images0.cnblogs.com/blog2015/47012/201508/140920298959608.png
I have viewed the profile result, "copying temp data " wasted almost all the time. I know the reason is caused by "group by" functionality, So I did some changes below to verify it:
I removed the MAX, SUM functions as well as the Group By sql and ran it, the time is only cost about 40 seconds, which is ok for us.
So here , I want to know, if there are some other methods to make above sql execute much more faster?
more info, you can find here: http://www.cnblogs.com/scy251147/p/4728995.html
EDIT:
From the explain view, I can see that in t_filters_items table, there are about 50802 rows filtered, And this table is not luckily Using temporary to store temp data, which is not a good choice for me . I really don't like "Group By" in MySQL very much.
Do not use CHARACTER SET utf8 on UUID columns. Change to ascii. Further discussion of uuids and how to further shrink them: http://mysql.rjweb.org/doc.php/uuid
Are there really 50K rows with FI.filterID = '4dce2235-aafd-4ba2-b248-c137ad6ce8ca'?
The GROUP BY spans two table (LPP and SL) making it impossible to optimize. Can that be changed?
The SUM(...) is likely to have a bigger value than you expect. This is because of the JOINs. Try to rewrite the computation of the SUM in a subquery.
Are you using InnoDB? Is innodb_buffer_pool_size set to about 70% of available RAM?
Approximately how many rows in each table?

MySQL: Grouped/Ordered/Left Join query very slow

I have a problem with a query which takes far too long (Over two seconds just for this simple query).
On first look it appears to be an indexing issue, all joined fields are indexed, but i cannot find what else I may need to index to speed this up. As soon as i add the fields i need to the query, it gets even slower.
SELECT `jobs`.`job_id` AS `job_id` FROM tabledef_Jobs AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
GROUP BY `jobs`.`job_id`
ORDER BY `jobs`.`date_posted` ASC
LIMIT 0 , 50
Table row counts (~): tabledef_Jobs (108k), tabledef_JobCatLink (109k), tabledef_Companies (100), tabledef_Applications (50k)
Here you can see the Describe. 'Using temporary' appears to be what is slowing down the query:
table index screenshots:
Any help would be greatly appreciated
EDIT WITH ANSWER
Final improved query with thanks to #Steve (marked answer). Ultimately, the final query was reduced from ~22s to ~0.3s:
SELECT `jobs`.`job_id` AS `job_id` FROM
(
SELECT * FROM tabledef_Jobs as jobs ORDER BY `jobs`.`date_posted` ASC LIMIT 0 , 50
) AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
GROUP BY `jobs`.`job_id`
ORDER BY `jobs`.`date_posted` ASC
LIMIT 0 , 50
Right, I’ll have a stab at this.
It would appear that the Query Optimiser cannot use an index to fulfil the query upon the tabledef_Jobs table.
You've got an offset limit and this with the combination of your ORDER BY cannot limit the amount of data before joining and thus it is having to group by job_id which is a PK and fast – but then order that data (temporary table and a filesort) before limiting and throwing away a the vast majorly of this data before finally join everything else to it.
I would suggest, adding a composite index to jobs of “job_id, date_posted”
So firstly optimise the base query:
SELECT * FROM tabledef_Jobs
GROUP BY job_id
ORDER BY date_posted
LIMIT 0,50
Then you can combine the joins and the final structure together to make a more efficient query.
I cannot let it go by without suggesting you rethink your limit offset. This is fine for small initial offsets but when it starts to get large this can be a major cause of performance issues. Let’s for example sake say you’re using this for pagination, what happens if they want page 3,000 – you will use
LIMIT 3000, 50
This will then collect 3050 rows / manipulate the data and then throw away the first 3000.
[edit 1 - In response to comments below]
I will expand with some more information that might point you in the right direction. Unfortunately there isn’t a simple fix that will resolve it , you must understand why this is happening to be able to address it. Simply removing the LIMIT or ORDER BY may not work and after all you don’t want to remove then as its part of your query which means it must be there for a purpose.
Optimise the simple base query first that is usually a lot easier than working with multi-joined datasets.
Despite all the bashing it receives there is nothing wrong with filesort. Sometimes this is the only way to execute the query. Agreed it can be the cause of many performance issues (especially on larger data sets) but that’s not usually the fault of filesort but the underlying query / indexing strategy.
Within MySQL you cannot mix indexes or mix orders of the same index – performing such a task will result in a filesort.
How about as I suggested creating an index on date_posted and then using:
SELECT jobs.job_id, jobs.date_posted, jobcats .*, apps.*, company .* FROM
(
SELECT DISTINCT job_id FROM tabledef_Jobs
ORDER BY date_posted
LIMIT 0,50
) AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id

JOIN or INNER SELECT with IN, which is faster?

I was wondering which is faster an INNER JOIN or INNER SELECT with IN?
select t1.* from test1 t1
inner join test2 t2 on t1.id = t2.id
where t2.id = 'blah'
OR
select t1.* from test1 t1
where t1.id IN (select t2.id from test2 t2 where t2.id = 'blah')
Assuming id is key, these queries mean the same thing, and a decent DBMS will execute them in the exact same way. Unfortunately MySQL doesn't, as can be seen by expanding the "View Execution Plan" link in this SQL Fiddle. Which one will be faster probably depends on the size of tables - if TABLE1 has very few rows, then IN has a chance for being faster, while JOIN will likely be faster in all other cases.
This is a peculiarity of MySQL's query optimizer. I've never seen Oracle, PostgreSQL or MS SQL Server execute such simple equivalent queries differently.
If you have to guess, INNER JOIN is likely to be more efficient than an IN (SELECT ...), but that can vary from one query to another.
The EXPLAIN keyword is one of your best friends. Type EXPLAIN in front of your complete SELECT query and MySQL will give you some basic information about how it will execute the query. It'll tell you where it's using file sorts, where it's using indices you've created (and where it's ignoring them), and how many rows it will probably have to examine to fulfill the request.
If all else is equal, use the INNER JOIN mostly because it's more predictable and thus easier to understand to a new developer coming in. But of course if you see a real advantage to the IN (SELECT ...) form, use it!
Though you'd have to check the execution plan on whatever RDBS you're inquiring about, I would guess the inner join would be faster or at least the same. Perhaps someone will correct me if I'm wrong.
The nested select will most likely run the entire inner query anyway, and build a hash table of possible values from test2. If that query returns a million rows, you've incurred the cost of loading that data into memory no matter what.
With the inner join, if test1 only has 2 rows, it will probably just do 2 index scans on test2 for the id values of each of those rows, and not have to load a million rows into memory.
It's also possible that a more modern database system can optimize the first scenario since it has statistics on each table, however at the very best case, the inner join would be the same.
In most of the cases JOIN is much faster than sub query but sub-query is more readable than JOIN.
RDBMS creates an execution plan against JOIN so it can be predict that what data should be loaded to be processed. This definitely saves time. On the other hand for the sub-query it run all the queries and load all their data to do the processing.
For more details please check this link.

MySQL Join clause vs WHERE clause

What's the difference in a clause done the two following ways?
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2 AND
table2.member_id = 4
)
I've compared them both with basic queries and EXPLAIN EXTENDED and don't see a difference. I'm wondering if someone here has discovered a difference in a more complex/processing intensive envornment.
SELECT * FROM table1 INNER JOIN table2 ON (
table2.col1 = table1.col2
)
WHERE table2.member_id = 4
With an INNER join the two approaches give identical results and should produce the same query plan.
However there is a semantic difference between a JOIN (which describes a relationship between two tables) and a WHERE clause (which removes rows from the result set). This semantic difference should tell you which one to use. While it makes no difference to the result or to the performance, choosing the right syntax will help other readers of your code understand it more quickly.
Note that there can be a difference if you use an outer join instead of an inner join. For example, if you change INNER to LEFT and the join condition fails you would still get a row if you used the first method but it would be filtered away if you used the second method (because NULL is not equal to 4).
If you are trying to optimize and know your data, by adding the clause "STRAIGHT_JOIN" can tremendously improve performance. You have an inner join ON... So, just to confirm, you want only records where table1 and table2 are joined, but only for table 2 member ID = some value.. in this case 4.
I would change the query to have table 2 as the primary table of the select as it has an explicit "member_id" that could be optimized by an index to limit rows, then joining to table 1 like
select STRAIGHT_JOIN
t1.*
from
table2 t2,
table1 t1
where
t2.member_id = 4
and t2.col1 = t1.col2
So the query would pre-qualify only the member_id = 4 records, then match between table 1 and 2. So if table 2 had 50,000 records and table 1 had 400,000 records, having table2 listed first will be processed first. Limiting the ID = 4 even less, and even less when joined to table1.
I know for a fact the straight_join works as I've implemented it many times dealing with gov't data of 14+ million records linking to over 15 lookup tables where the engine got confused trying to think for me on the critical table. One such query was taking 24+ hours before hanging... Adding the "STRAIGHT_JOIN" and prioritizing what the "primary" table was in the query dropped it to a final correct result set in under 2 hours.
There's not really much of a difference in the situation you describe; in a situation with multiple complex joins, my understanding is that the first is somewhat preferential, as it will reduce the complexity somewhat; that said, it's going to be a small difference. Overall, you shouldn't notice much of a difference in most if not all situations.
With an inner join, it makes almost* no difference; if you switch to outer join, all the difference in the world.
*I say "almost" because optimizers are quirky beasts and it isn't impossible that under some circumstances, it might do a better job optimizing the former or the latter. Do not attempt to take advantage of this behavior.