I have the following query which is taking about 20 seconds on records of 60,000 in the sale table. I understand that the ORDER BY and LIMIT are causing the issue, as when ORDER BY is removed it is returned in 0.10 seconds.
I am unsure how to optimise this query, any ideas?
The explain output is here https://gist.github.com/anonymous/1b92fa64261559de32da
SELECT sale.saleID as id,
node.title as location,
sale.saleTotal as total,
sale.saleStatus as status,
payment.paymentMethod,
field_data_field_band_name.field_band_name_value as band,
invoice.invoiceID,
field_data_field_first_name.field_first_name_value as firstName,
field_data_field_last_name.field_last_name_value as lastName,
sale.created as date
FROM sale
LEFT JOIN payment
ON payment.saleID = sale.saleID
LEFT JOIN field_data_field_location
ON field_data_field_location.entity_id = sale.registerSessionID
LEFT JOIN node
ON node.nid = field_data_field_location.field_location_target_id
LEFT JOIN invoice
ON invoice.saleID = sale.saleID
LEFT JOIN profile
ON profile.uid = sale.clientID
LEFT JOIN field_data_field_band_name
ON field_data_field_band_name.entity_id = profile.pid
LEFT JOIN field_data_field_first_name
ON field_data_field_first_name.entity_id = profile.pid
LEFT JOIN field_data_field_last_name
ON field_data_field_last_name.entity_id = profile.pid
ORDER BY sale.created DESC
LIMIT 0,50
Possibly, you cannot do anything. For instance, when you are measuring performance, are you looking at the time to return the first record or the entire results set? Without the order by, the query can return the first row quite quickly, but you still might need to wait a bit to get all the rows.
Assuming the comparison is valid, the following index might help: sale(created, saleId, clientId, SaleTotal, SaleStatus. This is a covering index for the query, carefully constructed so it can be read in the right order. If this avoids the final sort, then it should speed the query, even for fetching the first row.
Minimal:
ALTER TABLE `sale` ADD INDEX (`saleID` , `created`);
ALTER TABLE `invoice` ADD INDEX (`saleID`);
Indices first. But another technique concerning LIMIT, used by paging for instance, is to use values: last row of page yielding start value for search for next page.
As you use ORDER BY sale.created DESC you could guess a sufficiently large period:
WHERE sale.created > CURRENT_DATE() - INTERVAL 2 MONTH
An index on created a must.
Related
I'm trying to speed up a mysql query. The Listings table has several million rows. If I don't sort them later I get the result in 0.1 seconds but once I sort it takes 7 seconds. What can I improve to speed up the query?
SELECT l.*
FROM listings l
INNER JOIN listings_categories lc
ON l.id=lc.list_id
AND lc.cat_id='2058'
INNER JOIN locations loc
ON l.location_id=loc.id
WHERE l.location_id
IN (7841,7842,7843,7844,7845,7846,7847,7848,7849,7850,7851,7852,7853,7854,7855,7856,7857,7858,7859,7860,7861,7862,7863,7864,7865,7866,7867,7868,7869,7870,7871,7872,7873,7874,7875,7876,7877,7878,7879,7880,7881,7882,7883,7884,7885,7886,7887,7888,7889,7890,7891,7892,7893,7894,7895,7896,7897,7898,7899,7900,7901,7902,7903)
ORDER BY date
DESC LIMIT 0,10;
EXPLAIN SELECT: Using Index l=date, loc=primary, lc=primary
Such performance questions are really difficult to answer and depend on the setup, indexes etc. So, there will likely not the one and only solution and even not really correct or incorrect attempts to improve the speed. This is a lof of try and error. Anyway, some points I noted which often cause performance issues are:
Avoid conditions within joins that should be placed in the where instead. A join should contain the columns only that will be joined, no further conditions. So the "lc.cat_id='2058" should be put in the where clause.
Using IN is often slow. You could try to replace it by using OR (l.location_id = 7841 OR location_id = 7842 OR...)
Open the query execution plan and check whether there is something useful for you.
Try to find out if there are special cases/values within the affected columns which slow down your query
Change "ORDER BY date" to "ORDER BY tablealias.date" and check if this makes a difference in performance. Even if not, it is better to read.
If you can rename the column "date", do this because using SQL keywords as table name or column name is no good idea. I'm unsure if this influences the performance, but it should be avoided if possible.
Good luck!
You can try additonal indexes to speed up the query, but you'll have a tradeoff when creating/manipulating data.
These combined keys could speed up the query:
listings: date, location_id
listings_categories: cat_id, list_id
Since the plan says it uses the date index, there wouldn't be a need to read the record to check the location_id when usign the new index, and same for the join with listinngs_category, index read would be enough
l: INDEX(location_id, id)
lc: INDEX(cat_id, list_id)
If those don't suffice, try the following rewrite.
SELECT l2.*
FROM
(
SELECT l1.id
FROM listings AS l1
JOIN listings_categories AS lc ON lc.list_id = l1.id
JOIN locations AS loc ON loc.id = l1.location_id
WHERE lc.cat_id='2058'
AND l1.location_id IN (7841, ..., 7903)
ORDER BY l1.date DESC
LIMIT 0,10
) AS x
JOIN listings l2 ON l1.id = x.id
ORDER BY l2.date DESC
With
listings: INDEX(location_id, date, id)
listings_categories: INDEX(cat_id, list_id)
The idea here is to get the 10 ids from the index before reaching to the table itself. Your version is probably shoveling around the whole table before sorting, and then delivering the 10.
I'm making a sample recent screen that will display a list, it displays the list, with id set as primary key.
I have done the correct query as expected but the table with big amount of data can cause slow performance issues.
This is the sample query below:
SELECT distinct H.id -- (Primary Key),
H.partnerid as PartnerId,
H.partnername AS partner, H.accountname AS accountName,
H.accountid as AccountNo,
FROM myschema.mytransactionstable H
INNER JOIN (
SELECT S.accountid, S.partnerid, S.accountname,
max(S.transdate) AS maxDate
from myschema.mytransactionstable S
group by S.accountid, S.partnerid, S.accountname
) ms ON H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname =ms.accountname
AND H.transdate = maxDate
WHERE H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname = ms.accountname
AND H.transdate = maxDate
GROUP BY H.partnerid,H.accountid, H.accountname
ORDER BY H.id DESC
LIMIT 5
In my case, there are values which are similar in the selected columns but differ only in their id's
Below is a link to an image without executing the query above. They are all the records that have not yet been filtered.
Sample result query click here
Since I only want to get the 5 most recent by their id but the other columns can contain similar values
accountname,accountid,partnerid.
I already got the correct query but,
I want to improve the performance of the query. Any suggestions for the improvement of query?
You can try using row_number()
select * from
(
select *,row_number() over(order by transdate desc) as rn
from myschema.mytransactionstable
)A where rn<=5
Don't repeat ON and WHERE clauses. Use ON to say how the tables (or subqueries) are "related"; use WHERE for filtering (that is, which rows to keep). Probably in your case, all the WHERE should be removed.
Please provide SHOW CREATE TABLE
This 'composite' index would probably help because of dealing with the subquery and the JOIN:
INDEX(partnerid, accountid, accountname, transdate)
That would also avoid a separate sort for the GROUP BY.
But then the ORDER BY is different, so it cannot avoid a sort.
This might avoid the sort without changing the result set ordering: ORDER BY partnerid, accountid, accountname, transdate DESC
Please provide EXPLAIN SELECT ... and EXPLAIN FORMAT=JSON SELECT ... if you have further questions.
If we cannot get an index to handle the WHERE, GROUP BY, and ORDER BY, the query will generate all the rows before seeing the LIMIT 5. If the index does work, then the outer query will stop after 5 -- potentially a big savings.
I have the following query, which takes around 15 seconds to execute. If I remove the ORDER BY, it takes 3 seconds, which is still way too long.
SELECT
pages.id AS id,
pages.page_title AS name,
SUM(visitors.bounce) AS bounce,
SUM(visitors.goal) AS goal,
count(visitors.id) AS volume
FROM
pages
LEFT JOIN visitors ON pages.id = visitors.page_id
GROUP BY pages.id
ORDER BY volume DESC
For readability, I slightly simplified this query from the one used in the application, but I've been testing with this simplified query and the problem does still exists. So the problem is in this part.
Table pages: around 3K records. Table visitors: around 300K records.
What I have done:
I have indexes on visitors.page_id (with external key linking to pages.id).
Obviously my ID fields are set as primary key.
What I have tried:
I have increased the read_buffer_size, sort_buffer_size, read_rnd_buffer_size, to 64M.
EXPLAIN query with sorting (15 secs):
EXPLAIN query without sorting (3 secs, still way to long and that's not the output I want):
Remove the SUM and Count calculations, they didn't really have an effect on the execution time.
Any ideas to improve this query?
First, try
My first suggestion is to do the aggregation before the join:
SELECT p.id, p.page_title AS name,
v.bounce, v.goal,v.volume
FROM pages p LEFT JOIN
(SELECT page_id, sum(v.bounce) as bounce, sum(v.goal) as goal,
count(*) as volumn
FROM visitors v
GROUP BY page_id
) v
ON pages.id = v.page_id
ORDER BY volume DESC;
However, your query needs to do both an aggregation and a sort -- and you have no filtering. I'm not sure you'll be able to get it much faster.
I have a problem with a query which takes far too long (Over two seconds just for this simple query).
On first look it appears to be an indexing issue, all joined fields are indexed, but i cannot find what else I may need to index to speed this up. As soon as i add the fields i need to the query, it gets even slower.
SELECT `jobs`.`job_id` AS `job_id` FROM tabledef_Jobs AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
GROUP BY `jobs`.`job_id`
ORDER BY `jobs`.`date_posted` ASC
LIMIT 0 , 50
Table row counts (~): tabledef_Jobs (108k), tabledef_JobCatLink (109k), tabledef_Companies (100), tabledef_Applications (50k)
Here you can see the Describe. 'Using temporary' appears to be what is slowing down the query:
table index screenshots:
Any help would be greatly appreciated
EDIT WITH ANSWER
Final improved query with thanks to #Steve (marked answer). Ultimately, the final query was reduced from ~22s to ~0.3s:
SELECT `jobs`.`job_id` AS `job_id` FROM
(
SELECT * FROM tabledef_Jobs as jobs ORDER BY `jobs`.`date_posted` ASC LIMIT 0 , 50
) AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
GROUP BY `jobs`.`job_id`
ORDER BY `jobs`.`date_posted` ASC
LIMIT 0 , 50
Right, I’ll have a stab at this.
It would appear that the Query Optimiser cannot use an index to fulfil the query upon the tabledef_Jobs table.
You've got an offset limit and this with the combination of your ORDER BY cannot limit the amount of data before joining and thus it is having to group by job_id which is a PK and fast – but then order that data (temporary table and a filesort) before limiting and throwing away a the vast majorly of this data before finally join everything else to it.
I would suggest, adding a composite index to jobs of “job_id, date_posted”
So firstly optimise the base query:
SELECT * FROM tabledef_Jobs
GROUP BY job_id
ORDER BY date_posted
LIMIT 0,50
Then you can combine the joins and the final structure together to make a more efficient query.
I cannot let it go by without suggesting you rethink your limit offset. This is fine for small initial offsets but when it starts to get large this can be a major cause of performance issues. Let’s for example sake say you’re using this for pagination, what happens if they want page 3,000 – you will use
LIMIT 3000, 50
This will then collect 3050 rows / manipulate the data and then throw away the first 3000.
[edit 1 - In response to comments below]
I will expand with some more information that might point you in the right direction. Unfortunately there isn’t a simple fix that will resolve it , you must understand why this is happening to be able to address it. Simply removing the LIMIT or ORDER BY may not work and after all you don’t want to remove then as its part of your query which means it must be there for a purpose.
Optimise the simple base query first that is usually a lot easier than working with multi-joined datasets.
Despite all the bashing it receives there is nothing wrong with filesort. Sometimes this is the only way to execute the query. Agreed it can be the cause of many performance issues (especially on larger data sets) but that’s not usually the fault of filesort but the underlying query / indexing strategy.
Within MySQL you cannot mix indexes or mix orders of the same index – performing such a task will result in a filesort.
How about as I suggested creating an index on date_posted and then using:
SELECT jobs.job_id, jobs.date_posted, jobcats .*, apps.*, company .* FROM
(
SELECT DISTINCT job_id FROM tabledef_Jobs
ORDER BY date_posted
LIMIT 0,50
) AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
Hi i have an issue with a mysql select statement i cant get my head around,
Table client_directory_data
id int,
verified int,
client_id int,
created timestamp,
description longtext
select * from client_directory_data where verified = 1 order by created desc
but this selects multiple rows for each client_id
what i need to do is to select every client_id which has a verified = 1 but only get the most recent row for each client_id, i hope that makes sense.
This is an issue I face all the time. Fortunately there's a nice little trick for doing this:
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
And if you want the whole row you can just join onto it like so:
SELECT
*
FROM (
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
) ids
JOIN client_directory_data USING (id);
Of course if you're ordering by an indexed field anyway (that you could therefore join on efficiently anyway), it's better to use MAX(id) AS id, although it actually has very little impact on performance. The main reason to use MAX() is really to make the code a little simpler. It also avoids the pitfalls you may encounter if the field contains commas (which you can get around with a different seperator for the group concat) or hitting the max GROUP_CONCAT length (which can be extended with SET group_concat_max_len = xxx; and only causes warnings anyway).
I can see why this would intuitively seem like it would have performance issues, however it's actually the best performng method I've found for these queries - especially on large tables.
Here are some benchmarks I've taken from some of the larger tables currently available to me comparing the three methods in this thread.
Query A: (~5,000 records, ~900 results, non-indexed field)
GROUP_CONCAT method: 0.0100 seconds
MAX method: 0.102 seconds
LEFT JOIN method: 0.0082 seconds
Query B : (~300,000 records, ~95,000 results)
GROUP_CONCAT method: 1.8618 seconds
MAX method: 1.7904 seconds
LEFT JOIN method: 6.4649 seconds
Query C : (~300,000 records, ~7 results)
GROUP_CONCAT method: 0.103 seconds
MAX method: 0.0102 seconds
LEFT JOIN method: (I got bored after 4 hours)
Query D : (~500,000 records, ~5,000 different values of the field being grouped)
GROUP method: 0.1355 seconds
MAX Method : 0.0429 seconds
LEFT JOIN method: (I got bored after 10 minutes)
That makes sense and is a classic question.
Assuming that the most recent row is the one with highest id, you can use:
SELECT *
FROM client_directory_data c
LEFT JOIN client_directory_data d ON c.client_id = d.client_id AND d.verified = 1 AND d.id > c.id
WHERE d.id IS NULL
AND c.verified = 1;
You can have an explanation of this query pattern here.
Make id as primary key for the table client_directory_data