I have a query with group by and sum. I have close to 1 million records. When i run the query it is taking 2.5s. If i remove the group by clause it is taking 0.89s. Is there any way we can optimize the query using group by and sum together.
SELECT aggEI.ei_uuid AS uuid,aggEI.companydm_id AS companyId,aggEI.rating AS rating,aggEI.ei_name AS name,
compdm.company_name AS companyName,sum(aggEI.count) AS activity
FROM AGG_EXTERNALINDIVIDUAL AS aggEI
JOIN COMPANYDM AS compdm ON aggEI.companydm_id = compdm.companydm_id
WHERE aggEI.ei_uuid is not null
and aggEI.companydm_id IN (8)
and aggEI.datedm_id = 20130506
AND aggEI.topicgroupdm_id IN (1,2,3,4,5,6,7)
AND aggEI.rating >= 0
AND aggEI.rating <= 100
GROUP BY aggEI.ei_uuid,aggEI.companydm_id
LIMIT 0,200000
Explain result is as below:
1 SIMPLE compdm const PRIMARY,companydm_id_UNIQUE,comp_idx PRIMARY 8 const 1 Using temporary; Using filesort
1 SIMPLE aggEI ref PRIMARY,datedm_id_UNIQUE,agg_ei_comdm_fk_idx,agg_ei_datedm_fk_idx,agg_ei_topgrp_fk_idx,uid_comp_ei_dt_idx,uid_comp_dt_idx,comp_idx datedm_id_UNIQUE 4 const 197865 Using where
Also i didn't understand why compdm table is executed first. Can someone explain?
I have index on AGG_EXTERNALINDIVIDUAL table with combination of ei_uuid,companydm_id,datedm_id. The same is shown on aggEI table under possible keys as uid_comp_dt_idx. But aggEI table is taking datedmid_UNIQUE as key. I didn't understand the behavior.
Can someone explain?
Explain has to run the dependent queries before it can run the main one.
You need to check indexing on AGG_EXTERNALINDIVIDUAL.
Related
SELECT
call_id
,call_date
,call_no
,call_amountdue
,rechargesamount
,call_penalty
,callpayment_received
,calldiscount
FROM `call`
WHERE calltype = 'Regular'
AND callcode = 98
AND call_connect = 1
AND call_date < '2018-01-01'
ORDER BY
`call_date` DESC
,`call_id` DESC
limit 1
Index is already there on call_date, callcode, calltype, callconnect
Table has 10 million records. Query is taking 2 min
How to get results within 3sec?
INDEX (calltype, callcode, call_connect, -- in any order
call_date, -- next
call_id) -- last
This will make it possible to find the one row that is desired without having to step over other rows.
Since you seem to have INDEX(calltype), Drop it; it will be in the way and, anyway, redundant. The rest of the indexes you mentioned will be ignored.
More discussion in Index Cookbook
I have got a SQL query that I tried to optimize and I could reduce through various means the time from over 5 seconds to about 1.3 seconds, but no further. I was wondering if anyone would be able to suggest further improvements.
The Explain diagram shows a full scan:
explain diagram
The Explain table will give you more details:
explain tabular
The query is simplified and shown below - just for reference, I'm using MySQL 5.6
select * from (
select
#row_num := if(#yacht_id = yacht_id and #charter_type = charter_type and #start_base_id = start_base_id and #end_base_id = end_base_id, #row_num +1, 1) as row_number,
#yacht_id := yacht_id as yacht_id,
#charter_type := charter_type as charter_type,
#start_base_id := start_base_id as start_base_id,
#end_base_id := end_base_id as end_base_id,
model, offer_type, instant, rating, reviews, loa, berths, cabins, currency, list_price, list_price_per_day,
discount, client_price, client_price_per_day, days, date_from, date_to, start_base_city, end_base_city, start_base_country, end_base_country,
service_binary, product_id, ext_yacht_id, main_image_url
from (
select
offer.yacht_id, offer.charter_type, yacht.model, offer.offer_type, offer.instant, yacht.rating, yacht.reviews, yacht.loa,
yacht.berths, yacht.cabins, offer.currency, offer.list_price, offer.list_price_per_day,
offer.discount, offer.client_price, offer.client_price_per_day, offer.days, date_from, date_to,
offer.start_base_city, offer.end_base_city, offer.start_base_country, offer.end_base_country,
offer.service_binary, offer.product_id, offer.start_base_id, offer.end_base_id,
yacht.ext_yacht_id, yacht.main_image_url
from website_offer as offer
join website_yacht as yacht
on offer.yacht_id = yacht.yacht_id,
(select #yacht_id:='') as init
where date_from > CURDATE()
and date_to <= CURDATE() + INTERVAL 3 MONTH
and days = 7
order by offer.yacht_id, charter_type, start_base_id, end_base_id, list_price_per_day asc, discount desc
) as filtered_offers
) as offers
where row_number=1;
Thanks,
goppi
UPDATE
I had to abandon some performance improvements and replaced the original select with the new one. The select query is actually dynamically built by the backend based on which filter criteria are set. As such the where clause of the most inner select can expland quite a lot. However, this is the default select if no filter is set and is the version that takes significantly longer than 1 sec.
explain in text form - doesn't come out pretty as I couldn't figure out how to format a table, but here it is:
1 PRIMARY ref <auto_key0> <auto_key0> 9 const 10
2 DERIVED ALL 385967
3 DERIVED system 1 Using filesort
3 DERIVED offer ref idx_yachtid,idx_search,idx_dates idx_dates 5 const 385967 Using index condition; Using where
3 DERIVED yacht eq_ref PRIMARY,id_UNIQUE PRIMARY 4 yachtcharter.offer.yacht_id 1
4 DERIVED No tables used
Sub selects are never great,
You should sign up here: https://www.eversql.com/
Run that and it will give you all the right indexes and optimsiations you need for this query.
There's still some optimization you can use. Considering the subquery returns 5000 rows only you could use an index for it.
First rephrase the predicate as:
select *
from website_offer
where date_from >= CURDATE() + INTERVAL 1 DAY -- rephrased here
and date(date_to) <= CURDATE() + INTERVAL 3 MONTH
and days = 7
order by yacht_id, charter_type, list_price_per_day asc, discount desc
limit 5000
Then, if you add the following index the performance could improve:
create index ix1 on website_offer (days, date_from, date_to);
This query takes 18 seconds
SELECT `wd`.`week` AS `start_week`, `wd`.`hold_code`, COUNT(wd.hold_code) AS hold_code_count
FROM `weekly_data` AS `wd`
JOIN aol_reporting_hold_codes hc ON hc.hold_code = wd.hold_code AND chart = 'GR'
WHERE `wd`.`days` <= 6
AND `wd`.`hold_code` IS NOT NULL
AND NOT `wd`.`hold_code` = ''
AND `wd`.`week` >= '201717'
AND `wd`.`itemgroup` IN ('BOTDTO', 'BOTDWG', 'C&FORG', 'C&FOTO', 'MF-SUB', 'MI-SUB', 'PROPRI', 'PROPTO', 'STRSTO', 'STRSUB')
AND `production_type` = 2
AND `contract` = "1234"
AND `project` = 8
GROUP BY `start_week`, `wd`.`hold_code`
This query takes 4 seconds
SELECT `wd`.`week` AS `start_week`, `wd`.`hold_code`, COUNT(wd.hold_code) AS hold_code_count
FROM `weekly_data` AS `wd`
JOIN aol_reporting_hold_codes hc ON hc.hold_code = wd.hold_code AND chart = 'GR'
WHERE `wd`.`days` <= 6
AND `wd`.`hold_code` IS NOT NULL
AND NOT `wd`.`hold_code` = ''
AND `wd`.`week` >= '201717'
AND `wd`.`itemgroup` IN ('BOTDWG', 'C&FORG', 'C&FOTO', 'MF-SUB', 'MI-SUB', 'PROPRI', 'PROPTO', 'STRSTO', 'STRSUB')
AND `production_type` = 2
AND `contract` = "1234"
AND `project` = 8
GROUP BY `start_week`, `wd`.`hold_code`
All I have done is removed one item from the IN clause. I can remove any one of the items. It runs in 4 seconds as long as there are 9 items or less. It takes 18 seconds to run as soon as I increase to 10 items.
I thought MySQL limited length of command by size i.e. 1MB
More than just the EXPLAIN, use EXPLAIN FORMAT=JSON and get the "Optimizer trace" for the query. I suspect the length of the IN leads to picking a different query plan.
There is virtually no limit to the number of items in IN. I have seen as many as 70K.
That aside, you may be able to speed up even the 4-sec version...
I suggest having this index. Grrr... I can't tell which columns are in which tables. So, if these are all in one table, then make such an index:
INDEX(production_type, contract, project) -- in any order
If those are all in wd, then tack on a 4th column - any of week, itemgroup, days.
Be cautious about COUNT(wd.hold_code).
COUNT(x) checks x for being non-NULL; is that what you want? If not, then simply say COUNT(*).
When JOINing, then GROUP BY, you get an "explode-implode". The number of intermediate rows is big; that is when the COUNT is performed.
It seems wrong to both COUNT(hold_code) and GROUP BY hold_code. What are you trying to do?
For further discussion, please provide SHOW CREATE TABLE and EXPLAIN.
Please note MySql IN clause limit is established with max_allowed_packet value. You may check with NOT IN if results are faster. Also I suggest put values to be checked with IN clause under a buffer string instead of comma separated values and then give a try.
I have a data set consisting of minute-by-minute data. My goal is to return minute-by-minute records, and add calculations that create sums of a certain field for the past 24 hours, counting back from each minute record.
The query I have is the following:
SELECT main.recorded_at AS x, (SELECT SUM(precipitation) FROM data AS sub WHERE sub.host = main.host sub.recorded_at BETWEEN SUBTIME(main.recorded_at, '24:00:00') AND main.recorded_at) AS y FROM data AS main WHERE host = 'xxxx' ORDER BY x ASC;
Is there a more efficient way to write this query? I have tried, but failed, so far, using LEFT JOINS and different GROUP BYs.
When I explain this query, I get the following:
1 PRIMARY main ref host host 767 const 4038 100.00 Using where; Using filesort
2 DEPENDENT SUBQUERY sub ref host,recorded_at host 767 const 4038 100.00 Using where
In total, the query takes about 200 seconds to run with 8000 records, getting slower all the time. My goal is to get the aggregate 24-hour precipitation for each result, and somehow in under 2 seconds.
Maybe I'm going about this the wrong way? I'm open to suggestions for other avenues to get the same result. :)
Thanks!
~Mike
Assuming I'm understanding your question correctly, it looks like you can use SUM with CASE to achieve the same result without using the correlated subquery.
SELECT recorded_at AS x,
SUM(CASE WHEN recorded_at BETWEEN SUBTIME(recorded_at, '24:00:00') AND recorded_at
THEN precipitation END) As y
FROM data
WHERE host = 'xxxx'
GROUP BY recorded_at
ORDER BY x ASC;
While I'm not sure this would yield a better performance, I do think it would solve your issue using an OUTER JOIN with GROUP BY:
SELECT main.recorded_at AS x,
SUM(sub.precipitation) As y
FROM data main LEFT JOIN data sub ON
main.host = sub.host AND
sub.recorded_at BETWEEN SUBTIME(main.recorded_at, '24:00:00') AND main.recorded_at
WHERE main.host = 'xxxx'
GROUP BY main.recorded_at
ORDER BY x ASC;
I am running a query to get the total notes input by each users between a date range. This is the query I am running:
SELECT SQL_NO_CACHE
COUNT(notes.user_id) AS "Number of Notes"
FROM csu_users
JOIN notes ON notes.user_id = csu_users.user_id
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31"
AND notes.system = 0
GROUP BY csu_users.user_id
Some notes about my setup:
The query takes between 30 and 35 seconds to run, which is too long for our system
This is an InnoDB table
The notes table is about 1GB, with ~3,000,000 rows
I'm deliberately using SQL_NO_CACHE to ensure an accurate benchmark
The output of EXPLAIN SELECT is as follows (I've tried my best to format it):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE csu_users index user_id user_id 5 NULL 1 Using index
1 SIMPLE notes ref user_id,timestamp,system user_id 4 REFSYS_DEV.csu_users.user_id 152 Using where
I have the following indexes applied:
notes
Primary Key - id
item_id
user_id
timestamp (note: this is actually a DATETIME. The name is just misleading, sorry!)
system
csu_users
Primary Key - id
user_id
Any ideas how I can speed this up? Thank you!
If I'm not mistaken, by converting your timestamp to a string representation, you're loosing all advantages of the index on that column. try using timestamp values in your comparison
Is the csu_users table necessary? If the relationship is 1-1 and the user id is always present, then you can run this query instead:
SELECT COUNT(notes.user_id) AS "Number of Notes"
FROM notes
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31" AND notes.system = 0
GROUP BY notes.user_id
Even if that is not the case, you can do the join after the aggregation and filtering, because all the conditions are on notes:
select "Number of Notes"
from (SELECT notes.user_id, COUNT(notes.user_id) AS "Number of Notes"
FROM notes
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31" AND notes.system = 0
GROUP BY notes.user_id
) n join
csu_users cu
on n.user_id = cu.user_id