I have this query (4.8 seconds run time):
SELECT a, b, c
FROM table_a ta
INNER JOIN table_b tb ON ta.id = tb.id
AND ta.id2 = tb.id2
WHERE ta.id2 = 1
AND tb.id2 = 1
AND ta.id IN (*100K strings list*)
(I know the condition on id2 = 1 can be done better, lets ignore that for now)
So for the above query, when profiling, I get:
| statistics | 3.471655 |
Reading online a bit, I saw it meant the thread is performing "disk-bound other work".
After changing the query to insert the 100K strings into a temp table and joining with that table, I managed to reduce the run time to 0.82
seconds, but I cant say I completely understand why.
So:
What does "disk bound other-work" mean exactly? What determines how long this step will run? Table size row-wise? Table size bytes-wise?
Where does that run time improvement comes from? Is JOIN really that much more efficient than IN? I always figured they just create some hash set in memory and use that, which should be very fast
EDIT:
Im using MariaDB 10.2.25.
CREATE TABLE:
CREATE TABLE table_a (
d VARCHAR(100) DEFAULT NULL,
e VARCHAR(100) DEFAULT NULL,
f VARCHAR(100) DEFAULT NULL,
g VARCHAR(100) DEFAULT NULL,
h VARCHAR(100) DEFAULT NULL,
i VARCHAR(100) DEFAULT NULL,
a CHAR(32) NOT NULL,
j CHAR(32),
k CHAR(27) NOT NULL,
id BIGINT(20),
l VARCHAR(100) DEFAULT NULL,
m VARCHAR(100) DEFAULT NULL,
n VARCHAR(100) DEFAULT NULL,
o TEXT,
b INT DEFAULT NULL,
p VARCHAR(100) DEFAULT NULL,
q INT(10) DEFAULT NULL,
r INT(10) DEFAULT NULL,
s INT(10) DEFAULT NULL,
t CHAR(5) DEFAULT NULL,
u INT(3) DEFAULT NULL,
v BOOL,
w BOOL,
x BOOL,
y VARCHAR(100) DEFAULT NULL,
z VARCHAR(100) DEFAULT NULL,
dd VARCHAR(100) DEFAULT NULL,
ee VARCHAR(100) DEFAULT NULL,
ff VARCHAR(100) DEFAULT NULL,
gg VARCHAR(500) DEFAULT NULL,
hh VARCHAR(50) DEFAULT NULL,
ii VARCHAR(50) DEFAULT NULL,
jj BOOL DEFAULT NULL,
kk VARCHAR(500) DEFAULT NULL,
id2 INT NOT NULL,
ll INT UNSIGNED DEFAULT NULL,
KEY idx1 (m),
KEY idx2 (id2,id),
KEY idx3 (id2),
PRIMARY KEY (id2,a)
) DEFAULT CHARSET=utf8;
CREATE TABLE table_b (
aaa CHAR(27) NOT NULL,
id BIGINT(20),
bbb INT UNSIGNED,
c INT UNSIGNED NOT NULL,
ccc VARCHAR(50),
id2 INT UNSIGNED NOT NULL,
KEY idx1 (id2,id),
PRIMARY KEY (id2,aaa),
KEY `id` (`id`)
);
EXPLAIN
Query 1:
+------+-------------+-------+-------+----------------------------------+-------------+---------+---------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+----------------------------------+-------------+---------+---------------------------+--------+-------------+
| 1 | SIMPLE | table_a | range | PRIMARY,idx3,idx2 | PRIMARY | 100 | NULL | 100000 | Using where |
| 1 | SIMPLE | table_b | ref | PRIMARY,id | id | 13 | table_a.id,const | 1 | |
+------+-------------+-------+-------+----------------------------------+-------------+---------+---------------------------+--------+-------------+
EXPLAIN Query 2:
+------+-------------+-------+--------+----------------------------------+-------------+---------+---------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------------------------+-------------+---------+---------------------------+-------+-------------+
| 1 | SIMPLE | ttt | index | PRIMARY | PRIMARY | 62 | NULL | 76191 | Using index |
| 1 | SIMPLE | table_a | eq_ref | PRIMARY,idx3,idx2 | PRIMARY | 100 | const,ttt.id | 1 | Using where |
| 1 | SIMPLE | table_b | ref | PRIMARY,id | id | 13 | table_a.id,const | 1 | |
+------+-------------+-------+--------+----------------------------------+-------------+---------+---------------------------+-------+-------------+
(Too much for a Comment; may lead to an Answer.)
CHAR(32) -- are these UUIDs? If so, declare them CHARACTER SET ascii, not utf8.
Be consistent on SIGNED vs UNSIGNED, especially with columns used in JOIN.
id | 13 | table_a.id,const does not agree with the table definition given. Please fix.
How much time does it take to build ttt? If that is too large, you are not saving any time.
Related
I'm trying to understand why these two queries are treated differently with regards to use of the primary keys in joins.
This query with a join on icd_codes (the SELECT query, without the EXPLAIN, of course) completes in 56 ms:
EXPLAIN
SELECT var.Var_ID,
var.Gene,
var.HGVSc,
pVCF_145K.PT_ID,
pVCF_145K.AD_ALT,
pVCF_145K.AD_REF,
icd_codes.ICD_NM,
icd_codes.PT_AGE
FROM public.variants_145K var
INNER JOIN public.pVCF_145K USING (Var_ID)
INNER JOIN public.icd_codes using (PT_ID)
# INNER JOIN public.demographics USING (PT_ID)
WHERE Gene IN ('SLC9A6', 'SLC9A7')
AND Canonical
AND impact = 'high'
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
| 1 | SIMPLE | var | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125 | NULL | 280 | Using index condition; Using where |
| 1 | SIMPLE | pVCF_145K | ref | PRIMARY,pVCF_145K_PT_ID_index | PRIMARY | 326 | public.var.Var_ID | 268 | |
| 1 | SIMPLE | icd_codes | ref | PRIMARY | PRIMARY | 38 | public.pVCF_145K.PT_ID | 29 | |
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
This query with a join on demographics takes over 11 minutes, and I'm not sure how to interpret the difference in the explain results. Why is it resorting to using the join buffer? How can I optimize this further?
EXPLAIN
SELECT variants_145K.Var_ID,
variants_145K.Gene,
variants_145K.HGVSc,
pVCF_145K.PT_ID,
pVCF_145K.AD_ALT,
pVCF_145K.AD_REF,
demographics.Sex,
demographics.Age
FROM public.variants_145K
INNER JOIN public.pVCF_145K USING (Var_ID)
# inner join public.icd_codes using (PT_ID)
INNER JOIN public.demographics USING (PT_ID)
WHERE Gene IN ('SLC9A6', 'SLC9A7')
AND Canonical
AND impact = 'high'
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
| 1 | SIMPLE | variants_145K | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125 | NULL | 280 | Using index condition; Using where |
| 1 | SIMPLE | demographics | ALL | PRIMARY | NULL | NULL | NULL | 1916393 | Using join buffer (flat, BNL join) |
| 1 | SIMPLE | pVCF_145K | eq_ref | PRIMARY,pVCF_145K_PT_ID_index | PRIMARY | 364 | public.variants_145K.Var_ID,public.demographics.PT_ID | 1 | |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
Adding a further filter in demographics (WHERE demographics.Platform IS NOT NULL) as shown below reduces to 38 seconds. However, there are queries where we do not use such filters so it would be ideal if it could use the primary PT_ID key in the joins.
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
| 1 | SIMPLE | variants_145K | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125 | NULL | 280 | Using index condition; Using where |
| 1 | SIMPLE | demographics | range | PRIMARY,Demographics_PLATFORM_index | Demographics_PLATFORM_index | 17 | NULL | 258544 | Using index condition; Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | pVCF_145K | eq_ref | PRIMARY,pVCF_145K_PT_ID_index | PRIMARY | 364 | public.variants_145K.Var_ID,public.demographics.PT_ID | 1 | |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
The tables:
create table public.demographics # 1,916,393 rows
(
PT_ID varchar(9) not null
primary key,
Age float(3,1) null,
Status varchar(8) not null,
Sex varchar(7) not null,
Race_1 varchar(41) not null,
Race_2 varchar(41) not null,
Ethnicity varchar(22) not null,
Smoker_flag tinyint(1) not null,
Platform char(4) null,
MyCode_Consent tinyint(1) not null,
MR_ENC_DT date null,
Birthday date null,
Deathday date null,
max_unrelated_145K tinyint unsigned null
);
create index Demographics_PLATFORM_index
on public.demographics (Platform);
create table public.icd_codes # 116,220,141 rows
(
PT_ID varchar(9) not null,
ICD_CD varchar(8) not null,
ICD_NM varchar(217) not null,
DX_DT date not null,
PT_AGE float(3,1) unsigned not null,
CODE_SYSTEM char(7) not null,
primary key (PT_ID, ICD_CD, DX_DT)
);
create table public.pVCF_145K # 10,113,244,082 rows
(
Var_ID varchar(81) not null,
PT_ID varchar(9) not null,
GT tinyint unsigned not null,
GQ smallint unsigned not null,
AD_REF smallint unsigned not null,
AD_ALT smallint unsigned not null,
DP smallint unsigned not null,
FT varchar(30) null,
primary key (Var_ID, PT_ID)
);
create index pVCF_145K_PT_ID_index
on public.pVCF_145K (PT_ID);
create table public.variants_145K # 151,314,917 rows
(
Var_ID varchar(81) not null,
Gene varchar(22) null,
Feature varchar(18) not null,
Feature_type varchar(10) null,
HIGH_INF_POS tinyint(1) null,
Consequence varchar(26) not null,
rsid varchar(34) null,
Impact varchar(8) not null,
Canonical tinyint(1) not null,
Exon smallint unsigned null,
Intron smallint unsigned null,
HGVSc varchar(323) null,
HGVSp varchar(196) null,
AA_position smallint unsigned null,
gnomAD_NFE_MAF float null,
SIFT varchar(14) null,
PolyPhen varchar(17) null,
GHS_Hom mediumint(5) unsigned null,
GHS_Het mediumint(5) unsigned null,
GHS_WT mediumint(5) unsigned null,
IDT_MAF float null,
VCR_MAF float null,
UKB_MAF float null,
Chr tinyint unsigned not null,
Pos int(9) unsigned not null,
Ref varchar(298) not null,
Alt varchar(306) not null,
primary key (Var_ID, Feature)
);
create index variants_145K_Chr_Pos_Ref_Alt_index
on public.variants_145K (Chr, Pos, Ref, Alt);
create index variants_145K_Gene_index
on public.variants_145K (Gene);
create index variants_145K_Impact_Gene_index
on public.variants_145K (Impact, Gene);
create index variants_145K_rsid_index
on public.variants_145K (rsid);
This is on MariaDB 10.5.8 (innodb)
Thank you!
INDEX(impact, canonical, gene) or INDEX(canonical, impact, gene) is better for the var.
If you don't need it, remove INNER JOIN public.icd_codes USING (PT_ID). It is costly to reach into that table, and all it does is filter out any rows that fail in the JOIN.
Ditto for demographics.
The "join buffer" is not always a "resort to"; however, it is often a fast way. Especially if most of the table is needed and the join_buffer is big enough.
More
Note that demographics has a single-column PRIMARY KEY(PT_ID), but the other table has a composite PK. This probably impacts whether the Optimizer will even consider using the "join buffer".
Depending on a lot of things (in the query and the data), the Optimizer may make the wrong choice between join_buffer and repeatedly doing lookups.
This is my query running in one page of my site
SELECT
DISTINCT b.CruisePortID,
b.SailingDates,
b.CruisePortID,
b.ArriveTime,
b.DepartTime,
b.PortName,
b.DayNumber
FROM
cruise_itineraries a,
cruise_itinerary_days b,
cruise_ports c
WHERE
a.ID = b.CruiseItineraryID
AND a.CruisePortID = c.ID
AND a.ID = '352905'
AND b.CruisePortID != 0
GROUP BY b.DayNumber;
while running this query in phpmy admin its take 3.20 sec because of cruise_itineraries had more 300 000 records
I tried indexing also after indexing it show 2.92 sec. Is any possible to reduced query time less .10 sec. Its help my site performance
here details
CREATE TABLE IF NOT EXISTS `cruise_itineraries` (
`cl` int(11) NOT NULL,
`ID` bigint(20) NOT NULL,
`Description` varchar(500) NOT NULL,
`SailingPlanID` varchar(100) NOT NULL,
`VendorID` varchar(100) NOT NULL,
`VendorName` varchar(100) NOT NULL,
`ShipID` varchar(100) NOT NULL,
`ShipName` varchar(100) NOT NULL,
`Duration` int(11) NOT NULL,
`DestinationID` varchar(100) NOT NULL,
`Date` datetime NOT NULL,
`CruisePortID` varchar(100) NOT NULL,
`TradeRestriction` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `cruise_itinerary_days` (
`cld` int(11) NOT NULL,
`CruiseItineraryID` varchar(100) NOT NULL,
`SailingDates` datetime NOT NULL,
`VendorID` int(11) NOT NULL,
`VendorName` varchar(100) NOT NULL,
`ShipID` int(11) NOT NULL,
`ShipName` varchar(100) NOT NULL,
`SailingPlanID` int(11) NOT NULL,
`PlanName` varchar(100) NOT NULL,
`DayNumber` bigint(20) NOT NULL,
`PortName` varchar(100) NOT NULL,
`CruisePortID` varchar(100) NOT NULL,
`ArriveTime` varchar(100) NOT NULL,
`DepartTime` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `cruise_ports` (
`cp` int(11) NOT NULL,
`ID` varchar(100) NOT NULL,
`Name` varchar(100) NOT NULL,
`Description` varchar(1000) NOT NULL,
`NearestAirportCode` varchar(100) NOT NULL,
`UNCode` varchar(100) NOT NULL,
`Address` varchar(500) NOT NULL,
`City` varchar(100) NOT NULL,
`StateCode` varchar(100) NOT NULL,
`CountryCode` varchar(100) NOT NULL,
`PostalCode` varchar(100) NOT NULL,
`Phone` varchar(50) NOT NULL,
`Fax` varchar(100) NOT NULL,
`Directions` varchar(1000) NOT NULL,
`Content` varchar(1000) NOT NULL,
`HomePageURL` varchar(100) NOT NULL,
`Longitude` varchar(100) NOT NULL,
`Latitude` varchar(500) NOT NULL,
`CarnivalID` varchar(100) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `cruise_itineraries`
ADD PRIMARY KEY (`cl`),
ADD KEY `ID_2` (`ID`);
ALTER TABLE `cruise_itineraries`
ADD PRIMARY KEY (`cl`),
ADD KEY `ID_2` (`ID`);
ALTER TABLE `cruise_itinerary_days`
ADD PRIMARY KEY (`cld`);
ALTER TABLE `cruise_ports`
ADD PRIMARY KEY (`cp`);
ALTER TABLE `cruise_itineraries`
MODIFY `cl` int(11) NOT NULL AUTO_INCREMENT;
ALTER TABLE `cruise_itinerary_days`
MODIFY `cld` int(11) NOT NULL AUTO_INCREMENT;
ALTER TABLE `cruise_ports`
MODIFY `cp` int(11) NOT NULL AUTO_INCREMENT;
EXPLAIN RESULT:
+----+-------------+-------+------+---------------+------+---------+-------+---------+--------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+---------+--------------------------------------------------------+
| 1 | SIMPLE | a | ref | ID_2 | ID_2 | 8 | const | 1 | Using index condition; Using temporary; Using filesort |
| 1 | SIMPLE | c | ALL | NULL | NULL | NULL | NULL | 3267 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | b | ALL | NULL | NULL | NULL | NULL | 2008191 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------+---------------+------+---------+-------+---------+--------------------------------------------------------+
+----+-------------+-------+------+------------------------------------+------------------------------------+---------+-------+------+--------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------------------+------------------------------------+---------+-------+------+--------------------------------------------------------------+
| 1 | SIMPLE | b | ref | Idx_CruiseItineraryID_CruisePortID | Idx_CruiseItineraryID_CruisePortID | 9 | const | 12 | Using index condition; Using temporary; Using filesort |
| 1 | SIMPLE | a | ref | ID_2 | ID_2 | 8 | const | 1 | Distinct |
| 1 | SIMPLE | c | ALL | NULL | NULL | NULL | NULL | 3267 | Using where; Distinct; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------+------------------------------------+------------------------------------+---------+-------+------+--------------------------------------------------------------+
First I would like to state that try to avoid IMPLICIT MySQL JOINS.
Use INNER JOINS instead.
I personally think the INNER JOIN is better, because it is more
readable. It shows better the relations between the table. You got
those relations in the join, and you do the filtering in the WHERE
clause. This separation makes the query more readable.
The faults I've found:
The data type of cruise_itineraries.ID is BIGINT and the data type of cruise_itinerary_days.CruiseItineraryID is varchar. But you are matching them in a query. So it will run slow no matter if you use index on cruise_itinerary_days.CruiseItineraryID in cruise_itinerary_days table.
Change the data type of cruise_itinerary_days.CruiseItineraryID to BIGINT.
ALTER TABLE cruise_itinerary_days MODIFY CruiseItineraryID BIGINT;
Next you have to create a composite index on cruise_itinerary_days table based on your query.
ALTER TABLE cruise_itinerary_days ADD INDEX Idx_CruiseItineraryID_CruisePortID (CruiseItineraryID, CruisePortID)`
Now create an index in cruise_ports table on cruise_ports.ID field.
ALTER TABLE cruise_ports ADD INDEX Idx_cruise_ports_ID (ID);
And finally the query is formulated using INNER JOINS since I've stated reasons above behind this choice:
SELECT
DISTINCT b.CruisePortID,
b.SailingDates,
b.CruisePortID,
b.ArriveTime,
b.DepartTime,
b.PortName,
b.DayNumber
FROM cruise_itineraries a
INNER JOIN cruise_itinerary_days b ON a.ID = b.CruiseItineraryID
INNER JOIN cruise_ports c ON a.CruisePortID = c.ID
WHERE a.ID = 352905
AND b.CruisePortID != 0
GROUP BY b.DayNumber;
I can't find a way to fasten simple queries in a huge table.
I don't think i'm asking something crazy to MySQL, even with the amount of datas… and i can't understand why these following queries have so much different execution time !
I tried my best to read all articles about big datas in mysql, fields optimization, and already achieved to reduce query time with field types… but really, i'm getting lost now with this kind of simple queries !
Here is an example on MySQL 5.1.69 :
SELECT rv.`id_prd`,SUM(`quantite`)
FROM `report_ventes` AS rv
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 3.76 sec
Let's add a LEFT JOIN and another selected field :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 12.10 sec
Explain :
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
And let's another where clause :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE rp.`id_clas_prd` LIKE '1%'
AND `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 21.00 sec
Explain :
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY,id_clas_prd | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using where |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
And here are the tables parameters :
report_produits : 80 000 rows
CREATE TABLE `report_produits` (
`id_prd` int(11) unsigned NOT NULL,
`acl_cip_7` int(7) NOT NULL,
`acl_cip_ean_13` varchar(255) DEFAULT NULL,
`lib_prd` varchar(255) DEFAULT NULL,
`id_clas_prd` char(7) NOT NULL DEFAULT '',
`id_lab_prd` int(11) unsigned NOT NULL,
`id_rbt_prd` int(11) unsigned NOT NULL,
`id_tva_prd` int(11) unsigned NOT NULL,
`t_gen` varchar(255) NOT NULL,
`id_grp_gen` varchar(16) NOT NULL DEFAULT '',
`id_liste_delivrance` int(11) unsigned NOT NULL,
PRIMARY KEY (`id_prd`),
KEY `index_lab` (`id_lab_prd`),
KEY `index_grp` (`id_grp_gen`),
KEY `id_clas_prd` (`id_clas_prd`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
report_ventes : 16 556 188 rows
CREATE TABLE `report_ventes` (
`id` int(13) NOT NULL AUTO_INCREMENT,
`periode` mediumint(6) DEFAULT NULL,
`id_phie` smallint(4) unsigned NOT NULL,
`id_prd` mediumint(8) unsigned NOT NULL,
`quantite` smallint(11) DEFAULT NULL,
`ca_ht` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `periode` (`periode`)
) ENGINE=MyISAM AUTO_INCREMENT=18491315 DEFAULT CHARSET=utf8;
There is no covering index and MySQL decides that scanning the whole table is more effective than to use an index and lookup for the requested values.
You are joining to the report_ventes on the id_prd, but that column is not the part of the clustering index (PK in MySQL). This means, the server should lookup for all the values. The server bypasses the periode index possibly because it is not enough selective to use it.
An index could help which includes the id_prd, periode and quantite columns. With this index, there is a chance that the MySQL server will use it since it is a covering index for this query.
Give it a try, but its hard to tell the real truth without testing it on the actual environment.
Basically your indexes is not being used, i can't spot the precise reason without trying it on a sql server, but a common cause is the data has different types.
AND periode BETWEEN 201301 AND 201312
"periode" has datatype mediumint(6) and the litteral "201301" possible has datatype int(10)
LEFT JOIN `report_produits` AS rp ON (rv.`id_prd` = rp.`id_prd`)
Here are the 2 datatypes also different.
I need help on optimizing the following query
select
DATE_FORMAT( traffic.stat_date, '%Y/%m'),
pt.promotion,
sum(traffic.voice_nat_onnet_mins - pt.promo_minutes_onnet) as total_onnet_mins,
sum(traffic.voice_nat_offnet_mins + traffic.voice_nat_landline_mins + traffic.voice_int_mins + traffic.voice_nng_mins + traffic.voice_not_rec_mins - pt.promo_minutes_offnet) as total_offnet_mins,
sum(traffic.sms_ptp_onnet_evts) as total_onnet_sms,
sum(traffic.sms_ptp_offnet_evts + traffic.sms_vas_pta_evts) as total_offnet_sms,
sum(traffic.dati_kb) as internet_kb
from
stats_novercanet.mnp_prod_stat_outgoing_traffic traffic
INNER JOIN stats_novercanet.mnp_prod_stat_promotion_traffic pt
ON pt.id_source_user=traffic.id_source_user
INNER JOIN stats_novercanet.mnp_prod_stat_customer_first_signup fs
ON pt.id_source_user = fs.id_source_user
where
traffic.stat_date between '2013-11-01' and '2013-11-30'
and traffic.stat_date >= (
select min(ft.stat_date)
from stats_novercanet.mnp_prod_stat_promotion_traffic ft
where
traffic.id_source_user=ft.id_source_user
and (ft.sub_rev>0 or ft.ren_rev>0)
and pt.promotion=ft.promotion
)
and pt.stat_date between '2013-11-01' and '2013-11-30'
group by
DATE_FORMAT( traffic.stat_date, '%Y/%m'),
pt.promotion
order by
DATE_FORMAT( traffic.stat_date, '%Y/%m'),
pt.promotion **
I have used explain for this query and it showed me following result
+----+--------------------+---------+-------+------------------------------------------------+---------------------------------+---------+-----------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+-------+------------------------------------------------+---------------------------------+---------+-----------------------------------------+--------+----------------------------------------------+
| 1 | PRIMARY | pt | range | idx_prod_stat_pro_tra_stat_date,id_source_user | idx_prod_stat_pro_tra_stat_date | 4 | NULL | 530114 | Using where; Using temporary; Using filesort |
| 1 | PRIMARY | fs | ref | id_source_user | id_source_user | 5 | stats_novercanet.pt.id_source_user | 1 | Using where; Using index |
| 1 | PRIMARY | traffic | ref | stat_date,id_source_user | id_source_user | 5 | stats_novercanet.pt.id_source_user | 60 | Using where |
| 2 | DEPENDENT SUBQUERY | ft | ref | id_source_user,promotion | id_source_user | 5 | stats_novercanet.traffic.id_source_user | 93 | Using where |
+----+--------------------+---------+-------+------------------------------------------------+---------------------------------+---------+-----------------------------------------+--------+----------------------------------------------+
Any help on optimization will be great. I have created index on id_source_user, stat_date and promotion as well but no luck. Also tried with subquery in join but no luck.
Result is as follow for mnp_prod_stat_promotion_traffic.**
| mnp_prod_stat_promotion_traffic | CREATE TABLE `mnp_prod_stat_promotion_traffic` (
`stat_date` date DEFAULT NULL,
`id_source_user` int(64) DEFAULT NULL,
`promotion` varchar(64) DEFAULT NULL,
`num_of_sub` int(64) DEFAULT NULL,
`num_of_ren` int(64) DEFAULT NULL,
`credit` float DEFAULT NULL,
`minutes` float DEFAULT NULL,
`kb` float DEFAULT NULL,
`sms` int(64) DEFAULT NULL,
`lbs` int(64) DEFAULT NULL,
`sub_rev` float DEFAULT NULL,
`ren_rev` float DEFAULT NULL,
`consumed_credit` float DEFAULT NULL,
`sim_type` varchar(32) DEFAULT NULL,
`price_plan` varchar(64) DEFAULT NULL,
`WiFi_mins` float DEFAULT NULL,
`over_min` float DEFAULT NULL,
`over_min_consumed` float DEFAULT NULL,
`over_sms` float DEFAULT NULL,
`over_sms_consumed` float DEFAULT NULL,
`over_data` float DEFAULT NULL,
`over_data_consumed` float DEFAULT NULL,
`promo_minutes_onnet` float DEFAULT NULL,
`promo_minutes_offnet` float DEFAULT NULL,
`promo_sms_onnet` int(64) DEFAULT NULL,
`promo_sms_offnet` int(64) DEFAULT NULL,
KEY `idx_prod_stat_pro_tra_stat_date` (`stat_date`),
KEY `id_source_user` (`id_source_user`),
KEY `promotion` (`promotion`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
How many results are you expecting to get returned. If you know for example that you only want one record returned then you can use LIMITS
When a query is ran it will search the whole table for that record, but if you know the is only one, two or three results returned then you can LIMIT. This will save MySQL a lot of time, but again it will depend on the number of results you are excepting, and you will have to apply it to the tables you are running it on.
Also, another tip is to check what type of table types you are using. Have a look at this webpage for more information: http://www.mysqltutorial.org/understand-mysql-table-types-innodb-myisam.aspx
Another option to do is to build a script to use your existing query above and store the result in a new table, and only run the script once a month via cron at like midnight. I did this for an analytical project, and it worked well.
Here is the query:
select timespans.id as timespan_id, count(*) as num
from reports, timespans
where timespans.after_date >= '2011-04-13 22:08:38' and
timespans.after_date <= reports.authored_at and
reports.authored_at < timespans.before_date
group by timespans.id;
Here are the table defs:
CREATE TABLE `reports` (
`id` int(11) NOT NULL auto_increment,
`source_id` int(11) default NULL,
`url` varchar(255) default NULL,
`lat` decimal(20,15) default NULL,
`lng` decimal(20,15) default NULL,
`content` text,
`notes` text,
`authored_at` datetime default NULL,
`created_at` datetime default NULL,
`updated_at` datetime default NULL,
`data` text,
`title` varchar(255) default NULL,
`author_id` int(11) default NULL,
`orig_id` varchar(255) default NULL,
PRIMARY KEY (`id`),
KEY `index_reports_on_title` (`title`),
KEY `index_content_on_reports` (`content`(128))
CREATE TABLE `timespans` (
`id` int(11) NOT NULL auto_increment,
`after_date` datetime default NULL,
`before_date` datetime default NULL,
`after_offset` int(11) default NULL,
`before_offset` int(11) default NULL,
`is_common` tinyint(1) default NULL,
`created_at` datetime default NULL,
`updated_at` datetime default NULL,
`is_search_chunk` tinyint(1) default NULL,
`is_day` tinyint(1) default NULL,
PRIMARY KEY (`id`),
KEY `index_timespans_on_after_date` (`after_date`),
KEY `index_timespans_on_before_date` (`before_date`)
And here is the explain:
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | timespans | range | index_timespans_on_after_date,index_timespans_on_before_date | index_timespans_on_after_date | 9 | NULL | 84 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | reports | ALL | NULL | NULL | NULL | NULL | 183297 | Using where |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+----------------------------------------------+
And here is the explain after I create an index on authored_at. As you can see, the index is not actually getting used (I think...)
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+------------------------------------------------+
| 1 | SIMPLE | timespans | range | index_timespans_on_after_date,index_timespans_on_before_date | index_timespans_on_after_date | 9 | NULL | 86 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | reports | ALL | index_reports_on_authored_at | NULL | NULL | NULL | 183317 | Range checked for each record (index map: 0x8) |
+----+-------------+-----------+-------+--------------------------------------------------------------+-------------------------------+---------+------+--------+------------------------------------------------+
There are about 142k rows in the reports table, and far fewer in the timespans table.
The query is taking about 3 seconds now.
The strange thing is that if I add an index on reports.authored_at, it actually makes the query far slower, about 20 seconds. I would have thought it would do the opposite, since it would make it easy to find the reports at either end of the range, and throw the rest away, rather than having to examine all of them.
Can someone clarify? I'm stumped.
Instead of two separate indexes for the timespan table, try merging them into a single multi-column index with before_date and after_date in a single index. Then add that index to authored_at as well.
i rewrite you query like this:
select t.id, count(*) as num from timespans t
join reports r where t.after_date >= '2011-04-13 22:08:38'
and r.authored_at >= '2011-04-13 22:08:38'
and r.authored_at < t.before_date
group by t.id order by null;
and change indexes of tables
alter table reports add index authored_at_idx(authored_at);
You can used partition feature of database on column after_date. It will help u a lot.