MySQL: Why are not all keys of the index used? - mysql

I've a table with 50 columns. I defined one index (not unique) with the following 6 columns:
rdsr_id (int),
StartOfXrayIrradiation (datetime),
PatientsBirthDate (date),
DeviceObserverUID (varchar(100)),
IdentifiedProtocolShort (varchar(50)),
RedundantEntryFromDoseSummary (tinyint(1))
The table is called report and has around 20'000 rows and is growing. When running the following query, the result shows that only 4 keys of the index are used.
EXPLAIN EXTENDED SELECT r.PatientID, r.StartOfXrayIrradiation, MeanCTDIvol_in_mGy
FROM report r
INNER JOIN ct_irradiation_events e ON r.rdsr_id = e.rdsr_id
INNER JOIN patient_age_categories a ON ( DATEDIFF( r.StartOfXrayIrradiation, r.PatientsBirthDate ) <= a.max_age_days
AND DATEDIFF( r.StartOfXrayIrradiation, r.PatientsBirthDate ) >= a.min_age_days
AND a.description = 'Erwachsene' )
WHERE MeanCTDIvol_in_mGy IS NOT NULL
AND r.DeviceObserverUID = '2.25'
AND r.IdentifiedProtocolShort = 'XXXXX'
AND r.RedundantEntryFromDoseSummary =0
AND e.CTAcquisitionType != 'Constant Angle Acquisition'
AND DATEDIFF( r.StartOfXrayIrradiation, '2013-01-06' ) >=0
AND DATEDIFF( r.StartOfXrayIrradiation, '2014-03-06' ) <=0;
result for table report:
> id: 1
> select_type: SIMPLE
> table: r
> type: ref
> possible_keys: TimelineHistogramQueries
> key: TimelineHistogramQueries
> key_len: 4
> ref: rdsr.e.rdsr_id
> rows: 1
> filtered: 100.00
> Extra: Using where
So I guess the columns IdentifiedProtocolShort and RedundantEntryFromDoseSummary are not used? The result of the query are 1400 rows. When removing the two columns from the WHERE clause, 9500 rows are found. BTW: I did run "ANALYZE TABLE report" after creating the index, if that matters...
Why are not all keys of the index used? Should I change my index?

Assuming that your TimelineHistogramQueries key is a compound key over the six columns that you list in that order, then the key_len value of 4 (bytes) does indeed indicate that only the rdsr_id column is being used from the index: this is supported by the ref value of rdsr.e.rdsr_id.
You ask why IdentifiedProtocolShort and RedundantEntryFromDoseSummary (columns 5 and 6 in the index) are not being used. As documented under Multiple-Column Indexes:
MySQL cannot use the index to perform lookups if the columns do not form a leftmost prefix of the index.
If you do not require the columns of this index to be in their current order for any other query, you could merely reorder the columns; otherwise, you may need to define a second index.

Depends on what you want out of your query. Leave out patient ID and DOB from your first query if you are interested to see on which dates e.g. your patients had an x-ray, etc. Unless you are running your analysis by age. You are confusing the system by trying to index it all.

Related

How to speed up a query containing HAVING?

I have a table with close to a billion records, and need to query it with HAVING. It's very slow (about 15 minutes on decent hardware). How to speed it up?
SELECT ((mean - 3.0E-4)/(stddev/sqrt(N))) as t, ttest.strategyid, mean, stddev, N,
kurtosis, strategies.strategyId
FROM ttest,strategies
WHERE ttest.strategyid=strategies.id AND dataset=3 AND patternclassid="1"
AND exitclassid="1" AND N>= 300 HAVING t>=1.8
I think the problem is t cannot be indexed because it needs to be computed. I cannot add it as a column because the '3.0E-4' will vary per query.
Table:
create table ttest (
strategyid bigint,
patternclassid integer not null,
exitclassid integer not null,
dataset integer not null,
N integer,
mean double,
stddev double,
skewness double,
kurtosis double,
primary key (strategyid, dataset)
);
create index ti3 on ttest (mean);
create index ti4 on ttest (dataset,patternclassid,exitclassid,N);
create table strategies (
id bigint ,
strategyId varchar(500),
primary key(id),
unique key(strategyId)
);
explain select.. :
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
ttest
NULL
range
PRIMARY,ti4
ti4
17
NULL
1910344
100.00
Using index condition; Using MRR
1
SIMPLE
strategies
NULL
eq_ref
PRIMARY
PRIMARY
8
Jellyfish_test.ttest.strategyid
1
100.00
Using where
The query needs to reformulated and an index needs to be added.
Plan A:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
s.strategyId
FROM ttest AS tt
JOIN strategies AS s ON tt.strategyid = s.id
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
and a 'composite' and 'covering' index on test. Replace your ti4 with this (to make it 'covering'):
INDEX(dataset, patternclassid, exitclassid, -- any order
N, strategyid) -- in this order
Plan B:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
( SELECT s.strategyId
FROM strategies AS s
WHERE s.id = tt.strategyid = s.id
) AS strategyId
FROM ttest AS tt
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
With the same index.
Unfortunately the expression for t needs to be repeated. By moving it from HAVING to WHERE, avoids gathering unwanted rows, only to end up throwing them away. Maybe the optimizer will do that automatically. Please provide EXPLAIN SELECT ... to see.
Also, it is unclear whether one of the two formulations will run faster than the other.
To be honest, I've never seen HAVING being used like this; for 20+ years I've assumed it can only be used in GROUP BY situations!
Anyway, IMHO you don't need it here, as Rick James points out, you can put it all in the WHERE.
Rewriting it a bit I end up with:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
Most of that we can indeed foresee a reasonable index. The problem remains with the last calculation:
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
However, before we go to that: how many rows are there if you ignore this 'formula'? 100? 200? If so, indexing as foreseen in Rick James' answer should be sufficient IMHO.
If it's 1000's or many more than the question becomes: how much of those are thrown out by the formula? 1%? 50% 99%? If it's on the low side then again, indexing as proposed by Rick James will do. If however you only need to keep a few you may want to further optimize this and index accordingly.
From your explanation I understand that 3.0E-4 is variable so we can't include it in the index.. so we'll need to extract the parts we can:
If my algebra isn't failing me you can play with the formula like this:
AND ((t.mean - 3.0E-4) / (t.stddev / sqrt(t.N))) >= 1.8
AND ((t.mean - 3.0E-4) ) >= 1.8 * (t.stddev / sqrt(t.N))
AND t.mean - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N)))
AND - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N))) - t.mean
So the query becomes:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND (1.8 * (t.stddev / sqrt(t.N))) - t.mean <= -3.0E-4
I'm not familiar with mysql but glancing the documentation it should be possible to include 'generated columns' in the index. So, we'll do exactly that with (1.8 * (t.stddev / sqrt(t.N)) - t.mean).
Your indexed fields thus become:
dataset, paternclassid, exitclassid, N, (1.8 * (t.stddev / sqrt(t.N))) - t.mean)
Note that the system will have to calculate this value for each and every row on insert (and possibly update) you do on the table. However, once there (and indexed) it should make the query quite a bit faster.

Optimizing SQL query with sub queries

I have got a SQL query that I tried to optimize and I could reduce through various means the time from over 5 seconds to about 1.3 seconds, but no further. I was wondering if anyone would be able to suggest further improvements.
The Explain diagram shows a full scan:
explain diagram
The Explain table will give you more details:
explain tabular
The query is simplified and shown below - just for reference, I'm using MySQL 5.6
select * from (
select
#row_num := if(#yacht_id = yacht_id and #charter_type = charter_type and #start_base_id = start_base_id and #end_base_id = end_base_id, #row_num +1, 1) as row_number,
#yacht_id := yacht_id as yacht_id,
#charter_type := charter_type as charter_type,
#start_base_id := start_base_id as start_base_id,
#end_base_id := end_base_id as end_base_id,
model, offer_type, instant, rating, reviews, loa, berths, cabins, currency, list_price, list_price_per_day,
discount, client_price, client_price_per_day, days, date_from, date_to, start_base_city, end_base_city, start_base_country, end_base_country,
service_binary, product_id, ext_yacht_id, main_image_url
from (
select
offer.yacht_id, offer.charter_type, yacht.model, offer.offer_type, offer.instant, yacht.rating, yacht.reviews, yacht.loa,
yacht.berths, yacht.cabins, offer.currency, offer.list_price, offer.list_price_per_day,
offer.discount, offer.client_price, offer.client_price_per_day, offer.days, date_from, date_to,
offer.start_base_city, offer.end_base_city, offer.start_base_country, offer.end_base_country,
offer.service_binary, offer.product_id, offer.start_base_id, offer.end_base_id,
yacht.ext_yacht_id, yacht.main_image_url
from website_offer as offer
join website_yacht as yacht
on offer.yacht_id = yacht.yacht_id,
(select #yacht_id:='') as init
where date_from > CURDATE()
and date_to <= CURDATE() + INTERVAL 3 MONTH
and days = 7
order by offer.yacht_id, charter_type, start_base_id, end_base_id, list_price_per_day asc, discount desc
) as filtered_offers
) as offers
where row_number=1;
Thanks,
goppi
UPDATE
I had to abandon some performance improvements and replaced the original select with the new one. The select query is actually dynamically built by the backend based on which filter criteria are set. As such the where clause of the most inner select can expland quite a lot. However, this is the default select if no filter is set and is the version that takes significantly longer than 1 sec.
explain in text form - doesn't come out pretty as I couldn't figure out how to format a table, but here it is:
1 PRIMARY ref <auto_key0> <auto_key0> 9 const 10
2 DERIVED ALL 385967
3 DERIVED system 1 Using filesort
3 DERIVED offer ref idx_yachtid,idx_search,idx_dates idx_dates 5 const 385967 Using index condition; Using where
3 DERIVED yacht eq_ref PRIMARY,id_UNIQUE PRIMARY 4 yachtcharter.offer.yacht_id 1
4 DERIVED No tables used
Sub selects are never great,
You should sign up here: https://www.eversql.com/
Run that and it will give you all the right indexes and optimsiations you need for this query.
There's still some optimization you can use. Considering the subquery returns 5000 rows only you could use an index for it.
First rephrase the predicate as:
select *
from website_offer
where date_from >= CURDATE() + INTERVAL 1 DAY -- rephrased here
and date(date_to) <= CURDATE() + INTERVAL 3 MONTH
and days = 7
order by yacht_id, charter_type, list_price_per_day asc, discount desc
limit 5000
Then, if you add the following index the performance could improve:
create index ix1 on website_offer (days, date_from, date_to);

mysql LIKE query takes >1 minute but mix LIKE + MATCH/AGAINST 1 second

I have a small table with 3 fields and 5 million rows:
fields are Id, DEPTAIRPORT, ARRAIRPORT, DEPT/ARR fields are FULLTEXT indexed
WHen running this code it takes >1 minute to get a result:
SELECT * FROM test_copy
WHERE
DEPTAIRPORT LIKE 'buenos%' AND
ARRAIRPORT LIKE 'tokyo%';
However when I use this query, it takes 1 second:
SELECT * FROM test_copy
WHERE
DEPTAIRPORT LIKE 'buenos%' AND
MATCH (ARRAIRPORT) AGAINST ('tokyo*' IN BOOLEAN MODE)
Using two MATCH/AGAINST (instead of LIKE) also take >1 minute.
Question: Why two LIKE take so long while using LIKE and MATCH/AGAINST is so fast ?
My main concern is the final table will have 10 fields (and >30 million records), and the search criteria will will then be more complex, including LIKE on other fields....but if just two LIKE are causing such delay I would like to fix the problem before proceeding...
EDIT:
Fields are FULLTEXT.
EXPLAIN shows for 1st query (LIKE + LIKE):
select_type:SIMPLE
type:ALL
possible_keys: DEPTAIRPORT,ARRAIRIPORT
key: NULL
ley_len:NULL
ref: NULL
rows: 4444841
extra: USING WHERE
EXPLAIN shows for 2nd query (LIKE + MATCH/AGAINST):
select_type:SIMPLE
type:FULLTEXT
possible_keys: DEPTAIRPORT,ARRAIRIPORT
key: ARRAIRIPORT
key_len:0
ref: NULL
rows: 1
extra: USING WHERE
2nd EDIT:
I have restarted HeidiSQL and now the LIKE+LIKE query takes 10 seconds (before >1 minute).
Looks strange....

MySQL groupby with sum

I have a query with group by and sum. I have close to 1 million records. When i run the query it is taking 2.5s. If i remove the group by clause it is taking 0.89s. Is there any way we can optimize the query using group by and sum together.
SELECT aggEI.ei_uuid AS uuid,aggEI.companydm_id AS companyId,aggEI.rating AS rating,aggEI.ei_name AS name,
compdm.company_name AS companyName,sum(aggEI.count) AS activity
FROM AGG_EXTERNALINDIVIDUAL AS aggEI
JOIN COMPANYDM AS compdm ON aggEI.companydm_id = compdm.companydm_id
WHERE aggEI.ei_uuid is not null
and aggEI.companydm_id IN (8)
and aggEI.datedm_id = 20130506
AND aggEI.topicgroupdm_id IN (1,2,3,4,5,6,7)
AND aggEI.rating >= 0
AND aggEI.rating <= 100
GROUP BY aggEI.ei_uuid,aggEI.companydm_id
LIMIT 0,200000
Explain result is as below:
1 SIMPLE compdm const PRIMARY,companydm_id_UNIQUE,comp_idx PRIMARY 8 const 1 Using temporary; Using filesort
1 SIMPLE aggEI ref PRIMARY,datedm_id_UNIQUE,agg_ei_comdm_fk_idx,agg_ei_datedm_fk_idx,agg_ei_topgrp_fk_idx,uid_comp_ei_dt_idx,uid_comp_dt_idx,comp_idx datedm_id_UNIQUE 4 const 197865 Using where
Also i didn't understand why compdm table is executed first. Can someone explain?
I have index on AGG_EXTERNALINDIVIDUAL table with combination of ei_uuid,companydm_id,datedm_id. The same is shown on aggEI table under possible keys as uid_comp_dt_idx. But aggEI table is taking datedmid_UNIQUE as key. I didn't understand the behavior.
Can someone explain?
Explain has to run the dependent queries before it can run the main one.
You need to check indexing on AGG_EXTERNALINDIVIDUAL.

Optimising a MySQL Query that takes ~30 seconds to run

I am running a query to get the total notes input by each users between a date range. This is the query I am running:
SELECT SQL_NO_CACHE
COUNT(notes.user_id) AS "Number of Notes"
FROM csu_users
JOIN notes ON notes.user_id = csu_users.user_id
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31"
AND notes.system = 0
GROUP BY csu_users.user_id
Some notes about my setup:
The query takes between 30 and 35 seconds to run, which is too long for our system
This is an InnoDB table
The notes table is about 1GB, with ~3,000,000 rows
I'm deliberately using SQL_NO_CACHE to ensure an accurate benchmark
The output of EXPLAIN SELECT is as follows (I've tried my best to format it):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE csu_users index user_id user_id 5 NULL 1 Using index
1 SIMPLE notes ref user_id,timestamp,system user_id 4 REFSYS_DEV.csu_users.user_id 152 Using where
I have the following indexes applied:
notes
Primary Key - id
item_id
user_id
timestamp (note: this is actually a DATETIME. The name is just misleading, sorry!)
system
csu_users
Primary Key - id
user_id
Any ideas how I can speed this up? Thank you!
If I'm not mistaken, by converting your timestamp to a string representation, you're loosing all advantages of the index on that column. try using timestamp values in your comparison
Is the csu_users table necessary? If the relationship is 1-1 and the user id is always present, then you can run this query instead:
SELECT COUNT(notes.user_id) AS "Number of Notes"
FROM notes
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31" AND notes.system = 0
GROUP BY notes.user_id
Even if that is not the case, you can do the join after the aggregation and filtering, because all the conditions are on notes:
select "Number of Notes"
from (SELECT notes.user_id, COUNT(notes.user_id) AS "Number of Notes"
FROM notes
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31" AND notes.system = 0
GROUP BY notes.user_id
) n join
csu_users cu
on n.user_id = cu.user_id