Optimising a MySQL Query that takes ~30 seconds to run - mysql

I am running a query to get the total notes input by each users between a date range. This is the query I am running:
SELECT SQL_NO_CACHE
COUNT(notes.user_id) AS "Number of Notes"
FROM csu_users
JOIN notes ON notes.user_id = csu_users.user_id
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31"
AND notes.system = 0
GROUP BY csu_users.user_id
Some notes about my setup:
The query takes between 30 and 35 seconds to run, which is too long for our system
This is an InnoDB table
The notes table is about 1GB, with ~3,000,000 rows
I'm deliberately using SQL_NO_CACHE to ensure an accurate benchmark
The output of EXPLAIN SELECT is as follows (I've tried my best to format it):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE csu_users index user_id user_id 5 NULL 1 Using index
1 SIMPLE notes ref user_id,timestamp,system user_id 4 REFSYS_DEV.csu_users.user_id 152 Using where
I have the following indexes applied:
notes
Primary Key - id
item_id
user_id
timestamp (note: this is actually a DATETIME. The name is just misleading, sorry!)
system
csu_users
Primary Key - id
user_id
Any ideas how I can speed this up? Thank you!

If I'm not mistaken, by converting your timestamp to a string representation, you're loosing all advantages of the index on that column. try using timestamp values in your comparison

Is the csu_users table necessary? If the relationship is 1-1 and the user id is always present, then you can run this query instead:
SELECT COUNT(notes.user_id) AS "Number of Notes"
FROM notes
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31" AND notes.system = 0
GROUP BY notes.user_id
Even if that is not the case, you can do the join after the aggregation and filtering, because all the conditions are on notes:
select "Number of Notes"
from (SELECT notes.user_id, COUNT(notes.user_id) AS "Number of Notes"
FROM notes
WHERE notes.timestamp BETWEEN "2013-01-01" AND "2013-01-31" AND notes.system = 0
GROUP BY notes.user_id
) n join
csu_users cu
on n.user_id = cu.user_id

Related

How to speed up a query containing HAVING?

I have a table with close to a billion records, and need to query it with HAVING. It's very slow (about 15 minutes on decent hardware). How to speed it up?
SELECT ((mean - 3.0E-4)/(stddev/sqrt(N))) as t, ttest.strategyid, mean, stddev, N,
kurtosis, strategies.strategyId
FROM ttest,strategies
WHERE ttest.strategyid=strategies.id AND dataset=3 AND patternclassid="1"
AND exitclassid="1" AND N>= 300 HAVING t>=1.8
I think the problem is t cannot be indexed because it needs to be computed. I cannot add it as a column because the '3.0E-4' will vary per query.
Table:
create table ttest (
strategyid bigint,
patternclassid integer not null,
exitclassid integer not null,
dataset integer not null,
N integer,
mean double,
stddev double,
skewness double,
kurtosis double,
primary key (strategyid, dataset)
);
create index ti3 on ttest (mean);
create index ti4 on ttest (dataset,patternclassid,exitclassid,N);
create table strategies (
id bigint ,
strategyId varchar(500),
primary key(id),
unique key(strategyId)
);
explain select.. :
id
select_type
table
partitions
type
possible_keys
key
key_len
ref
rows
filtered
Extra
1
SIMPLE
ttest
NULL
range
PRIMARY,ti4
ti4
17
NULL
1910344
100.00
Using index condition; Using MRR
1
SIMPLE
strategies
NULL
eq_ref
PRIMARY
PRIMARY
8
Jellyfish_test.ttest.strategyid
1
100.00
Using where
The query needs to reformulated and an index needs to be added.
Plan A:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
s.strategyId
FROM ttest AS tt
JOIN strategies AS s ON tt.strategyid = s.id
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
and a 'composite' and 'covering' index on test. Replace your ti4 with this (to make it 'covering'):
INDEX(dataset, patternclassid, exitclassid, -- any order
N, strategyid) -- in this order
Plan B:
SELECT ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) as t,
tt.strategyid, tt.mean, tt.stddev, tt.N, tt.kurtosis,
( SELECT s.strategyId
FROM strategies AS s
WHERE s.id = tt.strategyid = s.id
) AS strategyId
FROM ttest AS tt
WHERE tt.dataset = 3
AND tt.patternclassid = 1
AND tt.exitclassid = 1
AND tt.N >= 300
AND ((tt.mean - 3.0E-4)/(tt.stddev/sqrt(tt.N))) >= 1.8
With the same index.
Unfortunately the expression for t needs to be repeated. By moving it from HAVING to WHERE, avoids gathering unwanted rows, only to end up throwing them away. Maybe the optimizer will do that automatically. Please provide EXPLAIN SELECT ... to see.
Also, it is unclear whether one of the two formulations will run faster than the other.
To be honest, I've never seen HAVING being used like this; for 20+ years I've assumed it can only be used in GROUP BY situations!
Anyway, IMHO you don't need it here, as Rick James points out, you can put it all in the WHERE.
Rewriting it a bit I end up with:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
Most of that we can indeed foresee a reasonable index. The problem remains with the last calculation:
AND ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) >= 1.8
However, before we go to that: how many rows are there if you ignore this 'formula'? 100? 200? If so, indexing as foreseen in Rick James' answer should be sufficient IMHO.
If it's 1000's or many more than the question becomes: how much of those are thrown out by the formula? 1%? 50% 99%? If it's on the low side then again, indexing as proposed by Rick James will do. If however you only need to keep a few you may want to further optimize this and index accordingly.
From your explanation I understand that 3.0E-4 is variable so we can't include it in the index.. so we'll need to extract the parts we can:
If my algebra isn't failing me you can play with the formula like this:
AND ((t.mean - 3.0E-4) / (t.stddev / sqrt(t.N))) >= 1.8
AND ((t.mean - 3.0E-4) ) >= 1.8 * (t.stddev / sqrt(t.N))
AND t.mean - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N)))
AND - 3.0E-4 >= (1.8 * (t.stddev / sqrt(t.N))) - t.mean
So the query becomes:
SELECT ((t.mean - 3.0E-4)/(t.stddev/sqrt(t.N))) as t,
t.strategyid,
t.mean,
t.stddev,
t.N,
t.kurtosis,
s.strategyId
FROM ttest t,
JOIN strategies s
ON s.id = t.strategyid =
WHERE t.dataset=3
AND t.patternclassid="1"
AND t.exitclassid="1"
AND t.N>= 300
AND (1.8 * (t.stddev / sqrt(t.N))) - t.mean <= -3.0E-4
I'm not familiar with mysql but glancing the documentation it should be possible to include 'generated columns' in the index. So, we'll do exactly that with (1.8 * (t.stddev / sqrt(t.N)) - t.mean).
Your indexed fields thus become:
dataset, paternclassid, exitclassid, N, (1.8 * (t.stddev / sqrt(t.N))) - t.mean)
Note that the system will have to calculate this value for each and every row on insert (and possibly update) you do on the table. However, once there (and indexed) it should make the query quite a bit faster.

Optimizing an SQL query that utilizes Entity Attribute Value (EAV) tables, which add significant time

My goal on this post is to provide only the information needed and see if it is possible to optimize this query. Here is the query itself:
FLUSH TABLES;
SELECT SQL_CALC_FOUND_ROWS
nl.id,
nl.node_id,
nl.name,
nl.parent_name,
nlv.status,
nlv.version,
nlv.code,
nlv.region_id AS region
-- added fields if specified
,IF(ISNULL(nf_youtube_id.value), '', nf_youtube_id.value) AS youtube_id
,IF(ISNULL(nf_brightcove_id.value), '', nf_brightcove_id.value) AS brightcove_id
,IF(ISNULL(nf_youku_id.value), '', nf_youku_id.value) AS youku_id
FROM
node_lists nl
JOIN node_list_versions nlv ON nlv.node_list_id = nl.id
JOIN node_versions nv ON nl.node_id = nv.node_id AND nlv.region_id = nv.region_id
-- added field joins if specified
LEFT JOIN node_fields nf_youtube_id ON nf_youtube_id.node_version_id = nv.id AND nf_youtube_id.name = 'youtube_id'
LEFT JOIN node_fields nf_brightcove_id ON nf_brightcove_id.node_version_id = nv.id AND nf_brightcove_id.name = 'brightcove_id'
LEFT JOIN node_fields nf_youku_id ON nf_youku_id.node_version_id = nv.id AND nf_youku_id.name = 'youku_id'
WHERE
nl.model_type = 'Resource'
AND ((nl.name LIKE '%') OR (nlv.code LIKE '%'))
GROUP BY
nl.node_id, nlv.region_id
ORDER BY
status asc LIMIT 0, 1000000
So the stats on the tables are:
node_lists: innoDB, 10,400 records
node_list_versions: innoDB, 199,600 records
node_versions: innoDB, 12,065 records
node_fields: innoDB, 205,900 records
Although all the tables are innoDB, there are no actual constraints between the keys. No REFERENCES, ON UPDATE CASCADE, etc.. I do have btree indexes on all the join and where clause fields (that much I understand).
The query takes about 2.5 seconds without the join/fields from the node_fields table, and LEFT JOINing each instance of node_fields adds about a half second on this particular query.
Is there ANYTHING I can do to optimize this query, especially any glaring omissions? For reference, an EXPLAIN (that query) yielded the following:
select_type table type possible_keys key key_len ref rows filt Extra
SIMPLE nl ref PRIMARY,node_lists_model_type_index,node_lists_node_id_index node_lists_model_type_index 194 const 7624 100 Using temporary; Using filesort
SIMPLE nv ref node_versions_node_id_index,node_versions_region_id_index node_versions_node_id_index 4 db.nl.node_id 1 100 NULL
SIMPLE nlv ref node_list_region_node_list_id_index,node_list_region_region_id_index node_list_region_node_list_id_index 4 db.nl.id 17 6.25 Using where
SIMPLE nf_youtube_id ref node_fields_node_version_id_index,node_fields_name_index node_fields_node_version_id_index 4 db.nv.id 7 100 Using where
SIMPLE nf_brightcove_id ref node_fields_node_version_id_index,node_fields_name_index node_fields_node_version_id_index 4 db.nv.id 7 100 Using where
SIMPLE nf_youku_id ref node_fields_node_version_id_index,node_fields_name_index node_fields_node_version_id_index 4 db.nv.id 7 100 Using where
I can't see where I'm missing anything or indexes are incomplete. It seems like .5 seconds for each LEFT JOIN of node_fields is just too much? How can I improve this?

MySQL: Why are not all keys of the index used?

I've a table with 50 columns. I defined one index (not unique) with the following 6 columns:
rdsr_id (int),
StartOfXrayIrradiation (datetime),
PatientsBirthDate (date),
DeviceObserverUID (varchar(100)),
IdentifiedProtocolShort (varchar(50)),
RedundantEntryFromDoseSummary (tinyint(1))
The table is called report and has around 20'000 rows and is growing. When running the following query, the result shows that only 4 keys of the index are used.
EXPLAIN EXTENDED SELECT r.PatientID, r.StartOfXrayIrradiation, MeanCTDIvol_in_mGy
FROM report r
INNER JOIN ct_irradiation_events e ON r.rdsr_id = e.rdsr_id
INNER JOIN patient_age_categories a ON ( DATEDIFF( r.StartOfXrayIrradiation, r.PatientsBirthDate ) <= a.max_age_days
AND DATEDIFF( r.StartOfXrayIrradiation, r.PatientsBirthDate ) >= a.min_age_days
AND a.description = 'Erwachsene' )
WHERE MeanCTDIvol_in_mGy IS NOT NULL
AND r.DeviceObserverUID = '2.25'
AND r.IdentifiedProtocolShort = 'XXXXX'
AND r.RedundantEntryFromDoseSummary =0
AND e.CTAcquisitionType != 'Constant Angle Acquisition'
AND DATEDIFF( r.StartOfXrayIrradiation, '2013-01-06' ) >=0
AND DATEDIFF( r.StartOfXrayIrradiation, '2014-03-06' ) <=0;
result for table report:
> id: 1
> select_type: SIMPLE
> table: r
> type: ref
> possible_keys: TimelineHistogramQueries
> key: TimelineHistogramQueries
> key_len: 4
> ref: rdsr.e.rdsr_id
> rows: 1
> filtered: 100.00
> Extra: Using where
So I guess the columns IdentifiedProtocolShort and RedundantEntryFromDoseSummary are not used? The result of the query are 1400 rows. When removing the two columns from the WHERE clause, 9500 rows are found. BTW: I did run "ANALYZE TABLE report" after creating the index, if that matters...
Why are not all keys of the index used? Should I change my index?
Assuming that your TimelineHistogramQueries key is a compound key over the six columns that you list in that order, then the key_len value of 4 (bytes) does indeed indicate that only the rdsr_id column is being used from the index: this is supported by the ref value of rdsr.e.rdsr_id.
You ask why IdentifiedProtocolShort and RedundantEntryFromDoseSummary (columns 5 and 6 in the index) are not being used. As documented under Multiple-Column Indexes:
MySQL cannot use the index to perform lookups if the columns do not form a leftmost prefix of the index.
If you do not require the columns of this index to be in their current order for any other query, you could merely reorder the columns; otherwise, you may need to define a second index.
Depends on what you want out of your query. Leave out patient ID and DOB from your first query if you are interested to see on which dates e.g. your patients had an x-ray, etc. Unless you are running your analysis by age. You are confusing the system by trying to index it all.

GROUP BY HAVING not working as expected

I'm struggling with what should be a simple query.
An event table stores user activity in an application. Each click generates a new event and datetime stamp. I need to show a list of recently accessed records having the most recent datetime stamp. I need to only show the past 7 days of activity.
The table has an auto-increment field (eventID), which corresponds with the date_event field, so it's better to use that for determining the most recent record in the group.
I found that some records are not appearing in my results with the expected most recent datetime. So I stripped my query down the basics:
NOTE that the real-life query does not look at custID. I am including it here to narrow down on the problem.
SELECT
el.eventID,
el.custID,
el.date_event
FROM
event_log el
WHERE
el.custID = 12345 AND
el.userID=987
GROUP BY
el.custID
HAVING
MAX( el.eventID )
This is returned:
eventID custID date_event
346290 12345 2013-06-21 09:58:44
Here's the EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE el ref userID,custID,Composite custID 5 const 203 Using where
If I change the query to use HAVING MIN, the results don't change.. I should see a different eventID and date_event, as there are dozens of records matching the custID and userID.
SELECT
el.eventID,
el.custID,
el.date_event
FROM
event_log el
WHERE
el.custID = 12345 AND
el.userID=987
GROUP BY
el.custID
HAVING
MIN( el.eventID )
Same results as before:
eventID custID date_event
346290 12345 2013-06-21 09:58:44
No change.
This tells me I have another problem, but I am not seeing what that might be.
Some pointers would be appreciated.
SELECT
el.eventID,
el.custID,
el.date_event
FROM
event_log el
WHERE
el.custID = 12345 AND
el.userID=987 AND
el.eventID IN (SELECT MAX(eventID)
FROM event_log
WHERE custID = 12345
AND userID = 987)
Your query doesn't work because you misunderstand what HAVING does. It evaluates the expression on each line of the result set, and keeps the rows where the expression evaluates to true. The expression MAX(el.eventID) simply returns the maximum event ID selected by the query, it doesn't compare the current row to that event ID.
Another way is:
SELECT
el.eventID,
el.custID,
el.date_event
FROM
event_log el
WHERE
el.custID = 12345 AND
el.userID=987
ORDER BY eventID DESC
LIMIT 1
The more general form that works for multiple custID is:
SELECT el.*
FROM event_log el
JOIN (SELECT custID, max(date_event) maxdate
FROM event_log
WHERE userID = 987
GROUP BY custID) emax
ON el.custID = emax.custID AND el.date_event = emax.maxdate
WHERE el.userID = 987
You can use a group function in a statement containing no GROUP BY clause, but it would be equivalent to grouping on all rows. But I guess you're looking for the common syntax,
SELECT
MIN(el.eventID) AS `min_eventID`, --> Yes it is wrong :(
el.custID,
el.date_event
FROM
event_log el
WHERE
el.userID = 987
GROUP BY el.custID;
But disagreements are welcome .
[ Edit ]
I think I didn't show a solution fast enough... but maybe you're rather looking for the fastest solution.
Assuming field date_event defaults to CURRENT_TIMESTAMP (am I wrong?), ordering by date_event would be a waste of time (and money, thus).
I've made some tests with 20K rows and execution time was about 5ms.
SELECT STRAIGHT_JOIN y.*
FROM ((
SELECT MAX(eventId) as eventId
FROM event_log
WHERE userId = 987 AND custId = 12345
)) AS x
INNER JOIN event_log AS y
USING (eventId);
Maybe (possibly, who knows) you didn't get the straight_join thing; as documented on the scriptures, STRAIGHT_JOINs are similar to JOINs, except that the left table is always read before the right table. Sometimes it's useful.
For your specific situation, we're likely to filter to a certain eventID before (on table "x"), not to retrieve 99,99% useless rows from table "y".
More disagreements expected in 3, 2, ...

MySQL groupby with sum

I have a query with group by and sum. I have close to 1 million records. When i run the query it is taking 2.5s. If i remove the group by clause it is taking 0.89s. Is there any way we can optimize the query using group by and sum together.
SELECT aggEI.ei_uuid AS uuid,aggEI.companydm_id AS companyId,aggEI.rating AS rating,aggEI.ei_name AS name,
compdm.company_name AS companyName,sum(aggEI.count) AS activity
FROM AGG_EXTERNALINDIVIDUAL AS aggEI
JOIN COMPANYDM AS compdm ON aggEI.companydm_id = compdm.companydm_id
WHERE aggEI.ei_uuid is not null
and aggEI.companydm_id IN (8)
and aggEI.datedm_id = 20130506
AND aggEI.topicgroupdm_id IN (1,2,3,4,5,6,7)
AND aggEI.rating >= 0
AND aggEI.rating <= 100
GROUP BY aggEI.ei_uuid,aggEI.companydm_id
LIMIT 0,200000
Explain result is as below:
1 SIMPLE compdm const PRIMARY,companydm_id_UNIQUE,comp_idx PRIMARY 8 const 1 Using temporary; Using filesort
1 SIMPLE aggEI ref PRIMARY,datedm_id_UNIQUE,agg_ei_comdm_fk_idx,agg_ei_datedm_fk_idx,agg_ei_topgrp_fk_idx,uid_comp_ei_dt_idx,uid_comp_dt_idx,comp_idx datedm_id_UNIQUE 4 const 197865 Using where
Also i didn't understand why compdm table is executed first. Can someone explain?
I have index on AGG_EXTERNALINDIVIDUAL table with combination of ei_uuid,companydm_id,datedm_id. The same is shown on aggEI table under possible keys as uid_comp_dt_idx. But aggEI table is taking datedmid_UNIQUE as key. I didn't understand the behavior.
Can someone explain?
Explain has to run the dependent queries before it can run the main one.
You need to check indexing on AGG_EXTERNALINDIVIDUAL.