I got this query. It take ~0.0854 seconds to excutes. I find it a little slow. Below see my explain
SELECT
stops.stop_number,
stops.stop_name_1,
stops.stop_name_2
FROM
tranzit.stops_times
INNER JOIN
tranzit.stops
ON
(
stops_times.stop_id = stops.stop_id
)
INNER JOIN
tranzit.trips
ON
(
stops_times.trip_id = trips.trip_id
)
WHERE
trips.route_id = 109 AND
trips.trip_direction = 1 AND
trips.trip_period_start <= "2011-11-24" AND
trips.trip_period_end >= "2011-11-24"
GROUP BY
stops.stop_id
ORDER BY
stops_times.time_sequence ASC
LIMIT
0, 200
Explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE trips index_merge trip_id,trip_period_start,trip_period_end,trip_dir... route_id,trip_direction 3,1 NULL 271 Using intersect(route_id,trip_direction); Using wh...
1 SIMPLE stops_times ref stop_id,trip_id trip_id 16 tranzit.trips.trip_id 24
1 SIMPLE stops ref stop_id stop_id 3 tranzit.stops_times.stop_id 1 Using where
And I have indexe on trips :
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment
trips 1 agency_id 1 agency_id A 2 NULL NULL BTREE
trips 1 trip_id 1 trip_id A 9361 NULL NULL BTREE
trips 1 trip_period_start 1 trip_period_start A 2 NULL NULL BTREE
trips 1 trip_period_end 1 trip_period_end A 2 NULL NULL BTREE
trips 1 trip_direction 1 trip_direction A 2 NULL NULL BTREE
trips 1 route_id 1 route_id A 106 NULL NULL BTREE
trips 1 shape_id 1 shape_id A 520 NULL NULL BTREE
trips 1 trip_terminus 1 trip_terminus A 301 NULL NULL BTREE
Indexes on stops
stop_number BTREE Non Non stop_number 4626 A
agency_id BTREE Non Non agency_id 1 A
stop_id BTREE Non Non stop_id 4626 A
Thanks for any help
Given how many rows you have in the tables it is already running pretty quick. You could try a few different approaches such as added more where conditions or performing a simple select and then running a second query to get the needed join fields. But these aren't where you really need to focus.
The important question is how will this query behave in the wild. If you are running it 100 times every second you need to know if it is going to degrade and become a bottleneck. If it can run in 0.08 every time, then that still allows for a very responsive application.
The most important strategy however, if it is possible and came be made effective, is using memcache or a similar option to prevent running the query all the time.
As people wrote before:
Split to 2 queries:
Trip information, by group_concat to make it faster
SELECT group_concat(trip_id) FROM trips WHERE
trips.route_id = 109 AND
trips.trip_direction = 1 AND
trips.trip_period_start = "2011-11-24"
Next Information
SELECT
stops.stop_number,
stops.stop_name_1,
stops.stop_name_2
FROM
tranzit.stops_times,
tranzit.stops
WHERE
stops_times.stop_id = stops.stop_id
AND
stops_times.trip_id in ( ...)
GROUP BY, ...
I think it will be faster, as you don't need other information from trips table outside the query.
the most tricky part is on the range query trip_period_start, trip_period_end,
I think you can consider a composite key like:-
alter table trips
add index testing
(
route_id, trip_direction, trip_period_start, trip_period_end
);
depend on how many unique value for trip_direction,
if always only a few unique values,
alter table trips
add index testing
(
route_id, trip_period_start, trip_period_end, trip_direction
);
Already less than 1 tenth of a second and you want it faster? ok...
I would build a composite index on ( route_id, trip_direction, trip_period_start ) as those are the three critical elements of your query. Also, in that order to have the smallest granularity to the front of the index (specific route). Then, within that, its direction, then, the dates. Next, I would swap the order of the query with the trips table up front since you are doing INNER joins. Additionally, have an index on your "stops_times" table on TRIP_ID. By starting with the first table with its qualifiers, then joining to the child-level tabls via relations, you still get the elements, but you are running against the smallest index set first on trips.
select STRAIGHT_JOIN
stops.stop_number,
stops.stop_name_1,
stops.stop_name_2
from
tranzit.trips
join tranzit.stops_times
on trips.trip_id = stops_times.trip_id
join tranzit.stops
on stops_times.stop_id = stops.stop_id
where
trips.route_id = 109
AND trips.trip_direction = 1
AND trips.trip_period_start <= "2011-11-24"
AND trips.trip_period_end >= "2011-11-24"
group by
stops.stop_id
ORDER BY
stops_times.time_sequence
LIMIT
0, 200
I found something that work like a charm. My results number are :
0.0011
0.0008
0.0017 (highest)
0.0006 (lowest)
0.0013
These result aren't from the cache. I switch all the WHERE in t (trips.agency_id, trips.route_id, trips.trip_direction, trips.trip_period_start, trips.trip_period_end) and it is working very good ! I can't explain why but if someone can, i'd like to see why. Thanks a lot everyone !
PS : Even without trips.agency_id it is working great.
SELECT
stops.stop_number,
stops.stop_name_1,
stops.stop_name_2
FROM
tranzit.stops_times,
tranzit.stops,
(
SELECT
trips.trip_id
FROM
tranzit.trips
WHERE
trips.agency_id = 5 AND
trips.route_id = 109 AND
trips.trip_direction = 0 AND
trips.trip_period_start <= "2011-12-01" AND
trips.trip_period_end >= "2011-12-01"
LIMIT 1
) as t
WHERE
stops_times.stop_id = stops.stop_id AND
stops_times.trip_id in (t.trip_id)
GROUP BY
stops_times.stop_id
ORDER BY
stops_times.time_sequence ASC
LIMIT
0, 200
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> system NULL NULL NULL NULL 1 Using temporary; Using filesort
1 PRIMARY stops_times ref trip_id,stop_id trip_id 16 const 33 Using where
1 PRIMARY stops ref stop_id stop_id 3 tranzit.stops_times.stop_id 1 Using where
2 DERIVED trips ref testing testing 4 275 Using where
Related
I have the following queries which both return the same result and row count:
select * from (
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime) as atb
left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id
and atb.campaign_id = demand.campaign_id
and atb.business_rule_id = demand.business_rule_id
ran explain extended, and these are the results:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1451739 | 100.00 | NULL |
| 1 | PRIMARY | demand | ref | PRIMARY,join_index | PRIMARY | 4 | atb.campaign_id | 1 | 100.00 | Using where |
| 2 | DERIVED | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
and the other is:
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id
and hbrl.campaign_id = demand.campaign_id
and hbrl.business_rule_id = demand.business_rule_id
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime;
and these are the results for the second query:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | demand | ref | PRIMARY,join_index | PRIMARY | 4 | my6sense_datawarehouse.hourly_business_rule_level.campaign_id | 1 | 100.00 | Using where; Using index |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
the first one takes about 2 seconds while the second one takes over 2 minutes!
why is the second query taking so long?
what am I missing here?
thanks.
Use a subquery whenever the subquery significantly shrinks the number of rows before - ANY JOIN - always to reinforce Rick James Plan B.
To reinforce Rick & Paul's answer which you have already documented. The answers by Rick and Paul deserve Acceptance.
One possible reason is the number of rows that have to be joined with the second table.
The GROUP BY clause and the HAVING clause will limit the number of rows returned from your subquery.
Only those rows will be used for the join.
Without the subquery only the WHERE clause is limiting the number of rows for the JOIN.
The JOIN is done before the GROUP BY and HAVING clauses are processed.
Depending on group size and the selectivity of the HAVING conditions there would be much more rows that need to be joined.
Consider the following simplified example:
We have a table users with 1000 entries and the columns id, email.
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
);
Then we have a (huge) log table user_actions with 1,000,000 entries and the columns id, user_id, timestamp, action
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
);
The task is to find all users who have at least 900 entries in the log table since 2017-02-01.
The subquery solution:
select a.user_id, a.cnt, u.email
from (
select a.user_id, count(*) as cnt
from user_actions a
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
) a
left join users u on u.id = a.user_id
The subquery returns 135 rows (users). Only those rows will be joined with the users table.
The subquery runs in about 0.375 seconds. The time needed for the join is almost zero, so the full query runs in about 0.375 seconds.
Solution without subquery:
select a.user_id, count(*) as cnt, u.email
from user_actions a
left join users u on u.id = a.user_id
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
The WHERE condition filters the table to 866,081 rows.
The JOIN has to be done for all those 866K rows.
After the JOIN the GROUP BY and the HAVING clauses are processed and limit the result to 135 rows.
This query needs about 0.815 seconds.
So you can already see, that a subquery can improve the performance.
But let's make things worse and drop the primary key in the users table.
This way we have no index which can be used for the JOIN.
Now the first query runs in 0.455 seconds. The second query needs 40 seconds - almost 100 times slower.
Notes
It's difficult to say if the same applies to your case. Reasons are:
Your queries are quite complex and far away from from beeing an MVCE.
I don't see anything beeng selected from the demand table - So it's unclear why you are joining it at all.
You use a LEFT JOIN in one query and an INNER JOIN in another one.
The relation between the two tables is unclear.
No information about indexes. You should provide the CREATE statements (SHOW CREATE table_name).
Test setup
drop table if exists users;
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
)
select seq as id, rand(1) as email
from seq_1_to_1000
;
drop table if exists user_actions;
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
)
select seq as id
, floor(rand(2)*1000)+1 as user_id
#, '2017-01-01 00:00:00' + interval seq*20 second as timestamp
, from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) as timestamp
, rand(3) as action
from seq_1_to_1000000
;
MariaDB 10.0.19 with sequence plugin.
The queries are different. One says JOIN, the other says LEFT JOIN. You are not using demand, so the join is probably useless. However, in the case of JOIN, you are filtering out advertisers that are not in dim_demand; it that the intent?
But that does not address the question.
The EXPLAINs estimate that there are 1.5M rows in hbrl. But how many show up in the result? I would guess it is a lot fewer. From this, I can answer your question.
Consider these two:
SELECT ... FROM ( SELECT ... FROM a
GROUP BY or HAVING or LIMIT ) x
JOIN b
SELECT ... FROM a
JOIN b
GROUP BY or HAVING or LIMIT
The first will decrease the number of rows that need to join to b; the second will need to do a full 1.5M joins. I suspect that the time taken to do the JOIN (be it LEFT or not) is where the difference is.
Plan A: Remove demand from the query.
Plan B: Use a subquery whenever the subquery significantly shrinks the number of rows before the JOIN.
Indexing (may speed up both variants):
INDEX(publisher_network_id, network_time)
and get rid of this as being useless (since the between will fail anyway for NULL):
and network_time IS NOT NULL
Side note: I recommend simplifying and fixing this
and network_time
between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f')
AND str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
to
and network_time >= '2017-08-13 17:00:00
and network_time < '2017-08-13 17:00:00 + INTERVAL 24 HOUR
I have a query which is running a bit slow. It takes a table containing the items that result from a search and then gets the categories that contain 1 or more of these items. The categories (~300) are stored in a nested set model with around 4 levels. I need to know the counts at each level (so an item might be in the grand child category, but also needs to be counted in the child and parent categories).
The query to do this is as follows:-
SELECT category_parent.id,
category_parent.depth,
category_parent.name AS item_sub_category,
category_parent.left_index,
category_parent.right_index,
COUNT(DISTINCT item.id) as total
FROM search_enquiries_found
INNER JOIN item ON search_enquiries_found.item_id = item.id
INNER JOIN category ON item.mmg_code = category.mmg_code
INNER JOIN category category_parent ON category.left_index BETWEEN category_parent.left_index AND category_parent.right_index
WHERE search_enquiries_found.search_enquiry_id = 35
AND item.cost_price > 0
GROUP BY category_parent.id,
category_parent.depth,
item_sub_category,
category_parent.left_index,
category_parent.right_index
ORDER BY category_parent.left_index
With around 12k records found this is taking ~1.5 seconds. The explain follows:-
id select_type table partitions type possible_keys key key_len ref rows Extra
1 SIMPLE category NULL ALL left_index NULL NULL NULL 337 Using temporary; Using filesort
1 SIMPLE item NULL ref PRIMARY,id,mmg_code mmg_code 27 em_entaonline.category.mmg_code 43 Using where
1 SIMPLE search_enquiries_found NULL eq_ref search_enquiry_id,item_id,search_enquiry_id_2,search_enquiry_id_relevance search_enquiry_id 8 const,em_entaonline.item.id 1 Using index
1 SIMPLE category_parent NULL ALL left_index NULL NULL NULL 337 Range checked for each record (index map: 0x2)
The important table is the hierachical category table:-
CREATE TABLE `category` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(128) NOT NULL,
`depth` int(11) NOT NULL,
`is_active` tinyint(1) NOT NULL DEFAULT '1',
`left_index` int(4) NOT NULL,
`right_index` int(4) NOT NULL,
`mmg_code` varchar(25) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `left_index` (`left_index`,`right_index`),
UNIQUE KEY `depth` (`depth`,`left_index`,`right_index`),
KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
This is not usefully using any index on the join to get parent categories.
Changing this query to remove most of the group by fields and just using the unique id (in effect the other fields in the group by are not required, but I tend to include them to keep to strict standards) reduces the time taken down to ~0.5 seconds:-
SELECT category_parent.id,
category_parent.depth,
category_parent.name AS item_sub_category,
category_parent.left_index,
category_parent.right_index,
COUNT(DISTINCT item.id) as total
FROM search_enquiries_found
INNER JOIN item ON search_enquiries_found.item_id = item.id
INNER JOIN category ON item.mmg_code = category.mmg_code
INNER JOIN category category_parent ON category.left_index BETWEEN category_parent.left_index AND category_parent.right_index
WHERE search_enquiries_found.search_enquiry_id = 35
AND item.cost_price > 0
GROUP BY category_parent.id
ORDER BY category_parent.left_index
The explain for this is very slightly different:-
id select_type table partitions type possible_keys key key_len ref rows Extra?
1 SIMPLE category NULL ALL left_index NULL NULL NULL 337 Using temporary; Using filesort
1 SIMPLE item NULL ref PRIMARY,id,mmg_code mmg_code 27 em_entaonline.category.mmg_code 43 Using where
1 SIMPLE search_enquiries_found NULL eq_ref search_enquiry_id,item_id,search_enquiry_id_2,search_enquiry_id_relevance search_enquiry_id 8 const,em_entaonline.item.id 1 Using index
1 SIMPLE category_parent NULL ALL PRIMARY,left_index,depth,name NULL NULL NULL 337 Using where; Using join buffer (Block Nested Loop)
Now I have 2 questions.
1 - Why does removing the fields from the GROUP BY (which are completely dependent on the primary key left in the group by) change the EXPLAIN this way, with such a large change in performance.
2 - In either case the performance is poor, with no key used for the join to the parent category table. Any suggestions on improving this?
As a further point, I have tried reversing the order of the joins and forcing it using STRAIGHT_JOIN. In this case this further improves the performance (~0.2 seconds), but this is an unusual situation (as it is returning all the records from the parent category table - most times the search will return far less records hence far less categories), so I think this will slow things down in more normal situations, and it still isn't using indexes on some of the joins. It would be good if MySQL could decide which way to join the tables depending on the result sets.
SELECT category_parent.id,
category_parent.depth,
category_parent.name AS item_sub_category,
category_parent.left_index,
category_parent.right_index,
COUNT(DISTINCT item.id) as total
FROM category category_parent
STRAIGHT_JOIN category ON category.left_index BETWEEN category_parent.left_index AND category_parent.right_index
STRAIGHT_JOIN item ON item.mmg_code = category.mmg_code
STRAIGHT_JOIN search_enquiries_found ON search_enquiries_found.item_id = item.id
WHERE search_enquiries_found.search_enquiry_id = 35
GROUP BY category_parent.id,
category_parent.depth,
item_sub_category,
category_parent.left_index,
category_parent.right_index
ORDER BY category_parent.left_index
id select_type table partitions type possible_keys key key_len ref rows Extra
1 SIMPLE category_parent NULL ALL left_index NULL NULL NULL 337 Using temporary; Using filesort
1 SIMPLE category NULL ALL left_index NULL NULL NULL 337 Range checked for each record (index map: 0x2)
1 SIMPLE item NULL ref PRIMARY,id,mmg_code mmg_code 27 em_entaonline.category.mmg_code 43 Using index
1 SIMPLE search_enquiries_found NULL eq_ref search_enquiry_id,item_id,search_enquiry_id_2,search_enquiry_id_relevance search_enquiry_id 8 const,em_entaonline.item.id 1 Using index
Been working at this for awhile now and cannot seem to get it optimized. Although it does work, each left joined logs* table is reading every row in the database regardless if it is part of the set it is joined to (user_id's). While it returns correct results as is, this will be a problem as the user base and db as a whole grows.
Some quick background : given an account id there can be any number of computers to it. On each of those computers there can be any number of users linked to it. These user_id's are then linked in the logs tables. Each of these relationships is indexed (account_id, computer_id, user_id) for the necessary tables.
I have put the left joins in subqueries to prevent a cartesian product (a previous issue which subqueries solved).
Query :
SELECT
users.username as username,
computers.computer_name as computer_name,
l1.cnt as cnt1,
l2.cnt as cnt2,
l3.cnt as cnt3,
l4.cnt as cnt4,
l5.cnt as cnt5,
l6.cnt as cnt6
FROM computers
INNER JOIN users
on users.computer_id = computers.computer_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs1
group by user_id
) AS l1
on l1.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs2
group by user_id
) AS l2
on l2.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs3
group by user_id
) AS l3
on l3.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs4
group by user_id
) AS l4
on l4.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs5
group by user_id
) AS l5
on l5.user_id = users.user_id
LEFT JOIN
(SELECT
user_id,
count(*) as cnt
from logs6
group by user_id
) AS l6
on l6.user_id = users.user_id
WHERE computers.account_id = :cw_account_id AND computers.status = :cw_status
GROUP BY users.user_id
Plan :
computers 1 PRIMARY ref PRIMARY,unique_filter,status unique_filter 4 const 5 Using where; Using temporary; Using filesort
users 1 PRIMARY ref PRIMARY,unique_filter unique_filter 4 stephen_spcplus_inno.computers.computer_id 1 Using index
<derived2> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 3
logs1 2 DERIVED index user_id user_id 8 33 Using index
<derived3> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 10
logs2 3 DERIVED index user_id user_id 8 101 Using index
<derived4> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 4
logs3 4 DERIVED index user_id user_id 8 41 Using index
<derived5> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 2
logs4 5 DERIVED index user_id user_id 8 28 Using index
<derived6> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 2
logs5 6 DERIVED index user_id user_id 8 28 Using index
<derived7> 1 PRIMARY ref <auto_key0> <auto_key0> 4 stephen_spcplus_inno.users.user_id 275
logs6 7 DERIVED index user_id user_id 775 27516 Using index
example results :
username computer_name cnt1 cnt2 cnt3 cnt4 cnt5 cnt6
testuser COMPUTER_1 1 2 1 (null) (null) 3
testuser2 COMPUTER_1 (null) (null) (null) (null) (null) (null)
someuser COMPUTER_2 32 83 26 15 28 1157
As an example, for logs6 the plan is reading every row in the database (27516) yet there were only 1160 which 'should' have been joined.
I have tried lots of different things, but cannot get this to operate in an optimized manner. As it is currently the reason all the rows from each table are being read is due to the use of COUNT(*) within each joins subquery... removing this and only the needed rows are joined like I want, however, I do not know how to get the counts then in the same grouped result.
Help from any gurus would be great! Yes, I know I do not have a lot of rows in the db, but I can see the results are correct and see that the full table scans are going to be a problem as well.
EDIT (partial solution) :
I have found a partial solution to this problem, but it requires an additional query to get a list of user_ids. By adding WHERE user_id IN (17,22,23) where these are the user_ids which should be joined... to each log table I get the correct results and the entire table is not scanned.
If anyone knows of a way to make this work without this additional query and where additional please let me know.
I simplified your question to 2 log-tables and played around with it a bit on SQLFiddle.
=> http://sqlfiddle.com/#!2/a99e4a/2
It seems that using a sub-query makes things worse in my example data, but I wonder how it handles things when there are much more records in the tables that don't fit the criteria.
I'd suggest you give it a try and see what comes out. I don't have a MySql db to play around with here and I'd rather not bring SqlFiddle to its knees =)
I'm losing hair on a stupid query. First, I would explain what's its goal. I have a set of values fetched every hour and stored in the DB. These values can increase or stay equal with time. This query extracts the latest value day by day for latest 60 days (I have twins query for extract lastest value by weeks and months, they are similar). The query is self explanatory:
SELECT l.value AS value
FROM atable AS l
WHERE l.time = (
SELECT MAX(m.time)
FROM atable AS m
WHERE DATE(l.time) = DATE(m.time)
LIMIT 1
)
ORDER BY l.time DESC
LIMIT 60
It looks no special. But it's extremely slow (> 30 secs), considering time is an index and table contains less than 5000 rows. And I'm sure the problem is with sub-query.
Where is the noob mistake?
Update 1: Same situation if I avoid MAX() using SELECT m.time ... ORDER BY m.time DESC.
Update 2: Seems is not a problem with DATE() function called to many times. I've tried to create a calculated field day DATE. The UPDATE atable SET day = DATE(time) runs in less than 2secs. The modified query, with l.day = m.day (no functions!), runs in the same exactly time as before.
The main issue I see is using DATE() on the left of the expression in the WHERE clause. Using the function DATE() on both sides of the WHERE expression explicitly prevents MySQL from using an index on the date field. Instead, it must scan all rows to apply the function on each row.
Instead of this:
WHERE DATE(l.time) = DATE(m.time)
Try something like this:
WHERE l.time BETWEEN
DATE_SUB(m.date, INTERVAL TIME_TO_SEC(m.date) SECOND)
AND DATE_ADD(DATE_SUB(m.date, INTERVAL TIME_TO_SEC(m.date) SECOND), INTERVAL 86399 SECOND)
Maybe you know of a better way to turn m.date into a range like 2012-02-09 00:00:00 and 2012-02-09 23:59:59 than the above example, but the idea is that you want to keep the left side of the expression as the raw column name, l.time in this case, and give it a range in the form of two constants (or two expressions that can be converted to constants) on the right side.
EDIT
I'm using your pre-calculated day field:
SELECT *
FROM atable a
WHERE a.time IN
(SELECT MAX(time)
FROM atable
GROUP BY day
ORDER BY day DESC
LIMIT 60)
At least here, the inner query is only ran once, and then a binary search is done with the IN cluase. You're still scanning the table, but just once, and the advantage of the inner query being run just once will probably make a huge dent.
If you know that you have values for every day, you could improve that inner query by adding a WHERE clause, limiting it to the last 60 calendar days, and losing the LIMIT 60. Make sure that day and time are indexed.
Instead of using MAX(m.time) do the following in the sub-select
SELECT m.time
FROM table AS m
WHERE DATE(l.time) = DATE(m.time)
ORDER BY m.time DESC
LIMIT 1
This might help speed up the query since it is giving the query parser an alternative
However one other piece i noticed is you are using the DATE(l.time) and DATE(m.time) which if your index is not created on DATE(m.time) then you will not be using the index and hence could cause slowness.
Based on the feedback answer, if the entries are sequentially added via date/time, directly correlated to the auto-increment ID, who cares about the TIME... get the auto-inc number for exact, non-ambiguous join
select
A1.AutoID,
A1.time,
A1.Value
from
( select date( A2.time ) as SingleDate,
max( A2.AutoID ) as MaxAutoID
from aTable A2
where date( A2.Time ) >= date( date_sub( now(), interval 60 day ))
group by date( A2.time ) ) into MaxPerDate
JOIN aTable A1
on MaxPerDate.MaxAutoID = A1.AutoID
order by
A1.AutoID DESC
You could use the "explain" statement to get mysql to tell you what it's doing.
EXPLAIN SELECT l.value AS value
FROM table AS l
WHERE l.time = (
SELECT MAX(m.time)
FROM table AS m
WHERE DATE(l.time) = DATE(m.time) LIMIT 1
)
ORDER BY l.time DESC LIMIT 60
That should at least give you an insight where to look further.
If you have an index on time, I would suggest getting TOP 1 instead of MAX as follows:
SELECT l.value AS value
FROM table AS l
WHERE l.time = (
SELECT TOP 1 m.time
FROM table AS m
ORDER BY m.time DESC LIMIT 1
)
ORDER BY l.time DESC LIMIT 60
Your outer query is using a filesort without indexes.
Try changing to InnoDB engine to see if it improves things.
Doing a quick test:
mysql> show create table atable\G
*************************** 1. row ***************************
Table: atable
Create Table: CREATE TABLE `atable` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`t` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `t` (`t`)
) ENGINE=InnoDB AUTO_INCREMENT=51 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> explain SELECT id FROM atable AS l WHERE l.t = ( SELECT MAX(m.t) FROM atable AS m WHERE DATE(l.t) = DATE(m.t) LIMIT 1 ) ORDER BY l.t DESC LIMIT 50;
+----+--------------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| 1 | PRIMARY | l | index | NULL | t | 4 | NULL | 50 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | m | index | NULL | t | 4 | NULL | 50 | Using where; Using index |
+----+--------------------+-------+-------+---------------+------+---------+------+------+--------------------------+
2 rows in set (0.00 sec)
After changing to MyISAM:
mysql> explain SELECT id FROM atable AS l WHERE l.t = ( SELECT MAX(m.t) FROM atable AS m WHERE DATE(l.t) = DATE(m.t) LIMIT 1 ) ORDER BY l.t DESC LIMIT 50;
+----+--------------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| 1 | PRIMARY | l | ALL | NULL | NULL | NULL | NULL | 50 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | m | index | NULL | t | 4 | NULL | 50 | Using where; Using index |
+----+--------------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
2 rows in set (0.00 sec)
I'm having real difficulties optimising a MySQL query. I have to use the existing database structure, but I am getting an extremely slow response under certain circumstances.
My query is:
SELECT
`t`.*,
`p`.`trp_name`,
`p`.`trp_lname`,
`trv`.`trv_prosceslevel`,
`trv`.`trv_id`,
`v`.`visa_destcountry`,
`track`.`track_id`,
`track`.`track_datetoembassy`,
`track`.`track_expectedreturn`,
`track`.`track_status`,
`track`.`track_comments`
FROM
(SELECT
*
FROM
`_transactions`
WHERE
DATE(`tr_datecreated`) BETWEEN DATE('2011-07-01 00:00:00') AND DATE('2011-08-01 23:59:59')) `t`
JOIN
`_trpeople` `p` ON `t`.`tr_id` = `p`.`trp_trid` AND `p`.`trp_name` = 'Joe' AND `p`.`trp_lname` = 'Bloggs'
JOIN
`_trvisas` `trv` ON `t`.`tr_id` = `trv`.`trv_trid`
JOIN
`_visas` `v` ON `trv`.`trv_visaid` = `v`.`visa_code`
JOIN
`_trtracking` `track` ON `track`.`track_trid` = `t`.`tr_id` AND `p`.`trp_id` = `track`.`track_trpid` AND `trv`.`trv_id` = `track`.`track_trvid` AND `track`.`track_status` IN ('New','Missing_Info',
'En_Route',
'Ready_Pickup',
'Received',
'Awaiting_Voucher',
'Sent_Client',
'Closed')
ORDER BY `tr_id` DESC
The results of an explain statement on the above is:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 164 Using temporary; Using filesort
1 PRIMARY track ALL status_index NULL NULL NULL 4677 Using where
1 PRIMARY p eq_ref PRIMARY PRIMARY 4 db.track.track_trpid 1 Using where
1 PRIMARY trv eq_ref PRIMARY PRIMARY 4 db.track.track_trvid 1 Using where
1 PRIMARY v eq_ref visa_code visa_code 4 db.trv.trv_visaid 1
2 DERIVED _transactions ALL NULL NULL NULL NULL 4276 Using where
The query times are acceptable until the value of 'Closed' is included in the very last track.track_status IN clause. The length of time is then increased about 10 to 15 times the other queries.
This makes sense as the 'Closed' status refers to all the clients whose transactions have been dealt with, wihich corresponds to about 90% to 95% of the database.
The issue is, is that in some cases, the search is taking about 45 seconds which is rediculous. I'm sure MySQL can do much better than that and it's just my query at fault, even if the tables do have 4000 rows, but I can't work out how to optimise this statement.
I'd be grateful for some advice about where I'm going wrong and how I should be implementing this query to produce a faster result.
Many thanks
Try this:
SELECT t.*,
p.trp_name,
p.trp_lname,
trv.trv_prosceslevel,
trv.trv_id,
v.visa_destcountry,
track.track_id,
track.track_datetoembassy,
track.track_expectedreturn,
track.track_status,
track.track_comments
FROM
_transactions t
JOIN _trpeople p ON t.tr_id = p.trp_trid
JOIN _trvisas trv ON t.tr_id = trv.trv_trid
JOIN _visas v ON trv.trv_visaid = v.visa_code
JOIN _trtracking track ON track.track_trid = t.tr_id
AND p.trp_id = track.track_trpid
AND trv.trv_id = track.track_trvid
WHERE DATE(t.tr_datecreated)
BETWEEN DATE('2011-07-01 00:00:00') AND DATE('2011-08-01 23:59:59')
AND track.track_status IN ('New','Missing_Info','En_Route','Ready_Pickup','Received','Awaiting_Voucher','Sent_Client', 'Closed')
AND p.trp_name = 'Joe' AND p.trp_lname = 'Bloggs'
ORDER BY tr_id DESC