I'm losing hair on a stupid query. First, I would explain what's its goal. I have a set of values fetched every hour and stored in the DB. These values can increase or stay equal with time. This query extracts the latest value day by day for latest 60 days (I have twins query for extract lastest value by weeks and months, they are similar). The query is self explanatory:
SELECT l.value AS value
FROM atable AS l
WHERE l.time = (
SELECT MAX(m.time)
FROM atable AS m
WHERE DATE(l.time) = DATE(m.time)
LIMIT 1
)
ORDER BY l.time DESC
LIMIT 60
It looks no special. But it's extremely slow (> 30 secs), considering time is an index and table contains less than 5000 rows. And I'm sure the problem is with sub-query.
Where is the noob mistake?
Update 1: Same situation if I avoid MAX() using SELECT m.time ... ORDER BY m.time DESC.
Update 2: Seems is not a problem with DATE() function called to many times. I've tried to create a calculated field day DATE. The UPDATE atable SET day = DATE(time) runs in less than 2secs. The modified query, with l.day = m.day (no functions!), runs in the same exactly time as before.
The main issue I see is using DATE() on the left of the expression in the WHERE clause. Using the function DATE() on both sides of the WHERE expression explicitly prevents MySQL from using an index on the date field. Instead, it must scan all rows to apply the function on each row.
Instead of this:
WHERE DATE(l.time) = DATE(m.time)
Try something like this:
WHERE l.time BETWEEN
DATE_SUB(m.date, INTERVAL TIME_TO_SEC(m.date) SECOND)
AND DATE_ADD(DATE_SUB(m.date, INTERVAL TIME_TO_SEC(m.date) SECOND), INTERVAL 86399 SECOND)
Maybe you know of a better way to turn m.date into a range like 2012-02-09 00:00:00 and 2012-02-09 23:59:59 than the above example, but the idea is that you want to keep the left side of the expression as the raw column name, l.time in this case, and give it a range in the form of two constants (or two expressions that can be converted to constants) on the right side.
EDIT
I'm using your pre-calculated day field:
SELECT *
FROM atable a
WHERE a.time IN
(SELECT MAX(time)
FROM atable
GROUP BY day
ORDER BY day DESC
LIMIT 60)
At least here, the inner query is only ran once, and then a binary search is done with the IN cluase. You're still scanning the table, but just once, and the advantage of the inner query being run just once will probably make a huge dent.
If you know that you have values for every day, you could improve that inner query by adding a WHERE clause, limiting it to the last 60 calendar days, and losing the LIMIT 60. Make sure that day and time are indexed.
Instead of using MAX(m.time) do the following in the sub-select
SELECT m.time
FROM table AS m
WHERE DATE(l.time) = DATE(m.time)
ORDER BY m.time DESC
LIMIT 1
This might help speed up the query since it is giving the query parser an alternative
However one other piece i noticed is you are using the DATE(l.time) and DATE(m.time) which if your index is not created on DATE(m.time) then you will not be using the index and hence could cause slowness.
Based on the feedback answer, if the entries are sequentially added via date/time, directly correlated to the auto-increment ID, who cares about the TIME... get the auto-inc number for exact, non-ambiguous join
select
A1.AutoID,
A1.time,
A1.Value
from
( select date( A2.time ) as SingleDate,
max( A2.AutoID ) as MaxAutoID
from aTable A2
where date( A2.Time ) >= date( date_sub( now(), interval 60 day ))
group by date( A2.time ) ) into MaxPerDate
JOIN aTable A1
on MaxPerDate.MaxAutoID = A1.AutoID
order by
A1.AutoID DESC
You could use the "explain" statement to get mysql to tell you what it's doing.
EXPLAIN SELECT l.value AS value
FROM table AS l
WHERE l.time = (
SELECT MAX(m.time)
FROM table AS m
WHERE DATE(l.time) = DATE(m.time) LIMIT 1
)
ORDER BY l.time DESC LIMIT 60
That should at least give you an insight where to look further.
If you have an index on time, I would suggest getting TOP 1 instead of MAX as follows:
SELECT l.value AS value
FROM table AS l
WHERE l.time = (
SELECT TOP 1 m.time
FROM table AS m
ORDER BY m.time DESC LIMIT 1
)
ORDER BY l.time DESC LIMIT 60
Your outer query is using a filesort without indexes.
Try changing to InnoDB engine to see if it improves things.
Doing a quick test:
mysql> show create table atable\G
*************************** 1. row ***************************
Table: atable
Create Table: CREATE TABLE `atable` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`t` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `t` (`t`)
) ENGINE=InnoDB AUTO_INCREMENT=51 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> explain SELECT id FROM atable AS l WHERE l.t = ( SELECT MAX(m.t) FROM atable AS m WHERE DATE(l.t) = DATE(m.t) LIMIT 1 ) ORDER BY l.t DESC LIMIT 50;
+----+--------------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| 1 | PRIMARY | l | index | NULL | t | 4 | NULL | 50 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | m | index | NULL | t | 4 | NULL | 50 | Using where; Using index |
+----+--------------------+-------+-------+---------------+------+---------+------+------+--------------------------+
2 rows in set (0.00 sec)
After changing to MyISAM:
mysql> explain SELECT id FROM atable AS l WHERE l.t = ( SELECT MAX(m.t) FROM atable AS m WHERE DATE(l.t) = DATE(m.t) LIMIT 1 ) ORDER BY l.t DESC LIMIT 50;
+----+--------------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
| 1 | PRIMARY | l | ALL | NULL | NULL | NULL | NULL | 50 | Using where; Using filesort |
| 2 | DEPENDENT SUBQUERY | m | index | NULL | t | 4 | NULL | 50 | Using where; Using index |
+----+--------------------+-------+-------+---------------+------+---------+------+------+-----------------------------+
2 rows in set (0.00 sec)
Related
I am trying to get the 3 successful (success =1) recent records and then see their average response time.
I have manipulated the results so that the average response is always 2ms.
I have 20,000 records in this table right now, but I plan on have 1-2 million. It takes 40 seconds just with 20,000 records, so I need to optimize this query.
Here is the fiddle: http://sqlfiddle.com/#!9/dc91eb/1/0
The fiddle contains my indices too, so I am open to adding more indices if needed.
SELECT proxy,
Avg(a.responsems) AS avgResponseMs,
COUNT(*) as Count
FROM proxylog a
WHERE
a.success = 1
AND ( (SELECT Count(0)
FROM proxylog b
WHERE ( ( b.success = a.success )
AND ( b.proxy = a.proxy )
AND ( b.datetime >= a.datetime ) )) <= 3 )
GROUP BY proxy
ORDER BY avgResponseMs
Here is the result of EXPLAIN
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| 1 | PRIMARY | a | index | NULL | proxy | 61 | NULL | 19110 | Using where; Using temporary; Using filesort |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| 2 | DEPENDENT SUBQUERY | b | ref | proxy,datetime | proxy | 52 | wwwim_iroom.a.proxy | 24 | Using where; Using index |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
Before you suggest windowed functions, I am using MariaDB 10.1.21 which is ~Mysql 5.6 AFAIK
An index on (success, proxy, datetime, responsems) should help. success, proxy and datetime are the columns shared between both queries. datetime should come after the other two, because it is used to filter a range whereas the other two filter on a point. responsems comes last as this is the column the calculation is done on. That way the needed values can be taken directly from the index.
And please edit the question and include the DDL and DML also in the question it self. The fiddle might be down some day and the question therefore useless for future readers.
I was able to mimic row_number and follow #Gordon Linoff answer
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (
SELECT
#row_number:=CASE
WHEN #g = proxy
THEN #row_number + 1
ELSE 1
END AS RN,
#g:=proxy g,
pl.*
FROM proxyLog pl,
(SELECT #g:=0,#row_number:=0) as t
WHERE pl.success = 1
ORDER BY proxy,datetime DESC
) pl
WHERE RN <= 3
GROUP BY proxy
ORDER BY avgResponseMs
From your comment back to my question, I think I know what your problem is.
If you have a proxy that has 900 requests, your first is still counting 900 (at or greater). Second is counting 899, Third, 898 and so on. That is what is killing your performance. Now add that to having millions of records will choke the crud out of your query.
What you may want to do is have a max date applied to the first one you are querying against where it makes reasonable sense. If you have proxy requests such as times are (and all are success values)
8:00:00
8:00:18
8:00:57
9:02:12
9:15:27
Do really care about the success time between 8:00:57 and 9:02 and 9:15? If a computer is getting pounded with activity in one hour vs light in another, is that really a fair assessment of success times?
What you MAY want is to have some (your discretion) cutoff time, such as within 3 minutes. What if someone does not even resume work going through a proxy for some time. Is that really it? Again, your discresion
AND ( a.datetime <= b.datetime and b.datetime < date_add( a.datetime, interval 5 minutes )) )) <= 3 )
And the <= 3 is not giving you what I THINK you expect. Again, your innermost COUNT(*) is counting all records >= a.datetime, so it would not be until you were at the end of a given batch of proxy times that you would get these counts.
So are you looking for the HISTORICAL average times, or only the most recent 3 time cycles for a given proxy. What you are requesting and querying may be two completely different things.
You may want to edit your original post to clarify. I end here until I hear back to possible offer additional assistance.
I would advise you to try writing the query using window functions:
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (SELECT pl.*,
ROW_NUMBER() OVER (PARTITION BY pl.proxy ORDER BY datetime DESC) as seqnum
FROM proxylog pl
WHERE pl.success = 1
) pl
WHERE seqnum <= 3
GROUP BY proxy
ORDER BY avgResponseMs;
For this, you want an index on proxylog(success, proxy, datetime, responsems).
In older versions, I would replace your version of the subquery with:
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (SELECT pl.*,
ROW_NUMBER() OVER (PARTITION BY pl.proxy ORDER BY datetime DESC) as seqnum
FROM proxylog pl
WHERE
) pl
WHERE pl.success = 1 AND
pl.datetime >= ANY (SELECT pl2.datetime
FROM proxylog pl2
WHERE pl2.success = pl.success AND
pl2.proxy = pl.proxy
ORDER BY pl2.datetime DESC
LIMIT 1 OFFSET 2
)
GROUP BY proxy
ORDER BY avgResponseMs;
The index you want for this is the same as above.
I have the following queries which both return the same result and row count:
select * from (
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime) as atb
left join dim_demand demand on atb.display_advertiser_id = demand.advertiser_dsp_id
and atb.campaign_id = demand.campaign_id
and atb.business_rule_id = demand.business_rule_id
ran explain extended, and these are the results:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1451739 | 100.00 | NULL |
| 1 | PRIMARY | demand | ref | PRIMARY,join_index | PRIMARY | 4 | atb.campaign_id | 1 | 100.00 | Using where |
| 2 | DERIVED | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+-----------------+---------+----------+----------------------------------------------+
and the other is:
select UNIX_TIMESTAMP(network_time) * 1000 as epoch_network_datetime,
hbrl.business_rule_id,
display_advertiser_id,
hbrl.campaign_id,
truncate(sum(coalesce(hbrl.ad_spend_network, 0))/100000.0, 2) as demand_ad_spend_network,
sum(coalesce(hbrl.ad_view, 0)) as demand_ad_view,
sum(coalesce(hbrl.ad_click, 0)) as demand_ad_click,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else 100*sum(hbrl.ad_click)/sum(hbrl.ad_view) end, 0), 2) as ctr_percent,
truncate(coalesce(case when sum(hbrl.ad_view) = 0 then 0 else sum(hbrl.ad_spend_network)/100.0/sum(hbrl.ad_view) end, 0), 2) as ecpm,
truncate(coalesce(case when sum(hbrl.ad_click) = 0 then 0 else sum(hbrl.ad_spend_network)/100000.0/sum(hbrl.ad_click) end, 0), 2) as ecpc
from hourly_business_rule_level hbrl
join dim_demand demand on hbrl.display_advertiser_id = demand.advertiser_dsp_id
and hbrl.campaign_id = demand.campaign_id
and hbrl.business_rule_id = demand.business_rule_id
where (publisher_network_id = 31534)
and network_time between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f') and str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
and (network_time IS NOT NULL and display_advertiser_id > 0)
group by network_time, hbrl.campaign_id, hbrl.business_rule_id
having demand_ad_spend_network > 0
OR demand_ad_view > 0
OR demand_ad_click > 0
OR ctr_percent > 0
OR ecpm > 0
OR ecpc > 0
order by epoch_network_datetime;
and these are the results for the second query:
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | hourly_business_rule_level | ALL | _hourly_business_rule_level_supply_idx,_hourly_business_rule_level_demand_idx | NULL | NULL | NULL | 1494447 | 97.14 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | demand | ref | PRIMARY,join_index | PRIMARY | 4 | my6sense_datawarehouse.hourly_business_rule_level.campaign_id | 1 | 100.00 | Using where; Using index |
+----+-------------+----------------------------+------+-------------------------------------------------------------------------------+---------+---------+---------------------------------------------------------------+---------+----------+----------------------------------------------+
the first one takes about 2 seconds while the second one takes over 2 minutes!
why is the second query taking so long?
what am I missing here?
thanks.
Use a subquery whenever the subquery significantly shrinks the number of rows before - ANY JOIN - always to reinforce Rick James Plan B.
To reinforce Rick & Paul's answer which you have already documented. The answers by Rick and Paul deserve Acceptance.
One possible reason is the number of rows that have to be joined with the second table.
The GROUP BY clause and the HAVING clause will limit the number of rows returned from your subquery.
Only those rows will be used for the join.
Without the subquery only the WHERE clause is limiting the number of rows for the JOIN.
The JOIN is done before the GROUP BY and HAVING clauses are processed.
Depending on group size and the selectivity of the HAVING conditions there would be much more rows that need to be joined.
Consider the following simplified example:
We have a table users with 1000 entries and the columns id, email.
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
);
Then we have a (huge) log table user_actions with 1,000,000 entries and the columns id, user_id, timestamp, action
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
);
The task is to find all users who have at least 900 entries in the log table since 2017-02-01.
The subquery solution:
select a.user_id, a.cnt, u.email
from (
select a.user_id, count(*) as cnt
from user_actions a
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
) a
left join users u on u.id = a.user_id
The subquery returns 135 rows (users). Only those rows will be joined with the users table.
The subquery runs in about 0.375 seconds. The time needed for the join is almost zero, so the full query runs in about 0.375 seconds.
Solution without subquery:
select a.user_id, count(*) as cnt, u.email
from user_actions a
left join users u on u.id = a.user_id
where a.timestamp >= '2017-02-01 00:00:00'
group by a.user_id
having cnt >= 900
The WHERE condition filters the table to 866,081 rows.
The JOIN has to be done for all those 866K rows.
After the JOIN the GROUP BY and the HAVING clauses are processed and limit the result to 135 rows.
This query needs about 0.815 seconds.
So you can already see, that a subquery can improve the performance.
But let's make things worse and drop the primary key in the users table.
This way we have no index which can be used for the JOIN.
Now the first query runs in 0.455 seconds. The second query needs 40 seconds - almost 100 times slower.
Notes
It's difficult to say if the same applies to your case. Reasons are:
Your queries are quite complex and far away from from beeing an MVCE.
I don't see anything beeng selected from the demand table - So it's unclear why you are joining it at all.
You use a LEFT JOIN in one query and an INNER JOIN in another one.
The relation between the two tables is unclear.
No information about indexes. You should provide the CREATE statements (SHOW CREATE table_name).
Test setup
drop table if exists users;
create table users(
id smallint auto_increment primary key,
email varchar(50) unique
)
select seq as id, rand(1) as email
from seq_1_to_1000
;
drop table if exists user_actions;
create table user_actions(
id mediumint auto_increment primary key,
user_id smallint not null,
timestamp timestamp,
action varchar(50),
index (timestamp, user_id)
)
select seq as id
, floor(rand(2)*1000)+1 as user_id
#, '2017-01-01 00:00:00' + interval seq*20 second as timestamp
, from_unixtime(unix_timestamp('2017-01-01 00:00:00') + seq*20) as timestamp
, rand(3) as action
from seq_1_to_1000000
;
MariaDB 10.0.19 with sequence plugin.
The queries are different. One says JOIN, the other says LEFT JOIN. You are not using demand, so the join is probably useless. However, in the case of JOIN, you are filtering out advertisers that are not in dim_demand; it that the intent?
But that does not address the question.
The EXPLAINs estimate that there are 1.5M rows in hbrl. But how many show up in the result? I would guess it is a lot fewer. From this, I can answer your question.
Consider these two:
SELECT ... FROM ( SELECT ... FROM a
GROUP BY or HAVING or LIMIT ) x
JOIN b
SELECT ... FROM a
JOIN b
GROUP BY or HAVING or LIMIT
The first will decrease the number of rows that need to join to b; the second will need to do a full 1.5M joins. I suspect that the time taken to do the JOIN (be it LEFT or not) is where the difference is.
Plan A: Remove demand from the query.
Plan B: Use a subquery whenever the subquery significantly shrinks the number of rows before the JOIN.
Indexing (may speed up both variants):
INDEX(publisher_network_id, network_time)
and get rid of this as being useless (since the between will fail anyway for NULL):
and network_time IS NOT NULL
Side note: I recommend simplifying and fixing this
and network_time
between str_to_date('2017-08-13 17:00:00.000000', '%Y-%m-%d %H:%i:%S.%f')
AND str_to_date('2017-08-14 16:59:59.999000', '%Y-%m-%d %H:%i:%S.%f')
to
and network_time >= '2017-08-13 17:00:00
and network_time < '2017-08-13 17:00:00 + INTERVAL 24 HOUR
I have the following query, all relevant columns are indexed correctly. MySQL version 5.0.8. The query takes forever:
SELECT COUNT(*) FROM `members` `t` WHERE t.member_type NOT IN (1,2)
AND ( SELECT end_date FROM subscriptions s
WHERE s.sub_auth_id = t.member_auth_id AND s.sub_status = 'Completed'
AND s.sub_pkg_id > 0 ORDER BY s.id DESC LIMIT 1 ) < curdate( )
EXPLAIN output:
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
1 | PRIMARY | t | ALL | membership_type | NULL | NULL | NULL | 9610 | Using where
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
2 | DEPENDENT SUBQUERY | s | index | subscription_auth_id, | PRIMARY | 4 | NULL | 1 | Using where
| | | | subscription_pkg_id, | | | | |
| | | | subscription_status | | | | |
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
Why?
Your subselect refers to values in the parent query. This is known as a correlated (dependent) subquery, and such a query has to be executed once for every row in the parent query, which often leads to poor performance. It is often faster to rewrite the query as a JOIN, for example like this
(Note: without a sample schema to test with, it is impossible to say in advance if this will be faster and still correct, you might need to adjust it a little):
SELECT COUNT(*) FROM members t
LEFT JOIN (
SELECT sub_auth_id as member_id, max(id) as sid FROM subscriptions
WHERE sub_status = 'Completed'
AND sub_pkg_id > 0
GROUP BY sub_auth_id
LEFT JOIN (
SELECT id AS subid, end_date FROM subscriptions
WHERE sub_status = 'Completed'
AND sub_pkg_id > 0
) sdate ON sid = subid
) sub ON sub.member_id = t.member_auth_id
WHERE t.member_type NOT IN (1,2)
AND sub.end_date < curdate( )
The logic here is:
For each member, find his latest subscription.
For each latest subscription, find its end date.
Join these member-latest_sub_date pair to the members list.
Filter the list.
Your query is slow because as written you are considering 9,610 rows and therefore performing 9,610 SELECT subqueries in your WHERE clause. You really should rewrite your query to JOIN the members and subscriptions tables first, to which your WHERE conditions could still apply.
EDIT: Try this.
SELECT COUNT(*)
FROM `members` `t`
JOIN subscriptions s ON (s.sub_auth_id = t.member_auth_id)
WHERE t.member_type NOT IN (1,2)
AND s.sub_status = 'Completed'
AND s.sub_pkg_id > 0
AND end_date < curdate()
ORDER BY s.id DESC LIMIT 1
Caveat: I'm not a MySQL expert, but pretty good in a different SQL flavour (VFP), but I believe you will save some time if:
You count just one field, let's say memberid, instead of *.
Your comparison NOT IN (1,2) is replaced with > 2 (provided that is valid).
The ORDER BY in your subselect is unnecessary, I think. You're trying to get the last completed subscription?
The < curdate() should be inside your subselect's WHERE.
(SELECT end_date FROM subscriptions s
WHERE s.end_date < curdate() and s.sub_auth_id = t.member_auth_id AND
s.sub_status = 'Completed' AND s.sub_pkg_id > 0 ORDER BY s.id DESC LIMIT 1 )
Tune your subselect so as to trim down the set as quickly as possible. The first conditional should be the one least likely to occur.
I ended up doing it like this:
select count(*) from members t
JOIN subscriptions s ON s.sub_auth_id = t.member_auth_id
WHERE t.membership_type > 2 AND s.sub_status = 'Completed' AND s.sub_pkg_id > 0
AND s.sub_end_date < curdate( )
AND s.id = (SELECT MAX(ss.id) FROM subscriptions ss WHERE ss.sub_auth_id = t.member_auth_id)
I believe that the problem is due to a bug that won't be fixed until MySQL 6.
I have a large mysql-table with about 110.000.000 items
The Table Design is:
CREATE TABLE IF NOT EXISTS `tracksim` (
`tracksimID` int(11) NOT NULL AUTO_INCREMENT,
`trackID1` int(11) NOT NULL,
`trackID2` int(11) NOT NULL,
`sim` double NOT NULL,
PRIMARY KEY (`tracksimID`),
UNIQUE KEY `TrackID1` (`trackID1`,`trackID2`),
KEY `sim` (`sim`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Now I want to query a normal query:
SELECT trackID1, trackID2 FROM `tracksim`
WHERE sim > 0.5 AND
(`trackID1` = 168123 OR `trackID2`= 168123)
ORDER BY sim DESC LIMIT 0,100
The Explain statement gives me:
+----+-------------+----------+-------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+-------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | tracksim | range | TrackID1,sim | sim | 8 | NULL | 19980582 | 100.00 | Using where |
+----+-------------+----------+-------+---------------+------+---------+------+----------+----------+-------------+
The query seems to be very slow(about 185 seconds), but i don't know if it is only because of the amount of items in the table. Do you have a tip how I can speedup the query or the table-lookup?
With 110 million records, I can't imagine there are many entries with the track ID in question. I would have indexes such as
(trackID1, sim )
(trackID2, sim )
(tracksimID, sim)
and do a PREQUERY via union and join against that result
select STRAIGHT_JOIN
TS2.*
from
( select ts.tracksimID
from tracksim ts
where ts.trackID1 = 168123
and ts.sim > 0.5
UNION
select ts.trackSimID
from tracksim ts
where ts.trackid2 = 168123
and ts.sim > 0.5
) PreQuery
JOIN TrackSim TS2
on PreQuery.TrackSimID = TS2.TrackSimID
order by
TS2.SIM DESC
LIMIT 0, 100
Mostly I agree with Drap, but the following variation of the query might be even more efficient, especially for larger LIMIT:
SELECT TS2.*
FROM (
SELECT tracksimID, sim
FROM tracksim
WHERE trackID1 = 168123
AND sim > 0.5
UNION
SELECT trackSimID, sim
FROM tracksim
WHERE trackid2 = 168123
AND ts.sim > 0.5
ORDER BY sim DESC
LIMIT 0, 100
) as PreQuery
JOIN TrackSim TS2 USING (TrackSimID);
Requires (trackID1, sim) and (trackID2, sim) indexes.
Try filtering your query so you don't return the full table. Alternatively you could try applying an index to the table on one of the track ID's, for example:
CREATE INDEX TRACK_INDEX
ON tracksim (trackID1)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
http://www.tutorialspoint.com/mysql/mysql-indexes.htm
Mysql is using an index on (faver_profile_id,removed,notice_id) when it should be using the index on (faver_profile_id,removed,id). The weird thing is that for some values of faver_profile_id it does use the correct index. I can use FORCE INDEX which drastically speeds up the query, but I'd like to figure out why mysql is doing this.
This is a new table (35m rows) copied from another table using INSERT INTO.. SELECT FROM.
I did not run OPTIMIZE TABLE or ANALYZE after. Could that help?
SELECT `Item`.`id` , `Item`.`cached_image` , `Item`.`submitter_id` , `Item`.`source_title` , `Item`.`source_url` , `Item`.`source_image` , `Item`.`nudity` , `Item`.`tags` , `Item`.`width` , `Item`.`height` , `Item`.`tumblr_id` , `Item`.`tumblr_reblog_key` , `Item`.`fave_count` , `Item`.`file_size` , `Item`.`animated` , `Favorite`.`id` , `Favorite`.`created`
FROM `favorites` AS `Favorite`
LEFT JOIN `items` AS `Item` ON ( `Favorite`.`notice_id` = `Item`.`id` )
WHERE `faver_profile_id` =11619
AND `Favorite`.`removed` =0
AND `Item`.`removed` =0
AND `nudity` =0
ORDER BY `Favorite`.`id` DESC
LIMIT 26
Query execution plan: "idx_notice_id_profile_id" is an index on (faver_profile_id,removed,notice_id)
1 | SIMPLE | Favorite | ref | idx_faver_idx_id,idx_notice_id_profile_id,notice_id_idx | idx_notice_id_profile_id | 4 | const,const | 15742 | Using where; Using filesort |
1 | SIMPLE | Item | eq_ref | PRIMARY | PRIMARY | 4 | gragland_imgfave.Favorite.notice_id | 1 | Using where
I don't know if its causing any confusion or not, but maybe by moving some of the AND qualifiers to the Item's join might help as its directly correlated to the ITEM and not the favorite. In addition, I've explicitly qualified table.field references where they were otherwise missing.
SELECT
Item.id,
Item.cached_image,
Item.submitter_id,
Item.source_title,
Item.source_url,
Item.source_image,
Item.nudity,
Item.tags,
Item.width,
Item.height,
Item.tumblr_id,
Item.tumblr_reblog_key,
Item.fave_count,
Item.file_size,
Item.animated,
Favorite.id,
Favorite.created
FROM favorites AS Favorite
LEFT JOIN items AS Item
ON Favorite.notice_id = Item.id
AND Item.Removed = 0
AND Item.Nudity = 0
WHERE Favorite.faver_profile_id = 11619
AND Favorite.removed = 0
ORDER BY Favorite.id DESC
LIMIT 26
So now, from the "Favorites" table, its criteria is explicitly down to faver_profile_id, removed, id (for order)