Optimize Query with MAX/MIN in HAVING clause - mysql

I have a query that is taking 17-20 seconds on our server, and I'd like to see what I can do to optimize it. MySQL 5.6, will be upgrading to 5.7 next couple of months.
Query:
SELECT pm.mid AS mid
FROM
pm_message pm
INNER JOIN pm_index pmi ON pmi.mid = pm.mid
GROUP BY pm.mid
HAVING (MIN(pmi.deleted) > 0 AND MAX(pmi.deleted) < '1535490002')
LIMIT 1000 OFFSET 0;
The mid column in both pm_message and pm_index is a primary key in both tables.
The table have millions of records each
select count(*) from pm_message;
3748290
select count(*) from pm_index;
6938947
Any suggestions for improving this query?
I'm wondering if making the 'deleted' column in the pm_index table an index would help?

I would completely rewrite the query because you basically want a list of distinct mids where deleted falls within a certain range. You do not need to display any data from the pm_index table, so I would use correlated subquery with not exists operator. This way mysql does not have to group and order the entire pm_index table to get the mins and the maxes.
SELECT pm.mid AS mid
FROM
pm_message pm
WHERE NOT EXISTS (SELECT 1 FROM pm_index WHERE pm_index.mid=pm.mid and (pm_index.deleted<0 OR pm_index.deleted>1535490002))
The query would benefit from a multi-column index on mid and deleted fields of pm_index table.

This elaborates on Shadows answer. Try using two not exists clauses:
SELECT pm.mid
FROM pm_message pm
WHERE NOT EXISTS (SELECT 1
FROM pm_index pmi
WHERE pmi.mid = pm.mid AND
pmi.deleted < 0
) AND
NOT EXISTS (SELECT 1
FROM pm_index pmi
WHERE pmi.mid = pm.mid AND
pmi.deleted > 1535490002
) ;
And be sure you have an index on pm_index(mid, deleted). The index is very important. I'm breaking it into two clauses because OR can confuse the query optimizer.

Give this a try. It turn things inside out -- starting with pmi, then minimizing touching of pm.
SELECT mid, MIN(deleted) AS mind, MAX(deleted) AS maxd
FROM pm_index AS pmi
GROUP BY mid
HAVING mind > 0
AND maxd < '1535490002'
AND EXISTS (
SELECT 1
FROM pm
WHERE mid = pmi.mid
)
LIMIT 1000 OFFSET 0;
I doubt if this will help much -- the query seems to need to touch virtually all rows in both tables.
If all mid values in pmi definitely exist in pm, then my EXISTS clause can be removed. However you say that both tables have PRIMARY KEY(mid)? I suspect pmi actually has a second column in its PK. Please provide SHOW CREATE TABLE.

Related

Conditionally counting while also grouping by

I am trying to join two tables
ad_data_grouped
adID, adDate (date), totalViews
This is data that has already been grouped by both adID and adDate.
The second table is
leads
leadID, DateOfBirth, adID, state, createdAt(dateTime)
What I'm struggling with is joining these two tables so I can have a column that counts the number of leads when it shares the same adID and where the adDate = createdAt
The problem I'm running into is that when the counts are all the same for all groupings of adID....I have a few other things I'm trying to do, but it's based on similar similar conditional counting.
Query:(I know the temp table is probably overkill, but I'm trying to break this up into small pieces where I can understand what each piece does)
CREATE TEMPORARY TABLE ad_stats_grouped
SELECT * FROM `ad_stats`
LIMIT 0;
INSERT INTO ad_stats_grouped(AdID, adDate, DailyViews)
SELECT
AdID,
adDate,
sum(DailyViews)
FROM `ad_stats`
GROUP BY adID, adDate;
SELECT
ad_stats_grouped.adID,
ad_stats_grouped.adDate,
COUNT(case when ad_stats_grouped.adDate = Date(Leads.CreatedAt) THEN 1 ELSE 0 END)
FROM `ad_stats_grouped` INNER JOIN `LEADS` ON
ad_stats_grouped.adID = Leads.AdID
GROUP BY adID, adDate;
The problem with your original query is the logic in the COUNT(). This aggregate functions takes in account all non-null values, so it counts 0 and 1s. One solution would be to change COUNT() to SUM().
But I think that the query can be furtermore improved by moving the date condition on the date to the on part of a left join:
select
g.adid,
g.addate,
count(l.adid)
from `ad_stats_grouped` g
left join `leads` l
on g.adid = l.adid
and l.createdat >= g.addate
and l.createdat < g.ad_stats + interval 1 day
group by g.adid, g.addate;

Query is taking too long even with 1k results

I have made several tests to optimize the query below but none of them helped.
What I tried is;
Add extra indexes
Change query logic by checking other attributes aswell in IN clause
Tested suggestions of online query optimization tools (eversql etc)
Indexes I am using;
radacct (`_accttime`);
radacct (`username`);
radacct (`acctstoptime`,`_accttime`);
Complete Query;
(SELECT *
FROM `radacct`
WHERE (radacct._accttime > NOW() - INTERVAL 1.2 HOUR)
AND radacct.acctstoptime IN
(SELECT MAX(radacct.acctstoptime)
FROM `radacct`
GROUP BY radacct.username) )
UNION
(SELECT *
FROM `radacct`
WHERE (radacct._accttime >= DATE_SUB(NOW(), INTERVAL 2 MONTH)
AND radacct.acctstoptime IS NULL) )
When I execute SELECT statements above by themselves, they only take about few miliseconds.
I have issue with IN clause. So this is the query that takes ages
As I see it, your problem is the dependent subquery in your IN. Apparently the optimizer doesn't get that the subquery technically doesn't change much. (also, the query might be suboptimal). Essentially, the subquery is executed for each row (which is bad).
Now, we have to find out, which part triggers it to be a dependent, because it isn't really. My first try would be to give it a different alias:
IN (SELECT MAX(inner.acctstoptime) FROM radacct AS `inner` GROUP BY inner.username)
If that isn't enough to make it independent, make it a full-blown join (INNER, such that non-joined rows [= non-max rows] are discarded from the result):
INNER JOIN (
SELECT MAX(inner.accstoptime) as maxstoptime, inner.username
FROM `radacct` AS `inner`
GROUP BY inner.username
) sub ON (sub.maxstoptime=radacct.acctstoptime)
Hope that does the trick.
since your result has rows of users with their max acctstoptimes, it might - on rare occasions - contain more than one row for a user, when there is a row with a acctstoptime, which isn't the max for THAT user but it matches the max of another user. In the join part, you can just add another condition in the ON-clause. In the IN subquery, you would drop the explicit group by and add WHERE radacct.username=inner.username. (which would indeed make it an explicit dependent subquery, but the optimizer might be able to handle it)
update: due to miscommunication ...
The resulting complete query with the join:
(SELECT DISTINCT radacct.*
FROM radacct
INNER JOIN (
SELECT MAX(inner.accstoptime) as maxstoptime, inner.username
FROM `radacct` AS `inner`
GROUP BY inner.username
) sub ON (sub.maxstoptime=radacct.acctstoptime)
WHERE (_accttime > NOW() - INTERVAL 1.2 HOUR)
)
UNION
(SELECT *
FROM `radacct`
WHERE (_accttime >= DATE_SUB(NOW(),INTERVAL 2 MONTH)
AND acctstoptime IS NULL)
)
you may still add the username comparison in the ON clause.
What this query does is, it removes the "IN" selector and force a intermediate result for the join (for each username the max acctstoptime). the join will then join the normal rows to an intermediate result row, if and only if the acctstoptime is the max for some user (or THAT user, if you add the username comparison). If it doesn't have the max acctstoptime and thus no join "partner", it will be discarded from the result (caused by the INNER, the LEFT JOIN was somewhat insufficient), thus leaving only the rows with a max acctstoptime (in the first part of the union).

Is there a way to create an SQL query faster than this one?

I have a MySQL table which stores the data of a hotel's reservations.
I need a query to see the amount of guests who stayed in the hotel for each date.
I was able to create a query (using a subquery) but it performs very slowly. Is there a better way to get the requested data? (For example join the table to itself, or whatever.)
My query is:
SELECT CheckOutDate AS Date,
(SELECT SUM(NrOfGuests) FROM tblGuests tG
WHERE tG.CheckInDate <= tblGuests.CheckOutDate
AND tG.CheckOutDate > tblGuests.CheckOutDate
AND tG.IsCancelled = False AND tG.NoShow = False)
AS NrOfGestsStaying
FROM tblGuests
GROUP BY CheckOutDate
What is the best way to make it perform faster?
In the original query, the SELECT returns a SUM on every row of the table using a subquery. The duplicates are removed afterwards using a group by CheckOutDate. So, in other words, this is the SUM(NrOfGuests) for distinct CheckOutDate.
You can remove duplicate CheckOutDate in advance by subquerying distinct CheckOutDate. So in the receiving query the SUM is applied just one time for distinct CheckOutDate:
SELECT dT.CheckOutDate
,(SELECT SUM(NrOfGuests)
FROM tblGuests tG
WHERE tG.CheckInDate <= dT.CheckOutDate
AND tG.CheckOutDate >= dT.CheckOutDate
AND tG.IsCancelled = 0
AND tG.NoShow = 0
) AS NrOfGuests
FROM (
SELECT DISTINCT CheckOutDate
FROM tblGuests
) AS dT
ORDER BY dT.CheckOutDate

How to join tables with union ? mysql

I have two tables:
history
business
I want to run this query :
SELECT name, talias.*
FROM
(SELECT business.bussName as name history.*
FROM history
INNER JOIN business on history.bussID = business.bussID
WHERE history.activity = 'Insert' OR history.activity = 'Update'
UNION
SELECT name as Null, history.*
FROM history
WHERE history.activity = 'Delete'
) as talias
WHERE 1
order by talias.date DESC
LIMIT $fetch,20
this query take 13 second , I think the problem is that Mysql join all the rows at history and business tables ! While it should join just 20 rows !
how could I fix that ?
If I understand you correctly you want all rows from history where the activity is deleted plus all those rows where the activity is 'Insert' or 'Update' and there is a corresponding row in the business table.
I don't know if that is going to be faster than your query - you will need to check the execution plan to verify this.
SELECT *
FROM history
where activity = 'Delete'
or ( activity in ('Insert','Update')
AND exists (select 1
from business
where history.bussID = business.bussID))
order by `date` DESC
LIMIT $fetch,20
Edit (after the question has changed)
If you do need columns from the business table, replacing the union with an outer join might improve performance.
But to be honest, I don't expect it. The MySQL optimizer isn't very smart and I wouldn't be surprised if the outer join was actually implemented using some kind of union. Again only you can test that by looking at the execution plan.
SELECT h.*,
b.bussName as name
FROM history
LEFT JOIN business b
ON h.bussID = b.bussID
AND h.activity in ('Insert','Update')
WHERE h.activity in ('Delete', 'Insert','Update')
ORDER BY h.`date` DESC
LIMIT $fetch,20
Btw: date is a horrible column name. First because it's a reserved word, second (and more important) because it doesn't document anything. Is that the "creation date"? The "deletion date"? A "due date"? Some other date?
Try this:
SELECT h.*
FROM history AS h
WHERE (h.activity IN ('Insert', 'Update')
AND EXISTS (SELECT * FROM business AS b WHERE b.bussID = h.bussID))
OR h.activity = 'Delete'
ORDER BY h.date DESC
LIMIT $fetch, 20
For the ORDER BY and LIMIT to be efficient, make sure you have an index on history.date.

How to optimize query looking for rows where conditional join rows do not exist?

I've got a table of keywords that I regularly refresh against a remote search API, and I have another table that gets a row each each time I refresh one of the keywords. I use this table to block multiple processes from stepping on each other and refreshing the same keyword, as well as stat collection. So when I spin up my program, it queries for all the keywords that don't have a request currently in process, and don't have a successful one within the last 15 mins, or whatever the interval is. All was working fine for awhile, but now the keywords_requests table has almost 2 million rows in it and things are bogging down badly. I've got indexes on almost every column in the keywords_requests table, but to no avail.
I'm logging slow queries and this one is taking forever, as you can see. What can I do?
# Query_time: 20 Lock_time: 0 Rows_sent: 568 Rows_examined: 1826718
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT JOIN `keywords_requests` as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
AND KeywordsRequest.created > FROM_UNIXTIME(1234551323)
)
WHERE KeywordsRequest.id IS NULL
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;
It seems your most selective index on Keywords is one on KeywordRequest.created.
Try to rewrite query this way:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` as kr
WHERE created > FROM_UNIXTIME(1234567890) /* Happy unix_time! */
) AS KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
)
WHERE keyword_id IS NULL;
It will (hopefully) hash join two not so large sources.
And Bill Karwin is right, you don't need the GROUP BY or ORDER BY
There is no fine control over the plans in MySQL, but you can try (try) to improve your query in the following ways:
Create a composite index on (keyword_id, status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
UNION
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
This ideally should use NESTED LOOPS on your index.
Create a composite index on (status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
UNION ALL
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
ON keyword_id = id
WHERE keyword_id IS NULL
This will hopefully use HASH JOIN on even more restricted hash table.
When diagnosing MySQL query performance, one of the first things you need to analyze is the report from EXPLAIN.
If you learn to read the information EXPLAIN gives you, then you can see where queries are failing to make use of indexes, or where they are causing expensive filesorts, or other performance red flags.
I notice in your query, the GROUP BY is irrelevant, since there will be only one NULL row returned from KeywordRequests. Also the ORDER BY is irrelevant, since you're ordering by a column that will always be NULL due to your WHERE clause. If you remove these clauses, you'll probably eliminate a filesort.
Also consider rewriting the query into other forms, and measure the performance of each. For example:
SELECT k.id, k.keyword
FROM `keywords` AS k
WHERE NOT EXISTS (
SELECT * FROM `keywords_requests` AS kr
WHERE kr.keyword_id = k.id
AND kr.status IN ('success', 'active')
AND kr.source_id = '29'
AND kr.created > FROM_UNIXTIME(1234551323)
);
Other tips:
Is kr.source_id an integer? If so, compare to the integer 29 instead of the string '29'.
Are there appropriate indexes on keyword_id, status, source_id, created? Perhaps even a compound index over all four columns would be best, since MySQL will use only one index per table in a given query.
You did a screenshot of your EXPLAIN output and posted a link in the comments. I see that the query is not using an index from Keywords, which makes sense since you're scanning every row in that table anyway. The phrase "Not exists" indicates that MySQL has optimized the LEFT OUTER JOIN a bit.
I think this should be improved over your original query. The GROUP BY/ORDER BY was probably causing it to save an intermediate data set as a temporary table, and sorting it on disk (which is very slow!). What you'd look for is "Using temporary; using filesort" in the Extra column of EXPLAIN information.
So you may have improved it enough already to mitigate the bottleneck for now.
I do notice that the possible keys probably indicate that you have individual indexes on four columns. You may be able to improve that by creating a compound index:
CREATE INDEX kr_cover ON keywords_requests
(keyword_id, created, source_id, status);
You can give MySQL a hint to use a specific index:
... FROM `keywords_requests` AS kr USE INDEX (kr_cover) WHERE ...
Dunno about MySQL but in MSSQL the lines of attack I would take are:
1) Create a covering index on KeywordsRequest status, source_id and created
2) UNION the results tog et around the OR on KeywordsRequest.status
3) Use NOT EXISTS instead o the Outer Join (and try with UNION instead of OR too)
Try this
SELECT Keyword.id, Keyword.keyword
FROM keywords as Keyword
LEFT JOIN (select * from keywords_requests where source_id = '29' and (status = 'success' OR status = 'active')
AND source_id = '29'
AND created > FROM_UNIXTIME(1234551323)
AND id IS NULL
) as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
)
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;