Understanding how to design MySQL indexes for good performance - mysql

Through trial and error, I've arrived at a good index for this query, but I'd really like to understand why this and only this index helps, and how to avoid having to repeat the t&e next time.
The InnoDB table structure for a log table is:
This is my query—it's looking for all users who have one kind of action in the log, but not another kind of action. It's also restricting to certain values of org and a certain date range.
SELECT DISTINCT USER AS 'Dormant Users'
FROM db.log
WHERE `action` = #a1
AND `org` = #orgid
AND `logdate` >= #startdate
AND USER NOT IN (SELECT DISTINCT USER
FROM db.log
WHERE `action` = #a2
AND `org` = #orgid
AND `logdate` >= #startdate)
;
With no indexes, this takes about 21 seconds, and EXPLAIN shows this:
So, I thought having an index on org, logdate, and action might help. And it does—if I create an index on those columns in that precise order, the query time is reduced to about 0.3s, and the EXPLAIN output is now:
But, if I change the order of the columns within the index, or even just add another, unrelated index (say on the user column), the query takes about 2 seconds.
So, how can I understand and even design the index to perform well based on that query, and avoid the rather degenerate case of adding another index and harming performance? Or is it just a case of test and see what works?

My answer is not the answer because it is not about how to set the index but how to write your query to make it more efficient.
Avoid using NOT IN if the subquery is not a small table :
SELECT DISTINCT l1.USER AS 'Dormant Users'
FROM db.log l1
WHERE `action` = #a1
AND `org` = #orgid
AND `logdate` >= #startdate
AND NOT EXISTS (SELECT 1
FROM db.log l2
WHERE l1.`user` = l2.`user`
AND l1.`org` = l2.`org`
AND l2.`action` = #a2
AND l2.`logdate` >= #startdate)
;
EDIT : I removed the explanation link as it is not what I thought. I am only a skilled developer and not a DBA. Thus, I have optimized a lot of queries and I always have had better results with NOT EXISTS than NOT IN when volumes get hihg. But I am not able to argue about the internal reason (and I guess it depends on the RDBMS)

...or with an outer join...
SELECT DISTINCT user
FROM log x
LEFT
JOIN log y
ON y.user = x.user
AND y.org = x.org
AND y.action = #a2
AND y.logdate > = #startdate
WHERE x.action` = #a1
AND x.org = #orgid
AND x.logdate >= #startdate
AND y.user IS NULL;
I'm not too hot on indexing, but I'd start with (org, action, logdate)

Related

Performance issue on query with math calculations

This my query with its performance (slow_query_log):
SELECT j.`offer_id`, o.`offer_name`, j.`success_rate`
FROM
(
SELECT
t.`offer_id`,
(
SUM(CASE WHEN `offer_id` = t.`offer_id` AND `sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*)
) AS `success_rate`
FROM `tblSales` AS t
WHERE DATE(t.`sales_time`) = CURDATE()
GROUP BY t.`offer_id`
ORDER BY `success_rate` DESC
) AS j
LEFT JOIN `tblOffers` AS o
ON j.`offer_id` = o.`offer_id`
LIMIT 5;
# Time: 180113 18:51:19
# User#Host: root[root] # localhost [127.0.0.1] Id: 71
# Query_time: 10.472599 Lock_time: 0.001000 Rows_sent: 0 Rows_examined: 1156134
Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).
The query works fine and provides the output I needed. But it appears to be that its a bit slower.
offer_id and sales_status are already indexed in the tblSales. So do you have any suggestion on improving the inner query (where it calculates the success rate) so that performance can be improved? I have been playing with the math for more than 2hrs. But couldn't get a better way.
Btw, tblSales has lots of data. It contains those sales which are SUCCESSFUL, FAILED, PENDING, etc.
Thank you
EDIT
As you requested am including the table design also(only relevant fields are included):
tblSales
`sales_id` bigint UNSIGNED NOT NULL AUTO_INCREMENT,
`offer_id` bigint UNSIGNED NOT NULL DEFAULT '0',
`sales_time` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`sales_status` ENUM('WAITING', 'SUCCESS', 'FAILED', 'CANCELLED') NOT NULL DEFAULT 'WAITING',
PRIMARY KEY (`sales_id`),
KEY (`offer_id`),
KEY (`sales_status`)
There are some other fields also in this table, that holds some other info. Amount, user_id, etc. which are not relevant for my question.
Numerous 'problems', none of which involve "math".
JOINs make things difficult. LEFT JOIN says "I don't care whether the row exists in the 'right' table. (I suspect you don't need LEFT??) But it also says "There may be multiple rows in the right table. Based on the column names, I will guess that there is only one offer_name for each offer_id. If this is correct, then here my first recommendation. (This will convince the Optimizer that there is no issue with the JOIN.) Change from
SELECT ..., o.offer_name, ...
LEFT JOIN `tblOffers` AS o ON j.`offer_id` = o.`offer_id`
...
to
SELECT ...,
( SELECT offer_name FROM tbloffers WHERE offer_id j.offer_id
) AS offer_name, ...
It also gets rid of a bug wherein you are assuming that the inner ORDER BY will be preserved for the LIMIT. This used to be the case, but in newer versions of MariaDB / MySQL, it is not. The ORDER BY in a "derived table" (your subquery) is now ignored.
2 down, a few more to go.
"Don't hide an indexed column in a function." I am referring to DATE(t.sales_time) = CURDATE(). Assuming you have no sales_time values for the 'future', then that test can be changed to t.sales_time >= CURDATE(). If you really need to restrict to just today, then do this:
AND sales_time >= CURDATE()
AND sales_time < CURDATE() + INTERVAL 1 DAY
The ORDER BY and the LIMIT should usually be put together. In your case, you may as well add the LIMIT to the "derived table", thereby leading to only 5 rows for the outer query to work with. But... There is still the question of getting them sorted correctly. So change from
SELECT ...
FROM ( SELECT ...
ORDER BY ... )
LIMIT ...
to
SELECT ...
FROM ( SELECT ...
ORDER BY ...
LIMIT 5 ) -- trim sooner
ORDER BY ... -- deal with the loss of ordering from derived table
Rolling it all together, I have
SELECT j.`offer_id`,
( SELECT offer_name
FROM tbloffers
WHERE offer_id = j.offer_id
) AS offer_name,
j.`success_rate`
FROM
( SELECT t.`offer_id`,
AVG(t.sales_status = 'SUCCESS') AS `success_rate`
FROM `tblSales` AS t
WHERE t.sales_time >= CURDATE()
GROUP BY t.`offer_id`
ORDER BY `success_rate` DESC
LIMIT 5
) AS j
ORDER BY `success_rate` DESC;
(I took the liberty of shortening the SUM(...) in two ways.)
Now for the indexes...
tblSales needs at least (sales_time), but let's go for a "covering" (with sales_time specifically first):
INDEX(sales_time, sales_status, order_id)
If tbloffers has PRIMARY KEY(offer_id), then no further index is worth adding. Else, add this covering index (in this order):
INDEX(offer_id, offer_name)
(Apologies to other Answerers; I stole some of your ideas.)
Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).
Approach this with a simple JOIN and GROUP BY:
SELECT s.offer_id, o.offer_name,
AVG(s.sales_status = 'SUCCESS') as success_rate
FROM tblSales s JOIN
tblOffers o
ON o.offer_id = s.offer_id
WHERE s.sales_time >= CURDATE() AND
s.sales_time < CURDATE() + INTERVAL 1 DAY
GROUP BY s.offer_id, o.offer_name
ORDER BY success_rate DESC;
Notes:
The use of date arithmetic allows the query to make use of an index on tblSales(sales_time) -- or better yet tblSales(salesTime, offer_id, sales_status).
The arithmetic for success_rate has been simplified -- although this has minimal impact on performance.
I added offer_name to the GROUP BY. If you are learning SQL, you should always have all the unaggregated keys in the GROUP BY clause.
A LEFT JOIN is only needed if you have offers in tblSales which are not in tblOffers. I am guessing you have proper foreign key relationships defined, and this is not the case.
Based on not much information that you have provided (i mean table schema) you could try the following.
SELECT `o`.`offer_id`, `o`.`offer_name`, SUM(CASE WHEN `t`.`sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) AS `success_rate`
FROM `tblOffers` `o`
INNER JOIN `tblSales` `t`
ON `o`.`offer_id` = `t`.`offer_id`
WHERE DATE(`t`.`sales_time`) = CURDATE()
GROUP BY `o`.`offer_id`
ORDER BY `success_rate` DESC
LIMIT 0,5;
You can find a sample of this query in this SQL Fiddle example
Without knowing your schema, the lowest hanging fruit I see is this part....
WHERE DATE(t.`sales_time`) = CURDATE()
Try changing that to something that looks like
Where t.sales_time >= #12-midnight-of-current-date and t.sales_time <= #23:59:59-of-current-date

Query takes more than 40 seconds to execute

This query takes more than 40 seconds to execute on a table that has 200k rows
SELECT
my_robots.*,
(
SELECT count(id)
FROM hpsi_trading
WHERE estado <= 1 and idRobot = my_robots.id
) as openorders,
apikeys.apikey,
apikeys.apisecret
FROM my_robots, apikeys
WHERE estado <= 1
and idRobot = '2'
and ready = '1'
and apikeys.id = my_robots.idApiKey
and (my_robots.id LIKE '%0'
OR my_robots.id LIKE '%1'
OR my_robots.id LIKE '%2')
I know it is because of the count inside the query, but how could i fix this efficiently.
Edit: Explain
Thanks.
Use GROUP BY instead
SELECT my_robots.*,
count(id) as openorders,
apikeys.apikey,
apikeys.apisecret
FROM my_robots
JOIN apikeys ON apikeys.id = my_robots.idApiKey
LEFT JOIN hpsi_trading ON hpsi_trading.idRobot = my_robots.id and estado <= 1
WHERE estado <= 1 and
idRobot = '2' and
ready = '1' and
(
my_robots.id LIKE '%0' OR
my_robots.id LIKE '%1' OR
my_robots.id LIKE '%2'
)
GROUP BY my_robots.id, apikeys.apikey, apikeys.apisecret
Use explicit JOIN syntax. Some indexes will be needed to run it fast, however, the database structure is not clear from your post (and from your query as well).
The explain plan shows that the largest pain is selecting the data from the table hpsi_trading.
The challenge from the database's point of view is that the query contains a correlated subquery in the SELECT clause, which needs to be executed once for each result of the outer query (after filtering).
Replacing this subquery with a JOIN + GROUP BY will require MySQL to join between all these records (inflate) and only then deflate the data using GROUP BY, which might take time.
Instead, I would extract the subquery to a temporary table, which is grouped during creation, index it and join to it. That way, the subquery will run once, using a quick covering index, it will already group the data and only then join it to the other table.
This far, it's all pros. But, the con here is that extracting a subquery to a temporary table might require more effort on the development side.
Please try this version and let me know if it helped (if not, please provide a fresh EXPLAIN plan screenshot):
Creating the temp table:
CREATE TEMPORARY TABLE IF NOT EXISTS temp1 AS
SELECT idRobot, COUNT(id) as openorders
FROM hpsi_trading
WHERE estado <= 1
GROUP BY idRobot;
The modified query:
SELECT
my_robots.*,
temp1.openorders,
apikeys.apikey,
apikeys.apisecret
FROM
my_robots,
apikeys
LEFT JOIN temp1 on temp1.idRobot = my_robots.id
WHERE
estado <= 1 AND idRobot = '2'
AND ready = '1'
AND apikeys.id = my_robots.idApiKey
AND (my_robots.id LIKE '%0'
OR my_robots.id LIKE '%1'
OR my_robots.id LIKE '%2')
The indexes to add for this solution (I assumed from logic that estado, idRobot and ready are from the apikeys table. If that's not the case, let me know and I'll adjust the indexes):
ALTER TABLE `temp1` ADD INDEX `temp1_index_1` (idRobot);
ALTER TABLE `hpsi_trading` ADD INDEX `hpsi_trading_index_1` (idRobot, estado, id);
ALTER TABLE `apikeys` ADD INDEX `apikeys_index_1` (`idRobot`, `ready`, `id`, `estado`);
ALTER TABLE `my_robots` ADD INDEX `my_robots_index_1` (`idApiKey`);

MYSQL query reducing the inner query in conditions

I am running into a small problem,
This is a demo query
select
A.order_id,
if(
A.production_date != '0000-00-00',
A.production_date,
if(
SOME INNER QUERY != '0000-00-00',
SOME INNER QUERY ,
SOME OTHER INNER QUERY
)
) as production_start_date
from
orders A
So basically, suppose the SOME INNER QUERY is taking 10 seconds to do its calculations, fetching data from 8 different tables, checking past history for same order type etc. and if its result is a date, I fetch that date in first condition. But now it will take 20 seconds as 10 seconds for calculation for if condition, and 10 seconds to re-execute to return the result.
Is there any way I can reduce this?
if any one is interested in looking at actual query http://pastebin.com/zqzbpEei
Assuming your query looks like this (sorry, I gave up trying to locate the actual query):
IF(
(SELECT aField FROM aTable WHERE bigCondition) != '0000-00-00',
SELECT aField FROM aTable WHERE bigCondition,
SELECT anotherField FROM anotherTable
)
You can rewrite it as follows:
SELECT IF (
someField != '0000-00-00',
someField,
SELECT anotherField FROM anotherTable
)
FROM aTable WHERE bigCondition
This way you compute bigCondition only once.
This query is quite ugly indeed.
Your major problem seems to be the misuse (and abuse, big time) of the IF() construct. It should be reserved to simple conditions and operations. The same applies to logical operators. Do not operate on entire queries. For instance, I see this one bit appears a few times in your query:
IF(
(SELECT v1.weekends FROM vendor v1 WHERE v1.vendor_id = A.vendor_id) IS NULL
OR (SELECT v1.weekends FROM vendor v1 WHERE v1.vendor_id = A.vendor_id) = '',
'6', -- by the way, why is this a string?! This is an integer, isn't it?
(SELECT v1.weekends FROM vendor v1 WHERE v1.vendor_id = A.vendor_id)
)
This is Bad. The condition should be moved into the SELECT directly. Rewrite it as below:
SELECT
IF (v1.weekends IS NULL OR v1.weekends = '', 6, v1.weekends)
FROM vendor v1 WHERE v1.vendor_id = A.vendor_id
That's two SELECT saved. Do this for every IF() that contains a query, and I am ready to bet you are going to speed up your query by several orders of magnitude.
There is a lot more to say about your current code. Unfortunately, you will probably need to refactor some parts of your ORM. Add new, more specialised methods to some classes, and make them use new queries that you crafted manually. Then refactor your current operation so that it uses these new methods.

Where to use ROWLOCK, READPAST with CTE, Subquery and Update?

In trying to avoid deadlocks and synchronize requests from multiple services, I'm using ROWLOCK, READPAST. My question is where should I put it in a query that includes a CTE, a subquery and an update statement on the CTE? Is there one key spot or should all three places have it (below)? Or maybe there's a better way to write such a query so that I can select ONLY the rows that will be updated.
alter proc dbo.Notification_DequeueJob
#jobs int = null
as
set nocount on;
set xact_abort on;
declare #now datetime
set #now = getdate();
if(#jobs is null or #jobs <= 0) set #jobs = 1
;with q as (
select
*,
dense_rank() over (order by MinDate, Destination) as dr
from
(
select *,
min(CreatedDt) over (partition by Destination) as MinDate
from dbo.NotificationJob with (rowlock, readpast)
) nj
where (nj.QueuedDt is null or (DATEDIFF(MINUTE, nj.QueuedDt, #now) > 5 and nj.CompletedDt is null))
and (nj.RetryDt is null or nj.RetryDt < #now)
and not exists(
select * from dbo.NotificationJob
where Destination = nj.Destination
and nj.QueuedDt is not null and DATEDIFF(MINUTE, nj.QueuedDt, #now) < 6 and nj.CompletedDt is null)
)
update t
set t.QueuedDt = #now,
t.RetryDt = null
output
inserted.NotificationJobId,
inserted.Categories,
inserted.Source,
inserted.Destination,
inserted.Subject,
inserted.Message
from q as t
where t.dr <= #jobs
go
I don't have an answer off-hand, but there are ways you can learn more.
The code you wrote seems reasonable. Examining the actual query plan for the proc might help verify that SQL Server can generate a reasonable query plan, too.
If you don't have an index on NotificationJob.Destination that includes QueuedDt and CompletedDt, the not exists sub-query might acquire shared locks on the entire table. That would be scary for concurrency.
You can observe how the proc behaves when it acquires locks. One way is to turn on trace flag 1200 temporarily, call your proc, and then turn off the flag. This will generate a lot of information about what locks the proc is acquiring. The amount of info will severely affect performance, so don't use this flag in a production system.
dbcc traceon (1200, -1) -- print detailed information for every lock request. DO NOT DO THIS ON A PRODUCTION SYSTEM!
exec dbo.Notification_DequeueJob
dbcc traceoff (1200, -1) -- turn off the trace flag ASAP

More efficient way to write multiple UPDATE queries

Is there a better / more efficient / shorter way to write this SQL Query:
UPDATE mTable SET score = 0.2537 WHERE user = 'Xthane' AND groupId = 37;
UPDATE mTable SET score = 0.2349 WHERE user = 'Mike' AND groupId = 37;
UPDATE mTable SET score = 0.2761 WHERE user = 'Jack' AND groupId = 37;
UPDATE mTable SET score = 0.2655 WHERE user = 'Isotope' AND groupId = 37;
UPDATE mTable SET score = 0.3235 WHERE user = 'Caesar' AND groupId = 37;
UPDATE mTable
SET score =
case user
when 'Xthane' then 0.2537
when 'Mike' then 0.2349
when 'Jack' then 0.2761
when 'Isotope' then 0.2655
when 'Caesar' then 0.3235
else score
end
where groupId = 37
You can use a CASE statement to perform this type of UPDATE.
UPDATE mTable
SET score
= CASE user
WHEN 'Xthane' THEN 0.2537
WHEN 'Mike' THEN 0.2349
WHEN 'Jack' THEN 0.2761
WHEN 'Isotope' THEN 0.2655
WHEN 'Caesar' THEN 0.3235
ELSE score
END
WHERE groupId = 37
You could create a temporary table, insert score, user and groupid for all the records you want to update then do something like this:
UPDATE
FROM mTable m
INNER JOIN tmpTable t
ON m.groupId = t.groupId
AND m.user = t.user
SET m.score = t.score;
Your original statements look short enough, and are easy enough to understand, and you can determine whether there were any rows affected on each of those separate UPDATE statements.
For a large number of statements, however, there's a considerable amount of overhead making "roundtrips" to the database to execute each individual statement. You can get much faster execution (shorter elapsed time) for a large set of updates by "batching" the updates together in a single statement execution.
So, it depends on what you are trying to achieve.
Better? Depends on how you define that. (Should the statements be more understandable, easier to debug, less resource intensive?
More efficient? In terms of reduced elapsed time, yes, there are other ways to accomplish these same updates, but the statements are not as easy to understand as yours.
Shorter? In terms of SQL statements with fewer characters, yes, there are ways to achieve that. (Some examples are shown in other answers, but note that the effects of the statements in some of those answers is significantly DIFFERENT than your statements.)
The actual performance of those alternatives is really going to depend on the number of rows, and available indexes. (e.g. if you have hundreds of thousands of rows with groupId = 37, but are only updating 5 of those rows).