Performance issue on query with math calculations - mysql

This my query with its performance (slow_query_log):
SELECT j.`offer_id`, o.`offer_name`, j.`success_rate`
FROM
(
SELECT
t.`offer_id`,
(
SUM(CASE WHEN `offer_id` = t.`offer_id` AND `sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) / COUNT(*)
) AS `success_rate`
FROM `tblSales` AS t
WHERE DATE(t.`sales_time`) = CURDATE()
GROUP BY t.`offer_id`
ORDER BY `success_rate` DESC
) AS j
LEFT JOIN `tblOffers` AS o
ON j.`offer_id` = o.`offer_id`
LIMIT 5;
# Time: 180113 18:51:19
# User#Host: root[root] # localhost [127.0.0.1] Id: 71
# Query_time: 10.472599 Lock_time: 0.001000 Rows_sent: 0 Rows_examined: 1156134
Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).
The query works fine and provides the output I needed. But it appears to be that its a bit slower.
offer_id and sales_status are already indexed in the tblSales. So do you have any suggestion on improving the inner query (where it calculates the success rate) so that performance can be improved? I have been playing with the math for more than 2hrs. But couldn't get a better way.
Btw, tblSales has lots of data. It contains those sales which are SUCCESSFUL, FAILED, PENDING, etc.
Thank you
EDIT
As you requested am including the table design also(only relevant fields are included):
tblSales
`sales_id` bigint UNSIGNED NOT NULL AUTO_INCREMENT,
`offer_id` bigint UNSIGNED NOT NULL DEFAULT '0',
`sales_time` DATETIME NOT NULL DEFAULT '0000-00-00 00:00:00',
`sales_status` ENUM('WAITING', 'SUCCESS', 'FAILED', 'CANCELLED') NOT NULL DEFAULT 'WAITING',
PRIMARY KEY (`sales_id`),
KEY (`offer_id`),
KEY (`sales_status`)
There are some other fields also in this table, that holds some other info. Amount, user_id, etc. which are not relevant for my question.

Numerous 'problems', none of which involve "math".
JOINs make things difficult. LEFT JOIN says "I don't care whether the row exists in the 'right' table. (I suspect you don't need LEFT??) But it also says "There may be multiple rows in the right table. Based on the column names, I will guess that there is only one offer_name for each offer_id. If this is correct, then here my first recommendation. (This will convince the Optimizer that there is no issue with the JOIN.) Change from
SELECT ..., o.offer_name, ...
LEFT JOIN `tblOffers` AS o ON j.`offer_id` = o.`offer_id`
...
to
SELECT ...,
( SELECT offer_name FROM tbloffers WHERE offer_id j.offer_id
) AS offer_name, ...
It also gets rid of a bug wherein you are assuming that the inner ORDER BY will be preserved for the LIMIT. This used to be the case, but in newer versions of MariaDB / MySQL, it is not. The ORDER BY in a "derived table" (your subquery) is now ignored.
2 down, a few more to go.
"Don't hide an indexed column in a function." I am referring to DATE(t.sales_time) = CURDATE(). Assuming you have no sales_time values for the 'future', then that test can be changed to t.sales_time >= CURDATE(). If you really need to restrict to just today, then do this:
AND sales_time >= CURDATE()
AND sales_time < CURDATE() + INTERVAL 1 DAY
The ORDER BY and the LIMIT should usually be put together. In your case, you may as well add the LIMIT to the "derived table", thereby leading to only 5 rows for the outer query to work with. But... There is still the question of getting them sorted correctly. So change from
SELECT ...
FROM ( SELECT ...
ORDER BY ... )
LIMIT ...
to
SELECT ...
FROM ( SELECT ...
ORDER BY ...
LIMIT 5 ) -- trim sooner
ORDER BY ... -- deal with the loss of ordering from derived table
Rolling it all together, I have
SELECT j.`offer_id`,
( SELECT offer_name
FROM tbloffers
WHERE offer_id = j.offer_id
) AS offer_name,
j.`success_rate`
FROM
( SELECT t.`offer_id`,
AVG(t.sales_status = 'SUCCESS') AS `success_rate`
FROM `tblSales` AS t
WHERE t.sales_time >= CURDATE()
GROUP BY t.`offer_id`
ORDER BY `success_rate` DESC
LIMIT 5
) AS j
ORDER BY `success_rate` DESC;
(I took the liberty of shortening the SUM(...) in two ways.)
Now for the indexes...
tblSales needs at least (sales_time), but let's go for a "covering" (with sales_time specifically first):
INDEX(sales_time, sales_status, order_id)
If tbloffers has PRIMARY KEY(offer_id), then no further index is worth adding. Else, add this covering index (in this order):
INDEX(offer_id, offer_name)
(Apologies to other Answerers; I stole some of your ideas.)

Here, tblOffers have all the OFFERS listed. And the tblSales contains all the sales. What am trying to find out is the top selling offers, based on the success rate (ie. those sales which are SUCCESS).
Approach this with a simple JOIN and GROUP BY:
SELECT s.offer_id, o.offer_name,
AVG(s.sales_status = 'SUCCESS') as success_rate
FROM tblSales s JOIN
tblOffers o
ON o.offer_id = s.offer_id
WHERE s.sales_time >= CURDATE() AND
s.sales_time < CURDATE() + INTERVAL 1 DAY
GROUP BY s.offer_id, o.offer_name
ORDER BY success_rate DESC;
Notes:
The use of date arithmetic allows the query to make use of an index on tblSales(sales_time) -- or better yet tblSales(salesTime, offer_id, sales_status).
The arithmetic for success_rate has been simplified -- although this has minimal impact on performance.
I added offer_name to the GROUP BY. If you are learning SQL, you should always have all the unaggregated keys in the GROUP BY clause.
A LEFT JOIN is only needed if you have offers in tblSales which are not in tblOffers. I am guessing you have proper foreign key relationships defined, and this is not the case.

Based on not much information that you have provided (i mean table schema) you could try the following.
SELECT `o`.`offer_id`, `o`.`offer_name`, SUM(CASE WHEN `t`.`sales_status` = 'SUCCESS' THEN 1 ELSE 0 END) AS `success_rate`
FROM `tblOffers` `o`
INNER JOIN `tblSales` `t`
ON `o`.`offer_id` = `t`.`offer_id`
WHERE DATE(`t`.`sales_time`) = CURDATE()
GROUP BY `o`.`offer_id`
ORDER BY `success_rate` DESC
LIMIT 0,5;
You can find a sample of this query in this SQL Fiddle example

Without knowing your schema, the lowest hanging fruit I see is this part....
WHERE DATE(t.`sales_time`) = CURDATE()
Try changing that to something that looks like
Where t.sales_time >= #12-midnight-of-current-date and t.sales_time <= #23:59:59-of-current-date

Related

mysql is scanning table despite index

I have the following mysql query that I think should be faster. The database table has 1 million records and the query table 3.5 seconds
set #numberofdayssinceexpiration = 1;
set #today = DATE(now());
set #start_position = (#pagenumber-1)* #pagesize;
SELECT *
FROM (SELECT ad.id,
title,
description,
startson,
expireson,
ad.appuserid UserId,
user.email UserName,
ExpiredCount.totalcount
FROM advertisement ad
LEFT JOIN (SELECT servicetypeid,
Count(*) AS TotalCount
FROM advertisement
WHERE Datediff(#today,expireson) =
#numberofdayssinceexpiration
AND sendreminderafterexpiration = 1
GROUP BY servicetypeid) AS ExpiredCount
ON ExpiredCount.servicetypeid = ad.servicetypeid
LEFT JOIN aspnetusers user
ON user.id = ad.appuserid
WHERE Datediff(#today,expireson) = #numberofdayssinceexpiration
AND sendreminderafterexpiration = 1
ORDER BY ad.id) AS expiredAds
LIMIT 20 offset 1;
Here's the execution plan:
Here are the indexes defined on the table:
I wonder what I am doing wrong.
Thanks for any help
First, I would like to point out some problems. Then I will get into your Question.
LIMIT 20 OFFSET 1 gives you 20 rows starting with the second row.
The lack of an ORDER BY in the outer query may lead to an unpredictable ordering. In particular, the Limit and Offset can pick whatever they want. New versions will actually throw away the ORDER BY in the subquery.
DATEDIFF, being a function, makes that part of the WHERE not 'sargeable'. That is it can't use an INDEX. The usual way (which is sargeable) to compare dates is (assuming expireson is of datatype DATE):
WHERE expireson >= CURDATE() - INTERVAL 1 DAY
Please qualify each column name. With that, I may be able to advise on optimal indexes.
Please provide SHOW CREATE TABLE so that we can see what column(s) are in each index.

Is there a way I can optimize this query to make it shorter?

I am trying to make the code shorter and simpler. The code is working. I want to take the inner queries to a CTE or temp table to make it shorter. How do I go about this?
Create OR REPLACE view piper.v_da_areas_per_site(logistics_id, streams_label, city_id, type) as
SELECT
dataset.logistics_id,
dataset.streams_label,
dataset.city_id,
FROM
(SELECT
DISTINCT r.logistics_id,
r.streams_label,
r.city_id
FROM
top.distributions r
JOIN (
SELECT
distributions. Distribution_id,
max(distributions.event_time) AS event_time
FROM
top. distributions distributions
WHERE
distributions.stream_type = 'DA'
AND distributions. distribution_space = 'DaFilterName'
GROUP BY
distributions. distribution_id
) m ON r. distribution_id = m. distribution_id
AND r.event_time = m.event_time
AND current_date >= r. distribution_start_time
AND r. distribution_end_time >= current_date
AND r.stream_type = 'DA'
AND r. distribution_space = 'DaFilterName'
AND (
r.logistics_id IN (
SELECT
DISTINCT dev_class_hub_list.class_hub
FROM
piper.dev_class_hub_list
WHERE
dev_class_hub_list.is_3p = 'N'
)
)
)dataset;
These indexes may help:
distributions: INDEX(stream_type, distribution_space,
distribution_id, Distribution_id, event_time)
distributions: INDEX(distribution_id) -- (unless is PK)
dev_class_hub_list: INDEX(is_3p, class_hub)
Try changing
AND ( r.logistics_id IN (
SELECT DISTINCT dev_class_hub_list.class_hub
FROM dev_class_hub_list
WHERE dev_class_hub_list.is_3p = 'N' ) )
to
AND EXISTS ( SELECT 1
FROM dev_class_hub_list AS dchl
WHERE dchl.is_3p = 'N'
AND dchl.class_hub = r.logistics_id )
The DISTINCT may be causing an extra de-dup pass; the **EXISTS** is a "semi-join", so it stops when it finds the first one. The Optimizer may turn one of these into the other. Please do
EXPLAIN SELECT ...;
SHOW WARNINGS; -- to see the transformations performed
Range tests like this are notoriously difficult to optimize:
AND current_date >= r.distribution_start_time
AND r.distribution_end_time >= current_date
As for CTE -- I need to understand what the SELECT is trying to do.
As for VIEW -- Views are "syntactic sugar"; they tend to be no better than the equivalent SELECT. Will you be adding WHERE or other clauses when you SELECT from this VIEW? They may or may not be efficiently folded into the resulting Select.

MySQL - Add flag column to identify the first payment

I want to improve my current query. So I have this table called Incomes. Where I have a sourceId varchar field. I have a single SELECT for the fields I need, but I needed to add an extra field called isFirstTime to represent if it was the first time on the row on what that sourceId was used. This is my current query:
SELECT DISTINCT
`income`.*,
CASE WHEN (
SELECT
`income2`.id
FROM
`income` as `income2`
WHERE
`income2`."sourceId" = `income`."sourceId"
ORDER BY
`income2`.created asc
LIMIT 1
) = `income`.id THEN true ELSE false END
as isFirstIncome
FROM
`income` as `income`
WHERE `income`.incomeType IN ('passive', 'active') AND `income`.status = 'paid'
ORDER BY `income`.created desc
LIMIT 50
The query works but slows down if I keep increasing the LIMIT or OFFSET. Any suggestions?
UPDATE 1:
Added WHERE statements used on the original query
UPDATE 2:
MYSQL version 5.7.22
You can achieve it using Ordered Analytical Function.
You can use ROW_NUMBER or RANK to get the desired result.
Below query will give the desired output.
SELECT *,
CASE
WHEN Row_number()
OVER(
PARTITION BY sourceid
ORDER BY created ASC) = 1 THEN true
ELSE false
END AS isFirstIncome
FROM income
WHERE incomeType IN ('passive', 'active') AND status = 'paid'
ORDER BY created desc
DB Fiddle: See the result here
My first thought is that isFirstIncome should be an extra column in the table. It should be populated as the data is inserted.
If you don't like that, let's try to optimize the query...
Let's avoid doing the subquery more than 50 times. This requires turning the query inside-out. (It's like "explode-implode", where the query gathers lots of stuff, then sorts it and throws most of the rows away.)
To summarize:
do the least amount of effort to just identify the 5 rows.
JOIN to whatever tables are needed (including itself if appropriate); this is to get any other columns desired (including isFirstIncome).
SELECT i3.*,
( ... using i3 ... ) as isFirstIncome
FROM (
SELECT i1.id, i1.sourceId
FROM `income` AS i1
WHERE i1.incomeType IN ('passive', 'active')
AND i1.status = 'paid'
ORDER BY i1.created DESC
LIMIT 50
) AS i2
JOIN income AS i3 USING(id)
ORDER BY i2.created DESC -- yes, repeated
(I left out the computation of isFirstIncome; it is discussed in other Answers. But note that it will be executed at most 50 times.)
(The aliases -- i1, i2, i3 -- are numbered in the order they will be "used"; this is to assist in following the SQL.)
To assist in performance, add
INDEX(status, incomeType, created, id, sourceId)
It should help with my formulation, but probably not for the other versions. Your version would benefit from
INDEX(sourceId, created, id)

MYSQL Check for record existence while fetching records

I've ran into some performance issues with my database structure "or better to say my query instead "
I have a the following table :
http://sqlfiddle.com/#!9/348cb
And following query trying to fetch certain data, and after that trying to check if there are other records matching my conditions, it's all in the following query.
it is working as expected, the only reason that I'm asking this question is that if there is a way I could increase its performance or use another way to get the results.
As you can see, There two ( SELECT )'s which trying to check if there are any other records containing current query data.
SELECT (
SELECT COUNT(*) FROM log AS LIKES
WHERE L.target_account=LIKES.target_account
AND LIKES.type='like'
) as liked,
(
SELECT COUNT(*) FROM log AS COMMENTS
WHERE L.target_account=COMMENTS.target_account
AND COMMENTS.type='follow_back'
) as follow_back,
(
SELECT COUNT(*) FROM log AS FOLLOW_BACK
WHERE L.target_account=FOLLOW_BACK.target_account
AND COMMENTS.type='follow_back'
) as follow_back,
L.*
FROM `log` as L
WHERE `L`.`information` = '".$target_name."'
AND `L`.`account_id` = '".$id."'
AND `L`.`date_ts` BETWEEN CURDATE() - INTERVAL ".$limit." DAY AND CURDATE()
This query takes too much time to fetch the data.
Thanks in advance.
You may be able to rewrite the query, depending on the relationship between target account and account id.
In the meantime, you want indexes. The two you want are instagram_log(target_account, type) and instagram_log(account_id, information, date_ts):
create index idx_instagram_log_1 on instagram_log(target_account, type);
create index idx_instagram_log_2 on instagram_log(account_id, information, date_ts);
SELECT SUM(LIKES) LIKES,SUM(FOLLOW_BACK) FOLLOW_BACK,SUM(COMMENTS) FROM
(
SELECT
CASE WHEN L.type='like' THEN 1 ELSE 0 END LIKES,
CASE WHEN L.type='follow_back' THEN 1 ELSE 0 END FOLLOW_BACK,
CASE WHEN L.type='comments' THEN 1 ELSE 0 END COMMENTS
FROM `log` as L
WHERE `L`.`information` = '".$target_name."'
AND `L`.`account_id` = '".$id."'
AND `L`.`date_ts` BETWEEN CURDATE() - INTERVAL ".$limit." DAY AND CURDATE()
)Z
Try the above query.

ORDER BY Causes MySQL query to become Extremely Slow

I have the following query:
SELECT *
FROM products
INNER JOIN product_meta
ON products.id = product_meta.product_id
JOIN sales_rights
ON product_meta.product_id = sales_rights.product_id
WHERE ( products.categories REGEXP '[[:<:]]5[[:>:]]' )
AND ( active = '1' )
AND ( products.show_browse = 1 )
AND ( product_meta.software_platform_mac IS NOT NULL )
AND ( sales_rights.country_id = '240'
OR sales_rights.country_id = '223' )
GROUP BY products.id
ORDER BY products.avg_rating DESC
LIMIT 0, 18;
Running the query with the omission of the ORDER BY section and the query runs in ~90ms, with the ORDER BY section and the query takes ~8s.
I've browsed around SO and have found the reason for this could be that the sort is being executed before all the data is returned, and instead we should be running ORDER BY on the result set instead? (See this post: Slow query when using ORDER BY)
But I can't quite figure out the definitive way on how I do this?
I've browsed around SO and have found the reason for this could be
that the sort is being executed before all the data is returned, and
instead we should be running ORDER BY on the result set instead?
I find that hard to believe, but if that's indeed the issue, I think you'll need to do something like this. (Note where I put the parens.)
select * from
(
SELECT products.id, products.avg_rating
FROM products
INNER JOIN product_meta
ON products.id = product_meta.product_id
JOIN sales_rights
ON product_meta.product_id = sales_rights.product_id
WHERE ( products.categories REGEXP '[[:<:]]5[[:>:]]' )
AND ( active = '1' )
AND ( products.show_browse = 1 )
AND ( product_meta.software_platform_mac IS NOT NULL )
AND ( sales_rights.country_id = '240'
OR sales_rights.country_id = '223' )
GROUP BY products.id
) as X
ORDER BY avg_rating DESC
LIMIT 0, 18;
Also, edit your question and include a link to that advice. I think many of us would benefit from reading it.
Additional, possibly unrelated issues
Every column used in a WHERE clause should probably be indexed somehow. Multi-column indexes might perform better for this particular query.
The column products.categories seems to be storing multiple values that you filter with regular expressions. Storing multiple values in a single column is usually a bad idea.
MySQL's GROUP BY is indeterminate. A standard SQL statement using a GROUP BY might return fewer rows, and it might return them faster.
If you can, you may want to index your ID columns so that the query will run quicker. This is a DBA-level solution, rather than a SQL solution - tuning the database will help overall performance.
The issue in the instance of this query, was that by using GROUP BY and ORDER BY in a query, MySQL is unable to use the index if the GROUP BY and ORDER BY expressions are different.
Related Reading:
http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
http://mysqldba.blogspot.co.uk/2008/06/how-to-pick-indexes-for-order-by-and.html