MySQL query too much slow - mysql

I'm trying to make a query for get some trend stats, but the benchmark is really slow. The query execution time is around 134 seconds.
I have a MySQL table called table_1.
Below the create statement
CREATE TABLE `table_1` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`original_id` bigint(11) DEFAULT NULL,
`invoice_num` bigint(11) DEFAULT NULL,
`registration` timestamp NULL DEFAULT NULL,
`paid_amount` decimal(10,6) DEFAULT NULL,
`cost_amount` decimal(10,6) DEFAULT NULL,
`profit_amount` decimal(10,6) DEFAULT NULL,
`net_amount` decimal(10,6) DEFAULT NULL,
`customer_id` bigint(11) DEFAULT NULL,
`recipient_id` text,
`cashier_name` text,
`sales_type` text,
`sales_status` text,
`sales_location` text,
`invoice_duration` text,
`store_id` double DEFAULT NULL,
`is_cash` int(11) DEFAULT NULL,
`is_card` int(11) DEFAULT NULL,
`brandid` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_registration_compound` (`id`,`registration`)
) ENGINE=InnoDB AUTO_INCREMENT=47420958 DEFAULT CHARSET=latin1;
I have set a compound index made of id+registration.
Below the query
SELECT
store_id,
CONCAT('[',GROUP_CONCAT(tot SEPARATOR ','),']') timeline_transactions,
SUM(tot) AS total_transactions,
CONCAT('[',GROUP_CONCAT(totalRevenues SEPARATOR ','),']') timeline_revenues,
SUM(totalRevenues) AS revenues,
CONCAT('[',GROUP_CONCAT(totalProfit SEPARATOR ','),']') timeline_profit,
SUM(totalProfit) AS profit,
CONCAT('[',GROUP_CONCAT(totalCost SEPARATOR ','),']') timeline_costs,
SUM(totalCost) AS costs
FROM (select t1.md,
COALESCE(SUM(t1.amount+t2.revenues), 0) AS totalRevenues,
COALESCE(SUM(t1.amount+t2.profit), 0) AS totalProfit,
COALESCE(SUM(t1.amount+t2.costs), 0) AS totalCost,
COALESCE(SUM(t1.amount+t2.tot), 0) AS tot,
t1.store_id
from
(
SELECT a.store_id,b.md,b.amount from ( SELECT DISTINCT store_id FROM table_1) AS a
CROSS JOIN
(
SELECT
DATE_FORMAT(a.DATE, "%m") as md,
'0' as amount
from (
select curdate() - INTERVAL (a.a + (10 * b.a) + (100 * c.a)) month as Date
from (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as a
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as b
cross join (select 0 as a union all select 1 union all select 2 union all select 3 union all select 4 union all select 5 union all select 6 union all select 7 union all select 8 union all select 9) as c
) a
where a.Date >='2019-01-01' and a.Date <= '2019-01-14'
group by md) AS b
)t1
left join
(
SELECT
COUNT(epl.invoice_num) AS tot,
SUM(paid_amount) AS revenues,
SUM(profit_amount) AS profit,
SUM(cost_amount) AS costs,
store_id,
date_format(epl.registration, '%m') md
FROM table_1 epl
GROUP BY store_id, date_format(epl.registration, '%m')
)t2
ON t2.md=t1.md AND t2.store_id=t1.store_id
group BY t1.md, t1.store_id) AS t3 GROUP BY store_id ORDER BY total_transactions desc
Below the EXPLAIN
Maybe I should change from timestamp to datetime in registration column?

About 90% of your execution time will be used to execute GROUP BY store_id, date_format(epl.registration, '%m').
Unfortunately, you cannot use an index to group by a derived value, and since this is vital to your report, you need to precalculate this. You can do this by adding that value to your table, e.g. using a generated column:
alter table table_1 add md varchar(2) as (date_format(registration, '%m')) stored
I kept the varchar format you used for the month here, you could also use a number (e.g. tinyint) for the month.
This requires MySQL 5.7, otherwise you can use triggers to achieve the same thing:
alter table table_1 add md varchar(2) null;
create trigger tri_table_1 before insert on table_1
for each row set new.md = date_format(new.registration,'%m');
create trigger tru_table_1 before update on table_1
for each row set new.md = date_format(new.registration,'%m');
Then add an index, preferably a covering index, starting with store_id and md, e.g.
create index idx_table_1_storeid_md on table_1
(store_id, md, invoice_num, paid_amount, profit_amount, cost_amount)
If you have other, similar reports, you may want to check if they use additional columns and could profit from covering more columns. The index will require about 1.5GB of storage space (and how long it takes your drive to read 1.5GB will basically single-handedly define your execution time, short of caching).
Then change your query to group by this new indexed column, e.g.
...
SUM(cost_amount) AS costs,
store_id,
md -- instead of date_format(epl.registration, '%m') md
FROM table_1 epl
GROUP BY store_id, md -- instead of date_format(epl.registration, '%m')
)t2 ...
This index will also take care of another other 9% of your execution time, SELECT DISTINCT store_id FROM table_1, which will profit from an index starting with store_id.
Now that 99% of your query is taken care of, some further remarks:
the subquery b and your date range where a.Date >='2019-01-01' and a.Date <= '2019-01-14' might not do what you think it does. You should run the part SELECT DATE_FORMAT(a.DATE, "%m") as md, ... group by md separately to see what it does. In its current state, it will give you one row with the tuple '01', 0, representing "january", so it is basically a complicated way of doing select '01', 0. Unless today is the 15th or later, then it returns nothing (which is probably unintended).
Particularly, it will not limit the invoice dates to that specific range, but to all invoices that are from (the whole) january of any year. If that is what you intended, you should (additionally) add that filter directly, e.g. by using FROM table_1 epl where epl.md = '01' GROUP BY ..., reducing your execution time by an additional factor of about 12. So (apart from the 15th and up-problem), with your current range you should get the same result if you use
...
SUM(cost_amount) AS costs,
store_id,
md
FROM table_1 epl
WHERE md = '01'
GROUP BY store_id, md
)t2 ...
For different date ranges you will have to adjust that term. And to emphasize my point, this is significantly different from filtering invoices by their date, e.g.
...
SUM(cost_amount) AS costs,
store_id,
md
FROM table_1 epl
WHERE epl.registration >='2019-01-01'
and epl.registration <= '2019-01-14'
GROUP BY store_id, md
)t2 ...
which you may (or may not) have tried to do. You would need a different index in that case though (and it would be a slightly different question).
there might be some additional optimizations, simplifications or beautifications in the rest of your query, e.g group BY t1.md, t1.store_id looks redundant and/or wrong (indicating you are actually not on MySQL 5.7), and the b-subquery can only give you values 1 to 12, so generating 1000 dates and reducing them again could be simplified. But since they are operating on 100-ish rows, they will not affect execution time significantly, and I haven't checked those in detail. Some of it is probably due to getting the right output format or to generalizations (although, if you are dynamically grouping by other formats than by month, you need other indexes/columns, but that would be a different question).
An alternative way to precalculate your values would be a summary table where you e.g. run your inner query (the expensive group by) once a day and store the result in a table and then reuse it (by selecting from this table instead of doing the group by). This is especially viable for data like invoices that never change (although otherwise you can use triggers to keep the summary tables up to date). It also becomes more viable if you have several scenarios, e.g. if your user can decide to group by weekday, year, month or zodiac sign, since otherwise you would need to add an index for each of those. It becomes less viable if you need to dynamically limit your invoice range (to e.g. 2019-01-01 ... 2019-01-14). If you need to include the current day in your report, you can still precalculate and then add the values for the current date from the table (which should only involve a very limited number of rows, which is fast if you have an index starting with your date column), or use triggers to update your summary table on-the-fly.

With PRIMARY KEY(id), having INDEX(id, anything) is virtually useless.
See if you can avoid nesting subqueries.
Consider building that 'date' table permanently and have a PRIMARY KEY(md) on it. Currently, neither subquery has an index on the join column (md).
You may have the "explode-implode" syndrome. This is where JOINs expand the number of rows, only to have the GROUP BY collapse them.
Don't use COUNT(xx) unless you need to check xx for being NULL. Simply do COUNT(*).
store_id double -- Really?
TIMESTAMP vs DATETIME -- they perform about the same; don't bother changing it.
Since you are only looking at 2019-01, get rid of
date_format(epl.registration, '%m')
That, alone, may speed it up a lot. (However, you lose generality.)

Related

How can I find the next row in a MySQL query or stop a query when it reaches a certain ID?

I am making a MySQL query of a table with thousands of records. What I'm really trying to do is find the next and previous rows that surround a particular ID. The issue is that when sorting the table in a specific way, there is no correlation between IDs (I can't just search for id > $current_id LIMIT 1, for example, because the needed ID in the next row might or might not actually be higher. Here is an example:
ID Name Date
4 Fred 1999-01-04
6 Bill 2002-04-02
7 John 2002-04-02
3 Sara 2002-04-02
24 Beth 2007-09-18
1 Dawn 2007-09-18
Say I know I want the records that come directly before and after John (ID = 7). In this case, the ID of the record after that row is actually a lower number. The table is sorted by date first and then by name, but there are many entires with the same date - so I can't just look for the next date, either. What is the best approach to find, in this case, the row before and (separately) the row after ID 7?
Thank you for any help.
As others have suggested you can use window functions for this, but I would use LEAD() and LAG() instead of ROW_NUMBER().
SELECT *
FROM (
SELECT
*,
LAG(ID) OVER (ORDER BY `Date` ASC, `Name` ASC) `prev`,
LEAD(ID) OVER (ORDER BY `Date` ASC, `Name` ASC) `next`
FROM `tbl`
) t
WHERE `ID` = 7;
With thousands of records (very small) this should be very fast but if you expect it to grow to hundreds of thousands, or even millions of rows you should try to limit the amount of work being done in the inner query. Sorting millions of rows and assigning prev and next values to all of them, just to use one row would be excessive.
Assuming your example of John (ID = 7) you could use the Date to constrain the inner query. If the adjacent records would always be within one month then you could do something like -
SELECT *
FROM (
SELECT
*,
LAG(ID) OVER (ORDER BY `Date` ASC, `Name` ASC) `prev`,
LEAD(ID) OVER (ORDER BY `Date` ASC, `Name` ASC) `next`
FROM `tbl`
WHERE `Date` BETWEEN '2002-04-02' - INTERVAL 1 MONTH AND '2002-04-02' + INTERVAL 1 MONTH
) t
WHERE `ID` = 7;
Without knowing more detail about the distribution of your data, I am only guessing but hopefully you get the idea.
You can use a window function called ROW_NUM in this way. ROW_NUM() OVER(). This will number every row in the table consecutively. Now you search for your I'd and you also get the Row number. For example, you search for ID=7 and you get row number 35. Now you can search for row number from 34 to 36 to get rows below and above the one with I'd 7.
This is what comes to mind:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY `date`, `name`) AS row_num
FROM people
) p1
WHERE row_num > (SELECT row_num FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY `date`, `name`) AS row_num
FROM people
) p2 WHERE p2.id = 7)
LIMIT 1;
Using the row number window function, you can compare two view of the table with id = 7 and get the row you need. You can change the condition in the subquery to suit your needs, e.g., p2.name = 'John' and p2.date = '2002-04-02'.
Here's a dbfiddle demonstrating: https://www.db-fiddle.com/f/mpQBcijLFRWBBUcWa3UcFY/2
Alternately, you can simplify the syntax a bit and avoid the redundancy using a CTE like this:
WITH p AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY `date`, `name`) AS row_num
FROM people
)
SELECT *
FROM p
WHERE row_num > (SELECT row_num FROM p WHERE p.id = 7)
LIMIT 1;

Compute an average number of transactions per user in a readable manner

I have always been struggling with these types of queries. So, I'd like that someone checks my approach to handle those.
I am asked to find how many transactions, on average, each user executes during a 12 hours timespan starting from the first transaction.
This is the data:
CREATE TABLE IF NOT EXISTS `transactions` (
`transactions_ts` timestamp ,
`user_id` int(6) unsigned NOT NULL,
`transaction_id` bigint NOT NULL,
`item` varchar(200), PRIMARY KEY(`transaction_id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `transactions` (`transactions_ts`, `user_id`, `transaction_id`,`item` ) VALUES
('2016-06-18 13:46:51.0', 13811335,1322361417, 'glove'),
('2016-06-18 17:29:25.0', 13811335,3729362318, 'hat'),
('2016-06-18 23::07:12.0', 13811335,1322363995,'vase' ),
('2016-06-19 07:14:56.0',13811335,7482365143, 'cup'),
('2016-06-19 21:59:40.0',13811335,1322369619,'mirror' ),
('2016-06-17 12:39:46.0',3378024101,9322351612, 'dress'),
('2016-06-17 20:22:17.0',3378024101,9322353031,'vase' ),
('2016-06-20 11:29:02.0',3378024101,6928364072,'tie'),
('2016-06-20 18:59:48.0',13811335,1322375547, 'mirror');
My approach is the following (with the steps and the query itself below):
1) For each distinct user_id, find their first and 12 hours' transaction timestamp. This is accomplished by the inner query aliased as t1
2) Then, by inner join to the second inner query (t2), basically, I augment each row of the transactions table with two variables "first_trans" and "right_trans" of the 1st step.
3) Now, by where-condition, I select only those transaction timestamps that fall in the interval specified by first_trans and right_trans timestamps
4) Filtered table from the step 3 is now aggregated as count distinct transaction ids per user
5) The result of the 4 steps above is a table where each user has a count of transactions falling into the interval of 12 hrs from the first timestamp. I wrap it in another select that sums users' transaction counts and divides by the number of users, giving an average count per user.
I am quite certain that the end result is correct overall, but I keep thinking I might go without the 4th select. Or, perhaps, the whole code is somewhat clumsy, while my aim was to make this query as readable as possible, and not necessarily computationally optimal.
select
sum(dist_ts)/count(*) as avg_ts_per_user
from (
select
count(distinct transaction_id) as dist_ts,
us_id
from
(select
user_id as us_id,
min(transactions_ts) as first_trans,
min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id )
as t1
inner join
(select * from transactions )
as t2
on t1.us_id=t2.user_id
where transactions_ts >= first_trans
and transactions_ts < right_trans
group by us_id
) as t3
Fiddle demo
I don't think there is a mistake per se. The code can be slightly simplified (and neatened up a bit as follows):
select sum(dist_ts)/count(*) as avg_ts_per_user
from (
select count(distinct transaction_id) as dist_ts, us_id
from (
select user_id as us_id, min(transactions_ts) as first_trans, min(transactions_ts) + interval 12 hour as right_trans
from transactions
group by us_id
) as t1
inner join transactions as t2
on t1.us_id=t2.user_id and transactions_ts >= first_trans and transactions_ts < right_trans
group by us_id
) as t3
The (select * from transactions ) as t2 was simplified above and I somewhat arbitrarilly moved a where clause condition to the on clause of the inner join.
My Fiddle Demo
Here is a second way that does not use inner joins:
select sum(cnt)/count(*) as avg_ts_per_user from (
select count(*) as cnt, t.user_id
from transactions t
where t.transactions_ts >= (select min(transactions_ts) from transactions where user_id = t.user_id)
and t.transactions_ts < (select min(transactions_ts) + interval 12 hour from transactions where user_id = t.user_id)
group by t.user_id
) sq
Another Fiddle
You should probably run EXPLAIN against the two queries to see which one runs better on your server. Also note that min(transaction_ts) is specified twice for each user. Is MySql able to avoid the redundant calculation? I don't know. One possibility would be to create a temporary table consisting of user_id and min_transaction_ts so that the value is computed once. This would only make sense if your table had lots of rows and maybe not even then.

MySQL put a specific row at the top of the result

I'm doing a basic SQL select query which returns a set of results. I want a specific row which the entry "Fee" to be put at the top of the results, then the rest.
Something like:
SELECT * FROM tbl ORDER By Charges = Fee DESC, Charges DESC
Can anyone help?
You could try this :
SELECT * from tbl ORDER BY CASE WHEN Charges = 'Fee' THEN 0 ELSE 1 END, Charges DESC;
I think you'd have a use a UNION query. ORDER BY doesn't support this kind of thing by default as far as I know.
Something like this:
SELECT * FROM tbl WHERE Charges = 'Fee'
UNION
SELECT * FROM tbl ORDER BY Charges DESC
You would have to use ORDER BY with a FIELD attribute, which would then order by those first.
As I don't have your table definitions, I have throw one together here http://sqlfiddle.com/#!9/91376/13
For sake of it disappearing, the script pretty much consists of;
CREATE TABLE IF NOT EXISTS `tbl` (
`id` int(6) unsigned AUTO_INCREMENT,
`Name` char(6) not null,
`Charges` char(10) NOT NULL,
PRIMARY KEY (`id`)
) DEFAULT CHARSET=utf8;
INSERT INTO `tbl` (`Name`, `Charges`)
VALUES ('One', 'Fee'), ('Two', 'Charge'), ('Three', 'Charge'),
('Four', 'Charge'), ('Five', 'Fee'), ('Six', 'Fee'),
('Seven', 'Invoice'), ('Eight', 'Fee'), ('Nine', 'Invoice'),
('Ten', 'Invoice');
SELECT *
FROM tbl
ORDER BY FIELD(`Charges`, 'Charge') DESC
;
Which returns:
id Name Charges
2 Two Charge
3 Three Charge
4 Four Charge
1 One Fee
9 Nine Invoice
8 Eight Fee
7 Seven Invoice
6 Six Fee
5 Five Fee
10 Ten Invoice
So, to directly answer your question, your query would be;
SELECT *
FROM tbl
ORDER BY FIELD(Charges, 'Fee') DESC
edit : Viewable, sorted by Charges = Fee here : http://sqlfiddle.com/#!9/91376/15
SELECT * FROM tbl ORDER By FIELD(Charges, 'Fee') DESC
You can use something like the above. Where Charges is the field and fee the specific value. That way you can keep it simple.

mysql true way for using group by

i have this table
CREATE TABLE IF NOT EXISTS `goldprice` (
`price` double unsigned NOT NULL,
`days` smallint(5) unsigned NOT NULL,
`seconds` mediumint(5) unsigned NOT NULL,
`sid` smallint(4) unsigned NOT NULL,
`gid` smallint(4) NOT NULL,
PRIMARY KEY (`days`,`seconds`,`sid`),
KEY `sid` (`sid`),
KEY `gid` (`gid`),
KEY `days` (`days`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
and i want minimum and maximum price of each day
first and last price of each day (base on its seconds)
with using subquery i can solve some part of my problem
before grouping i make a subquery and sorting in it
result in mysql is true
min, max, and last or first (based on sort type) can be made
two important things remains
last and first both is required
performance is very important subquery seems not good
and
i have his sql
SELECT price, days, seconds FROM goldprice
where gid=1 and days>=16200 group by days
order by days desc, seconds desc
change "days>=16200" to "days=16200"
will returns different result in "days=16200" row.
the sort is not remaining desc.
i know behavior of my sql group by
but i can't find good solution for my needs
MySQL order by before group by
Your query is incorrect. You select days with a random price match and a random seconds match. You should decide what price and seconds you want to show per day. The SUM? The MINinum? The MAXimum?
When starting with GROUP BY, you should make sure that for each column you either group by it or aggregate it (e.g. use SUM(price) instead of price alone or have price in the group by clause).
Example:
select a, b, MIN(c), MAX(d), e
from mytable
group by a, b;
a and b are okay, because you group by them. MIN(c) and MAX(d) are okay, because you aggregate c and d. e is incorrect; it is neither in the group by clause nor being aggregated. This is allowed in MySQL, but it's an advanced feature one must be aware of and handle that carefully. Above select statement would give just any of the matching e per a, b - the minimun e, the maximum e or just any other e. Only do this when you know that e is unique for a, b or you don't care what e you get. As said, it's an advanced feature.
my solve is here
select g2.days, group_concat(price) prices,
group_concat(seconds) times, minp, maxp from goldprice g1
inner join (
SELECT gid, days,
max(price) as maxp, min(price) as minp,
max(seconds) as maxt, min(seconds) as mint
FROM `goldprice` where gid = 1 and days = 16200
group by days
) g2
on (g1.days = g2.days and g1.gid = g2.gid and (seconds = mint or seconds = maxt))
where gid = 1 and days = 16200
group by days order by days desc, seconds asc
is my solve is correct

Need help optimizing 4 heavy queries on one webpage

I have four queries that run on one web page. I use them for statistics and they are taking too long to load.
Here are my current configurations
use the text wrapping button on pastebin to make it easier to read.
I have a lot of RAM dedicated to mysql but it still takes a long time. I have also index most of the columns.
I'm just trying to see what other options I have.
I put "show create table" and total count(*) in here. I'm going to rename everything and paste in SO. I agree that someone in the future may use it.
QUERY ONE
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsresults
WHERE
DID = 28
AND ActionTypeID = 1
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
This, I would have a covering index based on your key elements so the engine does not have to go back to the raw data... Based on this and your following queries, I would have THAT column in the primary index position such as
StatisticsResults -- index ( DID, ActionTypeID, DateActioned )
The order by by respective year() descending and month() descending will do the same thing as your hard-coded references to FIND the field in the list.
QUERY TWO
-- 381.812
SELECT SQL_NO_CACHE
DATE_FORMAT(DateActioned,'%M-%Y') as val1,
COUNT(*) AS total_count
FROM
db.statisticsdivision
WHERE
DID = 28
AND ActionTypeID = 9
AND DateActioned IS NOT NULL
GROUP BY
DATE_FORMAT(DateActioned, '%m-%y')
ORDER BY
YEAR( DateActioned ) DESC,
MONTH( DateActioned ) DESC
ON this one, the DID = '28', I changed to DID = 28. If the column is numeric, don't offer confusion to the engine to try and convert one to the other. The same indexes from option 1 would apply here too.
QUERY THREE
-- 33.899
SELECT SQL_NO_CACHE DISTINCT
AID,
COUNT(*) AS acount
FROM
db.statisticsresults
JOIN db.division_id USING(AID)
WHERE
DID = 28
GROUP BY
AID
ORDER BY
count(*) DESC
LIMIT
19
This one looks like a bit of a waste... you are joining to the division table based on an "AID" column in the stats table. Why are you doing the join unless you actually are expecting some invalid "AID" values not in the division table? Again, change your "DID" column to 28 instead of '28'. Ensure your division table has its index on "AID" for the join. The SECOND index from query 1 appears to be your better option
QUERY FOUR
-- 21.403
SELECT SQL_NO_CACHE DISTINCT
TID,
tax,
agent,
COUNT(*) AS t_count
FROM
db.statisticsresults sr
JOIN db.tax_id USING(TID)
JOIN db.agent_id ai ON(ai.AID = sr.AID)
WHERE
DID = 28
GROUP BY
TID,
sr.AID
ORDER BY
COUNT(*) DESC
LIMIT 19
Again, "DID" column from '28' to 28
FOR your TAX_ID table, have a covering index on that too so it can handle the join
TO the agent table without going TO the raw page data
Tax_ID -- index ( tid, aid )
Finally, if you are dealing with your original list finding things only from Jan 2012 to Dec 2013, you can simplify querying the ENTIRE table of stats by adding to your WHERE clause...
AND DateActioned >= '2012-01-01'
So you completely skip over anything prior to 2012 (old data I presume?)