I am new to sql and this forum has been my lifeline till now. Thank you for creating and sharing on this great platform.
I am currently working on a large dataset and would appreciate some guidance.
The data table (existing_table) has 4 million rows and it looks like this:
id date sales_a sales_b sales_c sales_d sales_e
Please note that there are multiple rows with the same date.
What I want to do is to add 5 more columns in this table (cumulative_sales_a, cumulative_sales_b, etc.) which will have the cumulative sales figures for a, b, c, etc. till a particular date (this will be grouped by date). I used the following code to do this:
create table new_cumulative
select t.id, t.date, t.sales_a, t.sales_b, t.sales_c, t.sales_d, t.sales_e,
(select sum(x.sales_a) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_a,
(select sum(x.sales_b) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_b,
(select sum(x.sales_c) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_c,
(select sum(x.sales_d) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_d,
(select sum(x.sales_e) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_e
from existing_table t
group by t.id, t.date;
I had created an index on the column 'id' before running this query.
Though I got the desired output, this query took almost 11 hours to finish.
I was wondering if I am doing something wrong here and if there is a better (and faster) way of running such queries.
Thank you for your help.
Some queries are expensive by nature and take long time to execute. In this particular case you could avoid having 5 subqueries :
SELECT a.*, b.cumulative_sales_a, b.cumulative_sales_b, ...
FROM
(
select t.id, t.`date`, t.sales_a, t.sales_b, t.sales_c, t.sales_d, t.sales_e
from existing_table t
GROUP BY t.id,t.`date`
)a
LEFT JOIN
(
select x.id, x.date, sum(x.sales_a) as cumulative_sales_a,
sum(x.sales_b) as cumulative_sales_b, ...
FROM existing_table x
GROUP BY x.id, x.`date`
)b ON (b.id = a.id AND b.`date` <=a.`date`)
It's also expensive query, but it should have a better execution plan than your original. Also, I'm not sure if
select t.id, t.`date`, t.sales_a, t.sales_b, t.sales_c, t.sales_d, t.sales_e
from existing_table t
GROUP BY t.id,t.`date`
gives you what you want - for instance, if you have 5 records with the same id and date, it will grab values of other fields (sales_a, sales_b, etc) from any of these 5 records...
you may join all mini-select with sum in one query as
(select sum(x.sales_a) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_a,
(select sum(x.sales_b) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_b,
(select sum(x.sales_c) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_c,
(select sum(x.sales_d) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_d,
(select sum(x.sales_e) from existing_table x where x.id = t.id and x.date <= t.date) as cumulative_sales_e
in
select sum(..),sum(..),sum(...),sum(..),sum(..)
from existing table x
where x.id=t.id and x.date<=t.date
Looks like an excellent spot for MySQL variables querying. In this case, I would pre-query all the aggregations by your expected "ID" and "Date" to remove the duplicates and have a single entry as a grand total for the one day. Take this result and have it ordered by the ID and date to prepare for the next part joining to the "#sqlvariables" versions.
Now, just process them in order and keep accumulating for each ID until the new ID, then reset the counter back to zero, but keep adding the respective "Sales". After each "record" is processed, set the #lastID to the ID just processed so it can be compared when processing the next row to identify if continuing on same person, or force reset back to zero.
To help optimize and ensure the inner "PreAgg"regate query ensure an index on (ID, Date). Should be SUPER Fast for you.
SELECT
PreAgg.ID,
PreAgg.`Date`,
PreAgg.SalesA,
PreAgg.SalesB,
PreAgg.SalesC,
PreAgg.SalesD,
PreAgg.SalesE,
#CumulativeA := if( #lastID := PreAgg.ID, #CumulativeA, 0 ) + PreAgg.SalesA as CumulativeA,
#CumulativeB := if( #lastID := PreAgg.ID, #CumulativeB, 0 ) + PreAgg.SalesB as CumulativeB,
#CumulativeC := if( #lastID := PreAgg.ID, #CumulativeC, 0 ) + PreAgg.SalesC as CumulativeC,
#CumulativeD := if( #lastID := PreAgg.ID, #CumulativeD, 0 ) + PreAgg.SalesD as CumulativeD,
#CumulativeE := if( #lastID := PreAgg.ID, #CumulativeE, 0 ) + PreAgg.SalesE as CumulativeE,
#lastID := PreAgg.ID as dummyPlaceholder
from
( select
t.id,
t.`date`,
SUM( t.sales_a ) SalesA,
SUM( t.sales_b ) SalesB,
SUM( t.sales_c ) SalesC,
SUM( t.sales_d ) SalesD,
SUM( t.sales_e ) SalesE
from
existing_Table t
group by
t.id,
t.`date`
order by
t.id,
t.`date` ) PreAgg,
( select
#lastID := 0,
#CumulativeA := 0,
#CumulativeB := 0,
#CumulativeC := 0,
#CumulativeD := 0,
#CumulativeE := 0 ) sqlvars
Related
mysql table: work
|id|user_id|created_at|realization|
I have been working on a sql query which calculates performance (realisation today / realization on the first day of the month) and sortes records based on performance.
Expected result:
|ranking|performance|user|
|1|0.88|36|
|2|0.712444111|444|
|3|0.711|1|
|4|0.33333|9|
|5|0.1006|29|
returned result:
|ranking|performance|user|
|4|0.88|36|
|2|0.712444111|444|
|5|0.711|1|
|3|0.33333|9|
|1|0.1006|29|
Here is my query:
SET #ranking := 0;
SELECT
#ranking := #ranking + 1 as ranking,
w1.user_id,
IFNULL(ROUND(w2.realization / w1.realization), 4), 0) AS performance
FROM work w1
JOIN (
SELECT min(created_at) AS first_month, max(created_at) AS last_month, user_id
FROM work
WHERE (DATE_FOMAT(NOW(), '%Y-%m') = DATE_FORMAT(created_at, '%Y-%m')
GROUP BY user_id
ORDER BY user_id
) AS w ON w1.user_id = w.user_id AND w1.created_at = w.first_month
JOIN work AS w2 ON w1.user_id = w2.user_id AND w2.created_at = w.last_month
ORDER BY performance DESC
UPDATE
Even if I try to wrap it this way, the rankings are not right
SET #ranking := 0;
SELECT #ranking := #ranking + 1 as ranking, a.user_id, a.performance
FROM (
SELECT
w1.user_id,
IFNULL(ROUND(w2.realization / w1.realization), 4), 0) AS performance
FROM work w1
JOIN (
SELECT min(created_at) AS first_month, max(created_at) AS last_month, user_id
FROM work
WHERE (DATE_FOMAT(NOW(), '%Y-%m') = DATE_FORMAT(created_at, '%Y- %m')
GROUP BY user_id
ORDER BY user_id
) AS w ON w1.user_id = w.user_id AND w1.created_at = w.first_month
JOIN work AS w2 ON w1.user_id = w2.user_id AND w2.created_at = w.last_month
ORDER BY performance DESC
) AS a
I have a Mysql Database like below:
id , name , col1
and i want to find all rows that: value of col1 of the row is greater than avrage of maximom 5 rows past
for example if I have 50 rows , and if the row #20 has gotten , the avrage of value of col1 of rows #20,#19,#18,#17,#16 should be less than the value of col1 of row #20 , and so on...
Thank you in advance.
What you seem to want here is running average of past M records starting from current record and we need to select the current record if current record's column value is greater than the running average.
Here is my attempt to it:
SET #M := 2;
SELECT * FROM
(
SELECT (#rownumber:= #rownumber + 1) AS rn, yt.*
FROM your_table yt,(SELECT #rownumber:= 0) nums
ORDER BY name, id
) a
WHERE a.var1 >
(
SELECT avg(b.var1)
FROM
(
SELECT (#rownumber:= #rownumber + 1) AS rn, yt.*
FROM your_table yt,(SELECT #rownumber:= 0) nums
ORDER BY name, id
) b
WHERE b.rn > a.rn - #M AND b.rn <= a.rn
)
#M is count of past records to be considered for finding running average.
Here is the code at SQL Fiddle
[EDIT]:
Here is another solution which according to me should be more efficient than correlated query.
SET #M := 2;
SELECT a.* FROM
(
SELECT (#rownumber:= #rownumber + 1) AS rn, yt.*
FROM your_table yt,(SELECT #rownumber:= 0) nums
ORDER BY name, id
) a
JOIN
(
SELECT b.name, b.rn, AVG(c.var1) AS av
FROM
(
SELECT (#rownumber1:= #rownumber1 + 1) AS rn, yt.*
FROM your_table yt,(SELECT #rownumber1:= 0) nums
ORDER BY name, id
) b
JOIN
(
SELECT (#rownumber2:= #rownumber2 + 1) AS rn, yt.*
FROM your_table yt,(SELECT #rownumber2:= 0) nums
ORDER BY name, id
) c
ON b.name = c.name
AND c.rn > (b.rn - #M) AND c.rn <= b.rn
GROUP BY b.name,b.rn
) runningavg
ON a.name = runningavg.name
AND a.rn = runningavg.rn
AND a.var1 > runningavg.av
Here I have used simple inner join to calculate running average and again with inner join have selected rows which have column value greater than average.
Here is the code at SQL Fiddle
Let me know did it prove to be efficient.
So I'll show you what I'm trying to do and explain my problem, there may be an answer different to the approach I'm trying to take.
The query I'm trying to perform is as follows:
SELECT *
FROM report_keywords rk
WHERE rk.report_id = 231
AND (
SELECT SUM(t.conv) FROM (
SELECT conv FROM report_keywords t2 WHERE t2.campaign_id = rk.campaign_id ORDER BY conv DESC LIMIT 10
) t
) >= 30
GROUP BY rk.campaign_id
The error I get is
Unknown column 'rk.campaign_id' in 'where clause'
Obviously this is saying that the table alias rk is not making it to the subsubquery. What I'm trying to do is get all of the campaigns where the sum of the top 10 conversions is greater than or equal to 30.
The relevant table structure is:
id INT,
report_id INT,
campaign_id INT,
conv INT
Any help would be greatly appreciated.
Update
Thanks to Kickstart I was able to do what I wanted. Here's my final query:
SELECT campaign_id, SUM(conv) as sum_conv
FROM (
SELECT campaign_id, conv, #Sequence := if(campaign_id = #campaign_id, #Sequence + 1, 1) AS aSequence, #campaign_id := campaign_id
FROM report_keywords
CROSS JOIN (SELECT #Sequence := 0, #campaign_id := 0) Sub1
WHERE report_id = 231
ORDER BY campaign_id, conv DESC
) t
WHERE aSequence <= 10
GROUP BY campaign_id
HAVING sum_conv >= 30
Possibly use a user variable to add a sequence number to get the latest 10 records for each one, then use SUM to get the count of those.
Something like this:-
SELECT rk.*
FROM report_keywords rk
INNER JOIN
(
SELECT campaign_id, SUM(conv) AS SumConv
FROM
(
SELECT campaign_id, conv, #Sequence := if(campaign_id = #campaign_id, #Sequence + 1, 1) AS aSequence, #campaign_id := campaign_id
FROM report_keywords
CROSS JOIN (SELECT #Sequence := 0, #campaign_id := "") Sub1
ORDER BY campaign_id, conv
) Sub2
WHERE aSequence <= 10
GROUP BY campaign_id
) Sub3
ON rk.campaign_id = Sub3.campaign_id AND Sub3.SumConv >= 30
WHERE rk.report_id = 231
Here's the SQLFiddle Link to my tables.
I basically want to select only Jack and Jill, as there is a non-zero difference between the last two nums entries of the table foo with the user being their respective names.
How is this possible?
Note: just to mention, in my foo table, I have around 100000 rows, so it would be good if there was a very quick and fast way of retrieving the data.
I prefer doing this using limit with the offset to get the two most recent values. Happily, your table has an id column for determining the order.
select user,
(select num from foo f2 where f2.user = f.user order by f2.id desc limit 1
) lastval,
(select num from foo f2 where f2.user = f.user order by f2.id desc limit 1, 2
) lastval2
from foo f
group by user
having lastval <> lastval2
Here's one way (although I think you'd be more likely to JOIN on a user's id rather than their name!?!...
SELECT u.*
FROM
( SELECT x.*, COUNT(*) rank FROM foo x JOIN foo y ON y.user = x.user AND y.id >= x.id GROUP BY x.id)a
LEFT
JOIN
( SELECT x.*, COUNT(*) rank FROM foo x JOIN foo y ON y.user = x.user AND y.id >= x.id GROUP BY x.id)b
ON b.user = a.user
AND b.num = a.num
AND b.rank = a.rank + 1
JOIN users u
ON u.user = a.user
WHERE b.id IS NULL
AND a.rank = 1;
I think this query can be rewritten as follows, which might be faster...
SELECT u.*
FROM
( SELECT id
, user
, num
, #prev_user := #curr_user
, #curr_user := user
, #rank := IF(#prev_user = #curr_user, #rank+1, #rank:=1) rank
FROM foo
JOIN (SELECT #curr_user := null, #prev_user := null, #rank := 0) sel1
ORDER
BY user
, id DESC
) a
LEFT
JOIN
( SELECT id
, user
, num
, #prev_user := #curr_user
, #curr_user := user
, #rank := IF(#prev_user = #curr_user, #rank+1, #rank:=1) rank
FROM foo
JOIN (SELECT #curr_user := null, #prev_user := null, #rank := 0) sel1
ORDER
BY user
, id DESC
) b
ON b.user = a.user
AND b.num = a.num
AND b.rank = a.rank + 1
JOIN users u
ON u.user = a.user
WHERE b.id IS NULL
AND a.rank = 1;
Based on Strawberrys 2nd solution I have tried this.
SELECT user, MIN(num) AS MinNum, MAX(num) AS MaxNum
FROM ( SELECT id
, user
, num
, #prev_user := #curr_user
, #curr_user := user
, #rank := IF(#prev_user = #curr_user, #rank+1, 1) AS rank
FROM foo
JOIN (SELECT #curr_user := null, #prev_user := null, #rank := 1) sel1
ORDER BY user, id DESC
) AS Sub
WHERE rank <= 2
GROUP BY user
HAVING MinNum != MaxNum
This is getting the details ranked as a subselect and rejecting where the rank is greater than 2 (unfortunately the user variables give strange results if you try and check this within the subselect). The results are then grouped on user and the min and max value of num are returned. If they are different then the row is returned (and as you only have 1 or 2 rows per user, the min and max will only be different if there are 2 rows returned AND they have different values).
Advantage of this is that it avoids joining 2 100000 sets against each other and also only needs to do the ranking once (although you would hope that MySQL would optimise this 2nd issue away anyway).
I have a MySQL table with the structure:
beverages_log(id, users_id, beverages_id, timestamp)
I'm trying to compute the maximum streak of consecutive days during which a user (with id 1) logs a beverage (with id 1) at least 5 times each day. I'm pretty sure that this can be done using views as follows:
CREATE or REPLACE VIEW daycounts AS
SELECT count(*) AS n, DATE(timestamp) AS d FROM beverages_log
WHERE users_id = '1' AND beverages_id = 1 GROUP BY d;
CREATE or REPLACE VIEW t AS SELECT * FROM daycounts WHERE n >= 5;
SELECT MAX(streak) AS current FROM ( SELECT DATEDIFF(MIN(c.d), a.d)+1 AS streak
FROM t AS a LEFT JOIN t AS b ON a.d = ADDDATE(b.d,1)
LEFT JOIN t AS c ON a.d <= c.d
LEFT JOIN t AS d ON c.d = ADDDATE(d.d,-1)
WHERE b.d IS NULL AND c.d IS NOT NULL AND d.d IS NULL GROUP BY a.d) allstreaks;
However, repeatedly creating views for different users every time I run this check seems pretty inefficient. Is there a way in MySQL to perform this computation in a single query, without creating views or repeatedly calling the same subqueries a bunch of times?
This solution seems to perform quite well as long as there is a composite index on users_id and beverages_id -
SELECT *
FROM (
SELECT t.*, IF(#prev + INTERVAL 1 DAY = t.d, #c := #c + 1, #c := 1) AS streak, #prev := t.d
FROM (
SELECT DATE(timestamp) AS d, COUNT(*) AS n
FROM beverages_log
WHERE users_id = 1
AND beverages_id = 1
GROUP BY DATE(timestamp)
HAVING COUNT(*) >= 5
) AS t
INNER JOIN (SELECT #prev := NULL, #c := 1) AS vars
) AS t
ORDER BY streak DESC LIMIT 1;
Why not include user_id in they daycounts view and group by user_id and date.
Also include user_id in view t.
Then when you are queering against t add the user_id to the where clause.
Then you don't have to recreate your views for every single user you just need to remember to include in your where clause.
That's a little tricky. I'd start with a view to summarize events by day:
CREATE VIEW BView AS
SELECT UserID, BevID, CAST(EventDateTime AS DATE) AS EventDate, COUNT(*) AS NumEvents
FROM beverages_log
GROUP BY UserID, BevID, CAST(EventDateTime AS DATE)
I'd then use a Dates table (just a table with one row per day; very handy to have) to examine all possible date ranges and throw out any with a gap. This will probably be slow as hell, but it's a start:
SELECT
UserID, BevID, MAX(StreakLength) AS StreakLength
FROM
(
SELECT
B1.UserID, B1.BevID, B1.EventDate AS StreakStart, DATEDIFF(DD, StartDate.Date, EndDate.Date) AS StreakLength
FROM
BView AS B1
INNER JOIN Dates AS StartDate ON B1.EventDate = StartDate.Date
INNER JOIN Dates AS EndDate ON EndDate.Date > StartDate.Date
WHERE
B1.NumEvents >= 5
-- Exclude this potential streak if there's a day with no activity
AND NOT EXISTS (SELECT * FROM Dates AS MissedDay WHERE MissedDay.Date > StartDate.Date AND MissedDay.Date <= EndDate.Date AND NOT EXISTS (SELECT * FROM BView AS B2 WHERE B1.UserID = B2.UserID AND B1.BevID = B2.BevID AND MissedDay.Date = B2.EventDate))
-- Exclude this potential streak if there's a day with less than five events
AND NOT EXISTS (SELECT * FROM BView AS B2 WHERE B1.UserID = B2.UserID AND B1.BevID = B2.BevID AND B2.EventDate > StartDate.Date AND B2.EventDate <= EndDate.Date AND B2.NumEvents < 5)
) AS X
GROUP BY
UserID, BevID