Summarize data that has already been grouped - reporting-services

I have a data set that looks like this:
User | Task | Time
--------|--------|--------
User A | Task X | 100
User A | Task Y | 200
User A | Task Z | 300
User B | Task X | 400
User B | Task Y | 500
User B | Task Z | 600
User C | Task X | 700
User C | Task Y | 800
User C | Task Z | 900
User D | Task X | 1000
User D | Task Y | 1100
user D | Task Z | 1200
When I do my initial grouping, the data looks like this:
| Avg User | Avg Task X | Avg Task Y | Avg Task Z
User | Time | Time | Time | Time
-------|----------|------------|------------|------------
User A | 200 | 100 | 200 | 300
User B | 500 | 400 | 500 | 600
User C | 800 | 700 | 800 | 900
User D | 1100 | 1000 | 1100 | 1200
I need it to look like this:
| Avg User | Avg Task X | Avg Task Y | Avg Task Z
User | Time | Time | Time | Time
------|----------|------------|------------|------------
All | 650 | 550 | 650 | 750
This is how I got those numbers:
650 = (200+500+800+1100) / 4
550 = (100+400+700+1000) / 4
650 = (200+500+800+1100) / 4
750 = (300+600+900+1200) / 4
In other words, I have a column group on Task and a row group on User. The problem is that I want the row group to get summarized an extra time.
At first glance I could just return the user's name back as 'All' and it would summarize but this doesn't actually give me the averages that I need. I need to first SUM the times by user, and then find the average per user. If I change the way the original data is shaped, my task groups will no longer work properly.
If I try to use a "Totals" row on my row group, it aggregates the ORIGINAL data and not the summarized/grouped data. That is rather disappointing because it is actually incorrect in my eyes.

The only way I was able to do this type of functionality is to was to use the Code section of the report. I would keep track of the group data I wanted to summarize in a global variable in that I would later output to the field that I wanted.
Here is a microsoft article to describe how to embed code into your report
http://msdn.microsoft.com/en-us/library/ms159238.aspx
Here is a much more detailed way to solve your problem. Link

Assuming your source is SQL Server 2008 you might be able to use a combination of grouping sets:
http://technet.microsoft.com/en-us/library/bb522495.aspx
And the SSRS Aggregate Function:
http://msdn.microsoft.com/en-us/library/ms155830(v=sql.90).aspx
This blog has an example that may also be helpful
http://beyondrelational.com/blogs/jason/archive/2010/07/03/aggregate-of-an-aggregate-function-in-ssrs.aspx
Good Luck

I would do this in a sql script, doing this in reporting would be overkill (although it probably would be possible).
I have and example script right here:
drop table #tmp, #tmp2, #tmp3
select 'User A' as [User],' Task X ' as [Task],100.00 as [Time]
into #tmp
union all
select 'User A ',' Task Y ',200
union all
select 'User A ',' Task Z ',300
union all
select 'User B ',' Task X ',400
union all
select 'User B ',' Task Y ',500
union all
select 'User B ',' Task Z ',600
union all
select 'User C ',' Task X ',700
union all
select 'User C ',' Task Y ',800
union all
select 'User C ',' Task Z ',900
union all
select 'User D ',' Task X ',1000
union all
select 'User D ',' Task Y ',1100
union all
select 'User D ',' Task Z ',1200
select [User],
Task,
Sum(time) as time
into #tmp2
from #tmp
group by [User],
[Task]
select [User],
avg(time) as time
into #tmp3
from #tmp2
group by [User];
declare #statement nvarchar(max);
select #statement =
'with cteTimes as (
select *
from #tmp2 t
pivot (sum (t.[time]) for Task in (' + stuff((select ', ' + quotename([Task]) from #tmp group by [Task] for xml path, type).value('.','varchar(max)'), 1, 2, '') + ')) as Task
)
select ''All'' as [User],
(select avg(usr.time) from #tmp3 usr),'
+ stuff((select ', avg(' + quotename([Task]) + ') as ' + quotename([Task]) from #tmp group by [Task] for xml path, type).value('.','varchar(max)'), 1, 2, '') +
+'from cteTimes x ';
exec sp_executesql #statement;
The script can probably be optimized by using a pivot instead of multiple joins while creating the #tmp4.
My example is just explanatory.

Here's the query I would write that works... The "PreQuery" is done to group the counts and sum of each element for a given user... Then that is rolled-up to the top-most level of "All". Now, this is based on your data sample.
SELECT
AVG( TaskTime / TaskCount ) as TaskAvg,
SUM( XTime ) / SUM( XCount ) as XAvg,
SUM( YTime ) / SUM( YCount ) as YAvg,
SUM( ZTime ) / SUM( ZCount ) as ZAvg
from
( SELECT
user,
COUNT(*) as TaskCount,
SUM( Time ) as TaskTime,
CASE WHEN Task = "Task X" THEN 1 ELSE 0 END as XCount,
CASE WHEN Task = "Task X" THEN Time ELSE 0 END as XTime,
CASE WHEN Task = "Task Y" THEN 1 ELSE 0 END as YCount,
CASE WHEN Task = "Task Y" THEN Time ELSE 0 END as YTime,
CASE WHEN Task = "Task Z" THEN 1 ELSE 0 END as ZCount,
CASE WHEN Task = "Task Z" THEN Time ELSE 0 END as ZTime
FROM
AllUsersTasks
group by ;
user ) PreQuery
If your data could provide that a given user has multiple entries for a single Task, such as 3 entries for User A, Task X has Times of 95, 100 and 105, you have 3 entries for 300 which results in the 100. This could skew your OVERALL Average of this task and would have to modify the query. Let me know if a person will have multiple entries per a given task based on production data... If so, then THAT element would probably need to be put into its OWN pre-query where the "From AllUserTasks" table is.

Related

GROUP BY function that returns the first result

Here is my MySQL query with its result, I want to obtain for each value of 'function' the first result of 'priority', and thus get the result below
However, a syntax of the type "SELECT function, FIRST (priority) ... GROUP BY function" does not exist, do you have any idea how to do this?
SELECT
function,
priority
FROM challenge_access_rule
WHERE 10 = challenge_access_rule.challenge_id
AND (
rule = 'everybody'
OR (rule = 'friends' AND 1)
)
UNION
SELECT
function,
isRestriction AS priority
FROM challenge_access_user
WHERE 10 = challenge_access_user.challenge_id AND challenge_access_user.user_id = 2
ORDER BY ABS(priority)
-- RESULT --
emitInstance | 0
emitDeal | 0
emitDealInstance | 1
emitDeal | 100
emitDealInstance | -100
vote | -100
emitDeal | -200
view | -200
interact | -200
-- DESIRED RESULT --
emitInstance | 0
emitDeal | 0
emitDealInstance | 1
vote | -100
view | -200
interact | -200
Thanks,
Bastien
First, I would simplify your current query to:
SELECT car.function, car.priority
FROM challenge_access_rule car
WHERE 10 = car.challenge_id AND
( car.rule in ('everybody', 'friends') or
car.user_id = 2
)
ORDER BY ABS(car.priority);
Then, if you want the minimum of the absolute value of the priority, then you can use aggregation and a substring_index()/group_concat() trick:
SELECT car.function,
substring_index(group_concat(car.priority order by ABS(car.priority)), ',', 1) as priority
FROM challenge_access_rule car
WHERE 10 = car.challenge_id AND
( car.rule in ('everybody', 'friends') or
car.user_id = 2
)
GROUP BY car.function;
or a case expression:
SELECT car.function,
(case when min(car.priority) = - min(abs(car.priority))
then - min(abs(car.priority))
else min(car.priority)
end) as priority
FROM challenge_access_rule car
WHERE 10 = car.challenge_id AND
( car.rule in ('everybody', 'friends') or
car.user_id = 2
)
GROUP BY car.function;
This code work !
Thanks to Gordon
SELECT function,
substring_index(group_concat(priority ORDER BY ABS(priority)), ',', 1)
FROM (
SELECT
function,
priority
FROM challenge_access_rule
WHERE 10 = challenge_access_rule.challenge_id
AND (
rule = 'everybody'
OR (rule = 'friends' AND 1) -- 'AND 1' edit after
)
UNION
SELECT
function,
isRestriction AS priority
FROM challenge_access_user
WHERE 10 = challenge_access_user.challenge_id AND challenge_access_user.user_id = 2
ORDER BY ABS(priority)
) AS toto
GROUP BY function

Create a query that fetches rows plus their adjacent rows

I am having problems with a specific query - respectively creating the query in the first place.
The columns can be reduced to id, seconds and status.
=============================
| id | seconds | status |
-----------------------------
| 0 | 0 | 0 |
| 1 | 12 | 1 |
| 2 | 25 | 0 |
| 3 | 37 | 1 |
| 4 | 42 | 0 |
=============================
What I'd like to have: All entries with status = 1 PLUS all entries that are less than 10 seconds away from those entries. Basically, I want to fetch all possible pairs (or triplets, etc.) of rows to check manually (later automatically) whether they need to be paired (there is a column parent_id for this purpose, but we don't need that for the query). I could do this in code (first select all status=1, then loop), but I wonder whether it is possible to do this purely in the database.
Thus, my desired output would be the following:
=============================
| id | seconds | status |
-----------------------------
| 1 | 12 | 1 | <- status = 1
| 3 | 37 | 1 | <- status = 1
| 4 | 42 | 0 | <- only 5 seconds after status = 1
=============================
My current best guess is this:
SELECT * FROM entries e0
WHERE
e0.status = 1 OR
e0.status = 0 AND
0 < (SELECT count(*)
FROM entries e1
WHERE e1.status = 1 AND abs(e1.seconds - e0.seconds) < 10)
But this fetches the whole table, and I don't really know why - and it takes a long time to do so (there is an index on the column seconds, the table has 9000 entries).
Is there a way to do this (maybe even effiently)?
Here's one option with union all and exists:
select * from entries where status = 1
union all
select * from entries e where status = 0 and
exists (select 1
from entries e2
where e2.status = 1 and
abs(e.seconds - e2.seconds) < 10
)
SQL Fiddle Demo
Alternatively you could use an outer join with distinct instead of exists:
select distinct e.*
from entries e
left join entries e2 on e2.status = 1
where e.status = 1 or abs(e.seconds - e2.seconds) < 10
More Fiddle
I prefer to do it in a single query. However there are also ways of doing it with exists or subqueries as well. Utilizing an outer join means you can grab everything at once with a nicely crafted where and join statements, adding a group by or distinct based on your performance situation will tidy up your results and make them unique rows.
My suggestion on where statements to ensure your intentions are met is to use parenthesis to establish your intended precedence. It will make your code clearer to your intentions.
WHERE Condition1 = True OR Condition2 = True AND Condition3 = True
Should be
WHERE Condition1 = True OR (Condition2 = True AND Condition3 = True)
Oddly, I would not have thought it would evaluate in the manner you mention because of past experience but then again I ALWAYS use parenthesis to establish my precedence to make it more clear and easier to craft more complex conditions.
Reason you are getting the whole table. Is because of the data in your table. Seriously, sometimes we go looking for the answer and make it complicated, I prefer my way of solving the query of yours but given your result set example my query and yours get the same results! Try changing the 10 seconds down to 1/2/3 etc and see what the effect of your query is. My assumption would be in your full dataset that your any record with a status of 0 is within 10 seconds of a record that has a status of 1...... I would have commented back but this is one of the first questions I have answered.
Here is some example code based on your dataset and query.
DECLARE #Entries AS TABLE (
Id INT
,Seconds INT
,[Status] BIT
)
INSERT INTO #Entries (Id, Seconds, [Status])
VALUES (0,0,0 )
,(1,12,1 )
,(2,25,0 )
,(3,37,1 )
,(4,42,0 )
SELECT *
FROM
#Entries e0
WHERE
e0.Status = 1
OR e0.Status = 0
AND 0 < (SELECT count(*)
FROM
#Entries e1
WHERE e1.Status = 1 AND ABS(e1.Seconds - e0.Seconds) < 10)
SELECT DISTINCT
e0.*
FROM
#Entries e0
LEFT JOIN #Entries e1
ON e1.[Status] = 1
AND ABS(e1.seconds - e0.seconds) < 10
WHERE
e0.[Status] = 1
OR e1.id IS NOT NULL

MySQL dividing current row with current row + 1

I have a problem with MySQL query. I have two tables, table currency and table currency_detail. Table currency contains currency code, such as USD, EUR, IDR, etc. Table currency_detail contains date and rate. After joining tables to get all rates USD in 1 year, I have data which look like this :
Date | Rate
-------------------
2015-10-20 | 14463
2015-10-19 | 14452
2015-10-18 | 14442
2015-10-15 | 14371
2015-10-14 | 14322
2015-10-10 | 14306
2015-10-08 | 14322
I need to count every current row with current row + 1. Is it possible to get results that look like this ?
Date | Rate | PX
------------------------------
2015-10-20 | 14463 | 0.000761 -> LN(14463/14452)
2015-10-19 | 14452 | 0.000692 -> LN(14452/14442)
2015-10-18 | 14442 | 0.004928 -> LN(14442/14371)
2015-10-15 | 14371 | 0.003415 -> LN(14371/14322)
2015-10-14 | 14322 | 0.001118 -> LN(14322/14306)
2015-10-10 | 14306 | -0.00112 -> LN(14306/14322)
2015-10-08 | 14322 | 0 -> 0 (because no data after this row)
I have tried many ways, but still cant find the solutions. Anyone can help with the query ? Thanks before..
In standard SQL you would simply use LAG to read the value from the previous record. In MySQL you need a workaround. The easiest way might be to select all rows twice and number them on-the-fly; then you can join by row number:
select
this.rdate, this.rate,
ln(this.rate / prev.rate) as px
from
(
select #rownum1 := #rownum1 + 1 as rownum, rates.*
from rates
cross join (select #rownum1 := 0) init
order by rdate
) this
left join
(
select #rownum2 := #rownum2 + 1 as rownum, rates.*
from rates
cross join (select #rownum2 := 0) init
order by rdate
) prev on prev.rownum = this.rownum - 1
order by this.rdate desc;
I had to use different rownum variable names in the two subqueries, by the way, as MySQL got confused otherwise. I consider this a flaw, but I must admit MySQL's variables-in-SQL thing is still kind of alien to me :-)
SQL fiddle: http://www.sqlfiddle.com/#!9/341c4/7

How can I make an SQL query that returns time differences between checkins and checkouts?

I'm using mysql and I've got a table similar to this one:
id | user | task | time | checkout
----+-------+------+-----------------------+---------
1 | 1 | 1 | 2014-11-25 17:00:00 | 0
2 | 2 | 2 | 2014-11-25 17:00:00 | 0
3 | 1 | 1 | 2014-11-25 18:00:00 | 1
4 | 1 | 2 | 2014-11-25 19:00:00 | 0
5 | 2 | 2 | 2014-11-25 20:00:00 | 1
6 | 1 | 2 | 2014-11-25 21:00:00 | 1
7 | 1 | 1 | 2014-11-25 21:00:00 | 0
8 | 1 | 1 | 2014-11-25 22:00:00 | 1
id is just an autogenerated primary key, and checkout is 0 if that row registered a user checking in and 1 if the user was checking out from the task.
I would like to know how to make a query that returns how much time has a user spent at each task, that is to say, I want to know the sum of the time differences between the checkout=0 time and the nearest checkout=1 time for each user and task.
Edit: to make things clearer, the results I'd expect from my query would be:
user | task | SUM(timedifference)
------+------+-----------------
1 | 1 | 02:00:00
1 | 2 | 02:00:00
2 | 2 | 03:00:00
I have tried using SUM(UNIX_TIMESTAMP(time) - UNIX_TIMESTAMP(time)), while grouping by user and task to figure out how much time had elapsed, but I don't know how to make the query only sum the differences between the particular times I want instead of all of them.
Can anybody help? Is this at all possible?
As all comments tell you, your current table structure is not ideal. However it's still prossible to pair checkins with checkouts. This is a SQL server implementation but i am sure you can translate it to MySql:
SELECT id
, user_id
, task
, minutes_per_each_task_instance = DATEDIFF(minute, time, (
SELECT TOP 1 time
FROM test AS checkout
WHERE checkin.user_id = checkout.user_id
AND checkin.task = checkout.task
AND checkin.id < checkout.id
AND checkout.checkout = 1
))
FROM test AS checkin
WHERE checkin.checkout = 0
Above code works but will become slower and slower as your table starts to grow. After a couple of hundred thousands it will become noticable
I suggest renaming time column to checkin and instead of having checkout boolean field make it datetime, and update record when user checkouts. That way you will have half the number of records and no complex logic for reading or querying
You can determine with a ranking method what are the matching check in/ check out records, and calculate time differences between them
In my example new_table is the name of your table
SELECT n.user, n.task,n.time, n.checkout ,
CASE WHEN #prev_user = n.user
AND #prev_task = n.task
AND #prev_checkout = 0
AND n.checkout = 1
AND #prev_time IS NOT NULL
THEN HOUR(TIMEDIFF(n.time, #prev_time)) END AS timediff,
#prev_time := n.time,
#prev_user := n.user,
#prev_task := n.task,
#prev_checkout := n.checkout
FROM new_table n,
(SELECT #prev_user = 0, #prev_task = 0, #prev_checkout = 0, #prev_time = NULL) a
ORDER BY user, task, `time`
Then sum the time differences (timediff) by wrapping it in another select
SELECT x.user, x.task, sum(x.timediff) as total
FROM (
SELECT n.user, n.task,n.time, n.checkout ,
CASE WHEN #prev_user = n.user
AND #prev_task = n.task
AND #prev_checkout = 0
AND n.checkout = 1
AND #prev_time IS NOT NULL
THEN HOUR(TIMEDIFF(n.time, #prev_time)) END AS timediff,
#prev_time := n.time,
#prev_user := n.user,
#prev_task := n.task,
#prev_checkout := n.checkout
FROM new_table n,
(#prev_user = 0, #prev_task = 0, #prev_checkout = 0, #prev_time = NULL) a
ORDER BY user, task, `time`
) x
GROUP BY x.user, x.task
It would probably be easier to understand by changing the table structure though. If that is at all possible. Then the SQL wouldn't have to be so complicated and would be more efficient. But to answer your question it is possible. :)
In the above examples, names prefixed with '#' are MySQL variables, you can use the ':=' to set a variable to a value. Cool stuff ay?
Select MAX of checkouts and checkins independently, map them based on user and task and calculate the time difference
select user, task,
SUM(UNIX_TIMESTAMP(checkin.time) - UNIX_TIMESTAMP(checkout.time)) from (
(select user, task, MAX(time) as time
from checkouts
where checkout = 0
group by user, task) checkout
inner join
(select user, task, MAX(time) as time
from checkouts
where checkout = 1
group by user, task) checkin
on (checkin.time > checkout.time
and checkin.user = checkout.user
and checkin.task = checkout.task)) c
This should work. Join on the tables and select the minimum times
SELECT
`user`,
`task`,
SUM(
UNIX_TIMESTAMP(checkout) - UNIX_TIMESTAMP(checkin)
)
FROM
(SELECT
so1.`user`,
so1.`task`,
MIN(so1.`time`) AS checkin,
MIN(so2.`time`) AS checkout
FROM
so so1
INNER JOIN so so2
ON (
so1.`id` = so2.`id`
AND so1.`user` = so2.`user`
AND so1.`task` = so2.`task`
AND so1.`checkout` = 0
AND so2.`checkout` = 1
AND so1.`time` < so2.`time`
)
GROUP BY `user`,
`task`,
so1.`time`) a
GROUP BY `user`,
`task` ;
As others have suggested though, This will not scale too well as it is, you would need to adjust it if it starts handling more data

MySQL - Count Values occurring between other values

I'd like to count how many occurrences of a value happen before a specific value
Below is my starting table
+-----------------+--------------+------------+
| Id | Activity | Time |
+-----------------+--------------+------------+
| 1 | Click | 1392263852 |
| 2 | Error | 1392263853 |
| 3 | Finish | 1392263862 |
| 4 | Click | 1392263883 |
| 5 | Click | 1392263888 |
| 6 | Finish | 1392263952 |
+-----------------+--------------+------------+
I'd like to count how many clicks happen before a finish happens.
I've got a very roundabout way of doing it where I write a function to find the last
finished activity and query the clicks between the finishes.
Also repeat this for Error.
What I'd like to achieve is the below table
+-----------------+--------------+------------+--------------+------------+
| Id | Activity | Time | Clicks | Error |
+-----------------+--------------+------------+--------------+------------+
| 3 | Finish | 1392263862 | 1 | 1 |
| 6 | Finish | 1392263952 | 2 | 0 |
+-----------------+--------------+------------+--------------+------------+
This table is very long so I'm looking for an efficient solution.
If anyone has any ideas.
Thanks heaps!
This is a complicated problem. Here is an approach to solving it. The groups between the "finish" records need to be identified as being the same, by assigning a group identifier to them. This identifier can be calculated by counting the number of "finish" records with a larger id.
Once this is assigned, your results can be calculated using an aggregation.
The group identifier can be calculated using a correlated subquery:
select max(id) as id, 'Finish' as Activity, max(time) as Time,
sum(Activity = 'Clicks') as Clicks, sum(activity = 'Error') as Error
from (select s.*,
(select sum(s2.activity = 'Finish')
from starting s2
where s2.id >= s.id
) as FinishCount
from starting s
) s
group by FinishCount;
A version that leverages user(session) variables
SELECT MAX(id) id,
MAX(activity) activity,
MAX(time) time,
SUM(activity = 'Click') clicks,
SUM(activity = 'Error') error
FROM
(
SELECT t.*, #g := IF(activity <> 'Finish' AND #a = 'Finish', #g + 1, #g) g, #a := activity
FROM table1 t CROSS JOIN (SELECT #g := 0, #a := NULL) i
ORDER BY time
) q
GROUP BY g
Output:
| ID | ACTIVITY | TIME | CLICKS | ERROR |
|----|----------|------------|--------|-------|
| 3 | Finish | 1392263862 | 1 | 1 |
| 6 | Finish | 1392263952 | 2 | 0 |
Here is SQLFiddle demo
Try:
select x.id
, x.activity
, x.time
, sum(case when y.activity = 'Click' then 1 else 0 end) as clicks
, sum(case when y.activity = 'Error' then 1 else 0 end) as errors
from tbl x, tbl y
where x.activity = 'Finish'
and y.time < x.time
and (y.time > (select max(z.time) from tbl z where z.activity = 'Finish' and z.time < x.time)
or x.time = (select min(z.time) from tbl z where z.activity = 'Finish'))
group by x.id
, x.activity
, x.time
order by x.id
Here's another method of using variables, which is somewhat different to #peterm's:
SELECT
Id,
Activity,
Time,
Clicks,
Errors
FROM (
SELECT
t.*,
#clicks := #clicks + (activity = 'Click') AS Clicks,
#errors := #errors + (activity = 'Error') AS Errors,
#clicks := #clicks * (activity <> 'Finish'),
#errors := #errors * (activity <> 'Finish')
FROM
`starting` t
CROSS JOIN
(SELECT #clicks := 0, #errors := 0) i
ORDER BY
time
) AS s
WHERE Activity = 'Finish'
;
What's similar to Peter's query is that this one uses a subquery that's returning all the rows, setting some variables along the way and returning the variables' values as columns. That may be common to most methods that use variables, though, and that's where the similarity between these two queries ends.
The difference is in how the accumulated results are calculated. Here all the accumulation is done in the subquery, and the main query merely filters the derived dataset on Activity = 'Finish' to return the final result set. In contrast, the other query uses grouping and aggregation at the outer level to get the accumulated results, which may make it slower than mine in comparison.
At the same time Peter's suggestion is more easily scalable in terms of coding. If you happen to have to extend the number of activities to account for, his query would only need expansion in the form of adding one SUM(activity = '...') AS ... per new activity to the outer SELECT, whereas in my query you would need to add a variable and several expressions, as well as a column in the outer SELECT, per every new activity, which would bloat the resulting code much more quickly.