I'm having trouble with a certain query in MySQL, and I hope someone can help me.
A little background info:
We have a callcenter reporting API available to us, from our "telephony as a service" company. The pertinent fields I'm grabbing from their XML interface are:
agent_name
interaction_id
origination <-- this is the "caller ID", which is not always accurate
create_timestamp
accept_timestamp
abandon_timestamp
queue_id
Regular phone calls (interactions, in this case) are answered by each agent, after having queued in our "Main" queue. The create_timestamp field is the time the call starts queuing to agents belonging to "Main", and the accept_timestamp is the time when the agent answers the call. The abandon_timestamp is the time the caller gets tired of queuing and 1) hangs up, or 2) presses a menu option to go to voicemail. The voicemail is saved as an .mp3 file and is queued to the same group of agents as if it were a new, inbound call, except it is associated with the "Main_VM" queue rather than the "Main" queue.
The tricky part is this:
If a call comes in and is "abandoned" to voicemail, the interaction_id does not stay the same for the voicemail .mp3 that queues to the agents. Nor is it always incremented by 1 ... there are times when other calls come in during the time the person has been queuing. Here are example record snippets:
A)
+----------------+--------------+---------------------+---------------------+---------------------+---------------+
| interaction_id | origination | create_timestamp | accept_timestamp | abandon_timestamp | queue_id |
+----------------+--------------+---------------------+---------------------+---------------------+---------------+
| 21771 | NNNPPPXXXX | 2012-09-04 08:26:15 | 0000-00-00 00:00:00 | 2012-09-04 08:27:17 | Main |
| 21772 | NNNPPPXXXX | 2012-09-04 08:27:44 | 2012-09-04 08:32:07 | 0000-00-00 00:00:00 | Main_VM |
+----------------+--------------+---------------------+---------------------+---------------------+---------------+
B)
+----------------+--------------+---------------------+---------------------+---------------------+---------------+
| interaction_id | origination | create_timestamp | accept_timestamp | abandon_timestamp | queue_id |
+----------------+--------------+---------------------+---------------------+---------------------+---------------+
| 2195 | AAAAAAAAAA | 2011-10-28 09:21:02 | 2011-10-28 09:23:50 | 0000-00-00 00:00:00 | Main |
| 2197 | NNNPPPXXXX | 2011-10-28 09:22:37 | 0000-00-00 00:00:00 | 2011-10-28 09:26:42 | Main |
| 2199 | BBBBBBBBBB | 2011-10-28 09:23:38 | 2011-10-28 09:27:23 | 0000-00-00 00:00:00 | Main |
| 2200 | CCCCCCCCCC | 2011-10-28 09:24:40 | 2011-10-28 09:33:09 | 0000-00-00 00:00:00 | Main |
| 2201 | NNNPPPXXXX | 2011-10-28 09:27:16 | 2011-10-28 09:42:28 | 0000-00-00 00:00:00 | Main_VM |
+----------------+--------------+---------------------+---------------------+---------------------+---------------+
In MySQL, I need to be able to associate interaction_id 2197 with 2201, and 21771 with 21772, for example. I'll be doing things like TIMESTAMPDIFF() to calculate the "total" time to answer the call, SLA met & abandoned percentages, ...etc.; while also accounting for hours of operation and holidays. I think I have most of that worked out, but my main trouble is what I've just described.
NOTE: I intend to change the "0000-00-00 00:00:00" timestamps to NULL. I'm still in planning.
I made some headway on this, and I thought I'd share. I just had one of the fields return the most recent call, using a LIMIT 1:
select interaction_id, origination, create_timestamp, accept_timestamp, abandon_timestamp, queue_name, parent_call, agi.agent_name
from (
(
select interaction_id, origination, create_timestamp, accept_timestamp, abandon_timestamp, queue_name,
(
select interaction_id
from queue_interactions q1
where q1.origination = q2.origination
and ABS(timestampdiff(SECOND, q1.abandon_timestamp, q2.create_timestamp)) < 180
order by q2.abandon_timestamp
LIMIT 1
) as parent_call
from queue_interactions q2
where q2.queue_name = "Service Desk VM"
)
UNION
(
select interaction_id, origination, create_timestamp, accept_timestamp, abandon_timestamp, queue_name, NULL as parent_call
from queue_interactions q3
where q3.queue_name = "Service Desk"
)
) a natural left join agent_interactions agi
order by a.create_timestamp
;
Related
i spent lot of time to search a solution to my problem. i think i'm near the solution but my final request doesn't work...
first of all, i have a table that represent water index based on 10 minutes sampling.
----------------------------------
| DateTime | Counter |
----------------------------------
| 2020-05-13 15:00:03 | 38450 |
| 2020-05-13 15:10:03 | 38454 |
| 2020-05-15 15:00:03 | 38500 |
| 2020-06-02 12:10:03 | 38510 |
| 2020-06-15 12:10:03 | 38600 |
----------------------------------
Some samples could be not present in the table
so i would like to extract a table to see my consumptions by days, week, month, year
i have found many examples, but none works as i expect...
for the example table above, i expect to get:
------------------------------------------------------------------------------
| fromDateTime | toDateTime | fromCounter | toCounter | diff |
------------------------------------------------------------------------------
| 2020-05-13 15:00:03 | 2020-06-02 12:10:03 | 38450 | 38510 | 60 |
| 2020-06-02 12:10:03 | 2020-06-15 12:10:03 | 38510 | 38600 | 90 |
------------------------------------------------------------------------------
i have writen a query
select mt1.DateTime as fromDateTime,
mt2.DateTime as toDateTime,
mt1.Counter as fromCounter,
mt2.Counter as toCounter,
(mt2.Counter - mt1.Counter) as diff
from WaterTest as mt1
left join WaterTest as mt2
on mt2.DateTime=(
select max(dd.datetime) as DateTime from (SELECT MIN(DateTime) as DateTime
FROM WaterTest as mt3
WHERE month(mt3.DateTime) = month(mt1.DateTime + INTERVAL 1 month)
union ALL
select max(DateTime) as DateTime
from WaterTest as mt4
where month(mt4.DateTime) = month(mt1.DateTime)
) as dd
)
But MySql results with an error saying "Field 'mt1.DateTime' unknown in where clause"
is someone can help to find where i'm wrong ?
Am I on the good way to achieve this ?
(and for sure, if there is a more powerfull request.... :) )
I have a single-table SQL database built from DHCPD logs, structured as below:
+------+-------+------+----------+---------+-------------------+-----------------+
| id | Month | Day | Time | Type | MAC | ClientIP |
+------+-------+------+----------+---------+-------------------+-----------------+
| 9305 | Nov | 24 | 03:20:00 | DHCPACK | 00:04:f2:4b:dd:51 | 10.123.246.116 |
| 9307 | Nov | 24 | 03:20:07 | DHCPACK | 00:04:f2:99:4c:ba | 10.123.154.176 |
| 9310 | Nov | 24 | 03:20:08 | DHCPACK | 00:19:bb:cf:cd:28 | 10.99.107.3 |
| 9311 | Nov | 24 | 03:20:08 | DHCPACK | 00:19:bb:cf:cd:28 | 10.99.107.3 |
Every DHCP event from the log will eventually make its way into this database, so events from any point in time will be potentially used in the construction of graphs. To make use of the data for graphing, I need to be able to create an output table with multiple columns, but with values derived from a count of those in a single column matching a specific pattern.
The closest thing I've managed to come up with is this query:
select 'Data' as ClientIP, count(*) from Log where ClientIP like '10.99%' and MAC like '00:04:f2%'
union
select 'Voice' as ClientIP, count(*) from Log where ClientIP like '10.123%' and MAC like '00:04:f2%';
Which yields the following result:
+-----------+-------+
| ClientIP | Count |
+-----------+-------+
| Data | 4618 |
| Voice | 13876 |
+-----------+-------+
Fine for a one-off query, but I want to take those two rows, turn them into two columns, and run the same query with one row per hour (for instance). I want something like this:
+------+-------+------+
| Hour | Voice | Data |
+------+-------+------+
| 03 | 22 | 4 |
| 04 | 123 | 23 |
| 05 | 45 | 5 |
Any advice is greatly welcomed.
Thanks
You can group by hour and use conditional computation to count Data and Voice traffic.
For example:
SELECT
HOUR(time) AS `Hour`,
SUM(CASE WHEN ClientIP like '10.99%' and MAC like '00:04:f2%' THEN 1 ELSE 0 END) AS `Data`,
SUM(CASE WHEN ClientIP like '10.123%' and MAC like '00:04:f2%' THEN 1 ELSE 0 END) AS `Voice`
FROM log
GROUP BY HOUR(time)
Create a separate table for (as you want) :
+------+-------+------+
| Hour | Voice | Data |
+------+-------+------+
and update it every hour using triggers.
I have an event system and for my repeat events I am using a cron like system.
Repeat Event:
+----+----------+--------------+
| id | event_id | repeat_value |
+----+----------+--------------+
| 1 | 11 | *_*_* |
| 2 | 12 | *_*_2 |
| 3 | 13 | *_*_4/2 |
| 4 | 14 | 23_*_* |
| 5 | 15 | 30_05_* |
+----+----------+--------------+
NOTE: The cron value is day_month_day of week
Event:
+----+------------------------+---------------------+---------------------+
| id | name | start_date_time | end_date_time |
+----+------------------------+---------------------+---------------------+
| 11 | Repeat daily | 2014-04-30 12:00:00 | 2014-04-30 12:15:00 |
| 12 | Repeat weekly | 2014-05-06 12:00:00 | 2014-05-06 13:00:00 |
| 13 | Repeat every two weeks | 2014-05-08 12:45:00 | 2014-05-08 13:45:00 |
| 14 | Repeat monthly | 2014-05-23 15:15:00 | 2014-05-23 16:00:00 |
| 15 | Repeat yearly | 2014-05-30 07:30:00 | 2014-05-30 10:15:00 |
+----+------------------------+---------------------+---------------------+
Anyway I have a query to select the events:
SELECT *
FROM RepeatEvent
JOIN `Event`
ON `Event`.`id` = `RepeatEvent`.`event_id`
That produces:
+----+----------+--------------+----+------------------------+---------------------+---------------------+
| id | event_id | repeat_value | id | name | start_date_time | end_date_time |
+----+----------+--------------+----+------------------------+---------------------+---------------------+
| 1 | 11 | *_*_* | 11 | Repeat daily | 2014-04-30 12:00:00 | 2014-04-30 12:15:00 |
| 2 | 12 | *_*_2 | 12 | Repeat weekly | 2014-05-06 12:00:00 | 2014-05-06 13:00:00 |
| 3 | 13 | *_*_4/2 | 13 | Repeat every two weeks | 2014-05-08 12:45:00 | 2014-05-08 13:45:00 |
| 4 | 14 | 23_*_* | 14 | Repeat monthly | 2014-05-23 15:15:00 | 2014-05-23 16:00:00 |
| 5 | 15 | 30_05_* | 15 | Repeat yearly | 2014-05-30 07:30:00 | 2014-05-30 10:15:00 |
+----+----------+--------------+----+------------------------+---------------------+---------------------+
However, I want to select events within a month. I will only have certain conditions: daily, weekly, every two weeks, month and yearly.
I want to put in my where clause a way to divide the string of the repeat value and if it fits any of the following conditions to show it as a result (repeatEvent is row that is being interrogated, search is the date being looked for):
array(3) = string_divide(repeat_value, '_')
daily = array(0)
monthy = array(1)
dayOfWeek = array(2)
if(daily == '*' && month == '*' && dayOfWeek == '*') //returns all the daily events as they will happen
return repeatEvent
if(if(daily == '*' && month == '*' && dayOfWeek == search.dayOfWeek) //returns all the events on specific day
return repeatEvent
if(daily == search.date && month == '*' && dayOfWeek == '*') //returns all the daily events as they will happen
return repeatEvent
if (contains(dayOfWeek, '/'))
array(2) = string_divide(dayOfWeek,'/')
specificDayOfWeek = array(0);
if(specificDayOfWeek == repeatEvent.start_date.dayNumber)
if(timestampOf(search.timestamp)-timestampOf(repeatEvent.start_date)/604800 == (0 OR EVEN)
return repeatEvent
if(daily == search.date && month == search.month && dayOfWeek == '*') //returns a single yearly event (shouldn't often crop up)
return repeatEvent
//everything else is either an unknown format of repeat_value or not an event on this day
To summarise I want to run a query in which the repeat value is split in the where clause and I can interrogate the split items. I have looked at cursors but the internet seems to advise against them.
I could process the results of selecting all the repeat events in PHP, however, I imagine this being very slow.
Here is what I would like to see if looking at the month of April:
+----------+--------------+----+------------------------+---------------------+---------------------+
| event_id | repeat_value | id | name | start_date_time | end_date_time |
+----------+--------------+----+------------------------+---------------------+---------------------+
| 11 | *_*_* | 11 | Repeat daily | 2014-04-30 12:00:00 | 2014-04-30 12:15:00 |
+----------+--------------+----+------------------------+---------------------+---------------------+
Here is what I would like to see if looking at the month of May
+----------+--------------+----+------------------------+---------------------+---------------------+
| event_id | repeat_value | id | name | start_date_time | end_date_time |
+----------+--------------+----+------------------------+---------------------+---------------------+
| 11 | *_*_* | 11 | Repeat daily | 2014-04-30 12:00:00 | 2014-04-30 12:15:00 |
| 12 | *_*_2 | 12 | Repeat weekly | 2014-05-06 12:00:00 | 2014-05-06 13:00:00 |
| 13 | *_*_4/2 | 13 | Repeat every two weeks | 2014-05-08 12:45:00 | 2014-05-08 13:45:00 |
| 14 | 23_*_* | 14 | Repeat monthly | 2014-05-23 15:15:00 | 2014-05-23 16:00:00 |
| 15 | 30_05_* | 15 | Repeat yearly | 2014-05-30 07:30:00 | 2014-05-30 10:15:00 |
+----------+--------------+----+------------------------+---------------------+---------------------+
Here is what I would like to see if looking at the month of June
+----------+--------------+----+------------------------+---------------------+---------------------+
| event_id | repeat_value | id | name | start_date_time | end_date_time |
+----------+--------------+----+------------------------+---------------------+---------------------+
| 11 | *_*_* | 11 | Repeat daily | 2014-04-30 12:00:00 | 2014-04-30 12:15:00 |
| 12 | *_*_2 | 12 | Repeat weekly | 2014-05-06 12:00:00 | 2014-05-06 13:00:00 |
| 13 | *_*_4/2 | 13 | Repeat every two weeks | 2014-05-08 12:45:00 | 2014-05-08 13:45:00 |
| 14 | 23_*_* | 14 | Repeat monthly | 2014-05-23 15:15:00 | 2014-05-23 16:00:00 |
+----------+--------------+----+------------------------+---------------------+---------------------+
You could put a bandaid on this, but no one would be doing you any favors to tell you that that is the answer.
If your MySQL database can be changed I would strongly advise you to split your current column with underscores day_month_day of year to three separate columns, day, month, and day_of_year. I would also advise you to change your format to be INT rather than VARCHAR. This will make it faster and MUCH easier to search and parse, because it is designed in a way that doesn't need to be translated into computer language through complicated programs... It is most of the way there already.
Here's why:
Reason 1: Your Table is not Optimized
Your table is not optimized and will be slowed regardless of what you choose to do at this stage. SQL is not built to have multiple values in one column. The entire point of an SQL database is to split values into different columns and rows.
The advantage to normalizing this table is that it will be far quicker to search it, and you will be able to build queries in MySQL. Take a look at Normalization. It is a complicated concept, but once you get it you will avoid creating messy and complicated programs.
Reason 2: Your Table could be tweaked slightly to harness the power of computer date/time functions.
Computers follow time based on Unix Epoch Time. It counts seconds and is always running in your computer. In fact, computers have been counting this since, as the name implies, the first Unix computer was ever switched on. Further, each computer and computer based program/system, has built in, quick date and time functions. MySQL is no different.
I would also recommend also storing all of these as integers. repeat_doy (day of year) can easily be a smallint or at least a standard int, and instead of putting a month and day, you can put the actual 1-365 day of the year. You can use DAY_OF_YEAR(NOW()) to input this into MySQL. To pull it back out as a date you can use MAKEDATE(YEAR(NOW),repeat_doy). Instead of an asterisk to signify all, you can either use 0's or NULL.
With a cron like system you probably will not need to do that sort of calculation anyway.
Instead, it will probably be easier to just measure the day of year elsewhere (every computer and language can do this. In Unix it is just date "%j").
Solution
Split your one repeat_value into three separate values and turn them all into integers based on UNIX time values. Day is 1-7 (or 0-6 for Sunday to Saturday), Month is 1-12, and day of year is 1-365 (remember, we are not including 366 because we are basing our year on an arbitrary non-leap year).
If you want to pull information in your SELECT query in your original format, it is much easier to use concat to merge the three columns than it is to try to search and split on one column. You can also easily harness built in MySQL functions to quickly turn what you pull into real, current, days, without a bunch of effort on your part.
To implement it in your SQL database:
+----+----------+--------------+--------------+------------+
| id | event_id | repeat_day | repeat_month | repeat_doy |
+----+----------+--------------+--------------+------------+
| 1 | 11 | * | * | * |
| 2 | 12 | * | * | 2 |
| 3 | 13 | * | * | 4/2 |
| 4 | 14 | 23 | * | * |
| 5 | 15 | 30 | 5 | * |
+----+----------+--------------+--------------+------------+
Now you should be able to build one query to get all of this data together regardless of how complicated your query. By normalizing your table, you will be able to fully harness the power of relational databases, without the headaches and hacks.
Edit
Hugo Delsing made a great point in the comments below. In my initial example I provided a fix to leap years for day_of_year in which I chose to ignore Feb 29. A much better solution removes the need for a fix. Split day_of_year to month and day with a compound index. He also has a suggestion about weeks and number of weeks, but I will just recommend you read it for more details.
Try to write where condition using this:
substring_index(repeat_value,'_', 1)
instead of daily
substring_index(substring_index(repeat_value,'_', -2), '_', 1)
instead of monthly
and
substring_index(substring_index(repeat_value,'_', -1), '_', 1)
instead of dayOfWeek
I think you are overthinking the problem if you only want the events per month and not per day. Assuming that you always correctly fill the repeat_value, the query is very basic.
Basically all event occur every month where the repeat_value is either LIKE '%_*_%' or LIKE '%_{month}_%'.
Since you mentions PHP I'm assuming you are building the query in PHP and thus I used the same.
<?php
function buildQuery($searchDate) {
//you could/should do some more checking if the date is valid if the user provides the string
$searchDate = empty($searchDate) ? date("Y-m-d") : $searchDate;
$splitDate = explode('-', $searchDate);
$month = $splitDate[1];
//Select everything that started after the searchdate
//the \_ is because else the _ would match any char.
$query = 'SELECT *
FROM RepeatEvent
JOIN `Event`
ON `Event`.`id` = `RepeatEvent`.`event_id`
WHERE `Event`.`start_date_time` < \''.$searchDate.'\'
AND
(
`RepeatEvent`.`repeat_value` LIKE \'%\_'.$month.'\_%\'
OR `RepeatEvent`.`repeat_value` LIKE \'%\_*\_%\'
)
';
return $query;
}
//show querys for all months on current day/year
for ($month = 1; $month<=12; $month++) {
echo buildQuery(date('Y-'.$month.'-d')) . '<hr>';
}
?>
Now if the repeat_value could be wrong, you could add a simple regex check to make sure the value is always like *_*_* or *_*_*/*
You can use basic regular expressions in MySQL:
http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html
For a monthly event in May (first day) you can use a pattern like this (not tested):
[0-9\*]+\_[5\*]\_1
You can generate this pattern via PHP
I have a 'Course' table and an 'Event' table.
I would like to have all the courses that actually take place, i.e. they are not cancelled by an event.
I have done this by a simple request for all the course and a script analysis (basically some loops), but this request take a time that I believe too long. I think what I want is possible in one query and no loops to optimize this request.
Here are the details :
'Course' c have the fields 'date', 'duration' and a many to many relation with the 'Grade' table
'Event' e have the fields 'begin', 'end', 'break' and a many to many relation with the 'Grade' table
A course is cancelled by an event if they occur at the same time and if the event is a break (e.break = 1)
A course is cancelled by an event if all the grades of the course are in the events that occurs at the same time (many events can occurs, I have to sum up the grades of these events and compare them to the grades of the courses). This is the part I'm doing with a loop, I have some trouble to conceptualize that.
Any help is welcome,
Thanks in advance,
PS : I'm using mysql
EDIT : Tables details
-Course
+-----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| date | datetime | NO | | NULL | |
| duration | time | NO | | NULL | |
| type | int(11) | NO | | NULL | |
+-----------+-------------+------+-----+---------+----------------+
+-------+---------------------+----------+------+
| id | date | duration | type |
+-------+---------------------+----------+------+
| 1 | 2013-12-10 10:00:00 | 02:00:00 | 0 |
| 2 | 2013-12-11 10:00:00 | 02:00:00 | 0 |
+-------+---------------------+----------+------+
-Event
+-------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| begin | datetime | NO | | NULL | |
| end | datetime | YES | | NULL | |
| break | tinyint(1) | NO | | NULL | |
+-------------+-------------+------+-----+---------+----------------+
+----+---------------------+---------------------+-------+
| id | begin | end | break |
+----+---------------------+---------------------+-------+
| 1 | 2013-12-10 00:00:00 | 2013-12-11 23:59:00 | 1 |
+----+---------------------+---------------------+-------+
-course_grade
+-----------+----------+
| course_id | grade_id |
+-----------+----------+
| 1 | 66 |
| 2 | 65 |
| 2 | 66 |
+-----------+----------+
-event_grade
+----------+----------+
| grade_id | event_id |
+----------+----------+
| 66 | 1 |
+----------+----------+
So here, only the course 2 should appear, because course 1 has only one grade, and this grade has an event.
I like riddles, this is a nice one, has many solutions, I think
As you say 'Any help is welcome', I give an answer altough its not the solution (and it does not fit into a comment)
I dont know, if you just want (A) the naked statement (over and out), or if you want (B) to understand how to get to the solution, I take (B)
I start with 'what would I change' before starting about the solution:
you are mixing date,datetime,start,end and duration, try to use only one logic (if it is your model ofcourse) ie.
an event/course has a start and an end time (or start/duration)
duration should (IMHO) not be a time
try to find a smallest timeslice for events/course (are there 1 sec events? or is a granularity of 5' (ie. 10:00, 10:05, 10:10 ans so on) a valid compromise?
My solution, a prgmatic one not academic
(sounds funny, but does work good in a simillar prob I had see annotation)
Create a table (T_TIME_OF_DAY) having all from 00:00, 00:05, .. 23:55
Create a Table (T_DAYS) in a valid and usefull range (a year?)
the carthesian product - call it points in time - (ie. select date, time from T_DAYS,T_TIME_OF_DAY no condition) of them (days x times) 300*24*12 ~ 100.000 rows if you need to look at a whole year (and 5' are ok for you) - thats not much and no prob
the next step is to join the curses fitting to your points in time (and the rows are down to <<100.000)
if you next join these with your events (again using point in time) you should get what you want.
simplyfied quarters of a day:
12 12 12 12 12 12 12 12
08 09 10 11 12 13 14 15
|...|...|...|...|...|...|...|...
grade 65 (C).............2..................
grade 66 (C).........1...2..................
grade 65 (E)................................
grade 66 (e)........1111..................
(annotation: I use this logic to calculate the availabillity of services regarding to their downtimes per Month / Year, and could use the already in timeslices prepared data for display)
(second answer, because it is a totaly different and mor3 standard aproach)
I made an SQLFiddle for you
so what to do:
and thats the a solution:
step one (in mind) select course,grades (lets call them C)
step two (in mind) select events, grades (lets call them E)
and - tada -
select all from C where there a no rows in E that have the same grade and the same date(somehow) and eventtype='break'
so your solution:
select
id, date start_time, date+duration end_time, grade_id
from Course c join course_grade cg on c.id=cg.course_id
where not exists (
select grade_id, begin start_time, end end_time
from event_grade eg join event e on eg.event_id=e.id
where
eg.grade_id=cg.grade_id
and e.break=1
and
(
(e.begin<=c.date and e.end >=c.date+c.duration)
or e.begin between c.date and c.date+c.duration
or e.end between c.date and c.date+c.duration
)
)
I did take no attention to optimize here
I'm developing a script at my company that will pull information from our SCM about source code activity, such as the number of lines changed, for a given product over time. All the changes for a given product that occur within the same day are combined into a single record in a mySQL table, something like this:
+------------+-------+------+
| date | prod | line |
+------------+-------+------+
| 2011-11-25 | prod2 | 471 |
| 2011-11-28 | prod2 | 389 |
+------------+-------+------+
I then replicate the table with the cumulative results using an inner join and summation:
+------------+-------+------+
| date | prod | line |
+------------+-------+------+
| 2011-11-25 | prod2 | 471 |
| 2011-11-28 | prod2 | 860 |
+------------+-------+------+
Now, I want to create a table that has one record for every day per product. I've been able to do that by joining with a calendar table. However, where new records are created, the line field should be populated with the most recent cumulative value for that product, rather than some hardcoded default like NULL or 0:
+------------+-------+------+
| date | prod | line |
+------------+-------+------+
| 2011-11-25 | prod2 | 471 |
| 2011-11-26 | prod2 | 471 |
| 2011-11-27 | prod2 | 471 |
| 2011-11-28 | prod2 | 860 |
+------------+-------+------+
I've solved this problem in two unsatisfying ways so far:
Fill in the date gaps first, then compute the cumulative sum
Loop over every element of the final table, saving the most recent non-null elements in a #user variable.
The first solution became hugely inefficient once my table grew large enough. The second solution gets the job done, but I've been trying to find a more elegant solution. Here is the code that produces the table with NULLs:
INSERT INTO final SELECT d.date,f.prod,p.line
FROM calendar AS d
CROSS JOIN
(SELECT DISTINCT prod FROM cumulative) AS f
LEFT JOIN cumulative AS p USING (date,prod) ;
Any ideas? I'm using MySQL.
Seems like the most sensible thing to do would be to store one row per day, with a zero if there were no lines changed. That would eliminate the need for a join on the calendar table.
So instead of your source table looking like this
+------------+-------+------+
| date | prod | line |
+------------+-------+------+
| 2011-11-25 | prod2 | 471 |
| 2011-11-28 | prod2 | 389 |
+------------+-------+------+
it would look like this.
+------------+-------+------+
| date | prod | line |
+------------+-------+------+
| 2011-11-25 | prod2 | 471 |
| 2011-11-26 | prod2 | 0 |
| 2011-11-27 | prod2 | 0 |
| 2011-11-28 | prod2 | 389 |
+------------+-------+------+
As for the running sum itself, your report writer might be able to do this faster than SQL. If MySQL supported windowing functions, you'd just write something like
select date, prod,
sum(line) over (partition by prod order by date)
from prod
although, even then, your report writer might be faster.
On platforms that don't support windowing functions, you just need a sum in a subquery.
select p1.prod, p1.date,
(select sum(line) from prod
where prod = p1.prod and date <= p1.date) as num_lines
from prod p1
order by p1.prod, p1.date