Two methods of performing cohort analysis in MySQL using joins - mysql

I make a cohort analysis processor. Input parameters: time range and step, condition (initial event) to exctract cohorts, additional condition (retention event) to check after each N hours/days/months. Output parameters: cohort analysis grid, like this:
0h | 16h | 32h | 48h | 64h | 80h | 96h |
cohort #00 15 | 6 | 4 | 1 | 1 | 2 | 2 |
cohort #01 1 | 35 | 8 | 0 | 2 | 0 | 1 |
cohort #02 0 | 3 | 31 | 11 | 5 | 3 | 0 |
cohort #03 0 | 0 | 4 | 27 | 7 | 6 | 2 |
cohort #04 0 | 1 | 1 | 4 | 29 | 4 | 3 |
Basically:
fetch cohorts: unique users who did something 1 in every period from time_begin every time_step.
find how many of them (in each cohort) did something 2 after N seconds, N*2 seconds, N*3, and so on until now.
In short - I have 2 solutions. One works too slow and includes a heavy select with joins for each data step: 1 day, 2 day, 3 day, etc. I want to optimize it by joining result for every data step to cohorts - and it's the second solution. It looks like it works but I'm not sure it's the best way and that it will give the same result even if cohorts will intersect. Please check it out.
Here's the whole story.
I have a table of > 100,000 events, something like this:
#user-id, timestamp, event_name
events_view (uid varchar(64), tm int(11), e varchar(64))
example input row:
"user_sampleid1", 1423836540, "level_end:001:win"
To make a cohort analisys first I extract cohorts: for example, users, who send special event '1st_launch' in 10 hour periods starting from 2015-02-13 and ending with 2015-02-16. All code in this post is simplified and shortened to see the idea.
DROP TABLE IF EXISTS tmp_c;
create temporary table tmp_c (uid varchar(64), tm int(11), c int(11) );
set beg = UNIX_TIMESTAMP('2015-02-13 00:00:00');
set en = UNIX_TIMESTAMP('2015-02-16 00:00:00');
select min(tm) into t_start from events_view ;
select max(tm) into t_end from events_view ;
if beg < t_start then
set beg = t_start;
end if;
if en > t_end then
set en = t_end;
end if;
set period = 3600 * 10;
set cnt_c = ceil((en - beg) / period) ;
/*works quick enough*/
WHILE i < cnt_c DO
insert into tmp_c (
select uid, min(tm), i from events_view where
locate("1st_launch", e) > 0 and tm > (beg + period * i)
AND tm <= (beg + period * (i+1)) group by uid );
SET i = i+1;
END WHILE;
Cohorts may consist the same user ids, though usually one user is exist only in one cohort. And in each cohort users are unique.
Now I have temp table like this:
user_id | 1st timestamp | cohort_no
uid1 1423836540 0
uid2 1423839540 0
uid3 1423841160 1
uid4 1423841460 2
...
uidN 1423843080 M
Then I need to again divide time range on periods and calculate for each period how many users from each cohort have sent event "level_end:001:win".
For each small period I select all unique users who have sent "level_end:001:win" event and left join them to tmp_c cohorts table. So I have something like this:
user_id | 1st timestamp | cohort_no | user_id | other fields...
uid1 1423836540 0 uid1
uid2 1423839540 0 null
uid3 1423841160 1 null
uid4 1423841460 2 uid4
...
uidN 1423843080 M null
This way I see how many users from my cohorts are in those who have sent "level_end:001:win", exclude not found by where clause: where t2.uid is not null.
Finally I perform grouping and have counts of users in each cohort, who have sent "level_end:001:win" in this particluar period.
Here's the code:
DROP TABLE IF EXISTS tmp_res;
create temporary table tmp_res (uid varchar(64) CHARACTER SET cp1251 NOT NULL, c int(11), cnt int(11) );
set i = 0;
set cnt_c = ceil((t_end - beg) / period) ;
WHILE i < cnt_c DO
insert into tmp_res
select concat(beg + period * i, "_", beg + period * (i+1)), c, count(distinct(uid)) from
(select t1.uid, t1.c from tmp_c t1 left join
(select uid, min(tm) from events_view where
locate("level_end:001:win", e) > 0 and
tm > (beg + period * i) AND tm <= (beg + period * (i+1)) group by uid ) t2
on t1.uid = t2.uid where t2.uid is not null) t3
group by c;
SET i = i+1;
END WHILE;
/*getting result of the first method: tooo slooooow!*/
select * from tmp_res;
The result I've got (it's ok that some cohorts are not appear on some periods):
"1423832400_1423890000","1","35"
"1423832400_1423890000","2","3"
"1423832400_1423890000","3","1"
"1423832400_1423890000","4","1"
"1423890000_1423947600","1","21"
"1423890000_1423947600","2","50"
"1423890000_1423947600","3","2"
"1423947600_1424005200","1","9"
"1423947600_1424005200","2","24"
"1423947600_1424005200","3","70"
"1423947600_1424005200","4","6"
"1424005200_1424062800","1","7"
"1424005200_1424062800","2","15"
"1424005200_1424062800","3","21"
"1424005200_1424062800","4","32"
"1424062800_1424120400","1","7"
"1424062800_1424120400","2","13"
"1424062800_1424120400","3","24"
"1424062800_1424120400","4","18"
"1424120400_1424178000","1","10"
"1424120400_1424178000","2","12"
"1424120400_1424178000","3","18"
"1424120400_1424178000","4","14"
"1424178000_1424235600","1","6"
"1424178000_1424235600","2","7"
"1424178000_1424235600","3","9"
"1424178000_1424235600","4","12"
"1424235600_1424293200","1","6"
"1424235600_1424293200","2","8"
"1424235600_1424293200","3","9"
"1424235600_1424293200","4","5"
"1424293200_1424350800","1","5"
"1424293200_1424350800","2","3"
"1424293200_1424350800","3","11"
"1424293200_1424350800","4","10"
"1424350800_1424408400","1","8"
"1424350800_1424408400","2","5"
"1424350800_1424408400","3","7"
"1424350800_1424408400","4","7"
"1424408400_1424466000","2","6"
"1424408400_1424466000","3","7"
"1424408400_1424466000","4","3"
"1424466000_1424523600","1","3"
"1424466000_1424523600","2","4"
"1424466000_1424523600","3","8"
"1424466000_1424523600","4","2"
"1424523600_1424581200","2","3"
"1424523600_1424581200","3","3"
It works but it takes too much time to process because there are many queries here instead of one, so I need to rewrite it.
I think it can be rewritten with joins, but I'm still not sure how.
I decided to make a temporary table and write period boundaries in it:
DROP TABLE IF EXISTS tmp_times;
create temporary table tmp_times (tm_start int(11), tm_end int(11));
set cnt_c = ceil((t_end - beg) / period) ;
set i = 0;
WHILE i < cnt_c DO
insert into tmp_times values( beg + period * i, beg + period * (i+1));
SET i = i+1;
END WHILE;
Then I get periods-to-events mapping (user_id + timestamp represent particular event) to temp table and left join it to cohorts table and group the result:
SELECT Concat(tm_start, "_", tm_end) per,
t1.c coh,
Count(DISTINCT( t2.uid ))
FROM tmp_c t1
LEFT JOIN (SELECT *
FROM tmp_times t3
LEFT JOIN (SELECT uid,
tm
FROM events_view
WHERE Locate("level_end:101:win", e) > 0)
t4
ON ( t4.tm > t3.tm_start
AND t4.tm <= t3.tm_end )
WHERE t4.uid IS NOT NULL
ORDER BY t3.tm_start) t2
ON t1.uid = t2.uid
WHERE t2.uid IS NOT NULL
GROUP BY per,
coh
ORDER BY per,
coh;
In my tests this returns the same result as method #1. I can't check the result manually, but I understand how method #1 work more and as far I can see it gives what I want. Method #2 is faster, but I'm not sure it's the best way and it will give the same result even if cohorts will intersect.
Maybe there are well-known common methods to perform a cohort analysis in SQL? Is method #1 I use more reliable than method #2? I work with joins not that often, that's why still do not fully understand joins magic yet.
Method #2 looks like pure magic, and I used to not believe in what I don't understand :)
Thanks for answers!

Related

Counting product pairs in a store whose difference in expenses is less than a certain amount in SQL

I have a table with the serial number of each product, whether it is in stock (1- in stock, 0- not in stock), the level of revenue from the product and the level of expenses from the product in the store. I would like to write a query that counts all customer pairs (without duplication of the same pair), that the expense difference between them is less than NIS 1,000 and both are in stock or both are out of stock. Show the average income gap (approximately) of all pairs, how many such pairs are in stock And how much is not in stock.
Sample table:
serial
Is_in_stock
Revenu_ from_the_product
Expenses_from_the_product
1
1
27627
57661
2
0
48330
20686
3
0
26010
861
4
1
22798
37771
5
0
24606
8905
6
1
48311
6433
7
0
29929
6278
8
0
24254
8590
Unfortunately I am lost and unable to find a solution to my problem.
I was thinking of creating subqueries but could not find a suitable solution
The result should show something like this(Please do not refer to this data for illustration):
Average income gap (in absolute value) of all pairs
Quantity of pairs in stock
The amount of pairs that are not in stock
13
10
5
In addition it is very important that the count be done without duplicates of the same pair
We can do this with two queries, without a procedure or user defined function
CREATE TABLE products(serial INT, Instock INT, Revenu INT, Expenses INT);
INSERT INTO products VALUES
(1,1,27627,57661),
(2,0,48330,20686),
(3,0,26010,861 ),
(4,1,22798,37771),
(5,0,24606,8905 ),
(6,1,48311,6433 ),
(7,0,29929,6278 ),
(8,0,24254,8590 );
✓
✓
SELECT a.serial,b.serial from
products a
join products b
on abs(a.expenses-b.expenses)<1000
where a.serial<b.serial
and a.instock=b.instock
serial | serial
-----: | -----:
5 | 8
select count(a.expenses) 'number of pairs',
avg(abs(a.expenses-b.expenses)) 'average difference',
sum(case when a.instock=1 and b.instock=1 then 1 else 0 end) pairsInstock,
sum(case when a.instock=0 and b.instock=0 then 1 else 0 end) pairsneitherStock,
sum(case when (a.instock+b.instock)=1 then 1 else 0 end ) oneInStock
from products a
cross join products b
where a.serial < b.serial;
number of pairs | average difference | pairsInstock | pairsneitherStock | oneInStock
--------------: | -----------------: | -----------: | ----------------: | ---------:
28 | 21362.1071 | 3 | 10 | 15
db<>fiddle here
I have solved it in stored procedure.
Starting with variables definition.
Cursor iterate results of sorted list and check if the following condition it TRUE according to your definition of pair.
prev_exp - curr_Expenses_from_the_product < 1000 AND prev_in_stock - curr_Is_in_stock = 0
In case it TRUE counter increased by 1.
In the end I closing the cursor and returning the counter value.
* You can add more logic to procedure and return more columns.
** Usage of this procedure is just to call to stored procedure by its name.
Table creation:
CREATE TABLE A(serial INT(11), Is_in_stock INT(11), Revenu_from_the_product INT(11), Expenses_from_the_product INT(11));
Data insertion:
INSERT INTO A (serial,Is_in_stock,Revenu_from_the_product,Expenses_from_the_product) VALUES
(1,1,27627,57661),
(2,0,48330,20686),
(3,0,26010,861 ),
(4,1,22798,37771),
(5,0,24606,8905 ),
(6,1,48311,6433 ),
(7,0,29929,6278 ),
(8,0,24254,8590 );
Query:
BEGIN
DECLARE finished INTEGER DEFAULT 0;
DECLARE prev_exp int(11) DEFAULT 0;
DECLARE prev_in_stock int(11) DEFAULT 0;
DECLARE curr_Is_in_stock int(11) DEFAULT 0;
DECLARE curr_Expenses_from_the_product int(11) DEFAULT 0;
DECLARE duplications_counter int(11) DEFAULT 0;
-- declare cursor for relevant fields
DEClARE curs
CURSOR FOR
SELECT A.Is_in_stock,A.Expenses_from_the_product FROM A ORDER BY A.Expenses_from_the_product DESC;
-- declare NOT FOUND handler
DECLARE CONTINUE HANDLER
FOR NOT FOUND SET finished = 1;
OPEN curs;
getRow: LOOP
FETCH curs INTO curr_Is_in_stock,curr_Expenses_from_the_product;
IF finished = 1 THEN
LEAVE getRow;
END IF;
IF prev_exp - curr_Expenses_from_the_product < 1000 AND prev_in_stock - curr_Is_in_stock = 0 THEN
SET duplications_counter = duplications_counter+1;
END IF;
END LOOP getRow;
CLOSE curs;
-- return the counter
SELECT duplications_counter;
END
Result:
Counter: 5

Trying to make a pingpong stat tracking database with a stored procedure

I'm using a stored procedure to (try to) write to 3 different tables in MYsql to track ping-pong data and show cool statistics.
So I'm a complete noob to MySQL (and StackOverflow) and haven't really done any sort of database language before so all of this is pretty new to me. I'm trying to make a stored procedure that writes ping-pong stats that come from Ignition(I'm fairly certain that Ignition isn't the problem. It's telling me the writes failed so I think it's a problem with my stored procedure).
I currently have one stored procedure that writes to the players table and can add wins, losses, and total games played when a button is pressed. My problem now is that I want to add statistics where I can track the score and who played against who so I could make graphs and stuff.
This stored procedure is supposed to search through the pingpong table to find if the names passed have played against each other before so I can find the corresponding MatchID. If the players haven't played before, then it should create a new row with a new MatchID(This is the key so it should be unique every time). Once I have the MatchID, I can then figure out how many games the players have played against each other before, what the score was, and who beat who and stuff like that.
Here's what I've written and MySQL says it's fine, but obviously it's not working. I know it's not completely finished but I really need some guidance since this is my second time doing anything with MySQL or and database language for that matter and I don't think this should be failing when I test any sort of write.
CREATE DEFINER=`root`#`localhost` PROCEDURE `Matchups`(
#these are passed from Ignition and should be working
IN L1Name VARCHAR(255), #Player 1 name on the left side
IN L2Name VARCHAR(255), #Player 2 name on the left side
IN R1Name VARCHAR(255), #Player 3 name on the right side
IN R2Name VARCHAR(255), #Player 4 name on the right side
IN TWOvTWO int, #If this is 1, then L1,L2,R1,R2 are playing instead of L1,R1
IN LeftScore int,
IN RightScore int)
BEGIN
DECLARE x int DEFAULT 0;
IF((
SELECT MatchupID
FROM pingpong
WHERE (PlayerL1 = L1Name AND PlayerR1 = R1Name) OR (PlayerL1 = R1Name AND PlayerR1 = L1Name)
)
IS NULL) THEN
INSERT INTO pingpong (PlayerL1, PlayerL2, PlayerR1, PlayerR2) VALUES (L1Name, L2Name, R1Name, R2Name);
INSERT INTO pingponggames (MatchupID, Lscore, Rscore) VALUES ((SELECT MatchupID
FROM pingpong
WHERE (PlayerL1 = L1Name AND PlayerR1 = R1Name) OR (PlayerL1 = R1Name AND PlayerR1 = L1Name)), LeftScore, RightScore);
END IF;
END
Here are what my tables currently look like:
pingpong
PlayerL1 | PlayerL2 | PlayerR1 | PlayerR2 | MatchupID
-----------------------------------------------------
L1 | NULL | R1 | NULL | 1
L1 | NULL | L2 | NULL | 3
L1 | NULL | R2 | NULL | 4
L1 | NULL | test2 | NULL | 5
pingponggames
GameID | MatchupID | LScore | RScore
------------------------------------------
1 | 1 | NULL | NULL
pingpongplayers
Name | TotalWins | TotalLosses | GamesPlayed
-----------------------------------------------------
L1 | 8 | 5 | NULL
L2 | 1 | 1 | NULL
R1 | 1 | 6 | 7
R2 | 1 | 1 | NULL
test2 | 1 | 0 | 1
test1 | 0 | 0 | 0
Explained some features, If needed more I need more info
CREATE DEFINER=`root`#`localhost` PROCEDURE `Matchups`(
#these are passed from Ignition and should be working
IN L1Name VARCHAR(255), #Player 1 name on the left side
IN L2Name VARCHAR(255), #Player 2 name on the left side
IN R1Name VARCHAR(255), #Player 3 name on the right side
IN R2Name VARCHAR(255), #Player 4 name on the right side
-- what will be the INPUT other than 1? It's to notice doubles or singles right? so taking 0 as single & 1 as doubles
IN TWOvTWO INT, #If this is 1, then L1,L2,R1,R2 are playing instead of L1,R1
IN LeftScore INT,
IN RightScore INT)
BEGIN
DECLARE x INT DEFAULT 0; # i guess you are using it in the sp
DECLARE v_matchupid INT; #used int --if data type is different, set as MatchupID column datatype
DECLARE inserted_matchupid INT; -- use data type based on your column MatchupID from pingpong tbl
IF(TWOvTWO=0) THEN -- for singles
#what is the need of this query? to check singles or doubles? Currently it search for only single from what you have written, will change according to that
SELECT MatchupID INTO v_matchupid
FROM pingpong
WHERE L1Name IN (PlayerL1, PlayerR1) AND R1Name IN (PlayerL1, PlayerR1); # avoid using direct name(string) have a master tbl for player name and use its id to compare or use to refer in another tbl
# the if part checks is it new between them and insert in both tbls
IF(v_matchupid IS NULL) THEN
INSERT INTO pingpong (PlayerL1, PlayerR1) VALUES (L1Name, R1Name);
SET inserted_matchupid=LAST_INSERT_ID();
INSERT INTO pingponggames (MatchupID, Lscore, Rscore) VALUES (inserted_matchupid, LeftScore, RightScore);
/*
Once I have the MatchID, I can then figure out how many games the players have played against each other before
A: this will not work for new matchup since matchupid is created now
*/
# so assuming if match found update pingponggames tbl with matched matchupid.. i leave it up to you
ELSE
UPDATE pingponggames SET Lscore=LeftScore, Rscore=RightScore WHERE MatchupID=v_matchupid;-- you can write your own
END IF;
-- for doubles
ELSE # assuming the possibilities of TWOvTWO will be either 0 or 1 if more use "elseif(TWOvTWO=1)" for this block as doubles
SELECT MatchupID INTO v_matchupid
FROM pingpong
# Note: If player name are same it will be difficult so better use a unique id as reference
WHERE L1Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2) AND
L2Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2) AND
R1Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2) AND
R2Name IN (PlayerL1, PlayerL2, PlayerR1, PlayerR2);
IF(v_matchupid IS NULL) THEN
INSERT INTO pingpong (PlayerL1, PlayerL2, PlayerR1, PlayerR2) VALUES (L1Name, L2Name, R1Name, R2Name);
SET inserted_matchupid=LAST_INSERT_ID();
INSERT INTO pingponggames (MatchupID, Lscore, Rscore) VALUES (inserted_matchupid, LeftScore, RightScore);
ELSE
UPDATE pingponggames SET Lscore=LeftScore, Rscore=RightScore WHERE MatchupID=v_matchupid;-- you can write your own
END IF;
END IF;
END

Oracle OR MySQL - Custom ORDER with different criterea depending on NULL values

I have a table with two pairs of DATE. I want to order the result based on the time of these dates. Each row have at least one pair of dates (when a row has both intervals: start2 > start1, end2 > end1 and start2 >= end1), like this:
| start1 | end1 | start2 | end2
row 1 | 4:00 | 5:00 | 5:00 | 6:00
row 2 | 4:30 | 5:00 | NULL | NULL
row 3 | NULL | NULL | 5:30 | 6:00
row 4 | 5:00 | 6:00 | 6:00 | 7:00
When two rows have both pairs, they should be compared by the start1.
When one row have only pair1 and the other have only pair2, start1 should be compared to start2
When one row have only one pair (any) and the other have both pairs, they should be compared by the pair that the first row have (start1 to start1 or start2 to start2). E.g.: if the first row has only the start2 and end2, and the second row has start1, end1, start2 and end2, these two rows should be compared by start2 only (start1 from the second row should be ignored)
How can I accomplish that?
EDIT
I can easily do that in C#, but I need to do this in database. Below the code of how it should work in C#:
static void Main(string[] args)
{
...
intervals.Sort(new IntervalComparer());
}
public class IntervalComparer : IComparer<Interval>
{
public int Compare(Interval quadro1, Interval quadro2)
{
int result = 0;
if (quadro1.start1 != null && quadro2.start1 != null)
result = quadro1.start1.Value.CompareTo(quadro2.start1.Value);
else if (quadro1.start2 != null && quadro2.start2 != null)
result = quadro1.start2.Value.CompareTo(quadro2.start2.Value);
else if (quadro1.start1 != null)
result = quadro1.start1.Value.CompareTo(quadro2.start2.Value);
else
result = quadro1.start2.Value.CompareTo(quadro2.start1.Value);
return result;
}
}
There is a flaw in your logic. Consider the following example:
| start1 | start2
a | 4:00 | 7:00
b | 5:00 | NULL
c | NULL | 6:00
For that data the following would be true using your logic:
a < b (a.start1 = 4:00 < b.start1 = 5:00)
b < c (b.start1 = 5:00 < c.start2 = 6:00)
c < a (c.start2 = 6:00 < a.start2 = 7:00)
And thus a < b < c < a is also "true". But if you want a meaningfull order, the < oprator should be transitive. That means if a < b is TRUE and b < c is TRUE, a < c should also be TRUE. But this is not the case.
So you could order the three rows as [a, b, c] or [b, c, a] or [c, a, b]. For me it doesn't make sense.
If you want an SQL solution, you should redefine the logic. I suggest to fill the NULLs using an estimated (average) difference (AVG(start2 - start1)). In my example the average difference would be 3 hours. So b.start2 would be replaced with b.start1 + 3 hours = 8:00. c.start1 would be replaced with c.start2 - 3 hours = 3:00. You could now order by the estimated values.
MySQL example:
select t.*
, d.avg_diff
, coalesce(time_to_sec(t.start1), time_to_sec(t.start2) - d.avg_diff) as estimated1
, coalesce(time_to_sec(t.start2), time_to_sec(t.start1) + d.avg_diff) as estimated2
from my_table t
cross join (
select avg(time_to_sec(start2) - time_to_sec(start1))
as avg_diff
from my_table
) d
order by estimated1, estimated2;
You can of course use the estimation expressions in the ORDER BY clause:
select t.*
from my_table t
cross join (
select avg(time_to_sec(start2) - time_to_sec(start1))
as avg_diff
from my_table
) d
order by
coalesce(time_to_sec(t.start1), time_to_sec(t.start2) - d.avg_diff),
coalesce(time_to_sec(t.start2), time_to_sec(t.start1) + d.avg_diff);
Demo: http://rextester.com/HHLT51865
Demo with original data: http://rextester.com/AMJIDT94457
try using the IF function, it should be something along the lines of:
SELECT if(start1 not is null, TIME(start1), TIME(start2) ) AS sortable_value
FROM your_table ORDER by sortable_value
Your whole logic can be simplified as in the above query. It basically gets reduced to if start1 is not null then sort by its time, else sort by start2's time.
If I misinterpreted something then you can add another IF instead of TIME(start1) (which is equivalent of the then branch) or instead of TIME(start2) (equivalent of the else branch).

SQL Server: calculate field data from fields in same table but different set of data

I was looking around and found no solution to this. I´d be glad if someone could help me out here:
I have a table, e.g. that has among others, following columns:
Vehicle_No, Stop1_depTime, Segment_TravelTime, Stop_arrTime, Stop_Sequence
The data might look something like this:
Vehicle_No Stop1_DepTime Segment_TravelTime Stop_Sequence Stop_arrTime
201 13000 60 1
201 13000 45 2
201 13000 120 3
201 13000 4
202 13300 240 1
202 13300 60 2
...
and I need to calculate the arrival time at each stop from the departure time at the first stop and the travel times in between for each vehicle. What I need in this case would look like this:
Vehicle_No Stop1_DepTime Segment_TravelTime Stop_Sequence Stop_arrTime
201 13000 60 1
201 13000 45 2 13060
201 13000 120 3 13105
201 13000 4 13225
202 13300 240 1
202 13300 60 2 13540
...
I have tried to find a solution for some time but was not successful - Thanks for any help you can give me!
Here is the query that still does not work - I am sure I did something wrong with getting the table from the database into this but dont know where. Sorry if this is a really simple error, I have just begun working with MSSQL.
Also, I have implemented the solution provided below and it works. At this point I mainly want to understand what went wrong here to learn about it. If it takes too much time, please do not bother with my question for too long. Otherwise - thanks a lot :)
;WITH recCTE
AS
(
SELECT ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, ZAEHL_2011.dbo.L32.PlanAbfahrtStart, ZAEHL_2011.dbo.L32.Fahrzeit, ZAEHL_2011.dbo.L32.Sequenz, ZAEHL_2011.dbo.L32.PlanAbfahrtStart AS Stop_arrTime
FROM ZAEHL_2011.dbo.L32
WHERE ZAEHL_2011.dbo.L32.Sequenz = 1
UNION ALL
SELECT t. ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, t. ZAEHL_2011.dbo.L32.PlanAbfahrtStart, t. ZAEHL_2011.dbo.L32.Fahrzeit,t. ZAEHL_2011.dbo.L32.Sequenz, r.Stop_arrTime + r. ZAEHL_2011.dbo.L32.Fahrzeit AS Stop_arrTime
FROM recCTE AS r
JOIN ZAEHL_2011.dbo.L32 AS t
ON t. ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id = r. ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id
AND t. ZAEHL_2011.dbo.L32.Sequenz = r. ZAEHL_2011.dbo.L32.Sequenz + 1
)
SELECT ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, ZAEHL_2011.dbo.L32.PlanAbfahrtStart, ZAEHL_2011.dbo.L32.Fahrzeit, ZAEHL_2011.dbo.L32.Sequenz, ZAEHL_2011.dbo.L32.PlanAbfahrtStart,
CASE WHEN Stop_arrTime = ZAEHL_2011.dbo.L32.PlanAbfahrtStart THEN NULL ELSE Stop_arrTime END AS Stop_arrTime
FROM recCTE
ORDER BY ZAEHL_2011.dbo.L32.Zaehl_Fahrt_Id, ZAEHL_2011.dbo.L32.Sequenz
A recursive CTE solution - assumes that each Vehicle_No appears in the table only once:
DECLARE #t TABLE
(Vehicle_No INT
,Stop1_DepTime INT
,Segment_TravelTime INT
,Stop_Sequence INT
,Stop_arrTime INT
)
INSERT #t (Vehicle_No,Stop1_DepTime,Segment_TravelTime,Stop_Sequence)
VALUES(201,13000,60,1),
(201,13000,45,2),
(201,13000,120,3),
(201,13000,NULL,4),
(202,13300,240,1),
(202,13300,60,2)
;WITH recCTE
AS
(
SELECT Vehicle_No, Stop1_DepTime, Segment_TravelTime,Stop_Sequence, Stop1_DepTime AS Stop_arrTime
FROM #t
WHERE Stop_Sequence = 1
UNION ALL
SELECT t.Vehicle_No, t.Stop1_DepTime, t.Segment_TravelTime,t.Stop_Sequence, r.Stop_arrTime + r.Segment_TravelTime AS Stop_arrTime
FROM recCTE AS r
JOIN #t AS t
ON t.Vehicle_No = r.Vehicle_No
AND t.Stop_Sequence = r.Stop_Sequence + 1
)
SELECT Vehicle_No, Stop1_DepTime, Segment_TravelTime,Stop_Sequence, Stop1_DepTime,
CASE WHEN Stop_arrTime = Stop1_DepTime THEN NULL ELSE Stop_arrTime END AS Stop_arrTime
FROM recCTE
ORDER BY Vehicle_No, Stop_Sequence
EDIT
Corrected version of OP's query - note that it's not necessary to fully qualify the column names:
;WITH recCTE
AS
(
SELECT Zaehl_Fahrt_Id, PlanAbfahrtStart, Fahrzeit, L32.Sequenz, PlanAbfahrtStart AS Stop_arrTime
FROM ZAEHL_2011.dbo.L32
WHERE Sequenz = 1
UNION ALL
SELECT t.Zaehl_Fahrt_Id, t.PlanAbfahrtStart, t.Fahrzeit,t.Sequenz, r.Stop_arrTime + r.Fahrzeit AS Stop_arrTime
FROM recCTE AS r
JOIN ZAEHL_2011.dbo.L32 AS t
ON t.Zaehl_Fahrt_Id = r.Zaehl_Fahrt_Id
AND t.Sequenz = r.Sequenz + 1
)
SELECT Zaehl_Fahrt_Id, PlanAbfahrtStart, Fahrzeit, Sequenz, PlanAbfahrtStart,
CASE WHEN Stop_arrTime = PlanAbfahrtStart THEN NULL ELSE Stop_arrTime END AS Stop_arrTime
FROM recCTE
ORDER BY Zaehl_Fahrt_Id, Sequenz
I'm quite sure this works:
SELECT a.Vehicle_No, a.Stop1_DepTime,
a.Segment_TravelTime, a.Stop_Sequence, a.Stop1_DepTime +
(SELECT SUM(b.Segment_TravelTime) FROM your_table b
WHERE b.Vehicle_No = a.Vehicle_No AND b.Stop_Sequence < a.Stop_Sequence)
FROM your_table a
ORDER BY a.Vehicle_No

MySQL Hierarchical Structure Data Extraction

I've been struggling for about 2 hours on one query now. Help? :(
I have a table like this:
id name lft rgt
35 Top level board 1 16
37 2nd level board 3 6 15
38 2nd level board 2 4 5
39 2nd level board 1 2 3
40 3rd level board 1 13 14
41 3rd level board 2 9 12
42 3rd level board 3 7 8
43 4th level board 1 10 11
It is stored in the structure recommended in this tutorial. What I want to do is select a forum board and all sub forums ONE level below the selected forum board (no lower). Ideally, the query would get the selected forum's level while only being passed the board's ID, then it would select that forum, and all it's immediate children.
So, I would hopefully end up with:
id name lft rgt
35 Top level board 1 16
37 2nd level board 3 6 15
38 2nd level board 2 4 5
39 2nd level board 1 2 3
Or
id name lft rgt
37 2nd level board 3 6 15
40 3rd level board 1 13 14
41 3rd level board 2 9 12
42 3rd level board 3 7 8
The top rows here are the parent forums, the others sub forums. Also, I'd like something where a depth value is given, where the depth is relative to the selected parent form. For example, taking the last table as some working data, we would have:
id name lft rgt depth
37 2nd level board 3 6 15 0
40 3rd level board 1 13 14 1
41 3rd level board 2 9 12 1
42 3rd level board 3 7 8 1
Or
id name lft rgt depth
35 Top level board 1 16 0
37 2nd level board 3 6 15 1
38 2nd level board 2 4 5 1
39 2nd level board 1 2 3 1
I hope you get my drift here.
Can anyone help with this? It's really getting me annoyed now :(
James
The easiest way for you to do it - just add a column where you keep the depth.
Otherwise the query will be very inefficient - you will have to get a the whole hierarchy, sorted by left number (that will put very first child be first), join it to itself to make sure that for each next node left number is equal to previous node right number + 1
In general, nested intervals algorithm is nice, but has a serious disadvantage - if you add something to tree, a lot of recalculations required.
A nice alternative for this is Tropashko Nested intervals algorithm with continued fractions - just google for it. And getting a single level below the parent with this algorithm is done very naturally. Also, given a child, you can calculate all numbers for all its parents without hitting a database.
One more thing to consider is that relational databases really are not the most optimal and natural way to store hierarchical data. A structure like you have here - a binary tree, essentially - would be much easier to represent with an XML blob that you can persist, or store as an object in an object-oriented database.
I prefer the adjacency list approach myself. The following example uses a non-recursive stored procedure to return a tree/subtree which I then transform into an XML DOM but you could do whatever you like with the resultset. Remember it's a single call from PHP to MySQL and adjacency lists are much easier to manage.
full script here : http://pastie.org/1294143
PHP
<?php
header("Content-type: text/xml");
$conn = new mysqli("localhost", "foo_dbo", "pass", "foo_db", 3306);
// one non-recursive db call to get the tree
$result = $conn->query(sprintf("call department_hier(%d,%d)", 2,1));
$xml = new DomDocument;
$xpath = new DOMXpath($xml);
$dept = $xml->createElement("department");
$xml->appendChild($dept);
// loop and build the DOM
while($row = $result->fetch_assoc()){
$staff = $xml->createElement("staff");
// foreach($row as $col => $val) $staff->setAttribute($col, $val);
$staff->setAttribute("staff_id", $row["staff_id"]);
$staff->setAttribute("name", $row["name"]);
$staff->setAttribute("parent_staff_id", $row["parent_staff_id"]);
if(is_null($row["parent_staff_id"])){
$dept->setAttribute("dept_id", $row["dept_id"]);
$dept->setAttribute("department_name", $row["department_name"]);
$dept->appendChild($staff);
}
else{
$qry = sprintf("//*[#staff_id = '%d']", $row["parent_staff_id"]);
$parent = $xpath->query($qry)->item(0);
if(!is_null($parent)) $parent->appendChild($staff);
}
}
$result->close();
$conn->close();
echo $xml->saveXML();
?>
XML Output
<department dept_id="2" department_name="Mathematics">
<staff staff_id="1" name="f00" parent_staff_id="">
<staff staff_id="5" name="gamma" parent_staff_id="1"/>
<staff staff_id="6" name="delta" parent_staff_id="1">
<staff staff_id="7" name="zeta" parent_staff_id="6">
<staff staff_id="2" name="bar" parent_staff_id="7"/>
<staff staff_id="8" name="theta" parent_staff_id="7"/>
</staff>
</staff>
</staff>
</department>
SQL Stuff
-- TABLES
drop table if exists staff;
create table staff
(
staff_id smallint unsigned not null auto_increment primary key,
name varchar(255) not null
)
engine = innodb;
drop table if exists departments;
create table departments
(
dept_id tinyint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine = innodb;
drop table if exists department_staff;
create table department_staff
(
dept_id tinyint unsigned not null,
staff_id smallint unsigned not null,
parent_staff_id smallint unsigned null,
primary key (dept_id, staff_id),
key (staff_id),
key (parent_staff_id)
)
engine = innodb;
-- STORED PROCEDURES
drop procedure if exists department_hier;
delimiter #
create procedure department_hier
(
in p_dept_id tinyint unsigned,
in p_staff_id smallint unsigned
)
begin
declare v_done tinyint unsigned default 0;
declare v_dpth smallint unsigned default 0;
create temporary table hier(
dept_id tinyint unsigned,
parent_staff_id smallint unsigned,
staff_id smallint unsigned,
depth smallint unsigned
)engine = memory;
insert into hier select dept_id, parent_staff_id, staff_id, v_dpth from department_staff
where dept_id = p_dept_id and staff_id = p_staff_id;
/* http://dev.mysql.com/doc/refman/5.0/en/temporary-table-problems.html */
create temporary table tmp engine=memory select * from hier;
while not v_done do
if exists( select 1 from department_staff e
inner join hier on e.dept_id = hier.dept_id and e.parent_staff_id = hier.staff_id and hier.depth = v_dpth) then
insert into hier select e.dept_id, e.parent_staff_id, e.staff_id, v_dpth + 1 from department_staff e
inner join tmp on e.dept_id = tmp.dept_id and e.parent_staff_id = tmp.staff_id and tmp.depth = v_dpth;
set v_dpth = v_dpth + 1;
truncate table tmp;
insert into tmp select * from hier where depth = v_dpth;
else
set v_done = 1;
end if;
end while;
select
hier.dept_id,
d.name as department_name,
s.staff_id,
s.name,
p.staff_id as parent_staff_id,
p.name as parent_name,
hier.depth
from
hier
inner join departments d on hier.dept_id = d.dept_id
inner join staff s on hier.staff_id = s.staff_id
left outer join staff p on hier.parent_staff_id = p.staff_id;
drop temporary table if exists hier;
drop temporary table if exists tmp;
end #
delimiter ;
-- TEST DATA
insert into staff (name) values
('f00'),('bar'),('alpha'),('beta'),('gamma'),('delta'),('zeta'),('theta');
insert into departments (name) values
('Computing'),('Mathematics'),('English'),('Engineering'),('Law'),('Music');
insert into department_staff (dept_id, staff_id, parent_staff_id) values
(1,1,null),
(1,2,1),
(1,3,1),
(1,4,3),
(1,7,4),
(2,1,null),
(2,5,1),
(2,6,1),
(2,7,6),
(2,8,7),
(2,2,7);
-- TESTING (call this sproc from your php)
call department_hier(1,1);
call department_hier(2,1);