MySQL / MariaDB: Historization - How do it better - mysql

I have exports of person data which I would like to import into a table considering historization.
I wrote single sql-steps but two questions arises:
1. There is a step where I got a unexpected date
2: I would like to avoid manually submitting some steps and using stored procedure
The tables are:
Table to be filled considering historization:
CREATE TABLE person (
id INTEGER DEFAULT NULL
, name VARCHAR(50) DEFAULT NULL
, effective_dt DATE DEFAULT NULL
, expiry_dt DATE DEFAULT NULL
);
Table with person data to be imported:
CREATE TABLE person_stg (
id INTEGER DEFAULT NULL
, name VARCHAR(50) DEFAULT NULL
, export_dt DATE DEFAULT NULL
, import_flag TINYINT DEFAULT 0
);
-- Several exports which has to be imported
INSERT INTO person_stg (id, name, export_dt) VALUES
(1,'Jonn' , '2000-01-01')
, (2,'Marry' , '2000-01-01')
, (1,'John' , '2000-01-05')
, (2,'Marry' , '2000-01-06')
, (2,'Mary' , '2000-01-10')
, (3,'Samuel', '2000-01-10')
, (2,'Maria' , '2000-01-15')
;
The following first step (1) populates the table person with the first state of the person:
INSERT INTO person
SELECT a.id, a.name, a.export_dt, '9999-12-31' expiry_dt
FROM person_stg a
LEFT JOIN person_stg b
ON a.id = b.id
AND a.export_dt > b.export_dt
WHERE b.id IS NULL
;
SELECT * FROM person ORDER BY id, effective_dt;
+----+--------+--------------+------------+
| id | name | effective_dt | expiry_dt |
+----+--------+--------------+------------+
| 1 | Jonn | 2000-01-01 | 9999-12-31 |
| 2 | Marry | 2000-01-01 | 9999-12-31 |
| 3 | Samuel | 2000-01-10 | 9999-12-31 |
+----+--------+--------------+------------+
Step (2) changes the expiry date:
-- (2) Update expiry_dt where changes happened
UPDATE
person a
, person_stg b
SET a.expiry_dt = SUBDATE(b.export_dt,1)
WHERE a.id = b.id
AND a.name <> b.name
AND a.expiry_dt = '9999-12-31'
AND b.export_dt = (SELECT MIN(b.export_dt)
FROM person_stg c
WHERE b.id = c.id
AND c.import_flag = 0
)
;
SELECT * FROM person ORDER BY id, effective_dt;
+----+--------+--------------+------------+
| id | name | effective_dt | expiry_dt |
+----+--------+--------------+------------+
| 1 | Jonn | 2000-01-01 | 2000-01-04 |
| 2 | Marry | 2000-01-01 | 2000-01-09 |
| 3 | Samuel | 2000-01-10 | 9999-12-31 |
+----+--------+--------------+------------+
The third step (3) inserts the second status of person data:
-- (3) Insert new exports which has changes
INSERT INTO person
SELECT a.id, a.name, a.export_dt, '9999-12-31' expiry_dt
FROM person_stg a
INNER JOIN person b
ON a.id = b.id
AND b.expiry_dt = SUBDATE(a.export_dt,1)
AND a.export_dt > b.effective_dt
AND a.import_flag = 0
;
SELECT * FROM person ORDER BY id, effective_dt;
+----+--------+--------------+------------+
| id | name | effective_dt | expiry_dt |
+----+--------+--------------+------------+
| 1 | Jonn | 2000-01-01 | 2000-01-04 |
| 1 | John | 2000-01-05 | 9999-12-31 |
| 2 | Marry | 2000-01-01 | 2000-01-09 |
| 2 | Mary | 2000-01-10 | 9999-12-31 |
| 3 | Samuel | 2000-01-10 | 9999-12-31 |
+----+--------+--------------+------------+
And the last step (4) defines on person_stg which record was inserted:
-- (4) Define imported records
UPDATE
person_stg a
, person b
SET import_flag = 1
WHERE a.id = b.id
AND a.export_dt = b.effective_dt
;
So far, so good. If I repeat step (2) I got the following table:
+----+--------+--------------+------------+
| id | name | effective_dt | expiry_dt |
+----+--------+--------------+------------+
| 1 | Jonn | 2000-01-01 | 2000-01-04 |
| 1 | John | 2000-01-05 | 9999-12-31 |
| 2 | Marry | 2000-01-01 | 2000-01-09 |
| 2 | Mary | 2000-01-10 | 1999-12-31 | <--- ??? Should be 2000-01-14
| 3 | Samuel | 2000-01-10 | 9999-12-31 |
+----+--------+--------------+------------+
Mary/2000-01-10 got expiry_dt 1999-12-31 instead of 2000-01-14. I don't understand how this can happened.
So, my questions are:
(1a) Why this update of the expiry date gives this strange date?
(1b) Is there maybe a better code then (2)?
(2) How can I repeat steps (2) until (4) automatically? I need only some hints for a stored procedure.
-- (4) Define imported records
UPDATE
person_stg a
, person b
SET import_flag = 1
WHERE a.id = b.id
AND a.export_dt = b.effective_dt
;

If I understand what you want to do, you don't need a multi-step process. You are just looking for the "end date" for each record. Here is a method that uses correlated subqueries:
SELECT p.*, export_dt as effdate,
COALESCE((SELECT export_dt - interval 1 day
FROM person_stg p2
WHERE p2.id = p.id AND
p2.export_dt > p.export_dt
ORDER BY p2.export_dt
LIMIT 1
), '9999-12-31') as enddate
FROM person_stg p;
You can also do something using variables.
I'm not sure if this answers your question, because it replaces the whole process with a simpler query.

I found a solution using cursor which I never used before. First I made a stored procedure (SP) sp_add_record which update, insert new status or insert a new element given id and export_dt from patient_stg. This stored procedure was then used using SP with cursor (curs_add_records):
CALL curs_add_records();
SELECT * FROM person;
+----+--------+--------------+------------+
| id | name | effective_dt | expiry_dt |
+----+--------+--------------+------------+
| 1 | Jonn | 2000-01-01 | 2000-01-04 |
| 2 | Marry | 2000-01-01 | 2000-01-09 |
| 1 | John | 2000-01-05 | 9999-12-31 |
| 2 | Mary | 2000-01-10 | 2000-01-14 |
| 3 | Samuel | 2000-01-10 | 9999-12-31 |
| 2 | Maria | 2000-01-15 | 9999-12-31 |
+----+--------+--------------+------------+
The advantage of this procedure is that I can load table with the same code independently if it is an inital load (population load) or incremental.
Literatur I used:
Djoni Damrawikarte: Dimensional Data Warehousing with MySQL (DWH issues)
Ben Forta: MariaDB Crash Course (SP issues)
What follows are the SP I used.
PS: Was it appropriate to answer to my own question?
DELIMITER //
DROP PROCEDURE IF EXISTS sp_add_record //
CREATE PROCEDURE sp_add_record(
IN p_id INTEGER
, IN p_export_dt DATE
)
BEGIN
-- Change expiry_dt
UPDATE
person p
, person_stg s
SET p.expiry_dt = SUBDATE(p_export_dt,1)
WHERE p.id = s.id
AND p.id = p_id
AND s.export_dt = p_export_dt
AND p.effective_dt <= p_export_dt
AND ( p.name <> s.name )
AND p.expiry_dt = '9999-12-31'
;
-- Add new status
INSERT INTO person
SELECT s.id, s.name, s.export_dt, '9999-12-31' expiry_dt
FROM
person p
, person_stg s
WHERE p.id = s.id
AND p.id = p_id
AND s.export_dt = p_export_dt
AND ( p.name <> s.name )
-- does a entry exists with new expiry_dt?
AND EXISTS (SELECT *
FROM person p2
WHERE p2.id = p.id
AND p.expiry_dt = SUBDATE(p_export_dt,1)
)
-- entry with open expiry_dt not should not exist
AND NOT EXISTS (SELECT *
FROM person p3
WHERE p3.id = p.id
AND p3.expiry_dt = '9999-12-31'
)
;
-- Add new id
INSERT INTO person
SELECT s.id, s.name, s.export_dt, '9999-12-31' expiry_dt
FROM person_stg s
WHERE s.export_dt = p_export_dt
AND s.id = p_id
-- Add new id from stage if it does not exist in person
AND s.id NOT IN (SELECT p3.id
FROM person p3
WHERE p3.id = s.id
AND p3.expiry_dt = '9999-12-31'
)
;
END
//
DELIMITER ;
DELIMITER //
DROP PROCEDURE IF EXISTS curs_add_records //
CREATE PROCEDURE curs_add_records()
BEGIN
-- Local variables
DECLARE done BOOLEAN DEFAULT 0;
DECLARE p_id INTEGER;
DECLARE p_export_dt DATE;
-- Cursor
DECLARE c1 CURSOR
FOR
SELECT id, export_dt
FROM person_stg
ORDER BY export_dt, id
;
-- Declare continue handler
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done=1;
-- Open cursor
OPEN c1;
-- Loop through all rows
REPEAT
-- Get record
FETCH c1 INTO p_id, p_export_dt;
-- Call add record procedure
CALL sp_add_record(p_id,p_export_dt);
-- End of loop
UNTIL done END REPEAT;
-- Close cursor
CLOSE c1;
END;
//
DELIMITER ;

Related

Update last row in group with data from first row in group

I'm currently in the process of converting data from one structure to another, and in the process I have to take a status id from the first entry in the group and apply it to the last entry in that same group. I am able to target and update the last item in the group just fine when using a hard-coded value, but I'm hitting a wall when trying to use the status_id from the first entry. Here is an example of the data structure.
-----------------------------------------------------------
| id | ticket_id | status_id | new_status_id | created_at |
-----------------------------------------------------------
| 1 | 10 | NULL | 3 | 2018-06-20 |
| 2 | 10 | 1 | 1 | 2018-06-22 |
| 3 | 10 | 1 | 1 | 2018-06-23 |
| 4 | 10 | 1 | 1 | 2018-06-26 |
-----------------------------------------------------------
So the idea would be to take the new_status_id of ID 1 and apply it to the same field for ID 4.
Here is the query that works when using a hard-coded value
UPDATE Communications_History as ch
JOIN
(
SELECT communication_id, MAX(created_at) max_time, new_status_id
FROM Communications_History
GROUP BY communication_id
) ch2
ON ch.communication_id = ch2.communication_id AND ch.created_at = ch2.max_time
SET ch.new_status_id = 3
But when I use the following query, I get Unknown column ch.communication_id in where clause
UPDATE Communications_History as ch
JOIN
(
SELECT communication_id, MAX(created_at) max_time, new_status_id
FROM Communications_History
GROUP BY communication_id
) ch2
ON ch.communication_id = ch2.communication_id AND ch.created_at = ch2.max_time
SET ch.new_status_id = (
SELECT nsi FROM
(
SELECT new_status_id FROM Communications_History WHERE communication_id = ch.communication_id AND status_id IS NULL
) as ch3
)
Thanks!
So I just figured it out using variables. It turns out the original "solution" only worked when there was one ticket's worth of history in the table, but when all the data was imported, it no longer worked. However, this tweak did seem to fix the issue.
UPDATE Communications_History as ch
JOIN
(
SELECT communication_id, MAX(created_at) max_time, new_status_id
FROM Communications_History
GROUP BY communication_id
) ch2
ON ch.communication_id = ch2.communication_id AND ch.created_at = ch2.max_time
SET ch.new_status_id = ch2.new_status_id;

outer query column in inner query case expression

I am trying to write below query in vertica
`SELECT a.*
FROM a
WHERE a.country="India"
AND a.language ="Hindi"
AND ( CASE WHEN (a.spoken = true
AND exist ( select 1
FROM b
WHERE b.country=a.country
AND b.language=a.language
AND ( CASE WHEN (a.population <b.population
AND a.statsyear > b.statsyear))
THEN true //pick recent stats
WHEN (a.population > b.population)
THEN true
ELSE false
END)) THEN true
WHEN (a.written = true ) THEN
true
ELSE false
END)`
it is not working, because we can't reference "a.population" outer query field in case expression of innerquery. I tried rewriting it wil OR caluse Vertica is not allowing it.
How can I re-write this
I created below tables in MySQL local box
Example of Tables and Results
CREATE TABLE tableA
(
id INT,
country VARCHAR(20),
language VARCHAR(20),
spoken INT,
written INT,
population INT,
stats INT
)
insert into tableA values(1,'India','Hindi',1,0,9,2010)
insert into tableA values(2,'India','Hindi',1,0,11,2011)
insert into tableA values(3,'India','Hindi',1,0,10,2012)
insert into tableA values(4,'India','Hindi',0,1,10,2013)
insert into tableA values(5,'India','Hindi',1,1,10,2012)
insert into tableA values(6,'India','English',1,1,10,2012)
CREATE TABLE tableB
(
id INT,
country VARCHAR(20),
language VARCHAR(20),
population INT,
stats INT
)
insert into TableB values(1,'India','Hindi',10,2009)
insert into TableB values(2,'India','Hindi',10,2011)
insert into TableB values(3,'India','Hindi',10,2012)
Rewrote the query slightly in different way
select distinct a.id
from (
SELECT a.*
FROM TableA a
WHERE a.country="India"
AND a.language ="Hindi" ) a, TableB b
WHere ( CASE WHEN a.written=1 THEN
TRUE
WHEN ( (a.spoken = 1) AND (a.country=b.country) AND (a.language=b.language)) THEN
(case WHEN ((a.population < b.population) AND (a.stats > b.stats)) THEN
TRUE
WHEN (a.population > b.population) THEN
TRUE
ELSE
FALSE
END)
ELSE
FALSE
END)
got below results
1,2,4,5
This is what I need, now could you please help me in writing it more efficient manner
Boolean logic equivalent:
SELECT DISTINCT a.*
FROM TableA a
left join TableB b on a.country=b.country AND a.language=b.language
WHERE a.country='India'
AND a.language ='Hindi'
AND (
a.written=1
OR
(a.spoken = 1 AND a.population < b.population AND a.stats > b.stats)
OR
a.population > b.population
)
;
Result:
+----+---------+----------+--------+---------+------------+-------+
| id | country | language | spoken | written | population | stats |
+----+---------+----------+--------+---------+------------+-------+
| 1 | India | Hindi | 1 | 0 | 9 | 2010 |
| 2 | India | Hindi | 1 | 0 | 11 | 2011 |
| 4 | India | Hindi | 0 | 1 | 10 | 2013 |
| 5 | India | Hindi | 1 | 1 | 10 | 2012 |
+----+---------+----------+--------+---------+------------+-------+
Demo

MySQL update join query to solve duplicate Values

I have a Categories table which has some duplicate Categories as described below,
`Categories`
+========+============+============+
| cat_id | cat_name | item_count |
+========+============+============+
| 1 | Category 1 | 2 |
| 2 | Category 1 | 1 |
| 3 | Category 2 | 2 |
| 4 | Category 3 | 1 |
| 5 | Category 3 | 1 |
+--------+------------+------------+
Here is another junction table which relates to another Items table. The item_count in the first table is the total number of items per cat_id.
`Junction`
+========+=========+
| cat_id | item_id |
+========+=========+
| 1 | 100 |
| 1 | 101 |
| 2 | 102 |
| 3 | 103 |
| 3 | 104 |
| 4 | 105 |
| 5 | 106 |
+--------+---------+
How do I add or combine those items from the duplicate Categories into ones each having maximum item_count among their duplicates? (e.g. Category 1).
Also, if the item_count is the same for those duplicate ones, then the Category with maximum cat_id will be chosen and item_count will be combined to that record. (e.g. Category 3).
Note: Instead of removing the duplicate records, the item_count will
be set to 0.
Below is the expected result.
+========+============+============+
| cat_id | cat_name | item_count |
+========+============+============+
| 1 | Category 1 | 3 |
| 2 | Category 1 | 0 |
| 3 | Category 2 | 2 |
| 4 | Category 3 | 0 |
| 5 | Category 3 | 2 |
+--------+------------+------------+
+========+=========+
| cat_id | item_id |
+========+=========+
| 1 | 100 |
| 1 | 101 |
| 1 | 102 |
| 3 | 103 |
| 3 | 104 |
| 5 | 105 |
| 5 | 106 |
+--------+---------+
In the result, there are two duplicates Category 1 and Category 3. And we have 2 scenarios,
cat_id=2 is eliminated because its item_count=1 is less than
that of cat_id=1 which is item_count=2.
cat_id=4 is eliminated even though its item_count is the same
as that of cat_id=5 since 5 is the maximum among duplicate
Category 3.
Please help me if any query that can join and update both tables in order to solve the duplicates.
Here's a SELECT. You can figure out to adapt it to an UPDATE ;-)
I've ignored the jucntion table for simplicity
SELECT z.cat_id
, z.cat_name
, (z.cat_id = x.cat_id) * new_count item_count
FROM categories x
LEFT
JOIN categories y
ON y.cat_name = x.cat_name
AND (y.item_count > x.item_count OR (y.item_count = x.item_count AND y.cat_id > x.cat_id))
LEFT
JOIN
( SELECT a.cat_id, b.*
FROM categories a
JOIN
( SELECT cat_name, SUM(item_count) new_count, MAX(item_count) max_count FROM categories GROUP BY cat_name) b
ON b.cat_name = a.cat_name
) z
ON z.cat_name = x.cat_name
WHERE y.cat_id IS NULL;
+--------+------------+------------+
| cat_id | cat_name | item_count |
+--------+------------+------------+
| 1 | Category 1 | 3 |
| 2 | Category 1 | 0 |
| 3 | Category 2 | 2 |
| 4 | Category 3 | 0 |
| 5 | Category 3 | 2 |
+--------+------------+------------+
DELIMITER $$
DROP PROCEDURE IF EXISTS cursor_proc $$
CREATE PROCEDURE cursor_proc()
BEGIN
DECLARE #cat_id INT;
DECLARE #cat_name VARCHAR(255);
DECLARE #item_count INT;
DECLARE #prev_cat_Name VARCHAR(255);
DECLARE #maxItemPerCategory INT;
DECLARE #maxItemId INT DEFAULT 0;
DECLARE #totalItemsCount INT;
-- this flag will be set to true when cursor reaches end of table
DECLARE exit_loop BOOLEAN;
-- Declare the cursor
DECLARE categories_cursor CURSOR FOR
SELECT select cat_id ,cat_name ,item_count from Categories Order By cat_name, cat_id;
-- set exit_loop flag to true if there are no more rows
DECLARE CONTINUE HANDLER FOR NOT FOUND SET exit_loop = TRUE;
-- open the cursor
OPEN categories_cursor;
-- start looping
categories_loop: LOOP
-- read the name from next row into the variables
FETCH categories_cursor INTO #cat_id, #cat_name, #item_count ;
-- close the cursor and exit the loop if it has.
IF exit_loop THEN
CLOSE categories_loop;
LEAVE categories_loop;
END IF;
IF(#prev_cat_Name <> #cat_name)
THEN
-- Category has changed, set the item_count of the 'best' category with the total items count
IF(#maxItemId > 0)
THEN
UPDATE Categories
SET Categories.item_count=#totalItemsCount
WHERE Categories.cat_id=#maxItemId;
END IF;
-- Reset Values with the actual row values
SET #maxItemPerCategory = #item_count;
SET #prev_cat_Name = #cat_name;
SET #maxItemId = #cat_id
SET #totalItemsCount = #item_count;
ELSE
-- increment the total items count
SET #totalItemsCount = #totalItemsCount + #item_count
-- if the actual row has the maximun item counts, then it is the 'best'
IF (#maxIntPerCategory < #item_count)
THEN
SET #maxIntPerCategory = #item_count
SET #maxItemId = #cat_id
ELSE
-- else, this row is not the best of its Category
UPDATE Categories
SET Categories.item_count=0
WHERE Categories.cat_id=#cat_id;
END IF;
END IF;
END LOOP categories_loop;
END $$
DELIMITER ;
It's not pretty and copied in part from Strawberry's SELECT
UPDATE categories cat,
junction jun,
(select
(z.cat_id = x.cat_id) * new_count c,
x.cat_id newcatid,
z.cat_id oldcatid
from categories x
LEFT
JOIN categories y
ON y.cat_name = x.cat_name
AND (y.item_count > x.item_count OR (y.item_count = x.item_count AND y.cat_id > x.cat_id))
LEFT
JOIN
( SELECT a.cat_id, b.*
FROM categories a
JOIN
( SELECT cat_name, SUM(item_count) new_count, MAX(item_count) max_count FROM categories GROUP BY cat_name) b
ON b.cat_name = a.cat_name
) z
ON z.cat_name = x.cat_name
WHERE
y.cat_id IS NULL) sourceX
SET cat.item_count = sourceX.c, jun.cat_id = sourceX.newcatid
WHERE cat.cat_id = jun.cat_id and cat.cat_id = sourceX.oldcatid
I think it's better to do what you want one step at time:
First, get data you need:
SELECT Max(`cat_id`), sum(`item_count`) FROM `Categories` GROUP BY `cat_name`
With these data you'll be able to check if update was correctly done.
Then, with a loop on acquired data, update:
update Categories set item_count =
(
Select Tot FROM (
Select sum(`item_count`) as Tot
FROM `Categories`
WHERE `cat_name` = '#cat_name') as tmp1
)
WHERE cat_id = (
Select MaxId
FROM (
select max(cat_id) as MaxId
FROM Categories
WHERE `cat_name` = '#cat_name') as tmp2)
Pay attention, if you run twice this code the result will be wrong.
Finally, set others Ids to 0
UPDATE Categories set item_count = 0
WHERE `cat_name` = '#cat_name'
AND cat_id <> (
Select MaxId
FROM (
select max(cat_id) as MaxId
FROM items
WHERE `cat_name` = '#cat_name0') as tmp2)

Update Statement Using Max Date and userid as the criteria

So I am trying to Update a contract table where the Contract Start Date is the latest date and the relevant employee id. The Contract Table stores all past information about the employee.
eg.
contract_tbl
+------------+------------+--------------------+-----------------+---------------+
|Contractid |EmployeeId |ContractStartDate |ContractEndDate | Position |
+------------+------------+--------------------+-----------------+---------------+
| 1 | 1 | 2012-12-13 | 2013-12-12 | Data Entry |
+------------+------------+--------------------+-----------------+---------------+
| 2 | 1 | 2014-01-26 | 2015-01-25 | Data Entry |
+------------+------------+--------------------+-----------------+---------------+
| 3 | 2 | 2014-01-26 | 2015-01-25 | Data Entry |
+------------+------------+--------------------+-----------------+---------------+
This is the SQL that I have but it does not work. (using a mysql db)
UPDATE contract_tbl
SET Position='Data Analyst'
WHERE EmployeeId = 1 And ContractStartDate= (
select max(ContractStartDate
FROM contract_tbl))
So it should Update the second row shown above with Data Analyst in the Position column but I am getting an error.
Does anybody have any idea how to fix this?
Thanks in advance
This will also do:
UPDATE contract_tbl a
JOIN (
SELECT MAX(ContractStartDate) m
FROM contract_tbl
WHERE EmployeeId = 1) b ON a.ContractStartDate = b.m AND a.EmployeeId = 1
SET a.Position='Data Analyst';
Probably this is what you want:
UPDATE contract_tbl c1
SET Position='Data Analyst'
WHERE EmployeeId = 1 And ContractStartDate= (
SELECT max(ContractStartDate)
FROM contract_tbl c2
WHERE c2.EmployeeId = c1.EmployeeId
)

MySQL Multiplication of Hierarchical data in a single parent-children table

I am working on a project whose MySQL database contains two tables; people and percentages.
people table:
+----+------+--------+
| ID | Name | Parent |
+----+------+--------+
| 1 | A | 0 |
| 2 | B | 1 |
| 3 | C | 2 |
| 4 | D | 3 |
| 5 | E | 1 |
| 6 | F | 0 |
+----+------+--------+
Percentages table:
+----+------------+
| ID | Percentage |
+----+------------+
| 1 | 70% |
| 2 | 60% |
| 3 | 10% |
| 4 | 5% |
| 5 | 40% |
| 6 | 30% |
+----+------------+
The query result I am seeking should be as the following:
+----+------------+----------------+--------+
| ID | Percentage | Calculation | Actual |
+----+------------+----------------+--------+
| 1 | 70 | 70% | 70.00% |
| 2 | 60 | 70%*60% | 42.00% |
| 3 | 10 | 70%*60%*10% | 4.20% |
| 4 | 5 | 70%*60%*10%*5% | 0.21% |
| 5 | 40 | 70%*40% | 28.00% |
| 6 | 30 | 30% | 30.00% |
+----+------------+----------------+--------+
The Calculation column is only for elaboration. Is there any MySQL technique that i could use to achieve this hierarchical query? Even if the percentages table might contain multiple entries (percentages) for the same person ?
A solution is to utilize the function described at the following link for heirarchical queries:
http://explainextended.com/2009/03/19/hierarchical-queries-in-mysql-adding-ancestry-chains/
Instead of making a PATH though, you will want to calculate the multiplication.
SOLUTION SCRIPT
Copy and paste this directly in a mysql console. I have not had much luck in workbench. Additionally, this can be further optimized by combining hierarchy_sys_connect_by_path_percentage and hierarchy_sys_connect_by_path_percentage_result into one stored procedure. Unfortunately this may be quite slow for giant data sets.
Setup Table and Data
drop table people;
drop table percentages;
create table people
(
id int,
name varchar(10),
parent int
);
create table percentages
(
id int,
percentage float
);
insert into people values(1,' A ',0);
insert into people values(2,' B ',1);
insert into people values(3,' C ',2);
insert into people values(4,' D ',3);
insert into people values(5,' E ',1);
insert into people values(6,' F ',0);
insert into percentages values(1,0.70);
insert into percentages values(2,0.60);
insert into percentages values(3,0.10);
insert into percentages values(4,0.5);
insert into percentages values(5,0.40);
insert into percentages values(6,0.30);
DELIMITER $$
DROP FUNCTION IF EXISTS `hierarchy_sys_connect_by_path_percentage`$$
CREATE FUNCTION hierarchy_sys_connect_by_path_percentage(
delimiter TEXT,
node INT)
RETURNS TEXT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _path TEXT;
DECLARE _id INT;
DECLARE _percentage FLOAT;
DECLARE EXIT HANDLER FOR NOT FOUND RETURN _path;
SET _id = COALESCE(node, #id);
SELECT Percentage
INTO _path
FROM percentages
WHERE id = _id;
LOOP
SELECT parent
INTO _id
FROM people
WHERE id = _id
AND COALESCE(id <> #start_with, TRUE);
SELECT Percentage
INTO _percentage
FROM percentages
WHERE id = _id;
SET _path = CONCAT( _percentage , delimiter, _path);
END LOOP;
END $$
DROP FUNCTION IF EXISTS `hierarchy_sys_connect_by_path_percentage_result`$$
CREATE FUNCTION hierarchy_sys_connect_by_path_percentage_result(
node INT)
RETURNS FLOAT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
DECLARE _path TEXT;
DECLARE _id INT;
DECLARE _percentage FLOAT;
DECLARE EXIT HANDLER FOR NOT FOUND RETURN _path;
SET _id = COALESCE(node, #id);
SELECT Percentage
INTO _path
FROM percentages
WHERE id = _id;
LOOP
SELECT parent
INTO _id
FROM people
WHERE id = _id
AND COALESCE(id <> #start_with, TRUE);
SELECT Percentage
INTO _percentage
FROM percentages
WHERE id = _id;
SET _path = _percentage * _path;
END LOOP;
END $$
DELIMITER ;
Query
SELECT hi.id AS ID,
p.Percentage,
hierarchy_sys_connect_by_path_percentage('*', hi.id) AS Calculation,
hierarchy_sys_connect_by_path_percentage_result(hi.id) AS Actual
FROM people hi
JOIN percentages p
ON hi.id = p.id;
Result
+------+------------+-----------------+--------------------+
| ID | Percentage | Calculation | Actual |
+------+------------+-----------------+--------------------+
| 1 | 0.7 | 0.7 | 0.699999988079071 |
| 2 | 0.6 | 0.7*0.6 | 0.419999986886978 |
| 3 | 0.1 | 0.7*0.6*0.1 | 0.0419999994337559 |
| 4 | 0.5 | 0.7*0.6*0.1*0.5 | 0.0210000015795231 |
| 5 | 0.4 | 0.7*0.4 | 0.280000001192093 |
| 6 | 0.3 | 0.3 | 0.300000011920929 |
+------+------------+-----------------+--------------------+
Formatting the numbers is trivial so I leave it to you...
More important are optimizations to make less calls on the database.
Step 1 - Create a MySQL function to return the family tree as a comma delimited TEXT column:
DELIMITER //
CREATE FUNCTION fnFamilyTree ( id INT ) RETURNS TEXT
BEGIN
SET #tree = id;
SET #qid = id;
WHILE (#qid > 0) DO
SELECT IFNULL(p.parent,-1)
INTO #qid
FROM people p
WHERE p.id = #qid LIMIT 1;
IF ( #qid > 0 ) THEN
SET #tree = CONCAT(#tree,',',#qid);
END IF;
END WHILE;
RETURN #tree;
END
//
DELIMITER ;
Then use the following SQL to retrieve your results:
SELECT ppl.id
,ppl.percentage
,GROUP_CONCAT(pct.percentage SEPARATOR '*') as Calculations
,EXP(SUM(LOG(pct.percentage))) as Actual
FROM (SELECT p1.id
,p2.percentage
,fnFamilyTree( p1.id ) as FamilyTree
FROM people p1
JOIN percentages p2
ON p2.id = p1.id
) ppl
JOIN percentages pct
ON FIND_IN_SET( pct.id, ppl.FamilyTree ) > 0
GROUP BY ppl.id
,ppl.percentage
;
SQLFiddle at http://sqlfiddle.com/#!2/9da5b/12
Results:
+------+----------------+-----------------+----------------+
| ID | Percentage | Calculations | Actual |
+------+----------------+-----------------+----------------+
| 1 | 0.699999988079 | 0.7 | 0.699999988079 |
| 2 | 0.600000023842 | 0.7*0.6 | 0.420000009537 |
| 3 | 0.10000000149 | 0.7*0.6*0.1 | 0.04200000158 |
| 4 | 0.5 | 0.1*0.5*0.7*0.6 | 0.02100000079 |
| 5 | 0.40000000596 | 0.4*0.7 | 0.279999999404 |
| 6 | 0.300000011921 | 0.3 | 0.300000011921 |
+------+----------------+-----------------+----------------+
MySQL is a Relational DBS. Your requirements needs a Graph database.
However if you stay at MySQL there exists a few methods to add a few graph features. One of them is the concept of Nested Sets. But I don't suggest that, as it adds a lot of complexity.
SELECT a.id
, ROUND(pa.percentage/100
* COALESCE(pb.percentage/100,1)
* COALESCE(pc.percentage/100,1)
* COALESCE(pd.percentage/100,1)
* 100,2) x
FROM people a
LEFT
JOIN people b
ON b.id = a.parent
LEFT
JOIN people c
ON c.id = b.parent
LEFT
JOIN people d
ON d.id = c.parent
LEFT
JOIN percentages pa
ON pa.id = a.id
LEFT
JOIN percentages pb
ON pb.id = b.id
LEFT
JOIN percentages pc
ON pc.id = c.id
LEFT
JOIN percentages pd
ON pd.id = d.id
;
Consider switching to Postgres9, which supports recursive queries:
WITH RECURSIVE recp AS (
SELECT p.id, p.name, p.parent
, array[p.id] AS anti_loop
, array[pr.percentage ] AS percentages
, pr.percentage AS final_pr
FROM people p
JOIN percentages pr ON pr.id = p.id
WHERE parent = 0
UNION ALL
SELECT ptree.id, ptree.name, ptree.parent
, recp.anti_loop || ptree.id
, recp.percentages || pr.percentage
, recp.final_pr * pr.percentage
FROM people ptree
JOIN percentages pr ON pr.id = ptree.id
JOIN recp ON recp.id = ptree.parent AND ptree.id != ALL(recp.anti_loop)
)
SELECT id, name
, array_to_string(anti_loop, ' <- ') AS path
, array_to_string(percentages::numeric(10,2)[], ' * ') AS percentages_str
, final_pr
FROM recp
ORDER BY anti_loop
Check out sqlFiddle demo