Is there a SAS function to flag a word that repeats in order across columns? - function

Is there a way to flag rows where the word 'Add' is in sequence without any other word or missing in between?
I tried the array statement with the find function, but no luck!

This code will find all sequences of Add where there are at least two Adds in a row and save all of the sequences to a single comma-separated variable.
Sample data:
data have;
input t1$ t2$ t3$ t4$ t5$ t6$ t7$ t8$ t9$ t10$;
datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
run;
Code:
data want;
set have;
array t[*] t:;
array col[10] $;
length sequences $50.;
/* Check if the current and previous value is 'Add' */
do i = 1 to dim(t);
if(i > 1 AND t[i] = 'Add' AND t[i-1] = 'Add') then do;
col[i] = vname(t[i]);
col[i-1] = vname(t[i-1]);
end;
end;
/* Create a comma-separated list for each sequence. For example:
t1-t3,t3-t5
t1-t4
etc.
*/
flag_start = 0;
do i = 1 to dim(col);
/* Find the start of the sequence */
if(col[i] NE ' ' AND NOT flag_start) then do;
seq_start = col[i];
flag_start = 1;
end;
/* Find the end of the sequence */
if(col[i] = ' ' AND flag_start) then do;
seq_end = col[i-1];
flag_start = 0;
end;
/* If we are between sequences, calculate the sequence range and save it */
if(i > 1 AND col[i] = ' ' AND col[i-1] NE ' ') then do;
seq_range = cats(seq_start, '-', seq_end);
sequences = catx(',', sequences, seq_range);
end;
end;
drop i flag_start seq_start seq_end seq_range col:;
run;
Output:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequences
Add Add No Add No Add No Add t1-t2
Add No Add Add Add Add No t3-t6
Add Add Add No Add Add Add Add t1-t3,t5-t8

The presence of a target word at a T<index> column can be flagged using a binary value, setting the bits appropriately.
Example:
Flag up to 32 columns. For more than 32 columns you would need additional flag variables and some extra bookkeeping when calculating the flag value.
data have;
input (t1-t10) ($);
datalines;
Add Add No Add No Add . No Add .
Add No Add Add Add Add . . No .
Add Add Add No Add Add Add Add . .
;
data want;
set have;
array ts t1-t10;
flag = 0;
do over ts;
flag = BOR (flag, BLSHIFT(ts='Add', _i_-1));
end;
format flag binary32.;
run;

I have a few solutions depending on when and how many sequences are allowed.
First, a sequence is defined as 2 or more consecutive time periods with 'Add'. For my solutions I used Richard's sample data.
Solution 1: Valid sequences begin at T1 until a break
* valid sequence begins at t1 until a break;
data want1;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add'; * if either T1 or T2 <> 'Add' then move on to next obs;
array t(*) t1-t10;
do i = 3 to dim(t); * start loop at t3 since we know t1 & t2 = 'Add';
if t[i] ne 'Add' then do;
sequence = cats('T1-T', put(i-1, 2.));
output;
leave; * exit loop. move to next obs;
end;
end;
drop i;
run;
Result:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add Add Add No Add Add Add Add T1-T3
Solution 2: The next solution still detects valid sequences beginning at T1, but allows breaks and other sequences beyond the first one.
* sequence begins at t1 with a break and another sequence occurs on same row;
data want2;
set have;
length sequence $20;
if t1 = 'Add' and t2 = 'Add'; * if either T1 or T2 <> 'Add' then move to next obs;
array t(*) t1-t10;
seq_strt = 1; * start of sequence. start at 1 because of subsetting if;
break = 0; * flag for break in sequence. start at 0 because of subsetting if;
sequence = '';
do i = 3 to dim(t); * start loop at t3 since we know t1 & t2 = 'Add';
* start of sequence - 2 consecutive 'Add' during break;
if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do; * start of new sequence;
break = 0;
seq_strt = i-1;
end;
* end of sequence;
else if break = 0 and t[i] ne 'Add' then do;
break = 1; * flag a break;
sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
end;
end;
drop i seq_strt break;
run;
Result:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add Add Add No Add Add Add Add T1-T3,T5-T8
Finally, the last solution detects any sequence in any time period.
* capture any sequence at any period of time;
data want3;
set have;
length sequence $20;
array t(*) t1-t10;
seq_strt = 0; * start of sequence;
break = 1; * flag for break in sequence. start with break until new seq is found;
sequence = '';
do i = 2 to dim(t); * start loop at t2 to compare at t1;
* start of sequence - 2 consecutive 'Add' during break;
if break = 1 and t{i] = 'Add' and t[i-1] = 'Add' then do; * start of new sequence;
break = 0;
seq_strt = i-1;
end;
* end of sequence;
else if break = 0 and t[i] ne 'Add' then do;
break = 1; * flag a break;
sequence = catx(',', sequence, cats('T', put(seq_strt, 2.), '-T', put(i-1, 2.)));
end;
end;
drop i seq_strt break;
run;
Result:
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 sequence
Add Add No Add No Add No Add T1-T2
Add No Add Add Add Add No T3-T6
Add Add Add No Add Add Add Add T1-T3,T5-T8

Related

How to find greatest value in a column and display it with a "Greatest Value" Flag and display the rest of the new column value as "Not Greatest"

I have a table UnitCheck:
enter code here
I want to add a new column which will return me a value Greatest if the CostToRestaff value is the highest and should a return a value Not Greatest for every other value of`enter code here CostToRestaff.
CostToRestaff is a calculated column with the formula: 0 * FullDutyC + 10 * WoundedC + 25 * KilledC;
enter code here
edited:
I was trying more on these lines. Any idea where i am going wrong?
''''''
if Unit_ID = A then --A is the user parameter for my procedure
Set check1 = 0 * FullDutyC + 10 * WoundedC + 25 * KilledC;
end if;
if Unit_ID = B then --B is the user parameter for my procedure
Set check2 = 0 * FullDutyC + 10 * WoundedC + 25 * KilledC;
end if;
if Unit_ID = C then --C is the user parameter for my procedure
Set check3 = 0 * FullDutyC + 10 * WoundedC + 25 * KilledC;
end if;
if check1 > check2 AND check1 > check3 then
set Rejection = /* this is where i need the final answer which should be
the value itself for eg: 125 - since it is the highest value*/ where Unit_ID =
A;
elseif check2 > check3 AND check2 > check1 then
Set Rejection = /* this is where i need the final answer in this case since it is not the highest value then it should jump to the else condition*/ where
Unit_ID = B;
elseif check3 > check1 AND check3 > check2 then
Set Rejection = /* this is where i need the final answer in this case since it is not the highest value then it should jump to the else condition*/ where Unit_ID =
C;
else set Rejection = "Not Greatest"
end if;
''''''
Assuming MySQL 8.0, you can use window functions:
select
t.*,
(rank() over(order by CostToRestaff desc) = 1) isGreatest
from mytable t
This gives you a 0/1 flag that indicates if the current is has the greatest value of CostToRestaff across the whole table - which I find more meaningful that string '(Not) Greatest'. If there are top ties, they would all get the 1 flag.
If you want that as a string value:
select
t.*,
case when rank() over(order by CostToRestaff desc) = 1
then 'Greatest'
else 'Not Greatest'
end isGreatest
from mytable t
In earlier versions, you can use a subquery or a join to compute the maximum value:
select
t.*,
(select max(CostToRestaff) from mytable) = CostToRestaff isGreatest
from mytable t

MySQL user-defined function returns incorrect value when used in a SELECT statement

I met a problem when calling a user-defined function in MySQL. The computation is very simple but can't grasp where it went wrong and why it went wrong. Here's the thing.
So I created this function:
DELIMITER //
CREATE FUNCTION fn_computeLoanAmortization (_empId INT, _typeId INT)
RETURNS DECIMAL(17, 2)
BEGIN
SET #loanDeduction = 0.00;
SELECT TotalAmount, PeriodicDeduction, TotalInstallments, DeductionFlag
INTO #totalAmount, #periodicDeduction, #totalInstallments, #deductionFlag
FROM loans_table
WHERE TypeId = _typeId AND EmpId = _empId;
IF (#deductionFlag = 1) THEN
SET #remaining = #totalAmount - #totalInstallments;
IF(#remaining < #periodicDeduction) THEN
SET #loanDeduction = #remaining;
ELSE
SET #loanDeduction = #periodicDeduction;
END IF;
END IF;
RETURN #loanDeduction;
END;//
DELIMITER ;
If I call it like this, it works fine:
SELECT fn_computeLoanAmortization(3, 4)
But if I call it inside a SELECT statement, the result becomes erroneous:
SELECT Id, fn_computeLoanAmortization(Id, 4) AS Amort FROM emp_table
There's only one entry in the loans_table and the above statement should only result with one row having value in the Amort column but there are lots of random rows with the same Amort value as the one with the matching entry, which should not be the case.
Have anyone met this kind of weird dilemma? Or I might have done something wrong from my end. Kindly enlighten me.
Thank you very much.
EDIT:
By erroneous, I meant it like this:
loans_table has one record
EmpId = 1
TypeId = 2
PeriodicDeduction = 100
TotalAmount = 1000
TotalInstallments = 200
DeductionFlag = 1
emp_table has several rows
EmpId = 1
Name = Paolo
EmpId = 2
Name = Nikko
...
EmpId = 5
Name = Ariel
when I query the following statements, I get the correct value:
SELECT fn_computeLoanAmortization(1, 2)
SELECT Id, fn_computeLoanAmortization(Id, 2) AS Amort FROM emp_table WHERE EmpId = 1
But when I query this statement, I get incorrect values:
SELECT Id, fn_computeLoanAmortization(Id, 2) AS Amort FROM emp_table
Resultset would be:
EmpId | Amort
--------------------
1 | 100
2 | 100 (this should be 0, but the query returns 100)
3 | 100 (same error here)
...
5 | 100 (same error here up to the last record)
Inside your function, the variables you use to retrieve the values from the loans_table table are not local variables local to the function but session variables. When the select inside the function does not find any row, those variables still have the same values as from the previous execution of the function.
Use real local variables instead. In order to do that, use the variables names without # as a prefix and declare the variables at the beginning of the function. See this answer for more details.
I suspect the problem is that the variables in the INTO are not re-set when there is no matching row.
Just set them before the INTO:
BEGIN
SET #loanDeduction = 0.00;
SET #totalAmount = 0;
SET #periodicDeduction = 0;
SET #totalInstallments = 0;
SET #deductionFlag = 0;
SELECT TotalAmount, PeriodicDeduction, TotalInstallments, DeductionFlag
. . .
You might just want to set them to NULL.
Or, switch your logic to use local variables:
SET v_loanDeduction = 0.00;
SET v_totalAmount = 0;
SET v_periodicDeduction = 0;
SET v_totalInstallments = 0;
SET v_deductionFlag = 0;
And so on.

Performance issue with update query after adding index

I have added the index to my update query and by adding the same query start taking several hours to complete the process.While without index its completing in some minutes i have added the index to faster the process but it became very slow exactly opposite to my desire.
Below is sample code snippet of my code.
Cursor c_updt_stg_rsn is
select distinct substr(r.state_claim_id, 1, 12), ROWID
from nemis.stg_state_resp_rsn r
WHERE r.seq_resp_plan_id = v_seq_resp_plan_id
and r.submitted_claim_id is null
and r.filler_2 is null;
BEGIN
OPEN c_updt_stg_rsn;
LOOP
FETCH c_updt_stg_rsn BULK COLLECT
INTO v_state_claim_id, v_rowid LIMIT c_BULK_SIZE;
FORALL i IN 1 .. v_state_claim_id.COUNT()
UPDATE /*+ index(STG_STATE_RESP_RSN,IDX2_STG_STATE_RESP_RSN) */ nemis.stg_state_resp_rsn
SET (submitted_claim_id , filler_2) = (SELECT DISTINCT submitted_claim_id, sl_group_id FROM nemis.state_sub_Resp_dtl D WHERE
(d.state, d.type_of_claim) in (select distinct state, type_of_claim
from nemis.resp_match_state
where seq_resp_match_table_level_id in
(select seq_resp_match_table_level_id
from nemis.resp_match_table_level
where seq_resp_plan_id = v_seq_resp_plan_id))
AND resp_state_claim_id LIKE v_state_claim_id(i)||'%'
)
WHERE ROWID = v_rowid(i);
IF v_state_claim_id.COUNT() != 0 THEN
v_cnt_rsn := v_cnt_rsn + SQL%ROWCOUNT;
END IF;
COMMIT;
EXIT WHEN c_updt_stg_rsn%NOTFOUND;
END LOOP;
CLOSE c_updt_stg_rsn;

SQL Queries take up longer than 1 hour

Some of our SQL-Queries take longer than 3hours to execute, which is really long considering that our database is currently around 5000 entries big.
Here the two longest statements:
/* Alte Jobs anzeigen */
update vwr_synch.listings set vwr_synch.listings.Alt="alt"
where vwr_synch.listings.JobKey not in (select vwr_synch.jobs.JobKey from vwr_synch.jobs)
;
update vwr_synch.listings set vwr_synch.listings.Alt="alt" where vwr_synch.listings.VWRStatus="NoJobs" or vwr_synch.listings.VWRStatus="Problem"
;
update vwr_synch.listings set vwr_synch.listings.Alt=NULL
where vwr_synch.listings.VWRStatus="Active" and vwr_synch.listings.VWRRetry!="0" and vwr_synch.listings.Alt="alt"
;
/* Neue Jobs anzeigen */
update vwr_synch.jobs set vwr_synch.jobs.NeuAlt="Neu"
where vwr_synch.jobs.JobKey not in (select vwr_synch.listings.JobKey from vwr_synch.listings) and (vwr_synch.jobs.`Status`="Active" and vwr_synch.jobs.Retry="0")
;
Your sample code has multiple statements, so I'll just focus on the first one.
I prefer not in because of the semantics using NULL, although there is evidence that not in might be more efficient (see link in comments). This is the first query:
update vwr_synch.listings
set vwr_synch.listings.Alt = 'alt'
where vwr_synch.listings.JobKey not in (select vwr_synch.jobs.JobKey from vwr_synch.jobs);
I would change it to:
update vwr_synch.listings l
set l.Alt = 'alt'
where not exists (select 1 from vwr_synch.jobs.JobKey jk where jk.JobKey = l.JobKey);
Then, for this to work efficiently, you need an index on vwr_synch.jobs.JobKey(JobKey).
The next two statements are:
update vwr_synch.listings l
set l.Alt = 'alt'
where l.VWRStatus in ('NoJobs', 'Problem');
update vwr_synch.listings l
set l.Alt = NULL
where l.VWRStatus = 'Active' and l.VWRRetry <> '0' and l.Alt = 'alt';
For these, you want an index on vwr_synch.listings(VWRStatus, Alt, VWRRetry).

Adjacency table to nested set conversion

I need to convert an adjacency list to nested set in MySql. I have found only one resource over the internet to convert an adjacency list into a nested set using mysql(http://data.bangtech.com/sql/nested_set_treeview.htm). The code is also on the same webpage.
CREATE TABLE test.Tree
(emp CHAR(10) NOT NULL,
boss CHAR(10));
CREATE TABLE test.Personnel(
emp CHAR(20) PRIMARY KEY,
boss CHAR(20) REFERENCES Personnel(emp),
salary DECIMAL(6,2) NOT NULL
);
INSERT INTO test.Personnel VALUES ('jerry', 'NULL',1000.00);
INSERT INTO test.Personnel VALUES ('Bert', 'jerry',900.00);
INSERT INTO test.Personnel VALUES ('chuck', 'jerry',900.00);
INSERT INTO test.Personnel VALUES ('donna', 'chuck',800.00);
INSERT INTO test.Personnel VALUES ('eddie', 'chuck',700.00);
INSERT INTO test.Personnel VALUES ('fred', 'chuck',600.00);
INSERT INTO test.Tree
SELECT emp, boss FROM test.Personnel;
I make the Tree table from Personnel table. Tree table has the boss-employee hierarchy. This is an adjacency list. To convert it to the nest set, I applied this code.
BEGIN ATOMIC
DECLARE counter integer;
DECLARE max_counter integer;
DECLARE current_top integer;
SET counter = 2;
SET max_counter = 2 * (SELECT COUNT(*) FROM test.Tree);
SET current_top = 1;
INSERT INTO test.Stack
SELECT 1, emp, 1, NULL
FROM test.Tree
WHERE boss IS NULL;
DELETE FROM test.Tree
WHERE boss IS NULL;
WHILE counter <=(max_counter - 2)
LOOP IF EXISTS (SELECT * FROM test.Stack AS S1, test.Tree AS T1
WHERE S1.emp = T1.boss AND S1.stack_top = current_top)
THEN
BEGIN -- push when top has subordinates, set lft value
INSERT INTO test.Stack
SELECT (current_top + 1), MIN(T1.emp), counter, NULL
FROM test.Stack AS S1, test.Tree AS T1
WHERE S1.emp = T1.boss
AND S1.stack_top = current_top;
DELETE FROM test.Tree
WHERE emp = (SELECT emp
FROM test.Stack
WHERE stack_top = current_top + 1);
SET counter = counter + 1;
SET current_top = current_top + 1;
END
ELSE
BEGIN -- pop the stack and set rgt value
UPDATE test.Stack
SET rgt = counter,
stack_top = -stack_top -- pops the stack
WHERE stack_top = current_top
SET counter = counter + 1;
SET current_top = current_top - 1;
END IF;
END LOOP;
END;
MySQL workbench shows several syntax errors which I could not remove.
I am familiar with only very basic operations of mysql so could not debug the code on my own. How to remove all these errors? Plz Help.
The second source I found to do the above operation is http://www.sqlservercentral.com/articles/Hierarchy/94040/ but the code is in T Sql and I don't have enough skills to translate it to MySQL.
You should put NULL and not 'NULL' in the line
INSERT INTO test.Personnel VALUES ('jerry', 'NULL',1000.00);
Correct version:
INSERT INTO test.Personnel VALUES ('jerry', NULL, 1000.00);
Use Bash:
Bash converting:
# SQL command to fetch necessary fields, output it to text archive "tree"
SELECT id, parent_id, name FROM projects;
# Make a list "id|parentid|name" and sort by name
cat tree |
cut -d "|" -f 2-4 |
sed 's/^ *//;s/ *| */|/g' |
sort -t "|" -k 3,3 > list
# Creates the parenthood chain on second field
while IFS="|" read i p o
do
l=$p
while [[ "$p" != "NULL" ]]
do
p=$(grep -w "^$p" list | cut -d "|" -f 2)
l="$l,$p"
done
echo "$i|$l|$o"
done < list > listpar
# Creates left and right on 4th and 5th fields for interaction 0
let left=0
while IFS="|" read i l o
do
let dif=$(grep "\b$i,NULL|" listpar | wc -l)*2+1
let right=++left+dif
echo "$i|$l|$o|$left|$right"
let left=right
done <<< "$(grep "|NULL|" listpar)" > i0
# The same for following interactions
n=0
while [ -s i$n ]
do
while IFS="|" read i l nil left nil
do
grep "|$i,$l|" listpar |
while IFS="|" read i l o
do
let dif=$(grep "\b$i,$l|" listpar | wc -l)*2+1
let right=++left+dif
echo "$i|$l|$o|$left|$right"
let left=right
done
done < i$n > i$((++n))
done
# Show concatenated
cat i*|sort -t"|" -k 4n
# SQL commands
while IFS="|" read id nil nil left right
do
echo "UPDATE projects SET lft=$left, rgt=$right WHERE id=$id;"
done <<< "$(cat i*)"