Creating an ID variable for duplicates in SAS

Creating an ID variable for duplicates in SAS - duplicates

I have a SAS dataset with an ID variable which is supposed to be unique at the person level, but in reality there are duplicates. What I'd like to do is create a duplicate ID which only fills when a person has duplicate values of ID, like this:
ID Dupe_ID
1
2 1
2 1
3
4 2
4 2
Any help is much appreciated!

data have;
input ID;
cards;
1
2
2
3
4
4
;
/*if sorted*/
data want;
set have;
by id;
if first.id and not last.id then
_dup+1;
dup_id=_dup;
if first.id and last.id then
call missing (dup_id);
drop _dup;
run;

In SAS 9.3+, there is a new option on proc sort which can be of use. If you want to split your data into "actually unique" and "not unique" datasets (in order to later process the not uniques and work out what they should be), the following will do that:
proc sort data=have out=nonuniquedata nouniquekey uniqueout=uniquedata;
by id;
run;
NOUNIQUEKEY is basically the opposite of NODUPKEY: it only keeps records that are NOT unique. So here the "primary" output dataset will have the nonunique records, and the "uniqueout" dataset will have the unique ones.

It's handling it slightly differently, but just in case it's of use to you and/or others - proc sort has a handy simple dupout= option for seperating out non-unique key observations:
proc sort data=have out=want dupout=dups ;
by id ;
run ;
The first occurence of each id will go to the want dataset. Any subsequent observations with the same id will go to the dups dataset

proc sort data = dataset out = sortdata;
by id;
run;
data younameit;
length dup_id 1;
set sortdata;
by id;
if first.id and last.id then dup_id =;
else dup_id =1;
run;
My approach is to use Data Step with First. and Last.
You need to perform sorting at both PROCEDURE proc sort
and DATA step "by" immediately after set statementbefore First. and Last. could work in data step.
if an observation itself is the FIRST and the LAST observation of the by-group(i.e. Id), then it must be an unique item. Say if there are two observations with id =2 , the earlier observation would be the first.id and the later would be the last.id.

Related

Select a random row with where statement is taking to long

I want to select a random row with a specific where statement but the query is taking to long (around 2.7 seconds)
SELECT * FROM PIN WHERE available = '1' ORDER BY RAND() LIMIT 1
The database contains around 900k rows
Thanks

SELECT * FROM PIN WHERE available = '1' ORDER BY RAND() LIMIT 1
means, that you are going to generate a random number for EVERY row, then sort the whole result-set and finally retrieve one row.
That's a lot of work for querying a single row.
Assuming you have id's without gaps - or only little of them - you better use the programming language you are using to generate ONE random number - and fetch that id:
Pseudo-Example:
result = null;
min_id = queryMinId();
max_id = queryMaxId();
while (result == null){
random_number = random_beetween(min_id, max_id);
result = queryById(randomNumber);
}
If you have a lot of gaps, you could retrieve the whole id-set, and then pick ONE random number from that result prior:
id_set = queryAllIds();
random_number = random_beetween(0, size(id_set)-1);
result = queryById(id_set[random_number])
The first example will work without additional constraints. In your case, you should use option 2. This ensures, that all IDs with available=1 are pre-selected into an 0 to count() -1 array, hence ignoring all invalid ids.
Then you can generate a random number between 0 and count() -1 to get an index within that result-set, which you can translate to an actual ID, which you are going to fetch finally.
id_set = queryAllIdsWithAvailableEqualsOne(); //"Condition"
random_number = random_beetween(0, size(id_set)-1);
result = queryById(id_set[random_number])

Access comparing multiple columns for a result

First I just wanted to say thank you for all your help. I have a new issue with the next step of my query.
For each record there are three columns. For sake of argument lets
call them A,B and C.
In each column there can be four results: Pass, Not Tested, Low,
High
Not Tested and Pass can be treated as a pass.
Low and High can be treated as a fail.
I would like to see a result in another column that if there are any fail statements in columns A, B or C a fail response will be shown. Otherwise a Pass will be shown in the new column.
A------------B.---------- C.---------RESULT.-------
PASS---------PASS---------PASS-------PASS----------
High---------PASS---------PASS------ FAIL----------
Not Tested---PASS---------PASS-------PASS----------
LOW----------PASS---------PASS-------FAIL----------
Thank You,

Set the default value for ColumnD to "PASS", then run the following query:
UPDATE ATable SET ATable.D = "FAIL"
WHERE (((ATable.A)="high" Or (ATable.A)="low")) OR (((ATable.B)="high" Or (ATable.B)="low")) OR (((ATable.C)="high" Or (ATable.C)="low"));

How about the following? All you need to do is check if any of the three columns have a 'Fail' or 'Low' or 'High' to indicate failure. Or you can flip the logic...
SELECT Table1.A, Table1.B, Table1.C,
IIf([a]='Fail' Or [a]='Low' Or [a]='High','Fail',
IIf([b]='Fail' Or [b]='Low' Or [b]='High','Fail',
IIf([c]='Fail' Or [c]='Low' Or [c]='High','Fail','Pass'))) AS Result
FROM Table1;
Here is the flipped logic:
SELECT Table1.A, Table1.B, Table1.C,
IIf([a]='Pass' Or [a]='Not Tested',
IIf([b]='Pass' Or [b]='Not Tested',
IIf([c]='Pass' Or [c]='Not Tested','Pass','Fail'),'Fail'),'Fail') AS Result2
FROM Table1;

Which of the two approaches for the specified task is better performance-wise in Perl DBI module?

Here is a problem that I'm facing, which I need to solve using Perl DBI module:
Table:
c1 c2 c3 c4 c5 c6
__________________
r1 | a b c d e f
r2 | h i j k x m
r3 | n x p q r x
r4 | k l m n p q
Task: determine the name of the left-most column that has value 'x' in any of the rows. In the example it is c2.
I see two ways to do this:
First
Select column c1 in all the rows;
Loop through the retrieved fields, starting from top-most;
If any of the fields have value 'x', return c1;
Otherwise, repeat 1-4 for next column;
How I approximately imagine it to look in perl:
my #bind_values=\(my $field);
my $var;
for my $i (1..6) {
$statement="select c$i from table"
$dbh->selectcol_arrayref($statement, undef, #bind_values);
if ($field eq 'x') {$var=$i;last;}
}
return $field;
Second
Set variable $var to 4;
Select all columns from r1 to r$var.
Loop through returned fields, starting from left-most;
If a field has value 'x' and current column number is lower than x, assign the current column number to x;
repeat 2-5 for next row
return x
How I approximately imagine it to look in Perl:
my #bind_values;
my $var=6;
my #cols;
for my $i (1..6) {
for (1..$var){push #cols, "c$_"; push #bind_values, my "c$_";}
$statement="select #cols from table"
$dbh->selectrow_array($statement, undef, #bind_values)
for (#bind values){
if ($$_<$var) $var=$$_;
}
}
return $var;
If I understood the manual correctly, selectcol_array() actually performs a separate SQL call for each row in the table, so both approaches involve a two-level loop.
To people know more about the inner workings of Perl DBI module my question is the following:
Which of the approaches is better performance-wise?
If it's of any significance, I'm working with a MySQL database.
EDIT: Actual table dimensions are potentially c200 x r1000.
EDIT2:
Another idea: using LIMIT statement, to determine if a column contains a field with the statement SQL statement itself, for example:
SELECT c1
FROM table
WHERE c1='x'
LIMIT 0,1
This statement should allow to determine if c1 contains value 'x'. This would move some more of the performance load to DB engine, correct? Would this improve or worsen performance?

Here is a version using SQLite. I expect the same code to work for MySQL with little or no change. It should work fine unless your detabase table is huge, but you don't mention its size so I presume it's not out of the ordinary.
It simply fetches the contents of the table into memory and checks each column, one by one, to see if any field is x, printing the name of the column once it is found.
use strict;
use warnings;
use DBI;
use List::Util qw/ any /;
my $dbh = DBI->connect('dbi:SQLite:test.sqlite');
my $sth = $dbh->prepare('SELECT * FROM "table"');
$sth->execute;
my $table = $sth->fetchall_arrayref;
my $first_column;
for my $i (0 .. $#{$table->[0]}) {
my #column = map { $_->[$i] } #$table;
if ( any { $_ eq 'x' } #column ) {
$first_column = $sth->{NAME}[$i];
last;
}
}
print $first_column, "\n";
output
c2
Update
This way is likely to be faster, as it uses the database engine to search for columns that contain an x and very little data is loaded into memory
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:test.sqlite');
my #names = do {
my $sth = $dbh->prepare('SELECT * FROM "table"' LIMIT 0);
$sth->execute;
#{ $sth->{NAME_lc} };
};
my $first_column;
for my $col (#names) {
my $sql = qq{SELECT $col from "table" WHERE $col = 'x' LIMIT 1};
my $row = $dbh->selectrow_arrayref($sql);
if ($row) {
$first_column = $col;
last;
}
}
print $first_column, "\n";

Short of redesigning your table so that it can be queried more effectively, I think your optimal solution is likely to be a modified version of your Option 1. Instead of using fetchall_arrayref(), use fetchrow_arrayref() to collect 1 row at a time. Examine each row as you get it. Break the loop if the minimum column ever gets to column 1. This minimizes the memory used in the Perl code; it uses a single SQL statement (but multiple fetch operations — but then fetchall_arrayref() also uses multiple fetch operations).

The fact that you need to query your data this way tells me that it's stored in a bizarre and inappropriate way. Relational databases are meant to store relations, and the order of their columns should be irrelevant to how they logically function. Any need to refer to column order is a guaranteed sign that you're doing something wrong.
I understand that sometimes one needs to perform one-time queries to determine unusual things about data sets, but I stand by my assessment: this data is stored inappropriately.
My guess is that there are many columns that define related, sequential attributes, maybe something like "profits_1q2001", "profits_2q2001", etc. You'll want to create a separate table for those, maybe something like:
CREATE TABLE `department_profits` (
`id` int(10) unsigned NOT NULL,
`department_id` same_as_parent_table NOT NULL,
`y` year(4) NOT NULL,
`q` tinyint(3) unsigned NOT NULL,
`profits` decimal(9,2) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_dept_quarter` (`department_id`,`y`,`q`),
KEY `idx_profits_y_q_dept` (`profits`,`y`,`q`,`department_id`)
) ENGINE=InnoDB;
Converting the data from its current format to the proper format is left as an exercise for the reader, but it might involve 200 script-generated queries that look like:
SELECT CONCAT(
"INSERT INTO department_profits (department_id, y, q, profits) VALUES (",
"'", department_id, "',",
2001, ",",
1, ",",
profits_1q2001,
");"
)
FROM old_table;
If your question is then (say) when was the first time profits exceeded $10,000 and in which department, then finding the answer becomes something like:
SELECT department_id, y, q, profits
FROM department_profits
WHERE profits > 10000
ORDER BY y, q LIMIT 1;
For the actual question you asked -- if it really is a one-off -- since there are just 200,000 data points, I would do it manually. Export the whole table as tab-separated, drag it onto Excel, "Find/Replace" to change "x" to "-999" or some small value, then "Data -> Sort" by each column in turn until your answer pops to the top. Heck, plain old "Find" might tell you your answer. With just 200 columns, it won't take long, and you might learn something new about your data by seeing it all on the screen sorted various ways :)

Assuming your columns are c1 .. c6 you can use something like this to get it in sqlite:
select distinct (case when c1 = 'x' then 'c1' when c2 = 'x' then 'c2' when c3 = 'x' then 'c4' when c4 = 'x' then 'c4' when c5 = 'x' then 'c5' when c6 = 'x' then 'c6' else 'x' end) from mje order by 1 limit 1;

how to create a sql query for my given criteria?

I have one table named task_assignment.It has following 6 fields named as:
testId,quesId,evaluatorId,studId and marks
Actually this table is used to store marks for each test including each evaluators marks for each students by question id wise.
I have testId=1, quesId=Q1 and studId=S1 as a input. So, i want to get the following information in the select query.ie,Both evaluators(E1,E2) marks for the given input.
The sql query don't written more than one row for this...I want query output is :20,15 in a single row.
Please guide me to get out of this issue...

I think you won't be able to get your desired output 20, 15, since there is only one record which satisfies your criteria testId = 1, quesId = Q1, studId = S1.
But to answer your question, here's my query:
SELECT GROUP_CONCAT(marks)
FROM task_assignment
WHERE testId = 1
AND quesId = 'Q1'
AND studId = 'S1';
I've tried it in SQL Fiddle.
EDIT 1
If you want to parse the output of the query in your C# code to store them in separate variables, you can use the Split function:
string marks = "20, 15"; //Suppose that this value came from database
int mark1 = Convert.ToInt32(marks.Split(',')[0]);
int mark2 = Convert.ToInt32(marks.Split(',')[1]);
The code is still error-prone depending on the value of the marks variable, just make sure you have validated the value.
This might be unrelated to the question, but still to help you on your task, that's my answer.

Increment string with %name%+(num) in mysql

Is there way to realize this algorithm with mysql without 100500 queries and lots of resources?
if (exists %name% in table.name) {
num = 2;
while(exists %newname%+(num) in table.name) num++;
%name% = newname+(num);
}
Thanks

I don't know how much better you can do with a stored procedure in MySql, but you can definitely do better than 100500 queries:
SELECT name FROM table WHERE name LIKE 'somename%' ORDER BY name DESC LIMIT 1
At that point, you know that you can increment the number at the end of name and the result will be unused.
I 'm glossing over some fine print (this approach will never find and fill any "holes" in the naming scheme that may exist, and it's still not guaranteed that the name will be available due to race conditions), but in practice it can be made to work quite easily.

The simpliest way I can see of doing it is to create a table of sequential numbers
then cross join on to it....
SELECT a.name,b.id
FROM table a
WHERE a.name = 'somename'
CROSS JOIN atableofsequentialnumbers b
WHERE NOT EXISTS (SELECT 1 FROM table x WHERE x.name = CONCAT(a.name,b.id))
LIMIT 10
This will return the first 10 available numbers/names

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Creating an ID variable for duplicates in SAS - duplicates

data have; input ID; cards; 1 2 2 3 4 4 ; /if sorted/ data want; set have; by id; if first.id and not last.id then _dup+1; dup_id=_dup; if first.id and last.id then call missing (dup_id); drop _dup; run;

Related

Select a random row with where statement is taking to long

Access comparing multiple columns for a result

Which of the two approaches for the specified task is better performance-wise in Perl DBI module?

how to create a sql query for my given criteria?

Increment string with %name%+(num) in mysql

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Creating an ID variable for duplicates in SAS - duplicates

data have; input ID; cards; 1 2 2 3 4 4 ; /*if sorted*/ data want; set have; by id; if first.id and not last.id then _dup+1; dup_id=_dup; if first.id and last.id then call missing (dup_id); drop _dup; run;

Related

Select a random row with where statement is taking to long

Access comparing multiple columns for a result

Which of the two approaches for the specified task is better performance-wise in Perl DBI module?

how to create a sql query for my given criteria?

Increment string with %name%+(num) in mysql

Categories

Resources

data have; input ID; cards; 1 2 2 3 4 4 ; /if sorted/ data want; set have; by id; if first.id and not last.id then _dup+1; dup_id=_dup; if first.id and last.id then call missing (dup_id); drop _dup; run;