Copy and Pasting Rows into a SAS Dataset - duplicates

Is there any quick and dirty way to create duplicates of an observation in a dataset? I know I could just subset it and then use proc append, but that seems like an inelegant solution for a task that seems so simple. Any ideas?

I think coding it is the simplest way.
data work.have;
a=1;b=2;c=3;
run;
data work.want;
set work.have;
output;
if a=1 then output; /* Again */
run;

Try this
data check2(drop= i );
set check1;
output;
do i = 1 to datediff;
output;
end;
run;
where there are two dates I am trying to insert number of duplicate rows
where number of rows varies and is equal to the month difference between the two dates.

Related

Summing characters in SAS

I have a (pretty large) dataset in SAS where one of the columns contains data that looks like this -
Column Name
8
4
13
NA
NA
3
5
etc..
Because of the NA's in the columns (and there are quite a few of them), SAS recognises the entire column as containing character variables. However, I want to perform some mathematical operations on the numbers within this column e.g. SUM, but because SAS can't perform the SUM function on character variables, this is proving to be quite difficult. Is there a way to make SAS think of this column as being numeric?
Thanks for your help!
You mention SUM so there are two possibilities, SQL SUM function over an aggregate group, or the SUM statistic in a SAS Procedure, such as Proc MEANS, SUMMARY, UNIVARIATE, REPORT, TABULATE, etc.
In SQL a SUM of a computed value (the conversion from character representation of a number to a numeric value) can be performed directly. Suppose the column in question is named amount
data have;
length amount $2;
input amount ##; datalines;
8 4 13 NA NA 3 5
;
proc sql;
create table want as
select
SUM(
input(amount,?best12.) /* computed value is conversion through INPUT() */
) as amount_total
%* The ? before the informat (best12.) suppresses log messages such as
%* NOTE: Invalid string and
%* NOTE: Invalid argument;
from have;
For the case of other procedures, they will require a data source that delivers the column converted to a numeric or new numeric variable based on the original character variable. There are two ways to provide that data source:
As a VIEW
As a DATA set
* view;
data have_view / view=have_view;
set have;
amount_num = input(amount,?best12.);
run;
proc means noprint data=have_view;
var amount_num;
output out=want_2 sum=amount_total;
run;
* or data;
data have_num;
set have;
amount_num = input(amount,?best12.);
run;
proc means noprint data=have_num;
var amount_num;
output out=want_2 sum=amount_total;
run;
See SAS Proc Import CSV and missing data for a macro that converts a character variable in place, and does not create new variable names. With such a macro the non-numeric original values (such as NA, ??) are 'lost' because they become missing values (.)
If you only have one text value to deal with, you can use TRANWRD to replace it with either '0' or '.' - depending how you want to handle your NA null values.
OR...
If you want to make sure you remove all text characters from the column values, you can use the COMPRESS function with the 'KD' modifier to 'Keep Digits' only.
(A full list of compress modifiers can be found here. They're super useful when cleaning up text).
Then you can 'convert' the new column to numeric by applying a *1 multiplier to it, or use an INPUT statement (proc sql won't like the *1 multipier work around).
data your_dataset;
input text_column $ ##;
datalines;
8 4 13 NA NA 3 5
;
run;
/* update dataset to have numeric version of column... */
data updated_dataset;
set your_dataset;
num_column1 = tranwrd(upcase(text_column), 'NA', '0') * 1;
num_column2 = compress(text_column, ,'kd') * 1;
run;
/* ..or leave your dataset unchanged and just show sums using proc sql */
proc sql;
create table show_sums as
select sum(input(tranwrd(upcase(text_column), 'NA', '0'), 8.)) as sum1
,sum(input(compress(text_column, ,'kd'), 8.)) as sum2
from your_dataset
;
quit;

MySql delete where find_in_set not working

I have a MySql stored procedure that has multiple parts. Procedure receives an INT "inId" and a VARCHAR(500) argument called "inIgnoreLogTypes" that's a comma-separated list of numbers.
First part of SQL looks like this:
DECLARE affectedNumbers text;
SELECT GROUP_CONCAT(am.Numbers) INTO affectedNumbers FROM Users am WHERE am.userID = inId;
I need to do that because variable "affectedNumbers" will be used later on throughout this rather big stored procedure so for sake of performances i don't wanna do "IN (Select ...)" every time i need to look up the list.
I checked, variable "affectedNumbers" get's correctly populated with comma separated values.
Next part is this (and that's where the problem occurs):
DELETE FROM UserLogs WHERE
FIND_IN_SET(User_Number, affectedNumbers) AND
NOT FIND_IN_SET(LogType, inIgnoreLogTypes);
Above statement does nothing and after hours of searching for "why" i can't find the answer... Maybe because "affcetedNumbers" is TEXT and "User_Number" is INT? Or maybe because "LogType" is INT and "inIgnoreLogTypes" is VARCHAR?
I checked both sets, they are comma separated integers...
Found the issue! I have to use something like this:
DELETE FROM UserLogs WHERE
FIND_IN_SET(UserLogs.User_Number, affectedNumbers) AND
NOT FIND_IN_SET(UserLogs.LogType, inIgnoreLogTypes);
Strange, as there were no errors.... Now it works.

Inserting a delimiter

MySql has a function CONCAT_WS that I use to export multiple fields with a delimiter into a single field. Works great!
There are multiple fields being stored in a database I query off of that has data that I need to extract each field individually but within each field the data need to include a delimiter. I can most certainly do a concatenate but that does take awhile to set-up if my data requires up to 100 unique values. Below is an example of what I am talking about
Stored Data 01020304050607
End Result 01,02,03,04,05,06,07
Stored Data 01101213
End Result 01,10,12,13
Is there a function in MySQL that does the above?
I am not that familiar with mysql but I have seen questions like this come up before where a regular expression function would be useful. There are user-defined functions available that allow Oracle-like regular expression functions to be used as their support is weak in mysql. See here: https://github.com/hholzgra/mysql-udf-regexp
So you could do something like this:
select trim(TRAILING ',' FROM regexp_replace(your_column, '(.{2})', '\1,') )
from your_table;
This adds a comma every 2 character then chops off the last one. Maybe this will give you some ideas.

Delete duplicate rows in SAS

I am trying to delete duplicate rows from a csv file using SAS but haven't been able to do so. My data looks like-
site1,variable1,20151126000000,22.8,140,1
site1,variable1,20151126010000,22.8,140,1
site1,variable2,20151126000000,22.8,140,1
site1,variable2,20151126000000,22.8,140,1
site2,variable1,20151126000000,22.8,140,1
site2,variable1,20151126010000,22.8,140,1
The 4th row is a duplicate of the 3rd one. This is just an example, I have more than a thousand records in the file. I tried doing this by creating subsets but didn't get the desired results. Thanks in advance for any help.
I think you can use nodupkey for this, just reference your key, or you can use _all_ -
proc sort data = file nodupkey;
by _all_;
run;
In this article you find different options to remove duplicate rows: https://support.sas.com/resources/papers/proceedings17/0188-2017.pdf
If all columns are sorted the easiest way is to use the option noduprecs:
proc sort data = file noduprecs;
by some_column;
run;
In contrast to the option nodupkey no matter which column or columns you state after the by it will always remove duplicate rows based on all columns.
Edit: Apparently, all columns have to be sorted (-> have a look at the comment below).

Creating a table of duplicates from SAS data set with over 50 variables

I have a large SAS data set (54 variables and over 10 million observations) I need to load into Teradata. There are duplicates that must also come along, and my machine is not configured for MultiLoad. I want to simply create a table of the 300,000 duplicates I can append to the original load that did not accept them. The logic I've read in other posts seems good for tables with just a few variables. Is there another way that will create a new table where each observation having the same combination of all 54 variables is listed. I'm trying to avoid the proc sort...by logic using 54 variables. The query builder method seemed inefficient as well. Thanks.
Using proc sort is a good way to do it, you just need to create a nicer way to key off of it.
Create some test data.
data have;
x = 1;
y = 'a';
output;
output;
x = 2;
output;
run;
Create a new field that is basically equivalent to appending all of the fields in the row together and then running them though the md5() (hashing) algorithm. This will give you a nice short field that will uniquely identify the combination of all the values on that row.
data temp;
length hash $16;
set have;
hash = md5(cats(of _all_));
run;
Now use proc sort and our new hash field as the key. Output the duplicate records to the table named 'want':
proc sort data=temp nodupkey dupout=want;
by hash;
run;
You can do something like this:
proc sql;
create table rem_dups as
select <key_fields>, count(*) from duplicates
group by <key_fields>
having count(*) > 1;
quit;
proc sql;
create table target as
select dp.* from duplicates dp
left join rem_dups rd
on <Key_fields>
where <key_fields> is null;
quit;
If there are more than 300K duplicates, this option does not work. And also, I am afraid to say that I dont know about Teradata and the way you load tables.
First, a few sort related suggestions, then the core 'fast' suggestion after the break.
If the table is entirely unsorted (ie, the duplicates can appear anywhere in the dataset), then proc sort is probably your simplest option. If you have a key that will guarantee putting duplicate records adjacent, then you can do:
proc sort data=have out=uniques noduprec dupout=dups;
by <key>;
run;
That will put the duplicate records (note noduprec not nodupkey - that requires all 54 variables to be identical) in a secondary dataset (dups in the above). However, if they are not physically adjacent (ie, you have 4 or 5 duplicates by the key but only two are completely duplicated), it may not catch that if they are not adjacent physically; you would need a second sort, or you would need to list all variables in your by statement (which might be messy). You could also use Rob's md5 technique to simplify this some.
If the table is not 'sorted' but the duplicate records will be adjacent, you can use by with the notsorted option.
data uniques dups;
set have;
by <all 54 variables> notsorted;
if not (first.<last variable in the list>) then output dups;
else output uniques;
run;
That tells SAS not to complain if things aren't in proper order, but lets you use first/last. Not a great option though particularly as you need to specify everything.
The fastest way to do this is probably to use a hash table for this, iff you have enough RAM to handle it, or you can break your table up in some fashion (without losing your duplicates). 10m rows times 54 (say 10 byte) variables means 5.4GB of data, so this only works if you have 5.4GB of RAM available to SAS to make a hash table with.
If you know that a subset of your 54 variables are sufficient for verifying uniqueness, then the unique hash only has to contain those subset of variables (ie, it might only be four or five index variables). The dups hash table does have to contain all variables (since it will be used to output the duplicates).
This works by using modify to quickly process the dataset, not rewriting the majority of the observations; using remove to remove them and the hash table output method to output the duplicates to a new dataset. The unq hash table is only used for lookup - so, again, it could contain a subset of variables.
I also use a technique here for getting the full variable list into a macro variable so you don't have to type 54 variables out.
data class; *make some dummy data with a few true duplicates;
set sashelp.class;
if age=15 then output;
output;
run;
proc sql;
select quote(name)
into :namelist separated by ','
from dictionary.columns
where libname='WORK' and memname='CLASS'
; *note UPCASE names almost always here;
quit;
data class;
if 0 then set class;
if _n_=1 then do; *make a pair of hash tables;
declare hash unq();
unq.defineKey(&namelist.);
unq.defineData(&namelist.);
unq.defineDone();
declare hash dup(multidata:'y'); *the latter allows this to have dups in it (if your dups can have dups);
dup.defineKey(&namelist.);
dup.defineData(&namelist.);
dup.defineDone();
end;
modify class end=eof;
rc_c = unq.check(); *check to see if it is in the unique hash;
if rc_c ne 0 then unq.add(); *if it is not, add it;
else do; *otherwise add it to the duplicate hash and mark to remove it;
dup.add();
delete_check=1;
end;
if eof then do; *if you are at the end, output the dups;
rc_d = dup.output(dataset:'work.dups');
end;
if delete_check eq 1 then remove; *actually remove it from unique dataset;
run;
Instead of trying to avoid proc sort, I would recommend you to use Proc sortwith index.
Read the document about index
I am sure there must be identifier(s) to distinguish observation other than _n_,
and with the help of index, sorting by noduprecs or nodupkey dupout = dataset would be an efficient choice. Furthermore, indexing could also facilitate other operation such as merging / reporting.
Anyway, I do not think a dataset with 10 million observations ( each?) is a good dataset, and not to mention the 54 variables.