Summing characters in SAS - csv

I have a (pretty large) dataset in SAS where one of the columns contains data that looks like this -
Column Name
8
4
13
NA
NA
3
5
etc..
Because of the NA's in the columns (and there are quite a few of them), SAS recognises the entire column as containing character variables. However, I want to perform some mathematical operations on the numbers within this column e.g. SUM, but because SAS can't perform the SUM function on character variables, this is proving to be quite difficult. Is there a way to make SAS think of this column as being numeric?
Thanks for your help!

You mention SUM so there are two possibilities, SQL SUM function over an aggregate group, or the SUM statistic in a SAS Procedure, such as Proc MEANS, SUMMARY, UNIVARIATE, REPORT, TABULATE, etc.
In SQL a SUM of a computed value (the conversion from character representation of a number to a numeric value) can be performed directly. Suppose the column in question is named amount
data have;
length amount $2;
input amount ##; datalines;
8 4 13 NA NA 3 5
;
proc sql;
create table want as
select
SUM(
input(amount,?best12.) /* computed value is conversion through INPUT() */
) as amount_total
%* The ? before the informat (best12.) suppresses log messages such as
%* NOTE: Invalid string and
%* NOTE: Invalid argument;
from have;
For the case of other procedures, they will require a data source that delivers the column converted to a numeric or new numeric variable based on the original character variable. There are two ways to provide that data source:
As a VIEW
As a DATA set
* view;
data have_view / view=have_view;
set have;
amount_num = input(amount,?best12.);
run;
proc means noprint data=have_view;
var amount_num;
output out=want_2 sum=amount_total;
run;
* or data;
data have_num;
set have;
amount_num = input(amount,?best12.);
run;
proc means noprint data=have_num;
var amount_num;
output out=want_2 sum=amount_total;
run;
See SAS Proc Import CSV and missing data for a macro that converts a character variable in place, and does not create new variable names. With such a macro the non-numeric original values (such as NA, ??) are 'lost' because they become missing values (.)

If you only have one text value to deal with, you can use TRANWRD to replace it with either '0' or '.' - depending how you want to handle your NA null values.
OR...
If you want to make sure you remove all text characters from the column values, you can use the COMPRESS function with the 'KD' modifier to 'Keep Digits' only.
(A full list of compress modifiers can be found here. They're super useful when cleaning up text).
Then you can 'convert' the new column to numeric by applying a *1 multiplier to it, or use an INPUT statement (proc sql won't like the *1 multipier work around).
data your_dataset;
input text_column $ ##;
datalines;
8 4 13 NA NA 3 5
;
run;
/* update dataset to have numeric version of column... */
data updated_dataset;
set your_dataset;
num_column1 = tranwrd(upcase(text_column), 'NA', '0') * 1;
num_column2 = compress(text_column, ,'kd') * 1;
run;
/* ..or leave your dataset unchanged and just show sums using proc sql */
proc sql;
create table show_sums as
select sum(input(tranwrd(upcase(text_column), 'NA', '0'), 8.)) as sum1
,sum(input(compress(text_column, ,'kd'), 8.)) as sum2
from your_dataset
;
quit;

Related

How do I properly remove the integer portion of a double type value so that i only have the decimal portion remaining?

I thought that this would be easy. I wanted to remove the Integer portion of all values in a field. For example if I have a field with Double data type containing a value of 17.0938 then I wanted to get a result of 0.0938. So I used MyField - INT(MyField). That should be 17.0938 - 17. The result ends up being 9.38000000000017E-02. What am I doing wrong?
Note: Some of these seem to be come out right, for example if MyField contains 8.1563 then the result ends up being 0.1563 as I would expect. The problem still exists though because I'm doing this so that I can join the data to another table that contains fractional increments and their decimal equivalents(0, 1/64, 1/32, 3/64, 1/16, 5/64, 3/32 all the way to 63/64). When I join the result of MyField - INT(MyField) to the column of the table that says 0.1563 is 5/32 the join doesn't find the row that has 0.1563 on it. So if a value of 17.0938 is in MyField then the result of MyField - INT(MyField) should be 0.0938 which would join to the row of the table with fractional equivalents that has 0.0938 in it so that I could return the column with the fraction (3/32). There is a bit more to it than that but trying to keep out any irrelevant explanation. I can give a full explanation if it would help.
Use CDec:
Fraction = CDec(MyField) - Int(MyField)
However, do study my article on extreme conversion of Imperial values:
Convert and format imperial distance (feet and inches) with high precision
(if you don't have an account, browse to the link: Read the full article)
This will allow for conversions like:
SomeExpression = "4 ft. 5-7/64 in."
DecimalInches = ParseFeetInches(SomeExpression)
' DecimalInches -> 53.109375
and (note the precision):
' Meter/inch relation. 1 inch = 0.0254 m.
Const MetersPerInch As Currency = 0.0254
Inch = CDec(1) / MetersPerInch ' Decimal.
Inch -> 39.370078740157480314960629921
Meter = MeterInch(Inch)
Meter -> 1.0
Full code is too much to list here. However, it can also be found on GitHub:
VBA.Round
There is nothing wrong your logic is fine, note that decimal notation,
0.0938
is equivalent to
9.38000000000017E-02
The issues you have with the joins is due to the way data types are stored on Access, as an example 1/3 is a periodic number and is stored as 0.33333333
Since a join does a exact comparison, Access will not be able to match the fraction with the decimal value. One approach would be to convert the values from your equation and the stored factions results to string and truncate the results to a given number of characters (4 for example) before doing the join.

Creating a table of duplicates from SAS data set with over 50 variables

I have a large SAS data set (54 variables and over 10 million observations) I need to load into Teradata. There are duplicates that must also come along, and my machine is not configured for MultiLoad. I want to simply create a table of the 300,000 duplicates I can append to the original load that did not accept them. The logic I've read in other posts seems good for tables with just a few variables. Is there another way that will create a new table where each observation having the same combination of all 54 variables is listed. I'm trying to avoid the proc sort...by logic using 54 variables. The query builder method seemed inefficient as well. Thanks.
Using proc sort is a good way to do it, you just need to create a nicer way to key off of it.
Create some test data.
data have;
x = 1;
y = 'a';
output;
output;
x = 2;
output;
run;
Create a new field that is basically equivalent to appending all of the fields in the row together and then running them though the md5() (hashing) algorithm. This will give you a nice short field that will uniquely identify the combination of all the values on that row.
data temp;
length hash $16;
set have;
hash = md5(cats(of _all_));
run;
Now use proc sort and our new hash field as the key. Output the duplicate records to the table named 'want':
proc sort data=temp nodupkey dupout=want;
by hash;
run;
You can do something like this:
proc sql;
create table rem_dups as
select <key_fields>, count(*) from duplicates
group by <key_fields>
having count(*) > 1;
quit;
proc sql;
create table target as
select dp.* from duplicates dp
left join rem_dups rd
on <Key_fields>
where <key_fields> is null;
quit;
If there are more than 300K duplicates, this option does not work. And also, I am afraid to say that I dont know about Teradata and the way you load tables.
First, a few sort related suggestions, then the core 'fast' suggestion after the break.
If the table is entirely unsorted (ie, the duplicates can appear anywhere in the dataset), then proc sort is probably your simplest option. If you have a key that will guarantee putting duplicate records adjacent, then you can do:
proc sort data=have out=uniques noduprec dupout=dups;
by <key>;
run;
That will put the duplicate records (note noduprec not nodupkey - that requires all 54 variables to be identical) in a secondary dataset (dups in the above). However, if they are not physically adjacent (ie, you have 4 or 5 duplicates by the key but only two are completely duplicated), it may not catch that if they are not adjacent physically; you would need a second sort, or you would need to list all variables in your by statement (which might be messy). You could also use Rob's md5 technique to simplify this some.
If the table is not 'sorted' but the duplicate records will be adjacent, you can use by with the notsorted option.
data uniques dups;
set have;
by <all 54 variables> notsorted;
if not (first.<last variable in the list>) then output dups;
else output uniques;
run;
That tells SAS not to complain if things aren't in proper order, but lets you use first/last. Not a great option though particularly as you need to specify everything.
The fastest way to do this is probably to use a hash table for this, iff you have enough RAM to handle it, or you can break your table up in some fashion (without losing your duplicates). 10m rows times 54 (say 10 byte) variables means 5.4GB of data, so this only works if you have 5.4GB of RAM available to SAS to make a hash table with.
If you know that a subset of your 54 variables are sufficient for verifying uniqueness, then the unique hash only has to contain those subset of variables (ie, it might only be four or five index variables). The dups hash table does have to contain all variables (since it will be used to output the duplicates).
This works by using modify to quickly process the dataset, not rewriting the majority of the observations; using remove to remove them and the hash table output method to output the duplicates to a new dataset. The unq hash table is only used for lookup - so, again, it could contain a subset of variables.
I also use a technique here for getting the full variable list into a macro variable so you don't have to type 54 variables out.
data class; *make some dummy data with a few true duplicates;
set sashelp.class;
if age=15 then output;
output;
run;
proc sql;
select quote(name)
into :namelist separated by ','
from dictionary.columns
where libname='WORK' and memname='CLASS'
; *note UPCASE names almost always here;
quit;
data class;
if 0 then set class;
if _n_=1 then do; *make a pair of hash tables;
declare hash unq();
unq.defineKey(&namelist.);
unq.defineData(&namelist.);
unq.defineDone();
declare hash dup(multidata:'y'); *the latter allows this to have dups in it (if your dups can have dups);
dup.defineKey(&namelist.);
dup.defineData(&namelist.);
dup.defineDone();
end;
modify class end=eof;
rc_c = unq.check(); *check to see if it is in the unique hash;
if rc_c ne 0 then unq.add(); *if it is not, add it;
else do; *otherwise add it to the duplicate hash and mark to remove it;
dup.add();
delete_check=1;
end;
if eof then do; *if you are at the end, output the dups;
rc_d = dup.output(dataset:'work.dups');
end;
if delete_check eq 1 then remove; *actually remove it from unique dataset;
run;
Instead of trying to avoid proc sort, I would recommend you to use Proc sortwith index.
Read the document about index
I am sure there must be identifier(s) to distinguish observation other than _n_,
and with the help of index, sorting by noduprecs or nodupkey dupout = dataset would be an efficient choice. Furthermore, indexing could also facilitate other operation such as merging / reporting.
Anyway, I do not think a dataset with 10 million observations ( each?) is a good dataset, and not to mention the 54 variables.

MySQL REPLACE string with regex

I have a table with about 50,000 records. One of the fields is a "imploaded" field consisting of variable number of parameters from 1 to 800. I need to replace all parameters to 0.
Example:
1 parameter 3.45 should become 0.00
2 parameters 2.27^11.03 should become 0.00^0.00
3 parameters 809.11^0.12^3334.25 should become 0.00^0.00^0.00
and so on.
Really I need to replace anything between ^ with 0.00 ( for 1 parameter it should be just 0.00 without ^).
Or I need somehow count number of ^, generate string like 0.00^0.00^0.00 ... and replace it. The only tool available is MySqlWorkbench.
I would appreciate any help.
There is no regex replace capability built in to MySQL.
You can, however, accomplish your purpose by doing what you suggested -- counting the number of ^ and crafting a string of replacement values, with this:
TRIM(TRAILING '^' FROM REPEAT('0.00^',(LENGTH(column) - LENGTH(REPLACE(column,'^','')) + 1)));
From inside to outside, we calculate the number of values by counting the number of delimiters, and adding 1 to that count. We count the delimiters by comparing the length of the original string, against the length of the same string with the delimiters stripped out using REPLACE(...,'^','') to replace every ^ with nothing.
The REPEAT() function builds a string by repeating a string expression n number of times.
This results in a spurious ^ at the end of the string, which we remove easily enough with TRIM(TRAILING '^' FROM ...).
SELECT t1.*, ... the expression above ... FROM table_name t1, from your table to verify the results of this logic (replacing column with the actual name of the column), then you can UPDATE table SET column = ... to modify the values. once you are confident in the logic.
Note, of course, that this is indicative of a problematic database design. Each column should contain a single atomic value, not a "list" of values, as this question seems to suggest.

Force mySQL queries to be characters not numeric in R

I'm using RODBC to interface R with a MySQL database and have encountered a problem. I need to join two tables based on unique ID numbers (IDNUM below). The issue is that the ID numbers are 20 digit integers and R wants to round them. OK, no problem, I'll just pull these IDs as character strings instead of numeric using CAST(blah AS CHAR).
But R sees the incoming character strings as numbers and thinks "hey, I know these are character strings... but these character strings are just numbers, so I'm pretty sure this guy wants me to store this as numeric, let me fix that for him" then converts them back into numeric and rounds them. I need to force R to take the input as given and can't figure out how to make this happen.
Here's the code I'm using (Interval is a vector that contains a beginning and an ending timestamp, so this code is meant to only pull data from a chosen timeperiod):
test = sqlQuery(channel, paste("SELECT CAST(table1.IDNUM AS CHAR),PartyA,PartyB FROM
table1, table2 WHERE table1.IDNUM=table2.IDNUM AND table1.Timestamp>=",Interval[1],"
AND table2.Timestamp<",Interval[2],sep=""))
You will most likely want to read the documentation for the function you are using at ?sqlQuery, which includes notes about the following two relevant arguments:
as.is which (if any) columns returned as character should be
converted to another type? Allowed values are as for read.table. See
‘Details’.
and
stringsAsFactors logical: should columns returned as character and
not excluded by as.is and not converted to anything else be converted
to factors?
In all likelihood you want to specify the columns in questions in as.is.

Out of range value while converting MD5 to 2 Bigints in MySql

In a database (MySql) I am storing some string values and checking for the uniqueness of those strings before storing them.
For a fast string comparison (I mean check if the incoming string already exists in the DB before recording) I want to hast the incoming string (MD5), split the string into 2 equal parts, convert them into 2 bigints, store them separately along with the string, and when a record request arrives I want to search for these 2 bigints in a multiple column index. (ofcourse i will get the incoming string, MD5 that string, calculate the 2 bigint parts then query the database)
But the "line 3" below produces an interesting error in my "MySql Routine".
...
declare mystring varchar(3000); -- line 1
declare md5bigint1value bigint; -- line 2
...
set md5bigint1value = conv(substring((md5(mystring)),1,16),16,10); -- line 3
...
At "line 3" it says: Error Code: 1264. Out of range value for column 'md5bigint1value' at row 1
Does anybody know why this is happening?
Please let me know if you need any more info.
Thank you very much.
CONV, when used with a positive to_base, converts to an unsigned value, while BIGINT is signed. An unsigned 64-bit value won't necessarily fit into a signed 64-bit variable.
If to_base is a negative number, N is regarded as a signed number. Otherwise, N is treated as unsigned.
What you want to do is use -10 for a destination base, that is;
set md5bigint1value = conv(substring((md5(mystring)),1,16),16,-10); -- line 3
SQLfiddle to test with (10 won't work, -10 will).