I have hundreds of csv files that I need to import into SAS as .sas7bdat files. I don't want to do it manually as it is time consuming. I'm trying to write a %macro in SAS using a data step but don't know how to specify the correct format and length for each variable. I worry about what if I incorrectly specify the length for one of the variables and then some data won't be read correctly.
Here is an example:
Data_1:
A,B,C,D,E
2, Paul Twix, 5/9/2015, 2, 238
2, Paul Twix, 5/10/2015, 3, 238
2, Paul Twix, 5/11/2015, 4, 238
Data_2:
A,B,C,D,E
2345678, Carolina Ferrera, 5/9/2015, 22, 123
2345678, Carolina Ferrera, 5/10/2015, 23, 123
2345678, Carolina Ferrera, 5/11/2015, 24, 123
I thought of running this code first to determine the max length but again I can only check a handful number of files.
proc sql noprint ;
create table varlist as
select memname,varnum,name,type,length,format,format as informat,label
from dictionary.columns
where libname='WORK' and memname='Data_1'
;
quit;
When I have a handful number of files, I can manually adjust the length of the character variable, but if I have many files and I specify the format of the variables based on the first file, some variables will be trimmed. Here is an example:
%macro import_main(inf,outdat);
DATA &outdat.;
INFILE &inf.
LRECL=32767 firstobs=2
TERMSTR=CRLF
DLM=','
MISSOVER
DSD ;
INPUT
A : ?? BEST1.
B : $CHAR9.
C : ?? MMDDYY9.
D : ?? BEST1.
E : ?? BEST3. ;
FORMAT C YYMMDD10.;
RUN;
%mend import_main;
filename inf1 'C:\SAS_data_1.csv';
filename inf2 'C:\SAS_data_2.csv';
%import_main(inf1, work.SAS_data_1);
%import_main(inf2, work.SAS_data_2);
This code correctly displays the values for SAS_data_1 but incorrectly displays the names of the string in SAS_data_2.
Is there anything to avoid this error in %macro?
Thank you.
Related
I'm using SAS University Edition 9.4
This is my CSV data.
,MGAAAAAAAA,3,A0000B 2F1
11111,ハアン12222234222B56122,4,AA 0000
,テストデータ,5,AACHY 2410F1
,テストデタテストテ,5,AACHYF2
This is my infile statement.
data wk01;
infile '/folders/myfolders/data/test_csv.txt'
dsd delimiter=','
lrecl=1000 missover firstobs=1;
input firstcol :$ secondcol :$ thirdcol :$ therest :$;
run ;
I expected my result like this.
But after executing SAS, What I got is as below (the yellow highlight indicates which row/column have its data being truncated by SAS)
For example, the first row's second column is MGAAAAAAAA but SAS's outut is MGAAAAAA
Could you please point out what am I missing here? Thanks alot.
The values of your variables are longer than the 8 bytes you are allowing for them. The UTF-8 characters can use up to 4 bytes each. Looks like some of them are getting truncated in the middle, so you get an invalid UTF-8 code.
Just define longer lengths for your variables instead of letting SAS use the default length of 8. In general it is best to explicitly define your variables with a LENGTH or ATTRIB statement. Instead of forcing SAS to guess how to define them based on how you first use them in other statements like INPUT, FORMAT, INFORMAT or assignment.
data wk01;
infile '/folders/myfolders/data/test_csv.txt' dsd dlm=',' truncover ;
length firstcol $8 secondcol $30 thirdcol $30 therest $100;
input firstcol secondcol thirdcol therest;
run ;
I think what you have is a mixed encoding problem. What's essentially happening is that after the first 5 characters which are in ASCII, it changes to UTF8. The commas are getting mixed up in this soup and your standard delimiter gets a bit confused here. You need some manual coding like this to deal with it I think :
data wk01;
infile "test.csv" lrecl=1000 truncover firstobs=1;
input text $utf8x70.;
firstcomma = findc(text,',', 1);
secondcomma = findc(text,',', firstcomma + 1);
thirdcomma = findc(text,',', secondcomma + 1);
fourthcomma = findc(text,',', thirdcomma + 1);
length firstcol $5;
length secondcol $30;
length thirdcol $1;
length fourthcol $30;
firstcol= substr(text,1, firstcomma - 1);
secondcol = substr(text, firstcomma + 1, (secondcomma -firstcomma-1 ));
thirdcol = substr(text, secondcomma + 1, (thirdcomma - secondcomma - 1));
fourthcol = substr(text, thirdcomma + 1);
run;
Probably there is a cleaner way to do it, but this is the quick and dirty method I could come up at 2 AM :)
When I import a large csv into SAS, it always shows that ‘WARNING: A Character that could not be transcoded has been replaced in record XXXXX’. What should I do for it?
Thanks in advance.
1 /**********************************************************************
2 * PRODUCT: SAS
3 * VERSION: 9.4
4 * CREATOR: External File Interface
5 * DATE: 06MAR18
6 * DESC: Generated SAS Datastep Code
7 * TEMPLATE SOURCE: (None Specified.)
8 ***********************************************************************/
9 data WORK.Companies ;
10 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
11 infile 'E:\PATSTAT\Companies.csv' delimiter = ',' MISSOVER DSD lrecl=13106 firstobs=2 ;
12 informat person_id best32. ;
13 informat person_name $46. ;
...
36 informat nuts3 $5. ;
37 informat nuts3_name $30. ;
38 format person_id best12. ;
39 format person_name $46. ;
...
62 format nuts3 $5. ;
63 format nuts3_name $30. ;
64 input
...
89 nuts3 $
90 nuts3_name $
91 ;
92 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
93 run;
NOTE: A byte-order mark in the file "E:\PATSTAT\Companies.csv" (for fileref "#LN00025") indicates that the data is encoded in "utf-8". This encoding will be used to process the file.
NOTE: The infile 'E:\PATSTAT\Companies.csv' is: Filename=E:\PATSTAT\Companies.csv, RECFM=V, LRECL=52424, File Size (bytes)=228293377, Last Modified=03 March 2018 19:12:47 o'clock, Create Time=27 November 2017 14:10:57 o'clock
WARNING: A character that could not be transcoded has been replaced in record 775.
WARNING: A character that could not be transcoded has been replaced in record 857.
...
WARNING: A character that could not be transcoded has been replaced in record 10881.
NOTE: Limit set by ERRORS= option reached. Further warnings of this type will not be printed.
NOTE: 1048575 records were read from the infile 'E:\PATSTAT\Companies.csv'.
The minimum record length was 103.
The maximum record length was 680.
NOTE: The data set WORK.COMPANIES has 1048575 observations and 26 variables.
NOTE: DATA statement used (Total process time): real time 7.28 seconds cpu time 3.19 seconds
1048575 rows created in WORK.Companies from E:\PATSTAT\Companies.csv.
NOTE: WORK.COMPANIES data set was successfully created.
NOTE: The data set WORK.COMPANIES has 1048575 observations and 26 variables.
You need to start SAS with unicode support to read UTF-8 characters.
You could try setting ENCODING=ANY on the INFILE or FILENAME statement in your current SAS session. The encoding shouldn't matter for the numbers. But if you really have UTF-8 characters that cannot be transcoded into single byte WLATIN1 characters then you will probably have trouble working with those strings.
I'm having an issue reading a CSV file into a SAS dataset without bringing each field with my import. I don't want every field imported, but that's the only way I can seem to get this to work. The issue is I cannot get SAS to read my data correctly, even if it's reading the columns correctly... I think part of the issue is that I have data above my actual column headers that I don't want to read in.
My data is laid out like so
somevalue somevalue somevalue...
var1 var2 var3 var4
abc abc abc abc
Where I want to exclude somevalue, only read in select var's and their corresponding data.
Below is a sample file where I've scrambled all the values in my fields. I only want to keep COLUMN H(8), AT(46) and BE(57)
Here's some code I've tried so far...
This was SAS generated from a PROC IMPORT. My PROC IMPORT worked fine to read in every field value, so I just deleted the fields that I didn't want, but I don't get the output I expect. The values corresponding to the fields does not match.
A) PROC IMPORT
DATAFILE="C:\Users\dip1\Desktop\TU_&YYMM._FIN.csv"
OUT=TU_&YYMM._FIN
DBMS=csv REPLACE;
GETNAMES=NO;
DATAROW=3;
RUN;
generated this in the SAS log (I cut out the other fields I didn't want)
B) DATA TU_&YYMM._FIN_TEST;
infile 'C:\Users\fip1\Desktop\TU_1701_FIN.csv' delimiter = ',' DSD lrecl=32767
firstobs=3 ;
informat VAR8 16. ;
informat VAR46 $1. ;
informat VAR57 $22. ;
format VAR8 16. ;
format VAR46 $1. ;
format VAR57 $22. ;
input
VAR8
VAR46 $
VAR57 $;
run;
I've also tried this below... I believe I'm just missing something..
C) DATA TU_TEST;
INFILE "C:\Users\fip1\Desktop\TU_&yymm._fin.csv" DLM = "," TRUNCOVER FIRSTOBS = 3;
LABEL ACCOUNT_NUMBER = "ACCOUNT NUMBER";
LENGTH ACCOUNT_NUMBER $16.
E $1.
REJECTSUBCATEGORY $22.;
INPUT ACCOUNT_NUMBER
E
REJECTSUBCATEGORY;
RUN;
As well as trying to have SAS point to the columns I want to read in, modifying the above to:
D) DATA TU_TEST;
INFILE "C:\Users\fip1\Desktop\TU_&yymm._fin.csv" DLM = "," TRUNCOVER FIRSTOBS = 3;
LABEL ACCOUNT_NUMBER = "ACCOUNT NUMBER";
LENGTH ACCOUNT_NUMBER $16.
E $1.
REJECTSUBCATEGORY $22.;
INPUT #8 ACCOUNT_NUMBER
#46 E
#57 REJECTSUBCATEGORY;
RUN;
None of which work. Again, I can do this successfully if I bring in all of the fields with either A) or B), given that B) includes all the fields, but I can't get C) or D) to work, and I want to keep the code to a minimum if I can. I'm sure I'm missing something, but I've never had time to tinker with it so I've just been doing it the "long" way..
Here's a snippet of what the data looks like
A(1) B(2) C(3) D(4) E(5) F(6) G(7)
ABCDEFGHIJ ABCDMCARD 202020 4578917 12345674 457894A (blank)
CRA INTERNALID SUBCODE RKEY SEGT FNM FILEDATE
CREDITBUR 2ABH123 AB2CHE123 A28O5176688 J2 Name 8974561
With a delimited file you need to read all of the fields (or at least all of the fields up to the last one you want to keep) even if you do not want to keep all of those fields. For the ones you want to skip you can just read them into a dummy variable that you drop. Or even one of the variables you want to keep that you will overwrite by reading from a later column.
Also don't model your DATA step after the code generated by PROC IMPORT. You can make much cleaner code yourself. For example there is no need for any FORMAT or INFORMAT statements for the three variables you listed. Although if VAR8 really needs 16 digits you might want to attach a format to it so that SAS doesn't use BEST12. format.
data tu_&yymm._fin_test;
infile 'C:\Users\fip1\Desktop\TU_1701_FIN.csv'
dlm=',' dsd lrecl=32767 truncover firstobs=3
;
length var8 8 var46 $1 var57 $22 ;
length dummy $1 ;
input 7*dummy var8 37*dummy var46 10*dummy var57 ;
drop dummy ;
format var8 16. ;
run;
You can replace the VARxx variable names with more meaningful ones if you want (or add a RENAME statement). Using the position numbers here just makes it clearer in this code that the INPUT statement is reading the 57 columns from the input data.
I have many csv files with many variable column headers , up to 2000 variable column headers for some files.
I'm trying to do an import but at one point , the headers are truncated in a 'random' manner and the rest of the data are ignored therefore not imported. I'm putting random between quotes because it may not be random although I don't know the reason if it is not random. But let me give you more insight .
The headers are truncated randomly , some after the 977th variables , some others after the 1401th variable.
The headers are like this BAL_RT,ET-CAP,EXT_EA16,IVOL-NSA,AT;BAL_RT,ET-CAP,EXT_EA16,IVOL-NSA,AT;BAL_RT,ET-CAP,EXT_EA16,IVOL-NSA,AT
This the part of the import log
642130 VAR1439
642131 VAR1440
642132 VAR1441
642133 VAR1442 $
642134 VAR1443 $
642135 VAR1444 $
As you can see , some headers are seen as numeric although all the headers are alphanumeric as they are blending a mixture of character and numeric.
Please find my code for the import below
%macro lec ;
options macrogen symbolgen;
%let nfic=37 ;
%do i=1 %to &nfic ;
PROC IMPORT OUT= fic&i
DATAFILE= "C:\cygwin\home\appEuro\pot\fic&i..csv"
DBMS=DLM REPLACE;
DELIMITER='3B'x;
guessingrows=500 ;
GETNAMES=no;
DATAROW=1;
RUN;
data dico&i ; set fic&i (drop=var1) ;
if _n_ eq 1 ;
index=0 ;
array v var2-var1000 ;
do over v ;
if v ne "" then index=index+1 ;
end ;
run ;
data dico&i ; set dico&i ;
call symput("nvar&i",trim(left(index))) ;
run ;
%put &&nvar&i ;
%end ;
%mend ;
%lec ;
The code is doing an import and also creating a dictionnary with the headers as some of them are long (e.g more than 34 characters)
I'm not sure if these elements are related however, I would welcome any insights you will be able to give me.
Best.
You need to not use PROC IMPORT, as I mentioned in a previous comment. You need to construct your dictionary from a data step read in, because if you have 2000 columns times 34 or more long variable names, you will have more than 32767 record length.
An approach like this is necessary.
data headers;
infile "whatever" dlm=';' lrecl=99999 truncover; *or perhaps longer even, if that is needed - look at log;
length name $50; *or longer if 50 is not your real absolute maximum;
do until length(_infile_)=0;
input name $ #;
output;
end;
stop; *only want to read the first line!;
run;
Now you have your variable names. Now, you can read the file in with GETNAMES=NO; in proc import (you'll have to discard the first line), and then you can use that dictionary to generate rename statements (you will have lots of VARxxxx, but in a predictable order).
I have a number of csv files in a folder. They all have the same structure (3 columns). The SAS code below imports all of them into one dataset. It includes the 3 columns plus their file name.
My challenge is that the filename variable includes the directories and drive letter (e.g. 'Z:\DIRECTORYA\DIRECTORYB\file1.csv'). How can I just list the file name and not the path (e.g. file1.csv)? Thank you
data WORK.mydata;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
length FNAME $80.;
infile 'Z:\DIRECTORYA\DIRECTORYB\*2012*.csv' delimiter = ',' MISSOVER DSD lrecl=32767 filename=fname firstobs=2;
informat A $26. ;
informat B $6. ;
informat C 8. ;
format A $26. ;
format B $6. ;
format C 8. ;
input
A $
B $
C;
if _ERROR_ then call symputx('_EFIERR_',1);
filename=fname;
run;
I think, your best bet is to use regular expressions. Add to your DATA stel:
reg1=prxparse("/\\(\w+\.csv)/");
if prxmatch(reg1, filename) then
filename=prxposn(reg1,1,filename);
We can try break this into two data steps. We'll extract the filenames into one data set in the first data step. In the second data step, we'll slap on the filenames (incl. the .txt or .csv) to their respective observations in the combined data set.
We'll use the PIPEing method or PIPE statement, DIR command and /b.
For example, if I have three .txt files: example.txt, example2.txt and example3.txt
%let path = C:\sasdata;
filename my pipe 'dir "C:\sasdata\*.txt"/b ';
data example;
length filename $256;
infile my length=reclen;
input filename $varying256. reclen;
run;
data mydata;
length filename $100;
set example;
location=cat("&path\",filename);
infile dummy filevar=location length=reclen end=done missover;
do while (not done);
input A B C D;
output;
end;
run;
Output of first first data step:
filename
example.txt
example2.txt
example3.txt
Output of second data step:
filename A B C D
example.txt 171 255 61300 79077
example.txt 123 150 10300 13287
example2.txt 250 255 24800 31992
example2.txt 132 207 48200 62178
example2.txt 243 267 25600 33024
example3.txt 171 255 61300 79077
example3.txt 123 150 10300 13287
example3.txt 138 207 47400 61146
In Windows, this would read all the .txt files in the folder. It should work for .csv files as well as long as you add the delimiter=',' in the infile statement in the second data step and change the extension in filename statement to *.csv. Cheers.