I'm coming to SAS from R in which this problem is fairly easy to solve.
I'm trying to load a bunch of CanSim CSV files (one example table here) with a %Macro function.
%Macro ReadCSV (infile , outfile );
PROC IMPORT
DATAFILE= &infile.
OUT= &outfile.
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
%Mend ReadCSV;
%ReadCSV("\\DATA\CanSimTables\02820135-eng.csv", work.cs02820135);
%ReadCSV("\\DATA\CanSimTables\02820158-eng.csv", work.cs02820158);
The problem is that the numeric Value column has ".." in all the csv's whenever the value is missing. This is creating an error when IMPORT gets to the rows with this character string.
Is there some way to tell IMPORT that any ".." should be removed or treated as missing values? (I found forums referring to the DSD option, but that doesn't seem to help me here.)
Thanks!
PROC IMPORT can only guess at the structure of your data. For example it might see the .. and assume the column contains a character string instead of a number. It can also make other decisions that can made the generated dataset useless.
You will be better served to write you own data step code to read the file. It is not very difficult to do. For your example linked file all I did was copy and paste the first row of the CSV file and remove the commas, make the names valid variable names and take some guesses as to how long to make the character variables.
data want ;
infile "&path/&fname" dsd truncover firstobs=2 ;
length Ref_Date $7 GEO $100 Geographical_classification $20
CHARACTERISTICS $100 STATISTICS DATATYPE $50 Vector Coordinate $20
Value 8
;
input (Ref_Date -- Value) (??) ;
run;
The ?? modifier will tell SAS not to report any errors when trying the convert the text in the VALUE column into a number. So the .. and other garbage in the file will generate missing values.
Not explicitly relevant for this question, but - if your issue were "N" or "D" or similar that you wanted to become missing, there would be a somewhat easier solution: the missing statement (importantly distinct from the missing option).
missing M;
That tells SAS to see a single character M in the data as a missing value, and read it in accordingly. It would read it in as .M special missing value, which is functionally similar to . regular missing (but not actually equal in an equality statement).
Related
I'm new to SAS and I wish to import a csv file. This file has a column containing characters starting with a 0 (for instance, 01000 or 05200) and is 5 character long.
When I open my file with a calc software, no problem. But when I import in SAS with:
proc import file="myfile.csv"
out=output
dbms=csv;
run;
The column is then considered as numerical, and so the first 0 gets deleted. Changing the format afterwards doesn't solve my problem.
Is there a solution to specify the format import prior the csv reading, or just a solution to force the import of all the columns as characters?
Thanks a lot!
The easiest solution is to read the file with a program instead of forcing SAS to guess how to read the file. PROC IMPORT will actually generate a program that you could use as a model. But it is not hard to write your own. Then you will have complete control over how the variables are defined: NAME; TYPE (numeric or character); storage LENGTH; LABEL; FORMAT to use for display; INFORMAT to use for reading the values from the line.
Just define the variables, attach any required formats and/or informats, and then read them. For example this step would read two numeric and two character variables from the file. I made one of the character variables have DATE values so you can see how you might attach format and/or informat to a variable that would require it. Most variables do not need either an informat nor a format attached to them as SAS knows how to read and write both numbers and character strings.
data output;
infile "myfile.csv" dsd firstobs=2 truncover;
length var1 $10 var2 8 var3 $30 var4 8;
informat var4 date.;
format var4 yymmdd10.;
input var1 var2 var3 var4;
run;
I am reading a csv file into Stata using
import delimited "../data_clean/winter20.csv", encoding(UTF-8)
The raw data looks like:
y id1
-.7709586 000000000020
-.4195721 000000003969
-.8932499 300000000021
-1.256116 200000007153
-.7858037 000000000000
The imported data become:
y id1
-.7709586 20
-.4195721 000000003969
-.8932499 300000000021
-1.256116 200000007153
-.7858037 0
However, there are some columns of IDs which are read as numeric. I would like to import them as strings. I want to read the data exactly as how the raw data looks like.
The way I found online is:
import delimited "/Users/tianwang/Dropbox/Construction/data_clean/winter20.csv", encoding(UTF-8) stringcols(74 97 116) clear
However, the raw data may be updated and column numbers may change. The following
import delimited "/Users/tianwang/Dropbox/Construction/data_clean/winter20.csv", encoding(UTF-8) stringcols(id1 id2 id3) clear
gives error id1: invalid numlist in stringcols() option. Is there a way to specify variable names rather than column numbers?
The reason is leading zeros are missing if I read IDs as numeric. Methodtostring does not recover the leading zeros. format id1 %09.0f only works if variables have equal number of digits.
I think this should do it.
import delimited "../data_clean/winter20.csv", stringcols(_all) encoding(UTF-8) clear
PS: Tested in Stata16/Win10
I am trying to import reports in csv format to MySQL for further analysis process. But, I have find several negative numbers enclosed by bracket e.g ($184,919.02),
($182,246.50). If I use double format, it will become 0, but using varchar or text it appears.
I need it to be recorded in double format to automate some calculations in further analysis process. Is there any way to solve this problem? And also how to remove the $ (dollar) sign as well?
Thanks in advance.
Load into a VARCHAR column. Then update the column with REPLACE(col, '$', '') to get rid of the $.
Repeat to get rid of ,, -, (, ')` and any other garbage that is in the way.
Better yet, us a real programming language (not SQL) to cleanse the data. Many languages let you remove -$,() in a single pass.
I have data in the following json format:
{"metadata1":"val1","metadata2":"val2","data_rows":[{"var1":1,"var2":2,"var3":3},{"var1":4,"var2":5,"var3":6}]}
There are some metadata variables at the start, which only appear once, followed by multiple data records, all on the same line. How can I import this into a SAS dataset?
/*Create json file containing sample data*/
filename json "%sysfunc(pathname(work))\json.txt";
data _null_;
file json;
put '{"metadata1":"val1,","metadata2":"val2}","data_rows":[{"var1":1,"var2":2,"var3":3},{"var1":4,"var2":5,"var3":6}]}';
run;
/*Data step for importing the json file*/
data want;
infile json dsd dlm='},' lrecl = 1000000 n=1;
retain metadata1 metadata2;
if _n_ = 1 then input #'metadata1":' metadata1 :$8. #'metadata2":' metadata2 :$8. #;
input #'var1":' var1 :8. #'var2":' var2 :8. #'var3":' var3 :8. ##;
run;
Notes:
The point for SAS to start reading each variable is set using #'string' logic.
Setting , and } as delimiters and using : format modifiers on the input statement tells SAS to keep reading characters from the specified start point until it's read the maximum requested number or a delimiter has been reached.
Setting dsd on the infile statement removes the double quotes from character data values and prevents any problems from occurring if character variables contain delimiters.
The double trailing # tells SAS to continue reading more records from the same line using the same logic until it reaches the end of the line.
Metadata variables are handled as a special case using a separate input statement. They could easily be diverted to a single row in a separate file if desired.
lrecl needs to be greater than or equal to the length of your file for this approach to work.
Setting n=1 should help to reduce memory usage if your file is very large, by preventing SAS from attempting to buffer multiple input lines.
I`ve got (and will receive in the future) many CSV files that use the semicolon as delimiter and the comma as decimal separator.
So far I could not find out how to import these files into SAS using proc import -- or in any other automated fashion without the need for messing around with the variable names manually.
Create some sample data:
%let filename = %sysfunc(pathname(work))\sap.csv;
data _null_;
file "&filename";
put 'a;b';
put '12345,11;67890,66';
run;
The import code:
proc import out = sap01
datafile= "&filename"
dbms = dlm;
delimiter = ";";
GETNAMES = YES;
run;
After the import a value for the variable "AMOUNT" such as 350,58 (which corresponds to 350.58 in the US format) would look like 35,058 (meaning thirtyfivethousand...) in SAS (and after re-export to the German EXCEL it would look like 35.058,00).
A simple but dirty workaround would be the following:
data sap02; set sap01;
AMOUNT = AMOUNT/100;
format AMOUNT best15.2;
run;
I wonder if there is a simple way to define the decimal separator for the CVS-import (similar to the specification of the delimiter). ..or any other "cleaner" solution compared to my workaround.
Many thanks in advance!
You technically should use dbms=dlm not dbms=csv, though it does figure things out. CSV means "Comma separated values", while DLM means "delimited", which is correct here.
I don't think there's a direct way to make SAS read in with the comma via PROC IMPORT. You need to tell SAS to use the NUMXw.d informat when reading in the data, and I don't see a way to force that setting in SAS. (There's an option for output with a comma, NLDECSEPARATOR, but I don't think that works here.)
Your best bet is either to write data step code yourself, or to run the PROC IMPORT, go to the log, and copy/paste the read in code into your program; then for each of the read-in records add :NUMX10. or whatever the appropriate maximum width of the field is. It will end up looking something like this:
data want;
infile "whatever.txt" dlm=';' lrecl=32767 missover;
input
firstnumvar :NUMX10.
secondnumvar :NUMX10.
thirdnumvar :NUMX10.
fourthnumvar :NUMX10.
charvar :$15.
charvar2 :$15.
;
run;
It will also generate lots of informat and format code; you can alternately convert the informats to NUMX10. instead of BEST. instead of adding the informat to the read-in. You can also just remove the informats, unless you have date fields.
data want;
infile "whatever.txt" dlm=';' lrecl=32767 missover;
informat firstnumvar secondnumvar thirdnumvar fourthnumvar NUMX10.;
informat charvar $15.;
format firstnumvar secondnumvar thirdnumvar fourthnumvar BEST12.;
format charvar $15.;
input
firstnumvar
secondnumvar
thirdnumvar
fourthnumvar
charvar $
;
run;
Your best bet is either to write data step code yourself, or to run
the PROC IMPORT, go to the log, and copy/paste the read in code into
your program
This has a drawback. If there is a change in the stucture of the csv file, for example a changed column order, then one has to change the code in the SAS programm.
So it is safer to change the input, substituting in the numeric fields the comma with dot and passing SAS the modified input.
The first idea was to use a perl program for this, and then use in SAS a filename with a pipe to read the modified input.
Unfortunately there is a SAS restriction in the proc import: The IMPORT procedure does not support device types or access methods for the FILENAME statement except for DISK.
So one has to create a workfile on disk with the adjusted input.
I used the CVS_PP package to read the csv file.
testdata.csv contains the csv data to read.
substitute_commasep.perl is the name of the perl program
perl code:
# use lib "/........"; # specifiy, if Text::CSV_PP is locally installed. Otherwise error message: Can't locate Text/CSV_PP.pm in ....;
use Text::CSV_PP;
use strict;
my $csv = Text::CSV_PP->new({ binary => 1
,sep_char => ';'
}) or die "Error creating CSV object: ".Text::CSV_PP->error_diag ();
open my $fhi, "<", "$ARGV[0]" or die "Error reading CSV file: $!";
while ( my $colref = $csv->getline( $fhi) ) {
foreach (#$colref) { # analyze each column value
s/,/\./ if /^\s*[\d,]*\s*$/; # substitute, if the field contains only numbers and ,
}
$csv->print(\*STDOUT, $colref);
print "\n";
}
$csv->eof or $csv->error_diag();
close $fhi;
SAS code:
filename readcsv pipe "perl substitute_commasep.perl testdata.csv";
filename dummy "dummy.csv";
data _null_;
infile readcsv;
file dummy;
input;
put _infile_;
run;
proc import datafile=dummy
out=data1
dbms=dlm
replace;
delimiter=';';
getnames=yes;
guessingrows=32767;
run;