SAS Proc Import and wrong format - csv

I am trying to import a csv file through the proc import function. I am using the following syntax:
PROC IMPORT OUT= WORK.claims
DATAFILE= 'C:\Folder\File.csv'
DBMS=csv REPLACE;
GETNAMES=YES;
GUESSINGROWS=125;
RUN;
One of my variable is a character string of the following form: 15/04/2014AB280929D:01ABCDE. Thus it begins by a date, then 9 characters, a column and 7 characters.
The problem is that SAS detects this variable as a date and put a ddmmyy10 format on it. Then, when SAS tries to read the whole file I get errors on every line telling me that I have invalid data for this variable.
How can I fix this ?

Try using the File Import wizard from the menu instead. You'll get the same result initially, but you can then press F4 which will recall the last code submitted (in this case the code that import wizard runs in the background).
You can then modify the informats and formats to suit your needs, then rerun the code.

Related

Can't display CSV file in pyspark(ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling)

I'm getting an error while displaying a CSV file through Pyspark. I've attached the PySpark code and CSV file that I used.
from pyspark.sql import *
spark.conf.set("fs.azure.account.key.xxocxxxxxxx","xxxxx")
time_on_site_tablepath= "wasbs://dwpocblob#dwadfpoc.blob.core.windows.net/time_on_site.csv"
time_on_site = spark.read.format("csv").options(header='true', inferSchema='true').load(time_on_site_tablepath)
display(time_on_site.head(50))
The error is shown below
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
CSV file format is attached below
time_on_site:pyspark.sql.dataframe.DataFrame
next_eventdate:timestamp
barcode:integer
eventdate:timestamp
sno:integer
eventaction:string
next_action:string
next_deviceid:integer
next_device:string
type_flag:string
site:string
location:string
flag_perimeter:integer
deviceid:integer
device:string
tran_text:string
flag:integer
timespent_sec:integer
gg:integer
CSV file data is attached below
next_eventdate,barcode,eventdate,sno,eventaction,next_action,next_deviceid,next_device,type_flag,site,location,flag_perimeter,deviceid,device,tran_text,flag,timespent_sec,gg
2018-03-16 05:23:34.000,1998296,2018-03-14 18:50:29.000,1,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,124385,0
2018-03-17 07:22:16.000,1998296,2018-03-16 18:41:09.000,3,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,45667,0
2018-03-19 07:23:55.000,1998296,2018-03-17 18:36:17.000,6,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,1,132458,1
2018-03-21 07:25:04.000,1998296,2018-03-19 18:23:26.000,8,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,133298,0
2018-03-24 07:33:38.000,1998296,2018-03-23 18:39:04.000,10,IN,OUT,2,AGATE-R02-AP-Vehicle_Exit,,NULL,NULL,1,1,AGATE-R01-AP-Vehicle_Entry,Access Granted,0,46474,0
What could be done to load the CSV file successfully?
There is no issue in your syntax, it's working fine.
The issue is in your data of CSV file, where the column named as type_flag have only None(null) values, So it doesn't infer it's Datatype.
So, here are two options.
you can display the data without using head(). Like
display(time_on_site)
If you want to use head() then you need to replace the null value, at here I replaced it with the empty string('').
time_on_site = time_on_site.fillna('')
display(time_on_site.head(50))
For some reason, probably a bug, even if you provide a schema on the spark.read.schema(my_schema).csv('path') call
you get the same error on a display(df.head()) call
the display(df) works though, but it gave me a WTF moment.

Specify empty values by character string in PROC IMPORT

I'm coming to SAS from R in which this problem is fairly easy to solve.
I'm trying to load a bunch of CanSim CSV files (one example table here) with a %Macro function.
%Macro ReadCSV (infile , outfile );
PROC IMPORT
DATAFILE= &infile.
OUT= &outfile.
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
%Mend ReadCSV;
%ReadCSV("\\DATA\CanSimTables\02820135-eng.csv", work.cs02820135);
%ReadCSV("\\DATA\CanSimTables\02820158-eng.csv", work.cs02820158);
The problem is that the numeric Value column has ".." in all the csv's whenever the value is missing. This is creating an error when IMPORT gets to the rows with this character string.
Is there some way to tell IMPORT that any ".." should be removed or treated as missing values? (I found forums referring to the DSD option, but that doesn't seem to help me here.)
Thanks!
PROC IMPORT can only guess at the structure of your data. For example it might see the .. and assume the column contains a character string instead of a number. It can also make other decisions that can made the generated dataset useless.
You will be better served to write you own data step code to read the file. It is not very difficult to do. For your example linked file all I did was copy and paste the first row of the CSV file and remove the commas, make the names valid variable names and take some guesses as to how long to make the character variables.
data want ;
infile "&path/&fname" dsd truncover firstobs=2 ;
length Ref_Date $7 GEO $100 Geographical_classification $20
CHARACTERISTICS $100 STATISTICS DATATYPE $50 Vector Coordinate $20
Value 8
;
input (Ref_Date -- Value) (??) ;
run;
The ?? modifier will tell SAS not to report any errors when trying the convert the text in the VALUE column into a number. So the .. and other garbage in the file will generate missing values.
Not explicitly relevant for this question, but - if your issue were "N" or "D" or similar that you wanted to become missing, there would be a somewhat easier solution: the missing statement (importantly distinct from the missing option).
missing M;
That tells SAS to see a single character M in the data as a missing value, and read it in accordingly. It would read it in as .M special missing value, which is functionally similar to . regular missing (but not actually equal in an equality statement).

How to import a CSV file with delimiter as ";" and decimal separator as "," into SAS?

I`ve got (and will receive in the future) many CSV files that use the semicolon as delimiter and the comma as decimal separator.
So far I could not find out how to import these files into SAS using proc import -- or in any other automated fashion without the need for messing around with the variable names manually.
Create some sample data:
%let filename = %sysfunc(pathname(work))\sap.csv;
data _null_;
file "&filename";
put 'a;b';
put '12345,11;67890,66';
run;
The import code:
proc import out = sap01
datafile= "&filename"
dbms = dlm;
delimiter = ";";
GETNAMES = YES;
run;
After the import a value for the variable "AMOUNT" such as 350,58 (which corresponds to 350.58 in the US format) would look like 35,058 (meaning thirtyfivethousand...) in SAS (and after re-export to the German EXCEL it would look like 35.058,00).
A simple but dirty workaround would be the following:
data sap02; set sap01;
AMOUNT = AMOUNT/100;
format AMOUNT best15.2;
run;
I wonder if there is a simple way to define the decimal separator for the CVS-import (similar to the specification of the delimiter). ..or any other "cleaner" solution compared to my workaround.
Many thanks in advance!
You technically should use dbms=dlm not dbms=csv, though it does figure things out. CSV means "Comma separated values", while DLM means "delimited", which is correct here.
I don't think there's a direct way to make SAS read in with the comma via PROC IMPORT. You need to tell SAS to use the NUMXw.d informat when reading in the data, and I don't see a way to force that setting in SAS. (There's an option for output with a comma, NLDECSEPARATOR, but I don't think that works here.)
Your best bet is either to write data step code yourself, or to run the PROC IMPORT, go to the log, and copy/paste the read in code into your program; then for each of the read-in records add :NUMX10. or whatever the appropriate maximum width of the field is. It will end up looking something like this:
data want;
infile "whatever.txt" dlm=';' lrecl=32767 missover;
input
firstnumvar :NUMX10.
secondnumvar :NUMX10.
thirdnumvar :NUMX10.
fourthnumvar :NUMX10.
charvar :$15.
charvar2 :$15.
;
run;
It will also generate lots of informat and format code; you can alternately convert the informats to NUMX10. instead of BEST. instead of adding the informat to the read-in. You can also just remove the informats, unless you have date fields.
data want;
infile "whatever.txt" dlm=';' lrecl=32767 missover;
informat firstnumvar secondnumvar thirdnumvar fourthnumvar NUMX10.;
informat charvar $15.;
format firstnumvar secondnumvar thirdnumvar fourthnumvar BEST12.;
format charvar $15.;
input
firstnumvar
secondnumvar
thirdnumvar
fourthnumvar
charvar $
;
run;
Your best bet is either to write data step code yourself, or to run
the PROC IMPORT, go to the log, and copy/paste the read in code into
your program
This has a drawback. If there is a change in the stucture of the csv file, for example a changed column order, then one has to change the code in the SAS programm.
So it is safer to change the input, substituting in the numeric fields the comma with dot and passing SAS the modified input.
The first idea was to use a perl program for this, and then use in SAS a filename with a pipe to read the modified input.
Unfortunately there is a SAS restriction in the proc import: The IMPORT procedure does not support device types or access methods for the FILENAME statement except for DISK.
So one has to create a workfile on disk with the adjusted input.
I used the CVS_PP package to read the csv file.
testdata.csv contains the csv data to read.
substitute_commasep.perl is the name of the perl program
perl code:
# use lib "/........"; # specifiy, if Text::CSV_PP is locally installed. Otherwise error message: Can't locate Text/CSV_PP.pm in ....;
use Text::CSV_PP;
use strict;
my $csv = Text::CSV_PP->new({ binary => 1
,sep_char => ';'
}) or die "Error creating CSV object: ".Text::CSV_PP->error_diag ();
open my $fhi, "<", "$ARGV[0]" or die "Error reading CSV file: $!";
while ( my $colref = $csv->getline( $fhi) ) {
foreach (#$colref) { # analyze each column value
s/,/\./ if /^\s*[\d,]*\s*$/; # substitute, if the field contains only numbers and ,
}
$csv->print(\*STDOUT, $colref);
print "\n";
}
$csv->eof or $csv->error_diag();
close $fhi;
SAS code:
filename readcsv pipe "perl substitute_commasep.perl testdata.csv";
filename dummy "dummy.csv";
data _null_;
infile readcsv;
file dummy;
input;
put _infile_;
run;
proc import datafile=dummy
out=data1
dbms=dlm
replace;
delimiter=';';
getnames=yes;
guessingrows=32767;
run;

Exported csv SAS table cannot be imported

I exported my SAS table in the form of a csv file into a different folder for me to use with a different program using this code that worked:
PROC EXPORT data=CA_ISO_policyBYpolicy_&thestate.
outfile="&whichfolder.CA_ISO_policyBYpolicy_&thestate..csv"
dbms=dlm replace;
delimiter=",";
run;
Using a different program in a different folder I am trying to import the data via this code:
LIBNAME Home "/sasdata/sasperm2/act_cfr/fr/SJR/AmFam_vs_ISO_Compare/" ;
%let Filepath = /sasdata/sasperm2/act_cfr/fr/SJR/AmFam_vs_ISO_Compare/;
%sdwlogin;
RUN;
%let thestate = OR;
%let policyyr = 2012;
/*---- ISO_Compare ----*/
data Work.CA_ISO_policyBYpolicy_&thestate.;
length Policy $10.;
infile "&Filepath/CA_ISO_policyBYpolicy_&thestate..csv" DELIMITER=',' TERMSTR=CRLF LRECL=2500 FIRSTOBS=2 MISSOVER DSD;
input Policy;
run;
The program runs but I am getting no data. I shortened the variable list to make the code easier to read. When I manually copy and re-paste the data into a different csv file and re-name it the same "CA_ISO_policyBYpolicy_OR.csv" then it works in my program. My initial reason to incorporate this code was to get rid of the manual process... so if anybody has any hints I would be very thankful.
As Joe suggested in the comments, unless you need the csv files for another reason, it would be better to create a SAS data library for this.
Another way to do this would be to proc import it pretty much the same way you used proc export:
proc import datafile="&Filepath./CA_ISO_policyBYpolicy_&thestate..csv"
out=Work.CA_ISO_policyBYpolicy_&thestate. dbms=dlm replace;
delimiter=",";
getnames=yes; *this will create variable names from your first line;
*The opposite of what proc export did;
run;
Other thing I can think of is:
%let Filepath = /sasdata/sasperm2/act_cfr/fr/SJR/AmFam_vs_ISO_Compare/;
Might be causing problems because of the forward slash. Try it like:
%let Filepath = %str(/sasdata/sasperm2/act_cfr/fr/SJR/AmFam_vs_ISO_Compare/);
Also, does the sasdata directory actually exist right on the root directory or is it a subdirectory of the current directory where your sas program is located? If it's the current directory you need to lose the initial forward slash (or put a . in front of it):
%let Filepath = %str(./sasdata/sasperm2/act_cfr/fr/SJR/AmFam_vs_ISO_Compare/);

CSV to SAS dataset: no line-final comma causes problems

I'm trying to import a .CSV file into a SAS dataset, and am having some trouble. Here's a line of sample input:
Foo,5,10,3.5
Bar,2,3,1.0
The problem I'm having is that the line-final "3.5" and "1.0" are not being correctly interpreted as variable values (instead SAS complains that they are invalid values, giving me a NOTE: Invalid data for VARIABLE error). However, when I add a comma to the end of the line, like so:
Foo,5,10,3.5,
Bar,2,3,1.0,
Then everything works fine. Is there a way that I can make this import work without modifying the source file?
Currently, my DATA step's INFILE statement has the DSD, DLM=',', and MISSOVER options.
With this data in a .csv file in a windows environment
Foo,5,10,1.5
Bar,2,3,2.1
Foo,5,10,3.5
Bar,2,3,4.1
This code works (running SAS locally on a windows machine)
filename f 'D:\Data\SAS\input.csv';
data input;
infile f delimiter=',';
input char1 $ num1 num2 num3;
Run;
As #itzy mentioned, the environment is important..more info will help with the solution
When you are working with data from a different environment, you can use the TERMSTR option on the INFILE statement to tell SAS how the lines of data are terminated.
This most likely has to do with the different codes for line endings in Unix and Windows. I'm guessing your data comes from a different operating system than the one you're running SAS on.
The solution is to change the newline codes to the correct operating system. If you're running SAS on a unix system, try the dos2unix command. If you're running Windows, you can edit the CSV file with a text editor like UltraEdit or Notepad++ and save the file in Windows format.