Updating a local SAS table from a CSV file - csv

3 and have a table which I need to update. From my understanding, you can do something like the following:
data new_table;
update old_table update_table;
by some_key;
run;
My issue (well I have a few...) is that I'm importing the "update_table" from a CSV file and the formats aren't matching the "old_table", so the update fails.
I've tried creating the "update_table" from the "old_table" using proc sql create table with zero observations, which created the correct types/formats, but then I was unable to insert data into it without replacing it.
The other major issue I have is that there are a large number of columns (480), and custom formats, and I've run up against a 6000 character limit for the script.
I'm very new to SAS and any help would be greatly appreciated :)

It sounds like you need to use a data step to read in your CSV. There are lots of papers out there explaining how to do this, so I won't cover it here. This will allow you to specify the format (numeric/character) for each field. The nice thing here is you already know what formats they need to be in (from your existing dataset), so you can create this read in fairly easily.
Let's say your data is so:
data have;
informat x date9.;
input x y z $;
datalines;
10JAN2010 1 Base
11JAN2010 4 City
12JAN2010 8 State
;;;;
run;
Now, if you have a CSV of the same format, you can read it in by generating the input code from the above dataset. You can use PROC CONTENTS to do this, or you can generate it by using dictionary.tables which has the same information as PROC CONTENTS.
proc sql;
select catx(' ',name,ifc(type='char', '$' ,' '))into :inputlist
separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE';
select catx(' ',name,informat) into :informatlist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not missing(informat);
quit;
The above are two examples; they may or may not be sufficient for your particular needs.
Then you use them like so:
data want;
infile datalines dlm=',';
informat &informatlist.;
input &inputlist.;
datalines;
13JAN2010,9,REGION
;;;;
run;
(obviously you would use your CSV file instead of datalines, just used here as example).
The point is you can write the data step code using the metadata from your original dataset.

I needed this today, so I made a macro out of it: https://core.sasjs.io/mp__csv2ds_8sas.html
It doesn't wrap the input statement so it may break with a large number of columns if you have batch line length limits. If anyone would like me to fix that, just raise an issue: https://github.com/sasjs/core/issues/new

Related

SAS - Change the format of a column in a proc imported CSV

I cannot quite figure out how to change the format of a column in my data file. I have the data set proc imported, and it guessed the format of a specific column as numeric, I would like to to be character-based.
This is where I'm currently at, and it does not change the format of my NUMBER column:
proc import
datafile = 'datapath'
out = dataname
dbms = CSV
replace
;
format NUMBER $8.
;
guessingrows = 20000
;
run;
You could import the data and then format after using - I believe the following would work.
proc sql;
create table want as
select *,
put(Number, 4.) as CharacterVersion
from data;
quit;
You cannot change the type/format via PROC IMPORT. However, you can write a data step to read in the file and then customize everything. If you're not sure how to start with that, check the log after you run a PROC IMPORT and it will have the 'skeleton' code. You can copy that code, edit it, and run to get what you need. Writing from scratch also works using an INFILE and INPUT statement.
From the help file (search for "Processing Delimited Files in SAS")
If you need to revise your code after the procedure runs, issue the RECALL command (or press F4) to the generated DATA step. At this point, you can add or remove options from the INFILE statement and customize the INFORMAT, FORMAT, and INPUT statements to your data.
Granted, the grammar in this section is horrific! The idea is that the IMPORT Procedure generates source code that can be recalled and modified for subsequent submission.

CSV File Processing - SAS

I am pretty new to SAS programming and trying to find the most efficient way to my current ongoing initiative. Basically, I need to modify the existing .csv file stored on the SAS server and save it in my folder on the same server.
Modification required:
keep .csv as format
use "|" instead of "," as delimiter
have the following output name: filename_YYYYMMDDhhmmss.csv
keep only 4 variables from the original file
rename some of the variables we keep
Here is the script I am currently using, but there are a few issues with it:
PROC IMPORT OUT = libname.original_file (drop=var0)
FILE = "/.../file_on_server.csv"
DBMS = CSV
REPLACE;
RUN;
%PUT date_human = %SYSFUNC(PUTN(%sysevalf(%SYSFUNC(TODAY())-1), datetime20.));
proc export data = libname.original_file ( rename= ( var1=VAR11 var2=VAR22 Type=VAR33 ))
outfile = '/.../filename_&date_human..csv' label dbms=csv replace;
delimiter='|';
run;
I also have an issue with the variable called "Type" when renaming it as it looks like there is a conflict with some of the system key words. Date format is not good either, and I was not able to find the exact format on the SAS forums, unfortunately.
Any advice on how to make this script more efficient is greatly appreciated.
I wouldn't bother with trying to actually read the data into a SAS dataset. Just process it and write it back out. If the input structure is consistent then it is pretty simple. Just read everything as character strings and output the columns that you want to keep.
Let's assume that the data has 12 columns and the last one of the four that want to keep is the 10th column. So you only need to read in 10 of them.
First setup your input and output filenames in macro variables to make it easier to edit. You can use your logic for generating the filename for the new file.
%let infile=/.../file_on_server.csv;
%let outfile=/.../filename_&date_human..csv;
Then use a simple DATA _NULL_ step to read the data as character strings and write it back out. You can even change the relative order of the four columns if you want. So this program will copy the 2nd, 5th, 4th and 10th columns and change the column headers to NewName1, NewName2, NewName3 and NewName4.
data _null_;
infile "&infile" dsd dlm=',' truncover;
file "&outfile" dsd dlm='|';
length var1-var10 $200 ;
input var1-var10;
if _n_=1 then do;
var2='NewName1';
var5='NewName2';
var4='NewName3';
var10='NewName4';
end;
put var2 var5 var4 var10 ;
run;
If some of the data for the four columns you want to keep are longer than 200 characters then just update the LENGTH statement.
So let's try a little experiment. First let's make a dummy CSV file.
filename example temp;
data _null_;
file example ;
input;
put _infile_;
cards4;
a,b,c,d,e,f,g,h,i,j,k,l,m
1,2,3,4,5,6,7,8,9,10,11,12,13
o,p,q,r,s,t,u,v,w,x,y,z
;;;;
Now let's try running it. I will modify the INFILE and FILE statements to read from my temp file and write the result to the log.
infile example /* "&infile" */ dsd dlm=',' truncover;
file log /* "&outfile" */ dsd dlm='|';
Here are the resulting rows written.
NewName1|NewName2|NewName3|NewName4
2|5|4|10
p|s|r|x

Unable to import 3.4GB csv into redshift because values contains free-text with commas

And so we found a 3.6GB csv that we have uploaded onto S3 and now want to import into Redshift, then do the querying and analysis from iPython.
Problem 1:
This comma delimited file contains values free text that also contains commas and this is interfering with the delimiting so can’t upload to Redshift.
When we tried opening the sample dataset in Excel, Excel surprisingly puts them into columns correctly.
Problem 2:
A column that is supposed to contain integers have some records containing alphabets to indicate some other scenario.
So, the only way to get the import through is to declare this column as varchar. But then we can do calculations later on.
Problem 3:
The datetime data type requires the date time value to be in the format YYYY-MM-DD HH:MM:SS, but the csv doesn’t contain the SS and the database is rejecting the import.
We can’t manipulate the data on a local machine because it is too big, and we can’t upload onto the cloud for computing because it is not in the correct format.
The last resort would be to scale the instance running iPython all the way up so that we can read the big csv directly from S3, but this approach doesn’t make sense as a long-term solution.
Your suggestions?
Train: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train.csv (3.4GB)
Train Sample: https://s3-ap-southeast-1.amazonaws.com/bucketbigdataclass/stack_overflow_train-sample.csv (133MB)
Try having different delimiter or use escape characters.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY_preparing_data.html
For second issue, if you want to extract only numbers from the column after loading into char use regexp_replace or other functions.
For third issue, you can as well load it into VARCHAR field and then use substring cast(left(column_name, 10)||' '||right(column_name, 6)||':00' as timestamp)
to load it into final table from staging table
For the first issue, you need to find out a way to differentiate between the two types of commas - the delimiter and the text commas. Once you have done that, replace the delimiters with a different delimiter and use the same as delimiter in the copy command for Redshift.
For the second issue, you need to first figure out if this column needs to be present for numerical aggregations once loaded. If yes, you need to get this data cleaned up before loading. If no, you can directly load this as char/ varchar field. All your queries will still work but you will not be able to do any aggregations (sum/ avg and the likes) on this field.
For problem 3, you can use Text(date, "yyyy-mm-dd hh:mm:ss") function in excel to do a mass replace for this field.
Let me know if this works out.

Howto process multivariate time series given as multiline, multirow *.csv files with Apache Pig?

I need to process multivariate time series given as multiline, multirow *.csv files with Apache Pig. I am trying to use a custom UDF (EvalFunc) to solve my problem. However, all Loaders I tried (except org.apache.pig.impl.io.ReadToEndLoader which I do not get to work) to load data in my csv-files and pass it to the UDF return one line of the file as one record. What I need is, however one column (or the content of the complete file) to be able to process a complete time series. Processing one value is obviously useless because I need longer sequences of values...
The data in the csv-files looks like this (30 columns, 1st is a datetime, all others are double values, here 3 sample lines):
17.06.2013 00:00:00;427;-13.793273;2.885583;-0.074701;209.790688;233.118828;1.411723;329.099170;331.554919;0.077026;0.485670;0.691253;2.847106;297.912382;50.000000;0.000000;0.012599;1.161726;0.023110;0.952259;0.024673;2.304819;0.027350;0.671688;0.025068;0.091313;0.026113;0.271128;0.032320;0
17.06.2013 00:00:01;430;-13.879651;3.137179;-0.067678;209.796500;233.141233;1.411920;329.176863;330.910693;0.071084;0.365037;0.564816;2.837506;293.418550;50.000000;0.000000;0.014108;1.159334;0.020250;0.954318;0.022934;2.294808;0.028274;0.668540;0.020850;0.093157;0.027120;0.265855;0.033370;0
17.06.2013 00:00:02;451;-15.080651;3.397742;-0.078467;209.781511;233.117081;1.410744;328.868437;330.494671;0.076037;0.358719;0.544694;2.841955;288.345883;50.000000;0.000000;0.017203;1.158976;0.022345;0.959076;0.018688;2.298611;0.027253;0.665095;0.025332;0.099996;0.023892;0.271983;0.024882;0
Has anyone an idea how I could process this as 29 time series?
Thanks in advance!
What do you want to achieve?
If you want to read all rows in all files as a single record, this can work:
a = LOAD '...' USING PigStorage(';') as <schema> ;
b = GROUP a ALL;
b will contain all the rows in a bag.
If you want to read each CSV file as a single record, this can work:
a = LOAD '...' USING PigStorage(';','tagsource') as <schema> ;
b = GROUP a BY $0; --$0 is the filename
b will contain all the rows per file in a bag.

Export SAS dataset to Access with formatted values

I'm generating a table in SAS and exporting it to a Microsoft Access database (mdb). Right now, I do this by connecting to the database as a library:
libname plus "C:\...\plus.mdb";
data plus.groupings;
set groupings;
run;
However, SAS doesn't export the variable formats to the database, so I end up with numeric values in a table that I want to be human-readable. I've tried proc sql with the same effect. What can I do to get the formatted data into Access?
What I've tried so far:
Plain libname to mdb, data step (as above)
Plain libname to mdb, proc sql create table
OLE DB libname (as in Rob's reference), data step
OLE DB libname, proc sql create table
There's a bunch of alternative connection types here, maybe one of those will work?:
http://support.sas.com/techsup/technote/ts793.pdf
What's working right now:
Because SAS does preserve formatted values in csv, I'm exporting the table to a csv file that feeds a linked table in the Access database. This seems less than ideal, but it works. It's weird that SAS clearly has the capacity to export formatted values, but doesn't seem to document it.
proc export
data= groupings
outfile= "C:\...\groupings.csv"
dbms= CSV
replace;
putnames= yes;
run;
A seeming disadvantage of this approach is that I have to manually recreate the table if I add a new field. If I could drop/create in proc sql, that wouldn't be an issue.
From SAS support: if you create a SQL view using put() to define the variables, you can then export the view to the database and maintain the formatted values. Example:
libname plus "C:\...\plus.mdb";
proc sql;
create view groupings_view as (
SELECT put(gender, gender.) AS gender,
put(race, race.) AS race,
... etc.
FROM groupings
);
create table plus.groupings as (
SELECT *
FROM groupings_view
);
quit;
I wasn't able to just create the view directly in Access - it's not entirely clear to me that Jet supports views in a way that SAS understands, so maybe that's the issue. At any rate, the above does the trick for the limited number of variables I need to export. I can imagine automating the writing of such queries with a funky macro working on the output of proc contents, but I'm terribly grateful I don't have to do that...