I have worked with coworkers on this, googled around, edited this code a million times and I cannot get it to work.
Essentially, I am trying to stack multiple CSV files into one SAS dataset. I created earlier in my SAS the ability to find all of the names of the files [variable fname inside dirlist1]. I've been trying to get this code to work but the problem is some of the observations within these CSV files are blank. So for example column "apples" (see below) will have a majority of the column blank - but will occasionally have data. Right now this code reads in the right data, but when an observation is blank (e.g. for an observation, "apples" is blank - it shifts my data to the left instead of leaving that part blank. Is there something I am missing in this current code that can solve that?
Basically it's skipping texttext,text,,text, text < it's skipping that blank between the commas and continuing on and I WANT that blank.
data all_data (drop=fname);
length bananas $256;
length apples $25;
length grapefruit $10;
length berries $10;
set dirlist1;
filepath = "&dirname"||fname;
infile dummy filevar = filepath length=reclen firstobs=2 dlm=',' end=done missover;
do while(not done);
myfilename = fname;
input bananas apples grapefruit berries;
output;
end;
run;
Edit:
To note, I based this code from code published on a UCLA based site 1
Add the DSD modifier to your infile statement.
infile dummy filevar = filepath length=reclen firstobs=2 dlm=',' end=done missover DSD;
That will tell it to change the default treatment of consecutive delimiters (and also allows it to correctly handle quoted fields with embedded delimiters).
See the documentation on INFILE for more information.
Related
I imported a csv in SAS however the format was incorrect in the original file. I am working with addresses, so for example, the city will be incorrectly concatenated to the street variable or the zip code will be in the city variable. How to set parameters after importing. when I tried to use set length, it gave me a message saying that the length was already set before and that I should work with the DATA step. I do not know where exactly to do this.
Well, you can manually define what and how lines are read into SAS. Here is an example from Proc import.
Just change the delimiter ; in this case. Also, depending if your data has header row set the firstObs properly. Other than those, just list the variables and their attributes.
data WORK.Imported;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile 'c:\input\datafile.csv' delimiter =';' MISSOVER DSD lrecl=13106 firstobs=2 ;
informat first_var $15. ;
informat second_var $24. ;
informat third_var best32. ;
/*... add as many as your data has */
format first_var $15. ;
format second_var $24. ;
format third_var best12. ;
/*... add as many as your data has */
input
First_var $
Second_var $
CCM
/*... add as many as your data has */
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Lazy approach is just using proc import with guessingrows=max option:
proc import datafile="c:\imput\input.csv" out=imprted replace;
DELIMITER=";" ;
getnames=yes;
guessingrows=MAX;
run;
Be aware that in very large files this is going to ta long time. Usually better to set the rows 'sufficiently large' like 32k.
For more on importing see proc import. Or more generally on importing/exporting
This issue happened to me as well, problem is there are some 'line breaks' used in the input csv file. If you replace all the line breaks with space, save the file, and then import in SAS - it will import the data correctly.
Easy way to do this is:
Press Ctrl+H to open the Find & Replace dialog box.
In the Find What field enter Ctrl+J. It will look empty, but you will see a tiny dot.
In the Replace With field, enter any value to replace line breaks. Usually, it is space to avoid 2 words join accidentally. If all you need is deleting the line breaks, leave the "Replace With" field empty.
See this page for other ways:
I am dumping some csv data into mysql with this query
LOAD DATA LOCAL INFILE 'path/LYRIC.csv' INTO TABLE LYRIC CHARACTER SET euckr FIELDS TERMINATED BY '|';
When I did this, I can see follow logs from console.
...
[2017-09-13 11:24:10] ' for column 'SONG_ID' at row 3
[2017-09-13 11:24:10] [01000][1261] Row 3 doesn't contain data for all columns
[2017-09-13 11:24:10] [01000][1261] Row 3 doesn't contain data for all columns
...
I think csv got some line feed as a column data so it breaks all parsing process.
A single record in csv looks like ...
000001|2014-11-17 18:10:00|2014-11-17 18:10:00|If I were your September
I''d whisper a sunset to fly through
And if I were your September
|0|dba|asass|2014-11-17 18:10:00||||2014-11-17 18:10:00
So LOAD DATA pushes line 1 as a record and then try line 2 and so on, even if this is a single data.
How can I fix it? Should I request different type of the file to the client?
P.S. I am so new with this csv work.
Multiline fieds in csv should be surrounded with double quotes, like this:
000001|2014-11-17 18:10:00|2014-11-17 18:10:00|"If I were your September
I''d whisper a sunset to fly through
And if I were your September
"|0|dba|asass|2014-11-17 18:10:00||||2014-11-17 18:10:00
And any double quote inside that field should be escaped with another double quote.
Of course, the parser has to support (and maybe be instructed to use) multiline fields.
I am pretty new to SAS programming and trying to find the most efficient way to my current ongoing initiative. Basically, I need to modify the existing .csv file stored on the SAS server and save it in my folder on the same server.
Modification required:
keep .csv as format
use "|" instead of "," as delimiter
have the following output name: filename_YYYYMMDDhhmmss.csv
keep only 4 variables from the original file
rename some of the variables we keep
Here is the script I am currently using, but there are a few issues with it:
PROC IMPORT OUT = libname.original_file (drop=var0)
FILE = "/.../file_on_server.csv"
DBMS = CSV
REPLACE;
RUN;
%PUT date_human = %SYSFUNC(PUTN(%sysevalf(%SYSFUNC(TODAY())-1), datetime20.));
proc export data = libname.original_file ( rename= ( var1=VAR11 var2=VAR22 Type=VAR33 ))
outfile = '/.../filename_&date_human..csv' label dbms=csv replace;
delimiter='|';
run;
I also have an issue with the variable called "Type" when renaming it as it looks like there is a conflict with some of the system key words. Date format is not good either, and I was not able to find the exact format on the SAS forums, unfortunately.
Any advice on how to make this script more efficient is greatly appreciated.
I wouldn't bother with trying to actually read the data into a SAS dataset. Just process it and write it back out. If the input structure is consistent then it is pretty simple. Just read everything as character strings and output the columns that you want to keep.
Let's assume that the data has 12 columns and the last one of the four that want to keep is the 10th column. So you only need to read in 10 of them.
First setup your input and output filenames in macro variables to make it easier to edit. You can use your logic for generating the filename for the new file.
%let infile=/.../file_on_server.csv;
%let outfile=/.../filename_&date_human..csv;
Then use a simple DATA _NULL_ step to read the data as character strings and write it back out. You can even change the relative order of the four columns if you want. So this program will copy the 2nd, 5th, 4th and 10th columns and change the column headers to NewName1, NewName2, NewName3 and NewName4.
data _null_;
infile "&infile" dsd dlm=',' truncover;
file "&outfile" dsd dlm='|';
length var1-var10 $200 ;
input var1-var10;
if _n_=1 then do;
var2='NewName1';
var5='NewName2';
var4='NewName3';
var10='NewName4';
end;
put var2 var5 var4 var10 ;
run;
If some of the data for the four columns you want to keep are longer than 200 characters then just update the LENGTH statement.
So let's try a little experiment. First let's make a dummy CSV file.
filename example temp;
data _null_;
file example ;
input;
put _infile_;
cards4;
a,b,c,d,e,f,g,h,i,j,k,l,m
1,2,3,4,5,6,7,8,9,10,11,12,13
o,p,q,r,s,t,u,v,w,x,y,z
;;;;
Now let's try running it. I will modify the INFILE and FILE statements to read from my temp file and write the result to the log.
infile example /* "&infile" */ dsd dlm=',' truncover;
file log /* "&outfile" */ dsd dlm='|';
Here are the resulting rows written.
NewName1|NewName2|NewName3|NewName4
2|5|4|10
p|s|r|x
I have IMDB data in csv format. Here is a snapshot.
[root#jamatney IMDB]# head IMDBMovie.txt
id,name,year,rank
0,#28 (2002),2002,
1,#7 Train: An Immigrant Journey, The (2000),2000,
2,$ (1971),1971,6.4000000000000004
3,$1000 Reward (1913),1913,
4,$1000 Reward (1915),1915,
5,$1000 Reward (1923),1923,
6,$1,000,000 Duck (1971),1971,5
7,$1,000,000 Reward, The (1920),1920,
8,$10,000 Under a Pillow (1921),1921,
I'd like to import this data into a MySQL database. However there are commas present in the name cells. This prevents me from loading the data into the database correctly, as my loading query is,
mysql> LOAD DATA LOCAL INFILE 'IMDB/IMDBMovie.txt' INTO TABLE Movie FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
I've thought about using some combination of rev and cut to isolate the offending column, then find/replace the commas, but can't seem to get it to work. Was wondering if this is the right approach, or if there's a better way.
It looks like the first field and last two fields are unambiguous, so all you have to do is write a script to pull those out, and surround what remains in quotes. My bash-fu isn't quite good enough to get it done with rev and cut, but I was able to write a Python script to get it done. You can add an OPTIONALLY ENCLOSED BY clause to your LOAD DATA query.
f = open("IMDBMovie.txt")
print(next(f)) # header
for line in f:
fields = line.strip().split(",")
# Get unambiguous fields.
id = fields.pop(0)
rank = fields.pop(-1)
year = fields.pop(-1)
# Surround name with quotes.
name = '"{}"'.format(",".join(fields))
print("{},{},{},{}".format(id, name, year, rank))
On your test data, the output was
id,name,year,rank
0,"#28 (2002)",2002,
1,"#7 Train: An Immigrant Journey, The (2000)",2000,
2,"$ (1971)",1971,6.4000000000000004
3,"$1000 Reward (1913)",1913,
4,"$1000 Reward (1915)",1915,
5,"$1000 Reward (1923)",1923,
6,"$1,000,000 Duck (1971)",1971,5
7,"$1,000,000 Reward, The (1920)",1920,
8,"$10,000 Under a Pillow (1921)",1921,
This is too long for a comment.
Good luck. Your input file is in a lousy format. It is not really CSV. Here are two options:
(1) Open the file in Excel (or your favorite spreadsheet) and save it out with tab delimiters instead. Keep your fingers crossed that none of the fields have a tab. Or use another delimiter such as a pipe character.
(2) Load each row into a table with only one column, a big character string column. Then, parse the rows into their constituent fields (substring_index() can be very useful).
3 and have a table which I need to update. From my understanding, you can do something like the following:
data new_table;
update old_table update_table;
by some_key;
run;
My issue (well I have a few...) is that I'm importing the "update_table" from a CSV file and the formats aren't matching the "old_table", so the update fails.
I've tried creating the "update_table" from the "old_table" using proc sql create table with zero observations, which created the correct types/formats, but then I was unable to insert data into it without replacing it.
The other major issue I have is that there are a large number of columns (480), and custom formats, and I've run up against a 6000 character limit for the script.
I'm very new to SAS and any help would be greatly appreciated :)
It sounds like you need to use a data step to read in your CSV. There are lots of papers out there explaining how to do this, so I won't cover it here. This will allow you to specify the format (numeric/character) for each field. The nice thing here is you already know what formats they need to be in (from your existing dataset), so you can create this read in fairly easily.
Let's say your data is so:
data have;
informat x date9.;
input x y z $;
datalines;
10JAN2010 1 Base
11JAN2010 4 City
12JAN2010 8 State
;;;;
run;
Now, if you have a CSV of the same format, you can read it in by generating the input code from the above dataset. You can use PROC CONTENTS to do this, or you can generate it by using dictionary.tables which has the same information as PROC CONTENTS.
proc sql;
select catx(' ',name,ifc(type='char', '$' ,' '))into :inputlist
separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE';
select catx(' ',name,informat) into :informatlist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not missing(informat);
quit;
The above are two examples; they may or may not be sufficient for your particular needs.
Then you use them like so:
data want;
infile datalines dlm=',';
informat &informatlist.;
input &inputlist.;
datalines;
13JAN2010,9,REGION
;;;;
run;
(obviously you would use your CSV file instead of datalines, just used here as example).
The point is you can write the data step code using the metadata from your original dataset.
I needed this today, so I made a macro out of it: https://core.sasjs.io/mp__csv2ds_8sas.html
It doesn't wrap the input statement so it may break with a large number of columns if you have batch line length limits. If anyone would like me to fix that, just raise an issue: https://github.com/sasjs/core/issues/new