Appending csv files in SAS - csv

I have a bunch of csv files. Each has data from a different period:
filename file1 'JAN2011_PRICE.csv';
filename file2 'FEB2011_PRICE.csv';
...
Do I need to manually create intermediate datasets and then append them all together? Is there a better way of doing this?
SOLUTION
From the documentation it is preferable to use:
data allcsv;
length fileloc myinfile $ 300;
input fileloc $ ; /* read instream data */
/* The INFILE statement closes the current file
and opens a new one if FILELOC changes value
when INFILE executes */
infile my filevar=fileloc
filename=myinfile end=done dlm=',';
/* DONE set to 1 when last input record read */
do while(not done);
/* Read all input records from the currently */
/* opened input file */
input col1 col2 col3 ...;
output;
end;
put 'Finished reading ' myinfile=;
datalines;
path-to-file1
path-to-file2
...
run;

To read a bunch of csv files into a single SAS dataset, you can use a single data step as described in the SAS documentation here. You want the second example in this section which uses the filevar= infile option.
There should be no reason to create intermediate datasets.

The easiest method is to use a wildcard.
filename allfiles '*PRICE.csv';
data allcsv;
infile allfiles end=done dlm=',';
input col1 col2 col3 ...;
run;

Related

Importing data from VERY large text file into Mysql [duplicate]

I have a very large CSV file (150 MB). What is the best way to import it to MySQL?
I have to do some manipulation in PHP before inserting it into the MySQL table.
You could take a look at LOAD DATA INFILE in MySQL.
You might be able to do the manipulations once the data is loaded into MySQL, rather than first reading it into PHP. First store the raw data in a temporary table using LOAD DATA INFILE, then transform the data to the target table using a statement like the following:
INSERT INTO targettable (x, y, z)
SELECT foo(x), bar(y), z
FROM temptable
I would just open it with fopen and use fgetcsv to read each line into an array.
pseudo-php follows:
mysql_connect( //connect to db);
$filehandle = fopen("/path/to/file.csv", "r");
while (($data = fgetcsv($filehandle, 1000, ",")) !== FALSE) {
// $data is an array
// do your parsing here and insert into table
}
fclose($filehandle)

How do I compare two CSV files in PIG, and output only the different columns?

Between an older csv file and a newer one, I want to find what fields have changed on rows with the same key. For example, if the unique key is in field $2, and we have two files:
Old csv file:
FIELD1,FIELD2,ID,FIELD4
a,a,key1,a
b,b,key2,b
New csv file:
FIELD1,FIELD2,ID,FIELD4
a,a2,key1,a2
b,b,key2,b
Desired output something like:
{FIELD2:a2,ID:key1,FIELD4:a2}
or in other words, on the row with ID = key1, the 2nd and 4th fields changed, and these are the changed values.
A pig script that outputs the whole row if any field has changed, is:
old = load '$old' using PigStorage('\n') as (line:chararray);
new = load '$new' using PigStorage('\n') as (line:chararray);
cg = cogroup old by line, new by line;
new_only = foreach (filter cg by IsEmpty(old)) generate flatten(new);
store new_only into '$changes';
My initial idea (I'm not sure how to complete it) is:
old = LOAD $old USING PigStorage('|');
new = LOAD $new USING PigStorage('|');
cogroup_data = COGROUP old by $2, new by $2 -- 3rd column is unique key
diff_data = FOREACH cogroup_data GENERATE DIFF(old,new);
-- ({(a,a,key2,a),(a,a2,key2,a2)})
-- ? what goes here ?

Column names missing when exporting files using SAS data step

I have a large SAS dataset raw_data which contains data collected from various countries. This dataset has a column "country" which lists the country from which the observation is originated. I would like to export a separate .csv file for each country in raw_data. I use the following data step to produce the output:
data _null_;
set raw_data;
length fv $ 200;
fv = "/directory/" || strip(put(country,$32.)) || ".csv";
file write filevar=fv dsd dlm=',';
put (_all_) (:);
run;
However, the resulting .csv files no longer have the column names from raw_data. I have over a hundred columns in my dataset, so listing all of the column names is prohibitive. Can anyone provide me some guidance on how I can modify the above code so as to attach the column names to the .csv files being exported? Any help is appreciated!
You can create a macro variable that holds the variable names and puts them to the CSV file.
proc sql noprint;
select name into :var_list separated by ", "
from sashelp.vcolumn
where libname="WORK" and memname='RAW_DATA'
order by varnum;
quit;
data _null_;
set raw_data;
length fv $ 200;
by country;
fv = "/directory/" || strip(put(country,$32.)) || ".csv";
if first.country then do;
put "&var_list";
end;
file write filevar=fv dsd dlm=',';
put (_all_) (:);
run;
Consider this data step that is very similar to your program. It uses VNEXT to query the PDV and write the variable names as the first record of each file.
proc sort data=sashelp.class out=class;
by age;
run;
data _null_;
set class;
by age;
filevar=catx('\','C:\Users\name\Documents',catx('.',age,'csv'));
file dummy filevar=filevar ls=256 dsd;
if first.age then link names;
put (_all_)(:);
return;
names:
length _name_ $32;
call missing(_name_);
do while(1);
call vnext(_name_);
if _name_ eq: 'FIRST.' then leave;
put _name_ #;
end;
put;
run;

Date variable is NULL while loading csv data into hive External table

I am trying to load a SAS Dataset to Hive external table. For that, I have converted SAS dataset into CSV file format first. In sas dataset, Date variable (i.e as_of_dt) contents shows this:
LENGTH=8 , FORMAT= DATE9. , INFORMAT=DATE9. , LABLE=as_of_dt
And for converting SAS into CSV, I have used below code patch (i have used 'retain' statement before in sas so that the order of variables are maintained):
proc export data=input_SASdataset_for_csv_conv
outfile= "/mdl/myData/final_merged_table_201501.csv"
dbms=csv
replace;
putnames=no;
run;
Till here (i.e till csv file creation), the Date variable is read correctly. But after this, when I am loading it into Hive External Table by using below command in HIVE, then the DATE variable (i.e as_of_dt) is getting assigned as NULL :
CREATE EXTERNAL TABLE final_merged_table_20151(as_of_dt DATE, client_cm_id STRING, cm11 BIGINT, cm_id BIGINT, corp_id BIGINT, iclic_id STRING, mkt_segment_cd STRING, product_type_cd STRING, rated_company_id STRING, recovery_amt DOUBLE, total_bal_amt DOUBLE, write_off_amt DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/mdl/myData';
Also, when i am using this command in hive desc formatted final_merged_table_201501, then I am getting following table parameters:
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL TRUE
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1447151851
But even though it shows numRows=-1, still I am able to see data inside the table, by using hive command SELECT * FROM final_merged_table_20151 limit 10; , with Date variable (as_of_dt) stored as NULL.
Where might be the problem?
Based on madhu's comment you need to change the format on as_of_dt to yymmdd10.
You can do that with PROC DATASETS. Here is an example:
data test;
/*Test data with AS_OF_DT formatted date9. per your question*/
format as_of_dt date9.;
do as_of_dt=today() to today()+5;
output;
end;
run;
proc datasets lib=work nolist;
/*Modify Test Data Set and set format for AS_OF_DT variable*/
modify test;
attrib as_of_dt format=yymmdd10.;
run;
quit;
/*Create CSV*/
proc export file="C:\temp\test.csv"
data=test
dbms=csv
replace;
putnames=no;
run;
If you open the CSV, you will see the date in YYYY-MM-DD format.

How to write data to CSV file in ABL coding

I populated a temp from a query, and the temp table looks like,
ttcomp.inum
ttcomp.iname
ttcomp.iadd
There are 5000 records in this temp table and now i wanted to write in a CSV file. I think it could be done with output stream but i don't know how to implement this. Please someone help me in getting this.
Export does the trick:
/* Define a stream */
DEFINE STREAM str.
/* Define the temp-table. I did some guessing according datatypes... */
DEFINE TEMP-TABLE ttcomp
FIELD inum AS INTEGER
FIELD iname AS CHARACTER
FIELD iadd AS INTEGER.
/* Fake logic that populates your temp-table is here */
DEFINE VARIABLE i AS INTEGER NO-UNDO.
DO i = 1 TO 5000:
CREATE ttComp.
ASSIGN
ttComp.inum = i
ttComp.iname = "ABC123"
ttComp.iadd = 3.
END.
/* Fake logic done... */
/* Output the temp-table */
OUTPUT STREAM str TO VALUE("c:\temp\file.csv").
FOR EACH ttComp NO-LOCK:
/* Delimiter can be set to anything you like, comma, semi-colon etc */
EXPORT STREAM str DELIMITER "," ttComp.
END.
OUTPUT STREAM str CLOSE.
/* Done */
Here is an alternative without stream.
/* Using the temp-table. of Jensd*/
DEFINE TEMP-TABLE ttcomp
FIELD inum AS INTEGER
FIELD iname AS CHARACTER
FIELD iadd AS INTEGER.
OUTPUT TO somefile.csv APPEND.
FOR EACH ttcomp:
DISPLAY
ttcomp.inum + ",":U
ttcomp.iname + ",":U
ttcomp..iadd SKIP.
END.
OUTPUT CLOSE.
There is alternative way also. EXPORT is good in case format doesn't matter because EXPORT ignores format. For example you don't want format for date to be mm/dd/yy. In this case it is better to use PUT STREAM and specify format explicitly.
FOR EACH ttComp NO-LOCK:
PUT STREAM str UNFORMATTED SUBSTITUTE("&1;&2;&3 \n", ttComp.inum, ttcomp.iname, ttcomp.iadd).
END.