How to merge multiple csv files into 1 SAS file - csv

I just started using SAS 3 days ago and I need to merge ~50 csv files into 1 SAS dataset.
The 50 csv files have multiple variables with only 1 variable in common i.e. "region_id"
I've used SAS enterprise guide drag and drop functionalities to do this but it was too manual and took me half a day to upload and merge 47 csv files into 1 SAS file.
I was wondering whether anyone has a more intelligent way of doing this using base SAS?
Any advice and tips appreciated!
Thank you!
Example filenames:
2011Census_B01_AUST_short
2011Census_B02A_AUST_short
2011Census_B02B_AUST_short
2011Census_B03_AUST_short
.
.
2011Census_xx_AUST_short
I have more than 50 csv files to upload and merge.
The number and type of variables in the csv file varies in each csv file. However, all csv files have 1 common variable = "region_id"
Example variables:
region_id, Tot_P_M, Tot_P_F, Tot_P_P, Age_0_4_yr_F etc...

First, we'll need an automated way to import. The below simple macro takes the location of the file and the name of the file as inputs, and outputs a dataset to the work directory. (I'd use the concatenate function in Excel to create the SAS code 50 times). Also, we are sorting it to make the merge easier later.
%macro importcsv(location=,filename=);
proc import datafile="&location./&filename..csv"
out=&filename.
dbms=csv
replace;
getnames=yes;
run;
proc sort data= &filename.; by region_id; run;
%mend;
%importcsv(location = C:/Desktop,filename = 2011Census_B01_AUST_short)
.
.
.
Then simply merge all of the data together again. I added ellipses simply because I didn't want to right out 50 times.
data merged;
merge dataseta datasetb datasetc ... datasetax;
by region_id;
run;
Hope this helps.

Related

Importing messy time-data from CSV to SPSS

I have a large CSV-datafile with data on sleep. The first four items of this dataset contains time-data of varying formats.
I want to import this datafile into SPSS, but as you can see from the CSV-example underneath, the data is not easy to read for SPSS and Excel. How can I make use of these data?
Example of data:
Time went to bed;Minutes to sleep;Got up time;Actual sleep time in hours
22;240;08,30;4
24,00;60;09,00;8
200;120;1200;8
0;120;900;4,5
01:30;30;06:30;5
You can import it into SPSS as text (using ";" as delimiter) and then work on it. You can't use the current labels as variable names, so you should enter variable names and formats in the command. See examples below:
GET DATA /TYPE=TXT
/FILE="path\filename.txt"
/DELCASE=LINE
/DELIMITERS=";"
/ARRANGEMENT=DELIMITED
/FIRSTCASE=2
/IMPORTCASE=ALL
/VARIABLES=
Time_went_to_bed A5
Minutes_to_sleep F8.1
Go_tup_time A5
Actual_sleep_hrs F8.1.
CACHE.
EXECUTE.
Once you have this in a dataset you'll need to identify and correct each of the different shapes in which your time data appears.

Create libsvm from multiple csv files for xgboost external memory training

I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.

Editing a .csv file with a batch file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
this is my first question on here. I work as a meteorologist and have some coding experience, though it is far from professionally taught. Basically what I have is a .csv file from a weather station that is giving me data that is too detailed. (65.66 degrees and similar values) What I want to do is automate a way via a script file that would access the .csv file and get rid of values that were too detailed. (Take a temp from 65.66 to 66 (rounding up for anything above .5 and down for below) or for a pressure (29.8889) and making it (29.89) using the same rounding rules.) Is this possible to be done? If so how should I go about it. Again keep in mind that my coding skills for batch files are not the strongest.
Any help would be much appreciated.
Thanks,
I agree with the comments above. Math in batch is limited to integers, and won't work well for the manipulations you want.
I'd use PowerShell. Besides easily handling floating point math, it also has built-in methods for objectifying CSV data (as well as XML and other types of structured data). Take the following hypothetical CSV data contained within weather.csv:
date,time,temp,pressure,wx
20160525,12:30,65.66,30.1288,GHCND:US1TNWS0001
20160525,13:00,67.42,30.3942,GHCND:US1TNWS0001
20160525,13:30,68.92,31.0187,GHCND:US1TNWS0001
20160525,14:00,70.23,30.4523,GHCND:US1TNWS0001
20160525,14:30,70.85,29.8889,GHCND:US1TNWS0001
20160525,15:00,69.87,28.7384,GHCND:US1TNWS0001
The first thing you want to do is import that data as an object (using import-csv), then round the numbers as desired -- temp rounded to a whole number, and pressure rounded to a precision of 2 decimal places. Rounding to a whole number is easy. Just recast the data as an integer. It'll be rounded automatically. Rounding the pressure column is pretty easy as well if you invoke the .NET [math]::round() method.
# grab CSV data as a hierarchical object
$csv = import-csv weather.csv
# for each row of the CSV data...
$csv | foreach-object {
# recast the "temp" property as an integer
$_.temp = [int]$_.temp
# round the "pressure" property to a precision of 2 decimal places
$_.pressure = [math]::round($_.pressure, 2)
}
Now pretend you want to display the temperature, barometric pressure, and weather station name where "date" = 20160525 and "time" = 14:30.
$row = $csv | where-object { ($_.date -eq 20160525) -and ($_.time -eq "14:30") }
$row | select-object pressure,temp,wx | format-table
Assuming "pressure" started with a value of 29.8889 and "temp" had a value of 70.85, then the output would be:
pressure temp wx
-------- ---- --
29.89 71 GHCND:US1TNWS0001
If the CSV data had had multiple rows with the same date and time values (perhaps measurements from different weather stations), then the table would display with multiple rows.
And if you wanted to export that to a new csv file, just replace the format-table cmdlet with export-csv destination.csv
$row | select-object pressure,temp,wx | export-csv outfile.csv
Handy as a pocket on a shirt, right?
Now, pretend you want to display the human-readable station names rather than NOAA's designations. Make a hash table.
$stations = #{
"GHCND:US1TNWS0001" = "GRAY 1.5 E TN US"
"GHCND:US1TNWS0003" = "GRAY 1.9 SSE TN US"
"GHCND:US1TNWS0016" = "GRAY 1.3 S TN US"
"GHCND:US1TNWS0018" = "JOHNSON CITY 5.9 NW TN US"
}
Now you can add a "station" property to your "row" object.
$row = $row | select *,"station"
$row.station = $stations[$row.wx]
And now if you do this:
$row | select-object pressure,temp,station | format-table
Your console shows this:
pressure temp station
-------- ---- -------
29.89 71 GRAY 1.5 E TN US
For extra credit, say you want to export this row data to JSON (for a web page or something). That's slightly more complicated, but not impossibly so.
add-type -AssemblyName System.Web.Extensions
$JSON = new-object Web.Script.Serialization.JavaScriptSerializer
# convert $row from a PSCustomObject to a more generic hash table
$obj = #{}
# the % sign in the next line is shorthand for "foreach-object"
$row.psobject.properties | %{
$obj[$_.Name] = $_.Value
}
# Now, stringify the row and display the result
$JSON.Serialize($obj)
The output of that should be similar to this:
{"station":"GRAY 1.5 E TN US","wx":"GHCND:US1TNWS0001","temp":71,"date":"201605
25","pressure":29.89,"time":"14:30"}
... and you can redirect it to a .json file by using > or pipe it into the out-file cmdlet.
DOS batch scripting is, by far, not the best place to edit text files. However, it is possible. I will include sample, incomplete DOS batch code at the bottom of this post to demonstrate the point. I recommend you focus on Excel (no coding needed) or Python.
Excel - You don't need to code at all with Excel. Open the csv file. Let's say you have 66.667 in cell B12. In cell C12 enter a formula using the round function (code below). You can also teach yourself some Visual Basic for Applications. But, for this simple task, that is overkill. When done, if you save as csv format, you will loose your formulae and only have data. Consider saving as xlsx or xlsm.
Visual Basic Script - you can run vbscript on your machine with
cscript.exe (or wscript.exe), which is part of Windows. But, if using VB script, you might as well use VBA in Excel. It is almost identical.
Python is a very high level langauge with built in libraries
that make editing a csv file super easy. I recommend Anaconda
(a Python suite) from continuum.io. But, you can find the generic Python at
python.org as well. Anaconda will come prepackaged with lots of
helpful libraries. For csv editing, you will likely want to use the
pandas library. You can find plenty of short videos on YouTube.
Excel
Say you have 66.667 in cell B12. Set the formula in C13 to...
"=ROUND(B12,0)" to round to integer
"=ROUND(B12,1)" to round to one decimal place
As you copy and past, Excel will attempt to intelligently update the formulas for you.
Python
import pandas as pd
from StringIO import StringIO
import numpy as np
# load csv file to memory. Name your columns "using names=[]"
df = pd.read_csv(StringIO("C:/temp/weather.csv"), names=["city", "temperature", "date"])
df["temperature"].apply(np.round) #you just rounded the temperature column
pd.to_csv('newfile.csv') # export to a new csv file
pd.to_xls('newfile.xls') # or export to an excel file instead
DOS Batch
A Batch script for this is much, much harder. I will not write the whole program, because it is not a great solution. But, I'll give you a taste in DOS batch code at the bottom of this post. Compared to using Python or Excel, it is extremely complex.
Here is a rough sketch of DOS code. Because I don't recommend this method, I didn't take the time to debug this code.
setlocal ENABLEDELAYEDEXPANSION
:: prep our new file for output. Let's write the header row.
echo col1, col2, col3 >newfile.csv
:: read the existing text file line by line
:: since it is csv, we will parse on comma
:: skip lines starting with semi-colon
FOR /F "eol=; tokens=2,3* delims=, " %%i in (input_file.txt) do (
set col1=%%I, set col2=%%J, set col3=%%K
:: truncate col2 to 1 decimal place
for /f "tokens=2 delims==." %%A in ("col2") do (
set integer=%%A
set "decimal=%%B
set decimal=%decimal:~0,1%
:: or, you can use an if statement to round up or down
:: Now, put the integer and decimal together again and
:: redefine the value for col2.
set col2=%integer%.%decimal%
:: write output to a new csv file
:: > and >> can redirect output from console to text file
:: >newfile.csv will overwrite file.csv. We don't want
:: that, since we are in a loop.
:: >>newfile.csv will append to file.csv, perfect!
echo col1, col2, col3 >>newfile.csv
)
)
:: open csv file in default application
start myfile.csv

all the columns of a csv file cannot be imported in sas dataset

my data set contains 1300000 observations with 56 columns. it is a .csv file and i'm trying to import it by using proc import. after importing i find that only 44 out of 56 columns are imported.
i tried increasing the guessing rows but it is not helping.
P.S: i'm using sas 9.3
If (and only in that case as far as I am aware) you specify the file to load in a filename statement, you have to set the option lrecl to a value that is large enough.
If you don't, the default is only 256. Ergo, if your csv has lines longer than 256, he will not read the full line.
See this link for more information (just search for lrecl): https://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000308090.htm
If you have SAS Enterprise Guide (I think it's now included with all desktop licenses) try out the import wizard. It's excellent. And it will generate code you can reuse with a little editing.
It will take a while to run because it will read your entire file before writing the import logic.

Can SAS convert CSV files into Binary Format?

The output we need to produce is a standard delimited file but instead of ascii content we need binary. Is this possible using SAS?
Is there a specific Binary Format you need? Or just something non-ascii? If you're using proc export, you're probably limited to whatever formats are available. However, you can always create the csv manually.
If anything will do, you could simply zip the csv file.
Running on a *nix system, for example, you'd use something like:
filename outfile pipe "gzip -c > myfile.csv.gz";
Then create the csv manually:
data _null_;
set mydata;
file outfile;
put var1 "," var2 "," var3;
run;
If this is PC/Windows SAS, I'm not as familiar, but you'll probably need to install a command-line zip utility.
This link from SAS suggests using winzip, which has a freely downloadable version. Otherwise, the code is similar.
http://support.sas.com/kb/26/011.html
You can actually make a CSV file as a SAS catalog entry; CSV is a valid SAS Catalog entry type.
Here's an example:
filename of catalog "sasuser.test.class.csv";
proc export data=sashelp.class
outfile=of
dbms=dlm;
delimiter=',';
run;
filename of clear;
This little piece of code exports SASHELP.CLASS to a SAS Catalog entry of entry type CSV.
This way you get a binary format you can move between SAS installations on different platforms with PROC CPORT/CIMPORT, not having to worry if the used binary package format is available to your SAS session, since it's an internal SAS format.
Are you saying you have binary data that you want to output to csv?
If so, I don't think there is necessarily a defined standard for how this should be handled.
I suggest trying it (proc export comes to mind) and seeing if the results match your expectations.
Using SAS, output a .csv file; Open it in Excel and Save As whichever format your client wants. You can automate this process with a little bit of scripting in ### as well. (Substitute ### with your favorite scripting language.)