The output we need to produce is a standard delimited file but instead of ascii content we need binary. Is this possible using SAS?
Is there a specific Binary Format you need? Or just something non-ascii? If you're using proc export, you're probably limited to whatever formats are available. However, you can always create the csv manually.
If anything will do, you could simply zip the csv file.
Running on a *nix system, for example, you'd use something like:
filename outfile pipe "gzip -c > myfile.csv.gz";
Then create the csv manually:
data _null_;
set mydata;
file outfile;
put var1 "," var2 "," var3;
run;
If this is PC/Windows SAS, I'm not as familiar, but you'll probably need to install a command-line zip utility.
This link from SAS suggests using winzip, which has a freely downloadable version. Otherwise, the code is similar.
http://support.sas.com/kb/26/011.html
You can actually make a CSV file as a SAS catalog entry; CSV is a valid SAS Catalog entry type.
Here's an example:
filename of catalog "sasuser.test.class.csv";
proc export data=sashelp.class
outfile=of
dbms=dlm;
delimiter=',';
run;
filename of clear;
This little piece of code exports SASHELP.CLASS to a SAS Catalog entry of entry type CSV.
This way you get a binary format you can move between SAS installations on different platforms with PROC CPORT/CIMPORT, not having to worry if the used binary package format is available to your SAS session, since it's an internal SAS format.
Are you saying you have binary data that you want to output to csv?
If so, I don't think there is necessarily a defined standard for how this should be handled.
I suggest trying it (proc export comes to mind) and seeing if the results match your expectations.
Using SAS, output a .csv file; Open it in Excel and Save As whichever format your client wants. You can automate this process with a little bit of scripting in ### as well. (Substitute ### with your favorite scripting language.)
Related
I am attempting to import a CSV file which is in French to my US based analysis. I have noticed several issues in the import related to the use of accents. I put the csv file into a text reader and found that the data look like this
I am unsure how to get rid of the [sub] pieces and format this properly.
I am on SAS 9.3 and am unable to edit the CSV as it is a shared CSV with French researchers. I am also limited to what I can do in terms of additional languages within SAS because of admin rights.
I have tried the following fixes:
data want(encoding=asciiany);
set have;
comment= Compress(comment,'0D0A'x);
comment= TRANWRD(comment,'0D0A'x,'');
comment= TRANWRD(comment,'0D'x,'');
comment= TRANWRD(comment,"\u001a",'');
How can I resolve these issues?
While this would have been a major issue a few decades ago, nowadays, it's very simple to determine the encoding and then run your SAS in the right mode.
First, open the CSV in a text editor, not the basic Notepad but almost any other; Notepad++ is free, for example, or Ultraedit or Textpad, on Windows, or on the Mac, BBEdit, or several others will do. I'll assume Notepad++ for the rest of this answer, but all of them have some way of doing this. If you're in a restricted no-admin-rights environment, good news: Notepad++ can be installed in your user folder with no admin rights (or even on a USB!). (Also, an advanced text editor is a vital data science tool, so you should have one anyway.)
In Notepad++, once you open the file there will be an encoding in the bottom right: "UTF-8", "WLATIN1", "ASCII", etc., depending on the encoding of the file. Look and see what that is, and write it down.
Once you have that, you can try starting SAS in that encoding. For the rest of this, I assume it is in UTF-8 as that is fairly standard, but replace UTF-8 with whatever the encoding you determined. earlier.
See this article for more details; the instructions are for 9.4, but they have been the same for years. If this doesn't work, you'll need to talk to your SAS administrator, and they may need to modify your SAS installation.
You can either:
Make a new shortcut (a copy of the one you run SAS with) and add -encoding UTF-8 to the command line
Create a new configuration file, point SAS to it, and include ENCODING=UTF-8 in the configuration file.
Note that this will have some other impacts - the datasets you create will be encoded in UTF-8, and while SAS is capable of handling that, it will add some extra notes to the log and some extra time if you later do work in non-UTF8 SAS with this, or if you use non-UTF8 SAS datasets in this mode.
This worked:
data want;
array f[8] $4 _temporary_ ('ä' 'ö' 'ü' 'ß' 'Ä' 'Ö' 'Ü' 'É');
array t[8] $4 _temporary_ ('ae' 'oe' 'ue' 'ss' 'Ae' 'Oe' 'Ue' 'E');
set have;
newvar=oldvar;
newvar = Compress(newvar,'0D0A'x);
newvar = TRANWRD(newvar,'0D0A'x,'');
newvar = TRANWRD(newvar,'0D'x,'');
newvar = TRANWRD(newvar,"\u001a",'');
newvar = compress(newvar, , 'kw');
do _n_=1 to dim(f);
d=tranwrd(d, trim(f[_n_]), trim(t[_n_]));
end;
run;
I`ve got (and will receive in the future) many CSV files that use the semicolon as delimiter and the comma as decimal separator.
So far I could not find out how to import these files into SAS using proc import -- or in any other automated fashion without the need for messing around with the variable names manually.
Create some sample data:
%let filename = %sysfunc(pathname(work))\sap.csv;
data _null_;
file "&filename";
put 'a;b';
put '12345,11;67890,66';
run;
The import code:
proc import out = sap01
datafile= "&filename"
dbms = dlm;
delimiter = ";";
GETNAMES = YES;
run;
After the import a value for the variable "AMOUNT" such as 350,58 (which corresponds to 350.58 in the US format) would look like 35,058 (meaning thirtyfivethousand...) in SAS (and after re-export to the German EXCEL it would look like 35.058,00).
A simple but dirty workaround would be the following:
data sap02; set sap01;
AMOUNT = AMOUNT/100;
format AMOUNT best15.2;
run;
I wonder if there is a simple way to define the decimal separator for the CVS-import (similar to the specification of the delimiter). ..or any other "cleaner" solution compared to my workaround.
Many thanks in advance!
You technically should use dbms=dlm not dbms=csv, though it does figure things out. CSV means "Comma separated values", while DLM means "delimited", which is correct here.
I don't think there's a direct way to make SAS read in with the comma via PROC IMPORT. You need to tell SAS to use the NUMXw.d informat when reading in the data, and I don't see a way to force that setting in SAS. (There's an option for output with a comma, NLDECSEPARATOR, but I don't think that works here.)
Your best bet is either to write data step code yourself, or to run the PROC IMPORT, go to the log, and copy/paste the read in code into your program; then for each of the read-in records add :NUMX10. or whatever the appropriate maximum width of the field is. It will end up looking something like this:
data want;
infile "whatever.txt" dlm=';' lrecl=32767 missover;
input
firstnumvar :NUMX10.
secondnumvar :NUMX10.
thirdnumvar :NUMX10.
fourthnumvar :NUMX10.
charvar :$15.
charvar2 :$15.
;
run;
It will also generate lots of informat and format code; you can alternately convert the informats to NUMX10. instead of BEST. instead of adding the informat to the read-in. You can also just remove the informats, unless you have date fields.
data want;
infile "whatever.txt" dlm=';' lrecl=32767 missover;
informat firstnumvar secondnumvar thirdnumvar fourthnumvar NUMX10.;
informat charvar $15.;
format firstnumvar secondnumvar thirdnumvar fourthnumvar BEST12.;
format charvar $15.;
input
firstnumvar
secondnumvar
thirdnumvar
fourthnumvar
charvar $
;
run;
Your best bet is either to write data step code yourself, or to run
the PROC IMPORT, go to the log, and copy/paste the read in code into
your program
This has a drawback. If there is a change in the stucture of the csv file, for example a changed column order, then one has to change the code in the SAS programm.
So it is safer to change the input, substituting in the numeric fields the comma with dot and passing SAS the modified input.
The first idea was to use a perl program for this, and then use in SAS a filename with a pipe to read the modified input.
Unfortunately there is a SAS restriction in the proc import: The IMPORT procedure does not support device types or access methods for the FILENAME statement except for DISK.
So one has to create a workfile on disk with the adjusted input.
I used the CVS_PP package to read the csv file.
testdata.csv contains the csv data to read.
substitute_commasep.perl is the name of the perl program
perl code:
# use lib "/........"; # specifiy, if Text::CSV_PP is locally installed. Otherwise error message: Can't locate Text/CSV_PP.pm in ....;
use Text::CSV_PP;
use strict;
my $csv = Text::CSV_PP->new({ binary => 1
,sep_char => ';'
}) or die "Error creating CSV object: ".Text::CSV_PP->error_diag ();
open my $fhi, "<", "$ARGV[0]" or die "Error reading CSV file: $!";
while ( my $colref = $csv->getline( $fhi) ) {
foreach (#$colref) { # analyze each column value
s/,/\./ if /^\s*[\d,]*\s*$/; # substitute, if the field contains only numbers and ,
}
$csv->print(\*STDOUT, $colref);
print "\n";
}
$csv->eof or $csv->error_diag();
close $fhi;
SAS code:
filename readcsv pipe "perl substitute_commasep.perl testdata.csv";
filename dummy "dummy.csv";
data _null_;
infile readcsv;
file dummy;
input;
put _infile_;
run;
proc import datafile=dummy
out=data1
dbms=dlm
replace;
delimiter=';';
getnames=yes;
guessingrows=32767;
run;
I performed a GWAS in PLINK and now I would like to look at the data for a small set of SNPs listed one for each line, in a file called snps.txt.
I would like to export the data from PLINK for theses specific SNPs into a .txt or .csv file. Ideally, this file would have the individual IDs as well as the genotypes for these SNPs so that I could later merge it with my phenotype file and perform additional analyses and plots.
Is there an easy way to do that? I know I can use --extract to request specific SNPs only but I can't find a way to tell PLINK to export the data to an "exportable" text-based format.
If you are using classic plink (1.07) you should consider upgrading to plink 1.9. It is a lot faster, and supports many more formats. This answer is for plink 1.9.
Turning binary plink data into a .csv file
It sounds like your problem is that you are unable to turn the binary data into a regular plink text file.
This is easy to do with the recode option. It should be used without any parameters to convert to the plink text format:
plink --bfile gwas_file --recode --extract snps.txt --out gwas_file_text
If you want to convert the .ped data to a csv afterwards you could do the following:
cut -d " " -f2-2,7- --output-delimiter=, gwas_file_text.ped
This produces a comma-delimited file with IDs in the first column and then genotypes.
Turning plink data into other text based file formats
Note that you can also convert the data to a lot of other text-based filetypes, all described in the docs.
One of these is the common variant call format (VCF), which makes a file with the snps and individual IDs all in one file, as requested:
plink --bfile gwas_file --recode vcf --extract snps.txt --out gwas_file_text
I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.
I'm trying to import a .CSV file into a SAS dataset, and am having some trouble. Here's a line of sample input:
Foo,5,10,3.5
Bar,2,3,1.0
The problem I'm having is that the line-final "3.5" and "1.0" are not being correctly interpreted as variable values (instead SAS complains that they are invalid values, giving me a NOTE: Invalid data for VARIABLE error). However, when I add a comma to the end of the line, like so:
Foo,5,10,3.5,
Bar,2,3,1.0,
Then everything works fine. Is there a way that I can make this import work without modifying the source file?
Currently, my DATA step's INFILE statement has the DSD, DLM=',', and MISSOVER options.
With this data in a .csv file in a windows environment
Foo,5,10,1.5
Bar,2,3,2.1
Foo,5,10,3.5
Bar,2,3,4.1
This code works (running SAS locally on a windows machine)
filename f 'D:\Data\SAS\input.csv';
data input;
infile f delimiter=',';
input char1 $ num1 num2 num3;
Run;
As #itzy mentioned, the environment is important..more info will help with the solution
When you are working with data from a different environment, you can use the TERMSTR option on the INFILE statement to tell SAS how the lines of data are terminated.
This most likely has to do with the different codes for line endings in Unix and Windows. I'm guessing your data comes from a different operating system than the one you're running SAS on.
The solution is to change the newline codes to the correct operating system. If you're running SAS on a unix system, try the dos2unix command. If you're running Windows, you can edit the CSV file with a text editor like UltraEdit or Notepad++ and save the file in Windows format.