Plink: export subset of data into txt or csv - csv

I performed a GWAS in PLINK and now I would like to look at the data for a small set of SNPs listed one for each line, in a file called snps.txt.
I would like to export the data from PLINK for theses specific SNPs into a .txt or .csv file. Ideally, this file would have the individual IDs as well as the genotypes for these SNPs so that I could later merge it with my phenotype file and perform additional analyses and plots.
Is there an easy way to do that? I know I can use --extract to request specific SNPs only but I can't find a way to tell PLINK to export the data to an "exportable" text-based format.

If you are using classic plink (1.07) you should consider upgrading to plink 1.9. It is a lot faster, and supports many more formats. This answer is for plink 1.9.
Turning binary plink data into a .csv file
It sounds like your problem is that you are unable to turn the binary data into a regular plink text file.
This is easy to do with the recode option. It should be used without any parameters to convert to the plink text format:
plink --bfile gwas_file --recode --extract snps.txt --out gwas_file_text
If you want to convert the .ped data to a csv afterwards you could do the following:
cut -d " " -f2-2,7- --output-delimiter=, gwas_file_text.ped
This produces a comma-delimited file with IDs in the first column and then genotypes.
Turning plink data into other text based file formats
Note that you can also convert the data to a lot of other text-based filetypes, all described in the docs.
One of these is the common variant call format (VCF), which makes a file with the snps and individual IDs all in one file, as requested:
plink --bfile gwas_file --recode vcf --extract snps.txt --out gwas_file_text

Related

file "(...).csv" not Stata file error in using merge command

I use Stata 12.
I want to add some country code identifiers from file df_all_cities.csv onto my working data.
However, this line of code:
merge 1:1 city country using "df_all_cities.csv", nogen keep(1 3)
Gives me the error:
. run "/var/folders/jg/k6r503pd64bf15kcf394w5mr0000gn/T//SD44694.000000"
file df_all_cities.csv not Stata format
r(610);
This is an attempted solution to my previous problem of the file being a dta file not working on this version of Stata, so I used R to convert it to .csv, but that also doesn't work. I assume it's because the command itself "using" doesn't work with csv files, but how would I write it instead?
Your intuition is right. The command merge cannot read a .csv file directly. (using is technically not a command here, it is a common syntax tag indicating a file path follows.)
You need to read the .csv file with the command insheet. You can use it like this.
* Preserve saves a snapshot of your data which is brought back at "restore"
preserve
* Read the csv file. clear can safely be used as data is preserved
insheet using "df_all_cities.csv", clear
* Create a tempfile where the data can be saved in .dta format
tempfile country_codes
save `country_codes'
* Bring back into working memory the snapshot saved at "preserve"
restore
* Merge your country codes from the tempfile to the data now back in working memory
merge 1:1 city country using `country_codes', nogen keep(1 3)
See how insheet is also using using and this command accepts .csv files.

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Create libsvm from multiple csv files for xgboost external memory training

I am trying to train an xgboost model using its external memory version, which takes a libsvm file as training set. Right now, all the data is stored in a bunch of csv files which combine together are way larger than the memory I have, say 70G.(you can easily read any one of them). I just wonder how to create one large libsvm file for xgboost. Or if there is any other work round for this. Thank you.
If you csv files do not have headers you can combine them with the Unix cat command.
Example:
> ls
file1.csv file2.csv
> cat *.csv > combined.csv
Now combined.csv is the concatenation of all the other files.
If all your csv files have headers you''ll want to do something trickier, like take the n-1 lines with tail.
XGBoost supports csv as an input.
If you want to convert that to libsvm regardless, you can use phraug's scripts.

How to rename my hadoop result into a file with ".csv" extension

Actually my intention is to rename the output of a hadoop job to .csv files, because i need to visualize this csv data in rapidminer.
In How can i output hadoop result in csv format it is said, that for this purpose I need to follow these three steps:
1. Submit the MapReduce Job
2. Which will extract the output from HDFS using shell commands
3. Merge them together, rename as ".csv" and place in a directory where the visualization tool can access the final file
If so, how can I achieve this?
UPDATE
myjob.sh:
bin/hadoop jar /var/root/ALA/ala_jar/clsperformance.jar ala.clsperf.ClsPerf /user/root/ala_xmlrpt/Amrita\ Vidyalayam\,\ Karwar_Class\ 1\ B_ENG.xml /user/root/ala_xmlrpt-outputshell4
bin/hadoop fs -get /user/root/ala_xmlrpt-outputshell4/part-r-00000 /Users/jobsubmit
cat /Users/jobsubmit/part-r-00000 /Users/jobsubmit/output.csv
showing:
The CSV file was empty and couldn’t be imported.
when I tried to open output.csv.
solution
cat /Users/jobsubmit/part-r-00000> /Users/jobsubmit/output.csv
Firstly you need to retrieve MapReduce result from HDFS
hadoop dfs -copyToLocal path_to_result/part-r-* local_path
Then cat them into a single file
cat local_path/part-r-* > result.csv
Then it depends your MapReduce result format, if it's already a csv format, then it is done. If not, probably you have to use other tool like sed or awk to transform it into csv format.

Can SAS convert CSV files into Binary Format?

The output we need to produce is a standard delimited file but instead of ascii content we need binary. Is this possible using SAS?
Is there a specific Binary Format you need? Or just something non-ascii? If you're using proc export, you're probably limited to whatever formats are available. However, you can always create the csv manually.
If anything will do, you could simply zip the csv file.
Running on a *nix system, for example, you'd use something like:
filename outfile pipe "gzip -c > myfile.csv.gz";
Then create the csv manually:
data _null_;
set mydata;
file outfile;
put var1 "," var2 "," var3;
run;
If this is PC/Windows SAS, I'm not as familiar, but you'll probably need to install a command-line zip utility.
This link from SAS suggests using winzip, which has a freely downloadable version. Otherwise, the code is similar.
http://support.sas.com/kb/26/011.html
You can actually make a CSV file as a SAS catalog entry; CSV is a valid SAS Catalog entry type.
Here's an example:
filename of catalog "sasuser.test.class.csv";
proc export data=sashelp.class
outfile=of
dbms=dlm;
delimiter=',';
run;
filename of clear;
This little piece of code exports SASHELP.CLASS to a SAS Catalog entry of entry type CSV.
This way you get a binary format you can move between SAS installations on different platforms with PROC CPORT/CIMPORT, not having to worry if the used binary package format is available to your SAS session, since it's an internal SAS format.
Are you saying you have binary data that you want to output to csv?
If so, I don't think there is necessarily a defined standard for how this should be handled.
I suggest trying it (proc export comes to mind) and seeing if the results match your expectations.
Using SAS, output a .csv file; Open it in Excel and Save As whichever format your client wants. You can automate this process with a little bit of scripting in ### as well. (Substitute ### with your favorite scripting language.)