Is there a tool or script to split a phased VCF into two separate haploid VCFs, one for each haplotype? (linux) - extract

I have a phased .vcf file generated by longshot from a MinION sequencing run of diploid, human DNA. I would like to be able to split the file into two haploid files, one for haplotype 1, one for haplotype 2.
Do any of the VCF toolkits provide this function out of the box?
3 variants from my file:
##fileformat=VCFv4.2
##source=Longshot v0.4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 161499264 . G C 500.00 PASS DP=55;AC=27,27 GT:GQ:PS:UG:UQ 0|1:500.00:161499264:0/1:147.24
chr1 161502368 . A G 500.00 PASS DP=43;AC=4,38 GT:GQ:PS:UG:UQ 1/1:342.00:.:1/1:44.91
chr1 161504083 . A C 346.17 PASS DP=39;AC=19,17 GT:GQ:PS:UG:UQ 1|0:346.17:161499264:0/1:147.24

To extract haplotypes from phased vcf files, you can use samplereplay from RTGtools to generate the haplotype SDF file; then sdf2sam, sdf2fasta, and sdf2fastq to obtain corresponding files of phased haplotypes.
Edit: I haven't noticed that you needed a haploid VCF file. The method above should work if you first convert it to sam then to a VCF again.

I didn't find a tool so I coded something (not pretty but works)
awk '{if ($1 ~ /^##/) print; \
else if ($1=="#CHROM") { ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) {print $i"_A\t"$i"_B"}; ORS="\n"; print $NF"_A\t"$NF"_B"}\
else {ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) print substr($i,0,1)"\t"substr($i,3,1); \
ORS="\n"; print substr($NF,0,1)"\t"substr($NF,3,1)"\n"} }' VCF_FILE
First line to print the header.
On the third line I duplicated the name of the individuals (with NAME_A and NAME_B but you can change it.
Fifth line, I keep only the GT with substr().
If you want to keep the other info you can use substr() as well.
For example: substr($i,0,1)substr($i,4,100) will keep the info of the first GT and other fields.

Related

Get difference between two csv files based on column using bash

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.
You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }
Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)
If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.
Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file
awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

searching .CSV file with AWK - only working with first row

I've been trying to search through a specific column of a .csv file to find cells containing a particular word. However, it's only working for the first row (i.e. the headings) in my .csv file.
The file is a series of over 10,000 forum posts, with column 1 as the post key and column 2 as the post text. The headings as below are 'key', 'annotated sentence'.
key,annotated sentence
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading system of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost any claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of having others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on overturning the 2A."
"(595, 0)",So you're actually claiming that it is a lie to say that the UK has a lower gun crime rate than the US? Even if the police were miscounting crimes it's still a huge and unjustified leap in logic to conclude from that that the UK does not have a lower gun crime rate.
"(736, 3)","The anti-abortionists claim a load of **** on many issues. I don't listen to them. To put the ""life"" of an unfertilized egg above that of a person is grotesquely sick IMO. I support any such stem cell research wholeheartedly."
The CSV separator is a comma, and the text delimiter is ".
if I try:
awk -F, '$1 ~ /key/ {print}' posts_file.csv > output_file.csv
it will output the headings row no problem. However, I have tried:
awk -F, '$1 ~ /212/ {print}' posts_file.csv > output_file.csv
awk -F, '$2 ~ /Canada/ {print}' posts_file.csv > output_file.csv
and neither of these work - no matches are found though there should be. I can't figure out why? Any ideas? Thanks in advance.
awk to the rescue!
In general complex csv doesn't work but in your case since key and annotated sentence have very distinct value types you can extend your pattern search to the whole record instead of key and value, the trick is defining the record, which again based on your format can be done as well. For example
$ awk -v RS='\n"' '/Canada/{print RT $0}' csv
"(537, 5)","Forgive me for laughing; no, not really ha, ha, ha ha ha
Could it be that people here as well as Canada and the rest of the world has figured out your infantile ""grading syst
em of States"" is a complete sham and your very reason for existing is but an anti-constitutional farce and has lost a
ny claims you have or will make? You stand alone now brady, with simply a few still clinging to the false hope of havi
ng others pay for your failures and unConstitutional stance so you can sit on your hands while you keep harping on ove
rturning the 2A."
and this
$ awk -v RS='\n"' '/(212, 2)/{print RT $0}' csv
"(212, 2)","Got evidence to back that up??
I'm not sure how a stoner's worse than an alcoholic really.
-Wez"
Python's CSV parsing supports your format out of the box.
Below is a simple script that you could call as follows:
# csvfilter <1-basedFieldNdx> <regexStr> < <infile> > <outfile>
csvfilter 1 'key' < posts_file.csv > output_file.csv
csvfilter 1 '212' < posts_file.csv > output_file.csv
csvfilter 2 'Canada' < posts_file.csv > output_file.csv
Sample script csvfilter:
#!/usr/bin/env python
# coding=utf-8
import csv, sys, re
# Assign arguments to variables.
fieldNdx = int(sys.argv[1]) - 1 # Index of field to filter; Python arrays are 0-based!
reStr = sys.argv[2] if (len(sys.argv) > 2) else '' # Filter regex
# Read from stdin...
reader = csv.reader(sys.stdin)
# ... write to stdout.
writer = csv.writer(sys.stdout, reader.dialect)
# Read each line...
for row in reader:
# Match the target field against the filter regex and
# print the row only if it matches.
if (re.search(reStr, row[fieldNdx])):
writer.writerow(row)
OpenRefine could help with the search.
One way to use awk safely with complex CSV is to use a "csv2tsv" utility to convert the CSV file to a format that can be handled properly by awk.
Usually the TSV ("tab-separated values") format is just right for the job.
(If the final output must be CSV, then either a complementary "tsv2csv" utility can be used, or awk itself can do the job -- though some care may be required to get it exactly right.)
So the pipeline might look like this:
csv2tsv < input.csv | awk -F\\t 'BEGIN{OFS=FS} ....' | tsv2csv
There are several alternatives for csv-to-tsv conversion, ranging from roll-your-own scripts to Excel, but I'd recommend taking the time to check that whichever tool or toolset you select satisfies the "edge case" requirements that are of interest to you.

splitting CSV file by columns

I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:
x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)
I need to split this CSV file into multiple files which contain a less number of columns like:
# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)
# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)
#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)
Is there any program or library doing this?
I've googled for it , but programs that I found only split a file by rows not by columns.
Or which language could I use to do this efficiently?
I can use R, shell script, Python, C/C++, Java
A one-line solution for your example data and desired output:
cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv
The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.
update in python, since the OP asked for a program in an acceptable language:
# python 3 (or python 2, if you must)
import csv
import fileinput
output_specifications = ( # csv file name, selector function
('file1.csv', slice(3)),
('file2.csv', slice(3, 1003)),
('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
(
csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
selector,
) for file_name, selector in output_specifications
]
reader = csv.reader(fileinput.input())
for row in reader:
for row_writer, selector in output_row_writers:
row_writer(row[selector])
This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.
Use a small python script like:
fin = 'file_in.csv'
fout1 = 'file_out1.csv'
fout1_fd = open(fout1,'w')
...
lines = []
with open(fin) as fin_fd:
lines = fin_fd.read().split('\n')
for l in lines:
l_arr = l.split(',')
fout1_fd.write(','.join(l_arr[0:3]))
fout1_fd.write('\n')
...
...
fout1_fd.close()
...
You can open the file in Microsoft Excel, delete the extra columns, save as csv for file #1. Repeat the same procedure for the other 2 tables.
I usually use open office ( or microsof excel in case you are using windows) to do that without writing any program and change the file and save it. Following are two useful links showing how to do that.
https://superuser.com/questions/407082/easiest-way-to-open-csv-with-commas-in-excel
http://office.microsoft.com/en-us/excel-help/import-or-export-text-txt-or-csv-files-HP010099725.aspx

Use LibreOffice to convert HTML to PDF from Mac command in terminal?

I'm trying to convert a HTML file to a PDF by using the Mac terminal.
I found a similar post and I did use the code they provided. But I kept getting nothing. I did not find the output file anywhere when I issued this command:
./soffice --headless --convert-to pdf --outdir /home/user ~/Downloads/*.odt
I'm using Mac OS X 10.8.5.
Can someone show me a terminal command line that I can use to convert HTML to PDF?
I'm trying to convert a HTML file to a PDF by using the Mac terminal.
Ok, here is an alternative way to do convert (X)HTML to PDF on a Mac command line. It does not use LibreOffice at all and should work on all Macs.
This method (ab)uses a filter from the Mac's print subsystem, called xhtmltopdf. This filter is usually not meant to be used by end-users but only by the CUPS printing system.
However, if you know about it, know where to find it and know how to run it, there is no problem with doing so:
The first thing to know is that it is not in any desktop user's $PATH. It is in /usr/libexec/cups/filter/xhtmltopdf.
The second thing to know is that it requires a specific syntax and order of parameters to run, otherwise it won't. Calling it with no parameters at all (or with the wrong number of parameters) it will emit a small usage hint:
$ /usr/libexec/cups/filter/xhtmltopdf
Usage: xhtmltopdf job-id user title copies options [file]
Most of these parameter names show that the tool clearly related to printing. The command requires in total at least 5, or an optional 6th parameter. If only 5 parameters are given, it reads its input from <stdin>, otherwise from the 6ths parameter, a file name. It always emits its output to <stdout>.
The only CLI params which are interesting to us are number 5 (the "options") and the (optional) number 6 (the input file name).
When we run it on the command line, we have to supply 5 dummy or empty parameters first, before we can put the input file's name. We also have to redirect the output to a PDF file.
So, let's try it:
/usr/libexec/cups/filter/xhtmltopdf "" "" "" "" "" my.html > my.pdf
Or, alternatively (this is faster to type and easier to check for completeness, using 5 dummy parameters instead of 5 empty ones):
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html > my.pdf
While we are at it, we could try to apply some other CUPS print subsystem filters on the output: /usr/libexec/cups/filter/cgpdftopdf looks like one that could be interesting. This additional filter expects the same sort of parameter number and orders, like all CUPS filters.
So this should work:
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 "" \
> my.pdf
However, piping the output of xhtmltopdf into cgpdftopdf is only interesting if we try to apply some "print options". That is, we need to come up with some settings in parameter no. 5 which achieve something.
Looking up the CUPS command line options on the CUPS web page suggests a few candidates:
-o number-up=4
-o page-border=double-thick
-o number-up-layout=tblr
do look like they could be applied while doing a PDF-to-PDF transformation. Let's try:
/usr/libexec/cups/filter/xhtmltopdfcc 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 5 \
"number-up=4 page-border=double-thick number-up-layout=tblr" \
> my.pdf
Here are two screenshots of results I achieved with this method. Both used as input files two HTML files which were identical, apart from one line: it was the line which referenced a CSS file to be used for rendering the HTML.
As you can see, the xhtmltopdf filter is able to (at least partially) take into account CSS settings when it converts its input to PDF:
Starting 3.6.0.1 , you would need unoconv on the system to converts documents.
Using unoconv with MacOS X
LibreOffice 3.6.0.1 or later is required to use unoconv under MacOS X. This is the first version distributed with an internal python script that works. No version of OpenOffice for MacOS X (3.4 is the current version) works because the necessary internal files are not included inside the application.
I just had the same problem, but I found this LibreOffice help post. It seems that headless mode won't work if you've got LibreOffice (the usual GUI version) running too. The fix is to add an -env option, e.g.
libreoffice "-env:UserInstallation=file:///tmp/LibO_Conversion" \
--headless \
--invisible \
--convert-to csv file.xls