Related
I have a phased .vcf file generated by longshot from a MinION sequencing run of diploid, human DNA. I would like to be able to split the file into two haploid files, one for haplotype 1, one for haplotype 2.
Do any of the VCF toolkits provide this function out of the box?
3 variants from my file:
##fileformat=VCFv4.2
##source=Longshot v0.4.0
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth of reads passing MAPQ filter">
##INFO=<ID=AC,Number=R,Type=Integer,Description="Number of Observations of Each Allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase Set">
##FORMAT=<ID=UG,Number=1,Type=String,Description="Unphased Genotype (pre-haplotype-assembly)">
##FORMAT=<ID=UQ,Number=1,Type=Float,Description="Unphased Genotype Quality (pre-haplotype-assembly)">
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE
chr1 161499264 . G C 500.00 PASS DP=55;AC=27,27 GT:GQ:PS:UG:UQ 0|1:500.00:161499264:0/1:147.24
chr1 161502368 . A G 500.00 PASS DP=43;AC=4,38 GT:GQ:PS:UG:UQ 1/1:342.00:.:1/1:44.91
chr1 161504083 . A C 346.17 PASS DP=39;AC=19,17 GT:GQ:PS:UG:UQ 1|0:346.17:161499264:0/1:147.24
To extract haplotypes from phased vcf files, you can use samplereplay from RTGtools to generate the haplotype SDF file; then sdf2sam, sdf2fasta, and sdf2fastq to obtain corresponding files of phased haplotypes.
Edit: I haven't noticed that you needed a haploid VCF file. The method above should work if you first convert it to sam then to a VCF again.
I didn't find a tool so I coded something (not pretty but works)
awk '{if ($1 ~ /^##/) print; \
else if ($1=="#CHROM") { ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) {print $i"_A\t"$i"_B"}; ORS="\n"; print $NF"_A\t"$NF"_B"}\
else {ORS="\t";for (i=1;i<10;i++) print $i;\
for (i=10;i<NF;i++) print substr($i,0,1)"\t"substr($i,3,1); \
ORS="\n"; print substr($NF,0,1)"\t"substr($NF,3,1)"\n"} }' VCF_FILE
First line to print the header.
On the third line I duplicated the name of the individuals (with NAME_A and NAME_B but you can change it.
Fifth line, I keep only the GT with substr().
If you want to keep the other info you can use substr() as well.
For example: substr($i,0,1)substr($i,4,100) will keep the info of the first GT and other fields.
I'm working on a stream silence detection.
It's working on the following command in ffmpeg:
ffmpeg -i http://mystream.com/stream -af silencedetect=n=-50dB:d=0.5 -f null - 2> log.txt
I would like to get a json output of the logfile.
There is a json option in 'ffprobe' but silencedetect=n=-50dB:d=0.5 is'nt working.
Help!
Cheers!
ffprobe is meant to probe container-level or stream-level metadata. silencedetect is a filter which analyses the content of decoded audio streams; its output isn't controlled by the choice of writer.
What you could do, since silencedetect also logs its result to metadata tags, is output just that data to a file.
ffmpeg -i http://mystream.com/stream -af silencedetect=n=-50dB:d=0.5,ametadata=print:file=log.txt -f null -
Output
frame:281 pts:323712 pts_time:6.744
lavfi.silence_start=6.244
frame:285 pts:328320 pts_time:6.84
lavfi.silence_end=6.84
lavfi.silence_duration=0.596
frame:413 pts:475776 pts_time:9.912
lavfi.silence_start=9.412
frame:1224 pts:1410048 pts_time:29.376
lavfi.silence_end=29.376
lavfi.silence_duration=19.964
When I used ffprobe against an animated gif, I get, among other things, this:
> ffprobe.exe foo.gif
. . .
Stream #0:0: Video: gif, bgra, 500x372, 6.67 fps, 6.67 tbr, 100 tbn, 100 tbc
Great; this tells me the frame rate is 6.67 frames per second. But I'm going to be using this in a program and want it in a parsed format. ffprobe does json, but when I use it:
> ffprobe.exe -show_streams -of json foo.gif
The json shows:
"r_frame_rate": "20/3",
"avg_frame_rate": "20/3",
But I want the decimal form 6.67 instead of 20/3. Is there a way to have FFProbe produce its JSON output in decimal? I can't seem to find it in the docs.
My platform is Windows; FFProbe is version N-68482-g92a596f.
I did look into using ImageMagick, but the GIF file in question is corrupted (I'm working on a simple repair program); IM's "identify" command halts on it, while FFMpeg & FFProbe handle it just fine.
Addition: this is kind of academic now; I just used (in Python):
framerate_as_decimal = "%4.2f" % (float(fractions.Fraction(framerate_as_fraction)))
But I'm still kind of curious if there's an answer.
I know this is a bit old question but today I have tried to do the same and found two options:
You can use the subprocess module in python and mediainfo: fps = float(subprocess.check_output('mediainfo --Inform="Video;%FrameRate%" input.mp4, shell=True)) here the returned value is a string, that's why I am converting it to float. Unfortunately I wasn't able to execute the same thing without the shell=True but perhaps I am missing something.
Using ffprobe: ffprobe -v error -select_streams v:0 -show_entries stream=avg_frame_rate -of default=noprint_wrappers=1:nokey=1 input.mp4 here the problem is that the output is 50/1 or in your case 20/3 so you need to split the output by "/" and then to convert and divide the two elements of the list. Something like:
fps = subprocess.check_output(['ffprobe', '-v', 'error', '-select_streams', 'v:0', '-show_entries', 'stream=avg_frame_rate', '-of', 'default=noprint_wrappers=1:nokey=1', 'input.mp4'])
fps_lst = fps.split('/')
fps_real = float(fps_lst[0]) / int(fps_lst[1])
So the normal commands for getting the frame rate are:
ffprobe -v error -select_streams v:0 -show_entries stream=r_frame_rate -of default=noprint_wrappers=1:nokey=1 input.mp4 and mediainfo --Inform="Video;%FrameRate%" input.mp4
In Python, you can just use:
frame_rate_str = "15/3"
frame_rate = eval(frame_rate_str)
I'm creating a Bash script to parse the air pollution levels from the webpage:
http://aqicn.org/city/beijing/m/
There is a lot of stuff in the file, but this is the relevant bit:
"iaqi":[{"p":"pm25","v":[59,21,112],"i":"Beijing pm25 (fine
particulate matter) measured by U.S Embassy Beijing Air Quality
Monitor
(\u7f8e\u56fd\u9a7b\u5317\u4eac\u5927\u4f7f\u9986\u7a7a\u6c14\u8d28\u91cf\u76d1\u6d4b).
Values are converted from \u00b5g/m3 to AQI levels using the EPA
standard."},{"p":"pm10","v":[15,5,69],"i":"Beijing pm10
(respirable particulate matter) measured by Beijing Environmental
Protection Monitoring Center
I want the script to parse and display 2 numbers: current PM2.5 and PM10 levels (the numbers in bold in the above paragraph).
CITY="beijing"
AQIDATA=$(wget -q 0 http://aqicn.org/city/$CITY/m/ -O -)
PM25=$(awk -v FS="(\"p\":\"pm25\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
PM100=$(awk -v FS="(\"p\":\"pm10\",\"v\":\\\[|,[0-9]+)" '{print $2}' <<< $AQIDATA)
echo $PM25 $PM100
Even though I can get PM2.5 levels to display correctly, I cannot get PM10 levels to display. I cannot understand why, because the strings are similar.
Anyone here able to explain?
The following approach is based on two steps:
(1) Extracting the relevant JSON;
(2) Extracting the relevant information from the JSON using a JSON-aware tool -- here jq.
(1) Ideally, the web service would provide a JSON API that would allow one to obtain the JSON directly, but as the URL you have is intended for viewing with a browser, some form of screen-scraping is needed. There is a certain amount of brittleness to such an approach, so here I'll just provide something that currently works:
wget -O - http://aqicn.org/city/beijing/m |
gawk 'BEGIN{RS="function"}
$1 ~/getAqiModel/ {
sub(/.*var model=/,"");
sub(/;return model;}/,"");
print}'
(gawk or an awk that supports multi-character RS can be used; if you have another awk, then first split on "function", using e.g.:
sed $'s/function/\\\n/g' # three backslashes )
The output of the above can be piped to the following jq command, which performs the filtering envisioned in (2) above.
(2)
jq -c '.iaqi | .[]
| select(.p? =="pm25" or .p? =="pm10") | [.p, .v[0]]'
The result:
["pm25",59]
["pm10",15]
I think your problem is that you have a single line HTML file that contains a script that contains a variable that contains the data you are looking for.
Your field delimiters are either "p":"pm100", "v":[ or a comma and some digits.
For pm25 this works, because it is the first, and there are no occurrences of ,21 or something similar before it.
However, for pm10, there are some that are associated with pm25 ahead of it. So the second field contains the empty string between ,21 and ,112
#karakfa has a hack that seems to work -- but he doesn't explain very well why it works.
What he does is use awk's record separator (which is usually a newline) and sets it to either of :, ,, or [. So in your case, one of the records would be "pm25", because it is preceded by a colon, which is a separator, and succeeded by a comma, also a separator.
Once it hits the matching content ("pm25") it sets a counter to 4. Then, for this and the next records, it counts this counter down. "pm25" itself, "v", the empty string between : and [, and finally reaches one when hitting the record with the number you want to output: 4 && ! 3 is false, 3 && ! 2 is false, 2 && ! 1 is false, but 1 && ! 0 is true. Since there is no execution block, awk simply prints this record, which is the value you want.
A more robust work would probably be using xpath to find the script, then use some json parser or similar to get the value.
chw21's helpful answer explains why your approach didn't work.
peak's helpful answer is the most robust, because it employs proper JSON parsing.
If you don't want to or can't use third-party utility jq for JSON parsing, I suggest using sed rather than awk, because awk is not a good fit for field-based parsing of this data.
$ sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA"
59 15
The above should work with both GNU and BSD/OSX sed.
To read the result into variables:
read pm25 pm10 < \
<(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).+"pm10"[^[]+\[([0-9]+).*$/\1 \2/' <<< "$AQIDATA")
Note how I've chosen lowercase variable names, because it's best to avoid all upper-case variables in shell programming, so as to avoid conflicts with special shell and environment variables.
If you can't rely on the order of the values in the source string, use two separate sed commands:
pm25=$(sed -E 's/^.*"pm25"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
pm10=$(sed -E 's/^.*"pm10"[^[]+\[([0-9]+).*$/\1/' <<< "$AQIDATA")
awk to the rescue!
If you have to, you can use this hacky way using smart counters with hand-crafted delimiters. Setting RS instead of FS transfers looping through fields to awk itself. Multi-char RS is not available for all awks (gawk supports it).
$ awk -v RS='[:,[]' '$0=="\"pm25\""{c=4} c&&!--c' file
59
$ awk -v RS='[:,[]' '$0=="\"pm10\""{c=4} c&&!--c' file
15
I have 10Gb gzip archive (uncompressed is about 60Gb).
Is there a way to decompress this file with multithreading + on the fly splitting output to parts by 1Gb/part (n-lines/part, maybe)?
If I do something like this:
pigz -dc 60GB.csv.gz | dd bs=8M skip=0 count=512 of=4G-part-1.csv
I can get a 4Gb file, but it don't care about starting always from next line, so lines in my files won't be ended properly.
Also, as I notised, my GCE instance with persistant disk has maximum 33kb block size, so I can't actually use command like above, but have to print something like:
pigz -dc 60GB.csv.gz | dd bs=1024 skip=0 count=4194304 of=4G-part-1.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=4194304 count=4194304 of=4G-part-2.csv
pigz -dc 60GB.csv.gz | dd bs=1024 skip=$((4194304*2)) count=4194304 of=4G-part-3.csv
So, I have to make some trick to always start file from new line..
UPDATE:
zcat 60GB.csv.gz |awk 'NR%43000000==1{x="part-"++i".csv";}{print > x}'
did the trick.
Based on the sizes you mention in your question, it looks like you get about 6-to-1 compression. That doesn't seem great for text, but anyway...
As Mark states, you can't just dip mid stream into your gz file and expect to land on a new line. Your dd options won't work because dd only copies bytes, it doesn't detect compressed newlines. If indexing is out of scope for this, the following command line solution might help:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%20000000{n++} {print|("gzip>part-"n".gz")}'
This decompresses your file so that we can count lines, then processes the stream, changing the output file name every 20000000 lines. You can adjust your recompression options where you see "gzip" in the code above.
If you don't want your output to be compressed, you can simplify the last part of the line:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} {print>("part-"n".csv")}'
You might have to play with the number of lines to get something close to the file size you're aiming for.
Note that if your shell is csh/tcsh, you may have to escape the exclamation point in the awk script to avoid it being interpreted as a history reference.
UPDATE:
If you'd like to get status of what the script is doing, awk can do that. Something like this might be interesting:
$ gzcat 60GB.csv.gz | awk -v n=1 '!NR%3500000{n++} !NR%1000{printf("part=%d / line=%d\r",n,NR)} {print>("part-"n".csv")}'
This should show you the current part and line number every thousand lines.
Unless it was especially prepared for such an operation, or unless an index was built for that purpose, then no. The gzip format inherently requires the decompression of the data before any point in the stream, in order to decompress data after that point in the stream. So it cannot be parallelized.
The way out is to either a) recompress the gzip file with synchronization points and save those locations, or b) go through the entire gzip file once and create another file of entry points with the previous context at those points.
For a), zlib provides Z_FULL_FLUSH operations that insert synchronization points in the stream from which you can start decompression with no previous history. You would want to create such points sparingly, since they degrade compression.
For b), zran.c provides an example of how to build in index into a gzip file. You need to go through the stream once in serial order to build the index, but having done so, you can then start decompression at the locations you have saved.