I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression.
I'm having an issue in the rule called SRA2fastq below in which I use parallel-fastq-dump to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1, and SRRXXXXXXX_2.
Here is my config file:
samples:
fullard2018_NpfcATAC_1: SRR5367824
fullard2018_NpfcATAC_2: SRR5367798
fullard2018_NpfcATAC_3: SRR5367778
fullard2018_NpfcATAC_4: SRR5367754
fullard2018_NpfcATAC_5: SRR5367729
And here are the first few rules of my Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
"FastQC/fastq_multiqc.html",
expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
"peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))
rule SRA_prefetch:
params:
SRA="{SRA}"
output:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
log:
"logs/prefetch/{SRA}.log"
shell:
"prefetch {params.SRA}"
rule SRA2fastq:
input:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
output:
"fastq_files/{SRA}_1.fastq.gz",
"fastq_files/{SRA}_2.fastq.gz"
log:
"logs/SRA2fastq/{SRA}.log"
shell:
"""
parallel-fastq-dump --sra-id {input} --threads 8 \
--outdir fastq_files --split-files --gzip
"""
rule fastqc:
input:
rules.SRA2fastq.output
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{SRA}_{num}_fastqc.html"
log:
"logs/FASTQC/{SRA}_{num}.log"
wrapper:
"0.27.1/bio/fastqc"
rule multiqc_fastq:
input:
lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.27.1/bio/multiqc"
rule bowtie2:
input:
sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.27.1/bio/bowtie2/align"
However, when I run the Snakefile I get the following error:
Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz
I've seen this error many times before and it's usually caused when the name of output file generated by the program does not exactly match the output file name you specify in the corresponding snakemake rule. However, this is not the case here as if I run the command snakemake generates for this particular rule separately the files are created as expected and the file names match. Here is an example of one instance of the rule taken after running snakemake -np:
rule SRA2fastq:
input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
log: logs/SRA2fastq/SRR5367779.log
jobid: 18
wildcards: SRA=SRR5367779
parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip
Note the output files generated by the parallel-fastq-dump command run separately (i.e. not using snakemake) are named as specified in the SRA2fastq rule:
ls fastq_files
SRR5367729_1.fastq.gz SRR5367729_2.fastq.gz
I'm a bit stumped by this as this error is usually easily rectified but I can't work out what the issue is. I've tried changing the output section of the SRA2fastq to:
output:
file1="fastq_files/{SRA}_1.fastq.gz",
file2="fastq_files/{SRA}_2.fastq.gz"
However, this throws the same error. I've also tried just specifying one output file but this affects the bowtie2 rule later on as I get an input files missing error.
Any ideas what's going on here? Is there something I'm missing when trying to look for multiple output files in a single rule?
Many Thanks
Related
The following Snakefile fails with AmbiguousRuleException:
library_id = ['S1']
run_id = ['R1']
samples = dict(zip(library_id, run_id))
rule all:
input:
expand('{library_id}.bam', library_id= library_id),
rule bwa:
output:
'{run_id}.bam',
rule merge_bam:
input:
lambda wc: '%s.bam' % samples[wc.library_id],
output:
'{library_id}.bam',
Gives:
AmbiguousRuleException:
Rules bwa and merge_bam are ambiguous for the file S1.bam.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
bwa: run_id=S1
merge_bam: library_id=S1
Expected input files:
bwa:
merge_bam: R1.bamExpected output files:
bwa: S1.bam
merge_bam: S1.bam
That's expected and it's ok. However, if library_id and run_id have the same value the ambiguity is not detected and only the first rule is executed:
library_id = ['S1']
run_id = ['S1'] # Same as library_id!
samples = dict(zip(library_id, run_id))
rule all:
input:
expand('{library_id}.bam', library_id= library_id),
rule bwa:
output:
'{run_id}.bam',
rule merge_bam:
input:
lambda wc: '%s.bam' % samples[wc.library_id],
output:
'{library_id}.bam',
Dry-run execution:
Job counts:
count jobs
1 all
1 bwa
2
[Mon Aug 23 11:27:39 2021]
localrule bwa:
output: S1.bam
jobid: 1
wildcards: run_id=S1
[Mon Aug 23 11:27:39 2021]
localrule all:
input: S1.bam
jobid: 0
Job counts:
count jobs
1 all
1 bwa
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Is this a bug or am I missing something? The second example should give AmbiguousRuleException just like the first and it's even more obvious.
This is with snakemake 6.4.1
TL;DR
Snakemake performs some checks for cycles and jobs with the same input and output file(s) are removed from consideration during DAG creation. In your working case, the job from the merge_bam rule has the same input/output file (S1.bam) so it is not considered in the DAG and their is no ambiguity when satisfying the input of the all rule.
Details
Snakemake starts with the final target file (in this case S1.bam) and works backward to find parameterized rules (jobs) that can be executed to create the target file from existing input files. To do this, it recursively calls snakemake/dag.py::DAG.update() and snakemake/dag.py::DAG.update_() to construct the DAG from the initial target file(s). DAG.update() has the following check to remove jobs from consideration if they produce the same output file that they require for input:
if file in job.input:
cycles.append(job)
continue
E.g. if the target file is also the candidate job's input file, skip this candidate job.
In your working case, the job from the merge_bam rule is considered as a candidate for producing the S1.bam file requested by the all rule. However, the merge_bam job also requests S1.bam for it's own input, so it fails the above check for cycles. Consequently, it is not considered a producer for the S1.bam file requested by the all rule, leaving only the bwa job.
In the exception case, the merge_bam rule outputs S1.bam but asks for R1.bam as input, so it passes the cycle check and is considered a potential producer of the S1.bam file requested by the all rule. Since both merge_bam and bwa can produce S1.bam (and there is no ruleorder defined) an AmbiguousRuleException is thrown.
Conclusions
The mixing of a cyclic DAG and ambiguous rules causes this unintuitive behavior. Snakemake doesn't aim to find all possible rule ambiguities, so I would not necessarily say that this is a bug.
I am extracting prosody features from an audio file while using Opensmile using Windows version of Opensmile. It runs successful and an output csv is generated. But when I open csv, it shows some rows that are not readable. I used this command to extract prosody feature:
SMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav -O prosody_sample1.csv
And the output of csv looks like this:
[
Even I tried to use the sample wave file given in Example audio folder given in opensmile directory and the output is same (not readable). Can someone help me in identifying where the problem is actually? and how can I fix it?
You need to enable the csvSink component in the configuration file to make it work. The file config\prosody\prosodyShs.conf that you are using does not have this component defined and always writes binary output.
You can verify that it is the standart binary output in this way: omit the -O parameter from your command so it becomesSMILEXtract -C \opensmile-3.0-win-x64\config\prosody\prosodyShs.conf -I audio_sample_01.wav and execute it. You will get a output.htk file which is exactly the same as the prosody_sample1.csv.
How output csv? You can take a look at the example configuration in opensmile-3.0-win-x64\config\demo\demo1_energy.conf where a csvSink component is defined.
You can find more information in the official documentation:
Get started page of the openSMILE documentation
The section on configuration files
Documentation for cCsvSink
This is how I solved the issue. First I added the csvSink component to the list of the component instances. instance[csvSink].type = cCsvSink
Next I added the configuration parameters for this instance.
[csvSink:cCsvSink]
reader.dmLevel = energy
filename = \cm[outputfile(O){output.csv}:file name of the output CSV
file]
delimChar = ;
append = 0
timestamp = 1
number = 1
printHeader = 1
\{../shared/standard_data_output_lldonly.conf.inc}`
Now if you run this file it will throw you errors because reader.dmLevel = energy is dependent on waveframes. So the final changes would be:
[energy:cEnergy]
reader.dmLevel = waveframes
writer.dmLevel = energy
[int:cIntensity]
reader.dmLevel = waveframes
[framer:cFramer]
reader.dmLevel=wave
writer.dmLevel=waveframes
Further reference on how to configure opensmile configuration files can be found here
I have Snakefile as following:
SAMPLES, = glob_wildcards("data/{sample}_R1.fq.gz")
rule all:
input:
expand("samtools_sorted_out/{sample}.raw.snps.indels.g.vcf", sample=SAMPLES),
expand("samtools_sorted_out/combined_gvcf")
rule combine_gvcf:
input: "samtools_sorted_out/{sample}.raw.snps.indels.g.vcf"
output:directory("samtools_sorted_out/combined_gvcf")
params: gvcf_file_list="gvcf_files.list",
gatk4="/storage/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
shell:"""
java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
-jar {params.gatk4} GenomicsDBImport \
-V {params.gvcf_file_list} \
--genomicsdb-workspace-path {output}
"""
When I test it with dry run, I got error:
RuleException in line 335 of /data/yifangt/exomecapture/Snakefile:
Wildcards in input, params, log or benchmark file of rule combine_gvcf cannot be determined from output files:
'sample'
There are two places that I need some help:
The {output} is a folder that will be created by the shell part;
The {output} folder was hard-coded manually required by the command line (and the contents are unknown ahead of time).
The problem seems to be with the {output} without expansion as compared with the {input} which does.
How should I handle with this situation? Thanks a lot!
Here is the situation - jmeter results were recorded in .csv quite a long time ago (around 6 month). The version of jmeter was not changed (3.0) but the config files did. Now I've been trying to generate a report from the old csv as usual - using
jmeter.bat -g my.csv -o reportFolder
Also, to defeat the incompatibility of configurations, I created a file named local-saveservice.properties, and passed it through -q command line option. Playing with settings in this file, I managed to defeat several errors like "column number mismatch" or "No column xxx found in sample metadata", but I still didn't generate the report succesfully, and here is the trouble:
File 'D:\ .....\load_NSI_stepping3_2017-03-24-new.csv' does not contain the field names h
eader, ensure the jmeter.save.saveservice.* properties are the same as when the CSV file was created or the file may be
read incorrectly
An error occurred: Error while processing samples:Consumer failed with message :Consumer failed with message :Consumer f
ailed with message :Consumer failed with message :Error in sample at line:1 converting field:Latency at column:11 to:lon
g, fieldValue:'UTF-8'
However,in my .csv column number 11 has the header "Latency" and contains numeric values, though 'UTF-8' is the content of next column - "Encoding"
Here are first lines of my .csv
timeStamp,elapsed,label,responseCode,responseMessage,success,bytes,grpThreads,allThreads,URL,Latency,Encoding,SampleCount,ErrorCount,Connect,threadName
1490364040950,665,searchItemsInCatalogRequest,200,OK,true,25457,1,1,http://*.*.*.*:9080/em/.....Service,654,UTF-8,1,0,9,NSI - search item in catalog
1490364041620,507,searchItemsInCatalogRequest,200,OK,true,25318,1,1,http://*.*.*.*:9080/em/.....Service,499,UTF-8,1,0,0,NSI - search item in catalog
1490364042134,495,searchItemsInCatalogRequest,200,OK,true,24266,2,2,http://*.*.*.*:9080/em/.....Service,487,UTF-8,1,0,0,NSI - search item in catalog
1490364043595,563,searchItemsInCatalogRequest,200,OK,true,24266,2,2,http://*.*.*.*:9080/em/.....Service,556,UTF-8,1,0,6,NSI - search item in catalog
PS I had to add threadName manually, 'cos it was not saved during initial data recording (my knowledge of Jmeter was even less then now :) )
First you should update to JMeter 3.3 as there are bugs fixed in report generation in the 3 versions after 3.0.
Second add to your command line:
jmeter.bat -p <path to jmeter.properties> -q <path to your custom.properties used when you generated file> -g my.csv -o reportFolder
Ensure that in "your custom.properties" you set to false all fields that are prefixed by jmeter.save.saveservice that didn't yet exist at the time you generated the file.
I want to include the value of the "version" parameter in package.json as part of the Jenkins build name.
I'm using the Jenkins Build Name Setter plugin - https://wiki.jenkins-ci.org/display/JENKINS/Build+Name+Setter+Plugin
So far I've tried to use PROPFILE syntax in the "Build name macro template" step:
${PROPFILE,file="./mainline/projectDirectory/package.json",property="\"version\""}
This successfully creates a build, but includes the quotes and comma surrounding the value of the version property in package.json, for example:
"0.0.1",
I want just the value inside returned, so it reads
0.0.1
How can I do this? Is there a different plugin that would work better for parsing package.json and getting it into the template, or should I resort to some sort of regex for removing the characters I don't want?
UPDATE:
I tried using token transforms based on reading the Token Macro Plugin documentation, but it's not working:
${PROPFILE%\"\,#\",file="./mainline/projectDirectory/package.json",property="\"version\""}
still just returns
However, using only one escaped character and only one of # or % works. No other combinations I tried work.
${PROPFILE%\,,file="./mainline/projectDirectory/package.json",property="\"version\""}
which returns "0.0.1" (comma removed)
${PROPFILE#\"%\"\,,file="./mainline/projectDirectory/package.json",property="\"version\""}
which returns "0.0.1", (no characters removed)
UPDATE:
Tried to use the new Jenkins Token Macro plugin's JSON macro with no luck.
Jenkins Build Name Setter set to update the build name with Macro:
${JSON,file="./mainline/pathToFiles/package.json",path="version"}-${P4_CHANGELIST}
Jenkins build logs for this job show:
10:57:55 Evaluated macro: 'Error processing tokens: Error while parsing action 'Text/ZeroOrMore/FirstOf/Token/DelimitedToken/DelimitedToken_Action3' at input position (line 1, pos 74):
10:57:55 ${JSON,file="./mainline/pathToFiles/package.json",path="version"}-334319
10:57:55 ^
10:57:55
10:57:55 java.io.IOException: Unable to serialize org.jenkinsci.plugins.tokenmacro.impl.JsonFileMacro$ReadJSON#2707de37'
I implemented a new macro JSON, which takes a file and a path (which is the key hierarchy in the JSON for the value you want) in token-macro-2.1. You can only use a single transform per macro usage.
Try the token transformations # and % (see Token-Makro-Plugin):
${PROPFILE#"%",file="./mainline/projectDirectory/package.json",property="\"version\""}
(This will only help if you are using pipelines. But for what it's worth,..)
What works for me is a combination of readJSON from the Pipeline Utility Steps plugin and directly setting currentBuild.displayName, thusly:
script {
// readJSON from "Pipeline Utility Steps"
def packageJson = readJSON file: 'package.json'
def version = packageJson.version
echo "Setting build version: ${packageJson.version}"
currentBuild.displayName = env.BUILD_NUMBER + " - " + packageJson.version
// currentBuild.description = "other cool stuff"
}
Omitting error handling etc obvs.