snakemake: Ambiguous rule not detected? - exception

The following Snakefile fails with AmbiguousRuleException:
library_id = ['S1']
run_id = ['R1']
samples = dict(zip(library_id, run_id))
rule all:
input:
expand('{library_id}.bam', library_id= library_id),
rule bwa:
output:
'{run_id}.bam',
rule merge_bam:
input:
lambda wc: '%s.bam' % samples[wc.library_id],
output:
'{library_id}.bam',
Gives:
AmbiguousRuleException:
Rules bwa and merge_bam are ambiguous for the file S1.bam.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
bwa: run_id=S1
merge_bam: library_id=S1
Expected input files:
bwa:
merge_bam: R1.bamExpected output files:
bwa: S1.bam
merge_bam: S1.bam
That's expected and it's ok. However, if library_id and run_id have the same value the ambiguity is not detected and only the first rule is executed:
library_id = ['S1']
run_id = ['S1'] # Same as library_id!
samples = dict(zip(library_id, run_id))
rule all:
input:
expand('{library_id}.bam', library_id= library_id),
rule bwa:
output:
'{run_id}.bam',
rule merge_bam:
input:
lambda wc: '%s.bam' % samples[wc.library_id],
output:
'{library_id}.bam',
Dry-run execution:
Job counts:
count jobs
1 all
1 bwa
2
[Mon Aug 23 11:27:39 2021]
localrule bwa:
output: S1.bam
jobid: 1
wildcards: run_id=S1
[Mon Aug 23 11:27:39 2021]
localrule all:
input: S1.bam
jobid: 0
Job counts:
count jobs
1 all
1 bwa
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Is this a bug or am I missing something? The second example should give AmbiguousRuleException just like the first and it's even more obvious.
This is with snakemake 6.4.1

TL;DR
Snakemake performs some checks for cycles and jobs with the same input and output file(s) are removed from consideration during DAG creation. In your working case, the job from the merge_bam rule has the same input/output file (S1.bam) so it is not considered in the DAG and their is no ambiguity when satisfying the input of the all rule.
Details
Snakemake starts with the final target file (in this case S1.bam) and works backward to find parameterized rules (jobs) that can be executed to create the target file from existing input files. To do this, it recursively calls snakemake/dag.py::DAG.update() and snakemake/dag.py::DAG.update_() to construct the DAG from the initial target file(s). DAG.update() has the following check to remove jobs from consideration if they produce the same output file that they require for input:
if file in job.input:
cycles.append(job)
continue
E.g. if the target file is also the candidate job's input file, skip this candidate job.
In your working case, the job from the merge_bam rule is considered as a candidate for producing the S1.bam file requested by the all rule. However, the merge_bam job also requests S1.bam for it's own input, so it fails the above check for cycles. Consequently, it is not considered a producer for the S1.bam file requested by the all rule, leaving only the bwa job.
In the exception case, the merge_bam rule outputs S1.bam but asks for R1.bam as input, so it passes the cycle check and is considered a potential producer of the S1.bam file requested by the all rule. Since both merge_bam and bwa can produce S1.bam (and there is no ruleorder defined) an AmbiguousRuleException is thrown.
Conclusions
The mixing of a cyclic DAG and ambiguous rules causes this unintuitive behavior. Snakemake doesn't aim to find all possible rule ambiguities, so I would not necessarily say that this is a bug.

Related

Snakemake fails to produce output

I'm trying to run fastqc on two paired files (1.fq.gz and 2.fq.gz). Running:
snakemake --use-conda -np newenv/1_fastqc.html
...produces what looks like a sensible DAG:
Building DAG of jobs...
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
curlewtest 1 1 1
total 1 1 1
[Sat May 21 11:27:40 2022]
rule curlewtest:
input: 1.fq.gz, 2.fq.gz
output: newenv/1_fastqc.html, newenv/2_fastqc.html
jobid: 0
resources: tmpdir=/tmp
fastqc 1.fq.gz 2.fq.gz
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
curlewtest 1 1 1
total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
When I run the job snakemake --use-conda --cores all newenv/1_fastqc.html
, the analysis runs, but the output files fail to appear. Snakemake also throws the following error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 2 of /mnt/data/kcollier/snakemake-workspace/snakefile:
Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
newenv/1_fastqc.html
newenv/2_fastqc.html completed successfully, but some output files are missing. 0
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Increasing latency does not help. The output directory I created beforehand (newenv) also disappears. Does anyone else know why this is?
Without your code, it is difficult to answer precisely what causes the error. But this can happen if your shell command (or script) does not produce the output exactly as stated in the output directive of the rule - it may be something as simple as an error in the file paths.
Perhaps try to run the shell command that Snakemake runs and see if the output files, you expect, get created. You can easily see the commands that Snakemake runs by adding the --verbose/-p flag or the --debug flag to your snakemake command.

Snakemake: Error when trying to generate multiple output files

I'm writing a snakemake pipeline to take publicly available sra files, convert them to fastq files then run them through alignment, peak calling and LD score regression.
I'm having an issue in the rule called SRA2fastq below in which I use parallel-fastq-dump to convert SRA files to paired end fastq files. This rule generates two outputs for each SRA file, SRRXXXXXXX_1, and SRRXXXXXXX_2.
Here is my config file:
samples:
fullard2018_NpfcATAC_1: SRR5367824
fullard2018_NpfcATAC_2: SRR5367798
fullard2018_NpfcATAC_3: SRR5367778
fullard2018_NpfcATAC_4: SRR5367754
fullard2018_NpfcATAC_5: SRR5367729
And here are the first few rules of my Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2]),
"FastQC/fastq_multiqc.html",
expand("peak_files/{sample}_peaks.blrm.narrowPeak", sample=config['samples']),
"peak_files/Fullard2018_peaks.mrgd.blrm.narrowPeak",
expand("LD_annotation_files/Fullard_2018.{chr}.l2.ldscore.gz", chr=range(1,23))
rule SRA_prefetch:
params:
SRA="{SRA}"
output:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
log:
"logs/prefetch/{SRA}.log"
shell:
"prefetch {params.SRA}"
rule SRA2fastq:
input:
"/home/c1477909/ncbi/public/sra/{SRA}.sra"
output:
"fastq_files/{SRA}_1.fastq.gz",
"fastq_files/{SRA}_2.fastq.gz"
log:
"logs/SRA2fastq/{SRA}.log"
shell:
"""
parallel-fastq-dump --sra-id {input} --threads 8 \
--outdir fastq_files --split-files --gzip
"""
rule fastqc:
input:
rules.SRA2fastq.output
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{SRA}_{num}_fastqc.html"
log:
"logs/FASTQC/{SRA}_{num}.log"
wrapper:
"0.27.1/bio/fastqc"
rule multiqc_fastq:
input:
lambda wildcards: expand("FastQC/{SRA}_{num}_fastqc.html", SRA=[config['samples'][x] for x in config['samples']], num=[1,2])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.27.1/bio/multiqc"
rule bowtie2:
input:
sample=lambda wildcards: expand("fastq_files/{SRA}_{num}.fastq.gz", SRA=config['samples'][wildcards.sample], num=[1,2])
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.27.1/bio/bowtie2/align"
However, when I run the Snakefile I get the following error:
Error in job SRA2fastq while creating output files fastq_files/SRR5367754_1.fastq.gz, fastq_files/SRR5367754_2.fastq.gz
I've seen this error many times before and it's usually caused when the name of output file generated by the program does not exactly match the output file name you specify in the corresponding snakemake rule. However, this is not the case here as if I run the command snakemake generates for this particular rule separately the files are created as expected and the file names match. Here is an example of one instance of the rule taken after running snakemake -np:
rule SRA2fastq:
input: /home/c1477909/ncbi/public/sra/SRR5367779.sra
output: fastq_files/SRR5367779_1.fastq.gz, fastq_files/SRR5367779_2.fastq.gz
log: logs/SRA2fastq/SRR5367779.log
jobid: 18
wildcards: SRA=SRR5367779
parallel-fastq-dump --sra-id /home/c1477909/ncbi/public/sra/SRR5367779.sra --threads 8 --outdir fastq_files --split-files --gzip
Note the output files generated by the parallel-fastq-dump command run separately (i.e. not using snakemake) are named as specified in the SRA2fastq rule:
ls fastq_files
SRR5367729_1.fastq.gz SRR5367729_2.fastq.gz
I'm a bit stumped by this as this error is usually easily rectified but I can't work out what the issue is. I've tried changing the output section of the SRA2fastq to:
output:
file1="fastq_files/{SRA}_1.fastq.gz",
file2="fastq_files/{SRA}_2.fastq.gz"
However, this throws the same error. I've also tried just specifying one output file but this affects the bowtie2 rule later on as I get an input files missing error.
Any ideas what's going on here? Is there something I'm missing when trying to look for multiple output files in a single rule?
Many Thanks

Legal change to JSON input invalidates simple jq

Another department continually updates a JSON file that I then query. Its format is three lists of similar-looking dictionaries:
{
"levels":
[
{"a":1, "b":False, "c":"2012", "d":"2017"}
,{"a":2, "b":True, "c":"2013", "d":"9999"}
,...
]
,"costs":
[
{"e":12, "f":"foo", "g":"blarg", "h":"2015", "i":"2018"}
,{"e":-3, "f":"foo", "g":"glorb", "h":"2013", "i":"9999"}
,...
]
,"recipes":
[
{"j":"BAZ", "k":["blarg","glorb","bleeg"], "l":"dill", "m":"2016", "n":"2017"}
,{"j":"BAZ", "k":["blarg","bleeg"], "l":"dill", "m":"2017", "n":"9999"}
,...
]
} # line 3943 (see below)
Recently, my simple jq queries like
jq '.["recipes"][] | select(.l | test("ill"))' < jsonfile
stopped returning all of the results they should (e.g. returning only one of the two "dill" lines above) and started printing this error message:
jq: error (at <stdin>:3943): null (null) cannot be matched, as it is not a string
Line 3943 mentioned in the error is the final line of the file. Queries against the "levels" and "costs" sections of the file continue to work like normal; it's only the "recipes" section of the file that is breaking, as though jq thinks the closing brace of the file is still part of the "recipes" section.
To me this suggests there's been a formatting change or error in the last section of the file. However, software other than jq (e.g. python) doesn't report any problems parsing it. Before I start going through the input line by line ... does this error message indicate anything obvious to a jq expert?
Alas, I do not keep old versions of the file around for comparison. (I think I will start today.)
(self-answering after a bit of investigating)
I think there was no formatting error or change in formatting in the input.
I don't know why my query syntax did not encounter errors previously (maybe I just did not notice), but it seems that the entries in the "recipes" section often do not contain an "l" attribute, and jq will cease processing as soon as it encounters one that does not.
I also don't know why jq does not generate the same error message for every record that lacks that attribute, nor why it waits to the final line of the input to generate the single message. (Maybe that behavior is documented somewhere.)
In any case, I fixed the error (not just the message, but also the failure to display all relevent records) by testing for the presence of the attribute first:
jq '.["recipes"][] | select(has("l") and (.l | test("ill")))' < jsonfile

Lexical Analysis of Preprocessed Code

I have programmed an assembler with a preprocessor for the MOS 6502 microprocessor. The assembler spits out the correct binary and the preprocessor performs constant substitution, inclusions and conditional inclusions. The problem is retaining file positions of the included files. At this point the preprocessor emits a file directive just before and after a file is included. Here is an example.
Proggie.asm
JSR init
JSR loop
JSR end
%include "Init.asm"
%include "Loop.asm"
%include "End.asm"
Init.asm
init:
LDX #$00
RTS
Loop.asm
loop:
INX
CPX #$05
BNE loop
RTS
End.asm
end:
BRK
Pre Processor Result
%file "D:\Proggie.asm" 1
JSR init
JSR loop
JSR end
%file "D:\Init.asm" 1
init:
LDX #$00
RTS%file "D:\Init.asm" 2
%file "D:\Loop.asm" 1
loop:
INX
CPX #$05
BNE loop
RTS%file "D:\Loop.asm" 2
%file "D:\End.asm" 1
end:
BRK%file "D:\End.asm" 2
%file "D:\Proggie.asm" 2
This idea comes from the output the preprocessor from GCC produces. The %file directive tells the lexical analyzer that a file has just been entered or exited. The number after the file path says if the analyzer enters or exits the given file respectively. My lexical analyzer kind of works with this. It is still a bit of when telling the current line number.
So my question is: Is this the way to go? Or is there another algorithm I could use?
Gcc's preprocessor fabricates line control directives which look like this:
# 122 "/usr/include/x86_64-linux-gnu/bits/types.h" 2 3 4
Here, the 122 is the line number in the file /usr/include/x86_64-linux-gnu/bits/types.h. Including the line number means that a downstream lexer doesn't need to track the include stack in order to tell which line it is on.
The rest of the line are flags, which are similar to your approach with the addition of a couple of gcc-specific flags:
'1' This indicates the start of a new file.
'2' This indicates returning to a file (after having included another file).
'3' This indicates that the following text comes from a system header file, so certain warnings should be suppressed.
'4' This indicates that the following text should be treated as being wrapped in an implicit 'extern "C"' block.
These allow the downstream lexer to track the include stack if it wishes, and the gcc lexer does so in order to produce more informative (or at least more wordy) error messages.
I think the logic is easier with the preprocessor maintaining the stack, but it doesn't make a huge amount of difference, particularly if you're also going to want to generate "included from" notes in your error messages.

How to trigger an OpenNMS event with thresholds

it seems that it is not possible for me to trigger an event in OpenNMS using a threshold...
first the fact (as much detail as i can)
i want to monitor a html file, better, the content.
if a value is not what i expected OpenNMS should call be.
my html file:
Document Count: 5
in /var/lib/opennms/rrd/snmp/NODE are two files named: "documentCount" (.jbr & .meta)
--> because of the http-datacollection-config.xml
in my logfiles is written:
INFO [LegacyScheduler-Thread-2-of-50] RrdUtils: updateRRD: updating RRD file /var/lib/opennms/rrd/snmp/21/documentCount.jrb with values '1385031023:5'"
so the "5" is collected correctly.
now i created a threshold for this case:
<threshold type="high" ds-type="node"
value="4.0" rearm="2.0" trigger="1" triggeredUEI="uei.opennms.org/threshold/highThresholdExceeded"
filterOperator="or" ds-name="documentCount"
/>
in my collectd-configuration.xml is the threshold also enabled:
in my opinion the threshold of 4 is exceeded, because the value is 5. so the highTresholdEvent should be fired. BUT IT DOESNT.
so i'm here to ask if someone had an idea.
regards dawn
Check collectd.log with the following
tail -f collectd.log | grep -i thresholding
Threshold checking was moved to evaluate while the data is being retrieved a while back as opposed to a post process of rrd files.
Even with the log setting at info you should find some clues as to why the threshold rule is not matching any data.