Snakemake fails to produce output - output

I'm trying to run fastqc on two paired files (1.fq.gz and 2.fq.gz). Running:
snakemake --use-conda -np newenv/1_fastqc.html
...produces what looks like a sensible DAG:
Building DAG of jobs...
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
curlewtest 1 1 1
total 1 1 1
[Sat May 21 11:27:40 2022]
rule curlewtest:
input: 1.fq.gz, 2.fq.gz
output: newenv/1_fastqc.html, newenv/2_fastqc.html
jobid: 0
resources: tmpdir=/tmp
fastqc 1.fq.gz 2.fq.gz
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
curlewtest 1 1 1
total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
When I run the job snakemake --use-conda --cores all newenv/1_fastqc.html
, the analysis runs, but the output files fail to appear. Snakemake also throws the following error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 2 of /mnt/data/kcollier/snakemake-workspace/snakefile:
Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
newenv/1_fastqc.html
newenv/2_fastqc.html completed successfully, but some output files are missing. 0
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Increasing latency does not help. The output directory I created beforehand (newenv) also disappears. Does anyone else know why this is?

Without your code, it is difficult to answer precisely what causes the error. But this can happen if your shell command (or script) does not produce the output exactly as stated in the output directive of the rule - it may be something as simple as an error in the file paths.
Perhaps try to run the shell command that Snakemake runs and see if the output files, you expect, get created. You can easily see the commands that Snakemake runs by adding the --verbose/-p flag or the --debug flag to your snakemake command.

Related

snakemake: Ambiguous rule not detected?

The following Snakefile fails with AmbiguousRuleException:
library_id = ['S1']
run_id = ['R1']
samples = dict(zip(library_id, run_id))
rule all:
input:
expand('{library_id}.bam', library_id= library_id),
rule bwa:
output:
'{run_id}.bam',
rule merge_bam:
input:
lambda wc: '%s.bam' % samples[wc.library_id],
output:
'{library_id}.bam',
Gives:
AmbiguousRuleException:
Rules bwa and merge_bam are ambiguous for the file S1.bam.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
bwa: run_id=S1
merge_bam: library_id=S1
Expected input files:
bwa:
merge_bam: R1.bamExpected output files:
bwa: S1.bam
merge_bam: S1.bam
That's expected and it's ok. However, if library_id and run_id have the same value the ambiguity is not detected and only the first rule is executed:
library_id = ['S1']
run_id = ['S1'] # Same as library_id!
samples = dict(zip(library_id, run_id))
rule all:
input:
expand('{library_id}.bam', library_id= library_id),
rule bwa:
output:
'{run_id}.bam',
rule merge_bam:
input:
lambda wc: '%s.bam' % samples[wc.library_id],
output:
'{library_id}.bam',
Dry-run execution:
Job counts:
count jobs
1 all
1 bwa
2
[Mon Aug 23 11:27:39 2021]
localrule bwa:
output: S1.bam
jobid: 1
wildcards: run_id=S1
[Mon Aug 23 11:27:39 2021]
localrule all:
input: S1.bam
jobid: 0
Job counts:
count jobs
1 all
1 bwa
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Is this a bug or am I missing something? The second example should give AmbiguousRuleException just like the first and it's even more obvious.
This is with snakemake 6.4.1
TL;DR
Snakemake performs some checks for cycles and jobs with the same input and output file(s) are removed from consideration during DAG creation. In your working case, the job from the merge_bam rule has the same input/output file (S1.bam) so it is not considered in the DAG and their is no ambiguity when satisfying the input of the all rule.
Details
Snakemake starts with the final target file (in this case S1.bam) and works backward to find parameterized rules (jobs) that can be executed to create the target file from existing input files. To do this, it recursively calls snakemake/dag.py::DAG.update() and snakemake/dag.py::DAG.update_() to construct the DAG from the initial target file(s). DAG.update() has the following check to remove jobs from consideration if they produce the same output file that they require for input:
if file in job.input:
cycles.append(job)
continue
E.g. if the target file is also the candidate job's input file, skip this candidate job.
In your working case, the job from the merge_bam rule is considered as a candidate for producing the S1.bam file requested by the all rule. However, the merge_bam job also requests S1.bam for it's own input, so it fails the above check for cycles. Consequently, it is not considered a producer for the S1.bam file requested by the all rule, leaving only the bwa job.
In the exception case, the merge_bam rule outputs S1.bam but asks for R1.bam as input, so it passes the cycle check and is considered a potential producer of the S1.bam file requested by the all rule. Since both merge_bam and bwa can produce S1.bam (and there is no ruleorder defined) an AmbiguousRuleException is thrown.
Conclusions
The mixing of a cyclic DAG and ambiguous rules causes this unintuitive behavior. Snakemake doesn't aim to find all possible rule ambiguities, so I would not necessarily say that this is a bug.

The best way to running several functions

i need help with running different functions at the same with the same arguments.
I have a powershell script are build like this:
$ObjectsArray = #(Object1, Object2, Object3)
function function1($arg) {
do something...
}
function function2($arg) {
do something...
}
function function3($arg) {
do something...
}
foreach($Objec in ObjectArray) {
function1 -arg $Object.Name
function2 -arg $Object.Name
function3 -arg $Object.Name
}
in my script i have many functions and i want optimize the code.
there is any way to run all of these function in one time? maybe with regex?
in all the function i'm use with the same arguments.
Thanks!!
short answer: yes, it's possible.
longer answer: you will need to separate the various executions into powershell jobs. This is sort of multi-threading, but I don't know enough to tell you that it's actually peeling threads from the (virtual) core(s).
Here is how you call an individual job:
PS C:\temp\StackWork\csvimport> start-job -ScriptBlock {Get-ChildItem c:\ | select Mode,Name | ft -w -auto}
Id Name PSJobTypeName State HasMoreData Location Command
-- ---- ------------- ----- ----------- -------- -------
9 Job9 BackgroundJob Running True localhost Get-ChildItem c:\ | se...
There you see the output is not the results of the command, but the properties of the job itself. That 'State' field is how you check if the job is still running or completed.
Then this is how you get the resulting output of the job:
PS C:\temp\StackWork\csvimport> receive-job 9
Mode Name
---- ----
d----- inetpub
d----- PerfLogs
d-r--- Program Files
d-r--- Program Files (x86)
d----- Python27
d----- Quarantine
d----- Tech
d----- Temp
d-r--- Users
d----- Windows
Here is how you get the info on a running job:
PS C:\temp\StackWork\csvimport> get-job -Id 9
Id Name PSJobTypeName State HasMoreData Location Command
-- ---- ------------- ----- ----------- -------- -------
9 Job9 BackgroundJob Completed False localhost
Expanding this really depends on what you need to see in the output, and what you need to trigger as a next action. In your example, it's only really 3 parallel runs, but as you increase you may need to track running jobs and set limits to complete some before starting new ones. A good rule of thumb I've always heard was two running threads x (# cores -1).
All of that is very specific to your needs, but hopefully this helps with the basis of implementation.
In case you want to avoid the repetition of explicitly enumerating individual function calls that take the same arguments:
# Get function-info objects for the target functions by name pattern
# (wildcard expression), as an array.
# Note: Alternatively, you can store the names explicitly in an array:
# $funcsToCall = 'function1', 'function2', 'function3'
$funcsToCall = Get-Item function:function?
foreach ($object in $ObjectsArray) {
# Loop over all functions in the array and call each.
foreach ($func in $funcsToCall) {
# Use & (call operator) to call a function
# by function-info object or name.
& $func -arg $object.Name
}
}
This will still execute the functions sequentially, however.

How to format output from a program spawned from a expect script

I am writing a load testing script for radius server using tcl and expect.
I am invoking radclient, that comes inbuild with the radius server, from my script on remote server.
scripts does following:
take remote server IP
- spawn ssh to remote server
- invoke radclient
- perform load test using radclient commands
- need to collect the result from the output (as shown in the sample output) into a variable
- Extract authentication/sec as Transaction per second (TPS) from output or variable from previous step
Need help on last two steps:
Sample output from radclient:
*--> timetest 20 10 2 1 1
Cycles: 10, Repetitions: 2, Requests per Cycle: 10
Starting User Number: 1, Increment: 1
Current Repetition Number=1
Skipping Accounting On Request
Total Requests=100, Total Responses=100, Total Accepts=0 Total Not Accepts=100
1: Sending 100 requests and getting 100 responses took 449ms, or 0.00 authentications/sec
Current Repetition Number=2
Skipping Accounting On Request
Total Requests=100, Total Responses=100, Total Accepts=0 Total Not Accepts=100
2: Sending 100 requests and getting 100 responses took 471ms, or 0.00 authentications/sec
Expected Output:
TPS achieved = 0
You might use something like this:
expect -re {([\d.]+) authentications/sec}
set authPerSec $expect_out(1,string)
puts "TPS achieved = $authPerSec"
However, that's not to say that the information extracted is the right information. For example, when run against your test data it is likely to come unstuck as there are two places where you have authentications/sec due to all the repetitions; we don't account for that at all! More complex patterns might extract more information and so on.
expect {
-re {([\d.]+) authentications/sec} {
set authPerSec $expect_out(1,string)
puts "TPS achieved #[incr count] = $authPerSec"
exp_continue
}
"bash$" {
# System prompt means stop expecting; tune for what you've got...
}
}
Doing the right thing can be complex sometimes…

keepalived + MySQL with periodic MISC_CHECK

I have Keepalived + MySQL (master - master) setup done.
I have kept the priority same for MASTER and BACKUP because I don't want them to start flapping frequently (one time switch of VIP is good enough).
This setup works fine if I use the simple 'vrrp-script' to check if mysql daemon is down. e.g.
script to check mysql daemon
vrrp_script chk_mysql {
script "killall -0 mysqld" # verify the pid is exist or not
interval 2 # check every 2 seconds
weight 2
}
I want to make it work with deeper health check with one python script. I want to use MISC_CHECK for that.
e.g.
MISC_CHECK {
misc_path “script_to_call_python_script.sh xxxx xxxx xxxx xxxx”
misc_timeout 5
}
My query is:
How can I make the MISC_CHECK to run at specified intervals?
Otherwise, what is 'required' output of script in 'vrrp_script', so that I could run
shell script there (which runs are periodic interval)?
Place the python code in a folder and in your vrrp_script call it like
vrrp_script chk_mysql {
script "location of you python script"
interval "the specified interval"
weight 2
}
Set the output to 0 or 1 depending on the check
as #nimesh said above, vrrp_script support python script directly. Just put your shell/python/rudy location with the script "location of you script" config.

How can I log "show processlist" when there are more than n queries?

Our mysql processes can sometimes get backlogged and processes begin queuing up. I'd like to debug when and why this occurs by logging the processlist during slow times.
I'd like to run show full processlist; via a cron job and save output to text file if there are more than 50 rows returned.
Can you point me in the right direction?
For example:
echo "show full processlist;" | mysql -uroot > processlist-`date +%F-%H-%M`.log
I'd like to run that only when the result contains the text 50 rows in set (or greater than 50 rows).
pt-stalk is designed for this exact purpose. It samples the process list every second (or whatever time you specify), then when a threshold is reached (Threads_running is the default and is what you want in this case), collects a whole bunch of data, including disk activity, tcpdumps, multiple samples of the process list, server status variables, mutex/innodb status, and a bunch more.
Here's how to start it:
pt-stalk --daemonize --dest /var/lib/pt-stalk --collect-tcpdump --threshold 50 --cycles 1 --disk-pct-free 20 --retention-time 3 -- --defaults-file=/etc/percona-toolkit/pt-stalk_my.cnf
The command above will sample Threads_running (--threshold; set this to your value for n), every second (default of --interval) and fire a data collection if Threads_running is greater than 50 for 1 consecutive sample (--cycles). 3 days (--retention-time) of samples will be kept and collect will not fire if less than 20% of your disk is free (--disk-pct-free). At each collection, a pcap format tcpdump will be executed (--collect-tcpdump) which can be analyzed with either conventional tcpdump tools, or a number of other Percona Toolkit tools, including pt-query-digest and pt-tcp-model. There will be a 5 minute rest in between samples (default of --sleep) in order to prevent from DoS'ing yourself. The process wil be daemonized (--daemonize). The parameters after -- will be passed to all mysql/mysqladmin commands, so is a good place to set things like --defaults-file where you can store your login credentials away from prying eyes.
First of all, make sure MySQL's slow queries log isn't what you need. Also, MySQL's -e parameter allows you to specify a query on the command line.
Turning the logic around, this saves the process list and removes it when the process list isn't long enough:
date=$(date +...) # set the desired date format here
[ $(mysql -uroot -e "show full processlist" | tee plist-$date.log | wc -l) -lt 51 ] && rm plist-$date.log