SGE hold_jid and catching failed jobs - sungridengine

I have a script that submits a number of jobs to run in parallel on an SGE queue, and another gathering script that is executed when this list of jobs are finished. I am using -hold_jid wc_job_list to hold the execution of the gathering script while the parallel jobs are running.
I just noticed that sometimes some of the parallel jobs fail and the gathering script still runs. The documentation states that:
If any of the referenced jobs exits with exit code 100, the submitted
job will remain ineligible for execution.
How can I catch the parallel failed jobs exit status so that if any of them fail for any reason, the gathering script is not executed or gives an error message?

In case of BASH, you could parse the exit status of your program (can be referenced as $?) and in the case of not being 0 (which is the exit status for normal termination), call exit 100 at the end of your jobscript.
The problem with this is, that your job will remain in the queue in state Eqw and has to be deleted manually.
UPDATE: For every job you set to Eqw your administrators get an email...

Related

Successful build, but incomplete writeAndRead tasks

is it usual for a build to be successful, when not all the tasks for the writeAndRead stage completed?
e.g. I have 38637 of 50000 tasks complete

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster

I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:
sbatch --gpus=1 -t 100 python train.py
When requested GPU time ends, slurm kills my program and shows such message:
Epoch 0: : 339it [01:10, 4.84it/s, loss=-34] slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT ***
How can I configure a Trainer to save model when available time end?
I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.
You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1#30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:
#SBATCH -t 100
#SBATCH --signal=USR1#30
srun python train.py
Then, in your code, you can handle that signal like this:
import signal
def handler(signum, frame):
print('Signal handler got signal ', signum)
# e.g. exit(0), or call your pytorch save routines
# enable the handler
signal.signal(signal.SIGUSR1, handler)
# your code here
You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))
Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.

Receive email only when all the tasks are completed

I am launching a lot of jobs on a cluster as an array (similarly to what explained in http://www3.imperial.ac.uk/bioinfsupport/help/cluster_usage/submitting_array_jobs)
If I use $ -m ea I receive hundreds of emails, one for job.
How can I receive an email only when all the tasks are completed? Is it possible to receive when all the tasks are completed but also an email when any of the task is aborted?
According to my knowledge, this does not seem possible. Others may have more experience, so I defer final solution to those with more experience.
However, what you can do is:
Submit your job array without the -m option (or with -m a to track aborted tasks)
submit a second single dummy job using -hold_jid_ad <job_id_of_job_array> and -m e option.
This will send email when hold on on single job (step 2) is satisfied i.e. when all tasks in your job array complete (step 1).

How to get to know when qtconcurrent actually run

As mentioned in doc for QtConcurrent::run:
Note that the function may not run immediately; the function will only be run when a thread is available.
So here is a question: how to handle job running?
QFutureWatcher signal start is not suitable because
This signal is emitted when this QFutureWatcher starts watching the future set with setFuture().

Avoid printing job exit codes in SGE with option -sync yes

I have a Perl script which submits a bunch of array jobs to SGE. I want all the jobs to be run in parallel to save me time, and the script to wait for them all to finish, then go on to the next processing step, which integrates information from all SGE output files and produces the final output.
In order to send all the jobs into the background and then wait, I use Parallel::ForkManager and a loop:
$fork_manager = new Parallel::ForkManager(#as);
# #as: Max nb of processes to run simultaneously
for $a (#as) {
$fork_manager->start and next; # Starts the child process
system "qsub <qsub_options> ./script.plx";
$fork_manager->finish; # Terminates the child process
}
$fork_manager->wait_all_children;
<next processing step, local>
In order for the "waiting" part to work, however, I have had to add "-sync yes" to the qsub options. But as a "side effect" of this, SGE prints the exit code for each task in each array job, and since there are many jobs and the single tasks are light, it basically renders my shell unusable due to all those interupting messages while the qsub jobs are running.
How can I get rid of those messages? If anything, I would be interested in checking qsub's exit code for the jobs (so I can check everything went ok before the next step), but not in one exit code for each task (I log the tasks' error via option -e anyway in case I need it).
The simplest solution would be to redirect the output from qsub somewhere, i.e.
system("qsub <qsub options> ./script.plx >/dev/null 2>&1");
but this masks errors that you might want to see. Alternatively, you can use open() to start the subprocess and read it's output, only printing something if the subprocess generates an error.
I do have an alternate solution for you, though. You could submit the jobs to SGE without -sync y, and capture the job id when qsub prints it. Then, turn your summarization and results collection code into a follow on job and submit it with a dependency on the completion of the first jobs. You can submit this final job with -sync y so your calling script waits for it to end. See the documentation for -hold_jid in the qsub man page.
Also, rather than making your calling script decide when to submit the next job (up to your maximum), use SGE's -tc option to specify the maximum number of simultaneous jobs (note that -tc isn't in the man page, but it is in qsub's -help output). This depends on you using a new enough version of SGE to have -tc, of course.