Avoid printing job exit codes in SGE with option -sync yes - sungridengine

I have a Perl script which submits a bunch of array jobs to SGE. I want all the jobs to be run in parallel to save me time, and the script to wait for them all to finish, then go on to the next processing step, which integrates information from all SGE output files and produces the final output.
In order to send all the jobs into the background and then wait, I use Parallel::ForkManager and a loop:
$fork_manager = new Parallel::ForkManager(#as);
# #as: Max nb of processes to run simultaneously
for $a (#as) {
$fork_manager->start and next; # Starts the child process
system "qsub <qsub_options> ./script.plx";
$fork_manager->finish; # Terminates the child process
}
$fork_manager->wait_all_children;
<next processing step, local>
In order for the "waiting" part to work, however, I have had to add "-sync yes" to the qsub options. But as a "side effect" of this, SGE prints the exit code for each task in each array job, and since there are many jobs and the single tasks are light, it basically renders my shell unusable due to all those interupting messages while the qsub jobs are running.
How can I get rid of those messages? If anything, I would be interested in checking qsub's exit code for the jobs (so I can check everything went ok before the next step), but not in one exit code for each task (I log the tasks' error via option -e anyway in case I need it).

The simplest solution would be to redirect the output from qsub somewhere, i.e.
system("qsub <qsub options> ./script.plx >/dev/null 2>&1");
but this masks errors that you might want to see. Alternatively, you can use open() to start the subprocess and read it's output, only printing something if the subprocess generates an error.
I do have an alternate solution for you, though. You could submit the jobs to SGE without -sync y, and capture the job id when qsub prints it. Then, turn your summarization and results collection code into a follow on job and submit it with a dependency on the completion of the first jobs. You can submit this final job with -sync y so your calling script waits for it to end. See the documentation for -hold_jid in the qsub man page.
Also, rather than making your calling script decide when to submit the next job (up to your maximum), use SGE's -tc option to specify the maximum number of simultaneous jobs (note that -tc isn't in the man page, but it is in qsub's -help output). This depends on you using a new enough version of SGE to have -tc, of course.

Related

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster

I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:
sbatch --gpus=1 -t 100 python train.py
When requested GPU time ends, slurm kills my program and shows such message:
Epoch 0: : 339it [01:10, 4.84it/s, loss=-34] slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT ***
How can I configure a Trainer to save model when available time end?
I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.
You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1#30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:
#SBATCH -t 100
#SBATCH --signal=USR1#30
srun python train.py
Then, in your code, you can handle that signal like this:
import signal
def handler(signum, frame):
print('Signal handler got signal ', signum)
# e.g. exit(0), or call your pytorch save routines
# enable the handler
signal.signal(signal.SIGUSR1, handler)
# your code here
You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))
Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.

How to complete an converging parallal gateway in test

I wrote some JUnit-tests on my process. In some cases I used
RuntimeService
.createProcessInstanceByKey("ID") //
.startBeforeActivity("taskID") //
.setVariables(map) //
.execute()
to start a process from a given task (not from the beginning).
This works well so far. In one case, the starting task is in one of two flows after a parallel gateway. The process now just executes until it reaches the 'end' gateway of this parallel flow.
Is there a way to 'mock' that missing token on the second incoming sequence flow?
I hope, you understood me ;-)
You can execute
runtimeService
.createProcessInstanceModification(processInstanceId)
.startBeforeActivity(idOfGateway)
.execute();
If there are n missing tokens make sure to call #startBeforeActivity n times.

Receive email only when all the tasks are completed

I am launching a lot of jobs on a cluster as an array (similarly to what explained in http://www3.imperial.ac.uk/bioinfsupport/help/cluster_usage/submitting_array_jobs)
If I use $ -m ea I receive hundreds of emails, one for job.
How can I receive an email only when all the tasks are completed? Is it possible to receive when all the tasks are completed but also an email when any of the task is aborted?
According to my knowledge, this does not seem possible. Others may have more experience, so I defer final solution to those with more experience.
However, what you can do is:
Submit your job array without the -m option (or with -m a to track aborted tasks)
submit a second single dummy job using -hold_jid_ad <job_id_of_job_array> and -m e option.
This will send email when hold on on single job (step 2) is satisfied i.e. when all tasks in your job array complete (step 1).

my nodejs script is not exiting on its own after successful execution

I have written a script to update my db table after reading data from db tables and solr. I am using asyn.waterfall module. The problem is that the script is not getting exited after successful completion of all operations. I have used db connection pool also thinking that may be creating the script to wait infinitly.
I want to put this script in crontab and if it will not exit properly it would be creating a hell lot of instances unnecessarily.
I just went through this issue.
The problem with just using process.exit() is that the program I am working on was creating handles, but never destroying them.
It was processing a directory and putting data into orientdb.
so some of the things that I have come to learn is that database connections need to be closed before getting rid of the reference. And that process.exit() does not solve all cases.
When my project processed 2,000 files. It would get down to about 500 left, and the extra handles would have filled up the available working memory. Which means it would not be able to continue. Therefore never reaching the process.exit at the end.
On the other hand, if you close the items that are requesting the app to stay open, you can solve the problem at its source.
The two "Undocumented Functions" that I was able to use, were
process._getActiveHandles();
process._getActiveRequests();
I am not sure what other functions will help with debugging these types of issues, but these ones were amazing.
They return an array, and you can determine a lot about what is going on in your process by using these methods.
You have to tell it when you're done, by calling
process.exit();
More specifically, you'll want to call this in the callback from async.waterfall() (the second argument to that function). At that point, all your asynchronous code has executed, and your script should be ready to exit.
EDIT: As pointed out by #Aaron below, this likely has to do with something like a database connection being active, and not allowing the node process to end.
You can use the node module why-is-node-running:
Run npm install -D why-is-node-running
Add import * as log from 'why-is-node-running'; in your code
When you expect your program to exit, add a log statement:
afterAll(async () => {
await app.close();
log();
})
This will print a list of open handles with a stacktrace to find out where they originated:
There are 5 handle(s) keeping the process running
# Timeout
/home/maf/dev/node_modules/why-is-node-running/example.js:6 - setInterval(function () {}, 1000)
/home/maf/dev/node_modules/why-is-node-running/example.js:10 - createServer()
# TCPSERVERWRAP
/home/maf/dev/node_modules/why-is-node-running/example.js:7 - server.listen(0)
/home/maf/dev/node_modules/why-is-node-running/example.js:10 - createServer()
We can quit the execution by using:
connection.destroy();
If you use Visual Studio code, you can attach to an already running Node script directly from it.
First, run the Debug: Attached to Node Process command:
When you invoke the command, VS Code will prompt you which Node.js process to attach to:
Your terminal should display this message:
Debugger listening on ws://127.0.0.1:9229/<...>
For help, see: https://nodejs.org/en/docs/inspector
Debugger attached.
Then, inside your debug console, you can use the code from The Lazy Coder’s answer:
process._getActiveHandles();
process._getActiveRequests();

SGE hold_jid and catching failed jobs

I have a script that submits a number of jobs to run in parallel on an SGE queue, and another gathering script that is executed when this list of jobs are finished. I am using -hold_jid wc_job_list to hold the execution of the gathering script while the parallel jobs are running.
I just noticed that sometimes some of the parallel jobs fail and the gathering script still runs. The documentation states that:
If any of the referenced jobs exits with exit code 100, the submitted
job will remain ineligible for execution.
How can I catch the parallel failed jobs exit status so that if any of them fail for any reason, the gathering script is not executed or gives an error message?
In case of BASH, you could parse the exit status of your program (can be referenced as $?) and in the case of not being 0 (which is the exit status for normal termination), call exit 100 at the end of your jobscript.
The problem with this is, that your job will remain in the queue in state Eqw and has to be deleted manually.
UPDATE: For every job you set to Eqw your administrators get an email...