SGE unknown resource "nodes" - sungridengine

I submit a job on SGE with parameter -l like:
qsub -pe orte 4 -l nodes=4 run.sh
However, the system displays that:
Unable to run job: unknown resource "nodes".
Could you tell me why and how to solve it?
Thank you very much!

With Sun Grid Engine, the correct resource parameter is h, not nodes:
echo 'echo `hostname`' | qsub -l h=<some_hostname>
Using this example, you should see the hostname you specified in the standard output file.

There isn't a nodes resource. Instead you request a parallel environment and a number of slots (map to cores usually). The number of nodes you get is determined by the alloaction_rule of the parallel environment. There is usually a simple pe called something like mpi that will pack as many slots(cores) onto each node as will fit. Some people have created configs for grid engine that let it have a more PBS like syntax.

Related

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster

I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:
sbatch --gpus=1 -t 100 python train.py
When requested GPU time ends, slurm kills my program and shows such message:
Epoch 0: : 339it [01:10, 4.84it/s, loss=-34] slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT ***
How can I configure a Trainer to save model when available time end?
I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.
You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1#30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:
#SBATCH -t 100
#SBATCH --signal=USR1#30
srun python train.py
Then, in your code, you can handle that signal like this:
import signal
def handler(signum, frame):
print('Signal handler got signal ', signum)
# e.g. exit(0), or call your pytorch save routines
# enable the handler
signal.signal(signal.SIGUSR1, handler)
# your code here
You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))
Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.

How to get who or what turned off a pod?

We are currently trying to debug an issue with a pod and figured out that 6 other pod (not related) was turned off and would want to figure out when that happens and who or what turned it off (to see if it's related or not with the first issue).
Is it possible to get this kind of information with openshift ?
These operations are typically recorded in the audit logs (if you have enabled those): https://docs.openshift.com/container-platform/4.7/security/audit-log-view.html
So you can filter certain actions for example like so (GET actions):
oc adm node-logs node-1.example.com --path=oauth-apiserver/audit.log \
| jq 'select(.verb != "get")'

Can PM2 take an action upon a process being marked "Errored"

PM2 will mark a process status "Errored" if it restarts more than "max_restarts" where each restart lasts less than "min_uptime". Perhaps it happens in other circumstances as well.
I'd like to take an action in the event that such a string of fatal errors occur. In my case, I'd like to reboot the whole machine since it means something horrible has occurred. Is this possible?
Note: I now see that it's possible to do this when PM2 is being used programmatically (see answer below). Is there a way to do it automatically through the CLI instead? Something similar to a githook that runs automatically upon the "errored" status being raised.
If PM2 is being used programmatically, this function can be used:
pm2.describe(process,errback)
It returns 'processDescription', which includes 'pm2_env', which includes 'status', which would show 'errored'.
This may answer the question for someone else, but it does not answer the question for me, as I would like to use PM2 via CLI call, and not from within another node script.
The question is quite old, but I had the same problem and nowadays, there is a CLI solution:
You can use pm2 jlist to get the current process list as JSON and parse it for example with jq. To search for all processes managed by pm2 in status "errored", you could call something like:
pm2 jlist | jq '.[] | {"name": .name, "status": .pm2_env.status} | select(.status=="errored")'

How to ping from Zabbix agent?

Is it possible to ping from Zabbix agent and pass that data into Zabbix server? I would like to be able to get response time from the agent.
I read that it is possible by using fping, would be great if someone could guide me to the correct path.
Thank you,
Rijath Mohammed
While that is not currently available out of the box, you can implement such a functionality using a feature called "user parameters". This forum thread has a simple example:
UserParameter=myping[*],/etc/zabbix/fping -q $1;echo $?
Although for you the path to fping is likely to be /usr/sbin/fping or /usr/bin/fping.
You can read more about user parameters in the official manual: https://www.zabbix.com/documentation/3.0/manual/config/items/userparameters .
While I haven't ever configured that, it would be similar on Windows - see this forum thread for some inspiration.
And if you would like to see this feature implemented out of the box, make sure to vote on this feature request.
Got it working using the below powershell script, :)
$Test = test-connection google.com -count 1
$Test.responsetime
This will just return the response time for Google.com and that value is passed to Zabbix using the below user parameter:
UnsafeUserParameters=1
UserParameter=ping.google,C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe C:\zabbix\pinggoogle.ps1
I am calling this parameter from Zabbix using the key "ping.google"

Avoid printing job exit codes in SGE with option -sync yes

I have a Perl script which submits a bunch of array jobs to SGE. I want all the jobs to be run in parallel to save me time, and the script to wait for them all to finish, then go on to the next processing step, which integrates information from all SGE output files and produces the final output.
In order to send all the jobs into the background and then wait, I use Parallel::ForkManager and a loop:
$fork_manager = new Parallel::ForkManager(#as);
# #as: Max nb of processes to run simultaneously
for $a (#as) {
$fork_manager->start and next; # Starts the child process
system "qsub <qsub_options> ./script.plx";
$fork_manager->finish; # Terminates the child process
}
$fork_manager->wait_all_children;
<next processing step, local>
In order for the "waiting" part to work, however, I have had to add "-sync yes" to the qsub options. But as a "side effect" of this, SGE prints the exit code for each task in each array job, and since there are many jobs and the single tasks are light, it basically renders my shell unusable due to all those interupting messages while the qsub jobs are running.
How can I get rid of those messages? If anything, I would be interested in checking qsub's exit code for the jobs (so I can check everything went ok before the next step), but not in one exit code for each task (I log the tasks' error via option -e anyway in case I need it).
The simplest solution would be to redirect the output from qsub somewhere, i.e.
system("qsub <qsub options> ./script.plx >/dev/null 2>&1");
but this masks errors that you might want to see. Alternatively, you can use open() to start the subprocess and read it's output, only printing something if the subprocess generates an error.
I do have an alternate solution for you, though. You could submit the jobs to SGE without -sync y, and capture the job id when qsub prints it. Then, turn your summarization and results collection code into a follow on job and submit it with a dependency on the completion of the first jobs. You can submit this final job with -sync y so your calling script waits for it to end. See the documentation for -hold_jid in the qsub man page.
Also, rather than making your calling script decide when to submit the next job (up to your maximum), use SGE's -tc option to specify the maximum number of simultaneous jobs (note that -tc isn't in the man page, but it is in qsub's -help output). This depends on you using a new enough version of SGE to have -tc, of course.