Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster - deep-learning

I use a deep learning models written in pytorch_lightning (pytorch) and train them on slurm clusters. I submit job like this:
sbatch --gpus=1 -t 100 python train.py
When requested GPU time ends, slurm kills my program and shows such message:
Epoch 0: : 339it [01:10, 4.84it/s, loss=-34] slurmstepd: error: *** JOB 375083 ON cn-007 CANCELLED AT 2021-10-04T22:20:54 DUE TO TIME LIMIT ***
How can I configure a Trainer to save model when available time end?
I know about automatic saving after each epoch, but I have only one long epoch that lasts >10 hours, so this case is not working for me.

You can use Slurm's signalling mechanism to pass a signal to your application when it's within a certain number of seconds of the timelimit (see man sbatch). In your submission script use --signal=USR1#30 to send USR1 30 seconds before the timelimit is reached. Your submit script would contain these lines:
#SBATCH -t 100
#SBATCH --signal=USR1#30
srun python train.py
Then, in your code, you can handle that signal like this:
import signal
def handler(signum, frame):
print('Signal handler got signal ', signum)
# e.g. exit(0), or call your pytorch save routines
# enable the handler
signal.signal(signal.SIGUSR1, handler)
# your code here
You need to call your Python application via srun in order for Slurm to be able to propagate the signal to the Python process. (You can probably use --signal on the command line to sbatch, I tend to prefer writing self-contained submit scripts :))
Edit: This link has a nice summary of the issues involved with signal propagation and Slurm.

Related

Memory issues in a Cloud Storage Function

I have deployed a storage trigger cloud function that needs more memory. While deploying the GCF, I have deployed in the following manner with the appropriate flags.
gcloud functions deploy GCF_name--runtime python37 --trigger-resource bucket_name --trigger-event google.storage.object.finalize --timeout 540s --memory 8192MB
But I observed in the google cloud console, the memory utilization map is not going beyond 2GB. And in the logs I am getting this error, Function execution took 34566 ms, finished with status: 'connection error' which happens because of memory shortage. Can I get some help on this.
Edited
The application uploads text files to the storage that contains certain number of samples. Each file is read when it is uploaded to the storage and the data appended to a pre existing file. The total number of samples will be maximum of 75600002. That's why I need 8GB data. Its giving the connection error while appending the data to the file.
def write_to_file(filename,data,write_meta = False,metadata = []):
file1 = open('/tmp/'+ filename,"a+")
if write_meta:
file1.write(":".join(metadata))
file1.write('\n')
file1.write(",".join(data.astype(str)))
file1.close()
The memory utilisation map was the same after every upload.

You are writing a file to /tmp which is an in-memory filesystem. So start by deleting that file when you finish uploading it. In fact those:
Files that you write consume memory available to your function, and sometimes persist between invocations. Failing to explicitly delete these files may eventually lead to an out-of-memory error and a subsequent cold start.
Ref : https://cloud.google.com/functions/docs/bestpractices/tips#always_delete_temporary_files

SGE unknown resource "nodes"

I submit a job on SGE with parameter -l like:
qsub -pe orte 4 -l nodes=4 run.sh
However, the system displays that:
Unable to run job: unknown resource "nodes".
Could you tell me why and how to solve it?
Thank you very much!

With Sun Grid Engine, the correct resource parameter is h, not nodes:
echo 'echo `hostname`' | qsub -l h=<some_hostname>
Using this example, you should see the hostname you specified in the standard output file.

There isn't a nodes resource. Instead you request a parallel environment and a number of slots (map to cores usually). The number of nodes you get is determined by the alloaction_rule of the parallel environment. There is usually a simple pe called something like mpi that will pack as many slots(cores) onto each node as will fit. Some people have created configs for grid engine that let it have a more PBS like syntax.

my nodejs script is not exiting on its own after successful execution

I have written a script to update my db table after reading data from db tables and solr. I am using asyn.waterfall module. The problem is that the script is not getting exited after successful completion of all operations. I have used db connection pool also thinking that may be creating the script to wait infinitly.
I want to put this script in crontab and if it will not exit properly it would be creating a hell lot of instances unnecessarily.

I just went through this issue.
The problem with just using process.exit() is that the program I am working on was creating handles, but never destroying them.
It was processing a directory and putting data into orientdb.
so some of the things that I have come to learn is that database connections need to be closed before getting rid of the reference. And that process.exit() does not solve all cases.
When my project processed 2,000 files. It would get down to about 500 left, and the extra handles would have filled up the available working memory. Which means it would not be able to continue. Therefore never reaching the process.exit at the end.
On the other hand, if you close the items that are requesting the app to stay open, you can solve the problem at its source.
The two "Undocumented Functions" that I was able to use, were
process._getActiveHandles();
process._getActiveRequests();
I am not sure what other functions will help with debugging these types of issues, but these ones were amazing.
They return an array, and you can determine a lot about what is going on in your process by using these methods.

You have to tell it when you're done, by calling
process.exit();
More specifically, you'll want to call this in the callback from async.waterfall() (the second argument to that function). At that point, all your asynchronous code has executed, and your script should be ready to exit.
EDIT: As pointed out by #Aaron below, this likely has to do with something like a database connection being active, and not allowing the node process to end.

You can use the node module why-is-node-running:
Run npm install -D why-is-node-running
Add import * as log from 'why-is-node-running'; in your code
When you expect your program to exit, add a log statement:
afterAll(async () => {
await app.close();
log();
})
This will print a list of open handles with a stacktrace to find out where they originated:
There are 5 handle(s) keeping the process running
# Timeout
/home/maf/dev/node_modules/why-is-node-running/example.js:6 - setInterval(function () {}, 1000)
/home/maf/dev/node_modules/why-is-node-running/example.js:10 - createServer()
# TCPSERVERWRAP
/home/maf/dev/node_modules/why-is-node-running/example.js:7 - server.listen(0)
/home/maf/dev/node_modules/why-is-node-running/example.js:10 - createServer()

We can quit the execution by using:
connection.destroy();

If you use Visual Studio code, you can attach to an already running Node script directly from it.
First, run the Debug: Attached to Node Process command:
When you invoke the command, VS Code will prompt you which Node.js process to attach to:
Your terminal should display this message:
Debugger listening on ws://127.0.0.1:9229/<...>
For help, see: https://nodejs.org/en/docs/inspector
Debugger attached.
Then, inside your debug console, you can use the code from The Lazy Coder’s answer:
process._getActiveHandles();
process._getActiveRequests();

Avoid printing job exit codes in SGE with option -sync yes

I have a Perl script which submits a bunch of array jobs to SGE. I want all the jobs to be run in parallel to save me time, and the script to wait for them all to finish, then go on to the next processing step, which integrates information from all SGE output files and produces the final output.
In order to send all the jobs into the background and then wait, I use Parallel::ForkManager and a loop:
$fork_manager = new Parallel::ForkManager(#as);
# #as: Max nb of processes to run simultaneously
for $a (#as) {
$fork_manager->start and next; # Starts the child process
system "qsub <qsub_options> ./script.plx";
$fork_manager->finish; # Terminates the child process
}
$fork_manager->wait_all_children;
<next processing step, local>
In order for the "waiting" part to work, however, I have had to add "-sync yes" to the qsub options. But as a "side effect" of this, SGE prints the exit code for each task in each array job, and since there are many jobs and the single tasks are light, it basically renders my shell unusable due to all those interupting messages while the qsub jobs are running.
How can I get rid of those messages? If anything, I would be interested in checking qsub's exit code for the jobs (so I can check everything went ok before the next step), but not in one exit code for each task (I log the tasks' error via option -e anyway in case I need it).

The simplest solution would be to redirect the output from qsub somewhere, i.e.
system("qsub <qsub options> ./script.plx >/dev/null 2>&1");
but this masks errors that you might want to see. Alternatively, you can use open() to start the subprocess and read it's output, only printing something if the subprocess generates an error.
I do have an alternate solution for you, though. You could submit the jobs to SGE without -sync y, and capture the job id when qsub prints it. Then, turn your summarization and results collection code into a follow on job and submit it with a dependency on the completion of the first jobs. You can submit this final job with -sync y so your calling script waits for it to end. See the documentation for -hold_jid in the qsub man page.
Also, rather than making your calling script decide when to submit the next job (up to your maximum), use SGE's -tc option to specify the maximum number of simultaneous jobs (note that -tc isn't in the man page, but it is in qsub's -help output). This depends on you using a new enough version of SGE to have -tc, of course.

Using a Webpage to Control Robot Arm Written in Linux

Firstly I am no longer a student and currently working on a favour for a friend. I am making a website which has a live video feed of a robotic arm and a set off buttons that will allow users with the basic interaction of the robotic arm.
I have setup the website and live video feed. I do have a 4 second delay using flash media encoder and flash server 4.5. Have any suggestions in reducing the delay time?
I have done the python code required for the maplin robotic arm and now I am stuck and not sure how to link my python code with a webpage interface? Can anyone that has done this before provide with code that I could edit and learn from..
Python Code
import usb.core
import usb.util
import sys
import time
# This program is intended to control a robotic arm via USB from Linux
# The code is written in Python by Neil Polwart (c) 2011
# It is a work in progress and will improved!
# locate the device device
dev = usb.core.find(idVendor=0x1267, idProduct=0x0000)
# assigns the device to the handle "dev"
# can check the device is visible to Linux with command line command lsusb
# which should report a device with the above vendor and id codes.
# was it found?
if dev is None:
raise ValueError('Device not found') # if device not found report an error
# set the active configuration
dev.set_configuration()
# as no arguments, the first configuration will be the active one
# note as commands are sent to device as commands not data streams
# no need to define the endpoint
# defines the command packet to send
datapack=0x80,0,0
# change this packet to make different moves.
# first byte defines most of the movements, second byte shoulder rotation, third byte light
# command structure in more detail:
# http://notbrainsurgery.livejournal.com/38622.html?view=93150#t93150
print "requested move",datapack # reports the requested movement to the user
# send the command
bytesout=dev.ctrl_transfer(0x40, 6, 0x100, 0, datapack, 1000)
# outputs the command to the USB device, using the ctrl_transfer method
# 0x40, 6, 0x100, 0 defines the details of the write - bRequestType, bRequest, wValue, wIndex
# datapack is our command (3 bytes)
# the final value is a timeout (in ms) which is optional
# bytesout = the number of bytes written (i.e. 3 if successful)
print "Written :",bytesout,"bytes" # confirm to user that data was sent OK
# wait for a defined period
time.sleep(1) # waits for 1 second whilst motors move.
# now STOP the motors
datapack=0,0,0
bytesout=dev.ctrl_transfer(0x40, 6, 0x100, 0, datapack, 1000)
if bytesout == 3: print "Motors stopped"
So I need to find a way to edit the datapack line via a website interface. Any help is appreciated! I am using a Windows 7 setup but do have access to vmware

I'd set up an Apache server with mod_python and create a handler that imports your script and runs the necessary code. You can set up an AJAX script in JavaScript (with or without jQuery). Every time you want to run the Python script, a request needs to be made to the server. You can pass any information back and forth as needed via the HTTP.
Here's a good tutorial for Python and the CGI Module.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Save a model weights when a program receives TIME LIMIT while learning on a SLURM cluster - deep-learning

Related

Memory issues in a Cloud Storage Function

SGE unknown resource "nodes"

my nodejs script is not exiting on its own after successful execution

Avoid printing job exit codes in SGE with option -sync yes

Using a Webpage to Control Robot Arm Written in Linux

Categories

Resources