Sun Grid Engine finished job info - sungridengine

Is there a way to list the node which executed a Sun Grid Engine job using qstat or other SGE commands?
I have to get this information using a python script. I have figured out how to execute SGE commands from python but I didn't find the solution to list the execution node for a particular job. I have tried to list finished jobs using
qstat -s z -f -F
but the name of the host which executed the job dosen't appear in this list. Anyone could help me please?

If you have your job id
qacct -j {job id}
Otherwise you can list recents jobs
qacct -j -b {YYYYMMDDHHmm}
The node that executed the job is the value under the key "hostname".

Related

how could I send training data to the node that excutes the command

I am using slurm to manipulate the gpus to train my model. I configured the python environment on node A, which is where my code and data stored. The common practice is like this:
srun -p gpu --ntasks-per-node=1 --gres=gpu:2 python train.py
This will let slurm find a node for me and run my code on this node. Here I found my code is running 3 times slower than it will run on some local machine with same number of gpus. I guess the reason is that the data used in the code is stored on node A, while slurm assigned a node B for me to run my code. Thus the data on node A will have to be continuously transmitted from node A to node B which slows down the process.
Here my question is: is there a method that I could copy my data to node B so that the code can use the data like in the local machine?
You can replace the python train.py part in your command with a Bash script that first transfers the data and then run python train.py.
Even better would be to consider creating a proper submission script and submit it with sbath rather than using srun on its own:
#!/bin/bash
#SBATCH -p gpu
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
cp /global/directory/data /local/directory/
python train.py
You would need to replace the line cp /global/directory/data /local/directory/ with a proper command to copy the files. It could be scp rather than cp.

Running task in the background?

If we are submitting a task to the compute engine through ssh from host machine and if we shut down the host machine is there a way that we can get hold of the output of the submitted task later on when we switch on the host machine?
From the Linux point of view ‘ssh’ and ‘gcloud compute ssh’ are commands like all the others, therefore it is possible to redirect their output to a file while the command is performed using for example >> to redirect and append stdout to a file or 2>> to store stderr.
For example if you run from the first instance 'name1':
$ gcloud compute ssh name2 --command='watch hostname' --zone=XXXX >> output.out
where 'name2' is the second instance, and at some point you shutdown 'name1' you will find stored into output.out the output provided by the command till the shutdown occurred.
Note that there is also the possibility to create shut down scripts, that in this scenario could be useful in order to upload output.out to a bucket or to perform any kind of clean-up operation.
In order to do so you can run the following command
$ gcloud compute instances add-metadata example-instance --metadata-from-file shutdown-script=path/to/script_file
Where the content of the script could be something like
#! /bin/bash
gsutil cp path/output.out gs://yourbucketname
Always keep in mind that Compute Engine only executes shutdown scripts on a best-effort basis and does not guarantee that the shutdown script will be run in all cases.
More Documentation about shutdown scrips if needed.

Detect when instance has completed setup script?

I'm launching instances using the following command:
gcutil addinstance \
--image=debian-7 \
--persistent_boot_disk \
--zone=us-central1-a \
--machine_type=n1-standard-1 \
--metadata_from_file=startup-script:install.sh \
instance-name
How can I detect when this instance has completed it's install script? I'd like to be able to place this launch command in a larger provisioning script that then goes on to issue commands to the server that depend on the install script having been successfully completed.
There is a number of ways: sending yourself an email, uploading to Cloud Storage, sending a jabber message, ...
One simple, observable way IMHO is to add a logger entry at the end of your install.sh script (I also tweak the beginning for symmetry). Something like:
#!/bin/bash
/usr/bin/logger "== Startup script START =="
#
# Your code goes here
#
/usr/bin/logger "== Startup script END =="
You can check then if the script started or ended in two ways:
From your Developer's Console, select "Projects" > "Compute" > "VM Instances" > your instance > "Serial console" > "View Output".
From CLI, by issuing a gcutil getserialportoutput instance-name.
I don't know of a way to do all of this within gcutil addinstance.
I'd suggest:
Adding the instance via gcutil addinstance, making sure to use the --wait_until_running flag to ensure that the instance is running before you continue
Copying your script over to the instance via something like gcutil push
Using gcutil ssh <instance-name> </path-to-script/script-to-run> to run your script manually.
This way, you can write your script in such a way that it blocks until it's finished, and the ssh command will not return until your script on the remote machine is done executing.
There really are a lot of ways to accomplish this goal. One that tickles my fancy is to use the metadata server associated with the instance. Have the startup script set a piece of metadata to "FINISHED" when the script is done. You can query the metadata server with a hanging GET that will only return when the metadata updates. Just use gcutil setmetadata
from within the script as the last command.
I like this method because the hanging GET just gives you one command to run, rather than a poll to run in a loop, and it doesn't involve any services besides Compute Engine.
One more hacky way:
startup_script_finished=false
while [[ "$startup_script_finished" = false ]]; do
pid=$(gcloud compute ssh $GCLOUD_USER#$GCLOUD_INSTANCE -- pgrep -f "\"/usr/bin/python /usr/bin/google_metadata_script_runner --script-type startup\"")
if [[ -z $pid ]]; then
startup_script_finished=true
else
sleep 2
fi
done
One possible solution would be to have your install script create a text file in a cloud storage bucket, as the last thing it does, using the host name as the filename.
Your main script that did the original gcutil addinstance command could then be periodically polling the contents of the bucket (using gsutil ls) until it sees a file with a matching name and then it would know the install had completed on that instance.

Openshift Online build ends on timeout when building directly from bitbucket git repo

When trying to create an app:
rhc --debug app create mezzgit python-2.7 --from-code https://radeksvarz#bitbucket.org/radeksvarz/mezzanineopenshift.git
The app is not created and I get:
...
DEBUG: code 422 270080 ms
The initial build for the application failed: Shell command '/sbin/runuser -s
/bin/sh 53c31a3ae0b8cd298e0009c0 -c "exec /usr/bin/runcon
'unconfined_u:system_r:openshift_t:s0:c2,c490' /bin/sh -c \"gear postreceive
--init >> /tmp/initial-build.log 2>&1\""' exceeded timeout of 232
...
However this works:
rhc app create mezzgit python-2.7
cd mezzgit
del *
git remote add mezzanineopenshift -m master git#bitbucket.org:radeksvarz/mezzanineopenshift.git
git pull -s recursive -X theirs mezzanineopenshift master
git add -A
git commit -m "initial mezzanine deploy"
git push origin
Why is there an error in the first case?
The first example you provided:
rhc --debug app create mezzgit python-2.7 --from-code https://radeksvarz#bitbucket.org/radeksvarz/mezzanineopenshift.git`
Is essentially doing two major actions in one call. The first is creating everything necessary for your app to run and the second is to use a custom template for this application creation. The app creation process has a timeout associated with it in order to keep things from running forever or taking an extensive amount of time, thus causing issues with the Openshift Broker.
Now the --from-code is saying you want to use a quickstart as part of your application creation. One of the things that's going to determine if thats successful or not, is how long it takes to setup & complete that quickstart on your Openshift gear. So if executes a lot of lengthy scripts or downloads large files, its likely your app creation will time out.
Therefore if a quickstart or downloadable cartridge can't be created/completed in the specified timeout; its best to go with the secondary option. Which is to create the basic app, pull in the remote repo, and then push your changes back up. This separates things into two different actions and thus requiring the Openshift Broker to do a lot less work/waiting.

Redirect output to different directories for sun grid engine array jobs

I'm running a lot of jobs with Sun Grid Engine. Since these are a jobs (~100000), I would like to use array jobs, which seems to be easier on the queue.
Another problem is that each jobs produces an stdout and stderr file, which I need to track error. If I define them in the qsub -t 1-100000 -o outputdir -e errordir I will end up having directories with 100000 files in them, which is too much.
Is there a way to have each job write the output file to a directory (say, a directory which consists of the first 2 characters of the job ID, which is random hex letters; or the job number modulu 1000, or something of that sort).
Thanks
I can't think of a good way to do this with qsub as there are no programmatic interfaces into the -o and -e options. There is, however, a way to accomplish what you want.
Run your qsub with -o and -e pointing to /dev/null. Make the command you run be some type of wrapper that redirects it's own stdout and stderr to files in whatever fashion you want (i.e., your broken down directory structure) before it execs the real job.