I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.
I am new to qemu simulator.I want to emulate our existing pure c h264(video decoder)code in arm platform(cortex-a9) using qemu in ubuntu 12.04 and I had done it successfully from the links available in the internet.
Also we are having multithreading(pthreads) code in our application to speed up the process.If we enable multithreading we are getting the same performance (i.e)single thread(without multithreading).
Eg. single thread 9.75sec
Multithread 9.76sec
Since qemu will support parallel processing we are not able to get the performance.
steps done are as follows
1.compile the code using arm-linux-gnueabi-toolchain
2.Execute the code
qemu-arm -L executable
3.qemu version 1.6.1
Is there any option or settings has to be done in qemu if we want measure the performance in multi threading because we want to get the difference between single thread and multithread using qemu since we are not having any arm board with us.
Moreover,multithreading application hangs if we run for third time or fourth time i.e inconsistent behaviour in qemu.
whether we can rely on this qemu simulator or not since it is not cycle accurate.
You will not be able to use QEMU to estimate real hardware speed.
Also QEMU currently supports SMP running in a single thread... this means your guest OS will see multiple CPUs but will not recieve adicional cycles since all the emulation is occuring in a single thread.
Note that IO is delegated to separate threads... so usually if your VM is doing cpu and IO work you will see at least 1.5+ cores on the host being used.
There has been alot of research into parallelizing the cpu emulation in qemu but without much sucess. I suggest you buy some real hardware and run it there especially consiering that coretex-a9 hardware is cheap these days.
I have a single machine with 32 cores (2 processors), and 32G RAM. I installed gridengine to submit jobs to those queues I created. But it seems jobs are running on all cores.
I wonder if there is way to limit cores and RAMs for each job. For example I have two queues: parallel.q and serial.q, so that I allocate 20G RAMS and 20 cores to serial.q but I want each job only use one core and maximum 1G RAMs, and 8G RAMs + 8 cores to a single parallel job. All 4 cores and 4G rams left for other usage.
How can I config my queue or gridengine to get the setting right? I tried to read the manual, but don't have a clue.
Thanks!
I don't have problem with parallel jobs. I have some serial jobs will call several different programs somehow the system will assign them all cores available. But I don't want all cores be used for jobs rather for example only two cores available for each job.(Each job has several programs run sequentially, in which case systems allocate each program a core). BTW, I would like have some idle cores all the time to process other jobs, like processing data. Is it possible or necessary?
In fact, if I understand well, you want to partition a single machine with several sub-queues, is that right?
This may be problematic with SGE because the host configuration allows you to set the number of CPU available on a given node. Than you create your queues and assign different hosts to different queues.
In your case, you shoud assign the same host to one master queue, and then add subordinate queues that can use only a given MAX_SLOTS slots.
But if I may ask one question: why should you partition it? If you set up only one queue and configure some parallel environment then you can just submit your jobs using qsub -pe <parallelEnvironment> <NSLOTS> and the grid engine takes care of everything. I suggest you setup at least an OpenMP parallel environment, because you won't probably need MPI on a shared memory machine like yours (it seems a great machine BTW).
Another thing is that you must be able to configure your model run so that the code that you are using can be used with a limited number of CPU; this is very important. In practice you must assign the same number of CPUs to the simulation code than to the SGE. This information is contained in the $NSLOTS variable of your qsub-script.
I'm working on a cluster with a lot of nodes, and each node has two gpus. In the cluster, I can't launch "nvidia-smi" to check which device is busy. My code selects the best device (with cudaChooseDevice) in terms of capability, but when the cluster assign me the same node for two different jobs, then I have two tasks running on the same gpu.
My question is: There is a way to check at runtime if the device is busy or not?
Thanks
Your cluster managers should install and use cluster management (job-scheduling) software that allows them to assign and track GPUs just like CPUs and memory. There are a number of job schedulers that can do this. Even without explicit GPU support in the job-scheduler, it's possible to build job entry/exit scripts that will assign GPUs properly.
You can effectively include the same functionality that nvidia-smi uses by embedding NVML in your applications. Any query or data item reported on by nvidia-smi can be accessed programmatically through NVML.
It's also not clear to me why you could not launch a script for your job which checks which devices are busy using nvidia-smi, then picks an un-busy device.
But keep in mind that any runtime check you might do would be subject to the behavior of other applications. If those applications (whether launched by you or other users) have unusual behavior, your runtime check can easily be defeated.
I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.