I am using a workstation containing 4 GeForce GTX Titan black cards for CUDA development. I am working on Ubuntu 12.04.5 and none of these GPUs are used for display. I notice using cudaGetDeviceProperties that kernel execution timeout is enabled. Does this apply when I am not on Windows and not using a display?
I put the following code to test this in one of my kernels which normally runs fine:
__global__ void update1(double *alpha_out, const double *sDotZ, const double *rho, double, *minusAlpha_out, clock_t *global_now)
{
clock_t start = clock();
clock_t now;
for (;;) {
now = clock();
clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
if (cycles >= 50000000000) {
break;
}
}
*global_now = now;
}
The kernel launch looks like:
update1<<<1, 1>>>(d_alpha + idx, d_tmp, d_rho + idx, d_tmp, global_now);
CudaCheckError();
cudaDeviceSynchronize();
For a large enough number of cycles waiting, I see the error:
CudaCheckError() with sync failed at /home/.../xxx.cu:295:
the launch timed out and was terminated
It runs fine for a small number of cycles. If I run this same code on a Tesla K20m GPU with kernel execution timeout disabled, I do not see this error and the program runs as normal. If I see this error, does it definitely mean I am hitting the kernel time limit that appears to be enabled or could there be something else wrong with my code? All mentions of this problem seem to be by people using Windows or also using their card for display so how is it possible I am seeing this error?
Linux has a display watchdog as well. On Ubuntu, in my experience, it is active for display devices that are configured via xorg.conf (e.g. /etc/X11/xorg.conf, but the exact configuration method will vary by distro and version).
So yes, it is possible to see the kernel execution timeout error on Linux.
In general, you can work around it in several ways, but since you have multiple GPUs, the best approach is to remove the GPUs that you want to do compute tasks on, from your display configuration (e.g. xorg.conf or whatever), and then run your compute tasks on those. Once X is not configured to use a particular GPU, that GPU won't have any watchdog associated with it.
Additional specific details are given here.
If you were to reinstall things, another approach that generally works to keep your compute GPUs out of the display path, is to load the Linux OS with the GPUs not plugged into the system. After things are configured the way you want display-wise, then add the compute GPUs to the system and load the linux toolkit. You will want to manually load the display driver instead of letting the linux toolkit do it, and deselect the option to have the linux display driver installer modify the xorg.conf This will similarly get your GPUs configured for compute usage but keep them out of the display path.
Related
I have a question :
Let's say I have 2 GPU:s in my system and I have 2 host processes running cuda code. How can I be sure that each takes a GPU?
I'm considering setting exclusive_thread but I cannot understand how to get advantage of it: once I check that a device is free how can I be sure that it remains free until I do a cudaSetDevice?
EDIT:
So far I've tried this:
int devN = 0;
while (cudaSuccess != cudaSetDevice(devN))devN = (devN + 1) % 2;
but I get a
CUDA Runtime API error 77: an illegal memory access was encountered.
which is not strange since I am in EXCLUSIVE_PROCESS mode.
Two elements within this question. Assigning a process to a GPU and making sure a GPU is available for a single process.
Assigning a process to a GPU
There is a simple way to accomplish this using CUDA_VISIBLE_DEVICES environment variable: start you first process with CUDA_VISIBLE_DEVICES=0 and your second process with CUDA_VISIBLE_DEVICES=1. Each process will see a single GPU, with device index 0, and will see a different GPU.
Running nvidia-smi topo -m will display GPU topology and provide you with the corresponding CPU affinity.
Then, you may set CPU affinity for your process with taskset or numactl on linux or SetProcessAffinityMask on Windows.
Process has exclusive access to a GPU
To make sure that no other process may access your GPU, configure the GPU driver to be in exclusive process: nvidia-smi --compute-mode=1.
I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that.
Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with smaller amount of memory and/or with less advanced compute capability?
Per comments clarifying what I'm asking.
Suppose a customer reports a problem running on a GPU with compute capability C with M gigs of GPU memory and T threads per block. I have a better GPU on my machine, with higher compute capability, more memory, and more threads per block.
Can I run my program on my GPU restricted to M gigs of GPU memory? The answer to this one seems to be "yes, just allocate (whatever mem you have) - M at startup and never use it; that would leave only M until your program exits."
Can I reduce the size of the blocks on my GPU to no more than T threads for the duration of runtime?
Can I reduce compute capability of my GPU for the duration of runtime, as seen by my program?
I originally wanted to make this a comment but it was getting far too big for that scope.
As #RobertCrovella mentioned there is no native way to do what you are asking for. That said, you can take the following measures to minimize the bugs you see on other architectures.
0) Try to get the output from cudaGetDeviceProperties from the CUDA GPUs you want to target. You could crowd source this from your users or the community.
1) To restrict memory, you can either implement a memory manager and manually keep track of the memory being used or use cudaGetMemInfo to get a fairly close estimate. Note: This function returns memory used by other applications as well.
2) Have a wrapper macro to launch the kernel where you can explicitly check if the number of blocks / threads fit in the current profile. i.e. Instead of launching
kernel<float><<<blocks, threads>>>(a, b, c);
You'd do something like this:
LAUNCH_KERNEL((kernel<float>), blocks, threads, a, b, c);
Where you can have the macro be defined like this:
#define LAUNCH_KERNEL(kernel, blocks, threads, ...)\
check_blocks(blocks);\
check_threads(threads);\
kernel<<<blocks, threads>>>(__VA_ARGS__)
3) Reducing the compute capability is not possible, but you can however compile your code with various compute modes and make sure your kernels have backwards compatible code in them. If a certain part of your kernel errors out with an older compute mode, you can do something like this:
#if !defined(TEST_FALLBACK) && __CUDA_ARCH__ >= 300 // Or any other newer compute
// Implement using new fancy feature
#else
// Implement a fallback version
#endif
You can define TEST_FALLBACK whenever you want to test your fallback code and ensure your code works on older computes.
I have a host in our cluster with 8 Nvidia K80s and I would like to set it up so that each device can run at most 1 process. Before, if I ran multiple jobs on the host and each use a large amount of memory, they would all attempt to hit the same device and fail.
I set all the devices to compute mode 3 (E. Process) via nvidia-smi -c 3 which I believe makes it so that each device can accept a job from only one CPU process. I then run 2 jobs (each of which only takes about ~150 MB out of 12 GB of memory on the device) without specifying cudaSetDevice, but the second job fails with ERROR: CUBLAS_STATUS_ALLOC_FAILED, rather than going to the second available device.
I am modeling my assumptions off of this site's explanation and was expecting each job to cascade onto the next device, but it is not working. Is there something I am missing?
UPDATE: I ran Matlab using gpuArray in multiple different instances, and it is correctly cascading the Matlab jobs onto different devices. Because of this, I believe I am correctly setting up the compute modes at the OS level. Aside from cudaSetDevice, what could be forcing my CUDA code to lock into device 0?
This is relying on an officially undocumented behavior (or else prove me wrong and point out the official documentation, please) of the CUDA runtime that would, when a device was set to an Exclusive compute mode, automatically select another available device, when one is in use.
The CUDA runtime apparently enforced this behavior but it was "broken" in CUDA 7.0.
My understanding is that it should have been "fixed" again in CUDA 7.5.
My guess is you are running CUDA 7.0 on those nodes. If so, I would try updating to CUDA 7.5, or else revert to CUDA 6.5 if you really need this behavior.
It's suggested, rather than relying on this, that you instead use an external means, such as a job scheduler (e.g. Torque) to manage resources in a situation like this.
From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document
I'm playing with the matrixMulCUBLAS sample code and tried changing the default matrix sizes to something slightly more fun rows=5k x cols=2.5k and then the example fails with the error Failed to synchronize on the stop event (error code unknown error)! at line #377 when all the computation is done and it is apparently cleaning up cublas. What does this mean? and how can be fixed?
I've got cuda 5.0 installed with an EVGA FTW nVidia GeForce GTX 670 with 2GB memory. The driver version is 314.22 latest one as of today.
In general, when using CUDA on windows, it's necessary to make sure the execution time of a single kernel is not longer than about 2 seconds. If the execution time becomes longer, you may hit a windows TDR event. This is a windows watchdog timer that will reset the GPU driver if it does not respond within a certain period of time. Such a reset halts the execution of your kernel and generates bogus results, as well as usually a briefly "black" display and a brief message in the system tray. If your kernel execution is triggering the windows watchdog timer, you have a few options:
If you have the possibility to use more than one GPU in your system (i.e. usually not talking about a laptop here) and one of your GPUs is a Quadro or Tesla device, the Quadro or Tesla device can usually be placed in TCC mode. This will mean that GPU can no longer driver a physical display (if it was driving one) and that it is removed from the WDDM subsystem, so is no longer subject to the watchdog timer. You can use the nvidia-smi.exe tool that ships with the NVIDIA GPU driver to modify the setting from WDDM to TCC for a given GPU. Use your windows file search function to find nvidia-smi.exe and then use nvidia-smi --help to get command line help for how to switch from WDDM to TCC mode.
If the above method is not available to you (don't have 2 GPUs, don't have a Quadro or Tesla GPU...) then you may want to investigate changing the watchdog timer setting. Unfortunately this requires modifying the system registry, and the process and specific keys vary by OS. There are a number of resources on the web, such as here from Microsoft, as well as other questions on Stack Overflow, such as here, which may help with this.
A third option is simply to limit the execution time of your kernel(s). Successive operations might be broken into multiple kernel calls. The "gap" between kernel calls will allow the display driver to respond to the OS, and prevent the watchdog timeout.
The statement about TCC support is a general one. Not all Quadro GPUs are supported. The final determinant of support for TCC (or not) on a particular GPU is the nvidia-smi tool. Nothing here should be construed as a guarantee of support for TCC on your particular GPU.