CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions - cuda

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain transcendental functions were called from within the CUDA kernel, (e.g. sinf() or atanhf()).
This led me to write a much simpler program (see below), to confirm that these transcendental functions really were causing an issue, and it looks like that is indeed the case. When I compile and run the code below, which just has repeated calls to a kernel that uses tanh and atanh, repeatedly, sometimes the program works, and sometimes it prints Error with Kernel along with a message from the driver that says:
NVRM: XiD (0000:01:00): 13, 0002 000000 000050c0 00000368 00000000 0000080
With regards to frequency, it probably crashes 50% of the time that I run the executable.
From what I've read online, it sounds like XiD 13 is analogous to a host-based seg fault. However, given the array indexing, I can't see how that could be the case. Furthermore the program doesn't crash if I replace the transcendental functions in the kernel with other functions (e.g. repeated floating point subtraction and addition). That is, I don't get the XiD error message, and the program ultimately returns the correct value of atanh(0.7).
I'm running cuda-5.0 on Ubuntu 11.10 x64 Desktop. Driver version is 304.54, and I'm using a GeForce 9800 GTX.
I'm inclined to say that this is a hardware issue or a driver bug. What's strange is that the example applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.
The final bit of potentially important information is that if I run either my main project, or this test program under cuda-memcheck, it reports no errors, and never crashes. Honestly, I'd just run my project under cuda-memcheck, but the performance hit makes it impractical.
Thanks in advance for any help/insight here. If any one has a 9800 GTX and would be willing to run this code to see if it works, it would be greatly appreciated.
#include <iostream>
#include <stdlib.h>
using namespace std;
__global__ void test_trans (float *a, int length) {
if ((threadIdx.x + blockDim.x*blockIdx.x) < length) {
float temp=0.7;
for (int i=0;i<100;i++) {
temp=atanh(temp);
temp=tanh(temp);
}
a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp);
}
}
int main () {
float *array_dev;
float *array_host;
unsigned int size=10000000;
if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) {
cerr << "Error with memory Allocation\n"; exit (-1);}
array_host = new float [size];
for (int i=0;i<10;i++) {
test_trans <<< size/512+1, 512 >>> (array_dev, size);
if (cudaSuccess != cudaDeviceSynchronize()) {
cerr << "Error with kernel\n"; exit (-1);}
}
cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost);
cout << array_host[size-1] << "\n";
}
Edit: I dropped this project for a few months, but yesterday upon updating to driver version 319.23, I'm no longer having this problem. I think the issue I described must have been a bug that was fixed. Hope this helps.

The asker determined that this was a temporary issue fixed by a newer CUDA release. See the edit to the original question.

Related

why the first cuda kernel cannot overlap with previous memcpy?

Here is a demo. The kernel cannot overlap with previous cudaMemcpyAsync, although they are in different streams.
#include <iostream>
#include <cuda_runtime.h>
__global__ void warmUp(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
printf("warm up!");
}
}
__global__ void kernel(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
long long x = 0;
for(int i=0; i<1000000; i++){
x += i>>1;
}
printf("kernel!%d\n", x);
}
}
int main(){
//warmUp<<<1,32>>>();
int *data, *data_dev;
int dataSize = pow(10, 7);
cudaMallocHost(&data, dataSize*sizeof(int));
cudaMalloc(&data_dev, dataSize*sizeof(int));
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(data_dev, data, dataSize*sizeof(int), cudaMemcpyHostToDevice, stream1);
kernel<<<1, 32, 0, stream2>>>();
}
Visual Profiler show
After some attempts, I found out that this is due to it being the first kernel call.
Uncomment warmUp<<<1,32>>>();, Visual Profiler show, overlap!
Why?

CUDA uses lazy initialization. Because of this, the first time you do a particular operation or a particular operation type, it's possible that the behavior will not be as you expect.
The operation will/should work "correctly", but performance measurements may not be as you expect.
Contrary to the linked article, there really is no specified formula to force the lazy initialization to complete, without performing the actual work you intend to do.
If the only thing you ever intend to do with your application is launch a single kernel, then having that kernel overlap with a previous copy operation doesn't seem to make a lot of sense to me. In any event, you should expect that device initialization is necessary before all operations will proceed at expected speeds or in expected ways.
Lazy initialization behavior may vary based on CUDA version, platform (e.g. OS) and GPU type.
Additionally, kernel launches are asynchronous. So this particular coding pattern:
int main(){
...
kernel<<<1, 32, 0, stream2>>>();
}
is generally not recommended in CUDA, and specifically is not recommended when using a profiler. Your code should provide the opportunity for all issued work to complete properly, in order for the profiler to provide useful results. You should provide a cudaDeviceSynchronize() or similar operation at the end of your code, if you want to profile it, for this type of pattern.
I also don't recommend doing performance analysis on kernels that are issuing printf calls. The printf call imposes additional host/device synchronization behavior/needs, and this can be confusing; its not easy to predict the performance impact of that.

Implementing of mutex on cuda kernel function happens to be deadlocked

I'm a newcomer to cuda, and I try to perform mutex in the kernel function.
I read some tutorials and wrote my function, but in some case, deadlock happened.
Here are my codes, kernel function is very simple to count numbers of running thread started by main function.
#include <iostream>
#include <cuda_runtime.h>
__global__ void countThreads(int* sum, int* mutex) {
while(atomicCAS(mutex, 0, 1) != 0); // lock
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
int main() {
int* mutex = nullptr;
cudaMalloc(&mutex, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int* sum = nullptr;
cudaMalloc(&sum, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int ret = 0;
// pass, result is 1024
countThreads<<<1024, 1>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
// deadlock, why?
countThreads<<<1, 2>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
return 0;
}
So, can anyone tell me why the program deadlocked when calling countThreads<<<1, 2>>>(), and how to fix it? I want to perform cross-block mutex, may be it is not a good idea though. Many thanks.
I experimented for some time, and found if use thread in the same block, deadlock happens, otherwise, everything works well.

Threads in the same warp attempting to negotiate for a lock or mutex is probably the worst-case scenario. It is fairly difficult to program correctly, and the behavior may change depending on the exact GPU you are running on.
Here is an example of the type of analysis needed to explain the exact reason for the deadlock in a particular case. Such analysis is not readily done on what you have shown here because you have not indicated the type of GPU you are compiling for, or running on. It's also fairly important to provide the CUDA version you are using for compilation. I have witnessed code changes from one compiler generation to another, that may impact this. Even if you provided that information, I'm not sure the analysis is really worth-while, because I consider the negotiation-within-a-warp case to be extra troublesome to program correctly. This question/answer may also be of interest.
My general suggestion for a newcomer in CUDA (as you say) would be to use a method similar to what is described here. Briefly, negotiate for a lock at the threadblock level (ie. have one thread in each block negotiate among other blocks for the lock) then manage singleton activity within the block using standard, available block-level coordination schemes, such as __syncthreads(), and conditional coding.
You can learn more about this topic by searching on the cuda tag for such keywords as "lock" "critical section" etc.
FWIW, for me, anyway, your code does deadlock on a Kepler device and does not deadlock on a Volta device, as suggested by the reference in the comments. I'm not attempting to communicate any statement about whether your code is defect-free, it's just an observation. If I modify your kernel to look like this:
__global__ void countThreads(int* sum, int* mutex) {
int old = 1;
while (old){
old = atomicCAS(mutex, 0, 1); // lock
if (old == 0){
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
}
}
Then it seems to me to work in either the Kepler case or the Volta case. I'm not advancing this example to suggest it is "correct", rather to show that somewhat innocuous code modifications can change a code from deadlock to non-deadlock case, or vice versa. This kind of fragility is best avoided, certainly in the pre-Volta case, in my opinion.
For the volta and forward case, CUDA 11 and forward, you may want to use capability from the libcu++ library such as semaphore

CUDA performance: branching and shared memory

I wish to ask two questions on performance. I have been unable to create simple code to illustrate.
Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated
Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?
EDIT: I have been able to construct some simple code that give more insight for my query
Here it is
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;
__global__ void test()
{
__shared__ int t[1024];
int v=t[0];
bool b=(v==-1);
bool c=(v==-2);
int myValue=0;
for (int i=0;i<800;i++)
{
#if 1
v=i;
#else
v=t[i];
#endif
#if 0
if (b) {
printf("abs");
}
#endif
if (c)
{
printf ("IT HAPPENED");
v=8;
}
myValue+=v;
}
if (myValue==1000)
printf ("IT HAPPENED");
}
int main(int argc, char *argv[])
{
cudaEvent_t event_start,event_stop;
float timestamp;
float4 *data;
// Initialise
cudaDeviceReset();
cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
threadsPerBlock.x=32;
threadsPerBlock.y=32;
threadsPerBlock.z=1;
blocks.x=1;
blocks.y=1000;
blocks.z=1;
cudaEventCreate(&event_start);
cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
cudaEventRecord(event_stop, 0);
cudaEventSynchronize(event_stop);
cudaEventElapsedTime(&timestamp, event_start, event_stop);
printf("Calculated in %f", timestamp);
}
I am running this code on a GTX680.
Now the results are as follows ..
If run as it is it takes 5.44 ms
If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me
If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?
I have also changed slightly the code to make further checks with shared memory
Instead of
__shared__ int t[1024]
I did
__shared__ int2 t[1024]
and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?
Regards
Daniel

Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.
(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."
(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.

You need to keep in mind that the compiler is highly optimizing. So if you comment out the branch, you also eliminate the evaluation of the conditional, wether or not you leave it in the source code. Thus a difference of four instructions seems very plausible for your example:
load -1,
compare v to it (and store result in b),
test b,
branch,
although I have not compiled your example and looked at the code (which is what you should do - run cuobjdump -sass on your binaries and look at the actual differences in machine code.
Using the only the .x compnent of an int2 changes the layout in shared memory so that you go from bank conflict free access to a 2-way bank conflict, which causes the slight further slowdown in your example. IIRC the latency of a shared memory access is of the order of 30 cycles, which usually is easily hidden by other threads (as Roger has already mentioned).

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).

There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.

Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.

Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}

The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.

On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).

Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.

Memory Error in CUDA Program for Fermi GPU

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.

The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008