CUDA, cuPrintf causes "unspecified launch failure"? - cuda

I have a kernel which runs twice with different grid size.
My problem is with cuPrintf. When I don't have cudaPrintfInit() before kernel run and cudaPrintfDisplay(stdout, true) and cudaPrintfEnd() after kernel run, I have no error but when I put them there I get "unspecified launch failure" error.
In my device code, there is only one loop like this for printing:
if (threadIdx.x==0) {
cuPrintf("MAX:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
}
I'm using CUDA 4.0 with a card with cuda capability 2.0 and so I'm compiling my code with this syntax:
nvcc LB2.0.cu -arch=compute_20 -code=sm_20

If you are on a CC 2.0 GPU, you don't need cuPrintf at all -- CUDA has printf built-in for CC-2.0 and higher GPUs. So just replace your call to cuPrintf with this:
#if __CUDA_ARCH__ >= 200
if (threadIdx.x==0) {
printf("MAX:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
}
#endif
(Note you only need the #if / #endif lines if you are compiling your code for sm_20 and also earlier versions. With the example compilation command line you gave, you can eliminate them.)
With printf, you don't need cudaPrintfInit() or cudaPrintfDisplay() -- it is automatic. However if you print a lot of data, you may need to increase the default printf FIFO size with cudaDeviceSetLimit(), passing the cudaLimitPrintfFifoSize option.

Related

CUDA atomicAdd for doubles definition error

In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which includes such a function. I guess this is due to the fact that with Pascal and Compute Capability 6.0, a native double version of atomicAdd has been added, but somehow that is not properly ignored for previous Compute Capabilities.
The code below used to compile and run fine with previous CUDA versions, but now I get this compilation error:
test.cu(3): error: function "atomicAdd(double *, double)" has already been defined
But if I remove my implementation, I instead get this error:
test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
I should add that I only see this if I compile with -arch=sm_35 or similar. If I compile with -arch=sm_60 I get the expected behavior, i.e. only the first error, and successful compilation in the second case.
Edit: Also, it is specific for atomicAdd -- if I change the name, it works well.
It really looks like a compiler bug. Can someone else confirm that this is the case?
Example code:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void kernel(double *a)
{
double b=1.3;
atomicAdd(a,b);
}
int main(int argc, char **argv)
{
double *a;
cudaMalloc(&a,sizeof(double));
kernel<<<1,1>>>(a);
cudaFree(a);
return 0;
}
Edit: I got an answer from Nvidia who recognize this problem, and here is what the developers say about it:
The sm_60 architecture, that is newly supported in CUDA 8.0, has
native fp64 atomicAdd function. Because of the limitations of our
toolchain and CUDA language, the declaration of this function needs to
be present even when the code is not being specifically compiled for
sm_60. This causes a problem in your code because you also define a
fp64 atomicAdd function.
CUDA builtin functions such as atomicAdd are implementation-defined
and can be changed between CUDA releases. Users should not define
functions with the same names as any CUDA builtin functions. We would
suggest you to rename your atomicAdd function to one that is not the
same as any CUDA builtin functions.
That flavor of atomicAdd is a new method introduced for compute capability 6.0. You may keep your previous implementation of other compute capabilities guarding it using macro definition
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif
This macro named architecture identification macro is documented here:
5.7.4. Virtual Architecture Identification Macro
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
I assume NVIDIA did not place it for previous CC to avoid conflict for users defining it and not moving to Compute Capability >= 6.x. I would not consider it a BUG though, rather a release delivery practice.
EDIT: macro guard was incomplete (fixed) - here a complete example.
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif
__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }
int main (int argc, char* argv[])
{
kernel<<<1,1>>> () ;
return ::cudaDeviceSynchronize () ;
}
Compilation with:
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
Command lines (both successful):
$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35
You may find why it works with the include file: sm_60_atomic_functions.h, where the method is not declared if __CUDA_ARCH__ is lower than 600.

Maximum number of threads for a CUDA kernel on Tesla M2050

I am testing what is maximum number of threads for a simple kernel. I find total number of threads cannot exceed 4096. The code is as follows:
#include <stdio.h>
#define N 100
__global__ void test(){
printf("%d %d\n", blockIdx.x, threadIdx.x);
}
int main(void){
double *p;
size_t size=N*sizeof(double);
cudaMalloc(&p, size);
test<<<64,128>>>();
//test<<<64,128>>>();
cudaFree(p);
return 0;
}
My test environment: CUDA 4.2.9 on Tesla M2050. The code is compiled with
nvcc -arch=sm_20 test.cu
While checking what's the output, I found some combinations are missing. Run the command
./a.out|wc -l
I always got 4096. When I check cc2.0, I can only find the maximum number of blocks for x,y,z dimensions are (1024,1024,512) and maximum number of threads per block is 1024. And the calls to the kernel (either <<<64,128>>> or <<<128,64>>>) are well in the limits. Any idea?
NB: The CUDA memory operations are there to block the code so that the output from the kernel will be shown.
You are abusing kernel printf, and using it to judge how many threads you can run is a completely nonsensical idea. The runtime has a limited buffer size for printf output, and you are simply overflowing it with output when you run enough threads. There is an API to query and set the printf buffer size, using cudaDeviceGetLimit and cudaDeviceSetLimit (thanks to Robert Crovella for the link to the printf documentation in comments).
You can find the maximum number of threads a given kernel can run by looking here in the documentation.

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain transcendental functions were called from within the CUDA kernel, (e.g. sinf() or atanhf()).
This led me to write a much simpler program (see below), to confirm that these transcendental functions really were causing an issue, and it looks like that is indeed the case. When I compile and run the code below, which just has repeated calls to a kernel that uses tanh and atanh, repeatedly, sometimes the program works, and sometimes it prints Error with Kernel along with a message from the driver that says:
NVRM: XiD (0000:01:00): 13, 0002 000000 000050c0 00000368 00000000 0000080
With regards to frequency, it probably crashes 50% of the time that I run the executable.
From what I've read online, it sounds like XiD 13 is analogous to a host-based seg fault. However, given the array indexing, I can't see how that could be the case. Furthermore the program doesn't crash if I replace the transcendental functions in the kernel with other functions (e.g. repeated floating point subtraction and addition). That is, I don't get the XiD error message, and the program ultimately returns the correct value of atanh(0.7).
I'm running cuda-5.0 on Ubuntu 11.10 x64 Desktop. Driver version is 304.54, and I'm using a GeForce 9800 GTX.
I'm inclined to say that this is a hardware issue or a driver bug. What's strange is that the example applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.
The final bit of potentially important information is that if I run either my main project, or this test program under cuda-memcheck, it reports no errors, and never crashes. Honestly, I'd just run my project under cuda-memcheck, but the performance hit makes it impractical.
Thanks in advance for any help/insight here. If any one has a 9800 GTX and would be willing to run this code to see if it works, it would be greatly appreciated.
#include <iostream>
#include <stdlib.h>
using namespace std;
__global__ void test_trans (float *a, int length) {
if ((threadIdx.x + blockDim.x*blockIdx.x) < length) {
float temp=0.7;
for (int i=0;i<100;i++) {
temp=atanh(temp);
temp=tanh(temp);
}
a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp);
}
}
int main () {
float *array_dev;
float *array_host;
unsigned int size=10000000;
if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) {
cerr << "Error with memory Allocation\n"; exit (-1);}
array_host = new float [size];
for (int i=0;i<10;i++) {
test_trans <<< size/512+1, 512 >>> (array_dev, size);
if (cudaSuccess != cudaDeviceSynchronize()) {
cerr << "Error with kernel\n"; exit (-1);}
}
cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost);
cout << array_host[size-1] << "\n";
}
Edit: I dropped this project for a few months, but yesterday upon updating to driver version 319.23, I'm no longer having this problem. I think the issue I described must have been a bug that was fixed. Hope this helps.
The asker determined that this was a temporary issue fixed by a newer CUDA release. See the edit to the original question.

Can I get CUDA Compute capability (version) in compile time by #define?

How can I get CUDA Compute capability (version) in compile time by #define?
For example, if I use __ballot and compile with
nvcc -c -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_13,code=sm_13
source.cu
can I get version of compute capability in my code by #define for choose the branch of code with __ballot and without?
Yes. First, it's best to understand what happens when you use -gencode. NVCC will compile your input device code multiple times, once for each device target architecture. So in your example, NVCC will run compilation stage 1 once for compute_20 and once for compute_13.
When nvcc compiles a .cu file, it defines two preprocessor macros, __CUDACC__ and __CUDA_ARCH__. __CUDACC__ does not have a value, it is simply defined if cudacc is the compiler, and not defined if it isn't.
__CUDA_ARCH__ is defined to an integer value representing the SM version being compiled.
100 = compute_10
110 = compute_11
200 = compute_20
etc. To quote the NVCC documentation included with the CUDA Toolkit:
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy. This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
So, in your case where you want to use __ballot(), you can do this:
....
#if __CUDA_ARCH__ >= 200
int b = __ballot();
int p = popc(b & lanemask);
#else
// do something else for earlier architectures
#endif

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.