Why is cudaLaunchCooperativeKernel() returning not permitted? - cuda

So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so the GPU does have support for cooperative groups. I am unable to execute the following function
extern "C" __global__ void test(int x) {
if (x) {
printf("%d", x);
if (threadIdx.x == 0)
test<<<1, 1>>>(--x);
}
}
After calling,
cudaLaunchCooperativeKernel((void *)test, 1, 1, (void **) (&x));
getting an error 'operation not permitted' (code is 800). Now, this is returned when the device has no support of cooperative groups (Not in this case). So, what could cause this problem?

Your kernel makes use of dynamic parallelism.
However, dynamic parallelism is not allowed in kernels which are launched via cudaLaunchCooperativeKernel
This is mentioned in the documentation of the runtime API. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html

Related

Multiple global functions in the same CUDA source file

Can I write two separate global functions, that compute different things, in the same CUDA source file? Something like this:
__global__ void Ker1(mpz_t *d,mpz_t *c,mpz_t e,mpz_t n )
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm (d[i], c[i], e, n);
}
__global__ void Ker2(mpz_t *d,mpz_t *c,mpz_t d, mpz_t n)
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm(c[i], d[i],d, n);
}
int main()
{
/* ... */
cudaMemcpy(decode_device,decode_buffer,memSize,cudaMemcpyHostToDevice);
Ker1<<<dimGrid , dimBlock >>>( d_device,c_device,e,n );
Ker2<<<dimGrid , dimBlock>>>(c_device,d_device,d,n);
cudaMemcpy(decode_buffer,decode_device,memSize,cudaMemcpyDeviceToHost);
}
If not, how would you do something like this?
It is quite unclear what you're asking, but after 3 readings I assume : "Can I write several Kernels in the same source file ?".
Your can write as much kernel launchs as you want in your main function.
An example here on page 9 :
...
cudaMemcpy( dev1, host1, size, H2D ) ;
kernel2 <<< grid, block, 0 >>> ( ..., dev2, ... ) ;
kernel3 <<< grid, block, 0 >>> ( ..., dev3, ... ) ;
cudaMemcpy( host4, dev4, size, D2H ) ;
...
From : Streams and concurrency webinar
The calls will be asynchronous by default, so as soon as the kernel is launched in the GPU, the CPU will treat the instructions that follow.
To force synchronization you have to use cudaDeviceSynchronize(), or any memory transfer via cudaMemcpy that forces synchronization by itself.
Source : the CUDA FAQ.
Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.
The "cudaThreadSynchronize()" API call should be used when measuring
performance to ensure that all device operations have completed before
stopping the timer.
CUDA functions that perform memory copies and that control graphics
interoperability are synchronous, and implicitly wait for all kernels
to complete.
By the way, if you don't need to synchronize between kernels, they can be executed concurrently if your GPU has the required compute capability (CC) :
Q: Is it possible to execute multiple kernels at the same time?
Yes. GPUs of compute capability 2.x or higher support concurrent kernel execution and launches.
(still readen from the CUDA FAQ).

CUDA atomicAdd for doubles definition error

In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which includes such a function. I guess this is due to the fact that with Pascal and Compute Capability 6.0, a native double version of atomicAdd has been added, but somehow that is not properly ignored for previous Compute Capabilities.
The code below used to compile and run fine with previous CUDA versions, but now I get this compilation error:
test.cu(3): error: function "atomicAdd(double *, double)" has already been defined
But if I remove my implementation, I instead get this error:
test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
I should add that I only see this if I compile with -arch=sm_35 or similar. If I compile with -arch=sm_60 I get the expected behavior, i.e. only the first error, and successful compilation in the second case.
Edit: Also, it is specific for atomicAdd -- if I change the name, it works well.
It really looks like a compiler bug. Can someone else confirm that this is the case?
Example code:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void kernel(double *a)
{
double b=1.3;
atomicAdd(a,b);
}
int main(int argc, char **argv)
{
double *a;
cudaMalloc(&a,sizeof(double));
kernel<<<1,1>>>(a);
cudaFree(a);
return 0;
}
Edit: I got an answer from Nvidia who recognize this problem, and here is what the developers say about it:
The sm_60 architecture, that is newly supported in CUDA 8.0, has
native fp64 atomicAdd function. Because of the limitations of our
toolchain and CUDA language, the declaration of this function needs to
be present even when the code is not being specifically compiled for
sm_60. This causes a problem in your code because you also define a
fp64 atomicAdd function.
CUDA builtin functions such as atomicAdd are implementation-defined
and can be changed between CUDA releases. Users should not define
functions with the same names as any CUDA builtin functions. We would
suggest you to rename your atomicAdd function to one that is not the
same as any CUDA builtin functions.
That flavor of atomicAdd is a new method introduced for compute capability 6.0. You may keep your previous implementation of other compute capabilities guarding it using macro definition
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif
This macro named architecture identification macro is documented here:
5.7.4. Virtual Architecture Identification Macro
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
I assume NVIDIA did not place it for previous CC to avoid conflict for users defining it and not moving to Compute Capability >= 6.x. I would not consider it a BUG though, rather a release delivery practice.
EDIT: macro guard was incomplete (fixed) - here a complete example.
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif
__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }
int main (int argc, char* argv[])
{
kernel<<<1,1>>> () ;
return ::cudaDeviceSynchronize () ;
}
Compilation with:
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
Command lines (both successful):
$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35
You may find why it works with the include file: sm_60_atomic_functions.h, where the method is not declared if __CUDA_ARCH__ is lower than 600.

How to avoid memcpy if number of blocks depends on device variable?

I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?
I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.
An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}

cuda kernels using pthreads Missing Configuration Error

What is the meaining of missing configuration error in cuda ?
This below code is a thread function, when I run this code the error obtained is 1 which implies missing configuration error. what is the mistake in this code ?
void* run(void *args)
{
cudaError_t error;
Matrix *matrix=(Matrix*)args;
int scalar=2;
dim3 dimGrid(1,1,1);
dim3 dimBlock(1024,1,1);
cudaEvent_t start,stop;
cudaSetDevice(0);
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
for(int i=0 ;i< matrix->number ;i++ )
{
syntheticKernel<<<dimGrid,dimBlock>>>();
cudaThreadSynchronize();
}
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&matrix->time,start,stop);
error=cudaGetLastError();
assert(error!=0);
printf("%d\n",error);
}
Can you add more detail about your program please? The CUDA API routines each return a status code, you should check the status of each API call to catch and decode the first reported error.
One point to check is that you have not called any CUDA API routines before you fork the pthreads. Creating a CUDA context (which is automatic for most, but not all, CUDA API routines) before you fork the threads will cause problems. Check this, and if it's not the problem add more details to your question and check the return value of all API calls.
Why are you launching a single block in a Grid? This configuration seems suspicious:
dim3 dimGrid(1,1,1);
dim3 dimBlock(1024,1,1);
Try increasing the grid size and putting less threads in a block. But your main problem is probably about contexts as Tom suggests.

Memory Error in CUDA Program for Fermi GPU

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]