cuda kernels using pthreads Missing Configuration Error - configuration

What is the meaining of missing configuration error in cuda ?
This below code is a thread function, when I run this code the error obtained is 1 which implies missing configuration error. what is the mistake in this code ?
void* run(void *args)
{
cudaError_t error;
Matrix *matrix=(Matrix*)args;
int scalar=2;
dim3 dimGrid(1,1,1);
dim3 dimBlock(1024,1,1);
cudaEvent_t start,stop;
cudaSetDevice(0);
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start,0);
for(int i=0 ;i< matrix->number ;i++ )
{
syntheticKernel<<<dimGrid,dimBlock>>>();
cudaThreadSynchronize();
}
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&matrix->time,start,stop);
error=cudaGetLastError();
assert(error!=0);
printf("%d\n",error);
}

Can you add more detail about your program please? The CUDA API routines each return a status code, you should check the status of each API call to catch and decode the first reported error.
One point to check is that you have not called any CUDA API routines before you fork the pthreads. Creating a CUDA context (which is automatic for most, but not all, CUDA API routines) before you fork the threads will cause problems. Check this, and if it's not the problem add more details to your question and check the return value of all API calls.

Why are you launching a single block in a Grid? This configuration seems suspicious:
dim3 dimGrid(1,1,1);
dim3 dimBlock(1024,1,1);
Try increasing the grid size and putting less threads in a block. But your main problem is probably about contexts as Tom suggests.

Related

why the first cuda kernel cannot overlap with previous memcpy?

Here is a demo. The kernel cannot overlap with previous cudaMemcpyAsync, although they are in different streams.
#include <iostream>
#include <cuda_runtime.h>
__global__ void warmUp(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
printf("warm up!");
}
}
__global__ void kernel(){
int Id = blockIdx.x*blockDim.x+threadIdx.x;
if(Id == 0){
long long x = 0;
for(int i=0; i<1000000; i++){
x += i>>1;
}
printf("kernel!%d\n", x);
}
}
int main(){
//warmUp<<<1,32>>>();
int *data, *data_dev;
int dataSize = pow(10, 7);
cudaMallocHost(&data, dataSize*sizeof(int));
cudaMalloc(&data_dev, dataSize*sizeof(int));
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(data_dev, data, dataSize*sizeof(int), cudaMemcpyHostToDevice, stream1);
kernel<<<1, 32, 0, stream2>>>();
}
Visual Profiler show
After some attempts, I found out that this is due to it being the first kernel call.
Uncomment warmUp<<<1,32>>>();, Visual Profiler show, overlap!
Why?
CUDA uses lazy initialization. Because of this, the first time you do a particular operation or a particular operation type, it's possible that the behavior will not be as you expect.
The operation will/should work "correctly", but performance measurements may not be as you expect.
Contrary to the linked article, there really is no specified formula to force the lazy initialization to complete, without performing the actual work you intend to do.
If the only thing you ever intend to do with your application is launch a single kernel, then having that kernel overlap with a previous copy operation doesn't seem to make a lot of sense to me. In any event, you should expect that device initialization is necessary before all operations will proceed at expected speeds or in expected ways.
Lazy initialization behavior may vary based on CUDA version, platform (e.g. OS) and GPU type.
Additionally, kernel launches are asynchronous. So this particular coding pattern:
int main(){
...
kernel<<<1, 32, 0, stream2>>>();
}
is generally not recommended in CUDA, and specifically is not recommended when using a profiler. Your code should provide the opportunity for all issued work to complete properly, in order for the profiler to provide useful results. You should provide a cudaDeviceSynchronize() or similar operation at the end of your code, if you want to profile it, for this type of pattern.
I also don't recommend doing performance analysis on kernels that are issuing printf calls. The printf call imposes additional host/device synchronization behavior/needs, and this can be confusing; its not easy to predict the performance impact of that.

Why is cudaLaunchCooperativeKernel() returning not permitted?

So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so the GPU does have support for cooperative groups. I am unable to execute the following function
extern "C" __global__ void test(int x) {
if (x) {
printf("%d", x);
if (threadIdx.x == 0)
test<<<1, 1>>>(--x);
}
}
After calling,
cudaLaunchCooperativeKernel((void *)test, 1, 1, (void **) (&x));
getting an error 'operation not permitted' (code is 800). Now, this is returned when the device has no support of cooperative groups (Not in this case). So, what could cause this problem?
Your kernel makes use of dynamic parallelism.
However, dynamic parallelism is not allowed in kernels which are launched via cudaLaunchCooperativeKernel
This is mentioned in the documentation of the runtime API. https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html

Implementing of mutex on cuda kernel function happens to be deadlocked

I'm a newcomer to cuda, and I try to perform mutex in the kernel function.
I read some tutorials and wrote my function, but in some case, deadlock happened.
Here are my codes, kernel function is very simple to count numbers of running thread started by main function.
#include <iostream>
#include <cuda_runtime.h>
__global__ void countThreads(int* sum, int* mutex) {
while(atomicCAS(mutex, 0, 1) != 0); // lock
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
int main() {
int* mutex = nullptr;
cudaMalloc(&mutex, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int* sum = nullptr;
cudaMalloc(&sum, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int ret = 0;
// pass, result is 1024
countThreads<<<1024, 1>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
// deadlock, why?
countThreads<<<1, 2>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
return 0;
}
So, can anyone tell me why the program deadlocked when calling countThreads<<<1, 2>>>(), and how to fix it? I want to perform cross-block mutex, may be it is not a good idea though. Many thanks.
I experimented for some time, and found if use thread in the same block, deadlock happens, otherwise, everything works well.
Threads in the same warp attempting to negotiate for a lock or mutex is probably the worst-case scenario. It is fairly difficult to program correctly, and the behavior may change depending on the exact GPU you are running on.
Here is an example of the type of analysis needed to explain the exact reason for the deadlock in a particular case. Such analysis is not readily done on what you have shown here because you have not indicated the type of GPU you are compiling for, or running on. It's also fairly important to provide the CUDA version you are using for compilation. I have witnessed code changes from one compiler generation to another, that may impact this. Even if you provided that information, I'm not sure the analysis is really worth-while, because I consider the negotiation-within-a-warp case to be extra troublesome to program correctly. This question/answer may also be of interest.
My general suggestion for a newcomer in CUDA (as you say) would be to use a method similar to what is described here. Briefly, negotiate for a lock at the threadblock level (ie. have one thread in each block negotiate among other blocks for the lock) then manage singleton activity within the block using standard, available block-level coordination schemes, such as __syncthreads(), and conditional coding.
You can learn more about this topic by searching on the cuda tag for such keywords as "lock" "critical section" etc.
FWIW, for me, anyway, your code does deadlock on a Kepler device and does not deadlock on a Volta device, as suggested by the reference in the comments. I'm not attempting to communicate any statement about whether your code is defect-free, it's just an observation. If I modify your kernel to look like this:
__global__ void countThreads(int* sum, int* mutex) {
int old = 1;
while (old){
old = atomicCAS(mutex, 0, 1); // lock
if (old == 0){
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
}
}
Then it seems to me to work in either the Kepler case or the Volta case. I'm not advancing this example to suggest it is "correct", rather to show that somewhat innocuous code modifications can change a code from deadlock to non-deadlock case, or vice versa. This kind of fragility is best avoided, certainly in the pre-Volta case, in my opinion.
For the volta and forward case, CUDA 11 and forward, you may want to use capability from the libcu++ library such as semaphore

CUDA Kernels Randomly Fail, but only when I use certain transcendental functions

I've been working on a CUDA program, that randomly crashes with a unspecified launch failure, fairly frequently. Through careful debugging, I localized which kernel was failing, and furthermore that the failure occurred only if certain transcendental functions were called from within the CUDA kernel, (e.g. sinf() or atanhf()).
This led me to write a much simpler program (see below), to confirm that these transcendental functions really were causing an issue, and it looks like that is indeed the case. When I compile and run the code below, which just has repeated calls to a kernel that uses tanh and atanh, repeatedly, sometimes the program works, and sometimes it prints Error with Kernel along with a message from the driver that says:
NVRM: XiD (0000:01:00): 13, 0002 000000 000050c0 00000368 00000000 0000080
With regards to frequency, it probably crashes 50% of the time that I run the executable.
From what I've read online, it sounds like XiD 13 is analogous to a host-based seg fault. However, given the array indexing, I can't see how that could be the case. Furthermore the program doesn't crash if I replace the transcendental functions in the kernel with other functions (e.g. repeated floating point subtraction and addition). That is, I don't get the XiD error message, and the program ultimately returns the correct value of atanh(0.7).
I'm running cuda-5.0 on Ubuntu 11.10 x64 Desktop. Driver version is 304.54, and I'm using a GeForce 9800 GTX.
I'm inclined to say that this is a hardware issue or a driver bug. What's strange is that the example applications from nvidia work fine, perhaps because they do not use the affected transcendental functions.
The final bit of potentially important information is that if I run either my main project, or this test program under cuda-memcheck, it reports no errors, and never crashes. Honestly, I'd just run my project under cuda-memcheck, but the performance hit makes it impractical.
Thanks in advance for any help/insight here. If any one has a 9800 GTX and would be willing to run this code to see if it works, it would be greatly appreciated.
#include <iostream>
#include <stdlib.h>
using namespace std;
__global__ void test_trans (float *a, int length) {
if ((threadIdx.x + blockDim.x*blockIdx.x) < length) {
float temp=0.7;
for (int i=0;i<100;i++) {
temp=atanh(temp);
temp=tanh(temp);
}
a[threadIdx.x+ blockDim.x*blockIdx.x] = atanh(temp);
}
}
int main () {
float *array_dev;
float *array_host;
unsigned int size=10000000;
if (cudaSuccess != cudaMalloc ((void**)&array_dev, size*sizeof(float)) ) {
cerr << "Error with memory Allocation\n"; exit (-1);}
array_host = new float [size];
for (int i=0;i<10;i++) {
test_trans <<< size/512+1, 512 >>> (array_dev, size);
if (cudaSuccess != cudaDeviceSynchronize()) {
cerr << "Error with kernel\n"; exit (-1);}
}
cudaMemcpy (array_host, array_dev, sizeof(float)*size, cudaMemcpyDeviceToHost);
cout << array_host[size-1] << "\n";
}
Edit: I dropped this project for a few months, but yesterday upon updating to driver version 319.23, I'm no longer having this problem. I think the issue I described must have been a bug that was fixed. Hope this helps.
The asker determined that this was a temporary issue fixed by a newer CUDA release. See the edit to the original question.

Memory Error in CUDA Program for Fermi GPU

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]