wildly varying performance of cuMemAlloc/cuMemFree

wildly varying performance of cuMemAlloc/cuMemFree - cuda

In my application cuMemAlloc/cuMemFree seem awfully slow most of the time. However, I found that they are sometimes 10 times faster than usual. The test program below finishes in about 0.4s on two machines, both with cuda 5.5 but one with a compute capability 2.0 card, the other with a 3.5 one.
If the cublas initialization is removed then it takes about 5s. With the cublas initialization in, but allocating a different a different number of bytes such as 4000 it slows down about the same. Needless to say, I'm puzzled by this.
What can be causing this? If it's not a bug in my code, what kind of workaround do I have? The only thing I could think of is preallocating an arena an implementing my own allocator.
#include <stdio.h>
#include <cuda.h>
#include <cublas_v2.h>
#define cudaCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(CUresult code, char *file, int line)
{
if (code != CUDA_SUCCESS) {
fprintf(stderr,"GPUassert: %d %s %d\n", code, file, line);
exit(1);
}
}
void main(int argc, char *argv[])
{
CUcontext context;
CUdevice device;
int devCount;
cudaCheck(cuInit(0));
cudaCheck(cuDeviceGetCount(&devCount));
cudaCheck(cuDeviceGet(&device, 0));
cudaCheck(cuCtxCreate(&context, 0, device));
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
exit(1);
}
{
int i;
for (i = 0; i < 30000; i++) {
CUdeviceptr devBufferA;
cudaCheck(cuMemAlloc(&devBufferA, 8000));
cudaCheck(cuMemFree(devBufferA));
}
}
}

I took your code and profiled it on a linux 64 bit system with the 319.21 driver and CUDA 5.5 and a non-display compute 3.0 device. My first observation is that the run time is about 0.5s, which seems much faster then you are reporting. If I analyse the nvprof output, I get these histograms:
cuMemFree
Time (us) Frequency
3.65190000e+00 2.96670000e+04
4.59380000e+00 2.76000000e+02
5.53570000e+00 3.20000000e+01
6.47760000e+00 1.00000000e+00
7.41950000e+00 1.00000000e+00
8.36140000e+00 6.00000000e+00
9.30330000e+00 0.00000000e+00
1.02452000e+01 1.00000000e+00
1.11871000e+01 2.00000000e+00
1.21290000e+01 1.40000000e+01
cuMemAlloc
Time (us) Frequency
3.53840000e+00 2.98690000e+04
4.50580000e+00 8.60000000e+01
5.47320000e+00 2.00000000e+01
6.44060000e+00 0.00000000e+00
7.40800000e+00 0.00000000e+00
8.37540000e+00 6.00000000e+00
9.34280000e+00 0.00000000e+00
1.03102000e+01 0.00000000e+00
1.12776000e+01 1.20000000e+01
1.22450000e+01 5.00000000e+00
which tells me that 99.6% of cuMemAlloc calls take less than 3.5384 microseconds, and 98.9% of cuMemFree calls take less than 3.6519 microseconds. No free or allocate operation took more than 12.25 microseconds.
So my conclusions based on these results are
Both cuMemfree and cuMemAlloc are extremely fast, with every one of the 60000 total calls to those APIs in your example taking less than 12.25 microseconds
The median call time for both APIs is 2.7 microseconds, with a standard deviation of 0.25 microseconds, suggesting that there is very little variability in the API latency as well
Very occasionally (about 0.01% of the time), both APIs can be around six times slower than this median. This is probably due to operating system level resource contention
Every single one of the above points completely contradicts every assertion you have made in your question.
Given how different your results apparently are, I can only guess that you are running on a known high latency platform like WDDM Windows, and that driver batching and WDDM subsystem latency are completely dominating the performance of the code. In that case, it would seem that the simplest workaround is change platforms.....

The CUDA memory manager is known to be slow. I've seen mention that it is "two orders of magnitude" slower than host malloc() and free(). This information may be dated, but there are some graphs here:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html
I think this is because the CUDA memory manager is optimized for handling a small number of memory allocations at the cost of slowing down when there is a large number of allocations. And that this is because, in general, it is not efficient to handle many small buffers in a kernel.
There are two main issues with dealing with many buffers in a kernel:
1) It implies passing a table of pointers to the kernel. If there is a pointer for each thread, you incur an initial cost of loading the pointer from a table in global memory, before you can start working with the memory. Following a series of pointers is sometimes called "pointer chasing" and it is especially expensive on a GPU because memory accesses is relatively more expensive.
2) More importantly, a pointer for each thread implies a non-coalesced memory access pattern. On current architectures, if each thread in a warp loads a 32-bit value from global memory that is more than 128 bytes away from the others, 32 memory transaction are required for serving the warp. Each transaction will load 128 bytes and then discard 124 bytes. If all threads in a warp load values from the same natively aligned 128 byte area, all the loads are served by a single memory transaction. So, in a memory bound kernel, memory throughput may be only 1/32 of potential.
The most efficient way to handle memory with CUDA is often to allocate a few large chunks and index into them in the kernel.

Related

CUDA: Write directly from device to host pinned memory without sacrificing throughput?

In CUDA, is it possible to write directly to host (pinned) memory from a device kernel?
In my current setup, I first write to device DRAM and then copy from DRAM into host pinned memory.
I'm wondering if I can just write directly to host memory (i.e. use one step instead of two) without sacrificing throughput.
From what I understand, unified memory isn't the answer - guides mention that it's slower (perhaps because of its paging semantics?).
But I haven't tried it, so perhaps I'm mistaken - maybe there's an option to force everything to reside in host pinned memory?

There are numerous question here on the cuda SO tag about how to use pinned memory for "zero-copy" operations. Here is one example. You can find many more examples.
If you only have to write to each output point once, and your writes are/would be nicely coalesced, then there should not be a major performance difference between the costs of:
writing to device memory and then cudaMemcpy D->H after the kernel
writing directly to host-pinned memory
You will still need a cudaDeviceSynchronize() after the kernel call, before accessing the data on the host, to ensure consistency.
Differences on the order of ~10 microseconds are still possible due to CUDA operation overheads.
It should be possible to demonstrate that bulk transfer of data using direct read/writes to pinned memory from kernel code will achieve approximately the same bandwidth as what you would get with a cudaMemcpy transfer.
As an aside, the "paging semantics" of unified memory may be worked around but again a well optimized code in any of these 3 scenarios is not likely to show marked perf or duration differences.
Responding to comments, my use of "approximately" above is probably a stretch, here's a kernel that writes 4GB of data in less than half a second on a PCIE Gen2 system:
$ cat t2138.cu
template <typename T>
__global__ void k(T *d, size_t n){
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < n; i+=gridDim.x*blockDim.x)
d[i] = 0;
}
int main(){
int *d;
size_t n = 1048576*1024;
cudaHostAlloc(&d, sizeof(d[0])*n, cudaHostAllocDefault);
k<<<160, 1024>>>(d, n);
k<<<160, 1024>>>(d, n);
cudaDeviceSynchronize();
int *d1;
cudaMalloc(&d1, sizeof(d[0])*n);
cudaMemcpy(d, d1, sizeof(d[0])*n, cudaMemcpyDeviceToHost);
}
$ nvcc -o t2138 t2138.cu
$ compute-sanitizer ./t2138
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$ nvprof ./t2138
==21201== NVPROF is profiling process 21201, command: ./t2138
==21201== Profiling application: ./t2138
==21201== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 72.48% 889.00ms 2 444.50ms 439.93ms 449.07ms void k<int>(int*, unsigned long)
27.52% 337.47ms 1 337.47ms 337.47ms 337.47ms [CUDA memcpy DtoH]
API calls: 60.27% 1.88067s 1 1.88067s 1.88067s 1.88067s cudaHostAlloc
28.49% 889.01ms 1 889.01ms 889.01ms 889.01ms cudaDeviceSynchronize
10.82% 337.55ms 1 337.55ms 337.55ms 337.55ms cudaMemcpy
0.17% 5.1520ms 1 5.1520ms 5.1520ms 5.1520ms cudaMalloc
0.15% 4.6178ms 4 1.1544ms 594.35us 2.8265ms cuDeviceTotalMem
0.09% 2.6876ms 404 6.6520us 327ns 286.07us cuDeviceGetAttribute
0.01% 416.39us 4 104.10us 59.830us 232.21us cuDeviceGetName
0.00% 151.42us 2 75.710us 13.663us 137.76us cudaLaunchKernel
0.00% 21.172us 4 5.2930us 3.0730us 8.5010us cuDeviceGetPCIBusId
0.00% 9.5270us 8 1.1900us 428ns 4.5250us cuDeviceGet
0.00% 3.3090us 4 827ns 650ns 1.2230us cuDeviceGetUuid
0.00% 3.1080us 3 1.0360us 485ns 1.7180us cuDeviceGetCount
$
4GB/0.44s = 9GB/s
4GB/0.34s = 11.75GB/s (typical for PCIE Gen2 to pinned memory)
We can see that contrary to my previous statement, the transfer of data using in-kernel copying to a pinned allocation does seem to be slower (about 33% slower in my test case) than using a bulk copy (cudaMemcpy DtoH to a pinned allocation). However this isn't quite an apples-to-apples comparison, because the kernel itself would still have to write the 4GB of data to the device allocation to make the comparison to cudaMemcpy be sensible. The speed of this operation will depend on the GPU device memory bandwidth, which varies by GPU of course. So 33% higher is probably "too high" of an estimate of the comparison. But if your GPU has lots of memory bandwidth, this estimate will be pretty close. (On my V100, writing 4GB to device memory only takes ~7ms).

Getting an unexpected value in global device memory when multiple threads write to it

Here is problem with cuda threads , memory magament, it returns single threads result "100" but would expect 9 threads result "900".
#indudel <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
__global__
void test(int in1,int*ptr){
int e = 0;
for (int i = 0; i < 100; i++){
e++;
}
*ptr +=e;
}
int main(int argc, char **argv)
{
int devID = 0;
cudaError_t error;
error = cudaGetDevice(&devID);
if (error == cudaSuccess)
{
printf("GPU Device fine\n");
}
else{
printf("GPU Device problem, aborting");
abort();
}
int* d_A;
cudaMalloc(&d_A, sizeof(int));
int res=0;
//cudaMemcpy(d_A, &res, sizeof(int), cudaMemcpyHostToDevice);
test <<<3, 3 >>>(0,d_A);
cudaDeviceSynchronize();
cudaMemcpy(&res, d_A, sizeof(int),cudaMemcpyDeviceToHost);
printf("res is : %i",res);
Sleep(10000);
return 0;
}
It returns:
GPU Device fine\n
res is : 100
Would expect it to return higher number because 3x3(blocks,threads), insted of just one threads result?
What is done wrong and where does the numbers get lost?

You can't write your sum in this way to global memory.
You have to use an atomic function to ensure that the store is atomic.
In general, when having multiple device threads writing into the same values on global memory, you have to use either atomic functions :
float atomicAdd(float* address, float val);
double atomicAdd(double*
address, double val);
reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes (old + val), and stores the result
back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
or thread synchronization :
Throughput for __syncthreads() is 16 operations per clock cycle for
devices of compute capability 2.x, 128 operations per clock cycle for
devices of compute capability 3.x, 32 operations per clock cycle for
devices of compute capability 6.0 and 64 operations per clock cycle
for devices of compute capability 5.x, 6.1 and 6.2.
Note that __syncthreads() can impact performance by forcing the
multiprocessor to idle as detailed in Device Memory Accesses.

(adapting another answer of mine:)
You are experiencing the effects of the increment operator not being atomic. (C++-oriented description of what that means). What's happening, chronologically, is the following sequence of events (not necessarily in the same order of threads though):
...(other work)...
block 0 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 1 issues a LOAD instruction with address ptr into register r
...
block 2 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 0 completes the LOAD, now having 0 in register r
...
block 2 thread 2 completes the LOAD, now having 0 in register r
block 0 thread 0 adds 100 to r
...
block 2 thread 2 adds 100 to r
block 0 thread 0 issues a STORE instruction from register r to address ptr
...
block 2 thread 2 issues a STORE instruction from register r to address ptr
Thus every thread sees the initial value of *ptr, which is 0; adds 100; and stores 0+100=100 back. The order of the stores doesn't matter here as long as all of the threads try to store the same false value.
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use a block-level reduction primitive. This will ensure some partial ordering of the computational activity vis-a-vis shared block memory - using __syncthreads() or other mechanisms. Thus it might first have each thread add its own two elements up; then synchronize block threads; then have less threads add up pairs of pair-sums and so on. Here's an nVIDIA blog post on implementing fast reductions on their more modern GPU architectures.
block-local or warp-local and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

How to avoid Cuda error 6 (Launch Timeout) with consecutive asynchronous kernel launches?

I get a Cuda error 6 (also known as cudaErrorLaunchTimeout and CUDA_ERROR_LAUNCH_TIMEOUT) with this (simplified) code:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
The Cuda error 6 indicates that the kernel took too much time to return. The duration of a single MyKernel is only ~60 ms though. The block size is a classic 16×16.
Now, when I call cudaDeviceSynchronize() every, say, 50 iterations, the error doesn't occur:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
if(i % 50 == 0) cudaDeviceSynchronize();
}
I would like to avoid this synchronization, because it slows the program down a lot.
Since kernel launches are asynchronous, I guess the error occurs because the watchdog measures the execution duration of a kernel from its asynchronous launch, and not from the actual beginning of its execution.
I am new to Cuda. Is this a common case for the error 6 to occur? Is there a way to avoid this error without altering the performance?

Thanks to talonmies and Robert Crovella (whose proposed solution didn't work for me), I've been able to find an acceptable workaround.
To prevent the CUDA driver to batch the kernel launches together, another operation must be performed before or after each kernel launch. E.g. a dummy copy does the trick:
void* dummy;
cudaMalloc(&dummy, 1);
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
cudaMemcpyAsync(dummy, dummy, 1, cudaMemcpyDeviceToDevice);
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
This solution is 8 seconds faster (50s to 42s) than the one that includes calls to cudaDeviceSynchronize() (see question).
Besides, it's more reliable, 50 being an arbitrary, device-specific period.

The watchdog isn't measuring execution time of kernels, per se. The watchdog is keeping track of requests in the command queue that goes to the GPU, and determining if any of them have not been acknowledged by the GPU within a timeout period.
As #talonmies indicated in the comments, my best guess is that (if you are certain that no kernel execution exceeds the timeout period) this behavior is due to the CUDA driver WDDM batching mechanism, which seeks to reduce average latency by batching GPU commands together and sending to the GPU, in batches.
You don't have direct control over the batching behavior, and so in general, trying to work around this without disabling or modifying the windows TDR mechanism will be an imprecise exercise.
The general (somewhat undocumented) suggestion for a low-cost "flush" of the command queue, which you might try experimenting with, is to use cudaEventQuery(0); (as suggested here) in place of cudaDeviceSynchronize();, perhaps every 50 kernel launches or so. To some degree the specifics may depend on the machine configuration, and the GPU in use.
I'm not sure how effective it will be in your case. I don't think that it can be advanced as a "guarantee" of avoiding a TDR event without a lot more experimentation. Your mileage may vary.

Increasing achieved occupancy doesn't enhance computation speed linearly

I had a CUDA program in which kernel registers were limiting maximum theoretical achieved occupancy to %50. So I decided to use shared memory instead of registers for those variables that were constant between block threads and were almost read-only throughout kernel run. I cannot provide source code here; what I did was conceptually like this:
My initial program:
__global__ void GPU_Kernel (...) {
__shared__ int sharedData[N]; //N:maximum amount that doesn't limit maximum occupancy
int r_1 = A; //except for this first initialization, these registers don't change anymore
int r_2 = B;
...
int r_m = Y;
... //rest of kernel;
}
I changed above program to:
__global__ void GPU_Kernel (...) {
__shared__ int sharedData[N-m];
__shared__ int r_1, r_2, ..., r_m;
if ( threadIdx.x == 0 ) {
r_1 = A;
r_2 = B;
...
r_m = Y; //last of them
}
__syncthreads();
... //rest of kernel
}
Now threads of warps inside a block perform broadcast reads to access newly created shared memory variables. At the same time, threads don't use too much registers to limit achieved occupancy.
The second program has maximum theoretical achieved occupancy equal to %100. In actual runs, the average achieved occupancy for the first programs was ~%48 and for the second one is around ~%80. But the issue is enhancement in net speed up is around %5 to %10, much less than what I was anticipating considering improved gained occupancy. Why isn't this correlation linear?
Considering below image from Nvidia whitepaper, what I've been thinking was that when achieved occupancy is %50, for example, half of SMX (in newer architectures) cores are idle at a time because excessive requested resources by other cores stop them from being active. Is my understanding flawed? Or is it incomplete to explain above phenomenon? Or is it added __syncthreads(); and shared memory accesses cost?

Why isn't this correlation linear?
If you are already memory bandwidth bound or compute bound, and either one of those bounds is near the theoretical performance of the device, improving occupancy may not help much. Improving occupancy usually helps when niether of these are the limiters to performance for your code (i.e. you are not at or near peak memory bandwidth utilization or peak compute). Since you haven't provided any code or any metrics for your program, nobody can tell you why it didn't speed up more. The profiling tools can help you find the limiters to performance.
You might be interested in a couple webinars:
CUDA Optimization: Identifying Performance Limiters by Dr Paulius Micikevicius
CUDA Warps and Occupancy Considerations+ Live with Dr Justin Luitjens, NVIDIA
In particular, review slide 10 from the second webinar.

driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

I have this code here (modified due to the answer).
Info
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes
cmem[2], 76 bytes cmem[16]
I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"
When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.
My computations:
46 registers*128threads=5888 registers .
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).
With the occupancy calculator i found that using 128 threads/block
gives me 42% and am in the limits of my card.
Also,the number of threads per MP is 640 (limit is 1536)
Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:
cuEventDestroy failed: launch timeout
cuModuleUnload failed: launch timeout
1) I don't know/understand what else is needed to be computed.
2) I can't understand how we use/find the number of the blocks.I can see
that mostly,someone puts (threads-1+points)/threads ,but that still
doesn't work.
--------------UPDATED----------------------------------------------
After using driver.Context.synchronize() ,the code works for many points (1000000)!
But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?
--------------UPDATED2----------------------------------------------
Now,the code doesn't work again without doing anything!
Snapshot of code:
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t
#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)
J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))
#-----------------------------------------------------------------------------------
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
typedef pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];
__device__ __constant__ float Pi;
extern "C"{
__device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,
cp3 J[],
cp3 M[],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec, cmplx *All)
{
while (c<numPointsRs){
...
c++;
}
}
__global__ void computeEHfields(float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3], cmplx *All )
{
fp3 * Rs_mat=(fp3 *)Rs_mat_;
fp3 * Rp_mat=(fp3 *)Rp_mat_;
cp3 * J=(cp3 *)J_;
cp3 * M=(cp3 *)M_;
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
}
""" ,no_extern_c=1,options=['--ptxas-options=-v'])
#call the function(kernel)
func = mod.get_function("computeEHfields")
func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()
My card:
Device 0: "GeForce GTX 560"
CUDA Driver Version / Runtime Version 4.20 / 4.10
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP

There are quite a few issues that you have to deal with. Answer 1 provided by #njuffa is the best general solution. I'll provide more feedback based upon the limited data you have provided.
PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate representation. The offline or JIT compiler will convert this to device code. Device code may use more or less registers. Nsight Visual Studio Edition, the Visual Profiler, and the CUDA command line profiler can all provide you the correct register count.
The occupancy calculation is not simply RegistersPerSM / RegistersPerThread. Registers are allocated based upon a granularity. For CC 2.1 the granularity is 4 registers per thread per warp (128 registers). 2.x devices can actually allocate at a 2 register granularity but this can lead to fragmentation later in the kernel.
In your occupancy calculation you state
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).
I'm not sure what 6 means. Your device has 7 SMs. The maximum blocks per SM for 2.x devices is 8 blocks per SM.
You have provided an insufficient amount of code. If you provide pieces of code please provide the size of all inputs, the number of times each loop will be executed, and a description of the operations per function. Looking at the code you may be doing too many loops per thread. Without knowing the order of magnitude of the outer loop we can only guess.
Given that the launch is timing out you should probably approach debugging as follows:
a. Add a line to the beginning of the code
if (blockIdx.x > 0) { return; }
Run the exact code you have in one of the previously mentioned profilers to estimate the duration of a single block. Using the launch information provided by the profiler: register per thread, shared memory, ... use the occupancy calculator in the profiler or the xls to determine the maximum number of blocks that you can run concurrently. For example, if the theoretical block occupancy is 3 blocks per SM, and the number of SMs is 7 the you can run 21 blocks at a time which for you launch is 9 waves. NOTE: this assumes equal work per thread. Change the early exit code to allow 1 wave (21 blocks). If this launch times out then you need to reduce the amount of work per thread. If this passes then calculate how many waves you have and estimate when you will timeout (2sec on windows, ? on linux).
b. If you have too many waves then reduce you have to reduce the launch configuration. Given that you index by gridDim.x and blockDim.x you can do this by passing in these dimensions as as parameters to your kernel. This will require tou to minimally change your indexing code. You will also have to pass a blockIdx.x offset. Change your host code to launch multiple kernels back to back. Since there should be no conflict you can rr launch these in multiple streams to benefit from overlap at the end of each wave.

"launch timeout" would appear to indicate that the kernel ran too long and was killed by the watchdog timer. This can happen on GPUs that are also used for graphics output (e.g. a graphical desktop), where the task of the watchdog timer is to prevent the desktop from locking up for more than a few seconds. Best I can recall the watchdog time limit is on the order of 5 seconds or thereabouts.
At any given moment, the GPU can either run graphics, or CUDA, so the watchdog timer is needed when running a GUI to prevent the GUI from locking up for an extended period of time, which renders the machine inoperable through the GUI.
If possible, avoid using this GPU for the desktop and/or other graphics (e.g. don't run X if you are on Linux). If running without graphics isn't an option, to reduce kernel execution time to avoid hitting watchdog timer kernel termination, you will have to do less work per kernel launch, optimize the code so the kernel runs faster for the same amount of work, or deploy a faster GPU.

To provide more inputs on #njuffa's answer, in Windows systems you can increase the launch timeout or TDR (Timeout Detection & Recovery) by following these steps:
1: Open the options in Nsight Monitor.
2: Set an appropriate value for WDDM TDR Delay
CUATION: If this value is small you may get timeout error and for higher values your screen will stay frozen until kernel finishes it's job.
source

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008