CUDA - Making Sense of Profiler Results : Shared Memory Atomics - cuda

I have an application using CUDA where the core computation is essentially an overly complicated histogram. As such I looked at using shared memory atomics as described here. I had some initial bugs with the shared memory version so I wrote a global memory version as well, then was later able to fix the shared memory version and benchmark them. The rough flow is:
Global version:
binIndex=BinningCalculations(dataPoint)
atomicAdd_block(globalBins+binIndex,dataPoint)
Shared Version
initSharedMemory to 0
sync block
binIndex=BinningCalculations(dataPoint)
atomicAdd_block(sharedBins+binIndex,dataPoint)
sync block
Add shared memory to global memory
In both of these, a single thread is responsible for calculating a single datapoint. The datapoints are ints.
Thee results are thus: (all metrics from NSight Compute except the graph results)
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
<tr>
<th class="tg-baqh">value (avg)</th>
<th class="tg-baqh">Global</th>
<th class="tg-baqh">Shared</th>
</tr>
</thead>
<tbody>
<tr>
<td class="tg-amwm">Kernel Duration</td>
<td class="tg-baqh">7.3 us</td>
<td class="tg-baqh">10.6 us</td>
</tr>
<tr>
<td class="tg-amwm">Registers/thread</td>
<td class="tg-baqh">20</td>
<td class="tg-baqh">24</td>
</tr>
<tr>
<td class="tg-amwm">Shared Memory Usage</td>
<td class="tg-baqh">0</td>
<td class="tg-baqh">16 KB </td>
</tr>
<tr>
<td class="tg-amwm">Occupancy</td>
<td class="tg-baqh">50%</td>
<td class="tg-baqh">60%</td>
</tr>
<tr>
<td class="tg-amwm">Compute Throughput</td>
<td class="tg-baqh">5.5%</td>
<td class="tg-baqh">10.2%</td>
</tr>
<tr>
<td class="tg-amwm">Memory Throughput</td>
<td class="tg-baqh">35%</td>
<td class="tg-baqh">25%</td>
</tr>
<tr>
<td class="tg-amwm">Graph Completion Time - cuda Event</td>
<td class="tg-baqh">~115 us</td>
<td class="tg-baqh">~115 us</td>
</tr>
<tr>
<td class="tg-amwm">Graph Completion TIme - host Timing</td>
<td class="tg-baqh">~180 us</td>
<td class="tg-baqh">~165 us</td>
</tr>
</tbody>
</table>
I know that shared memory atomics are faster than global memory atomics (so I've been told by the people who know these things), but the results I'm seeing here are puzzling to me. The global memory kernels are showing up as faster than the shared memory kernels but the shared memory kernels are finishing faster. This could be because of the higher occupancy but does the +10% rate does really equate to being 40% faster? If the additional requirements of initializing+copying out shared memory and syncing extend kernel duration, where does the shared atomic speed come in?
The global memory version uses less (0) shared memory and less registers but somehow has lower average occupancy.
Both of these seem counter intuitive to me. Any explanation or pointers to helpful resources on profiling/optimization would be appreciated.
I can't share any significant amount of source code but I can supply more metrics if needed.
edit: Adding some code and new results
auto err = cudaEventRecord(timerStart, primaryStream);
cudaErrCheck(err, "Ingest Begin Event Failed");
std::string msg = "Global (us)";
if (useShared) msg = "Shared (us)";
{
auto timer = raiTimer("host timer " + msg);
if (cOpt_TIMING && benchmarkPtr) timer.setRecorder(benchmarkPtr);
err = cudaGraphLaunch(ingestGraph, primaryStream);
cudaErrCheck(err, "Graph Launch Fail");
err = cudaEventRecord(timerEnd, primaryStream);
cudaErrCheck(err, "Ingest End Event Failed");
if (cOpt_TIMING && benchmarkPtr)
{
float time = 0;
cudaEventSynchronize(timerEnd);
err = cudaEventElapsedTime(&time, timerStart, timerEnd);
cudaErrCheck(err, "Device Ingest Timing Fail");
benchmarkPtr->inputTimePoint("cuda timer " + msg, time*1000.0f);
}
}
This section is where I actually run the timers. I have two different timers setup:
the 'raiTimer' takes a high resolution system clock timestamp with std::chrono. It automatically sends its time data to the benchmark object when it is destroyed It has a few lines it is tracking that the cuda version doesn't but the trend is the important part.
the cudaEvent timing is what it looks like - it timers how long it takes for the memory copy and three kernels to complete in the primaryStream
The results from this were interesting. I increased the sample size to 10 runs from the 4-5 I was doing before, which means each kernel, shared and global, is each running a total of 30 times when it comes to what the profiler sees. I took an average of the 10 graph completion times for each:
cuda timer Global (us) 114.97 08/17/22 17:05:52
host timer Global (us) 182.84 08/17/22 17:05:52
cuda timer Shared (us) 111.984 08/17/22 17:05:52
host timer Shared (us) 167.95 08/17/22 17:05:5
So my earlier assertion that shared was running faster in total looks to be in question. According to the cuda timer they are basically neck and neck - typically within a few microseconds of each other every time with no definitive advantage either way. The host timer has the shared version fairly consistently ~15 uS faster than the global version. The host timer takes the same path regardless of which kernel version is running so I'm not sure why the host would be seeing a difference if it also has to see the eventSync.
So 1st question is now more related to timing oddity. Occupancy question remains.

Related

Why does this kernel not achieve peak IPC on a GK210?

I decided that it would be educational for me to try to write a CUDA kernel that achieves peak IPC, so I came up with this kernel (host code omitted for brevity but is available here)
#define WORK_PER_THREAD 4
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
i *= WORK_PER_THREAD;
if (i < n)
{
#pragma unroll
for(int j=0; j<WORK_PER_THREAD; j++)
y[i+j] = a * x[i+j] + y[i+j];
}
}
I ran this kernel on a GK210, with n=32*1000000 elements, and expected to see an IPC of close to 4, but ended up with a lousy IPC of 0.186
ubuntu#ip-172-31-60-181:~/ipc_example$ nvcc saxpy.cu
ubuntu#ip-172-31-60-181:~/ipc_example$ sudo nvprof --metrics achieved_occupancy --metrics ipc ./a.out
==5828== NVPROF is profiling process 5828, command: ./a.out
==5828== Warning: Auto boost enabled on device 0. Profiling results may be inconsistent.
==5828== Profiling application: ./a.out
==5828== Profiling result:
==5828== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K80 (0)"
Kernel: saxpy_parallel(int, float, float*, float*)
1 achieved_occupancy Achieved Occupancy 0.879410 0.879410 0.879410
1 ipc Executed IPC 0.186352 0.186352 0.186352
I was even more confused when I set WORK_PER_THREAD=16, resulting in less threads launched, but 16, as opposed to 4, independent instructions for each to execute, the IPC dropped to 0.01
My two questions are:
What is the peak IPC I can expect on a GK210? I think it is 8 = 4 warp schedulers * 2 instruction dispatches per cycle, but I want to be sure.
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
What is the peak IPC I can expect on a GK210?
The peak IPC per SM is equal to the number of warp schedulers in an SM times the issue rate of each warp scheduler. This information can be found in the whitepaper for a particular GPU. The GK210 whitepaper is here. From that document (e.g. SM diagram on p8) we see that each SM has 4 warp schedulers capable of dual issue. Therefore the peak theoretically achievable IPC is 8 instructions per clock per SM. (however as a practical matter even for well-crafted codes, you're unlikely to see higher than 6 or 7).
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
Your kernel requires global transactions at nearly every operation. Global loads and even L2 cache loads have latency. When everything you do is dependent on those, there is no way to avoid the latency, so your warps are frequently stalled. The peak observable IPC per SM on a GK210 is somewhere in the vicinity of 6, but you won't get that with continuous load and store operations. Your kernel does 2 loads, and one store (12 bytes total moved), for each multiply/add. You won't be able to improve it. (Your kernel has high occupancy because the SMs are loaded up with warps, but low IPC because those warps are frequently stalled, unable to issue an instruction, waiting for latency of load operations to expire.) You'll need to find other useful work to do.
What might that be? Well if you do a matrix multiply operation, which has considerable data reuse and a relatively low number of bytes per math op, you're likely to see better measurements.
What about your code? Sometimes the work you need to do is like this. We'd call that a memory-bound code. For a kernel like this, the figure of merit to use for judging "goodness" is not IPC but achieved bandwidth. If your kernel requires a particular number of bytes loaded and stored to perform its work, then if we compare the kernel duration to just the memory transactions, we can get a measure of goodness. Stated another way, for a pure memory bound code (i.e. your kernel) we would judge goodness by measuring the total number of bytes loaded and stored (profiler has metrics for this, or for a simple code you can compute it directly by inspection), and divide that by the kernel duration. This gives the achieved bandwidth. Then, we compare that to the achievable bandwidth based on a proxy measurement. A possible proxy measurement tool for this is bandwidthTest CUDA sample code.
As the ratio of these two bandwidths approaches 1.0, your kernel is doing "well", given the memory bound work it is trying to do.

Understanding CUDA profiler output (nvprof)

I'm just looking at the following output and trying to wrap my mind around the numbers:
==2906== Profiling result:
Time(%) Time Calls Avg Min Max Name
23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16
22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg
18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16
16.48% 7.84050s 68092 115.15us 1.8240us 503.00us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
...
0.25% 117.53ms 4744 24.773us 896ns 11.803ms [CUDA memcpy HtoD]
0.23% 107.32ms 37582 2.8550us 1.8880us 8.9556ms [CUDA memcpy DtoH]
...
==2906== API calls:
Time(%) Time Calls Avg Min Max Name
83.47% 41.8256s 42326 988.18us 16.923us 13.332ms cudaMemcpy
9.27% 4.64747s 326372 14.239us 10.846us 11.601ms cudaLaunch
1.49% 745.12ms 1502720 495ns 379ns 1.7092ms cudaSetupArgument
1.37% 688.09ms 4702 146.34us 879ns 615.09ms cudaFree
...
When it comes to optimizing memory access, what are the numbers I really need to look at when comparing different implementations? It first looks like memcpy only takes 117.53+107.32ms (in both directions), but then there is this API call cudaMemcpy: 41.8256s, which is much more. Also, the min/avg/max columns don't add up between the upper and the lower output block.
Why is there a difference and what is the "true" number that is important for me to optimize the memory transfer?
EDIT: second question is: is there a way to figure out who is calling e.g. axpy_kernel_val (and how many times)?
The difference in total time is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to cudaMemcpy, the cudaMemcpy call will be launched well before the kernel(s) have finished executing. The total time of the API call is from the moment it is launched to the moment it completes, so will overlap with executing kernels. You can see this very clearly if you run the output through the NVIDIA Visual Profiler (nvprof -o xxx ./myApp, then import xxx into nvvp).
The difference is min time is due to launch overhead. While the API profiling takes all of the launch overhead into account, the kernel timing only contains a small part of it. Launch overhead can be ~10-20us, as you can see here.
In general, the API calls section lets you know what the CPU is doing, while, the profiling results tells you what the GPU is doing. In this case, I'd argue you're underusing the CPU, as arguably the cudaMemcpy is launched too early and CPU cycles are wasted. In practice, however, it's often hard or impossible to get anything useful out of these spare cycles.

How to avoid Cuda error 6 (Launch Timeout) with consecutive asynchronous kernel launches?

I get a Cuda error 6 (also known as cudaErrorLaunchTimeout and CUDA_ERROR_LAUNCH_TIMEOUT) with this (simplified) code:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
The Cuda error 6 indicates that the kernel took too much time to return. The duration of a single MyKernel is only ~60 ms though. The block size is a classic 16×16.
Now, when I call cudaDeviceSynchronize() every, say, 50 iterations, the error doesn't occur:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
if(i % 50 == 0) cudaDeviceSynchronize();
}
I would like to avoid this synchronization, because it slows the program down a lot.
Since kernel launches are asynchronous, I guess the error occurs because the watchdog measures the execution duration of a kernel from its asynchronous launch, and not from the actual beginning of its execution.
I am new to Cuda. Is this a common case for the error 6 to occur? Is there a way to avoid this error without altering the performance?
Thanks to talonmies and Robert Crovella (whose proposed solution didn't work for me), I've been able to find an acceptable workaround.
To prevent the CUDA driver to batch the kernel launches together, another operation must be performed before or after each kernel launch. E.g. a dummy copy does the trick:
void* dummy;
cudaMalloc(&dummy, 1);
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
cudaMemcpyAsync(dummy, dummy, 1, cudaMemcpyDeviceToDevice);
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
This solution is 8 seconds faster (50s to 42s) than the one that includes calls to cudaDeviceSynchronize() (see question).
Besides, it's more reliable, 50 being an arbitrary, device-specific period.
The watchdog isn't measuring execution time of kernels, per se. The watchdog is keeping track of requests in the command queue that goes to the GPU, and determining if any of them have not been acknowledged by the GPU within a timeout period.
As #talonmies indicated in the comments, my best guess is that (if you are certain that no kernel execution exceeds the timeout period) this behavior is due to the CUDA driver WDDM batching mechanism, which seeks to reduce average latency by batching GPU commands together and sending to the GPU, in batches.
You don't have direct control over the batching behavior, and so in general, trying to work around this without disabling or modifying the windows TDR mechanism will be an imprecise exercise.
The general (somewhat undocumented) suggestion for a low-cost "flush" of the command queue, which you might try experimenting with, is to use cudaEventQuery(0); (as suggested here) in place of cudaDeviceSynchronize();, perhaps every 50 kernel launches or so. To some degree the specifics may depend on the machine configuration, and the GPU in use.
I'm not sure how effective it will be in your case. I don't think that it can be advanced as a "guarantee" of avoiding a TDR event without a lot more experimentation. Your mileage may vary.

wildly varying performance of cuMemAlloc/cuMemFree

In my application cuMemAlloc/cuMemFree seem awfully slow most of the time. However, I found that they are sometimes 10 times faster than usual. The test program below finishes in about 0.4s on two machines, both with cuda 5.5 but one with a compute capability 2.0 card, the other with a 3.5 one.
If the cublas initialization is removed then it takes about 5s. With the cublas initialization in, but allocating a different a different number of bytes such as 4000 it slows down about the same. Needless to say, I'm puzzled by this.
What can be causing this? If it's not a bug in my code, what kind of workaround do I have? The only thing I could think of is preallocating an arena an implementing my own allocator.
#include <stdio.h>
#include <cuda.h>
#include <cublas_v2.h>
#define cudaCheck(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(CUresult code, char *file, int line)
{
if (code != CUDA_SUCCESS) {
fprintf(stderr,"GPUassert: %d %s %d\n", code, file, line);
exit(1);
}
}
void main(int argc, char *argv[])
{
CUcontext context;
CUdevice device;
int devCount;
cudaCheck(cuInit(0));
cudaCheck(cuDeviceGetCount(&devCount));
cudaCheck(cuDeviceGet(&device, 0));
cudaCheck(cuCtxCreate(&context, 0, device));
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
if (stat != CUBLAS_STATUS_SUCCESS) {
printf ("CUBLAS initialization failed\n");
exit(1);
}
{
int i;
for (i = 0; i < 30000; i++) {
CUdeviceptr devBufferA;
cudaCheck(cuMemAlloc(&devBufferA, 8000));
cudaCheck(cuMemFree(devBufferA));
}
}
}
I took your code and profiled it on a linux 64 bit system with the 319.21 driver and CUDA 5.5 and a non-display compute 3.0 device. My first observation is that the run time is about 0.5s, which seems much faster then you are reporting. If I analyse the nvprof output, I get these histograms:
cuMemFree
Time (us) Frequency
3.65190000e+00 2.96670000e+04
4.59380000e+00 2.76000000e+02
5.53570000e+00 3.20000000e+01
6.47760000e+00 1.00000000e+00
7.41950000e+00 1.00000000e+00
8.36140000e+00 6.00000000e+00
9.30330000e+00 0.00000000e+00
1.02452000e+01 1.00000000e+00
1.11871000e+01 2.00000000e+00
1.21290000e+01 1.40000000e+01
cuMemAlloc
Time (us) Frequency
3.53840000e+00 2.98690000e+04
4.50580000e+00 8.60000000e+01
5.47320000e+00 2.00000000e+01
6.44060000e+00 0.00000000e+00
7.40800000e+00 0.00000000e+00
8.37540000e+00 6.00000000e+00
9.34280000e+00 0.00000000e+00
1.03102000e+01 0.00000000e+00
1.12776000e+01 1.20000000e+01
1.22450000e+01 5.00000000e+00
which tells me that 99.6% of cuMemAlloc calls take less than 3.5384 microseconds, and 98.9% of cuMemFree calls take less than 3.6519 microseconds. No free or allocate operation took more than 12.25 microseconds.
So my conclusions based on these results are
Both cuMemfree and cuMemAlloc are extremely fast, with every one of the 60000 total calls to those APIs in your example taking less than 12.25 microseconds
The median call time for both APIs is 2.7 microseconds, with a standard deviation of 0.25 microseconds, suggesting that there is very little variability in the API latency as well
Very occasionally (about 0.01% of the time), both APIs can be around six times slower than this median. This is probably due to operating system level resource contention
Every single one of the above points completely contradicts every assertion you have made in your question.
Given how different your results apparently are, I can only guess that you are running on a known high latency platform like WDDM Windows, and that driver batching and WDDM subsystem latency are completely dominating the performance of the code. In that case, it would seem that the simplest workaround is change platforms.....
The CUDA memory manager is known to be slow. I've seen mention that it is "two orders of magnitude" slower than host malloc() and free(). This information may be dated, but there are some graphs here:
http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_management_overhead.html
I think this is because the CUDA memory manager is optimized for handling a small number of memory allocations at the cost of slowing down when there is a large number of allocations. And that this is because, in general, it is not efficient to handle many small buffers in a kernel.
There are two main issues with dealing with many buffers in a kernel:
1) It implies passing a table of pointers to the kernel. If there is a pointer for each thread, you incur an initial cost of loading the pointer from a table in global memory, before you can start working with the memory. Following a series of pointers is sometimes called "pointer chasing" and it is especially expensive on a GPU because memory accesses is relatively more expensive.
2) More importantly, a pointer for each thread implies a non-coalesced memory access pattern. On current architectures, if each thread in a warp loads a 32-bit value from global memory that is more than 128 bytes away from the others, 32 memory transaction are required for serving the warp. Each transaction will load 128 bytes and then discard 124 bytes. If all threads in a warp load values from the same natively aligned 128 byte area, all the loads are served by a single memory transaction. So, in a memory bound kernel, memory throughput may be only 1/32 of potential.
The most efficient way to handle memory with CUDA is often to allocate a few large chunks and index into them in the kernel.

driver.Context.synchronize()- what else to take into consideration -- -a clean-up operation failed

I have this code here (modified due to the answer).
Info
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 46 registers, 120 bytes cmem[0], 176 bytes
cmem[2], 76 bytes cmem[16]
I don't know what else to take into consideration in order to make it work for different combinations of points "numPointsRs" and "numPointsRp"
When ,for example, i run the code with Rs=10000 and Rp=100000 with block=(128,1,1),grid=(200,1) its fine.
My computations:
46 registers*128threads=5888 registers .
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).
With the occupancy calculator i found that using 128 threads/block
gives me 42% and am in the limits of my card.
Also,the number of threads per MP is 640 (limit is 1536)
Now,if i try to use Rs=100000 and Rp=100000 (for the same threads and blocks) it gives me the message in the title,with:
cuEventDestroy failed: launch timeout
cuModuleUnload failed: launch timeout
1) I don't know/understand what else is needed to be computed.
2) I can't understand how we use/find the number of the blocks.I can see
that mostly,someone puts (threads-1+points)/threads ,but that still
doesn't work.
--------------UPDATED----------------------------------------------
After using driver.Context.synchronize() ,the code works for many points (1000000)!
But ,what impact has this addition to the code?(for many points the screen freezes for 1 minute or more).Should i use it or not?
--------------UPDATED2----------------------------------------------
Now,the code doesn't work again without doing anything!
Snapshot of code:
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
import cmath
import pycuda.driver as drv
import pycuda.tools as t
#---- Initialization and passing(allocate memory and transfer data) to GPU -------------------------
Rs_gpu=gpuarray.to_gpu(Rs)
Rp_gpu=gpuarray.to_gpu(Rp)
J_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
M_gpu=gpuarray.to_gpu(np.ones((numPointsRs,3)).astype(np.complex64))
Evec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
Hvec_gpu=gpuarray.to_gpu(np.zeros((numPointsRp,3)).astype(np.complex64))
All_gpu=gpuarray.to_gpu(np.ones(numPointsRp).astype(np.complex64))
#-----------------------------------------------------------------------------------
mod =SourceModule("""
#include <pycuda-complex.hpp>
#include <cmath>
#include <vector>
typedef pycuda::complex<float> cmplx;
typedef float fp3[3];
typedef cmplx cp3[3];
__device__ __constant__ float Pi;
extern "C"{
__device__ void computeEvec(fp3 Rs_mat[], int numPointsRs,
cp3 J[],
cp3 M[],
fp3 Rp,
cmplx kp,
cmplx eta,
cmplx *Evec,
cmplx *Hvec, cmplx *All)
{
while (c<numPointsRs){
...
c++;
}
}
__global__ void computeEHfields(float *Rs_mat_, int numPointsRs,
float *Rp_mat_, int numPointsRp,
cmplx *J_,
cmplx *M_,
cmplx kp,
cmplx eta,
cmplx E[][3],
cmplx H[][3], cmplx *All )
{
fp3 * Rs_mat=(fp3 *)Rs_mat_;
fp3 * Rp_mat=(fp3 *)Rp_mat_;
cp3 * J=(cp3 *)J_;
cp3 * M=(cp3 *)M_;
int k=threadIdx.x+blockIdx.x*blockDim.x;
while (k<numPointsRp)
{
computeEvec( Rs_mat, numPointsRs, J, M, Rp_mat[k], kp, eta, E[k], H[k], All );
k+=blockDim.x*gridDim.x;
}
}
}
""" ,no_extern_c=1,options=['--ptxas-options=-v'])
#call the function(kernel)
func = mod.get_function("computeEHfields")
func(Rs_gpu,np.int32(numPointsRs),Rp_gpu,np.int32(numPointsRp),J_gpu, M_gpu, np.complex64(kp), np.complex64(eta),Evec_gpu,Hvec_gpu, All_gpu, block=(128,1,1),grid=(200,1))
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
J=J_gpu.get()
M=M_gpu.get()
Evec=Evec_gpu.get()
Hvec=Hvec_gpu.get()
All=All_gpu.get()
My card:
Device 0: "GeForce GTX 560"
CUDA Driver Version / Runtime Version 4.20 / 4.10
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073283072 bytes)
( 0) Multiprocessors x (48) CUDA Cores/MP: 0 CUDA Cores //CUDA Cores 336 => 7 MP and 48 Cores/MP
There are quite a few issues that you have to deal with. Answer 1 provided by #njuffa is the best general solution. I'll provide more feedback based upon the limited data you have provided.
PTX output of 46 registers is not the number of registers used by your kernel. PTX is an intermediate representation. The offline or JIT compiler will convert this to device code. Device code may use more or less registers. Nsight Visual Studio Edition, the Visual Profiler, and the CUDA command line profiler can all provide you the correct register count.
The occupancy calculation is not simply RegistersPerSM / RegistersPerThread. Registers are allocated based upon a granularity. For CC 2.1 the granularity is 4 registers per thread per warp (128 registers). 2.x devices can actually allocate at a 2 register granularity but this can lead to fragmentation later in the kernel.
In your occupancy calculation you state
My card has limit 32768registers,so 32768/5888=5 +some => 5 block/SM
(my card has limit 6).
I'm not sure what 6 means. Your device has 7 SMs. The maximum blocks per SM for 2.x devices is 8 blocks per SM.
You have provided an insufficient amount of code. If you provide pieces of code please provide the size of all inputs, the number of times each loop will be executed, and a description of the operations per function. Looking at the code you may be doing too many loops per thread. Without knowing the order of magnitude of the outer loop we can only guess.
Given that the launch is timing out you should probably approach debugging as follows:
a. Add a line to the beginning of the code
if (blockIdx.x > 0) { return; }
Run the exact code you have in one of the previously mentioned profilers to estimate the duration of a single block. Using the launch information provided by the profiler: register per thread, shared memory, ... use the occupancy calculator in the profiler or the xls to determine the maximum number of blocks that you can run concurrently. For example, if the theoretical block occupancy is 3 blocks per SM, and the number of SMs is 7 the you can run 21 blocks at a time which for you launch is 9 waves. NOTE: this assumes equal work per thread. Change the early exit code to allow 1 wave (21 blocks). If this launch times out then you need to reduce the amount of work per thread. If this passes then calculate how many waves you have and estimate when you will timeout (2sec on windows, ? on linux).
b. If you have too many waves then reduce you have to reduce the launch configuration. Given that you index by gridDim.x and blockDim.x you can do this by passing in these dimensions as as parameters to your kernel. This will require tou to minimally change your indexing code. You will also have to pass a blockIdx.x offset. Change your host code to launch multiple kernels back to back. Since there should be no conflict you can rr launch these in multiple streams to benefit from overlap at the end of each wave.
"launch timeout" would appear to indicate that the kernel ran too long and was killed by the watchdog timer. This can happen on GPUs that are also used for graphics output (e.g. a graphical desktop), where the task of the watchdog timer is to prevent the desktop from locking up for more than a few seconds. Best I can recall the watchdog time limit is on the order of 5 seconds or thereabouts.
At any given moment, the GPU can either run graphics, or CUDA, so the watchdog timer is needed when running a GUI to prevent the GUI from locking up for an extended period of time, which renders the machine inoperable through the GUI.
If possible, avoid using this GPU for the desktop and/or other graphics (e.g. don't run X if you are on Linux). If running without graphics isn't an option, to reduce kernel execution time to avoid hitting watchdog timer kernel termination, you will have to do less work per kernel launch, optimize the code so the kernel runs faster for the same amount of work, or deploy a faster GPU.
To provide more inputs on #njuffa's answer, in Windows systems you can increase the launch timeout or TDR (Timeout Detection & Recovery) by following these steps:
1: Open the options in Nsight Monitor.
2: Set an appropriate value for WDDM TDR Delay
CUATION: If this value is small you may get timeout error and for higher values your screen will stay frozen until kernel finishes it's job.
source