Cuda kernel configuration - cuda

I'm writing a cuda c code to process pictures for example i created a swap function (swap blocs of the matrix) but it dos not work every time i thing i have a problem with the number of blocs and number of threads whene i lunch my kernel.
For example if i tak an image of size 2048*2048 with
threadsPerBlock.x=threadsPerBlock.y=64 and numBlocks.x=numBlocks.y=2048/threadsPerBlock.x
then swap<<<threadsPerBlock,numBlocks>>>(...) works fine.
But if I take an image of size 2560*2160, threadsPerBlock.x=threadsPerBlock.y=64 and numBlocks.x=2560/64 and numBlocks.y=2160/64+1, I have an error 9 wish is error invalid configuration argument.
I'm using CUDA 7.5 and a GPU with compute capability 5.0

The maximum number of threads per block for your compute 5.0 device is 1024. The source of your problem is that you have the arguments in the kernel launch reversed. When the maximum dimension of the image is less than 2048, that gives you a launch with less than 1024 threads per block. Larger than 2048 and the block size becomes illegal
If you do something like this:
threadsPerBlock.x=threadsPerBlock.y=32
numBlocks.x=numBlocks.y=2048/threadsPerBlock.x
swap<<<numBlocks,threadsPerBlock>>>(...)
You should find the kernel launch works unconditionally.

Related

Why does this kernel not achieve peak IPC on a GK210?

I decided that it would be educational for me to try to write a CUDA kernel that achieves peak IPC, so I came up with this kernel (host code omitted for brevity but is available here)
#define WORK_PER_THREAD 4
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
i *= WORK_PER_THREAD;
if (i < n)
{
#pragma unroll
for(int j=0; j<WORK_PER_THREAD; j++)
y[i+j] = a * x[i+j] + y[i+j];
}
}
I ran this kernel on a GK210, with n=32*1000000 elements, and expected to see an IPC of close to 4, but ended up with a lousy IPC of 0.186
ubuntu#ip-172-31-60-181:~/ipc_example$ nvcc saxpy.cu
ubuntu#ip-172-31-60-181:~/ipc_example$ sudo nvprof --metrics achieved_occupancy --metrics ipc ./a.out
==5828== NVPROF is profiling process 5828, command: ./a.out
==5828== Warning: Auto boost enabled on device 0. Profiling results may be inconsistent.
==5828== Profiling application: ./a.out
==5828== Profiling result:
==5828== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K80 (0)"
Kernel: saxpy_parallel(int, float, float*, float*)
1 achieved_occupancy Achieved Occupancy 0.879410 0.879410 0.879410
1 ipc Executed IPC 0.186352 0.186352 0.186352
I was even more confused when I set WORK_PER_THREAD=16, resulting in less threads launched, but 16, as opposed to 4, independent instructions for each to execute, the IPC dropped to 0.01
My two questions are:
What is the peak IPC I can expect on a GK210? I think it is 8 = 4 warp schedulers * 2 instruction dispatches per cycle, but I want to be sure.
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
What is the peak IPC I can expect on a GK210?
The peak IPC per SM is equal to the number of warp schedulers in an SM times the issue rate of each warp scheduler. This information can be found in the whitepaper for a particular GPU. The GK210 whitepaper is here. From that document (e.g. SM diagram on p8) we see that each SM has 4 warp schedulers capable of dual issue. Therefore the peak theoretically achievable IPC is 8 instructions per clock per SM. (however as a practical matter even for well-crafted codes, you're unlikely to see higher than 6 or 7).
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
Your kernel requires global transactions at nearly every operation. Global loads and even L2 cache loads have latency. When everything you do is dependent on those, there is no way to avoid the latency, so your warps are frequently stalled. The peak observable IPC per SM on a GK210 is somewhere in the vicinity of 6, but you won't get that with continuous load and store operations. Your kernel does 2 loads, and one store (12 bytes total moved), for each multiply/add. You won't be able to improve it. (Your kernel has high occupancy because the SMs are loaded up with warps, but low IPC because those warps are frequently stalled, unable to issue an instruction, waiting for latency of load operations to expire.) You'll need to find other useful work to do.
What might that be? Well if you do a matrix multiply operation, which has considerable data reuse and a relatively low number of bytes per math op, you're likely to see better measurements.
What about your code? Sometimes the work you need to do is like this. We'd call that a memory-bound code. For a kernel like this, the figure of merit to use for judging "goodness" is not IPC but achieved bandwidth. If your kernel requires a particular number of bytes loaded and stored to perform its work, then if we compare the kernel duration to just the memory transactions, we can get a measure of goodness. Stated another way, for a pure memory bound code (i.e. your kernel) we would judge goodness by measuring the total number of bytes loaded and stored (profiler has metrics for this, or for a simple code you can compute it directly by inspection), and divide that by the kernel duration. This gives the achieved bandwidth. Then, we compare that to the achievable bandwidth based on a proxy measurement. A possible proxy measurement tool for this is bandwidthTest CUDA sample code.
As the ratio of these two bandwidths approaches 1.0, your kernel is doing "well", given the memory bound work it is trying to do.

What is the number of registers in CUDA CC 5.0?

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.

cuda - loop unrolling issue

I have a kernel with a #pragma unroll 80 and I'm running it with NVIDIA GT 285, compute capability 1.3,
with grid architecture: dim3 thread_block( 16, 16 ) and dim3 grid( 40 , 30 ) and it works fine.
When I tried running it with NVIDIA GT 580, compute capability 2.0 and with the above grid architecture it works fine.
When I change the grid architecture on the GT 580 to
dim3 thread_block( 32 , 32 ) and dim3 grid( 20 , 15 ), thus producing the same number of threads as above, I get incorrect results.
If I remove #pragma unroll 80 or replace it with #pragma unroll 1 in GT 580 it works fine. If I don't then the kernel crashes.
Would anyone know why does this happen? Thank you in advance
EDIT: checked for kernel errors on both devices and I got the "invalid argument".
As I searched for the causes of this error I found that this happens when the dimensions of the grid and the block exceed their limits.
But this is not the case for me since I use 16x16=256 threads per block and 40x30=1200 total blocks. As far as I know these values are in the boundaries of the GPU grid for compute capability 1.3.
I would like to know if this could have anything to do with the loop unrolling issue I have.
I figured out what the problem was.
After some bug fixes I got the "Too Many Resources Requested for Launch" error.
For a loop unroll, extra registers per thread are needed and I was running out of registers, hence the error and the kernel fail.
I needed 22 registers per thread, and I have 1024 threads per block.
By inserting my data into the CUDA_Occupancy_calculator it showed me that 1 block per SM is scheduled, leaving me with 32678 registers for a whole block on the compute capability 2.0 device.
22 registers*1024 threads = 22528 registers<32678 which should have worked.
But I was compiling with nvcc -arch sm_13 using the C.C. 1.3 characteristic of 16384 registers per SM
I compiled with nvcc -arch sm_20 taking advantage of the 32678 registers, more than enough for the needed 22528, and it works fine now.
Thanks to everyone, I learned about kernel errors.

CUDA unspecified launch failure error

I have the following code http://pastebin.com/vLeD1GJm wich works just fine, but if I increase:
#define GPU_MAX_PW 100000000
to:
#define GPU_MAX_PW 1000000000
Then I receive:
frederico#zeus:~/Dropbox/coisas/projetos/delta_cuda$ optirun ./a
block size = 97657 grid 48828 grid 13951
unspecified launch failure in a.cu at line 447.. err number 4
I'm running this on a GTX 675M which has 2GB of memory. And the second definition of GPU_MAX_PW will have around 1000000000×2÷1024÷1024 = 1907 MB, so I'm not out of memory. What can be the problem since I'm only allocating more memory? Maybe the grid and block configuration become impossible?
Note that the error is pointing to this line:
HANDLE_ERROR(cudaMemcpy(gwords, gpuHashes, sizeof(unsigned short) * GPU_MAX_PW, cudaMemcpyDeviceToHost));
First of all you have your sizes listed incorrectly. The program works for 10,000,000 and not 100,000,000 (whereas you said it works for 100,000,000 and not 1,000,000,000). So memory size is not the issue, and your calculations there are based on the wrong numbers.
calculate_grid_parameters is messed up. The objective of this function is to figure out how many blocks are needed and therefore grid size, based on the GPU_MAX_PW specifying the total number of threads needed and 1024 threads per block (hard coded). The line that prints out block size = grid ... grid ... actually has the clue to the problem. For GPU_MAX_PW of 100,000,000, this function correctly computes that 100,000,000/1024 = 97657 blocks are needed. However, the grid dimensions are computed incorrectly. The grid dimensions grid.x * grid.y should equal the total number of blocks desired (approximately). But this function has decided that it wants grid.x of 48828 and grid.y of 13951. If I multiply those two, I get 681,199,428, which is much larger than the desired total block count of 97657. Now if I then launch a kernel with requested grid dimensions of 48828 (x) and 13951 (y), and also request 1024 threads per block, I have requested 697,548,214,272 total threads in that kernel launch. First of all this is not your intent, and secondly, while at the moment I can't say exactly why, this is apparently too many threads. Suffice it to say that this overall grid request exceeds some resource limitation of the machine.
Note that if you drop from 100,000,000 to 10,000,000 for GPU_MAX_PW, the grid calculation becomes "sensible", I get:
block size = 9766 grid 9766 grid 1
and no launch failure.

What is the maximum block count possible in CUDA?

Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.
If you call a kernel like this:
kernel<<< BLOCKS,THREADS >>>()
(without dim3 objects), what is the maximum number available for BLOCKS?
In an application of mine, I've set it up to 192000 and seemed to work fine... The problem is that the kernel I used changes the contents of a huge array, so although I checked some parts of the array and seemed fine, I can't be sure whether the kernel behaved strangely at other parts.
For the record I have a 2.1 GPU, GTX 500 ti.
With compute capability 3.0 or higher, you can have up to 2^31 - 1 blocks in the x-dimension, and at most 65535 blocks in the y and z dimensions. See Table H.1. Feature Support per Compute Capability of the CUDA C Programming Guide Version 9.1.
As Pavan pointed out, if you do not provide a dim3 for grid configuration, you will only use the x-dimension, hence the per dimension limit applies here.
In case anybody lands here based on a Google search (as I just did):
Nvidia changed the specification since this question was asked. With compute capability 3.0 and newer, the x-Dimension of a grid of thread blocks is allowed to be up to 2'147'483'647 or 2^31 - 1.
See the current: Technical Specification
65535 in a single dimension. Here's the complete table
I manually checked on my laptop (MX130), program crashes when #blocks > 678*1024+651. Each block with 1 thread, Adding even a single more block gives SegFault. Kernal code had no grid, linear structure only.