CUDA unspecified launch failure error - cuda

I have the following code http://pastebin.com/vLeD1GJm wich works just fine, but if I increase:
#define GPU_MAX_PW 100000000
to:
#define GPU_MAX_PW 1000000000
Then I receive:
frederico#zeus:~/Dropbox/coisas/projetos/delta_cuda$ optirun ./a
block size = 97657 grid 48828 grid 13951
unspecified launch failure in a.cu at line 447.. err number 4
I'm running this on a GTX 675M which has 2GB of memory. And the second definition of GPU_MAX_PW will have around 1000000000×2÷1024÷1024 = 1907 MB, so I'm not out of memory. What can be the problem since I'm only allocating more memory? Maybe the grid and block configuration become impossible?
Note that the error is pointing to this line:
HANDLE_ERROR(cudaMemcpy(gwords, gpuHashes, sizeof(unsigned short) * GPU_MAX_PW, cudaMemcpyDeviceToHost));

First of all you have your sizes listed incorrectly. The program works for 10,000,000 and not 100,000,000 (whereas you said it works for 100,000,000 and not 1,000,000,000). So memory size is not the issue, and your calculations there are based on the wrong numbers.
calculate_grid_parameters is messed up. The objective of this function is to figure out how many blocks are needed and therefore grid size, based on the GPU_MAX_PW specifying the total number of threads needed and 1024 threads per block (hard coded). The line that prints out block size = grid ... grid ... actually has the clue to the problem. For GPU_MAX_PW of 100,000,000, this function correctly computes that 100,000,000/1024 = 97657 blocks are needed. However, the grid dimensions are computed incorrectly. The grid dimensions grid.x * grid.y should equal the total number of blocks desired (approximately). But this function has decided that it wants grid.x of 48828 and grid.y of 13951. If I multiply those two, I get 681,199,428, which is much larger than the desired total block count of 97657. Now if I then launch a kernel with requested grid dimensions of 48828 (x) and 13951 (y), and also request 1024 threads per block, I have requested 697,548,214,272 total threads in that kernel launch. First of all this is not your intent, and secondly, while at the moment I can't say exactly why, this is apparently too many threads. Suffice it to say that this overall grid request exceeds some resource limitation of the machine.
Note that if you drop from 100,000,000 to 10,000,000 for GPU_MAX_PW, the grid calculation becomes "sensible", I get:
block size = 9766 grid 9766 grid 1
and no launch failure.

Related

Why is it said max blocks per grid dimension are 65535?

For a constant block size of 128 (cores per MP):
I did a performance comparison of a grid having a dim3 2D array with dimensions dim3(WIDTH, HEIGHT) versus a flattened 1D array of int = WIDTH * HEIGHT, where WIDTH, HEIGHT can be any arbitrarily large values representing a 2D array/matrix so long as "int" doesn't overflow in C.
According to my research, such as this answer here: Maximum blocks per grid:CUDA only 65535 blocks should be supported in a single dimension.
Yet with WIDTH = 4000, HEIGHT = 4000, the speed results end up essentially the same over multiple trials regardless of whether the grid has 1 dimension or 2. Eg: Given gridDim
{ x = 125000, y = 1, z = 1 }
I get the same performance as gridDim { x = 375, y = 375, z = 1 }, with a block size of 128 (computationally expensive operations are performed on the array for each thread).
I thought for the 1D gridDim, any value over 65535 shouldn't even work, going by prior answers. Why is such a large dimension accepted then?
Even if it does work, I thought this should somehow lead to wasted cores. Yet the speed between dim3 and a flattened 1D grid, with threads per block of 128 (# of cores per MP), is the same from my tests. What's the point then of using dim3 with multiple dimensions instead of a single dimension for the grid size?
Could someone please enlighten me here?
Thank you.
As can be seen in Table 15. Technical Specifications per Compute Capability, the x-dimension is not restricted to 65535 like the other two dimensions, instead it can go up to 2^31 - 1 for all supported compute architectures. As to why this is the case, you might not get a good answer as this seems like an old implementation detail.
The information in the linked SO answer is outdated (as mentioned in the comments). I edited it for future readers.
The dimensionality of the grid does not matter for "wasting cores". The amount of threads per block (together with the use of shared memory and registers) is what is important for utilization. And even there the dimensionality is just to make the code easier to write and read, as many GPU use-cases are not one-dimensional.
The amount of blocks in a grid (together with the amount of blocks that can be fitted onto each SM) can matter for minimizing the tail effect in smaller kernels (see this blog post), but again the dimensionality should be of no concern for that.
I have never seen any information about the dimensionality of the grid or blocks mattering directly to performance in a way that could not be emulated using 1D grids and blocks (i.e. 2D tiles for e.g. matrix multiplication are important for performance, but one could emulate them and use 1D blocks instead), so I view them just as a handy abstraction to keep index computations in user code at a reasonable level.

NSight Compute Grid Size Inconsistent Unit

I am launching a vector add kernel in the following way:
//cuda processing sequence step 1 is complete
int blocks = 1; // modify this line for experimentation
int threads = 1024; // modify this line for experimentation
vadd<<<blocks, threads>>>(d_A, d_B, d_C, DSIZE);
Then, I compile it with
nvcc -o vector_add_2b vector_add.cu
And profile it with
nv-nsight-cu-cli -fo vector_add_2b ./vector_add_2b
I found it strange that the Grid Size in the Nsight Compute is given by 1024,1,1, specially considered that this size is followed by a X (block dimension)
As I was writing this question, I also noticed that under Launch Statistics, they have the number I was expecting: 1
This, makes me believe that in the first case, the size of the Grid is given in Threads, whereas in the second it is given in blocks.
Why is that?

How to programmatically determine the correct launch parameters for a persistent kernel?

What is the correct way to programmatically determine the launch parameters of a persistent kernel? All examples I have found use hard coded values.
Is the following correct?
cudaDeviceProp props;
cudaGetDeviceProperties(&props, 0);
int blockCount = props.maxBlocksPerMultiProcessor * props.multiProcessorCount;
int blockThreadCount = props.maxThreadsPerMultiProcessor / props.maxBlocksPerMultiProcessor;
// Gives <<<1312, 96>>> on a RTX 3090
PersistentKernel<<<blockCount, blockThreadCount>>>(...);
Is the following correct?
No.
Use cudaOccupancyMaxPotentialBlockSize. That will give you both the grid size and block size for the current device which maximizes the occupancy of a given kernel with the minimum number of blocks. That is the optimal launch parameters for a given persistent kernel.
Note that the returned block and grid dimensions are scalars. You are free to reshape them into multidimensional dim3 block and/or grid dimensions which preserve the total number of threads per block and blocks which are returned by the API.

Should I check the number of threads in kernel code?

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:
__global__ void myKernel(int nbThreads)
{
int threadId = blockDim.x*blockIdx.y*gridDim.x //rows preceeding current row in grid
+ blockDim.x*blockIdx.x //blocks preceeding current block
+ threadIdx.x;
if (threadId < nbThreads)
{
statement();
statement();
statement();
}
}
They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time.
However, I've seen no example kernel on the internet so far where they actually do this verification.
Can CUDA actually launch more threads than specified block/grid dimensions?
CUDA will not launch more threads than what are specified by the block/grid dimensions.
However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.
In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.
In your example, this could be clarified by writing:
__global__ void myKernel(int problem_size)
if (threadId < problem_size)
which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.
As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.
If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

kernel failure: invalid configuration argument

I have a question about my code and whether I can run it on my current device or not.
Basically, I want to do a 3D interpolation.
When I launch my interpolation kernel, I get the following error: kernel failure: invalid configuration argument
I saw in this discussion that it can happen if you call too many threads or blocks, but I am not sure it is the case in my code. Could someone have a look at it and tell me what's wrong?
Here is how I call my kernel:
dim3 blockSize(6,6,6);
dim3 threadSize(dimX/blockSize.x,dimY/blockSize.y,dimZ/blockSize.z);
d_interpolate_kernel<<<blockSize,threadSize>>>(output,dimX,dimY,dimZ);
My dimensions are dimX = 54 or 108, dimY=dimX=42 or 84.
So I have blockSize(6,6,6) and threadSize(9,7,7) or (18,14,14).
My card has the following capabilities:
MAX_BLOCK_DIM_X = 512
MAX_BLOCK_DIM_Y = 512
MAX_BLOCK_DIM_Z = 64
MAX_GRID_DIM_X = 65535
MAX_GRID_DIM_Y = 65535
MAX_GRID_DIM_Z = 1
Do I get the error because MAX_GRID_DIM_Z is 1?
If yes, is there a way around this?
Thank you!
One problem is you have your blockSize and threadSize variables reversed in your kernel call.
You want something like this:
d_interpolate_kernel<<<threadSize,blockSize>>>(output,dimX,dimY,dimZ);
The first configuration argument is the size of the grid in blocks.
The second configuration argument is the size of the block in threads.
Since you have them reversed, your (18,14,14) values are not acceptable block sizes (too many threads), since the max number of threads per block is 512 (for cc1.x) or 1024 (otherwise), whereas 18x14x14 = 3528.
For me, threadSize is a confusing name. I would have called it gridSize or something like that.
The second problem as you've pointed out is that for a cc1.x card (which seems to be what you have) your Z grid dimension must be 1. At least for your 42 case, you can fix this by re-structuring the thread blocks to have a dimension of, say, (2,2,42) and your grid a dimension of, say, (27, 21, 1).
Otherwise, these indices are just arbitrary numbering schemes. You can come up with a 2D grid that covers all of your 3D volume, using a (6, 6, 6) block size if that is what you want. You just need to get creative about how you map the blockIdx.x and blockIdx.y built-in variables in your interpolation kernel, to simulate a 3D grid.