I am launching a vector add kernel in the following way:
//cuda processing sequence step 1 is complete
int blocks = 1; // modify this line for experimentation
int threads = 1024; // modify this line for experimentation
vadd<<<blocks, threads>>>(d_A, d_B, d_C, DSIZE);
Then, I compile it with
nvcc -o vector_add_2b vector_add.cu
And profile it with
nv-nsight-cu-cli -fo vector_add_2b ./vector_add_2b
I found it strange that the Grid Size in the Nsight Compute is given by 1024,1,1, specially considered that this size is followed by a X (block dimension)
As I was writing this question, I also noticed that under Launch Statistics, they have the number I was expecting: 1
This, makes me believe that in the first case, the size of the Grid is given in Threads, whereas in the second it is given in blocks.
Why is that?
Related
I am reading the book Professional CUDA C Programming. I've downloaded the source codes from Wiley, the file has been tested was chapter03/nestedReduce2.cu. Or the file could be found at github.
I've made the .cu file by its Makefile as well as simple command:
nvcc -o nestedReduce2 ./nestedReduce2.cu -rdc=true
The output was like:
./nestedReduce2 starting reduction at device 0: Quadro RTX 4000 array 1048576 grid 2048 block 512
cpu reduce elapsed 0.000858 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.000404 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested elapsed 0.044057 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.019464 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested2 elapsed 0.001051 sec gpu_sum: 946688 <<<grid 2048 block 512>>>
Test failed!
How to solve this problem? Is there some update for CUDA recursive programming since the last update of the book?
I don't have that book and have never read it. I don't really know what is in the book, so my response is directed to the code posted on the github site and nothing else. I'm unable to make any statements about a book I don't have and have never read.
Concerning the kernel in question:
__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,
int const iDim)
{
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * iDim;
// stop condition
if (iStride == 1 && threadIdx.x == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
// nested invocation to generate child grids
if(threadIdx.x == 0 && blockIdx.x == 0)
{
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
}
}
I believe it should be fairly evident for correctness, that the child kernel launch:
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
should not be allowed to execute until the preceding parent reduction:
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
is complete. Both items potentially span up to half the entire dataset, and therefore depend on results from multiple blocks (to be complete, for correctness).
On my V100 GPU (CUDA 11.4), the code gives the expected result. However as OP has demonstrated, it may not give the expected result in all scenarios.
In order to be confident of correct results, we would need something like a grid-wide sync, in between the parent reduction step, and the child kernel execution, for each sweep phase (except the last, since there is only 1 thread per block in that case, and so all blocks terminate before reaching the child kernel launch.)
Unfortunately, the cooperative groups grid-wide sync is not supported with CUDA dynamic parallelism (CDP).
The other grid-wide sync formally provided by CUDA is the kernel launch boundary. Therefore:
How to solve this problem?
my suggestion would be to dispense with CDP launches, and use a set of (non-recursive) kernel launches driven by a for-loop in host code. For someone at the level of study indicated here, this should be a trivial refactoring, so I will not present it here.
Additional discussion:
In particular, we could surmise that a case where the GPU is "smaller" (i.e. fewer SMs) and the grid size is "larger" might be a problem. This might give rise to a situation where child kernel blocks are executing prior to the completion of some parent kernel blocks.
Coupled with this, a question might be asked "is there any characteristic of null stream behavior (e.g. synchronization) between the parent kernel null stream and the child kernel null stream that would (or should have) created the desired ordering?" The answer is no. You can refer to the documentation, where null stream behavior of CDP kernels is discussed.
In my view it is clear that the child kernel NULL stream does not synchronize with the parent kernel null stream. As an additional thought experiment, we should keep in mind that the documentation states that a parent kernel is not considered complete until all child kernels are complete. Coupled with that, if we assumed null stream synchronizing behavior between parent and child, it would immediately give rise to deadlock. So we reject that hypothesis.
For additional inspection, we can derive a test case to convince ourselves that a parent kernel null stream and child kernel null stream do not interact:
$ cat t2099.cu
#include <iostream>
__global__ void child(int *d, int val){
*d = val;
}
__global__ void parent(int *d, int val){
*d = val;
if (blockIdx.x == 1048577) child<<<1,1>>>(d, 1);
}
int main(){
int *d;
cudaMallocManaged(&d, sizeof(d[0]));
parent<<<2*1048576, 1>>>(d, 0);
cudaDeviceSynchronize();
std::cout << d[0] << std::endl;
}
$ nvcc -o t2099 t2099.cu -rdc=true
$ ./t2099
0
$
In the above simplified test case, we are launching a parent kernel of ~2M blocks, where all parent kernel blocks set a variable to zero, and the child kernel launched from a single block picked arbitrarily sets the variable to 1.
If there were parent/child synchronization, we would expect the variable to be 1 at conclusion. Since it is 0, we conclude that there is no synchronization between parent and child kernel. The child kernel (block) somehow "intermixed" with the execution of the parent kernel blocks. (the "intermixing" is not in any way guaranteed by CUDA, but we could surmise that one reason the block scheduler might choose to intermix is because the parent kernel block is not complete until its child kernel block is complete. Therefore, from a throughput perspective, it might be advantageous to make forward progress on the child kernel, in the midst of the parent kernel.)
This discussion and experiment help to reinforce the idea that the presented code needs/requires a grid-wide sync for correctness, and neither the code itself nor the CDP mechanism provide any guarantee of that.
(for completeness, the test case I presented is not guaranteed to produce 0 and it may not produce 0 if you run it in your machine. The fact that it does produce 0 in at least one test setup - mine - is sufficient for the argument. In my test case, if I change the number of blocks launched to 1048578, then the output changes from 0 to 1.)
What is the correct way to programmatically determine the launch parameters of a persistent kernel? All examples I have found use hard coded values.
Is the following correct?
cudaDeviceProp props;
cudaGetDeviceProperties(&props, 0);
int blockCount = props.maxBlocksPerMultiProcessor * props.multiProcessorCount;
int blockThreadCount = props.maxThreadsPerMultiProcessor / props.maxBlocksPerMultiProcessor;
// Gives <<<1312, 96>>> on a RTX 3090
PersistentKernel<<<blockCount, blockThreadCount>>>(...);
Is the following correct?
No.
Use cudaOccupancyMaxPotentialBlockSize. That will give you both the grid size and block size for the current device which maximizes the occupancy of a given kernel with the minimum number of blocks. That is the optimal launch parameters for a given persistent kernel.
Note that the returned block and grid dimensions are scalars. You are free to reshape them into multidimensional dim3 block and/or grid dimensions which preserve the total number of threads per block and blocks which are returned by the API.
I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:
__global__ void myKernel(int nbThreads)
{
int threadId = blockDim.x*blockIdx.y*gridDim.x //rows preceeding current row in grid
+ blockDim.x*blockIdx.x //blocks preceeding current block
+ threadIdx.x;
if (threadId < nbThreads)
{
statement();
statement();
statement();
}
}
They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time.
However, I've seen no example kernel on the internet so far where they actually do this verification.
Can CUDA actually launch more threads than specified block/grid dimensions?
CUDA will not launch more threads than what are specified by the block/grid dimensions.
However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.
In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.
In your example, this could be clarified by writing:
__global__ void myKernel(int problem_size)
if (threadId < problem_size)
which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.
As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.
If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.
Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.
I have the following code http://pastebin.com/vLeD1GJm wich works just fine, but if I increase:
#define GPU_MAX_PW 100000000
to:
#define GPU_MAX_PW 1000000000
Then I receive:
frederico#zeus:~/Dropbox/coisas/projetos/delta_cuda$ optirun ./a
block size = 97657 grid 48828 grid 13951
unspecified launch failure in a.cu at line 447.. err number 4
I'm running this on a GTX 675M which has 2GB of memory. And the second definition of GPU_MAX_PW will have around 1000000000×2÷1024÷1024 = 1907 MB, so I'm not out of memory. What can be the problem since I'm only allocating more memory? Maybe the grid and block configuration become impossible?
Note that the error is pointing to this line:
HANDLE_ERROR(cudaMemcpy(gwords, gpuHashes, sizeof(unsigned short) * GPU_MAX_PW, cudaMemcpyDeviceToHost));
First of all you have your sizes listed incorrectly. The program works for 10,000,000 and not 100,000,000 (whereas you said it works for 100,000,000 and not 1,000,000,000). So memory size is not the issue, and your calculations there are based on the wrong numbers.
calculate_grid_parameters is messed up. The objective of this function is to figure out how many blocks are needed and therefore grid size, based on the GPU_MAX_PW specifying the total number of threads needed and 1024 threads per block (hard coded). The line that prints out block size = grid ... grid ... actually has the clue to the problem. For GPU_MAX_PW of 100,000,000, this function correctly computes that 100,000,000/1024 = 97657 blocks are needed. However, the grid dimensions are computed incorrectly. The grid dimensions grid.x * grid.y should equal the total number of blocks desired (approximately). But this function has decided that it wants grid.x of 48828 and grid.y of 13951. If I multiply those two, I get 681,199,428, which is much larger than the desired total block count of 97657. Now if I then launch a kernel with requested grid dimensions of 48828 (x) and 13951 (y), and also request 1024 threads per block, I have requested 697,548,214,272 total threads in that kernel launch. First of all this is not your intent, and secondly, while at the moment I can't say exactly why, this is apparently too many threads. Suffice it to say that this overall grid request exceeds some resource limitation of the machine.
Note that if you drop from 100,000,000 to 10,000,000 for GPU_MAX_PW, the grid calculation becomes "sensible", I get:
block size = 9766 grid 9766 grid 1
and no launch failure.