Empirically determining how many threads are in a warp - cuda

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?

Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5

Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

Related

Failed to test nestedReduce2.cu from book Professional CUDA C Programming

I am reading the book Professional CUDA C Programming. I've downloaded the source codes from Wiley, the file has been tested was chapter03/nestedReduce2.cu. Or the file could be found at github.
I've made the .cu file by its Makefile as well as simple command:
nvcc -o nestedReduce2 ./nestedReduce2.cu -rdc=true
The output was like:
./nestedReduce2 starting reduction at device 0: Quadro RTX 4000 array 1048576 grid 2048 block 512
cpu reduce elapsed 0.000858 sec cpu_sum: 1048576
gpu Neighbored elapsed 0.000404 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested elapsed 0.044057 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nestedNosyn elapsed 0.019464 sec gpu_sum: 1048576 <<<grid 2048 block 512>>>
gpu nested2 elapsed 0.001051 sec gpu_sum: 946688 <<<grid 2048 block 512>>>
Test failed!
How to solve this problem? Is there some update for CUDA recursive programming since the last update of the book?
I don't have that book and have never read it. I don't really know what is in the book, so my response is directed to the code posted on the github site and nothing else. I'm unable to make any statements about a book I don't have and have never read.
Concerning the kernel in question:
__global__ void gpuRecursiveReduce2(int *g_idata, int *g_odata, int iStride,
int const iDim)
{
// convert global data pointer to the local pointer of this block
int *idata = g_idata + blockIdx.x * iDim;
// stop condition
if (iStride == 1 && threadIdx.x == 0)
{
g_odata[blockIdx.x] = idata[0] + idata[1];
return;
}
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
// nested invocation to generate child grids
if(threadIdx.x == 0 && blockIdx.x == 0)
{
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
}
}
I believe it should be fairly evident for correctness, that the child kernel launch:
gpuRecursiveReduce2<<<gridDim.x, iStride / 2>>>(g_idata, g_odata,
iStride / 2, iDim);
should not be allowed to execute until the preceding parent reduction:
// in place reduction
idata[threadIdx.x] += idata[threadIdx.x + iStride];
is complete. Both items potentially span up to half the entire dataset, and therefore depend on results from multiple blocks (to be complete, for correctness).
On my V100 GPU (CUDA 11.4), the code gives the expected result. However as OP has demonstrated, it may not give the expected result in all scenarios.
In order to be confident of correct results, we would need something like a grid-wide sync, in between the parent reduction step, and the child kernel execution, for each sweep phase (except the last, since there is only 1 thread per block in that case, and so all blocks terminate before reaching the child kernel launch.)
Unfortunately, the cooperative groups grid-wide sync is not supported with CUDA dynamic parallelism (CDP).
The other grid-wide sync formally provided by CUDA is the kernel launch boundary. Therefore:
How to solve this problem?
my suggestion would be to dispense with CDP launches, and use a set of (non-recursive) kernel launches driven by a for-loop in host code. For someone at the level of study indicated here, this should be a trivial refactoring, so I will not present it here.
Additional discussion:
In particular, we could surmise that a case where the GPU is "smaller" (i.e. fewer SMs) and the grid size is "larger" might be a problem. This might give rise to a situation where child kernel blocks are executing prior to the completion of some parent kernel blocks.
Coupled with this, a question might be asked "is there any characteristic of null stream behavior (e.g. synchronization) between the parent kernel null stream and the child kernel null stream that would (or should have) created the desired ordering?" The answer is no. You can refer to the documentation, where null stream behavior of CDP kernels is discussed.
In my view it is clear that the child kernel NULL stream does not synchronize with the parent kernel null stream. As an additional thought experiment, we should keep in mind that the documentation states that a parent kernel is not considered complete until all child kernels are complete. Coupled with that, if we assumed null stream synchronizing behavior between parent and child, it would immediately give rise to deadlock. So we reject that hypothesis.
For additional inspection, we can derive a test case to convince ourselves that a parent kernel null stream and child kernel null stream do not interact:
$ cat t2099.cu
#include <iostream>
__global__ void child(int *d, int val){
*d = val;
}
__global__ void parent(int *d, int val){
*d = val;
if (blockIdx.x == 1048577) child<<<1,1>>>(d, 1);
}
int main(){
int *d;
cudaMallocManaged(&d, sizeof(d[0]));
parent<<<2*1048576, 1>>>(d, 0);
cudaDeviceSynchronize();
std::cout << d[0] << std::endl;
}
$ nvcc -o t2099 t2099.cu -rdc=true
$ ./t2099
0
$
In the above simplified test case, we are launching a parent kernel of ~2M blocks, where all parent kernel blocks set a variable to zero, and the child kernel launched from a single block picked arbitrarily sets the variable to 1.
If there were parent/child synchronization, we would expect the variable to be 1 at conclusion. Since it is 0, we conclude that there is no synchronization between parent and child kernel. The child kernel (block) somehow "intermixed" with the execution of the parent kernel blocks. (the "intermixing" is not in any way guaranteed by CUDA, but we could surmise that one reason the block scheduler might choose to intermix is because the parent kernel block is not complete until its child kernel block is complete. Therefore, from a throughput perspective, it might be advantageous to make forward progress on the child kernel, in the midst of the parent kernel.)
This discussion and experiment help to reinforce the idea that the presented code needs/requires a grid-wide sync for correctness, and neither the code itself nor the CDP mechanism provide any guarantee of that.
(for completeness, the test case I presented is not guaranteed to produce 0 and it may not produce 0 if you run it in your machine. The fact that it does produce 0 in at least one test setup - mine - is sufficient for the argument. In my test case, if I change the number of blocks launched to 1048578, then the output changes from 0 to 1.)

How to avoid Cuda error 6 (Launch Timeout) with consecutive asynchronous kernel launches?

I get a Cuda error 6 (also known as cudaErrorLaunchTimeout and CUDA_ERROR_LAUNCH_TIMEOUT) with this (simplified) code:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
The Cuda error 6 indicates that the kernel took too much time to return. The duration of a single MyKernel is only ~60 ms though. The block size is a classic 16×16.
Now, when I call cudaDeviceSynchronize() every, say, 50 iterations, the error doesn't occur:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
if(i % 50 == 0) cudaDeviceSynchronize();
}
I would like to avoid this synchronization, because it slows the program down a lot.
Since kernel launches are asynchronous, I guess the error occurs because the watchdog measures the execution duration of a kernel from its asynchronous launch, and not from the actual beginning of its execution.
I am new to Cuda. Is this a common case for the error 6 to occur? Is there a way to avoid this error without altering the performance?
Thanks to talonmies and Robert Crovella (whose proposed solution didn't work for me), I've been able to find an acceptable workaround.
To prevent the CUDA driver to batch the kernel launches together, another operation must be performed before or after each kernel launch. E.g. a dummy copy does the trick:
void* dummy;
cudaMalloc(&dummy, 1);
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
cudaMemcpyAsync(dummy, dummy, 1, cudaMemcpyDeviceToDevice);
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
This solution is 8 seconds faster (50s to 42s) than the one that includes calls to cudaDeviceSynchronize() (see question).
Besides, it's more reliable, 50 being an arbitrary, device-specific period.
The watchdog isn't measuring execution time of kernels, per se. The watchdog is keeping track of requests in the command queue that goes to the GPU, and determining if any of them have not been acknowledged by the GPU within a timeout period.
As #talonmies indicated in the comments, my best guess is that (if you are certain that no kernel execution exceeds the timeout period) this behavior is due to the CUDA driver WDDM batching mechanism, which seeks to reduce average latency by batching GPU commands together and sending to the GPU, in batches.
You don't have direct control over the batching behavior, and so in general, trying to work around this without disabling or modifying the windows TDR mechanism will be an imprecise exercise.
The general (somewhat undocumented) suggestion for a low-cost "flush" of the command queue, which you might try experimenting with, is to use cudaEventQuery(0); (as suggested here) in place of cudaDeviceSynchronize();, perhaps every 50 kernel launches or so. To some degree the specifics may depend on the machine configuration, and the GPU in use.
I'm not sure how effective it will be in your case. I don't think that it can be advanced as a "guarantee" of avoiding a TDR event without a lot more experimentation. Your mileage may vary.

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.