Understanding cuda heap memory limitations per thread - cuda

This question is about heap size limitation in cuda.
Having visited some questions concerning this topic, including this one:
new operator in kernel .. strange behaviour
I've made some tests. Given a kernel as follow:
#include <cuda.h>
#include <cuda_runtime.h>
#define CUDA_CHECK( err ) __cudaSafeCall( err, __FILE__, __LINE__ )
#define CUDA_CHECK_ERROR() __cudaCheckError( __FILE__, __LINE__ )
inline void __cudaSafeCall( cudaError err, const char *file, const int line )
{
if ( cudaSuccess != err )
{
fprintf( stderr, "cudaSafeCall() failed at %s:%i : %s\n",
file, line, cudaGetErrorString( err ) );
exit( -1 );
}
return;
}
inline void __cudaCheckError( const char *file, const int line )
{
cudaError err = cudaGetLastError();
if ( cudaSuccess != err )
{
fprintf( stderr, "cudaCheckError() failed at %s:%i : %s\n",
file, line, cudaGetErrorString( err ) );
exit( -1 );
}
return;
}
#include <stdio>
#define NP 900000
__device__ double *temp;
__device__ double *temp2;
__global__
void test(){
int i = blockDim.x*blockIdx.x + threadIdx.x;
if(i==0){
temp = new double[NP];
//temp2 = new double[NP];
}
if(i==0){
for(int k=0;k<NP;k++){
temp[i] = 1.;
if(k%1000 == 0){
printf("%d : %g\n", k, temp[i]);
}
}
}
if(i==0){
delete(temp);
//delete(temp2);
}
}
int main(){
//cudaDeviceSetLimit(cudaLimitMallocHeapSize, 32*1024*1024);
//for(int k=0;k<2;k++){
test<<<ceil((float)NP/512), 512>>>();
CUDA_CHECK_ERROR();
//}
return 0;
}
I want to test the heap size limitation.
Dynamically allocating one array (temp) with one thread which size is
roughly over 960,000*sizeof(double) (close to 8MB, which is the
default limit of the heap size) gives an error : ok. 900,000 works. (does someone know how to calculate the true limit?)
Rising the heap size limit allows to allocate more memory : normal, ok.
Back to a 8MB heap size, allocating one array per thread with TWO threads (so, replacing if (i==0) by if(i==0 || i==1), each one 900,000 * sizeof(double) fails. But 450,000*sizeof(double) each, works. Still ok.
Here comes my problem : allocating TWO arrays with ONE thread (so, temp and temp2 for thread 0), each 900,000 * sizeof(double) works too, but it should not? Indeed when I try to write in both arrays, it fails. But anyone has an idea why this different behaviour in allocation when using two arrays with one thread instead of two arrays with two threads?
EDIT : another test, which I find interesting for those who, like me, would be learning the usage of heap :
5. Executing the kernel two times, with one array of size 900,000 * sizeof(double) allocated by the single thread 0, works if there is the delete. If delete is omitted, it will fail the second time, but the first call will be executed.
EDIT 2 : how to allocate a device-wide variable once but writable by all threads (not from host, using dynamic allocation in device code)?

Probably you are not testing for a returned null-pointer on the new operation, which is a valid method in C++ for the operator to report a failure.
When I modify your code as follows, I get the message "second new failed":
#include <stdio.h>
#define NP 900000
__device__ double *temp;
__device__ double *temp2;
__global__
void test(){
int i = blockDim.x*blockIdx.x + threadIdx.x;
if(i==0){
temp = new double[NP];
if (temp == 0) {printf("first new failed\n"); return;}
temp2 = new double[NP];
if (temp2 == 0) {printf("second new failed\n"); return;}
}
if(i==0){
for(int k=0;k<NP;k++){
temp[i] = 1.;
if(k%1000 == 0){
printf("%d : %g\n", k, temp[i]);
}
}
}
if(i==0){
delete(temp);
delete(temp2);
}
}
int main() {
test<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
It's convenient if you provide a complete, compilable code, for others to work with, just as I have.
For your first EDIT question, it's not surprising that the second new will work if the first is deleted. The first allocates nearly all of the 8MB available. If you delete that allocation, then the second one will succeed. Referring to the documentation, we see that memory allocated dynamically in this fashion lives for the entire lifetime of the cuda context, or until a corresponding delete operation is performed (i.e. not just a single kernel call. The completion of the kernel does not necessarily free the allocation.)
For your second EDIT question, you are already demonstrating a method, using your __device__ double *temp; pointer, by which one thread can allocate storage which all threads can access. You will have a problem across blocks, however, because there is no guarantee of synchronization order amongst blocks or execution order amongst blocks, so if you allocate from thread 0 in block 0, that is only useful if block 0 executes before other blocks. You could come up with a complicated scheme to check if the variable allocation was already done (perhaps by testing the pointer for NULL, and also perhaps using atomics) but it creates fragile code. It's better to plan your global allocations ahead of time and allocate accordingly from the host.

Related

The way to properly do multiple CUDA block synchronization

I like to do CUDA synchronization for multiple blocks. It is not for each block where __syncthreads() can easily handle it.
I saw there are exiting discussions on this topic, for example cuda block synchronization, and I like the simple solution brought up by #johan, https://stackoverflow.com/a/67252761/3188690, essentially it uses a 64 bits counter to track the synchronized blocks.
However, I wrote the following code trying to accomplish the similar job but meet a problem. Here I used the term environment so that the wkNumberEnvs of blocks within this environment shall be synchronized. It has a counter. I used atomicAdd() to count how many blocks have already been synchronized themselves, once the number of sync blocks == wkBlocksPerEnv, I know all blocks finished sync and it is free to go. However, it has a strange outcome that I am not sure why.
The problem comes from this while loop. Since the first threads of all blocks are doing the atomicAdd, there is a while loop to check until the condition meets. But I find that some blocks will be stuck into the endless loop, which I am not sure why the condition cannot be met eventually? And if I printf some messages either in *** I can print here 1 or *** I can print here 2, there is no endless loop and everything is perfect. I do not see something obvious.
const int wkBlocksPerEnv = 2;
__device__ int env_sync_block_count[wkNumberEnvs];
__device__ void syncthreads_for_env(){
// sync threads for each block so all threads in this block finished the previous tasks
__syncthreads();
// sync threads for wkBlocksPerEnv blocks for each environment
if(wkBlocksPerEnv > 1){
const int kThisEnvId = get_env_scope_block_id(blockIdx.x);
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
// *** I can print here 1
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv){
// *** I can print here 2
}
// Do the next job ...
}
}
There are two potential issues with your code. Caching and block scheduling.
Caching can prevent you from observing an updated value during the while loop.
Block scheduling can cause a dead-lock if you wait for an update of a block which has not yet been scheduled. Since CUDA does not guarantee a specific order of scheduled blocks, the only way to prevent this dead-lock is to limit the number of blocks in the grid such that all blocks can run simultaneously.
Following code shows how you could synchronize multiple blocks while avoiding above issues. I adapted the code from the multi-grid synchronization given in the CUDA-sample conjugateGradientMultiDeviceCG https://github.com/NVIDIA/cuda-samples/blob/master/Samples/4_CUDA_Libraries/conjugateGradientMultiDeviceCG/conjugateGradientMultiDeviceCG.cu#L186
On pre-Volta devices, it uses volatile memory accesses. Volta and later uses acquire/release semantics.
Grid size is limited by querying device properties.
#include <cassert>
#include <cstdio>
constexpr int wkBlocksPerEnv = 13;
__device__
int getEnv(int blockId){
return blockId / wkBlocksPerEnv;
}
__device__
int getRankInEnv(int blockId){
return blockId % wkBlocksPerEnv;
}
__device__
unsigned char load_arrived(unsigned char *arrived) {
#if __CUDA_ARCH__ < 700
return *(volatile unsigned char *)arrived;
#else
unsigned int result;
asm volatile("ld.acquire.gpu.global.u8 %0, [%1];"
: "=r"(result)
: "l"(arrived)
: "memory");
return result;
#endif
}
__device__
void store_arrived(unsigned char *arrived,
unsigned char val) {
#if __CUDA_ARCH__ < 700
*(volatile unsigned char *)arrived = val;
#else
unsigned int reg_val = val;
asm volatile(
"st.release.gpu.global.u8 [%1], %0;" ::"r"(reg_val) "l"(arrived)
: "memory");
// Avoids compiler warnings from unused variable val.
(void)(reg_val = reg_val);
#endif
}
#if 0
//wrong implementation which does not synchronize. to check that kernel assert does trigger without proper synchronization
__device__
void syncthreads_for_env(unsigned char* temp){
}
#else
//temp must have at least size sizeof(unsigned char) * total_number_of_blocks in grid
__device__
void syncthreads_for_env(unsigned char* temp){
__syncthreads();
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
unsigned char* const mytemp = temp + env * wkBlocksPerEnv;
if(threadIdx.x == 0){
if(blockInEnv == 0){
// Leader block waits for others to join and then releases them.
// Other blocks in env can arrive in any order, so the leader have to wait for
// all others.
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
while (load_arrived(&mytemp[i]) == 0)
;
}
for (int i = 0; i < wkBlocksPerEnv - 1; i++) {
store_arrived(&mytemp[i], 0);
}
__threadfence();
}else{
// Other blocks in env note their arrival and wait to be released.
store_arrived(&mytemp[blockInEnv - 1], 1);
while (load_arrived(&mytemp[blockInEnv - 1]) == 1)
;
}
}
__syncthreads();
}
#endif
__global__
void kernel(unsigned char* synctemp, int* array){
const int env = getEnv(blockIdx.x);
const int blockInEnv = getRankInEnv(blockIdx.x);
if(threadIdx.x == 0){
array[blockIdx.x] = 1;
}
syncthreads_for_env(synctemp);
if(threadIdx.x == 0){
int sum = 0;
for(int i = 0; i < wkBlocksPerEnv; i++){
sum += array[env * wkBlocksPerEnv + i];
}
assert(sum == wkBlocksPerEnv);
}
}
int main(){
const int smem = 0;
const int blocksize = 128;
int deviceId = 0;
int numSMs = 0;
int maxBlocksPerSM = 0;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numSMs, cudaDevAttrMultiProcessorCount, deviceId);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&maxBlocksPerSM,
kernel,
blocksize,
smem
);
int maxBlocks = maxBlocksPerSM * numSMs;
maxBlocks -= maxBlocks % wkBlocksPerEnv; //round down to nearest multiple of wkBlocksPerEnv
printf("wkBlocksPerEnv %d, maxBlocks: %d\n", wkBlocksPerEnv, maxBlocks);
int* d_array;
unsigned char* d_synctemp;
cudaMalloc(&d_array, sizeof(int) * maxBlocks);
cudaMalloc(&d_synctemp, sizeof(unsigned char) * maxBlocks);
cudaMemset(d_synctemp, 0, sizeof(unsigned char) * maxBlocks);
kernel<<<maxBlocks, blocksize>>>(d_synctemp, d_array);
cudaFree(d_synctemp);
cudaFree(d_array);
return 0;
}
Atomic value is going to global memory but in the while-loop you read it directly and it must be coming from the cache which will not automatically synchronize between threads (cache-coherence only handled by explicit synchronizations like threadfence). Thread gets its own synchronization but other threads may not see it.
Even if you use threadfence, the threads in same warp would be in dead-lock waiting forever if they were the first to check the value before any other thread updates it. But should work with newest GPUs supporting independent thread scheduling.
I like to do CUDA synchronization for multiple blocks.
You should learn to dis-like it. Synchronization is always costly, even when implemented just right, and inter-core synchronization all the more so.
if (threadIdx.x == 0){
// incrementing env_sync_block_count by 1
atomicAdd(&env_sync_block_count[kThisEnvId], 1);
while(env_sync_block_count[kThisEnvId] != wkBlocksPerEnv)
// OH NO!!
{
}
}
This is bad. With this code, the first warp of each block will perform repeated reads of env_sync_block_count[kThisEnvId]. First, and as #AbatorAbetor mentioned, you will face the problem of cache incoherence, causing your blocks to potentially read the wrong value from a local cache well after the global value has long changed.
Also, your blocks will hog up the multiprocessors. Blocks will stay resident and have at least one active warp, indefinitely. Who's to say the will be evicted from their multiprocessor to schedule additional blocks to execute? If I were the GPU, I wouldn't allow more and more active blocks to pile up. Even if you don't deadlock - you'll be wasting a lot of time.
Now, #AbatorAbetor's answer avoids the deadlock by limiting the grid size. And I guess that works. But unless you have a very good reason to write your kernels this way - the real solution is to just break up your algorithm into consecutive kernels (or better yet, figure out how to avoid the need to synchronize altogether).
a mid-way approach is to only have some blocks get past the point of synchronization. You could do that by not waiting except on some condition which holds for a very limited number of blocks (say you had a single workgroup - then only the blocks which got the last K possible counter values, wait).

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

CUDA/C - Using malloc in kernel functions gives strange results

I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

cuda - memory allocation crashes

Ok I am trying to allocate an array of structs on the gpu, and it crashes (gives the stopped working msg).
This is the struct:
typedef struct point_t {
int id;
float x, y;
} point;
this is a part of the cuda code:
cudaError_t d_LoadPoints(point* points, int n , int chunkSize){
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
int nBytes = n * sizeof(point);
// Allocate the device input points array
point* d_points;
err = cudaMalloc((void** )&d_points, nBytes);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector points (error code %s)!\n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
cudaMemcpy(d_points,points ,nBytes ,cudaMemcpyHostToDevice);
puts("memory allocated successfully");
}
I tried printing the first elements of the points array , as well as n and chunksize and it comes out correctly.
This is the point it seems to crash at (I disabled the rest).
it crashes regardless of the debug prints.
The only thing I can think of is the size.
n is 250,000 and chunksize is 64,000 which I was planning to allocate for 125 blocks with 512 threads each.
I have no idea if this is a good idea or not but this is a side topic since I can;t even reach the kernel call.
Restarting visual studio fixed the problem

Implementing a critical section in CUDA

I'm trying to implement a critical section in CUDA using atomic instructions, but I ran into some trouble. I have created the test program to show the problem:
#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
while (atomicExch(&(locks[id]), 1u) != 0u) {} //lock
//critical section would go here
atomicExch(&(locks[id]),0u); //unlock
}
int main(int argc, char** argv) {
//initialize the locks array on the GPU to (0...0)
unsigned int* locks;
unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));
//Run the kernel:
k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);
//Check the error messages:
cudaError_t error = cudaGetLastError();
cutilSafeCall(cudaFree(locks));
if (cudaSuccess != error) {
printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
exit(-1);
}
return 0;
}
This code, unfortunately, hard freezes my machine for several seconds and finally exits, printing out the message:
fcudaSafeCall() Runtime API error in file <XXX.cu>, line XXX : the launch timed out and was terminated.
which means that one of those while loops is not returning, but it seems like this should work.
As a reminder atomicExch(unsigned int* address, unsigned int val) atomically sets the value of the memory location stored in address to val and returns the old value. So the idea behind my locking mechanism is that it is initially 0u, so one thread should get past the while loop and all other threads should wait on the while loop since they will read locks[id] as 1u. Then when the thread is done with the critical section, it resets the lock to 0u so another thread can enter.
What am I missing?
By the way, I am compiling with:
nvcc -arch sm_11 -Ipath/to/cuda/C/common/inc XXX.cu
Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.
As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.
Here is a kernel that will do the trick without error:
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
bool leaveLoop = false;
while (!leaveLoop) {
if (atomicExch(&(locks[id]), 1u) == 0u) {
//critical section
leaveLoop = true;
atomicExch(&(locks[id]),0u);
}
}
}
The poster has already found an answer to his own issue. Nevertheless, in the code below, I'm providing a general framework to implement a critical section in CUDA. More in detail, the code performs a block counting, but it is easily modifyiable to host other operations to be performed in a critical section. Below, I'm also reporting some explanation of the code, with some, "typical" mistakes in the implementation of critical sections in CUDA.
THE CODE
#include <stdio.h>
#include "Utilities.cuh"
#define NUMBLOCKS 512
#define NUMTHREADS 512 * 2
/***************/
/* LOCK STRUCT */
/***************/
struct Lock {
int *d_state;
// --- Constructor
Lock(void) {
int h_state = 0; // --- Host side lock state initializer
gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state
gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
}
// --- Destructor
__host__ __device__ ~Lock(void) {
#if !defined(__CUDACC__)
gpuErrchk(cudaFree(d_state));
#else
#endif
}
// --- Lock function
__device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }
// --- Unlock function
__device__ void unlock(void) { atomicExch(d_state, 0); }
};
/*************************************/
/* BLOCK COUNTER KERNEL WITHOUT LOCK */
/*************************************/
__global__ void blockCountingKernelNoLock(int *numBlocks) {
if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
}
/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCountingKernelLock(Lock lock, int *numBlocks) {
if (threadIdx.x == 0) {
lock.lock();
numBlocks[0] = numBlocks[0] + 1;
lock.unlock();
}
}
/****************************************/
/* BLOCK COUNTER KERNEL WITH WRONG LOCK */
/****************************************/
__global__ void blockCountingKernelDeadlock(Lock lock, int *numBlocks) {
lock.lock();
if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
lock.unlock();
}
/********/
/* MAIN */
/********/
int main(){
int h_counting, *d_counting;
Lock lock;
gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));
// --- Unlocked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCountingKernelNoLock << <NUMBLOCKS, NUMTHREADS >> >(d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the unlocked case: %i\n", h_counting);
// --- Locked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCountingKernelLock << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the locked case: %i\n", h_counting);
gpuErrchk(cudaFree(d_counting));
}
CODE EXPLANATION
Critical sections are sequences of operations that must be executed sequentially by the CUDA threads.
Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block having threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.
The above code has two kernel functions: blockCountingKernelNoLock and blockCountingKernelLock. The former does not use a critical section to increase the counter and, as one can see, returns wrong results. The latter encapsulates the counter increase within a critical section and so produces correct results. But how does the critical section work?
The critical section is governed by a global state d_state. Initially, the state is 0. Furthermore, two __device__ methods, lock and unlock, can change this state. The lock and unlock methods can be invoked only by a single thread within each block and, in particular, by the thread having local thread index threadIdx.x == 0.
Randomly during the execution, one of the threads having local thread index threadIdx.x == 0 and global thread index, say, t will be the first invoking the lock method. In particular, it will launch atomicCAS(d_state, 0, 1). Since initially d_state == 0, then d_state will be updated to 1, atomicCAS will return 0 and the thread will exit the lock function, passing to the update instruction. In the meanwhile such a thread performs the mentioned operations, all the other threads of all the other blocks having threadIdx.x == 0 will execute the lock method. They will however find a value of d_state equal to 1, so that atomicCAS(d_state, 0, 1) will perform no update and will return 1, so leaving these threads running the while loop. After that thread t accomplishes the update, then it executes the unlock function, namely atomicExch(d_state, 0), thus restoring d_state to 0. At this point, randomly, another of the threads with threadIdx.x == 0 will lock again the state.
The above code contains also a third kernel function, namely blockCountingKernelDeadlock. However, this is another wrong implementation of the critical section, leading to deadlocks. Indeed, we recall that warps operate in lockstep and they synchronize after every instruction. So, when we execute blockCountingKernelDeadlock, there is the possibility that one of the threads in a warp, say a thread with local thread index t≠0, will lock the state. Under this circumstance, the other threads in the same warp of t, including that with threadIdx.x == 0, will execute the same while loop statement as thread t, being the execution of threads in the same warp performed in lockstep. Accordingly, all the threads will wait for someone to unlock the state, but no other thread will be able to do so, and the code will be stuck in a deadlock.
by the way u have to remember that global memory writes and ! reads aren't completed where u write them in the code ... so for this to be practice you need to add a global memfence ie __threadfence()