Cuda atomic lock: threads in sequence - cuda

I have a code of which a section needs to be executed critically. I am using a lock for that piece of code so that each thread of the kernel (set up with one thread per block) executes that piece of code atomically. The order of the threads is what bothers me - I need the threads to execute in chronological order according to their indices (or actually, in order of their blockIdx), from 0 to say 10 (instead of randomly e.g. 5, 8, 3, 0, ...etc). Is it possible to do that?
Here is an example code:
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<math_functions.h>
#include<time.h>
#include<cuda.h>
#include<cuda_runtime.h>
// number of blocks
#define nob 10
struct Lock{
int *mutex;
Lock(void){
int state = 0;
cudaMalloc((void**) &mutex, sizeof(int));
cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
}
~Lock(void){
cudaFree(mutex);
}
__device__ void lock(void){
while(atomicCAS(mutex, 0, 1) != 0);
}
__device__ void unlock(void){
atomicExch(mutex, 0);
}
};
__global__ void theKernel(Lock myLock){
int index = blockIdx.x; //using only one thread per block
// execute some parallel code
// critical section of code (thread with index=0 needs to start, followed by index=1, etc.)
myLock.lock();
printf("Thread with index=%i inside critical section now...\n", index);
myLock.unlock();
}
int main(void)
{
Lock myLock;
theKernel<<<nob, 1>>>(myLock);
return 0;
}
which gives the following results:
Thread with index=1 inside critical section now...
Thread with index=0 inside critical section now...
Thread with index=5 inside critical section now...
Thread with index=9 inside critical section now...
Thread with index=7 inside critical section now...
Thread with index=6 inside critical section now...
Thread with index=3 inside critical section now...
Thread with index=2 inside critical section now...
Thread with index=8 inside critical section now...
Thread with index=4 inside critical section now...
I want these indices to start from 0 and execute chronologically to 9.
One way I thought to modify the Lock to achieve this is as follows:
struct Lock{
int *indexAllow;
Lock(void){
int startVal = 0;
cudaMalloc((void**) &indexAllow, sizeof(int));
cudaMemcpy(indexAllow, &startVal, sizeof(int), cudaMemcpyHostToDevice);
}
~Lock(void){
cudaFree(indexAllow);
}
__device__ void lock(int index){
while(index!=*indexAllow);
}
__device__ void unlock(void){
atomicAdd(indexAllow,1);
}
};
and then to just initialize the lock by passing the index as an argument:
myLock.lock(index);
but this stalls my pc... I'm probably missing something obvious.
If anyone can help I'd appreciate it!
Thanks!!!

I changed your code a bit. Now it produces your desired output:
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<math_functions.h>
#include<time.h>
#include<cuda.h>
#include<cuda_runtime.h>
// number of blocks
#define nob 10
struct Lock{
int *mutex;
Lock(void){
int state = 0;
cudaMalloc((void**) &mutex, sizeof(int));
cudaMemcpy(mutex, &state, sizeof(int), cudaMemcpyHostToDevice);
}
~Lock(void){
cudaFree(mutex);
}
__device__ void lock(uint compare){
while(atomicCAS(mutex, compare, 0xFFFFFFFF) != compare); //0xFFFFFFFF is just a very large number. The point is no block index can be this big (currently).
}
__device__ void unlock(uint val){
atomicExch(mutex, val+1);
}
};
__global__ void theKernel(Lock myLock){
int index = blockIdx.x; //using only one thread per block
// execute some parallel code
// critical section of code (thread with index=0 needs to start, followed by index=1, etc.)
myLock.lock(index);
printf("Thread with index=%i inside critical section now...\n", index);
__threadfence_system(); // For the printf. I'm not sure __threadfence_system() can guarantee the order for calls to printf().
myLock.unlock(index);
}
int main(void)
{
Lock myLock;
theKernel<<<nob, 1>>>(myLock);
return 0;
}
The lock() function accepts compare as the parameter and checks if it is equal to the value alraedy in mutex. If yes, it puts 0xFFFFFFFF into the mutex to indicate the lock is acquired by a thread. Because the mutex is initialized in the constructor by 0, only the thread with block ID 0 would be successful in acquiring the lock. In the unlock, we place the next block ID index into the mutex to guarantee your desired ordering. Also, because you have used printf() inside the CUDA kernel, I think a call to threadfence_system() is required for you to see them in the output in the same order.

Related

Is there a way to prevent a memory address being accessed?

Is it possible to prevent a memory address being accessed by other threads for some period? for example:
__global__ void func(int* a){
// other computation
__lock_address(a);
a[0] += threadIdx.x;
__unlock_address(a);
}
the first thread that finished the other computations and reached __lock_address will lock that memory address untill _unlock_address is called, any other threads that reached __lock_address will have to wait until the first thread unlocks it.
The above example is basically equivalent to atomicAdd, but what if I want to do more complicated computation rather than a simple addition?
Edit:
mutex in initialized to 0, a is initialized to -1
__global__ void func(int *a, int *mutex){
a[0] = atomicCAS(mutex, 0, 1); // a[0] = 1
}
if I do this, a[0] is equal to 1. but it should be 0 since that is the old value of mutex.
__global__ void func(int *a, int *mutex){
a[0] = mutex[0]; // a[0] = 0
}
This is a sanity check, value at a[0] is 0 now. which means mutex is initialized to 0 correctly.
You can use mutex to protect multithreaded access to the memory region. Cuda Programming Guide has a nice example of using atomic operations to implement it (https://docs.nvidia.com/cuda/cuda-c-programming-guide/#scheduling-example)
__device__ void mutex_lock(unsigned int *mutex) {
unsigned int ns = 8;
while (atomicCAS(mutex, 0, 1) == 1) {
__nanosleep(ns);
if (ns < 256) {
ns *= 2;
}
}
}
__device__ void mutex_unlock(unsigned int *mutex) {
atomicExch(mutex, 0);
}
Ok, I figured out what's wrong. essentially only the first thread in thread block is getting the old value 0, while simultinuously setting mutex to 1, other threads read mutex after mutex is set to 1 by first thread, then stucking in deadlock.
I found this solution that worked for me.

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

Does only one Thread executing a Kernel implement the declaration of an array in Shared Memory in CUDA

I am implementing the following CUDA kernel that stores an array in Shared Memory:
// Difference between adjacent array elements
__global__ void kernel( int* in, int* out ) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
// Allocate a shared array, one element per thread
__shared__ int sh_arr[BOCK_SIZE];
// each thread reads one element to sh_arr
sh_arr[i] = in[i];
// Ensure reads from all Threads in Block complete before continuing
__syncthreads();
if( i > 0 )
out[i] = sh_arr[i] - sh_arr[i-1];
// Ensure writes from all Threads in Block complete before continuing
__syncthreads();
}
BLOCK_SIZE is a constant declared outside the kernel.
It seems like every Thread that executes this Kernel will create a new array because every Thread that executes this Kernel will see this line:
__shared__ int sh_arr[BOCK_SIZE];
Is it the case that only the first Thread that executes this Kernel will "see" this line, and all subsequent kernels overlook this line?
Shared variables in CUDA are shared between threads in the same block. I don't know exactly how it is done under the hood but threads in the same thread-block will see __shared__ int sh_arr[BOCK_SIZE]; however, since it has the __shared__ modifier, only one thread will create the array while the others will just use it.

Making CUB blockradixsort on-chip entirely?

I am reading the CUB documentations and examples:
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}
In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?
BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?
I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted.
Thanks in advance!
You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.
Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:
#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
__shared__ int my_val[nTPB*ELEMS_PER_THREAD];
using namespace cub;
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
// Allocate shared memory for collectives
__shared__ typename my_block_sort::TempStorage sort_temp_stg;
// need to extend synthetic data for ELEMS_PER_THREAD > 1
my_val[threadIdx.x*ELEMS_PER_THREAD] = (threadIdx.x + 5)%nTPB; // synth data
__syncthreads();
printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
// Collectively sort the keys
my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
__syncthreads();
printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
}
int main(){
BlockSortKernel<<<1,nTPB>>>();
cudaDeviceSynchronize();
}
This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).
The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):
__device__ __forceinline__ void Sort(
Key (&keys)[ITEMS_PER_THREAD],
int begin_bit = 0,
int end_bit = sizeof(Key) * 8)
{ ...

Implementing a critical section in CUDA

I'm trying to implement a critical section in CUDA using atomic instructions, but I ran into some trouble. I have created the test program to show the problem:
#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
while (atomicExch(&(locks[id]), 1u) != 0u) {} //lock
//critical section would go here
atomicExch(&(locks[id]),0u); //unlock
}
int main(int argc, char** argv) {
//initialize the locks array on the GPU to (0...0)
unsigned int* locks;
unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));
//Run the kernel:
k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);
//Check the error messages:
cudaError_t error = cudaGetLastError();
cutilSafeCall(cudaFree(locks));
if (cudaSuccess != error) {
printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
exit(-1);
}
return 0;
}
This code, unfortunately, hard freezes my machine for several seconds and finally exits, printing out the message:
fcudaSafeCall() Runtime API error in file <XXX.cu>, line XXX : the launch timed out and was terminated.
which means that one of those while loops is not returning, but it seems like this should work.
As a reminder atomicExch(unsigned int* address, unsigned int val) atomically sets the value of the memory location stored in address to val and returns the old value. So the idea behind my locking mechanism is that it is initially 0u, so one thread should get past the while loop and all other threads should wait on the while loop since they will read locks[id] as 1u. Then when the thread is done with the critical section, it resets the lock to 0u so another thread can enter.
What am I missing?
By the way, I am compiling with:
nvcc -arch sm_11 -Ipath/to/cuda/C/common/inc XXX.cu
Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.
As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.
Here is a kernel that will do the trick without error:
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
bool leaveLoop = false;
while (!leaveLoop) {
if (atomicExch(&(locks[id]), 1u) == 0u) {
//critical section
leaveLoop = true;
atomicExch(&(locks[id]),0u);
}
}
}
The poster has already found an answer to his own issue. Nevertheless, in the code below, I'm providing a general framework to implement a critical section in CUDA. More in detail, the code performs a block counting, but it is easily modifyiable to host other operations to be performed in a critical section. Below, I'm also reporting some explanation of the code, with some, "typical" mistakes in the implementation of critical sections in CUDA.
THE CODE
#include <stdio.h>
#include "Utilities.cuh"
#define NUMBLOCKS 512
#define NUMTHREADS 512 * 2
/***************/
/* LOCK STRUCT */
/***************/
struct Lock {
int *d_state;
// --- Constructor
Lock(void) {
int h_state = 0; // --- Host side lock state initializer
gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state
gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
}
// --- Destructor
__host__ __device__ ~Lock(void) {
#if !defined(__CUDACC__)
gpuErrchk(cudaFree(d_state));
#else
#endif
}
// --- Lock function
__device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }
// --- Unlock function
__device__ void unlock(void) { atomicExch(d_state, 0); }
};
/*************************************/
/* BLOCK COUNTER KERNEL WITHOUT LOCK */
/*************************************/
__global__ void blockCountingKernelNoLock(int *numBlocks) {
if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
}
/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCountingKernelLock(Lock lock, int *numBlocks) {
if (threadIdx.x == 0) {
lock.lock();
numBlocks[0] = numBlocks[0] + 1;
lock.unlock();
}
}
/****************************************/
/* BLOCK COUNTER KERNEL WITH WRONG LOCK */
/****************************************/
__global__ void blockCountingKernelDeadlock(Lock lock, int *numBlocks) {
lock.lock();
if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
lock.unlock();
}
/********/
/* MAIN */
/********/
int main(){
int h_counting, *d_counting;
Lock lock;
gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));
// --- Unlocked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCountingKernelNoLock << <NUMBLOCKS, NUMTHREADS >> >(d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the unlocked case: %i\n", h_counting);
// --- Locked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCountingKernelLock << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the locked case: %i\n", h_counting);
gpuErrchk(cudaFree(d_counting));
}
CODE EXPLANATION
Critical sections are sequences of operations that must be executed sequentially by the CUDA threads.
Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block having threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.
The above code has two kernel functions: blockCountingKernelNoLock and blockCountingKernelLock. The former does not use a critical section to increase the counter and, as one can see, returns wrong results. The latter encapsulates the counter increase within a critical section and so produces correct results. But how does the critical section work?
The critical section is governed by a global state d_state. Initially, the state is 0. Furthermore, two __device__ methods, lock and unlock, can change this state. The lock and unlock methods can be invoked only by a single thread within each block and, in particular, by the thread having local thread index threadIdx.x == 0.
Randomly during the execution, one of the threads having local thread index threadIdx.x == 0 and global thread index, say, t will be the first invoking the lock method. In particular, it will launch atomicCAS(d_state, 0, 1). Since initially d_state == 0, then d_state will be updated to 1, atomicCAS will return 0 and the thread will exit the lock function, passing to the update instruction. In the meanwhile such a thread performs the mentioned operations, all the other threads of all the other blocks having threadIdx.x == 0 will execute the lock method. They will however find a value of d_state equal to 1, so that atomicCAS(d_state, 0, 1) will perform no update and will return 1, so leaving these threads running the while loop. After that thread t accomplishes the update, then it executes the unlock function, namely atomicExch(d_state, 0), thus restoring d_state to 0. At this point, randomly, another of the threads with threadIdx.x == 0 will lock again the state.
The above code contains also a third kernel function, namely blockCountingKernelDeadlock. However, this is another wrong implementation of the critical section, leading to deadlocks. Indeed, we recall that warps operate in lockstep and they synchronize after every instruction. So, when we execute blockCountingKernelDeadlock, there is the possibility that one of the threads in a warp, say a thread with local thread index t≠0, will lock the state. Under this circumstance, the other threads in the same warp of t, including that with threadIdx.x == 0, will execute the same while loop statement as thread t, being the execution of threads in the same warp performed in lockstep. Accordingly, all the threads will wait for someone to unlock the state, but no other thread will be able to do so, and the code will be stuck in a deadlock.
by the way u have to remember that global memory writes and ! reads aren't completed where u write them in the code ... so for this to be practice you need to add a global memfence ie __threadfence()