CUDA Kernel executing a statement by a single thread only - cuda

How can I write a statement in my CUDA kernel that is executed by a single thread. For example if I have the following kernel:
__global__ void Kernel(bool *d_over, bool *d_update_flag_threads, int no_nodes)
{
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if( tid<no_nodes && d_update_flag_threads[tid])
{
...
*d_over=true; // writing a single memory location, only 1 thread should do?
...
}
}
In above kernel, "d_over" is a single boolean flag while "d_update_flag_threads" is a boolean array.
What I normally did before is using the first thread in the thread block e.g.:
if(threadIdx.x==0)
but It could not work in this case as I have a flag array here and only threads with assosiated flag "true" will execute the if statement. That flag array is set by another CUDA kernel called before and I don't have any knowledge about it in advance.
In short, I need something similar to "Single" construct in OpenMP.

A possible approach is use atomic operations. If you need only one thread per block to do the update, you could do the atomic operation in shared memory (for compute capability >= 1.2) which is generally much faster than perform it in global memory.
Said that, the idea is as follow:
int tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int sFlag;
// initialize flag
if (threadIdx.x == 0) sFlag = 0;
__syncthreads();
if( tid<no_nodes && d_update_flag_threads[tid])
{
// safely update the flag
int singleFlag = atomicAdd(&sFlag, 1);
// custom single operation
if ( singleFlag == 0)
*d_over=true; // writing a single memory location, only 1 thread will do it
...
}
It is just an idea. I've not tested it but is close to an operation performed by a single thread, not being the first thread of the block.

You could use atomicCAS(d_over, 0, 1) where d_over is declared or type-cast as int*.
This would ensure that only the first thread that sees the d_over value to be 0 (false) would update it and nobody else would.

Related

Cuda: is there any way to prevent other threads from changing a shared or global variable?

say, I have a shared variable checker and the program works on different density ie. each thread will be working for one type of density
__shared__ int Checker;
int TID = blockDim.x * blockIdx.x + threadIdx.x;
so density on each thread be : density[TID]
****few calculations *****
so at some point, if the density increases than a threshold value, i need to change the value of checker.
something like:
if( density[TID] > threshold)
Checker=density[TID];
but if more than 1 thread satisfies the condition, then there might be a race condition, so how can i do that avoiding race condition.
I can use syncthreads and use for loop checking it 1 by 1, but that would be hugely serialized and slow.
I didnt find any atomic operation.
So, how do I avoid race condition here?
The canonical way to handle the race condition you mention is by using an atomic compare-and-swap operation, which is supported on CUDA capable GPUs for both shared and global memory See atomicCAS in CUDA programming guide..
__shared__ int Checker;
int TID = blockDim.x * blockIdx.x + threadId
int localChecker;
// Do some ops
if( density[TID] > threshold) {
localChecker = *(volatile int*)&Checker;
if (atomicCAS(&Checker, localChecker, density[TID]) == localChecker) {
// This thread won the write
}
}

How to collect individual results of the threads within a block?

In my Kernel, the threads are processing a small part of an array in global memory.
After processing I would also like to set a flag indicating that the result of the calculation is zero for all threads within a block:
__global__ void kernel( int *a, bool *blockIsNull) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int result = 0;
// {...} Here calculate result
a[tid] = result;
// some code here, but I don't know, that's my question...
if (condition)
blockIsNull[blockIdx.x] = true; // if all threads have returned result==0
}
Each individual thread owns the information. But I don't find an efficient way to collect it.
For example, I could have a counter in shared memory that is atomically incremented by each thread when result==0. So when the counter reaches blockDim.x it means that all threads have returned zero. Althought not tested, I am afraid that this solution will have a negative impact on performance (atomic functions are slow).
A zero result does not occur very often, so it is very unlikely to have zeros for all threads within a block. I would like to find a solution that has little impact on the performance in the general case.
What would be your recommendation ?
It sounds like you want to perform a block level reduction of the condition value across a block. Just about all CUDA hardware supports a set of very useful warp voting primitives. You could use the __all() warp vote to determine that each warp of threads satisfied the condition, and then use __all() again to check whether all warps satisfy the condition. In code, it might look like this:
__global__ void kernel( int *a, bool *blockIsNull) {
// assume that threads per block is <= 1024
__shared__ volatile int blockcondition[32];
int laneid = threadIdx.x % 32;
int warpid = threadIdx.x / 32;
// Set each condition value to non zero to begin
if (warpid == 0) {
blockcondition[threadIdx.x] = 1;
}
__syncthreads();
//
// your code goes here
//
// warpcondition holds the vote from each warp
int warpcondition = __all(condition);
// First thread in each warp loads the warp vote to shared memory
if (laneid == 0) {
blockcondition[warpid] = warpcondition;
}
__syncthreads();
// First warp reduces all the votes in shared memory
if (warpid == 0) {
int result = __all(blockcondition[threadIdx.x] != 0);
// first thread stores the block result to global memory
if (laneid == 0) {
blockIsNull[blockIdx.x] = (result !=0);
}
}
}
[ Huge disclaimer: written in browser, never compiled or tested, use at own risk ]
This code should (I think) work for any number of threads per block up to 1024. You could, if required, adjust the size of blockcondition to a smaller value if you were confident of an upper block size limit less than 1024. Probably the smartest way would be to use C++ templating and make the warp count limit a template parameter.

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".

Race condition with CUDA shuffle?

Using the shuffle command, are there race conditions/lost updates when two different threads concurrently attempt to update the same register value?
This is a late answer provided here to remove this question from the unanswered list.
From the CUDA C Programming Guide
The __shfl() intrinsics permit exchanging of a variable between threads within
a warp without use of shared memory
The idea is that a thread i can read, but not alter, the value of a register r assigned to thread j. So, and as pointed out in the comments above, there is no race condition.
The CUDA C Programming Guide provides also the following example to broadcast of a single value across a warp
global__ void bcast(int arg) {
int laneId = threadIdx.x & 0x1f;
int value;
if (laneId == 0) // Note unused variable for
value = arg; // all threads except lane 0
value = __shfl(value, 0); // Get "value" from lane 0
if (value != arg) printf("Thread %d failed.\n", threadIdx.x); }
void main() {
bcast<<< 1, 32 >>>(1234);
cudaDeviceSynchronize();
}
In this example, the value of the value register assigned to thread 0 in the warp is broadcast to all other threads in the warp and assigned to the local value registers. All the other threads are not attempting (but also cannot) alter the value of the value register assigned to thread 0.

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.