GPU CUDA internal thread scheduling not working properly - cuda

Have been trying to make the following code work
__global__ void kernel(){
if (threadIdx.x == 1){
while(var == 0){
}
}
if (threadIdx.x == 0){
var = 1;
}
}
where var is a global device variable. I am simply launching
two threads in the same block using kernel<<<1,2>>>();
If I switch the order of the ifs the code terminates. However,
if I do not switch the order of the ifs the code does not terminate.
It almost seems like if one thread goes into infinite loop then
no other thread will be allocated run-time before that thread
ends all of its code.
I was under the impression that in a GPU all threads get some run-time
allocated to them (although the order might be unknown to us).
I have also tried putting __threadfence() inside the while loop and inside
the ifs statements and also tried putting some
printf inside the while loop. It still doesn't work.
What is going on ? Any feedback would be appreciated.
Thanks !

If var is some sort of global variable what you see makes perfect sense when you consider how instructions from threads are scheduled.
You need to walk through your code as you are a warp of threads (32 threads). Divergence is when some of those 32 threads executes some code, while the others do not. When divergence happens, only the threads that are running the same instruction actually run until the other threads catch back up.
In other words...
__global__ void kernel(){
//Both threads encounter this at the same time. Thread 0 is set on "hold" while thread 1 continues in the if block.
if (threadIdx.x == 1){
while(var == 0){
}//infinite loop, Thread 0 will always be on hold. Thread 1 will always be in this loop
}
if (threadIdx.x == 0){
var = 1;
}
}
as opposed to...
__global__ void kernel(){
//Both threads encounter this at the same time. Thread 1 is set on "hold" while thread 0 continues in the if block.
if (threadIdx.x == 0){
//thread 1 sets global variable var to 1
var = 1;
}
//Threads 1 and 0 join again.
//Both encounter this. Thread 0 is set on hold while thread 1 continues.
if (threadIdx.x == 1){
//var was set to 1, this is ignored.
while(var == 0){
}
}
//Both threads join
}
Revisit the programming guide and review warps. If you want to test this further, try putting both threads in two blocks, this will prevent them from being in the same warp.
Be forewarned though that CUDA in general does not guarantee thread execution order between warps and blocks (unless synchronization of some method is used __syncthreads() or exiting a kernel).

Related

In CUDA, how do I detect that __syncthreads() was not called by all threads in the block?

I just ran into a weird and hard to reproduce problem in CUDA which turned out to involve undefined behaviour. I wanted thread 0 to set up some value in shared memory which should be used by all the threads.
__shared__ bool p;
p = false;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);
Now the assert(p); failed seemingly at random as I shoveled the code around and commented it out to find the issue.
I had used this construction in effectively the following undefined-behaviour context:
#include <assert.h>
__global__ void test() {
if (threadIdx.x == 0) __syncthreads(); // call __syncthreads in thread 0 only: this is a very bad idea
// everything below may exhibit undefined behaviour
// If the above __syncthreads runs only in thread 0, this will fail for all threads not in the first warp
__shared__ bool p;
p = false;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);
}
int main() {
test << <1, 32 + 1 >> > (); // nothing happens if you have only one warp, so we use one more thread
cudaDeviceSynchronize();
return 0;
}
The earlier __synchthreads() only reached by one thread was of course hidden in some functions, so it was hard to find. On my setup (sm50, gtx 980), this kernels runs through (no deadlock as advertised...) and the assertion fails for all threads outside of the first warp.
TL;DR
Is there any standard way to detect __syncthreads() not being called by all threads in a block? Maybe some debugger setting I am missing?
I could maybe construct my own (very slow) checked__syncthreads() that could detect the situation using maybe atomics and global memory, but I'd rather have a standard solution.
You have a threaded data race condition in your original code.
Thread 0 may advance up to and execute "p=true", but after that, a different thread might not have progressed at all and will still be back at the p=false line, overwriting the result.
Easiest fix for this specific example would simply to have ONLY thread 0 write to p, something like
__shared__ bool p;
if (threadIdx.x == 0) p = true;
__syncthreads();
assert(p);

overuse of __syncthread in the code

I understand the purpose of __syncthreads(), but I sometimes find it overused in some codes.
For instance, in the code below taken from NVIDIA notes, each thread calculates mainly s_data[tx]-s_data[tx-1]. Each thread needs the data it reads from the global memory and the data read by its neighboring thread. Both threads will be in the same warp and hence should complete retrieval of their data from the global memory and are scheduled for execution simultaneously.
I believe the code will still work without __syncthread(), but obviously the NVIDIA notes say otherwise. Any comment, please?
// Example – shared variables
// optimized version of adjacent difference
__global__ void adj_diff(int *result, int *input)
{
// shorthand for threadIdx.x
int tx = threadIdx.x;
// allocate a __shared__ array, one element per thread
__shared__ int s_data[BLOCK_SIZE];
// each thread reads one element to s_data
unsigned int i = blockDim.x * blockIdx.x + tx;
s_data[tx] = input[i];
// avoid race condition: ensure all loads
// complete before continuing
__syncthreads();
if(tx > 0)
result[i] = s_data[tx] – s_data[tx–1];
else if(i > 0)
{
// handle thread block boundary
result[i] = s_data[tx] – input[i-1];
}
}
It would be nice if you included a link to where, in the "Nvidia notes", this appeared.
both threads will be in the same warp
No, they won't, at least not in all cases. What happens when tx = 32? Then the thread corresponding to tx belongs to warp 1 in the block, and the thread corresponding to tx-1 belongs to warp 0 in the block.
There's no guarantee that warp 0 has executed before warp 1, so the code could fail without the call to __synchtreads() (since, without it, the value of s_data[tx-1] could be invalid, since warp 0 hasn't run and therefore hasn't loaded it yet.)

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

CUDA Kernel executing a statement by a single thread only

How can I write a statement in my CUDA kernel that is executed by a single thread. For example if I have the following kernel:
__global__ void Kernel(bool *d_over, bool *d_update_flag_threads, int no_nodes)
{
int tid = blockIdx.x*blockDim.x + threadIdx.x;
if( tid<no_nodes && d_update_flag_threads[tid])
{
...
*d_over=true; // writing a single memory location, only 1 thread should do?
...
}
}
In above kernel, "d_over" is a single boolean flag while "d_update_flag_threads" is a boolean array.
What I normally did before is using the first thread in the thread block e.g.:
if(threadIdx.x==0)
but It could not work in this case as I have a flag array here and only threads with assosiated flag "true" will execute the if statement. That flag array is set by another CUDA kernel called before and I don't have any knowledge about it in advance.
In short, I need something similar to "Single" construct in OpenMP.
A possible approach is use atomic operations. If you need only one thread per block to do the update, you could do the atomic operation in shared memory (for compute capability >= 1.2) which is generally much faster than perform it in global memory.
Said that, the idea is as follow:
int tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int sFlag;
// initialize flag
if (threadIdx.x == 0) sFlag = 0;
__syncthreads();
if( tid<no_nodes && d_update_flag_threads[tid])
{
// safely update the flag
int singleFlag = atomicAdd(&sFlag, 1);
// custom single operation
if ( singleFlag == 0)
*d_over=true; // writing a single memory location, only 1 thread will do it
...
}
It is just an idea. I've not tested it but is close to an operation performed by a single thread, not being the first thread of the block.
You could use atomicCAS(d_over, 0, 1) where d_over is declared or type-cast as int*.
This would ensure that only the first thread that sees the d_over value to be 0 (false) would update it and nobody else would.

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.