Is there a Liveness guarantee in CUDA? [duplicate] - cuda

This question already has answers here:
Cuda In-Situ memory race issue for algorithms such as convolution of morphologicam dilation
(1 answer)
Inter-block synchronization in CUDA
(1 answer)
question about modifing flag array in cuda
(1 answer)
Closed 4 months ago.
I am trying to implement a GPU application which requires the use of a MUTEX. I know this isn't ideal, but for correctness it is required. When the MUTEX is retrieved, which isn't often, all other threads will halt, and then only the single thread is allowed to continue, until it finishes, at which point all threads may continue normal operation.
I have tried to implement this using atomic operations to modify the flags, and busy waiting for the waiting threads however, at some point the execution just stops. I thought there was simply a deadlock somewhere in my execution, but this doesn't seem to be the case. The execution seems to simply get stuck in a seemingly arbitrary print statement.
Therefore, I was wondering, is there some guarantee that all threads will eventually be processed, or is it possible that the busy waiting loop is hogging all the scheduling cycles of the GPU?
This is the busy waiting loop:
while (flag) {
if(count > 10000){
count = 0; //Only used as breakpoint to see when the cycle has been entered
}
if (failFlag) {
return false;
}
count++;
}
This is how the flags are set
bool setFlag(int* loc, int val, bool strict=true) {
int val_i = val == 0 ? 1 : 0;
//In devices, atomically exchange
uint64_cu res = atomicCAS(loc, val_i, val);
//Make sure the value hasn't changed in the meantime
if ( (res != val_i) && strict) {
return false;
}
__threadfence();
return true;
}
and this is the seemingly arbitrary line the execution of the second thread never seems to move past
printf("%i:\t\t\t\tRebuild\n", getThreadID());
where getThreadID() returns threadIdx.x
I first tried using memcheck to see if some issue with the memory was coming up, which gave no errors. Then I tried racecheck which also didn't show any issues. I then used some print statements to see roughly where the execution was hanging in the executing thread. Finally, I used the debugger, which showed that the first thread was moving through the busy waiting loop, while the other thread was seemingly stuck on a random print statement I was using to debug (While there were several other similar statements before that point).
Here is the debugger, lines 377 to 385 are the busy wait loop, while line 206 is just a statement which prints
Thread 1 "main" hit Breakpoint 1, MyProgram::insert (this=0x7fffba000000, k=152796131036661202) at /home/User/MyProgramParallel/src/DataStructure.cu:379
379 in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda thread
Unrecognized option: 'thread'.
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line
Kernel 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) 1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu 206
* (0,0,0) (1,0,0) (0,0,0) (1,0,0) 1 0x0000555558f20c70 /home/User/MyProgramParallel/src/DataStructure.cu 379
(cuda-gdb) step
381 in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line
Kernel 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) 1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu 206
* (0,0,0) (1,0,0) (0,0,0) (1,0,0) 1 0x0000555558f20ce0 /home/User/MyProgramParallel/src/DataStructure.cu 381
(cuda-gdb) step
384 in /home/User/MyProgramParallel/src/DataStructure.cu
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line
Kernel 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) 1 0x0000555558f25e00 /home/User/MyProgramParallel/src/DataStructure.cu 206
* (0,0,0) (1,0,0) (0,0,0) (1,0,0) 1 0x0000555558f20ea0 /home/User/MyProgramParallel/src/DataStructure.cu 384
I would expect both threads to execute steps, with the first moving past line 206, and the other moving through the busy waiting loop. However, this is not the case, no matter how many times I continue the execution the breakpoint. That is why I'm wondering whether there is a liveness guarantee in CUDA? Or is this what a thread looks like after it has crashed? And otherwise, what is another possible reason for this behaviour? Before this point, the two threads seemed to be working in Lockstep.
The CUDA version is 11.3, and the operating system is Ubuntu

Related

When can threads of a warp get scheduled independently on Volta+?

Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?
From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.

What is the number of registers in CUDA CC 5.0?

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.

conditional syncthreads & deadlock (or not)

A follow up Q to: EarlyExit and DroppedThreads
According to the above links, the code below should dead-lock.
Please explain why this does NOT dead-lock. (Cuda 5 on a Fermi)
__device__ int add[144];
__device__ int result;
add<<<1,96>>>(); // the calling
__global__ void add() {
for(idx=72>>1; idx>0; idx>>=1) {
if(thrdIdx < idx)
add[thrdIdx]+= add[thrdIdx+idx];
else
return;
__syncthreads();
}
if(thrdIdx == 0)
result= add[0];
}
This is technically an ill-defined program.
Most, but not all (for example G80 does not), NVIDIA GPUs support early exit in this way because the hardware maintains an active thread count for each block, and this count is used for barrier synchronization rather than the initial thread count for the block.
Therefore, when the __syncthreads() in your code is reached, the hardware will not wait on any threads that have already returned, and the program runs without deadlock.
A more common use of this style is:
__global__ void foo(int n, ...) {
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx >= n) return;
... // do some computation with remaining threads
}
Important note: barrier counts are updated per-warp (see here), not per-thread. So you may have the case where, say, only a few (or zero) threads return early. This means that the barrier count is not decremented. However, as long as at least one thread from each warp reaches the barrier, it will not deadlock.
So in general, you need to use barriers carefully. But specifically, (simple) early exit patterns like this do work.
Edit: for your specific case.
Iteration Idx==36: 2 active warps so barrier exit count is 64. All threads from warp 0 reach barrier, incrementing count from 0 to 32. 4 threads from warp 1 reach barrier, incrementing count from 32 to 64, and warps 0 and 1 are released from barrier. Read the link above to understand why this happens.
Iteration Idx==18: 1 active warp so barrier exit count is 32. 18 threads from warp 0 reach barrier, incrementing count from 0 to 32. Barrier is satisfied and warp 0 is released.
Etc...

Execution order inside a warp for divergent cases in CUDA [duplicate]

I have the following code for a CUDA program:
#include <stdio.h>
#define NUM_BLOCKS 4
#define THREADS_PER_BLOCK 4
__global__ void hello()
{
printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x);
}
int main(int argc,char **argv)
{
// launch the kernel
hello<<<NUM_BLOCKS, THREADS_PER_BLOCK>>>();
// force the printf()s to flush
cudaDeviceSynchronize();
return 0;
}
in which every thread will print its threadIdx.x and blockIdx.x. One possible output of this program is this:
Hello. I'm a thread 0 in block 0
Hello. I'm a thread 1 in block 0
Hello. I'm a thread 2 in block 0
Hello. I'm a thread 3 in block 0
Hello. I'm a thread 0 in block 2
Hello. I'm a thread 1 in block 2
Hello. I'm a thread 2 in block 2
Hello. I'm a thread 3 in block 2
Hello. I'm a thread 0 in block 3
Hello. I'm a thread 1 in block 3
Hello. I'm a thread 2 in block 3
Hello. I'm a thread 3 in block 3
Hello. I'm a thread 0 in block 1
Hello. I'm a thread 1 in block 1
Hello. I'm a thread 2 in block 1
Hello. I'm a thread 3 in block 1
Running the program several times I get similar results, except that block order is random. For example, in the above output we have this block order 0, 2, 3, 1. Running the problem again I get 1,2,3, 0. This is expected. However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too.
I tried to change my code to force thread 0 in every block to take longer to execute. I did it like this:
__global__ void hello()
{
if (threadIdx.x == 0)
{
int k = 0;
for ( int i = 0; i < 1000000; i++ )
{
k = k + 1;
}
}
printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x);
}
I would expect thread order to be 1,2,3, 0. However, I got a similar result to the one I have shown above where thread order was always 0, 1, 2, 3. Why is this happening?
However, the thread order in every block is always 0,1,2,3. Why is this happening? I thought it would be random too
With 4 threads per block you are only launching one warp per block. A warp is the unit of execution (and scheduling, and resource assignment) in CUDA, not a thread. Currently, a warp consists of 32 threads.
This means that all 4 of your threads per block (since there is no conditional behavior in this case) are executing in lockstep. When they reach the printf function call, they all execute the call to that function in the same line of code, in lockstep.
So the question becomes, in this situation, how does the CUDA runtime dispatch these "simultaneous" function calls? The answer to that question is unspecified, but it is not "random". Therefore it's reasonable that the order of dispatch for operations within a warp does not change from run to run.
If you launch enough threads to create multiple warps per block, and probably also include some other code to disperse and or "randomize" the behavior between warps, you should be able to see printf operations emanating from separate warps occurring in "random" order.
To answer the second part of your question, when control flow diverges at the if statement, the threads where threadIdx.x != 0 simply wait to at the convergence point after the if statement. They do not go on to the printf statement until thread 0 has completed the if block.

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How are threads organized to be executed by a GPU?
Hardware
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
Software
threads are organized in blocks. A block is executed by a multiprocessing unit.
The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
being executed simultaneously).
Now a simple case: processing a 512x512 image
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
(so to have 512x512 threads = 4096*64)
It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
dim3 threadsPerBlock(8, 8); // 64 threads
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
dim3 numBlocks(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/
imageHeight/threadsPerBlock.y);
The kernel is launched like this:
myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );
Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way:
uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;
Suppose a 9800GT GPU:
it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)
https://www.tutorialspoint.com/cuda/cuda_threads.htm
A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:
func1();
__syncthreads();
func2();
__syncthreads();
then the kernel must run twice and the order of execution will be:
func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads
Note:
The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.
I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.
Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.
I would greatly appreciate if someone test this again for a more accurate piece of information.