I am recently working on a code that requires a initialization of a piece of global memory before each kernel launch, which will be modified later in the same kernel. I used to do a cudaMemset before each kernel launch. But the overhead cannot be neglected when I need to call this kernel for thousands of times. So I finally come up with this idea which is to use global memory to judge if all initialization work has been done. But I soon found that when some threads within the active blocks are doing the loop, the following blocks will not keep launching, which results in a dead loop.
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < (n + n)) {
data[i] = 0;
}//working.
__syncthreads();//sync
if (threadIdx.x == 0) {
atomicAdd((unsigned *)&flag, 1);//voting
while (flag < gridDim.x); //waiting
}
}
__syncthreads();
//do something with data
So is there a way to manually put the current blocks to sleep and keep the kernels launching? Or is there better solution for my initialization problem?
As you found out, you should not attempt block synchronization in CUDA - this will prevent later blocks from launching (because earlier blocks do not give up their resources) and deadlocks at the synchronization point.
Instead of trying to put blocks to sleep until their work is ready, try to move the work to a block that happens to be currently running. The Programming Guide has a worked example at the end of it's memory fence section for doing some extra work in the last block of a kernel. You could use this to prepare the global memory variables for the next block.
The benefit of not having to perform an extra cudaMemcpy() or an additional kernel launch however needs to be weighed against the extra atomic memory access per block and synchronization within each block. So with increasing number of blocks per grid at some point it is cheaper to just perform the extra cudaMemcpy().
Related
I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?
The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.
Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.
I have a piece of serial code which does something like this
if( ! variable )
{
do some initialization here
variable = true;
}
I understand that this works perfectly fine in serial and will only be executed once. What atomics operation would be the correct one here in CUDA?
It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.
A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.
Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.
You should use the following code in your kernel to control thread access to a critical section:
__syncthreads();
if (threadIdx.x == 0)
acquire_semaphore(&sem);
__syncthreads();
//begin critical section
// ... your critical section code goes here
//end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
release_semaphore(&sem);
__syncthreads();
Prior to the kernel define these helper functions and device variable:
__device__ volatile int sem = 0;
__device__ void acquire_semaphore(volatile int *lock){
while (atomicCAS((int *)lock, 0, 1) != 0);
}
__device__ void release_semaphore(volatile int *lock){
*lock = 0;
__threadfence();
}
I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.
Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)
Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.
Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.
Here is an example of usage of binary_semaphore to implement a single device global "lock" that could be used for access control to a critical section.
I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];
I am aware that block sync is not possible, the only way is launching a new kernel.
BUT, let's suppose that I launch X blocks, where X corresponds to the number of the SM on my GPU. I should aspect that the scheduler will assign a block to each SM...right? And if the GPU is being utilized as a secondary graphic card (completely dedicated to CUDA), this means that, theoretically, no other process use it... right?
My idea is the following: implicit synchronization.
Let's suppose that sometimes I need only one block, and sometimes I need all the X blocks. Well, in those cases where I need just one block, I can configure my code so that the first block (or the first SM) will work on the "real" data while the other X-1 blocks (or SMs) on some "dummy" data, executing exactly the same instruction, just with some other offset.
So that all of them will continue to be synchronized, until I am going to need all of them again.
Is the scheduler reliable under this conditions? Or can you be never sure?
You've got several questions in one, so I'll try to address them separately.
One block per SM
I asked this a while back on nVidia's own forums, as I was getting results that indicated that this is not what happens. Apparently, the block scheduler will not assign a block per SM if the number of blocks is equal to the number of SMs.
Implicit synchronization
No. First of all, you cannot guarantee that each block will have its own SM (see above). Secondly, all blocks cannot access the global store at the same time. If they run synchronously at all, they will loose this synchronicity as of the first memory read/write.
Block synchronization
Now for the good news: Yes, you can. The atomic instructions described in Section B.11 of the CUDA C Programming Guide can be used to create a barrier. Assume that you have N blocks executing concurrently on your GPU.
__device__ int barrier = N;
__global__ void mykernel ( ) {
/* Do whatever it is that this block does. */
...
/* Make sure all threads in this block are actually here. */
__syncthreads();
/* Once we're done, decrease the value of the barrier. */
if ( threadIdx.x == 0 )
atomicSub( &barrier , 1 );
/* Now wait for the barrier to be zero. */
if ( threadIdx.x == 0 )
while ( atomicCAS( &barrier , 0 , 0 ) != 0 );
/* Make sure everybody has waited for the barrier. */
__syncthreads();
/* Carry on with whatever else you wanted to do. */
...
}
The instruction atomicSub(p,i) computes *p -= i atomically and is only called by the zeroth thread in the block, i.e. we only want to decrement barrier once. The instruction atomicCAS(p,c,v) sets *p = v iff *p == c and returns the old value of *p. This part just loops until barrier reaches 0, i.e. until all blocks have crossed it.
Note that you have to wrap this part in calls to __synchtreads() as the threads in a block do not execute in strict lock-step and you have to force them all to wait for the zeroth thread.
Just remember that if you call your kernel more than once, you should set barrier back to N.
Update
In reply to jHackTheRipper's answer and Cicada's comment, I should have pointed out that you should not try to start more blocks than can be concurrently scheduled on the GPU! This is limited by a number of factors, and you should use the CUDA Occupancy Calculator to find the maximum number of blocks for your kernel and device.
Judging by the original question, though, only as many blocks as there are SMs are being started, so this point is moot.
#Pedro is definitely wrong!
Achieving global synchronization has been the subject of several research works recently and, at last for non-Kepler architectures (I don't have one yet). The conclusion is always the same (or should be): it is not possible to achieve such a global synchronization across the whole GPU.
The reason is simple: CUDA blocks cannot be preempted, so given that you fully occupy the GPU, threads waiting for the barrier rendez-vous will never allow the block to terminate. Thus, it will not be removed from the SM, and will prevent the remaining blocks to run.
As a consequence, you will just freeze the GPU that will never be able to escape from this deadlock state.
-- edit to answer Pedro's remarks --
Such shortcomings have been noticed by other authors such as:
http://www.openclblog.com/2011/04/eureka.html
by the author of OpenCL in action
-- edit to answer Pedro's second remarks --
The same conclusion is made by #Jared Hoberock in this SO post:
Inter-block barrier on CUDA