In CUDA 9, nVIDIA seems to have this new notion of "cooperative groups"; and for some reason not entirely clear to me, __ballot() is now (= CUDA 9) deprecated in favor of __ballot_sync(). Is that an alias or have the semantics changed?
... similar question for other builtins which now have __sync() added to their names.
No the semantics are not the same. The function calls themselves are different, one is not an alias for another, new functionality has been exposed, and the implementation behavior is now different between Volta architecture and previous architectures.
First of all, to set the ground work, it's necessary to be cognizant of the fact that Volta introduced the possibility for independent thread scheduling, by introducing a per-thread program counter and other changes. As a result of this, it's possible for Volta to behave in a non-warp-synchronous behavior for extended periods of time, and during periods of execution when previous architectures might still be warp-synchronous.
Most of the warp intrinsics work by only delivering expected results for threads that are actually participating (i.e. are actually active for the issue of that instruction, in that cycle). The programmer can now be explicit about which threads are expected to participate, via the new mask parameter. However there are some requirements, in particular on Pascal and previous architectures. From the programming guide:
Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp's active mask.
On Volta, however, the warp execution engine will bring about the necessary synchronization/participation amongst the indicated threads in the mask, in order to make the desired/indicated operation valid (assuming the appropriate _sync version of the instrinsic is used). To be clear, the warp execution engine will reconverge threads that are diverged on volta in order to match the mask, however it will not overcome programmer induced errors such as preventing a thread from participating in a _sync() intrinsic via conditional statements.
This related question discusses the mask parameter. This answer is not intended to address all possible questions that may arise from independent thread scheduling and the impact on warp level intrinsics. For that, I encourage reading of the programming guide.
Related
While reading through the CUDA programming guide:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture
I came across the following paragraph:
Prior to Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
However, at the start of the same section, it says:
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
Which appears to be contradict the other paragraph, because it mentions that threads have their own program counter, while the first paragraph claims they do not.
How is this active mask handled when a program has nested branches (such as if statements)?
How does a thread know when the divergent part which it did not need to execute is done, if it supposedly does not have its own program counter?
This answer is highly speculative, but based on the available information and some educated guessing, I believe the way it used to work before Volta is that each warp would basically have a stack of "return addresses" as well as the active mask or probably actually the inverse of the active mask, i.e., the mask for running the other part of the branch once you return. With this design, each warp can only have a single active branch at any point in time. A consequence of this is that the warp scheduler could only ever schedule the one active branch of a warp. This makes fair, starvation-free scheduling impossible and gives rise to all the limitations there used to be, e.g., concerning locks.
I believe what they basically did with Volta is that there is now a separate such stack and program counter for each branch (or maybe even for each thread; it should be functionally indistinguishable whether each thread has its own physical program counter or whether there is one shared program counter per branch; if you really want to find out about this implementation detail you maybe could design some experiment based on checking at which point you run out of stack space). This change gives all current branches an explicit representation and allows the warp scheduler to at any time pick threads from any branch to run. As a result, the warp scheduling can be made starvation-free, which gets rid of many of the restrictions that earlier architectures had…
I have basically the same question as posed in this discussion. In particular I want to refer to this final response:
I think there are two different questions mixed together in this
thread:
Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the
reasons you and others have described. If the data or calculation has
spatial locality, then so should the assignment of work to threads in
a warp.
Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since
you can do the index calculation trivially yourself at the top of the
kernel. This burns a few arithmetic instructions, but that should be
negligible compared to the kernel launch overhead.
This is why I think the multidimensional grids are intended as a
programmer convenience rather than a way to improve performance. You
do absolutely need to think about each warp's memory access patterns,
though.
I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.
What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?
A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.
There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA
PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions.
The special registers are:
%tid
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid
If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.
OpenCL and CUDA have included atomic operations for several years now (although obviously not every CUDA or OpenCL device supports these). But - my question is about the possibility of "living with" races due to non-atomic writes.
Suppose several threads in a grid all write to the same location in global memory. Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
Relevant parameters for this question (choose any combination(s), edit except nVIDIA+CUDA which already got an answer):
Memory space: Global memory only; this question is not about local/shared/private memory.
Alignment: Within a single memory write width (e.g. 128 bits on nVIDIA GPUs)
GPU Manufacturer: AMD / nVIDIA
Programming framework: CUDA / OpenCL
Position of store instruction in code: Same line of code for all threads / different lines of code.
Write destination: Fixed address / fixed offset from the address of a function parameter / completely dynamic
Write width: 8 / 32 / 64 bits.
Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
For current CUDA GPUs, and I'm pretty sure for NVIDIA GPUs with OpenCL, the answer is yes. Most of my terminology below will have CUDA in view. If you require an exhaustive answer for both CUDA and OpenCL, let me know, and I'll delete this answer. Very similar questions to this one have been asked, and answered, before anyway. Here's another, and I'm sure there are others.
When multiple "simultaneous" writes occur to the same location, one of them will win, intact.
Which one will win is undefined. The behavior of the non-winning writes is also undefined (they may occur, but be replaced by the winner, or they may not occur at all.) The actual contents of the memory location may transit through various values (such as the original value, plus any of the valid written values), but the transit will not pass through "junk" values (i.e. values that were not already there and were not written by any thread.) The transit ends up at the "winner", eventually.
Example 1:
Location X contains zero. Threads 1,5,32, 30000, and 450000 all write one to that location. If there is no other write traffic to that location, that location will eventually contain the value of one (at kernel termination, or earlier).
Example 2:
Location X contains 5. Thread 32 writes 1 to X. Thread 90303 writes 7 to X. Thread 432322 writes 972 to X. If there is no other write traffic to that location, upon kernel termination, or earlier, location X will contain either 1, 7 or 972. It will not contain any other value, including 5.
I'm assuming X is in global memory, and all traffic to it is naturally aligned to it, and all traffic to it is of the same size, although these principles apply to shared memory as well. I'm also assuming you have not violated CUDA programming principles, such as the requirement for naturally aligned traffic to device memory locations. The transactions I have in view here are those transactions that originate from a single SASS instruction (per thread) Such transactions can have a width of 1,2,4,or 8bytes. The claims I've made here apply whether the writes are originating from "the same line of code" or "different lines".
These claims are based on the PTX memory consistency model, and therefore the "correctness" is ensured by the GPU hardware, not by the compiler, the CUDA programming model, or the C++ standard that CUDA is based on.
This is a fairly complex topic (especially when we factor in cache behavior, and what to expect when we throw reads in the mix), but "junk" values should never occur. The only values that should occur in global memory are those values that were there to begin with, or those values that were written by some thread, somewhere.
I have gone through many forum posts and the NVIDIA documentation, but I couldn't understand what __threadfence() does and how to use it. Could someone explain what the purpose of that intrinsic is?
Normally, there are no guarantee that if one block writes something to global memory, the other block will "see" it. There is also no guarantee regarding the ordering of writes to global memory, with an exception of the block that issued it.
There are two exceptions:
atomic operations - those are always visible by other blocks
threadfence
Imagine, that one block produces some data, and then uses atomic operation to mark a flag that the data is there. But it is possible that the other block, after seeing the flag, still reads incorrect or incomplete data.
The __threadfence function, coming to the rescue, ensures the ordering. All writes before it really happen before all writes after it, as seen from other blocks.
Note that the __threadfence function doesn't necessarily need to stall the current thread until its writes to global memory are visible to all other threads in the grid. Implemented in this naive way, the __threadfence function could hurt performance severely.
As an example, if you do something like:
store your data
__threadfence()
atomically mark a flag
it is guaranteed that if the other block sees the flag, it will also see the data.
Further reading: Cuda Programming Guide, Chapter B.5 (as of version 11.5)
I google around a bit, but this is not clear to me now whether some GPUs programmed with CUDA can take advantage or can use instructions similar to those from SSE SIMD extensions; for instance whether we can sum up two vectors of floats in double precission, each one with 4 values. If so, I wonder whether it would be better to use more lighter threads for each of the previous 4 values of the vector or use SIMD.
CUDA programs compile to the PTX instruction set. That instruction set does not contain SIMD instructions. So, CUDA programs cannot make explicit use of SIMD.
However, the whole idea of CUDA is to do SIMD on a grand scale. Individual threads are part of groups called warps, within which every thread executes exactly the same sequence of instructions (although some of the instructions may be suppressed for some threads, giving the illusion of different execution sequences). NVidia call it Single Instruction, Multiple Thread (SIMT), but it's essentially SIMD.
As was mentioned in a comment to one of the replies, NVIDIA GPU has some SIMD instructions. They operate on unsigned int on per-byte and per-halfword basis. As of July 2015, there are several flavours of the following operations:
absolute value
addition/subtraction
computing average value
comparision
maximum/minimum
negation
sum of absolute difference