I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (9.2.3.1):
However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized
which hints to load/store atomicity "per-default" in shared memory. Thus, would the instructions ld.weak.shared.cta and ld.relaxed.shared.cta have the same effect?
Or is it an information the compiler needs anyway to avoid optimizing away load and store?
More generally, supposing variables are properly aligned, would __shared__ int and __shared__ cuda::atomic<int, cuda::thread_scope_block> provide the same guarantees (when considering only load and store operations)?
Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are __device__ int and __device__ cuda::atomic<int, cuda::thread_scope_block> equivalent in term of atomicity of load/store operations?
Thanks for any insight.
Serialization does not imply atomicity: thread A writes the 2 first bytes of an integer, then thread B reads the variable a, and finally thread A writes the last 2 bytes. All of this happening in sequence (not in parallel), but still not being atomic.
Further, serialization is not guaranteed in all cases, see:
Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.
Conclusion: use atomic.
Related
I am trying to understand the parallel forall post on instruction level profiling. And especially the following lines in section Reducing Memory Dependency Stalls:
NVIDIA GPUs do not have indexed register files, so if a stack array is accessed with dynamic indices, the compiler must allocate the array in local memory. In the Maxwell architecture, local memory stores are not cached in L1 and hence the latency of local memory loads after stores is significant.
I understand what register files are but what does it mean that they are not indexed? And why does it prevent the compiler to store a stack array accessed with dynamic indices?
The quote says that the array will be stored in local memory. What block does this local memory correspond to in the architecture below?
... what does it mean that they are not indexed
It means that indirect addressing of registers is not supported. So it isn't possible to index from one register (theoretically the register holding the first element of an array) to another arbitrary register. As a result the compiler can't generate code for non static indexing of an array stored in registers.
What block does this local memory correspond to in the architecture below?
It doesn't correspond to any of them. Local memory is stored in DRAM, not on the GPU itself.
I have a warp which writes some data to shared memory - with no overwrites, and soon after reads from shared memory. While there may be other warps in my block, they're not going to touch any part of that shared memory or write to anywhere my warp of interest reads from.
Now, I recall that despite warps executing in lockstep, we are not guaranteed that the shared memory reads following the shared memory writes will return the respective values supposedly written earlier by the warp. (this could theoretically be due to instruction reordering or - as #RobertCrovella points out - the compiler optimizing a shared memory access away)
So, we need to resort to some explicit synchronization. Obviously, the block-level __syncthreads() work. This is what does:
__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
That's too powerful for my needs :
It applies to global memory also, not just shared memory.
It performs inter-warp synchronization; I only need intra-warp.
It prevents all types of hazards R-after-W, W-after-R, W-after-W; I only need R-after-W.
It works also for cases of multiple threads performing writes to the same location in shared memory; in my case all shared memory writes are disjoint.
On the other hand, something like __threadfence_block() does not seem to suffice. Is there anything "in-between" those two levels of strength?
Notes:
Related question: CUDA __syncthreads() usage within a warp.
If you're going to suggest I use shuffling instead, then, yes, that's sometimes possible - but not if you want to have array access to the data, i.e. dynamically decide which element of the shared data you're going to read. That would probably spill into local memory, which seems scary to me.
I was thinking maybe volatile could be useful to me, but I'm not sure if using it would do what I want.
If you have an answer that assumes the computer capability is at least XX.YY, that's useful enough.
If I understand #RobertCrovella correctly, this fragment of code should be safe from the hazard:
/* ... */
volatile MyType* ptr = get_some_shared_mem();
ptr[lane::index()] = foo();
auto other_lane_index = bar(); // returns a value within 0..31
auto other_lane_value = ptr[other_lane_index];
/* ... */
because of the use of volatile. (And assuming bar() doesn't mess introduce hazards of its own.)
indirectJ2[MAX_SUPER_SIZE] is a shared array.
My cuda device kernel contains following statement (executed by all threads in the thread block):
int nnz_col = indirectJ2[MAX_SUPER_SIZE - 1];
I suspect this would cause bank conflicts.
Is there any way I can implement the above Thread-block level broadcast efficiently using new shuffle instructions for the kepler GPUs? I understand how it works at warp level. Other solutions, which are beyond shuffle instruction (for instance use of CUB etc.), are also welcome.
There is no bank conflict for that line of code on K40. Shared memory accesses already offer a broadcast mechanism. Quoting from the programming guide
"A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and j are in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads "
There is no such concept as shared memory bank conflicts at the threadblock level. Bank conflicts only pertain to the access pattern generated by the shared memory request emanating from a single warp, for a single instruction in that warp.
If you like, you can write a simple test kernel and use profiler metrics (e.g. shared_replay_overhead) to test for shared memory bank conflicts.
Warp shuffle mechanisms do not extend beyond a single warp; therefore there is no short shuffle-only sequence that can broadcast a single quantity to multiple warps in a threadblock. Shared memory can be used to provide direct access to a single quantity to all threads in a warp; you are already doing that.
global memory, __constant__ memory, and kernel parameters can also all be used to "broadcast" the same value to all threads in a threadblock.
From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.
I am writing a CUDA kernel that requires maintaining a small associative array per thread. By small, I mean 8 elements max worst case, and an expected number of entries of two or so; so nothing fancy; just an array of keys and an array of values, and indexing and insertion happens by means of a loop over said arrays.
Now I do this by means of thread local memory; that is identifier[size]; where size is a compile time constant. Now ive heard that under some circumstances this memory is stored off-chip, and under some circumstances it is stored on-chip. Obviously I want the latter, under all circumstances. I understand that I can accomplish such with a block of shared mem, where I let each thread work on its own private block; but really? I dont want to share anything between threads, and it would be a horrible kludge.
What exactly are the rules for where this memory goes? I cant seem to find any word from nvidia. For the record, I am using CUDA5 and targetting Kepler.
Local variables are either stored in registers, or (cached for compute capability >=2.0) off-chip memory.
Registers are only used for arrays if all array indices are constant and can be determined at compile time, as the architecture has no means for indexed access to registers at runtime.
In you case the number of keys may be small enough to use registers (and tolerate the increase in register pressure). Unroll all loops over array accesses to allow the compiler to place the keys in registers, and use cuobjdump -sass to check it actually does.
If you don't want to spend registers, you can either choose shared memory with a per-thread offset (but check that the additional registers used to hold per-thread indices into shared memory don't outvalue the number of keys you use) as you mentioned, or do nothing and use off-chip "local" memory (really "global" memory with just a different addressing scheme) hoping for the cache to do it's work.
If you hope for the cache to hold the keys and values, and don't use much shared memory, it may be beneficial to select the 48kB cache / 16kB shared memory setting over the default 16kB/48kB split using cudaDeviceSetCacheConfig().