How to avoid reading the same memory address in cuda? - cuda

I am writing a Cuda program. The size of shared memory is __shared__ int shared_memory[1024] and 32x32 threads per block. In the first round, each thread will access the value in shared_memory[0] and do the calculation, in the second round, each thread will access the value in shared_memory[31],..., in the 32 round, each thread will access the value in shared_memory[1023].
I am wondering is there an efficient method to avoid thread accessing the same shared memory position in each round? I understand that memory padding is a method to avoid bank conflict, but the program needs to access the same position in shared memory.

Related

Dynamic parallelism - passing contents of shared memory to spawned blocks?

While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able to use DP is:
If block figures out it needs more threads to complete its work, it spawns them; it imparts to its spawned threads "what it knows" - essentially, the contents of its shared memory, which each block of spawned threads get a copy of in its own shared memory ; the threads use what their parent thread "knew" to figure out what they need to continue doing, and do it.
AFAICT, though, this "inheritance" of shared memory does not happen. Is global memory (and constant memory via kernel arguments) the only way the "parent" DP kernel block can impart information to its "child" blocks?
There is no mechanism of the type you are envisaging. A parent thread cannot share either its local memory or its block's shared memory with a child kernel launched via dynamic parallelism. The only options are to use global or dynamically allocated heap memory when it is necessary for a parent thread to pass transient data to a child kernel.

CUDA: Thread-block Level Broadcast on K40 using Shuffle instrutions

indirectJ2[MAX_SUPER_SIZE] is a shared array.
My cuda device kernel contains following statement (executed by all threads in the thread block):
int nnz_col = indirectJ2[MAX_SUPER_SIZE - 1];
I suspect this would cause bank conflicts.
Is there any way I can implement the above Thread-block level broadcast efficiently using new shuffle instructions for the kepler GPUs? I understand how it works at warp level. Other solutions, which are beyond shuffle instruction (for instance use of CUB etc.), are also welcome.
There is no bank conflict for that line of code on K40. Shared memory accesses already offer a broadcast mechanism. Quoting from the programming guide
"A shared memory request for a warp does not generate a bank conflict between two threads that access any sub-word within the same 32-bit word or within two 32-bit words whose indices i and j are in the same 64-word aligned segment (i.e., a segment whose first index is a multiple of 64) and such that j=i+32 (even though the addresses of the two sub-words fall in the same bank): In that case, for read accesses, the 32-bit words are broadcast to the requesting threads "
There is no such concept as shared memory bank conflicts at the threadblock level. Bank conflicts only pertain to the access pattern generated by the shared memory request emanating from a single warp, for a single instruction in that warp.
If you like, you can write a simple test kernel and use profiler metrics (e.g. shared_replay_overhead) to test for shared memory bank conflicts.
Warp shuffle mechanisms do not extend beyond a single warp; therefore there is no short shuffle-only sequence that can broadcast a single quantity to multiple warps in a threadblock. Shared memory can be used to provide direct access to a single quantity to all threads in a warp; you are already doing that.
global memory, __constant__ memory, and kernel parameters can also all be used to "broadcast" the same value to all threads in a threadblock.

Local, global, constant & shared memory

I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? How do I access 'local' mem? It is __device__ memory, no?
The device-properties also reports: global, shared, & constant mem size.
Are these statements correct:
Global memory is __device__ memory. It has grid scope, and a lifetime of the grid (kernel).
Constant memory is __device__ __constant__ memory. It has grid scope & a lifetime of the grid (kernel).
Shared memory is __device__ __shared__ memory. It has single block scope & a lifetime of that block (of threads).
I'm thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn't an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*.
I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks?
(I'm thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)
From the CUDA C Programming Guide section 5.3.2.2, we see that local memory is used in several circumstances:
When each thread has some arrays but their size is not known at compile time (so they might not fit in the registers)
When the size of the arrays are known at compile time, and this size is too big for register memory (this can also happen with big structs)
When the kernel has already used up all the register memory (so if we have filled the registers with n ints, the n+1th int will go into local memory) - this last case is register spilling, and it should be avoided, because:
"Local" memory actually lives in the global memory space, which means reads and writes to it are comparatively slow compared to register and shared memory. You'll access local memory every time you use some variable, array, etc in the kernel that doesn't fit in the registers, isn't shared memory, and wasn't passed as global memory. You don't have to do anything explicit to use it - in fact you should try to minimize its use, since registers and shared memory are much faster.
Edit:
Re: shared memory, you cannot have two blocks exchanging shared memory or looking at each others' shared memory. Since the order of execution of blocks is not guaranteed, if you tried to do this you might tie up a SMP for hours waiting for another block to get executed. Similarly, two kernels running on the device at the same time can't see each others' memory UNLESS it is global memory, and even then you're playing with fire (of race conditions). As far as I am aware, blocks/kernels can't really send "messages" to each other. Your scenario doesn't really make sense since order of execution for the blocks will be different every time and it's bad practice to stall a block waiting for another.

CUDA shared memory

I need to know something about CUDA shared memory. Let's say I assign 50 blocks with 10 threads per block in a G80 card. Each SM processor of a G80 can handle 8 blocks simultaneously. Assume that, after doing some calculations, the shared memory is fully occupied.
What will be the values in shared memory when the next 8 new blocks arrive? Will the previous values reside there? Or will the previous values be copied to global memory and the shared memory refreshed for next 8 blocks?
It states about the type qualifiers:
Variables in registers for a thread, only stays in kernel
Variables in global memory for a thread, only stays in kernel
__device__ __shared__ type variable in shared memory for a block, only stays in kernel
__device__ type variable in global memory for a grid, stays until the application exits
__device__ __constant__ type variable for a grid, stays until the application exits
thus from this reference, the answer to your question is the memory should be refreshed for the next 8 blocks if they reside in shared memory of your device.
For kernel blocks, the execution order and SMs are randomly assigned. In that sense, even if the old value or address preserves, it is hard to keep things in track. I doubt there is even a way to do that. Communication between blocks are done via off chip memory. The latency associated with off chip memory is the performance killer, which makes gpu programming tricky. In Fermi cards, blocks share some L2 cache, but one can't alter the behavior of these caches.

Size of statically allocated shared memory per block with Compute Prof (Cuda/OpenCL)

In Nvidia's compute prof there is a column called "static private mem per work group" and the tooltip of it says "Size of statically allocated shared memory per block". My application shows that I am getting 64 (bytes I assume) per block. Does that mean I am using somewhere between 1-64 of those bytes or is the profiler just telling me that this amount of shared memory was allocated and who knows if it was used at all?
If it's allocated, it's probably because you used it. AFAIK CUDA passes parameters to kernels via shared memory, so it's must be that.