CUDA shared memory - cuda

I need to know something about CUDA shared memory. Let's say I assign 50 blocks with 10 threads per block in a G80 card. Each SM processor of a G80 can handle 8 blocks simultaneously. Assume that, after doing some calculations, the shared memory is fully occupied.
What will be the values in shared memory when the next 8 new blocks arrive? Will the previous values reside there? Or will the previous values be copied to global memory and the shared memory refreshed for next 8 blocks?

It states about the type qualifiers:
Variables in registers for a thread, only stays in kernel
Variables in global memory for a thread, only stays in kernel
__device__ __shared__ type variable in shared memory for a block, only stays in kernel
__device__ type variable in global memory for a grid, stays until the application exits
__device__ __constant__ type variable for a grid, stays until the application exits
thus from this reference, the answer to your question is the memory should be refreshed for the next 8 blocks if they reside in shared memory of your device.

For kernel blocks, the execution order and SMs are randomly assigned. In that sense, even if the old value or address preserves, it is hard to keep things in track. I doubt there is even a way to do that. Communication between blocks are done via off chip memory. The latency associated with off chip memory is the performance killer, which makes gpu programming tricky. In Fermi cards, blocks share some L2 cache, but one can't alter the behavior of these caches.

Related

Why kernel codes which are using shared memory must be necessarily synchronized?(CUDA)

Theoretical question about CUDA and GPU parallel calculations.
As I know, kernel is a code, function, which is execute by GPU.
Each kernel has a(is executed by) grid which consists blocks and blocks have threads.
So each kernel(code) is executed by even thousands of threads.
I have question about shared memory and kernel codes synchronization.
Could you justify the necessity of synchronization in kernel codes which are using shared memory?
How the synchronization affects the processing efficiency?
CW answer to get this off the unanswered list:
Could you justify the necessity of synchronization in kernel codes which are using shared memory?
__syncthreads() is frequently found in kernels that use shared memory, after the shared memory load, to prevent race conditions. Since the shared memory is usually loaded cooperatively (by all threads in the block), it's necessary to make sure that all threads have completed the loading operation, before any thread begins to use the loaded data for further processing
__syncthreads() is documented here.
Note that it only synchronizes threads within a given block, not grid-wide.

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?
And is there a way to transfer faster than global memory?
In theory, on devices of compute capability >= 2.0, transfers between blocks, using global memory, could be very fast because global memory transactions use the L1 and L2 caches.
However, the only way to safely transfer memory between blocks is to launch those blocks in separate kernel invocations. Then, you lose the theoretical advantage I just described, as the caches are flushed between invocations.
Within a given kernel invocation, you cannot know in which order your blocks will be launched.
Transferring data between blocks launched by separate kernel invocations is a common paradigm in CUDA and if there is enough computational work to be done, the latency of the global memory transactions can be completely hidden.

Local, global, constant & shared memory

I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? How do I access 'local' mem? It is __device__ memory, no?
The device-properties also reports: global, shared, & constant mem size.
Are these statements correct:
Global memory is __device__ memory. It has grid scope, and a lifetime of the grid (kernel).
Constant memory is __device__ __constant__ memory. It has grid scope & a lifetime of the grid (kernel).
Shared memory is __device__ __shared__ memory. It has single block scope & a lifetime of that block (of threads).
I'm thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn't an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*.
I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks?
(I'm thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)
From the CUDA C Programming Guide section 5.3.2.2, we see that local memory is used in several circumstances:
When each thread has some arrays but their size is not known at compile time (so they might not fit in the registers)
When the size of the arrays are known at compile time, and this size is too big for register memory (this can also happen with big structs)
When the kernel has already used up all the register memory (so if we have filled the registers with n ints, the n+1th int will go into local memory) - this last case is register spilling, and it should be avoided, because:
"Local" memory actually lives in the global memory space, which means reads and writes to it are comparatively slow compared to register and shared memory. You'll access local memory every time you use some variable, array, etc in the kernel that doesn't fit in the registers, isn't shared memory, and wasn't passed as global memory. You don't have to do anything explicit to use it - in fact you should try to minimize its use, since registers and shared memory are much faster.
Edit:
Re: shared memory, you cannot have two blocks exchanging shared memory or looking at each others' shared memory. Since the order of execution of blocks is not guaranteed, if you tried to do this you might tie up a SMP for hours waiting for another block to get executed. Similarly, two kernels running on the device at the same time can't see each others' memory UNLESS it is global memory, and even then you're playing with fire (of race conditions). As far as I am aware, blocks/kernels can't really send "messages" to each other. Your scenario doesn't really make sense since order of execution for the blocks will be different every time and it's bad practice to stall a block waiting for another.

How to use shared memory between kernel call of CUDA?

I want to use shared memory between kernel call of one kernel.
Can I use shared memory between kernel call?
No, you can't. Shared memory has thread block life-cycle. A variable stored in it can be accessible by all the threads belonging to one group during one __global__ function invocation.
Take a try of page-locked memory, but the speed should be much slower than graphic memory.
cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped);
then send the ptr to the kernel code.
Previously you could do it in a non-standard way where you would have a unique id for each shared memory block and the next kernel would check the id and therefore carry out required processing on this shared memory block. This was hard to implement as you needed to ensure full occupancy for each kernel and deal with various corner cases. In addition, without formal support you coulf not rely on compatibility across compute device and cuda versions.

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread:
float f = ...; // some calculations here
Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?
I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.
CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.