Local, global, constant & shared memory - cuda

I read some CUDA documentation that refers to local memory. (It is mostly the early documentation.) The device-properties reports a local-mem size (per thread). What does 'local' memory mean? What is 'local' memory? Where is 'local' memory? How do I access 'local' mem? It is __device__ memory, no?
The device-properties also reports: global, shared, & constant mem size.
Are these statements correct:
Global memory is __device__ memory. It has grid scope, and a lifetime of the grid (kernel).
Constant memory is __device__ __constant__ memory. It has grid scope & a lifetime of the grid (kernel).
Shared memory is __device__ __shared__ memory. It has single block scope & a lifetime of that block (of threads).
I'm thinking shared mem is SM memory. i.e. Memory that only that single SM had direct access to. A resource that is rather limited. Isn't an SM assigned a bunch of blocks at a time? Does this mean an SM can interleave the execution of different blocks (or not)? i.e. Run block*A* threads until they stall. Then run block*B* threads until they stall. Then swap back to block*A* threads again. OR Does the SM run a set of threads for block*A* until they stall. Then another set of block*A* threads are swapped in. This swap continues until block*A* is exhausted. Then and only then does work begin on block*B*.
I ask because of shared memory. If a single SM is swapping code in from 2 different blocks, then how does the SM quickly swap in/out the shared memory chunks?
(I'm thinking the later senerio is true, and there is no swapping in/out of shared memory space. Block*A* runs until completion, then block*B* starts execution. Note: block*A* could be a different kernel than block*B*.)

From the CUDA C Programming Guide section 5.3.2.2, we see that local memory is used in several circumstances:
When each thread has some arrays but their size is not known at compile time (so they might not fit in the registers)
When the size of the arrays are known at compile time, and this size is too big for register memory (this can also happen with big structs)
When the kernel has already used up all the register memory (so if we have filled the registers with n ints, the n+1th int will go into local memory) - this last case is register spilling, and it should be avoided, because:
"Local" memory actually lives in the global memory space, which means reads and writes to it are comparatively slow compared to register and shared memory. You'll access local memory every time you use some variable, array, etc in the kernel that doesn't fit in the registers, isn't shared memory, and wasn't passed as global memory. You don't have to do anything explicit to use it - in fact you should try to minimize its use, since registers and shared memory are much faster.
Edit:
Re: shared memory, you cannot have two blocks exchanging shared memory or looking at each others' shared memory. Since the order of execution of blocks is not guaranteed, if you tried to do this you might tie up a SMP for hours waiting for another block to get executed. Similarly, two kernels running on the device at the same time can't see each others' memory UNLESS it is global memory, and even then you're playing with fire (of race conditions). As far as I am aware, blocks/kernels can't really send "messages" to each other. Your scenario doesn't really make sense since order of execution for the blocks will be different every time and it's bad practice to stall a block waiting for another.

Related

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?
And is there a way to transfer faster than global memory?
In theory, on devices of compute capability >= 2.0, transfers between blocks, using global memory, could be very fast because global memory transactions use the L1 and L2 caches.
However, the only way to safely transfer memory between blocks is to launch those blocks in separate kernel invocations. Then, you lose the theoretical advantage I just described, as the caches are flushed between invocations.
Within a given kernel invocation, you cannot know in which order your blocks will be launched.
Transferring data between blocks launched by separate kernel invocations is a common paradigm in CUDA and if there is enough computational work to be done, the latency of the global memory transactions can be completely hidden.

Persistent GPU shared memory

I am new to CUDA programming, and I am mostly working with shared memory per block because of performance reasons. The way my program is structured right now, I use one kernel to load the shared memory and another kernel to read the pre-loaded shared memory. But, as I understand it, shared memory cannot persist between two different kernels.
I have two solutions in mind; I am not sure about the first one, and second might be slow.
First Solution: Instead of using two kernels, I use one kernel. After loading the shared memory, the kernel may wait for an input from the host, perform the operation and then return the value to host. I am not sure whether a kernel can wait for a signal from the host.
Second solution: After loading the shared memory, copy the shared memory value in the global memory. When the next kernel is launched, copy the value from global memory back into the shared memory and then perform the operation.
Please comment on the feasibility of the two solutions.
I would use a variation of your proposed first solution: As you already suspected, you can't wait for host input in a kernel - but you can syncronise your kernels at a point. Just call "__syncthreads();" in your kernel after loading your data into shared memory.
I don't really understand your second solution: why would you copy data to shared memory just to copy it back to global memory in the first kernel? Or would this first kernel also compute something? In this case I guess it will not help to store the preliminary results in the shared memory first, I would rather store them directly in global memory (however, this might depend on the algorithm).

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread:
float f = ...; // some calculations here
Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?
I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.
CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.

CUDA shared memory

I need to know something about CUDA shared memory. Let's say I assign 50 blocks with 10 threads per block in a G80 card. Each SM processor of a G80 can handle 8 blocks simultaneously. Assume that, after doing some calculations, the shared memory is fully occupied.
What will be the values in shared memory when the next 8 new blocks arrive? Will the previous values reside there? Or will the previous values be copied to global memory and the shared memory refreshed for next 8 blocks?
It states about the type qualifiers:
Variables in registers for a thread, only stays in kernel
Variables in global memory for a thread, only stays in kernel
__device__ __shared__ type variable in shared memory for a block, only stays in kernel
__device__ type variable in global memory for a grid, stays until the application exits
__device__ __constant__ type variable for a grid, stays until the application exits
thus from this reference, the answer to your question is the memory should be refreshed for the next 8 blocks if they reside in shared memory of your device.
For kernel blocks, the execution order and SMs are randomly assigned. In that sense, even if the old value or address preserves, it is hard to keep things in track. I doubt there is even a way to do that. Communication between blocks are done via off chip memory. The latency associated with off chip memory is the performance killer, which makes gpu programming tricky. In Fermi cards, blocks share some L2 cache, but one can't alter the behavior of these caches.