How to use L2 Cache in CUDA - cuda

I have searched other threads on usage of L2 cache in CUDA. But, unable to find the solution. How do i make use of L2 Cache? Is there any invoking function or declaration for its use? Like, for using shared memory, we use __device__ __shared__. Is there anything like that for L2 Cache??

The L2 cache is transparent to device code. All accesses to memory (global, local, surface, texture, constant, and instruction) that do not hit in L1 go to L2. All writes go through L2.
CUDA C Programming Guide F.4.2 : Global Memory
This sections provides a few more details on L2.
The compiler flag -dlcm=cg can be used to make global accesses be uncached in L1 and cached in L2.
CUDA C Programming Guide B.5 : Memory Fence Functions
The function __threadfence() can be used to make sure that all writes to global memory are visible in L2.
The function __threadfence_system() can be used to make sure that all writes to global memory are visible to host threads.

Related

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

CUDA Compute Capability 2.0. Global memory access pattern

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

Is CUDA shared memory also cached

In my CUDA application, I am copying data from device memory to shared memory. Is that data cached in L1 as well?
By default, all memory loads from global memory are cached in L1. The target location for the global memory load has no effect on the L1 caching (whether it is a register, or shared memory or thread local memory). The shared memory itself is obviously not cached.
This is just to expand on what #talonmies said.
A copy is two separate operations at a low level, a load and a store. Both load and store can be cached in L1 and L2 if they access global memory.
Since the load part of your copy is from global memory, it will be cached both in L1 and L2 by default. So, unless the compiler detects the special situation of copying from global to shared memory and uses an uncached load, you end up with two copies of your data that can be accessed at the same latency because the shared memory and L1 cache are implemented with the same physical on-chip memory.
From the CUDA C Programming Guide 4.2:
There is an L1 cache for each multiprocessor and an L2 cache shared by all
multiprocessors, both of which are used to cache accesses to local or global
memory, including temporary register spills. The cache behavior (e.g. whether reads
are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction.
I couldn't find anything about how this behavior may be modified from CUDA C.

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread:
float f = ...; // some calculations here
Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?
I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.
CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.