CUDA Compute Capability 2.0. Global memory access pattern - cuda

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.

L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

Related

why shared memory is faster than global memory?

is that difference in speed due to technology with which both were made of( i read that shared memory is a scratchpad memory that is mainly SRAM memory while global memory is typically a DRAM memory)?
what if both were made with same technology, will be any differences in performance based on shared memory is on-chip and global memory is off-chip due to extra instructions(load instructions) or extra hardware circuit needed for global memory to load it's data into the processor?
At least two reasons are the ones you've already pointed out. There is a:
Location difference - shared memory is on-chip, global memory (at least, ordinary global memory accesses that do not hit in one of the caches) are off-chip. Memory is generally clocked at a fixed frequency, and the maximum frequency will depend on how fast the system can be clocked. Long transmission lines, buffers that drive signals from off-chip to on-chip or vice versa, and many other circuit effects will slow down the maximum rate that a particular circuit can be clocked. Therefore the shared memory is considerably advantaged by being on-chip. The caches (L1, L2, read-only, constant cache, texture cache, etc.) all benefit from the same principle.
Technology difference. An SRAM cell (e.g. shared memory) might be clocked faster than a DRAM cell (e.g. off-chip global memory), and SRAM is more amenable to fast random access. DRAM has a more complicated access sequence that comes into play when a cell is accessed. DRAM is also burdened by mechanisms such as refresh that may get in the way of continuous fast access. However I would suggest that the technology difference is less of an issue. Another technology related issue is that SRAM arrays are generally more amenable (able to be placed in higher density) on the logic processes that modern processors use. For highest density, DRAM arrays use a semiconductor process that differs substantially from the one used for general logic inside a processor.
Processor instuctrions required wouldn't be a meaningful differentiator between shared memory and global memory access times.

filtering an image, best practices

I have an input image "let it be a buffer of 1024 * 1024 pixels, with RGBA color data"
what I want to do for each pixel, is to filter it depending on neighbors , like [-15,15] in x and y directions
so my concern is, doing this with global memory will do like 31 * 31 global memory access for each pixel "which would be very performance bottleneck" , also I'm not sure about the behavior of multiple threads trying to read from the same memory location at the same time "may be some of them fail to read so -> rubbish data in -> rubbish data out"
this question is for CUDA or OpenCL as the concept should be the same
I know that shared memory (per work group) or local memory (per thread) won't solve this as I can't read another thread local memory, or another group shared memory "correct me if I misunderstand this concept"
Shared memory is a typical approach to this problem, although the stencil area (31*31) is quite large. Data re-use benefit can still be gained however. Since adjacent pixel computations only extend the region required by one column, in a 16KB shared memory array of 32bit RGBA pixels, you could have enough data for at least 64 threads to cooperatively compute their pixel values out of a single shared memory load.
Regarding the concern about multiple threads reading the same location, there is no possibility for garbage data reads. Certainly there is a possibility for contention leading to a performance impact, but in fact with an orderly for-loop progression in the kernel, no threads will be reading the same location at the same time anyway. With appropriate data organization there will be good opportunity for coalesced reads from global memory and no bank conflicts in shared memory.
This type of problem is well-suited for GPUs e.g. CUDA or OpenCL, and there are many examples of programs like this on SO.

GPU coalesced global memory access vs using shared memory

If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored?
If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement.
ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement saying if you looking for one of these memory values, read from shared memory rather than global? Im assuming the warp divergence penalty would be less than reading from global memory each time.
When you read from global memory, the data are searched first in the L1 cache (high bandwidth, 1.600GB/s on Fermi, but limited in size, 48KB on Fermi), then, if not present in L1, they are searched in L2 (lower bandwidth, but larger than L1, 768KB on Fermi) and, and finally, if not present in L2, they are loaded from global memory*.
When a global memory load occurs, the data are moved to L2 and then to L1, so to be able to access them in a faster way next time a global memory read is required.
Possibly, such data are evicted by a subsequent global memory load, possibly not. So, in principle, if you are reading "small" chunks of data, you do not need to necessarily force the data to be located in the shared memory to access them next time in a fast way.
Take into account that, on Fermi and Kepler, shared memory is made by the same circuitry of the L1 cache. You can then see the shared memory as a controlled L1 cache.
You should then force the data to reside in the shared memory to be sure that they reside on, say, the "fastest available cache" and you do it whenever you need to access those same data a multiple number of times.
Note that the above is the general philosophy behind global memory transfers. Implementation details can differ depending on the underlying architecture.
*Il should be noticed that the L1 cache line could be disabled by a compiler option. This is useful in terms of performance for random access data patterns.

the latency of acessing shared memory

which latency is longer between two situation below,
The data be filled into the shared memory from global memory, and all the thread access the shared memory concurrently.the data maybe the same for multiple threads accessing
All the threads access the global memory,but the data are neighbors.
If you plan on accessing each value only once, then you won't gain anything from using shared memory.
Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.
If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.
Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.
I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.
Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.
Case (2) describes the optimal access pattern for global memory.
Perhaps I don't understand the difference between the two case. But if I do:
The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.
Think of it like this:
CASE 2:
you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.
CASE 1:
You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?

How the fastest can I transfer the data block of 256 bytes from one CUDA Block to another?
And is there a way to transfer faster than global memory?
In theory, on devices of compute capability >= 2.0, transfers between blocks, using global memory, could be very fast because global memory transactions use the L1 and L2 caches.
However, the only way to safely transfer memory between blocks is to launch those blocks in separate kernel invocations. Then, you lose the theoretical advantage I just described, as the caches are flushed between invocations.
Within a given kernel invocation, you cannot know in which order your blocks will be launched.
Transferring data between blocks launched by separate kernel invocations is a common paradigm in CUDA and if there is enough computational work to be done, the latency of the global memory transactions can be completely hidden.