GPU coalesced global memory access vs using shared memory - cuda

If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored?
If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement.
ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement saying if you looking for one of these memory values, read from shared memory rather than global? Im assuming the warp divergence penalty would be less than reading from global memory each time.

When you read from global memory, the data are searched first in the L1 cache (high bandwidth, 1.600GB/s on Fermi, but limited in size, 48KB on Fermi), then, if not present in L1, they are searched in L2 (lower bandwidth, but larger than L1, 768KB on Fermi) and, and finally, if not present in L2, they are loaded from global memory*.
When a global memory load occurs, the data are moved to L2 and then to L1, so to be able to access them in a faster way next time a global memory read is required.
Possibly, such data are evicted by a subsequent global memory load, possibly not. So, in principle, if you are reading "small" chunks of data, you do not need to necessarily force the data to be located in the shared memory to access them next time in a fast way.
Take into account that, on Fermi and Kepler, shared memory is made by the same circuitry of the L1 cache. You can then see the shared memory as a controlled L1 cache.
You should then force the data to reside in the shared memory to be sure that they reside on, say, the "fastest available cache" and you do it whenever you need to access those same data a multiple number of times.
Note that the above is the general philosophy behind global memory transfers. Implementation details can differ depending on the underlying architecture.
*Il should be noticed that the L1 cache line could be disabled by a compiler option. This is useful in terms of performance for random access data patterns.

Related

When should texture memory be prefered over constant memory?

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

why shared memory is faster than global memory?

is that difference in speed due to technology with which both were made of( i read that shared memory is a scratchpad memory that is mainly SRAM memory while global memory is typically a DRAM memory)?
what if both were made with same technology, will be any differences in performance based on shared memory is on-chip and global memory is off-chip due to extra instructions(load instructions) or extra hardware circuit needed for global memory to load it's data into the processor?
At least two reasons are the ones you've already pointed out. There is a:
Location difference - shared memory is on-chip, global memory (at least, ordinary global memory accesses that do not hit in one of the caches) are off-chip. Memory is generally clocked at a fixed frequency, and the maximum frequency will depend on how fast the system can be clocked. Long transmission lines, buffers that drive signals from off-chip to on-chip or vice versa, and many other circuit effects will slow down the maximum rate that a particular circuit can be clocked. Therefore the shared memory is considerably advantaged by being on-chip. The caches (L1, L2, read-only, constant cache, texture cache, etc.) all benefit from the same principle.
Technology difference. An SRAM cell (e.g. shared memory) might be clocked faster than a DRAM cell (e.g. off-chip global memory), and SRAM is more amenable to fast random access. DRAM has a more complicated access sequence that comes into play when a cell is accessed. DRAM is also burdened by mechanisms such as refresh that may get in the way of continuous fast access. However I would suggest that the technology difference is less of an issue. Another technology related issue is that SRAM arrays are generally more amenable (able to be placed in higher density) on the logic processes that modern processors use. For highest density, DRAM arrays use a semiconductor process that differs substantially from the one used for general logic inside a processor.
Processor instuctrions required wouldn't be a meaningful differentiator between shared memory and global memory access times.

What's the difference between "requested global load throughput" and "global load throughput" in CUDA

In CUDA, there are two metrics I don't quite understand clearly, which are "requested global load throughput" and "global load throughput".
from What's the difference between "gld/st_throughput" and "dram_read/write_throughput" metrics? I know the difference between global load throughput and dram load throughput, but what exactly is "requested global load throughput"?
If I want to tell how good my CUDA application behaves in global memory access, which metric should I use?
Requested global loads are the loads you, as the programmer, write. This is to distinguish from "effective" global loads that the memory engine performs.
For example, when you load 32 floats from global memory, you are requesting a 32x4 bytes global load. If those 32 floats are within the same 128 bytes segment, these 32 loads will be coalesced into a single memory transaction of 128 bytes. But if those floats are scattered, the memory engine may have to do several transactions to load all 32 floats. In the worst case, where all floats are more than 128 bytes from each other, the memory engine will issue 1 transaction per float: you get 32x128 bytes effectively loaded from global memory as opposed to 32x4 requested.
On a related note, the metric gld_efficiency is defined as 100 * gld_requested_throughput / gld_throughput. Therefore it hits 100% when all your accesses are perfectly coalesced. You may want to keep an eye on these different metrics to see how your application performs.

CUDA Compute Capability 2.0. Global memory access pattern

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

Is CUDA shared memory also cached

In my CUDA application, I am copying data from device memory to shared memory. Is that data cached in L1 as well?
By default, all memory loads from global memory are cached in L1. The target location for the global memory load has no effect on the L1 caching (whether it is a register, or shared memory or thread local memory). The shared memory itself is obviously not cached.
This is just to expand on what #talonmies said.
A copy is two separate operations at a low level, a load and a store. Both load and store can be cached in L1 and L2 if they access global memory.
Since the load part of your copy is from global memory, it will be cached both in L1 and L2 by default. So, unless the compiler detects the special situation of copying from global to shared memory and uses an uncached load, you end up with two copies of your data that can be accessed at the same latency because the shared memory and L1 cache are implemented with the same physical on-chip memory.
From the CUDA C Programming Guide 4.2:
There is an L1 cache for each multiprocessor and an L2 cache shared by all
multiprocessors, both of which are used to cache accesses to local or global
memory, including temporary register spills. The cache behavior (e.g. whether reads
are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction.
I couldn't find anything about how this behavior may be modified from CUDA C.