On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache (servicing global and constant memory).
We all know about the bank conflicts of accessing Shared Memory - the memory is divided into 32 banks of size 32-bits to allow simultaneous independent access by all 32 threads. On the other hand, Global Memory, though much slower, does not experience bank conflicts because memory requests are coalesced across the warp.
Question: Suppose some data from global or constant memory is cached in the L1 cache for a given warp. Is access to this data subject to bank conflicts, like Shared Memory (since the L1 Cache and the Shared Memory are in fact the same hardware), or is it bank-conflict-free in the way that Global/Constant memory is?
On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by
default partitioned into 48kb of Shared Memory and 16kb of L1 cache
Compute capability 2.x devices have 64 KB of SRAM per Streaming Multiprocessor (SM) that can beconfigured as
16 KB L1 and 48 KB shared memory, or
48 KB L1 and 16 KB shared memory.
(servicing global and constant memory).
Loads and stores to global memory, local memory, and surface memory go through the L1. Accesses to constant memory go through dedicated constant caches.
We all know about the bank conflicts of accessing Shared Memory - the
memory is divided into 32 banks of size 32-bits to allow simultaneous
independent access by all 32 threads. On the other hand, Global
Memory, though much slower, does not experience bank conflicts because
memory requests are coalesced across the warp.
Accesses through L1 to global or local memory are done per cache line (128 B). When a load request is issued to L1 the LSU needs to perform an address divergence calculation to determine which threads are accessing the same cache line. The LSU unit then has to perform a L1 cache tag look up. If the line is cached then it is written back to the register file; otherwise, the request is sent to L2. If the warp has threads not serviced by the request then a replay is requested and the operation is reissued with the remaining threads.
Multiple threads in a warp can access the same bytes in the cache line without causing a conflict.
Question: Suppose some data from global or constant memory is cached
in the L1 cache for a given warp.
Constant memory is not cached in L1 it is cached in the constant caches.
Is access to this data subject to bank conflicts, like Shared Memory
(since the L1 Cache and the hared Memory are in fact the same
hardware), or is it bank-conflict-free in the way that global/Constant
memory is?
L1 and the constant cache access a single cache line at a time so there are no bank conflicts.
Related
If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored?
If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement.
ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement saying if you looking for one of these memory values, read from shared memory rather than global? Im assuming the warp divergence penalty would be less than reading from global memory each time.
When you read from global memory, the data are searched first in the L1 cache (high bandwidth, 1.600GB/s on Fermi, but limited in size, 48KB on Fermi), then, if not present in L1, they are searched in L2 (lower bandwidth, but larger than L1, 768KB on Fermi) and, and finally, if not present in L2, they are loaded from global memory*.
When a global memory load occurs, the data are moved to L2 and then to L1, so to be able to access them in a faster way next time a global memory read is required.
Possibly, such data are evicted by a subsequent global memory load, possibly not. So, in principle, if you are reading "small" chunks of data, you do not need to necessarily force the data to be located in the shared memory to access them next time in a fast way.
Take into account that, on Fermi and Kepler, shared memory is made by the same circuitry of the L1 cache. You can then see the shared memory as a controlled L1 cache.
You should then force the data to reside in the shared memory to be sure that they reside on, say, the "fastest available cache" and you do it whenever you need to access those same data a multiple number of times.
Note that the above is the general philosophy behind global memory transfers. Implementation details can differ depending on the underlying architecture.
*Il should be noticed that the L1 cache line could be disabled by a compiler option. This is useful in terms of performance for random access data patterns.
I need a large amount of constant data, more than 6-8 KB, up to 16 KB. In the same time I don't use shared memory. And now I want to store this constant data in the shared memory. Is it a good idea? Any performance approximations? Does broadcasting work for shared memory as well as for constant?
Performance is critical for the application. And I think, I have only 8 KB constant memory cache on my Tesla C2075 (CUDA 2.0)
In compute capability 2.0, the same memory is used for L1 and shared memory. Partitioning between L1 and shared memory can be controlled with the cudaFuncSetCacheConfig() call. I would suggest setting L1 to the maximum possible (48K) with
cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferL1);
Then, pull your constant data from global memory, and let L1 handle the caching. If you have multiple arrays that are const, you can direct the compiler to use the constant cache for some of them by using the const qualifier in the kernel argument list. That way, you can leverage both L1 and the constant cache to cache your constants.
Broadcasting works both for L1 and constant cache accesses.
In my CUDA application, I am copying data from device memory to shared memory. Is that data cached in L1 as well?
By default, all memory loads from global memory are cached in L1. The target location for the global memory load has no effect on the L1 caching (whether it is a register, or shared memory or thread local memory). The shared memory itself is obviously not cached.
This is just to expand on what #talonmies said.
A copy is two separate operations at a low level, a load and a store. Both load and store can be cached in L1 and L2 if they access global memory.
Since the load part of your copy is from global memory, it will be cached both in L1 and L2 by default. So, unless the compiler detects the special situation of copying from global to shared memory and uses an uncached load, you end up with two copies of your data that can be accessed at the same latency because the shared memory and L1 cache are implemented with the same physical on-chip memory.
From the CUDA C Programming Guide 4.2:
There is an L1 cache for each multiprocessor and an L2 cache shared by all
multiprocessors, both of which are used to cache accesses to local or global
memory, including temporary register spills. The cache behavior (e.g. whether reads
are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction.
I couldn't find anything about how this behavior may be modified from CUDA C.
According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global memory read. So why the constant memory size is limited to 64 KB?
One more question in order not to ask twice. As far as I understand, in the Fermi architecture the texture cache is combined with the L2 cache. Does texture usage still make sense or the global memory reads are cached in the same manner?
1Constant Memory (Section 5.3.2.4)
The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1.
For devices of compute capability 1.x, a constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
The constant memory size is 64 KB for compute capability 1.0-3.0 devices. The cache working set is only 8KB (see the CUDA Programming Guide v4.2 Table F-2).
Constant memory is used by the driver, compiler, and variables declared __device__ __constant__. The driver uses constant memory to communicate parameters, texture bindings, etc. The compiler uses constants in many of the instructions (see disassembly).
Variables placed in constant memory can be read and written using the host runtime functions cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() (see the CUDA Programming Guide v4.2 section B.2.2). Constant memory is in device memory but is accessed through the constant cache.
On Fermi texture, constant, L1 and I-Cache are all level 1 caches in or around each SM. All level 1 caches access device memory through the L2 cache.
The 64 KB constant limit is per CUmodule which is a CUDA compilation unit. The concept of CUmodule is hidden under the CUDA runtime but accessible by the CUDA Driver API.
I'm not finding an improvement in speed with shared memory on an NVIDIA Tesla M2050
with about 49K shared memory per block. Actually if I allocate
a large char array in shared memory it slows down my program. For example
__shared__ char database[49000];
gives me slower running times than
__shared__ char database[4900];
The program accesses only the first 100 chars of database so the extra space
is unnecessary. I can't figure out why this is happening. Any help would be appreciated.
Thanks.
The reason for the relatively poor performance of CUDA shared memory when using larger arrays may have to do with the fact that each multiprocessor has a limited amount of available shared memory.
Each multiprocessor hosts several processors; for modern devices, typically 32, the number of threads in a warp. This means that, in the absence of divergence or memory stalls, the average processing rate is 32 instructions per cycle (latency is high due to pipelining).
CUDA schedules several blocks to a multiprocessor. Each block consists of several warps. When a warp stalls on a global memory access (even coalesced accesses have high latency), other warps are processed. This effectively hides the latency, which is why high-latency global memory is acceptable in GPUs. To effectively hide latency, you need enough extra warps to execute until the stalled warp can continue. If all warps stall on memory accesses, you can no longer hide the latency.
Shared memory is allocated to blocks in CUDA, and stored on a singly multiprocessor on the GPU device. Each multiprocessor has a relatively small, fixed amount of shared memory space. CUDA cannot schedule more blocks to multiprocessors than the multiprocessors can support in terms of shared memory and register usage. In other words, if the amount of shared memory on a multiprocessor is X and each block requires Y shared memory, CUDA will schedule no more than floor(X/Y) blocks at a time to each multiprocessor (it might be less since there are other constraints, such as register usage).
Ergo, by increasing shared memory usage of a block, you might be reducing the number of active warps - the occupancy - of your kernel, thereby hurting performance. You should look into your kernel code by compiling with the -Xptxas="-v" flag; this should give you register and shared & constant memory usage for each kernel. Use this data and your kernel launch parameters, as well as other required information, in the most recent version of the CUDA Occupancy Calculator to determine whether you might be affected by occupancy.
EDIT:
To address the other part of your question, assuming no shared memory bank conflicts and perfect coalescing of global memory accesses... there are two dimensions to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is on-chip. The bandwidth will be much the same. Ergo, if you are able to hide global memory access latency through coalescing, there is no penalty (note: the access pattern is important here, in that shared memory allows for potentially more diverse access patterns with little to no performance loss, so there can be benefits to using shared memory even if you can hide all the global memory latency).
Also, if you increase the shared memory per-block, CUDA will schedule grids with less concurrent blocks, so they all have enough shared memory, so it reduces parallelism and increases execution time.
The amount of resources available on the gpu are limited. The number of blocks running concurrently is roughly inversly proportional to the size of shared memory per block.
This explains why the runtime is slower when you launch the kernel which uses a really large amount of shared memory.