local cache hit metric in cuda profiler - cuda

For some CUDA application profilings, I see that the value of local hit rate (local_hit_rate metric) is 0%.
I want to distinguish the following concepts with that value.
The application has no access to the local cache.
All accesses to local cache were misses.
How can I find the answer? Since the value of inst_compute_ld_st, ldst_issued and ldst_executed are non-zero, is it fine to discard the first question? Or there is something else?
The device is M2000 which is CC5.3 CC5.2

nvprof supports both events (raw counters) and metrics. These can be queried using the following commands:
nvprof --query-events
nvprof --query-metrics
CC5./6. Local Memory Metircs
local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
local_load_transactions: Number of local memory load transactions
local_store_transactions: Number of local memory store transactions
local_hit_rate: Hit rate for local loads and stores
local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage
local_load_throughput: Local memory load throughput
local_store_throughput: Local memory store throughput
inst_executed_local_loads: Warp level instructions for local loads
inst_executed_local_stores: Warp level instructions for local stores
l2_local_load_bytes: Bytes read from L2 for misses in Unified Cache for local loads
l2_local_global_store_bytes: Bytes written to L2 from Unified Cache for local and global stores. This does not include global atomics.
local_load_requests: Total number of local load requests from Multiprocessor
local_store_requests: Total number of local store requests from Multiprocessor
local__request is the number of instructions executed to local memory via generic address space or local address space. On CC5./6.* I do not recall if this includes fully predicated of instructions.
local_*_transactions is the number of cache accesses that occurred due to the size (32-bit, 64-bit, ...) of the request and the address divergence of the request. If this is non-zero then local memory was accessed.
l2_local_*_bytes is the number of bytes of data loaded/stored to the L2 cache.

Related

Do I need to externally call flush if using cuda api to copy from GPU Memory to Persistent Memory?

I am using Cuda API:
cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )
to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
No.
However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.
Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.
For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.
Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.
You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.

What's the difference between "requested global load throughput" and "global load throughput" in CUDA

In CUDA, there are two metrics I don't quite understand clearly, which are "requested global load throughput" and "global load throughput".
from What's the difference between "gld/st_throughput" and "dram_read/write_throughput" metrics? I know the difference between global load throughput and dram load throughput, but what exactly is "requested global load throughput"?
If I want to tell how good my CUDA application behaves in global memory access, which metric should I use?
Requested global loads are the loads you, as the programmer, write. This is to distinguish from "effective" global loads that the memory engine performs.
For example, when you load 32 floats from global memory, you are requesting a 32x4 bytes global load. If those 32 floats are within the same 128 bytes segment, these 32 loads will be coalesced into a single memory transaction of 128 bytes. But if those floats are scattered, the memory engine may have to do several transactions to load all 32 floats. In the worst case, where all floats are more than 128 bytes from each other, the memory engine will issue 1 transaction per float: you get 32x128 bytes effectively loaded from global memory as opposed to 32x4 requested.
On a related note, the metric gld_efficiency is defined as 100 * gld_requested_throughput / gld_throughput. Therefore it hits 100% when all your accesses are perfectly coalesced. You may want to keep an eye on these different metrics to see how your application performs.

GPU coalesced global memory access vs using shared memory

If a thread is accessing global memory, why does it access a large chunk? Where is this large chunk stored?
If your reading from global memory in a coalesced manner, would it be beneficial to copy a common chunk of the global memory into shared memory, or would there not be any improvement.
ie: If each thread is reading the next 5 or 10 or 100 memory locations, and averaging them, if you could fit a chunk of X points from global memory into shared memory, could you not write an if statement saying if you looking for one of these memory values, read from shared memory rather than global? Im assuming the warp divergence penalty would be less than reading from global memory each time.
When you read from global memory, the data are searched first in the L1 cache (high bandwidth, 1.600GB/s on Fermi, but limited in size, 48KB on Fermi), then, if not present in L1, they are searched in L2 (lower bandwidth, but larger than L1, 768KB on Fermi) and, and finally, if not present in L2, they are loaded from global memory*.
When a global memory load occurs, the data are moved to L2 and then to L1, so to be able to access them in a faster way next time a global memory read is required.
Possibly, such data are evicted by a subsequent global memory load, possibly not. So, in principle, if you are reading "small" chunks of data, you do not need to necessarily force the data to be located in the shared memory to access them next time in a fast way.
Take into account that, on Fermi and Kepler, shared memory is made by the same circuitry of the L1 cache. You can then see the shared memory as a controlled L1 cache.
You should then force the data to reside in the shared memory to be sure that they reside on, say, the "fastest available cache" and you do it whenever you need to access those same data a multiple number of times.
Note that the above is the general philosophy behind global memory transfers. Implementation details can differ depending on the underlying architecture.
*Il should be noticed that the L1 cache line could be disabled by a compiler option. This is useful in terms of performance for random access data patterns.

CUDA bank conflict for L1 cache?

On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache (servicing global and constant memory).
We all know about the bank conflicts of accessing Shared Memory - the memory is divided into 32 banks of size 32-bits to allow simultaneous independent access by all 32 threads. On the other hand, Global Memory, though much slower, does not experience bank conflicts because memory requests are coalesced across the warp.
Question: Suppose some data from global or constant memory is cached in the L1 cache for a given warp. Is access to this data subject to bank conflicts, like Shared Memory (since the L1 Cache and the Shared Memory are in fact the same hardware), or is it bank-conflict-free in the way that Global/Constant memory is?
On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by
default partitioned into 48kb of Shared Memory and 16kb of L1 cache
Compute capability 2.x devices have 64 KB of SRAM per Streaming Multiprocessor (SM) that can beconfigured as
16 KB L1 and 48 KB shared memory, or
48 KB L1 and 16 KB shared memory.
(servicing global and constant memory).
Loads and stores to global memory, local memory, and surface memory go through the L1. Accesses to constant memory go through dedicated constant caches.
We all know about the bank conflicts of accessing Shared Memory - the
memory is divided into 32 banks of size 32-bits to allow simultaneous
independent access by all 32 threads. On the other hand, Global
Memory, though much slower, does not experience bank conflicts because
memory requests are coalesced across the warp.
Accesses through L1 to global or local memory are done per cache line (128 B). When a load request is issued to L1 the LSU needs to perform an address divergence calculation to determine which threads are accessing the same cache line. The LSU unit then has to perform a L1 cache tag look up. If the line is cached then it is written back to the register file; otherwise, the request is sent to L2. If the warp has threads not serviced by the request then a replay is requested and the operation is reissued with the remaining threads.
Multiple threads in a warp can access the same bytes in the cache line without causing a conflict.
Question: Suppose some data from global or constant memory is cached
in the L1 cache for a given warp.
Constant memory is not cached in L1 it is cached in the constant caches.
Is access to this data subject to bank conflicts, like Shared Memory
(since the L1 Cache and the hared Memory are in fact the same
hardware), or is it bank-conflict-free in the way that global/Constant
memory is?
L1 and the constant cache access a single cache line at a time so there are no bank conflicts.

Is CUDA shared memory also cached

In my CUDA application, I am copying data from device memory to shared memory. Is that data cached in L1 as well?
By default, all memory loads from global memory are cached in L1. The target location for the global memory load has no effect on the L1 caching (whether it is a register, or shared memory or thread local memory). The shared memory itself is obviously not cached.
This is just to expand on what #talonmies said.
A copy is two separate operations at a low level, a load and a store. Both load and store can be cached in L1 and L2 if they access global memory.
Since the load part of your copy is from global memory, it will be cached both in L1 and L2 by default. So, unless the compiler detects the special situation of copying from global to shared memory and uses an uncached load, you end up with two copies of your data that can be accessed at the same latency because the shared memory and L1 cache are implemented with the same physical on-chip memory.
From the CUDA C Programming Guide 4.2:
There is an L1 cache for each multiprocessor and an L2 cache shared by all
multiprocessors, both of which are used to cache accesses to local or global
memory, including temporary register spills. The cache behavior (e.g. whether reads
are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction.
I couldn't find anything about how this behavior may be modified from CUDA C.