I am trying to fully understand the information of PTXAS -v CUDA for kernel stack usage and register spilling (for sm_35 architecture). For one of my kernels it produces:
3536 bytes stack frame, 3612 bytes spill stores, 6148 bytes spill loads
ptxas info : Used 255 registers, 392 bytes cmem[0]
I know that the stack frame is allocated in local memory which lives physically where global memory is and is private to each thread.
My questions are:
Is the memory needed for register spillage also allocated in local
memory?
Is the total amount of memory needed for register spilling and stack
usage equal to [number of threads]x[3536 bytes]. Thus register
spillage loads/stores operate on the stack frame?
The number of spill stores/loads doesn't detail on the size of the
transfers. Are these always 32bit registers? Thus, a 64bit floating
point number spill would be counted as 2 spill stores?
Are spill stores/loads cached in L2 cache?
Registers are spilled to local memory. "local" means "thread-local", i.e. storage private to each thread.
The amount of local memory required for the entire launch is at least number_of_threads times local_memory_bytes_per_thread. Due to allocation granularity it can often be more.
The compiler statistics for spill transfers are already normalized
to bytes as individual local memory accesses may have difference
widths. Inspection of the generated machine code (run cuobjdump
--dump-sass on the binary) will show the width of the individual accesses. The relevant instructions will have names like LLD, LST,
LDL, STL.
I am reasonably sure that local memory accesses are cached in L1 and
L2 caches, but cannot quote the relevant paragraphs from the
documentation at this time.
Related
I am relatively new to CUDA programming.
In this blog (How to Access Global Memory Efficiently in CUDA C/C++ Kernels), we have the following:
"The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size."
The 128-byte transaction is also mentioned in this post (The cost of CUDA global memory transactions)
In addition, 32-and 128-byte memory transactions are also mentioned in the CUDA C Programming Guide. This guide also show Figure 20 about aligned and mis-aligned access, that I couldn't quite understand.
Explain and give examples on how 32-, 64-, 128-byte transactions would happen?
Go through Figure 20 in more details. What is the point that the Figure is making?
Both of these need to be understood in the context of a CUDA warp. All operations are issued warp-wide, and this includes instructions that access memory.
An individual CUDA thread can access 1,2,4,8,or 16 bytes in a single instruction or transaction. When considered warp-wide, that translates to 32 bytes all the way up to 512 bytes. The GPU memory controller can typically issue requests to memory in granularities of 32 bytes, up to 128 bytes. Larger requests (say, 512 bytes, considered warp wide) will get issued via multiple "transactions" of typically no more than 128 bytes.
Modern DRAM memory has the design characteristic that you don't typically ask for a single byte, you request a "segment" typically of 32 bytes at a time for typical GPU designs. The division of memory into segments is fixed at design time. As a result, you can request either the first 32 bytes (the first segment) or the second 32 bytes (the second segment). You cannot request bytes 16-47 for example. This is all a function of the DRAM design, but it manifests in terms of memory behavior.
The diagram(s) depicts the behavior of each thread in a warp. Individually, they are depicted by the gray/black arrows pointing upwards. Each arrow represents the request from a thread, and the arrow points to a relative location in memory that that thread would like to load or store.
The diagrams are presented in comparison to each other to show the effect of "alignment". When considered warp-wide, if all 32 threads are requesting bytes of data that belong to a single segment, this would require the memory controller to retrieve only one segment to satisfy the request. This would arguably be the most efficient possible behavior (and therefore data organization as well as access pattern, considered warp-wide) for a single request (i.e. a single load or store instruction).
However if the addresses emanating from each thread in the warp result in a pattern depicted in the 2nd figure, this would be "unaligned", and even though you are effectively asking for a similar data "footprint", the lack of alignment to a single segment means the memory controller will need to retrieve 2 segments from memory to satisfy the request.
That is the key point of understanding associated with the figure. But there is more to the story than that. Misaligned access is not necessarily as tragic (performance cut in half) as this might suggest. The GPU caches play a role here, when we consider these transactions not just in the context of a single warp, but across many warps.
To get a more complete and orderly treatment of these topics, I suggest referring to various training material. It's by no means the only one, but unit 4 of this training series will cover the topic in more detail.
I'd like to know if the configuration of constant memory changes as the underlying architecture evolves from Kepler to Volta. To be specific, I have two questions:
1) Does the sizes of constant memory and per-SM constant cache change?
2) What's the mapping from the cmem space to constant memory?
When compiling cuda code to PTX with adding '-v' to nvcc, we can see the memory usage like: ptxas info : Used 20 registers, 80 bytes cmem[0], 348 bytes cmem[2]. So does the cmem space maps to constant memory? Does accessing to each cmem space go through the on-chip constant cache?
I have found the answer for the 1st question.
In the CUDA C Programming Guide, table14 shows the size of constant memory and constant cache for different CCs.
The constant memory size is always 64KB from CC2.x to 6.x. The on-chip constant cache size is 8KB till CC 3.0 and increases to 10KB for the later.
In CUDA, there are two metrics I don't quite understand clearly, which are "requested global load throughput" and "global load throughput".
from What's the difference between "gld/st_throughput" and "dram_read/write_throughput" metrics? I know the difference between global load throughput and dram load throughput, but what exactly is "requested global load throughput"?
If I want to tell how good my CUDA application behaves in global memory access, which metric should I use?
Requested global loads are the loads you, as the programmer, write. This is to distinguish from "effective" global loads that the memory engine performs.
For example, when you load 32 floats from global memory, you are requesting a 32x4 bytes global load. If those 32 floats are within the same 128 bytes segment, these 32 loads will be coalesced into a single memory transaction of 128 bytes. But if those floats are scattered, the memory engine may have to do several transactions to load all 32 floats. In the worst case, where all floats are more than 128 bytes from each other, the memory engine will issue 1 transaction per float: you get 32x128 bytes effectively loaded from global memory as opposed to 32x4 requested.
On a related note, the metric gld_efficiency is defined as 100 * gld_requested_throughput / gld_throughput. Therefore it hits 100% when all your accesses are perfectly coalesced. You may want to keep an eye on these different metrics to see how your application performs.
According to "CUDA C Programming Guide", a constant memory access benefits only if a multiprocessor constant cache is hit (Section 5.3.2.4)1. Otherwise there can be even more memory requests for a half-warp than in case of the coalesced global memory read. So why the constant memory size is limited to 64 KB?
One more question in order not to ask twice. As far as I understand, in the Fermi architecture the texture cache is combined with the L2 cache. Does texture usage still make sense or the global memory reads are cached in the same manner?
1Constant Memory (Section 5.3.2.4)
The constant memory space resides in device memory and is cached in the constant cache mentioned in Sections F.3.1 and F.4.1.
For devices of compute capability 1.x, a constant memory request for a warp is first split into two requests, one for each half-warp, that are issued independently.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
The constant memory size is 64 KB for compute capability 1.0-3.0 devices. The cache working set is only 8KB (see the CUDA Programming Guide v4.2 Table F-2).
Constant memory is used by the driver, compiler, and variables declared __device__ __constant__. The driver uses constant memory to communicate parameters, texture bindings, etc. The compiler uses constants in many of the instructions (see disassembly).
Variables placed in constant memory can be read and written using the host runtime functions cudaMemcpyToSymbol() and cudaMemcpyFromSymbol() (see the CUDA Programming Guide v4.2 section B.2.2). Constant memory is in device memory but is accessed through the constant cache.
On Fermi texture, constant, L1 and I-Cache are all level 1 caches in or around each SM. All level 1 caches access device memory through the L2 cache.
The 64 KB constant limit is per CUmodule which is a CUDA compilation unit. The concept of CUmodule is hidden under the CUDA runtime but accessible by the CUDA Driver API.
My kernel call fails with "out of memory". It makes significant usage of the stack frame and I was wondering if this is the reason for its failure.
When invoking nvcc with --ptxas-options=-v it print the following profile information:
150352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 59 registers, 40 bytes cmem[0]
Hardware: GTX480, sm20, 1.5GB device memory, 48KB shared memory/multiprocessor.
My question is where is the stack frame allocated: In shared, global memory, constant memory, ..?
I tried with 1 thread per block, as well as with 32 threads per block. Same "out of memory".
Another issue: One can only enlarge the number of threads resident to one multiprocessor if the total numbers of registers do not exceed the number of available registers at the multiprocessor (32k for my card). Does something similar apply to the stack frame size?
Stack is allocated in local memory. Allocation is per physical thread (GTX480: 15 SM * 1536 threads/SM = 23040 threads). You are requesting 150,352 bytes/thread => ~3.4 GB of stack space. CUDA may reduce the maximum physical threads per launch if the size is that high. The CUDA language is not designed to have a large per thread stack.
In terms of registers GTX480 is limited to 63 registers per thread and 32K registers per SM.
Stack frame is most likely in the local memory.
I believe there is some limitation of the local memory usage, but even without it, I think CUDA driver might allocate more local memory than just for one thread in your <<<1,1>>> launch configuration.
One way or another, even if you manage to actually run your code, I fear it may be actually quite slow because of all those stack operations. Try to reduce the number of function calls (e.g. by inlining those functions).