How to measure Streaming Multiprocessor use/idle times in CUDA?

How to measure Streaming Multiprocessor use/idle times in CUDA? - cuda

A simple question, really: I have a kernel which runs with the maximum number of blocks per Streaming Multiprocessor (SM) possible and would like to know how much more performance I could theoretically extract from it. Ideally, I'd like to know the percentage of SM cycles that are idle, i.e. all warps are blocked on memory access.
I'm really just interested in finding that number. What I'm not looking for is
General tips on increasing occupancy. I'm using all the occupancy I can get, and even if I manage to get more performance, it won't tell me how much more would theoretically be possible.
How to compute the theoretical peak GFlops. My computations are not FP-centric, there's a lot of integer arithmetic and logic going on too.

Nsight Visual Studio Edition 2.1 and 2.2 Issue Efficiency experiments provides the information you are requesting. These counters/metrics should be added to the Visual Profiler in a release after CUDA 5.0.
Nsight Visual Studio Edition
From Nsight Visual Studio Edition 2.2 User Guide | Analysis Tools | Other Analysis Reports | Profiler CUDA Settings | Issue Efficiency Section
Issue Efficiency provides information about the device's ability to
issue the instructions of the kernel. The data reported includes
execution dependencies, eligible warps, and SM stall reasons.
For devices of compute capability 2.x, a multiprocessor has two warp
schedulers. Each warp scheduler manages at most 24 warps, for a total
of 48 warps per multiprocessor. The kernel execution configuration may
reduce the runtime limit. For information on occupancy, see the
Achieved Occupancy experiment. The first scheduler is in charge of the
warps with an odd ID, and the second scheduler is in charge of warps
with an even ID.
KEPLER UPDATE: For compute capability 3.x, a multiprocessor has four warp schedulers. Each warp scheduler manages at most 16 warps, for a total of 64 warps per multiprocessor.
At every instruction issue time, each scheduler will pick an eligible
warp from its list of active warps and issue an instruction. A warp is
eligible if the instruction has been fetched, the execution unit
required by the instruction is available, and the instruction has no
dependencies that have not been met.
The schedulers report the following statistics on the warps in the
multiprocessor:
Active Warps – A warp is active from the time it is scheduled on a
multiprocessor until it completes the last instruction. The active
warps counter increments by 0-48 per cycle. The maximum increment per
cycle is defined by the theoretical occupancy.
KEPLER UPDATE Range is 0-64 per cycle.
Eligible Warps – An active warp is eligible if it is able to issue the next instruction.
Warps that are not eligible will report an Issue Stall Reason. This
counter will increment by 0-ActiveWarps per cycle.
UPDATE On Fermi the Issue Stall Reason counters are updated only on cycles that the warp scheduler had not eligible warps. On Kepler the Issue Stall Reason counters are updated every cycle even if the warp scheduler issues an instruction.
Zero Eligible Warps – This counter increments each cycle by 1 if
neither scheduler has a
warp that can be issued.
One Eligible Warp – This counter increments
each cycle by 1 if only one of the two schedulers has a warp that can
be issued.
KEPLER UPDATE: On Kepler the counters are per scheduler so the One Eligible Warp means that the warp scheduler could issue an instruction. On Fermi there is a single counter for both schedulers so on Fermi you want One Eligible Warp counter to be as small as possible.
Warp Issue Holes – This counter increments each cycle by
the number of active warps that are not eligible. This is the same as
Active Warps minus Eligible Warps.
Long Warp Issue Holes – This
counter increment each cycle by the number of active warps that have
not been eligible to issue an instruction for more than 32 clock
cycles. Long holes indicate that warps are stalled on long latency
reasons such as barriers and memory operations.
Issue Stall Reasons –
Each cycle each ineligible warp will increment one of the issue stall
reason counters. The sum of all issue stall reason counters is equal
to warp issue holes. A ineligible warp will increment the Instruction Fetch
stall reason if the next assembly instruction has not yet been
fetched. Execution Dependency stall reason if an input dependency is
not yet available. This can be reduced by increasing the number of
independent instructions. Data Requests stall reasons if the request
cannot currently be made as the required resources are not available,
or are fully utilized, or too many operations of that type are already
outstanding. In case data requests make up a large portion of the
stall reasons, you should also run the memory experiments to determine
if you can optimize existing transactions per request or if you need
to revisit your algorithm. Texture stall reason if the texture
sub-system is already fully utilized and currently not able to accept
further operations. Synchronization stall reason if the warp is
blocked at a __syncthreads(). If this reason is large and the kernel
execution configuration is limited to a small number of blocks then
consider dividing the kernel grid into more thread blocks.
Visual Profiler 5.0
The Visual Profiler does not have counters that address your question. Until the counters are added you can use the following counters:
sm_efficiency[_instance]
ipc[_instance]
achieved_occupancy.
The target and max IPCs for compute capabilities are:
Compute Target Max
Capability IPC IPC
2.0 1.7 2.0
2.x 2.3 4.0
3.x 4.4 7.0
The target IPC is for ALU limited computation. The target IPC for memory bound kernels will be less. For compute capability 2.1 devices and higher it is harder to use IPC as each warp scheduler can dual-issue.

Related

task scheduling of NVIDIA GPU

I have some doubt about the task scheduling of nvidia GPU.
(1) If a warp of threads in a block(CTA) have finished but there remains other warps running, will this warp wait the others to finish? In other words, all threads in a block(CTA) release their resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.
(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU？ In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it has finished?
I would be appreciate if someone recommend me some book or articles about the architecture of GPU.Thanks!

I recommend this article. It's somewhat outdated, but I think it is a good starting point. The article targets Kepler architecture, so the most recent one, Pascal, may have some discrepancies in their behavior.
Answers for your specific questions (based on the article):
Q1. Do threads in a block release their resource only after all threads in the block finish running?
Yes. A warp that finished running while other warps in the same block didn't still acquires its resources such as registers and shared memory.
Q2. When every threads in a block all hang up, will it still occupy the resources? Or, does a new block of threads take over the resources?
You are asking whether a block can be preempted. I've searched through web and got the answer from here.
On compute capabilities < 3.2 blocks are never preempted.
On compute capabilities 3.2+ the only two instances when blocks can be preempted are during device-side kernel launch (dynamic parallelism) or single-gpu debugging.
So blocks don't give up their resources when stalled by some global memory access. Rather than expecting the stalled warps to be preempted, you should design your CUDA code so that there are plenty of warps resident in an SM, waiting to be dispatched. In this case even when some warps are waiting global memory access to finish, schedulers can launch other threads, effectively hiding latencies.

CUDA blocks & warps - which can run in parallel on a single SM?

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.
I have a GTX 550Ti, btw with compute capability 2.1.
4 SMs x 48 cores = 192 CUDA cores.
Ok so what's unclear to me is:
Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each?
Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).
I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!

The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.
When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.
In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.
Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.
Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.
Answers to each Question
Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?
Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.
Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).
The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.
Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.
Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?
Yes, warps can run in parallel.
Q: Because in the Fermi architecture it states that 2 warps are executed concurrently
Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.
Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.

I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.
In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.

One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.
Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.
Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.

How do nVIDIA CC 2.1 GPU warp schedulers issue 2 instructions at a time for a warp?

Note: This question is specific to nVIDIA Compute Capability 2.1 devices. The following information is obtained from the CUDA Programming Guide v4.1:
In compute capability 2.1 devices, each SM has 48 SP (cores)
for integer and floating point operations. Each warp is composed
of 32 consecutive threads. Each SM has 2 warp schedulers. At every
instruction issue time, one warp scheduler picks a ready warp of
threads and issues 2 instructions for the warp on the cores.
My doubts:
One thread will execute on one core. How can the device issue 2 instructions to a thread in a single clock cycle or a single multi-cycle operation?
Does this mean the 2 instructions should be independent of each other?
That the 2 instructions can be executed in parallel on the core, maybe because they use different execution units in the core? Does this also mean that the warp will be ready next only after 2 instructions are finished executing or is it after one of them?

This is instruction-level parallelism (ILP). The instructions issued from a warp simultaneously must be independent of each other. They are issued by the SM instruction scheduler to separate functional units in the SM.
For example, if there are two independent FMAD instructions in the warp's instruction stream that are ready to issue and the SM has two available sets of FMAD units on which to issue them, they can both be issued in the same cycle. (Instructions can be issued together in various combinations, but I have not memorized them so I won't provide details here.)
The FMAD/IMAD execution units in SM 2.1 are 16 SPs wide. This means that it takes 2 cycles to issue a warp (32-thread) instruction to one of the 16-wide execution units. There are multiple (3) of these 16-wide execution units (48 SPs total) per SM, plus special function units. Each warp scheduler can issue to two of them per cycle.
Assume the FMAD execution units are pipe_A, pipe_B and pipe_C. Let us say that at cycle 135, there are two independent FMAD instructions fmad_1 and fmad_2 that are waiting:
At cycle 135, the instruction scheduler will issue the first half warp (16 threads) of fmad_1 to FMAD pipe_A, and the first half warp of fmad_2 to FMAD pipe_B.
At cycle 136, the first half warp of fmad_1 will have moved to the next stage in FMAD pipe_A, and similarly the first half warp of fmad_2 will have moved to the next stage in FMAD pipe_B. The warp scheduler now issues the second half warp of fmad_1 to FMAD pipe_A, and the second half warp of fmad_2 to FMAD pipe_B.
So it takes 2 cycles to issue 2 instructions from the same warp. But as OP mentions there are two warp schedulers, which means this whole process can be done simultaneously for instructions from another warp (assuming there are sufficient functional units). Hence the maximum issue rate is 2 warp instructions per cycle. Note, this is an abstracted view for a programmer's perspective—the actual low-level architectural details may be different.
As for your question about when the warp will be ready next, if there are more instructions that don't depend on any outstanding (already issued but not retired) instructions, then they can be issued in the very next cycle. But as soon as the only available instructions are dependent on in-flight instructions, the warp will not be able to issue. However that is where other warps come in -- the SM can issue instructions for any resident warp that has available (non-blocked) instructions. This arbitrary switching between warps is what provides the "latency hiding" that GPUs depend on for high throughput.

CUDA warps and occupancy

I have always thought that the warp scheduler will execute one warp at a time, depending on which warp is ready, and this warp can be from any one of the thread blocks in the multiprocessor. However, in one of the Nvidia webminar slides, it is stated that "Occupancy = Number of warps running concurrently on a multiprocessor divided by maximum number of warps that can run concurrently". So more than one warp can run at one time? How does this work?
Thank you.

"Running" might be better interpreted as "having state on the SM and/or instructions in the pipeline". The GPU hardware schedules up as many blocks as are available or will fit into the resources of the SM (whichever is smaller), allocates state for every warp they contain (ie. register file and local memory), then starts scheduling the warps for execution. The instruction pipeline seems to be about 21-24 cycles long, and so there are a lot of threads in various stages of "running" at any given time.
The first two generations of CUDA capable GPU (so G80/90 and G200) only retire instructions from a single warp per four clock cycles. Compute 2.0 devices dual-issue instructions from two warps per two clock cycles, so there are two warps retiring instructions per clock. Compute 2.1 extends this by allowing what is effectively out of order execution - still only two warps per clock, but potentially two instructions from the same warp at a time. So the extra 16 cores per SM get used for instruction level parallelism, still issued from the same shared scheduler.

Streaming multiprocessors, Blocks and Threads (CUDA)

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?

The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().

For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here:
https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps using round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.
The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.
The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.

There are multiple streaming multiprocessor on one device.
A SM may contain multiple blocks. Each block may contain several threads.
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.
SM and block both have limits on number of thread, number of register and shared memory.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008