What's the mechanism of the warps and the banks in CUDA? - cuda

I'm a rookie in learning CUDA parallel programming. Now I'm confused in the global memory access of device. It's about the warp model and coalescence.
There are some points:
It's said that threads in one block are split into warps. In each warp there are at most 32 threads. That means all these threads of the same warp will execute simultaneously with the same processor. So what's the senses of half-warp?
When it comes to the shared memory of one block, it would be split into 16 banks. To avoid bank conflicts, multiple threads can READ one bank at the same time rather than write in the same bank. Is this a correct interpretation?
Thanks in advance!

The principal usage of "half-warp" was applied to CUDA processors
prior to the Fermi generation (e.g. the "Tesla" or GT200 generation,
and the original G80/G92 generation). These GPUs were architected
with a SM (streaming multiprocessor -- a HW block inside the GPU)
that had fewer than 32 thread
processors.
The definition of warp was still the same, but the actual HW
execution took place in "half-warps" at a time. Actually the
granular details are more complicated than this, but suffice it to
say that the execution model caused memory requests to be issued
according to the needs of a half-warp, i.e. 16 threads within the warp.
A full warp that hit a memory transaction would thus generate a
total of 2 requests for that transaction.
Fermi and newer GPUs have at least 32 thread processors per
SM.
Therefore a memory transaction is immediately visible across a full
warp. As a result, memory requests are issued at the per-warp
level, rather than per-half-warp. However, a full memory request
can only retrieve 128 bytes at a time. Therefore, for data sizes
larger than 32 bits per thread per transaction, the memory
controller may still break the request down into a half-warp size.
My view is that, especially for a beginner, it's not necessary to
have a detailed understanding of half-warp. It's generally
sufficient to understand that it refers to a group of 16 threads
executing together and it has implications for memory requests.
Shared memory for example on the Fermi-class
GPUs
is broken into 32 banks. On previous
GPUs
it was broken into 16 banks. Bank conflicts occur any time an
individual bank is accessed by more than one thread in the same
memory request (i.e. originating from the same code instruction).
To avoid bank conflicts, basic strategies are very similar to the
strategies for coalescing memory requests, eg. for global memory. On Fermi and newer GPUs, multiple threads can read the same address without causing a bank conflict, but in general the definition of a bank conflict is when multiple threads read from the same bank. For further understanding of shared memory and how to avoid bank conflicts, I would recommend the NVIDIA webinar on this topic.

Related

Issued load/store instructions for replay

There are two nvprof metrics regarding load/store instructions and they are ldst_executed and ldst_issued. We know that executed<=issued. I expect that those load/stores that are issued but not executed are related to branch predications and other incorrent predictions. However, from this (slide 9) document and this topic, instructions that are issued but not executed are related to serialization and replay.
I don't know if that reason applies for load/store instructions or not. Moreover, I would like to know why such terminology is used for issued but not executed instructions? If there is a serialization for any reason, instructions are executed multiple times. So, why they are not counted as executed?
Any explanation for that?
The NVIDIA architecture optimized memory throughput by issuing an instruction for a group of threads called a warp. If each thread accesses a consecutive data element or the same element then the access can be performed very efficiently. However, if each thread accesses data in a different cache line or at a different address in the same bank then there is a conflict and the instruction has to be replayed.
inst_executed is the count of instructions retired.
inst_issued is the count of instructions issued. An instruction may be issued multiple times in the case of a vector memory access, memory address conflict, memory bank conflict, etc. On each issue the thread mask is reduced until all threads have completed.
The distinction is made for two reasons:
1. Retirement of an instruction indicates completion of a data dependency. The data dependency is only resolved 1 time despite possible replays.
2. The ratio between issued and executed is a simple way to show opportunities to save warp scheduler issue cycles.
In Fermi and Kepler SM if a memory conflict was encountered then the instruction was replayed (re-issued) until all threads completed. This was performed by the warp scheduler. These replays consume issue cycles reducing the ability for the SM to issue instructions to math pipes. In this SM issued > executed indicates an opportunity for optimization especially if issued IPC is high.
In the Maxwell-Turing SM replays for vector accesses, address conflicts, and memory conflicts are replayed by the memory unit (shared memory, L1, etc.) and do not steal warp scheduler issue cycles. In this SM issued is very seldom more than a few % above executed.
EXAMPLE: A kernel loads a 32-bit value. All 32 threads in the warp are active and each thread accesses a unique cache line (stride = 128 bytes).
On Kepler (CC3.*) SM the instruction is issued 1 time then replayed 31 additional times as the Kepler L1 can only perform 1 tag lookup per request.
inst_executed = 1
inst_issued = 32
On Kepler the instruction has to be replayed again for each request that missed in the L1. If all threads miss in the L1 cache then
inst_executed = 1
inst_issued >= 64 = 32 request + 32 replays for misses
On Maxwell - Turing architecture the replay is performed by the SM memory system. The replays can limit memory throughput but will not block the warp scheduler from issuing instructions to the math pipe.
inst_executed = 1
inst_issued = 1
On Maxwell-Turing Nsight Compute/Perfworks expose throughput counters for each of the memory pipelines including number of cycles due to memory bank conflicts, serialization of atomics, address divergence, etc.
GPU architecture is based on maximizing throughput rather than minimizing latency. Thus, GPUs (currently) don't really do out-of-order execution or branch prediction. Instead of building a few cores full of complex control logic to make one thread run really fast (like you'd have on a CPU), GPUs rather use those transistors to build more cores to run as many threads as possible in parallel.
As explained on slide 9 of the presentation you linked, executed instructions are the instructions that control flow passes over in your program (basically, the number of lines of assembly code that were run). When you, e.g., execute a global load instruction and the memory request cannot be served immediately (misses the cache), the GPU will switch to another thread. Once the value is ready in the cache and GPU switches back to your thread, the load instruction will have to be issued again to complete fetching the value (see also this answer and this thread). When you, e.g., access shared memory and there are a bank conflicts, the shared memory access will have to be replayed multiple times for different threads in the warp…
The main reason to differentiate between executed and issued instructions would seem to be that the ratio of the two can serve as a measurement for the amount of overhead your code produces due to instructions that cannot be completed immediately at the time they are executed…

Role of Warps in NVIDIA GPU architecture with OpenCL

I'm studying OpenCL concepts as well as the CUDA architecture for a small project, and there is one thing that is unclear to me: the necessity for Warps.
I know a lot of questions have been asked on this subject, however after having read some articles i still don't get the "meaning" of warps.
As far as I understand (speaking for my GPU card which is a Tesla, but i guess this easily translates to other boards):
A work-item is linked to a CUDA thread, which several of them can be executed by a Streaming Processor (SP). BTW, does a SP treats those WI in parallel?
Work-items are grouped into Work-groups. Work-groups operate on a Stream Multiprocessor and can not migrate. However, work-items in a work-group can collaborate via shared memory (a.k.a local memory). One or more work-groups may be executed by a Stream MultiProcessor. BTW, does a SM treats those WG in parallel?
Work-item are executed in parallel inside a work-group. However, synchronization is NOT guaranteed, that's why you need concurrent programming primitives, such as barriers.
As far as I understand, all of this is rather a logical view than a 'physical', hardware perspective.
If all of the above is correct, can you help me on the following. Is that true to say that:
1 - Warps execute 32 threads or work-items simultaneously. Thus, they will 'consume' parts of a work-group. And that's why in the end you need stuff like memory fences to synchronize work-items in work groups.
2 - The Warp scheduler allocates the registers for the 32 threads of warp when it becomes active.
3 - Also, are executed thread in a warp synchronized at all?
Thanks for any input on Warps, and especially why they are necessary in the CUDA architecture.
My best analogon is that a Warp is the vector that be processed in parallel, not unlike an AVX or SSE vector with an Intel CPU. This makes an SM a 32-length vector processor.
Then, to your questions:
Yes, all 32 elements will be run in parallel. Note that also a GPU puts hyperthreading to the extreme: a workgroup will consist of multiple Warps, which all are run more-or-less in parallel. You will need memory fences to sychronise that all.
Yes, typically all 32 work elements (CUDA: thread) in a Warp will work in parallel. Note that you typically will have multiple regsters per work element.
Not guaranteed, AFAIK.

How Concurrent blocks can run a single GPU streaming multiprocessor?

I was studying about the CUDA programming structure and what I felt after studying is that; after creating the blocks and threads, each of this blocks is assigned to each of the streaming multiprocessor (e.g. I am using GForce 560Ti which has14 streaming multiprocessors and so at one time 14 blocks can be assigned to all the streaming multiprocessors). But as I am going through several online materials such as this one :
http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf
where it has been mentioned that several blocks can be run concurrently on one multiprocessor. I am basically very much confused with the execution of the threads and the blocks on the streaming multiprocessors. I know that the assignment of blocks and the execution of the threads are absolutely arbitrary but I would like how the mapping of the blocks and the threads actually happens so that the concurrent execution could occur.
The Streaming Multiprocessors (SM) can execute more than one block at a time using Hardware Multithreading, a process akin to Hypter-Threading.
The CUDA C Programming Guide describes this as follows in Section 4.2:
4.2 Hardware Multithreading
The execution context (program counters, registers, etc) for each warp
processed by a multiprocessor is maintained on-chip during the entire
lifetime of the warp. Therefore, switching from one execution context
to another has no cost, and at every instruction issue time, a warp
scheduler selects a warp that has threads ready to execute its next
instruction (the active threads of the warp) and issues the
instruction to those threads.
In particular, each multiprocessor has a set of 32-bit registers that
are partitioned among the warps, and a parallel data cache or shared
memory that is partitioned among the thread blocks.
The number of blocks and warps that can reside and be processed
together on the multiprocessor for a given kernel depends on the
amount of registers and shared memory used by the kernel and the
amount of registers and shared memory available on the multiprocessor.
There are also a maximum number of resident blocks and a maximum
number of resident warps per multiprocessor. These limits as well the
amount of registers and shared memory available on the multiprocessor
are a function of the compute capability of the device and are given
in Appendix F. If there are not enough registers or shared memory
available per multiprocessor to process at least one block, the kernel
will fail to launch.

What is the context switching mechanism in GPU?

As I know, GPUs switch between warps to hide the memory latency. But I wonder in which condition, a warp will be switched out? For example, if a warp perform a load, and the data is there in the cache already. So is the warp switched out or continue the next computation? What happens if there are two consecutive adds?
Thanks
First of all, once a thread block is launched on a multiprocessor (SM), all of its warps are resident until they all exit the kernel. Thus a block is not launched until there are sufficient registers for all warps of the block, and until there is enough free shared memory for the block.
So warps are never "switched out" -- there is no inter-warp context switching in the traditional sense of the word, where a context switch requires saving registers to memory and restoring them.
The SM does, however, choose instructions to issue from among all resident warps. In fact, the SM is more likely to issue two instructions in a row from different warps than from the same warp, no matter what type of instruction they are, regardless of how much ILP (instruction-level parallelism) there is. Not doing so would expose the SM to dependency stalls. Even "fast" instructions like adds have a non-zero latency, because the arithmetic pipeline is multiple cycles long. On Fermi, for example, the hardware can issue 2 or more warp-instructions per cycle (peak), and the arithmetic pipeline latency is ~12 cycles. Therefore you need multiple warps in flight just to hide arithmetic latency, not just memory latency.
In general, the details of warp scheduling are architecture dependent, not publicly documented, and pretty much guaranteed to change over time. The CUDA programming model is independent of the scheduling algorithm, and you should not rely on it in your software.

How much is run concurrently on a GPU given its numbers of SM's and SP's?

i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.