Hierarchy correspondance between hardware and program in CUDA model - cuda

In my current understanding, the hardware hierarchy of CUDA model is GPU card -> Streaming Multiprocessors (SMs) -> cores, and the program hierarchy is kernel-> grid -> block -> warp -> single thread. I want to know the correspondence between the hardware and program hierarchy. That is, Is a kernel in general composed of several grids? is grid contained in the GPU card or in the SMs? if grid is contained in the GPU card, can the the GPU card contain only one grid or multiple grids? Does block correspond to a SMS? Can a SMs contains only one block or multiple blocks? Can a block span several SMs? Can a core execute only one thread or multiple threads? etc.

A kernel is a function that runs on the GPU.
The grid is all threadblocks associated with a kernel launch. A kernel launch creates a single grid.
A grid can run on the entire GPU devices (all SM's in the GPU).
A grid is composed of threadblocks.
Threadblocks are groups of threads. Threads are grouped into warps (32 threads) for execution purposes, so we can also say threadblocks are groups of warps.
Threadblocks (the warps they contain) execute on an SM. Once a threadblock begins executing on a particular SM, it stays on that SM and will not migrate to another SM.
SMs are composed of cores. Each core executes one thread. The core execution engine may have the ability to handle multiple instructions at a time, so it can actually handle more than one thread, but not from the same warp. This part gets complicated and it's not essentialy to good beginner understanding of how a GPU works, so it's convenient and useful to think of a core only handling one thread at any given instant (instruction cycle).
An SM can handle multiple blocks simultaneously.
Please don't post questions that contain many questions.
Questions on SO should show some research effort.
Good research effort for questions like these would be take take some basic webinars from the nvidia webinar page, which will only require a couple hours of study.
Try these two first:
GPU Computing using CUDA C – An Introduction (2010)
An introduction to the basics of GPU computing using CUDA C. Concepts will be illustrated with walkthroughs of code samples. No prior GPU Computing experience required
GPU Computing using CUDA C – Advanced 1 (2010)
First level optimization techniques such as global memory optimization, and processor utilization. Concepts will be illustrated using real code examples

Related

Understanding Streaming Multiprocessors (SM) and Streaming Processors (SP)

I am trying to understand the basic architecture of a GPU. I have gone through a lot of material including this very good SO answer. But I am still confused not able to get a good picture of it.
My Understanding:
A GPU contains two or more Streaming Multiprocessors (SM) depending upon the compute capablity value.
Each SM consists of Streaming Processors (SP) which are actually responisible for the execution of instructions.
Each block is processed by SP in form of warps (32 threads).
Each block has access to a shared memory. A different block cannot access the data of some other block's shared memory.
Confusion:
In the following image, I am not able to understand which one is the Streaming Multiprocessor (SM) and which one is SP. I think that Multiprocessor-1 respresent a single SM and Processor-1 (upto M) respresent a single SP. But I am not sure about this because I can see that each Processor (in blue color) has been provided a Register but as far as I know, a register is provided to a thread unit.
It would be very helpful to me if you could provide some basic overview w.r.t this image or any other image.
First, some comments on the "My understanding" portion of the question:
The number of SMs depends on GPU model - there are low-end models with just one SM, and high-end ones with as many as 30! Compute capability defines what those SMs are capable of, but not how many SMs there are in a GPU.
Each thread block is assigned to an SM, not SP. There can be multiple thread blocks running on a given SM, subject to its resource limitations.
On to the diagram:
Orange boxes are indeed SMs, just as they are labeled. Each SM has shared memory pool, divided between all thread blocks running on this SM.
Blue boxes are SPs. Since SP is a scalar lane, it runs one thread, and each thread is provided with its own set of registers, again, just like the diagram shows.
Addressing the follow-up question:
Each SM can have multiple resident thread blocks. The maximum number of thread blocks resident on SM is determined by compute capability. Achieved number can be lower than maximum when it is limited by the number of registers or the amount of shared memory consumed by each thread block.
SM will then schedule instruction from all warps resident on it, picking among warps that have instructions ready for execution - and those warps may come from any thread block resident on this SM. You generally want to have many warps resident, so that at any given moment of time SPs can be kept busy running instructions from whatever warps are ready.
Number of cores per SM is not a very useful metric, and you need not think too much about it at this point.

Role of Warps in NVIDIA GPU architecture with OpenCL

I'm studying OpenCL concepts as well as the CUDA architecture for a small project, and there is one thing that is unclear to me: the necessity for Warps.
I know a lot of questions have been asked on this subject, however after having read some articles i still don't get the "meaning" of warps.
As far as I understand (speaking for my GPU card which is a Tesla, but i guess this easily translates to other boards):
A work-item is linked to a CUDA thread, which several of them can be executed by a Streaming Processor (SP). BTW, does a SP treats those WI in parallel?
Work-items are grouped into Work-groups. Work-groups operate on a Stream Multiprocessor and can not migrate. However, work-items in a work-group can collaborate via shared memory (a.k.a local memory). One or more work-groups may be executed by a Stream MultiProcessor. BTW, does a SM treats those WG in parallel?
Work-item are executed in parallel inside a work-group. However, synchronization is NOT guaranteed, that's why you need concurrent programming primitives, such as barriers.
As far as I understand, all of this is rather a logical view than a 'physical', hardware perspective.
If all of the above is correct, can you help me on the following. Is that true to say that:
1 - Warps execute 32 threads or work-items simultaneously. Thus, they will 'consume' parts of a work-group. And that's why in the end you need stuff like memory fences to synchronize work-items in work groups.
2 - The Warp scheduler allocates the registers for the 32 threads of warp when it becomes active.
3 - Also, are executed thread in a warp synchronized at all?
Thanks for any input on Warps, and especially why they are necessary in the CUDA architecture.
My best analogon is that a Warp is the vector that be processed in parallel, not unlike an AVX or SSE vector with an Intel CPU. This makes an SM a 32-length vector processor.
Then, to your questions:
Yes, all 32 elements will be run in parallel. Note that also a GPU puts hyperthreading to the extreme: a workgroup will consist of multiple Warps, which all are run more-or-less in parallel. You will need memory fences to sychronise that all.
Yes, typically all 32 work elements (CUDA: thread) in a Warp will work in parallel. Note that you typically will have multiple regsters per work element.
Not guaranteed, AFAIK.

CUDA blocks & warps - which can run in parallel on a single SM?

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.
I have a GTX 550Ti, btw with compute capability 2.1.
4 SMs x 48 cores = 192 CUDA cores.
Ok so what's unclear to me is:
Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each?
Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).
I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!
The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.
When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.
In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.
Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.
Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.
Answers to each Question
Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?
Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.
Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).
The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.
Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.
Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?
Yes, warps can run in parallel.
Q: Because in the Fermi architecture it states that 2 warps are executed concurrently
Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.
Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.
I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.
In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.
One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.
Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.
Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.

Working of CUDA scheduler

How do I know the behavior of CUDA scheduler? Apart from testing it by varying the grid sizes, block sizes etc. in my application is there any vendor provided documentation that explains exactly in what fashion the blocks are distributed?
It depends on the architecture you are working on.
On the Fermi architecture, for example, you have a GigaThread global scheduler that distributes thread blocks to the Streaming Multiprocessors (SM) schedulers. For each SM, a Dual Warp scheduler schedules threads in groups of 32 parallel threads called warps.
This is well explained in the NVIDIA White Paper on Fermi. I suggest also to take a look at this other document.

blocks, threads, warpSize

There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough perThread & perBlock memory. Memory limits are not an issue here.)
kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);
1) To keep the SM as busy as possible, I should set nThreads to a multiple of warpSize. True?
2) An SM can only execute one kernel at a time. That is all HWcores of that SM are executing only kernelA. (Not some HWcores running kernelA, while others run kernelB.) So if I have only one thread to run, I'm "wasting" the other HWcores. True?
3)If the warp-scheduler issues work in units of warpSize (32 threads), and each SM has 32 HWcores, then the SM would be full utilized. What happens when the SM has 48 HWcores? How can I keep all 48 cores full utilized when the scheduler is issuing work in chunks of 32? (If the previous paragraph is true, wouldn't it be better if the scheduler issued work in units of HWcore size?)
4) It looks like the warp-scheduler queues up 2 tasks at a time. So that when the currently-executing kernel stalls or blocks, the 2nd kernel is swapped in. (It is not clear, but I'll guess the queue here is more than 2 kernels deep.) Is this correct?
5) If my HW has an upper limit of 512 threads-per-block (nThreadsMax), that doesn't mean the kernel with 512 threads will run fastest on one block. (Again, mem not an issue.) There is a good chance I'll get better performance if I spread the 512-thread kernel across many blocks, not just one. The block is executed on one or many SM's. True?
5a) I'm thinking the smaller the better, but does it matter how small I make nBlocks? The question is, how to choose the value of nBlocks that is decent? (Not necessarily optimal.) Is there a mathematical approach to choosing nBlocks, or is it simply trial-n-err.
1) Yes.
2) CC 2.0 - 3.0 devices can execute up to 16 grids concurrently. Each SM is limited to 8 blocks so in order to reach full concurrency the device has to have at least 2 SMs.
3) Yes the warp schedulers select and issue warps at at time. Forget the concept of CUDA cores they are irrelevant. In order to hide latency you need to have high instruction level parallelism or a high occupancy. It is recommended to have >25% for CC 1.x and >50% for CC >= 2.0. In general CC 3.0 requires higher occupancy than 2.0 devices due to the doubling of schedulers but only a 33% increase in warps per SM. The Nsight VSE Issue Efficiency experiment is the best way to determine if you had sufficient warps to hide instruction and memory latency. Unfortunately, the Visual Profiler does not have this metric.
4) The warp scheduler algorithm is not documented; however, it does not consider which grid the thread block originated. For CC 2.x and 3.0 devices the CUDA work distributor will distribute all blocks from a grid before distributing blocks from the next grid; however, this is not guaranteed by the programming model.
5) In order to keep the SM busy you have to have sufficient blocks to fill the device. After that you want to make sure you have sufficient warps to reach a reasonable occupancy. There are both pros and cons to using large thread blocks. Large thread blocks in general use less instruction cache and have smaller footprints on cache; however, large thread blocks stall at syncthreads (SM can become less efficient as there are less warps to choose from) and tend to keep instructions executing on similar execution units. I recommend trying 128 or 256 threads per thread block to start. There are good reasons for both larger and smaller thread blocks.
5a) Use the occupancy calculator. Picking too large of a thread block size will often cause you to be limited by registers. Picking too small of a thread block size can find you limited by shared memory or the 8 blocks per SM limit.
Let me try to answer your questions one by one.
That is correct.
What do you mean exactly by "HWcores"? The first part of your statement is correct.
According to the NVIDIA Fermi Compute Architecture Whitepaper: "The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream".
Furthermore, the NVIDIA Keppler Architecture Whitepaper states: "Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle."
The "excess" cores are therefore used by scheduling more than one warp at a time.
The warp scheduler schedules warps of the same kernel, not different kernels.
Not quite true: Each block is locked in to a single SM, since that's where its shared memory resides.
That's a tricky issue and depends on how your kernel is implemented. You may want to have a look at the nVidia Webinar Better Performance at Lower Occupancy by Vasily Volkov which explains some of the more important issues. Primarily, though, I would suggest you choose your thread count to improve occupancy, using the CUDA Occupancy Calculator.