I have a basic question for my understanding. I apologize if some reference to the answer is provided in some documentation. I couldn't find anything related to this in C programming guide.
I have a Fermi Achitecture GPU Geforce GTX 470. It has
14 Streaming Multiprocessors
32 Stream Cores per SM
I wanted to understand thread per-emption mechanism with an example. Suppose I have simplest kernel with a 'printf' statement (printing out thread id). And I use the following dimensions for grid and blocks
dim3 grid, block;
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 1;
block.z = 1;
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors. And as each streaming multiprocessor has 32 cores, each core will execute one kernel (one thread). Is this correct?
If this is correct, then what happens in the following case?
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 64;
block.y = 1;
block.z = 1;
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
Can someone please comment on this?
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors.
Not necessarily. A single block with 32 threads is not enough to saturate an SM, so multiple blocks may be scheduled on a single SM while some go unused. As you increase the number of blocks, you will get to a point where they get evenly distributed over all SMs.
And as each multiprocessor has 32 cores, each core will execute one kernel (one thread).
The CUDA cores are heavily pipelined so each core processes many threads at the same time. Each thread is in a different stage in the pipeline. There are also a varying number of different types of resources.
Taking a closer look at the Fermi SM (see below), you see the 32 CUDA Cores (marketing speak for ALUs), each of which can hold around 20 threads in their pipelines. But there are only 16 LD/ST (Load/Store) units and only 4 SFU (Special Function) units. So, when a warp gets to an instruction that is not supported by the ALUs, the warp will be scheduled multiple times. For instance, if the instruction requires the SFU units, the warp will be scheduled 8 (32 / 4) times.
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
Because the CUDA architecture guarantees that all threads in a block will have access to the same shared memory, a block can never move between SMs. When the first warp for a block has been scheduled on a given SM, all other warps in that block will be run on that same SM regardless of resources becoming available on other SMs.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
Think of blocks as sets of warps that are guaranteed to run on the same SM. So, in your example, the 64 threads (2 warps) of each block will be executed on the same SM. On the first clock, the first instruction of one warp is scheduled. On the second clock, that instruction has moved one step into the pipelines so the resource that was used is free to accept either the second instruction from the same warp or the first instruction from the second warp. Since there are around 20 steps in the ALU pipelines on Fermi, 2 warps will not contain enough explicit parallelism to fill all the stages in the pipeline and they will probably not contain enough ILP to do so.
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
The dimensions are only to enable offloading of generation of 2D and 3D thread indexes to dedicated hardware. The schedulers see the blocks as a 1D array of warps. The order in which they search for eligible warps is undefined. The scheduler will search in a fairly small set of "active" warps for a warp that has a current instruction that needs a resource that is currently open. When a warp is complete, a new one will be added to the active set. So, the order in which the warps are completed becomes unpredictable.
Fermi SM:
Related
My questions are about warps and scheduling. I'm using NVIDIA Fermi terminology here. My observations are below, are they correct?
A. Threads in the same warp execute the same instruction. Each warp includes 32 threads.
According to the Fermi Whitepaper:
"Fermi’s dual warp scheduler selects two warps, and issues one
instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. "
From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together. Each scheduler issues half of a warp to 16 cores in a cycle, and in all, two schedulers issue two warp-halves into two 16-core scheduling groups in a cycle. In another words, one warp needs to be scheduled twice, half by half, in this Fermi architecture. If a warp contains only SFU operations, then this warp needs to be issued 8 times(32/4), since there's only 4 SFPUs in an SM.
B. When a large amount of threads (say 1-D array, 320 threads) is launched, consecutive threads will be grouped into 10 warps automatically, each has 32 threads. Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
Questions:
Q1. Which part handles the threads grouping (into warps)? software or hardware? if hardware, is it the warp scheduler? and how the hardware warp scheduler is implemented and work?
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?(this is a hardware question) Since in this case(Fermi), a warp will be scheduled twice (in two cycles) anyway. If a warp is 16 wide, simply two warps will be scheduled (also in two cycles), which seems the same with the previous case.I wonder whether this organization is due to performance concern.
What I can imagine now is: threads in the same warp can be guaranteed synchronized which can be useful sometimes, or other resources such as registers and memory are organized in the warp size basis. I'm not sure whether this is correct.
Correcting some misconceptions:
A. ...From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together.
When the warp instruction is issued to a group of 16 cores, the entire warp executes the instruction, because the cores are clocked twice (Fermi's "hotclock") so that each core actually executes two thread's worth of computation in a single cycle (= 2 hotclocks). When a warp instruction is dispatched, the entire warp gets serviced. It does not need to be scheduled twice.
B. ...Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
It's true that all threads in a block (and therefore all warps) are executing from the same instruction stream, but they are not necessarily executing the same instruction. Certainly all threads in a warp are executing the same instruction at any given time. But warps execute independently from each other and so different warps within a block may be executing different instructions from the stream, at any given time. The diagram on page 10 of the Fermi whitepaper makes this evident.
Q1: Which part handles the threads grouping (into warps)? software or hardware?
It is done by hardware, as explained in the hardware implementation section of the programming guide: "The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block. "
and how the hardware warp scheduler is implemented and work?
I don't believe this is formally documented anywhere. Greg Smith has provided various explanations about it, and you may wish to seach on "user:124092 scheduler" or a similar search, to read some of his comments.
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
This question is predicated on misconceptions outlined earlier. The grouping of threads into a warp is not dynamic; it is fixed at threadblock launch time, and it follows the methodology described above in the answer to Q1. Furthermore, threads 0-15 will never be scheduled with any threads other than 16-31, as 0-31 comprise a warp, which is indivisible for scheduling purposes, on Fermi.
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?
Again, I believe this question is predicated on previous misconceptions. The hardware units used to provide resources for a warp may exist in 16 units (or some other number) at some functional level, but from an operational level, the warp is scheduled as 32 threads, and each instruction is scheduled for the entire warp, and executed together, within some number of Fermi hotclocks.
As far as I know:
Q1 - scheduling is done at hardware level, warps are the scheduling units and warps, their lanes constituents (a laneid is the hardware equivalent of the thread index in a warp), SMs and other components at this level are all hardware units which are abstracted and programmed via the CUDA programming model.
Q2 - It also depends on the grid: if you're launching two blocks containing a single thread each, you will end up with two warps each of which contains only one active thread. As I said all scheduling and execution is done on a warp-basis and more warps the hardware has, the more it can schedule (although they may contain dummy NOP threads) and try to hide latency/less instruction pipeline stalls.
Q3 - Once resources are allocated threads are always divided into 32-thread warps. On Fermi warp schedulers pick two warp per cycle and dispatch them to execution units. On pre-Fermi architectures SMs had fewer than 32 thread processors. Right now Fermi has 32 thread processors. However, a full memory request can only retrieve 128 bytes at a time. Therefore, for data sizes larger than 32 bits per thread per transaction, the memory controller may still break the request down into a half-warp size (https://stackoverflow.com/a/14927626/1938163). Besides
The SM schedules threads in groups of 32 parallel threads called
warps. Each SM features two warp schedulers and two instruction
dispatch units, allowing two warps to be issued and executed
concurrently. Fermi’s dual warp scheduler selects two warps, and
issues one instruction from each warp to a group of sixteen cores,
sixteen load/store units, or four SFUs.
you don't have a "scheduling group size" at thread-level as you wrote, but if you re-read the above statement you'll have that 16 cores (or 16 load/store units or 4 SFUs) are readied with one instruction from a 32-thread warp each. If you were asking "why 16?" well.. that's another architectural story... and I suspect it's a carefully designed tradeoff. I'm sorry but I don't know more about this.
I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.
i've got a few questions regarding cuda's scheduling system.
A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?
B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>(); x,y =?
C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?
Thanks for your help
A. The compute work distributor will distribute a block from the grid to a SM. The SM will convert the block in warps (WARP_SIZE = 32 on all NVIDIA GPUs). Fermi 2.0 GPUs each SM has two warp schedulers which share a set of data paths. Every cycle each warp scheduler picks a warp and issues an instruction to one of data paths (please don't think of CUDA cores). On Fermi 2.1 GPUs each warp scheduler has independent data paths as well as a set of shared data paths. On 2.1 every cycle each warp scheduler will pick a warp and attempt to dual issue instructions for each warp.
The warp schedulers attempt to optimize the use of data paths. This means that it is possible that a single warp will execute multiple instructions in back to back cycle or the warp scheduler can choose to issue from a different warp every cycle.
The number of warps/threads that each SM can handle is specified in the CUDA Programming Guide v.4.2 Table F-1. This scales from 768 threads to 2048 threads (24-64 warps).
B. The maximum threads per launch is defined by the maximum GridDims * the maximum threads per block. See Table F-1 or refer to the documentation for cudaGetDeviceProperties.
C. See the same resources as (B). The optimum distribution of threads/block is defined by your algorithm partitioning and is influenced by the occupancy calculation. There are observable performance impacts based around problem set size of the warps on the SM and the amount of time blocked at instruction barriers (among other things). For starters I recommend at least 2 blocks per SM and ~50% occupancy.
B. It depends on your device. You can use the cuda function cudaGetDeviceProperties to see the specifications for your device. A common maximum number is y=1024 threads per block and x=65535 blocks per Grid dimension.
C.A common practise is to have powers of 2 (128,256,512 etc.) threads/block. Reducing large arrays is very effective that way (see Reduction). The optimum distribution of blocks and threads actually depends on your application and your hardware. I personally use 512 threads/block for large sparse linear algebra computations on a TeslaM2050 since it's the most efficient for my applications.
I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of all, I would like to understand if I got these facts straight:
The programmer writes a kernel, and organize its execution in a grid of thread blocks.
Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.
Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.
The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.
If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.
On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.
If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.
Are they correct?
Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).
My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:
I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?
Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)
If I "submerge" the GPU in scheduled work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?
Is there any way to check these situations using the profiler?
Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.
Two of the best references are
NVIDIA Fermi Compute Architecture Whitepaper
GF104 Reviews
I'll try to answer each of your questions.
The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.
4'. There is a mapping between laneid (threads index in a warp) and a core.
5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.
6'. A thread block will be divided into
WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize
There is no requirement for the warp schedulers to select two warps from the same thread block.
7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.
See reference 2 for differences between a GTX480 and GTX560.
If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.
1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.
2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.
8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions
In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.
3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.
4'. Parallel Nsight and the Visual Profiler show
a. executed IPC
b. issued IPC
c. active warps per active cycle
d. eligible warps per active cycle (Nsight only)
e. warp stall reasons (Nsight only)
f. active threads per instruction executed
The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC.
For MaxIPC assume
GF100 (GTX480) is 2
GF10x (GTX560) is 4 but target is 3 is a better target.
"E. If a warp contains 20 threads, but currently there are only 16 cores available, the warp will not run."
is incorrect. You are confusing cores in their usual sense (also used in CPUs) - the number of "multiprocessors" in a GPU, with cores in nVIDIA marketing speak ("our card has thousands of CUDA cores").
Cuda core (so answer) is a hardware concept and thread is a software concept. Even with only 16 cores available, you can still run 32 threads. However, you may need 2 clock cycles to run them with only 16 hardware cores.
The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle
warp scheduler is responsible to find cores to run instructions (so answer)
A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
A warp itself can only be scheduled on a SM (multiprocessor, or streaming multiprocessor), and can run up to 32 threads at the same time (depending on cores in SM); it cannot use more than a SM.
The number "48 warps" is the maximum number of active warps (warps which may be chosen to be scheduled for work in the next cycle, at any given cycle) per multiprocessor, on NVIDIA GPUs with Compute Capability 2.x; and this number corresponds to 1536 = 48 x 32 threads.
Answer based on this webinar
Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.
I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)
This link provides some insight into how CUDA will (likely) and organize your threads.
A quote from the summary:
To summarize, special parameters at a
kernel launch define the dimensions of
a grid and its blocks. Unique
coordinates in blockId and threadId
variables allow threads of a grid to
distinguish among them. It is the
programmer's responsibility to use
these variables in the kernel
functions so that the threads can
properly identify the portion of the
data to process. These variables
compel the programmers to organize
threads and there data into
hierarchical and multi-dimensional
organizations.
It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.
Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.