resident warps per SM in (GK20a GPU) tegra k1 - cuda

How many resident warps are present per SM in (GK20a GPU) tegra k1?
As per documents I got following information
In tegra k1 there is 1 SMX and 192 cores/multiprocessor
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Can any one specify value of maximun blocks per SMX?
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing
four warps to be issued and executed concurrently) threads running concurrently ?
if NO, How many number of threads run concurrently?
Kindly help me to solve and understand it.

Can any one specify value of maximun blocks per SMX?
The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?
There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".
Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).
Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).
Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.
So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.
However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.

Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:
each thread from certain kernel is using 64 32b registers (256B register memory),
kernel is launched with blocks of size 1024 threads,
obviously such block would consume 256*1024B of registers on particular SM
I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...
Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.
I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.
As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.
If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.
EDIT:
Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.

Related

How will GPU execute when the number of threads per block is larger than the maximum number of active threads on a Streaming Multiprocessor? [duplicate]

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.
deviceQuery gives me the following possibly relevant information:
CUDA Capability Major/Minor version number: 2.0
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA
Maximum number of threads per block: 1024
I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?
The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.
Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.
While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.
So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.
Taking into account all of the above is too difficult, though, so most people compare on two metrics:
Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.
The most important comparison is always measured wall-clock time on a real application.
There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:
The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.
To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.
I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.

In CUDA compute capability 3.5+, can all threads (on an SM) really have 255 registers each?

I'm looking at the following maximum values for different CUDA compute capabilities:
Registers per thread
Registers per SM (symmetric multiprocessor)
Threads per SM
as appearing here. Well, it looks like for CUDA 3.5 and upwards at least, 1 x 3 > 2 . That implies that while a single thread can use up to 255 registers, if too many threads attempt to do so there will be register spill. Is my interpretation correct? Or is stated figure 1. not really correct and it's really 64 registers per thread?
Rather than wikipedia, we can use documentation provided by NVIDIA to answer these questions.
The table 12 of the programming guide indicates that (for cc3.5):
The maximum registers per thread is 255
The maximum number of threads per block is 1024
The maximum registers per multiprocessor is 64K (i.e. 65536)
Registers per thread is decided at compile-time, is a specific number, and does not vary at runtime. Likewise, "spilling" as used in this context is a decision made at compile-time.
Therefore, I cannot simultaneously use 255 registers per thread while launching a threadblock of 1024 threads (1024 * 255 = 255K > 64K)
But if I launch a threadblock of 64 threads, I can certainly use up to 255 registers per thread, legally, with a properly launching threadblock.
Therefore, like some other CUDA constraints (such as the individual dimensions of a threadblock and the aggregate number of threads in a threadblock), the individual constraint of registers per thread is one limit, but the maximum number of registers per multiprocessor is another (aggregate) limit, and both must be satisfied, at launch, for a kernel to launch. If there are other threadblocks currently resident, this could impact occupancy. If there are no threadblocks currently resident, and the limits cannot be met, this is a condition that is detectable at launch-time and will be reported as a kernel launch error (too many resources requested for launch).

Determining number of warps allowed in CUDA SM

So if a streaming multiprocessor can allow maximum X threads, while each block in the SM allows Y threads, how many warps can we have in a block and how many warps can we have in a SM?
Here is my take on this question:
(1) A warp consists of 32 threads. In a block we can have Y/32, right?
(2) As far as # of warps per SM, we cannot exceed X the maximum number of threads in SM, so we can have X/32, right? I hope somebody can confirm these calculations.
(1) Yes, rounding up if needed (i.e. if number of threads Y per block is not evenly divisible by 32)
(2) Yes, that is one limit on the number of warps that may be active. Remember that the SM scheduler works by scheduling blocks first. The number of blocks that will be scheduled is a function of available resources (registers, shared memory, threads, etc.) A block will only be scheduled when there are enough resources available to support it's needs. So for example, if I have 1024 threads per block, I can schedule at most 1 block on an SM, because the limit of 1536 threads per SM (using CC 2.0 as an example here) prevents 2 blocks from being scheduled. So in that case, even though your X/32 number predicts a max of 48 warps, only 1024/32 = 32 warps will be scheduled. (using CC 2.0 as an example, with a block structure of 1024 threads per block).

How do CUDA blocks/warps/threads map onto CUDA cores?

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of all, I would like to understand if I got these facts straight:
The programmer writes a kernel, and organize its execution in a grid of thread blocks.
Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.
Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.
The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.
If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.
On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.
If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.
Are they correct?
Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).
My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:
I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?
Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)
If I "submerge" the GPU in scheduled work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?
Is there any way to check these situations using the profiler?
Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.
Two of the best references are
NVIDIA Fermi Compute Architecture Whitepaper
GF104 Reviews
I'll try to answer each of your questions.
The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.
4'. There is a mapping between laneid (threads index in a warp) and a core.
5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.
6'. A thread block will be divided into
WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize
There is no requirement for the warp schedulers to select two warps from the same thread block.
7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.
See reference 2 for differences between a GTX480 and GTX560.
If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.
1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.
2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.
8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions
In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.
3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.
4'. Parallel Nsight and the Visual Profiler show
a. executed IPC
b. issued IPC
c. active warps per active cycle
d. eligible warps per active cycle (Nsight only)
e. warp stall reasons (Nsight only)
f. active threads per instruction executed
The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC.
For MaxIPC assume
GF100 (GTX480) is 2
GF10x (GTX560) is 4 but target is 3 is a better target.
"E. If a warp contains 20 threads, but currently there are only 16 cores available, the warp will not run."
is incorrect. You are confusing cores in their usual sense (also used in CPUs) - the number of "multiprocessors" in a GPU, with cores in nVIDIA marketing speak ("our card has thousands of CUDA cores").
Cuda core (so answer) is a hardware concept and thread is a software concept. Even with only 16 cores available, you can still run 32 threads. However, you may need 2 clock cycles to run them with only 16 hardware cores.
The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle
warp scheduler is responsible to find cores to run instructions (so answer)
A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
A warp itself can only be scheduled on a SM (multiprocessor, or streaming multiprocessor), and can run up to 32 threads at the same time (depending on cores in SM); it cannot use more than a SM.
The number "48 warps" is the maximum number of active warps (warps which may be chosen to be scheduled for work in the next cycle, at any given cycle) per multiprocessor, on NVIDIA GPUs with Compute Capability 2.x; and this number corresponds to 1536 = 48 x 32 threads.
Answer based on this webinar

CUDA: How many concurrent threads in total?

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.
deviceQuery gives me the following possibly relevant information:
CUDA Capability Major/Minor version number: 2.0
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA
Maximum number of threads per block: 1024
I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?
The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.
Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.
While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.
So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.
Taking into account all of the above is too difficult, though, so most people compare on two metrics:
Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.
The most important comparison is always measured wall-clock time on a real application.
There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:
The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.
To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.
I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.