How will GPU execute when the number of threads per block is larger than the maximum number of active threads on a Streaming Multiprocessor? [duplicate] - cuda

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.
deviceQuery gives me the following possibly relevant information:
CUDA Capability Major/Minor version number: 2.0
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA
Maximum number of threads per block: 1024
I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?

The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.
Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.
While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.
So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.
Taking into account all of the above is too difficult, though, so most people compare on two metrics:
Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.
The most important comparison is always measured wall-clock time on a real application.

There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:
The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.
To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.

I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.

Related

How many threads can a Nvidia GPU launch?

How many threads can a Nvidia GTX 1050 4GB GPU launch?
For example:
kernel<<<1,32>>>(args); can launch 32 threads.
So what is the maximum number of threads possible?
I am aware of this post how many threads does nvidia GTS 450 has
This question may often arise from a misunderstanding of GPU execution behavior. However the limits according to the prompt you have given:
For example: kernel<<<1,32>>>(args); can launch 32 threads.
do not vary across GPUs supported by recent CUDA toolkits (i.e. GPUs of compute capability 3.0 through 8.6, which includes your GTX 1050). These limits are given in the documentation as well as via runtime queries such as demonstrated by the deviceQuery sample code.
These limits are that a threadblock (the second kernel launch configuration parameter) is limited to 1024 threads total, which is the product of the 3 dimensions, x,y,z, and each of those dimensions have individual limits:
x y z
threadblock 1024 1024 64
grid 2^31-1 65535 65535
and likewise, as indicated above, the grid (the first kernel launch configuration parameter) has individual limits on each dimension, but no limit on the product.
The maximum number of threads that is possible to be specified in a kernel launch currently is therefore the product of the grid dimensions and the threadblock limit:
total = (2^31-1)*65535*65535*1024
That product is 9,444,444,733,164,249,676,800
Note that on most GPUs, a kernel launch this large, even with an empty kernel, will take a very long time to complete. (*)
The documentation covers thread hierarchy as well as how to specify multi-dimensional grids and threadblocks.
(*) For amusement, an empty kernel launch of <<<dim3(1,65535,65535),1024>>> takes about 1 minute to process on a GTX960. So a "maximal" empty kernel launch would take about 2^31 minutes to process (over 4000 years) on that GPU.

resident warps per SM in (GK20a GPU) tegra k1

How many resident warps are present per SM in (GK20a GPU) tegra k1?
As per documents I got following information
In tegra k1 there is 1 SMX and 192 cores/multiprocessor
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Can any one specify value of maximun blocks per SMX?
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing
four warps to be issued and executed concurrently) threads running concurrently ?
if NO, How many number of threads run concurrently?
Kindly help me to solve and understand it.
Can any one specify value of maximun blocks per SMX?
The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?
There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".
Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).
Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).
Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.
So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.
However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.
Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:
each thread from certain kernel is using 64 32b registers (256B register memory),
kernel is launched with blocks of size 1024 threads,
obviously such block would consume 256*1024B of registers on particular SM
I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...
Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.
I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.
As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.
If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.
EDIT:
Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.

How do CUDA blocks/warps/threads map onto CUDA cores?

I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of all, I would like to understand if I got these facts straight:
The programmer writes a kernel, and organize its execution in a grid of thread blocks.
Each block is assigned to a Streaming Multiprocessor (SM). Once assigned it cannot migrate to another SM.
Each SM splits its own blocks into Warps (currently with a maximum size of 32 threads). All the threads in a warp executes concurrently on the resources of the SM.
The actual execution of a thread is performed by the CUDA Cores contained in the SM. There is no specific mapping between threads and cores.
If a warp contains 20 thread, but currently there are only 16 cores available, the warp will not run.
On the other hand if a block contains 48 threads, it will be split into 2 warps and they will execute in parallel provided that enough memory is available.
If a thread starts on a core, then it is stalled for memory access or for a long floating point operation, its execution could resume on a different core.
Are they correct?
Now, I have a GeForce 560 Ti so according to the specifications it is equipped with 8 SM, each containing 48 CUDA cores (384 cores in total).
My goal is to make sure that every core of the architecture executes the SAME instructions. Assuming that my code will not require more register than the ones available in each SM, I imagined different approaches:
I create 8 blocks of 48 threads each, so that each SM has 1 block to execute. In this case will the 48 threads execute in parallel in the SM (exploiting all the 48 cores available for them)?
Is there any difference if I launch 64 blocks of 6 threads? (Assuming that they will be mapped evenly among the SMs)
If I "submerge" the GPU in scheduled work (creating 1024 blocks of 1024 thread each, for example) is it reasonable to assume that all the cores will be used at a certain point, and will perform the same computations (assuming that the threads never stall)?
Is there any way to check these situations using the profiler?
Is there any reference for this stuff? I read the CUDA Programming guide and the chapters dedicated to hardware architecture in "Programming Massively Parallel Processors" and "CUDA Application design and development"; but I could not get a precise answer.
Two of the best references are
NVIDIA Fermi Compute Architecture Whitepaper
GF104 Reviews
I'll try to answer each of your questions.
The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.
4'. There is a mapping between laneid (threads index in a warp) and a core.
5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.
6'. A thread block will be divided into
WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize
There is no requirement for the warp schedulers to select two warps from the same thread block.
7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.
See reference 2 for differences between a GTX480 and GTX560.
If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.
1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.
2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.
8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions
In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.
3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.
4'. Parallel Nsight and the Visual Profiler show
a. executed IPC
b. issued IPC
c. active warps per active cycle
d. eligible warps per active cycle (Nsight only)
e. warp stall reasons (Nsight only)
f. active threads per instruction executed
The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC.
For MaxIPC assume
GF100 (GTX480) is 2
GF10x (GTX560) is 4 but target is 3 is a better target.
"E. If a warp contains 20 threads, but currently there are only 16 cores available, the warp will not run."
is incorrect. You are confusing cores in their usual sense (also used in CPUs) - the number of "multiprocessors" in a GPU, with cores in nVIDIA marketing speak ("our card has thousands of CUDA cores").
Cuda core (so answer) is a hardware concept and thread is a software concept. Even with only 16 cores available, you can still run 32 threads. However, you may need 2 clock cycles to run them with only 16 hardware cores.
The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle
warp scheduler is responsible to find cores to run instructions (so answer)
A warp is a logical assembly of 32 threads of execution. To execute a single instruction from a single warp, the warp scheduler must usually schedule 32 execution units (or "cores", although the definition of a "core" is somewhat loose).
A warp itself can only be scheduled on a SM (multiprocessor, or streaming multiprocessor), and can run up to 32 threads at the same time (depending on cores in SM); it cannot use more than a SM.
The number "48 warps" is the maximum number of active warps (warps which may be chosen to be scheduled for work in the next cycle, at any given cycle) per multiprocessor, on NVIDIA GPUs with Compute Capability 2.x; and this number corresponds to 1536 = 48 x 32 threads.
Answer based on this webinar

CUDA: How many concurrent threads in total?

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.
deviceQuery gives me the following possibly relevant information:
CUDA Capability Major/Minor version number: 2.0
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA
Maximum number of threads per block: 1024
I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?
The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.
Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.
While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.
So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.
Taking into account all of the above is too difficult, though, so most people compare on two metrics:
Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.
The most important comparison is always measured wall-clock time on a real application.
There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:
The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.
To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.
I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.