cuda workflow - possible scenario - cuda

The GeForce GTX 560 Ti has 8 SM and each SM has 48 cuda cores (SP). I'm going to launch kernel in this way: kernel<<<1024,1024>>> The SM schedules threads in groups of 32 parallel threads called warps. How will blocks and threads be distributed between 8 SM and 48 SP in each SM ? We have 1024 blocks and 1024 threads so what is possible scenario ? What is the maximum number of threads executing literally at the same time ? What is difference between fermi dual warp scheduler and earlier schedulers ?

The NVIDIA supplied occupancy calculator spreadsheet, which ships in every SDK or is available for download here, can provide the answer to the first three "sub-questions" you have asked.
As for the difference between multiprocessor level scheduling in Fermi compared with earlier architectures, the name ("dual warp scheduler") really says it all. In Fermi, MPs retire instructions from two warps simultaneously, compared to a single warp, as was the case in the first two generations of CUDA capable architectures. If you want a more detailed answer than that, I recommend reading the Fermi architecture whitepaper, available for download here.

Related

Different Kernels sharing SMx [duplicate]

Is it possible, using streams, to have multiple unique kernels on the same streaming multiprocessor in Kepler 3.5 GPUs? I.e. run 30 kernels of size <<<1,1024>>> at the same time on a Kepler GPU with 15 SMs?
On a compute capability 3.5 device, it might be possible.
Those devices support up to 32 concurrent kernels per GPU and 2048 threads peer multi-processor. With 64k registers per multi-processor, two blocks of 1024 threads could run concurrently if their register footprint was less than 16 per thread, and less than 24kb shared memory per block.
You can find all of this is the hardware description found in the appendices of the CUDA programming guide.

clarification about CUDA number of threads executed per SM

I am new to cuda programming and am reading about a G80 chip which has 128 SPs(16 SMs, each with 8 SPs) from the book "Programming Massively Parallel Processors - A hands on approach".
There is a comparison between Intel CPUs and G80 chip.
Intel CPUs support 2 to 4 threads, depending on the machine model, per core.
where as the G80 chip supports 768 threads per SM, which sums up to 12000 threads for this chip.
My question here is it that the G80 chip can execute 768 threads simultaneously ?
If not simultaneously then what is meant by Intel CPUs support 2 to 4 threads per core ? We can always have many threads/processes running on the Intel CPU scheduled by the OS.
G80 keep the context for 768 threads per SM concurrently and interleaves their execution. This is the key difference between CPU and GPU. GPUs are deep-multithreaded processor hiding memory accesses of some threads by the computation from other threads. The latency of executing a thread is much higher that the CPU and GPU is optimized for thread throughput instead of thread latency. In comparison, CPUs use out-of-order speculative execution to reduce the execution delay of one thread. There are several technique used by GPUs to reduce thread scheduling overhead. For example, GPUs group threads in coarser schedulable element called warps of wavefront and execute threads of the warp over an SIMD. GPU threads are identical making them suitable choice for SIMD model. In the eye of the programmer, threads are executed in MIMD fashion and they are grouped in thread blocks to reduce communication overhead.
Threads employed in a CPU core are used to fill different execution units by dynamic scheduling. CPU threads are not necessarily at the same type. It means once a thread is busy with the floating point other threads may find ALU idle. Therefore, execution of these thread can be done concurrently. Multiple threads per core are maintained to fill different execution units effectively preventing idle units. However, dynamic scheduling is costly in term of power and energy consumption. Therefore, manufacturer use a few threads per CPU core.
In answer to second part of your question: Threads in GPUs are scheduled by hardware (per SM warp scheduler) and the OS and even driver do not affect the scheduling.
As far as I know, 768 is the max number of resident threads in an SM. And the threads are executed in warps which consists of 32 threads. So in an SM, all 768 threads will not be executed at the same time, but they will be scheduled in chunks of 32 threads at a time, i.e. one warp at a time.
The analogous technology on CPUs is called "simultanous multithreading" (SMT), or hyperthreading in Intel's marketing speech. It allows usually two, on some CPUs four threads to be scheduled by the CPU itself in hardware.
This is different from the fact that the operating system may on top of that schedule a larger number of threads in software.

CUDA Kernel register size

On a compute capablility 1.3 GPU cuda card,
we run the following code
for(int i=1;i<20;++i)
kernelrun<<<30,320>>>(...);
we know that each SM has 8 SP and can run 1024 threads,
so there are 30 SM in tesla C1060 which can run 30*1024 threads concurrently.
As per the given code, how many threads can run concurrently ?
If there are 48 registers for the kernelrun kernel , what are the limitations on tesla C1060?
which has 16384 registers and 16KB shared memory?
Since concurrent kernel execution is not supported in Tesla C1060, how
can we execute the kernel in loop concurrently ? IS streams possible?
only one concurrent copy and execute engine in tesla C1060?
NVIDIA have been shipping an Occupancy calculator which you can use to answer this question for yourself since 2007. You should try it.
But to answer your question, each SM in your compute 1.3 device has 16384 registers per SM, so the number of threads per block if your kernel is register limited would be roughly 352 (16384/45 rounded down to the nearest 32). There is also a register page allocation granularity to consider.

how many threads does nvidia GTS 450 has

Dear friends:
i am want to study the CUDA programming, i bought a Nvidia GTS 450 PCI_E car. it has 192 SMs, then how many threads does it has. 192 threads? or 192*512 threads?
Regards
in CUDA the term threads refers to the a property of a specific kernel invocation, not of a property of the hardware.
For instance in this CUDA invocation:
someFunction<<<2,32>>>(1,2,3);
you have 32 threads in 2 blocks so 64 threads in total.
The hardware schedules threads to processors automatically.
According to the specs, your device has 192 "processor cores" - these are not the same as SMs. In CUDA, a SM is a multiprocessor that executes multiple threads in lockstep (8 for the 1.3 family of devices, more for later devices).
As shoosh pointed out, the number of threads used is a function of your kernel invocation.
Typically to get good performance in CUDA, you should run many more threads than you have CUDA processor cores - this is to hide the latency of your global memory accesses.

Streaming multiprocessors, Blocks and Threads (CUDA)

What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?
What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?
My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.
Is this correct?
The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:
The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().
For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).
You can define grids which maps blocks to the GPU.
You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).
One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.
To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.
Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.
You find numbers for the different Compute Capabilities and GPU Architectures here:
https://en.wikipedia.org/wiki/CUDA#Limitations
You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).
The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps using round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.
When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.
When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.
Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.
This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.
In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.
The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.
The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.
There are multiple streaming multiprocessor on one device.
A SM may contain multiple blocks. Each block may contain several threads.
A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.
SM and block both have limits on number of thread, number of register and shared memory.