kepler blocks per mp? - cuda

I am readind from kepler whitepaper here
that kepler supports up to 16 blocks / mp.
But threads/blocks = 1024 and threads/mp = 2048 ,hence blocks/mp = 2 .
Am I missing something here?

As you said right for kepler a streaming muliprocessor can ran up to 16 threadblocks.
In your example if a thread block consists of 1024 threads, than only two blocks can launched at the same time at one mp, because in this case you will be limited by the maximum amount of threads per multiprocessor - 2048 / 1024 = 2 blocks.
There are several factors that will influence how many blocks can ran concurrently in a streaming multiprocessor. A SM have a limited amount of registers and shared memory, too. If you use too much registers or too much shared memory, than you will be limited by these factors.
A good overview for this is the CUDA occupancy calculator. With the excel sheet you can easily set up a kernel configuration for all CUDA architectures and you will see by what the kernel will be limited.
Also the CUDA programming guide provide all the required informations.
Maybe a simple example can help - done with occupancy calculator for compute capability 3.0:
If your thread block consists of 512 threads and you won't use any registers or shared memory, than the amount of parallel blocks is only influenced by the block size.
For cc 3.0 per SM 2048 threads can be launched. So 2048 / 512 = 4. It's only possible to use 4 thread blocks at the same time.
In the second step you'll use additional 48 registers per thread.
Per thread block 512 * 48 = 24576 registers will be used. But a SM can only use 65536 registers. Now it's only possible to run two blocks instead of four.
In the last step let's assume a block uses 32000 bytes of shared memory. Because a SM can only use 49152 bytes for shared memory, it's only possible to use 1 thread block anymore.

Related

Different Kernels sharing SMx [duplicate]

Is it possible, using streams, to have multiple unique kernels on the same streaming multiprocessor in Kepler 3.5 GPUs? I.e. run 30 kernels of size <<<1,1024>>> at the same time on a Kepler GPU with 15 SMs?
On a compute capability 3.5 device, it might be possible.
Those devices support up to 32 concurrent kernels per GPU and 2048 threads peer multi-processor. With 64k registers per multi-processor, two blocks of 1024 threads could run concurrently if their register footprint was less than 16 per thread, and less than 24kb shared memory per block.
You can find all of this is the hardware description found in the appendices of the CUDA programming guide.

CUDA: Does the compute capability impact the maximum number of active threads?

If I have a device supporting CC 3.0 that means it has maximum number of active threads equal to 2048 per multiprocessor. And If would set the CC to 2.0 (compute_20,sm_20) does it mean that the maximum number of active threads will be only 1536 per multiprocessor or the compute capability has no impact to this?
Or is it have impact to the shared memory size?
CUDA is designed for scalability; kernels will expand to use all of the resources it can. So it doesn't matter how you compile the kernel; it will fill up all of the available threads unless you do something that prevents it from doing so, like launching it with 768 threads per block.
Now, GPU threads aren't like CPU cores; you aren't losing the ability to do computation if you aren't using all of the threads. A streaming multiprocessor (SM) on a device of compute capability 3.0 can manage 2048 threads simultaneously, but is only capable of executing 256 instructions per tick. There are other limits too; e.g. if you're doing 32-bit floating point addition, it can only do 192 of those per tick. Doing left shifts on 32-bit integers? Only 64 per tick.
The point of having more threads is for latency reasons -- when one thread is blocked for some reason, such as waiting to fetch a value from memory or to get the result of an arithmetic instruction, the SM will run a different thread instead. The point of using more threads is that it gives you more opportunities to hide this latency: more chances to have independent work available to do when some instructions are blocked, waiting for data.

How do a SM in CUDA run multiple blocks simultaneously?

In CUDA, can a SM run multiple blocks simultaneously if each block won't cost too much resource.
On Fermi, we know that a SM consists of 32kb register space for use. suppose a thread use 32 register, so this SM can lanuch one block which contains 256 ((32*1024)/(32*4)) threads. If SM can run multiple blocks simultaneously, we can also configure 32 theards for a block, and 8 block for the SM. Is there any difference?
As #talonmies commented, your math is not entirely correct. But the key point is that an SM contains a balance of many different types of resources. The better your kernel and kernel launch parameters fit with this balance, the better your performance.
I haven't checked the numbers for Kepler (compute capability 3.x) but for Fermi (2.x), an SM can keep track of 48 concurrent warps (1,536 threads) and 8 concurrent blocks. This means that if you chose a low thread count for your blocks, the 8 concurrent blocks becomes the limiting factor to occupancy in your kernel. For instance, if you chose 32 threads per block, you get up to 256 (8 * 32) concurrent threads running on the SM while the SM can run up to 1,536 threads (48 * 32).
In the occupancy calculator, you can see what the different hardware limits are and it will tell you which of them becomes the limiting factor with your specific kernel. You can experiment with variations in launch parameters, shared memory usage and register usage to see how they affect your occupancy.
Occupancy is not everything when it comes to performance. Increased occupancy translates to increased ability to hide the latency of memory transfers. When the memory bandwidth is saturated, increasing occupancy further does not help. There is another effect in play as well. Increasing the size of a block may decrease occupancy but at the same time increase the amount of instruction level parallelism (ILP) available in your kernel. In this case, decreasing occupancy can increase performance.

CUDA Kernel register size

On a compute capablility 1.3 GPU cuda card,
we run the following code
for(int i=1;i<20;++i)
kernelrun<<<30,320>>>(...);
we know that each SM has 8 SP and can run 1024 threads,
so there are 30 SM in tesla C1060 which can run 30*1024 threads concurrently.
As per the given code, how many threads can run concurrently ?
If there are 48 registers for the kernelrun kernel , what are the limitations on tesla C1060?
which has 16384 registers and 16KB shared memory?
Since concurrent kernel execution is not supported in Tesla C1060, how
can we execute the kernel in loop concurrently ? IS streams possible?
only one concurrent copy and execute engine in tesla C1060?
NVIDIA have been shipping an Occupancy calculator which you can use to answer this question for yourself since 2007. You should try it.
But to answer your question, each SM in your compute 1.3 device has 16384 registers per SM, so the number of threads per block if your kernel is register limited would be roughly 352 (16384/45 rounded down to the nearest 32). There is also a register page allocation granularity to consider.

CUDA, shared memory limit on CC 2.0 card

I know "Maximum amount of shared memory per multiprocessor" for GPU with Compute Capability 2.0 is 48KB as is said in the guide.
I'm a little confused about the amount of shared memory I can use for each block ? How many blocks are in a multiprocessor. I'm using GeForce GTX 580.
On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.