CUDA, shared memory limit on CC 2.0 card - cuda

I know "Maximum amount of shared memory per multiprocessor" for GPU with Compute Capability 2.0 is 48KB as is said in the guide.
I'm a little confused about the amount of shared memory I can use for each block ? How many blocks are in a multiprocessor. I'm using GeForce GTX 580.

On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.

Related

Different Kernels sharing SMx [duplicate]

Is it possible, using streams, to have multiple unique kernels on the same streaming multiprocessor in Kepler 3.5 GPUs? I.e. run 30 kernels of size <<<1,1024>>> at the same time on a Kepler GPU with 15 SMs?
On a compute capability 3.5 device, it might be possible.
Those devices support up to 32 concurrent kernels per GPU and 2048 threads peer multi-processor. With 64k registers per multi-processor, two blocks of 1024 threads could run concurrently if their register footprint was less than 16 per thread, and less than 24kb shared memory per block.
You can find all of this is the hardware description found in the appendices of the CUDA programming guide.

CUDA shared memory occupancy

If I have a 48kB shared memory per SM and I make a kernel where I allocate 32kB in shared memory that means that only 1 block can be running on one SM at the same time?
Yes, that is correct.
Shared memory must support the "footprint" of all "resident" threadblocks. In order for a threadblock to be launched on a SM, there must be enough shared memory to support it. If not, it will wait until the presently executing threadblock has finished.
There is some nuance to this arriving with Maxwell GPUs (cc 5.0, 5.2). These GPUs support either 64KB (cc 5.0) or 96KB (cc 5.2) of shared memory. In this case, the maximum shared memory available to a single threadblock is still limited to 48KB, but multiple threadblocks may use more than 48KB in aggregate, on a single SM. This means a cc 5.2 SM could support 2 threadblocks, even if both were using 32KB shared memory.

kepler blocks per mp?

I am readind from kepler whitepaper here
that kepler supports up to 16 blocks / mp.
But threads/blocks = 1024 and threads/mp = 2048 ,hence blocks/mp = 2 .
Am I missing something here?
As you said right for kepler a streaming muliprocessor can ran up to 16 threadblocks.
In your example if a thread block consists of 1024 threads, than only two blocks can launched at the same time at one mp, because in this case you will be limited by the maximum amount of threads per multiprocessor - 2048 / 1024 = 2 blocks.
There are several factors that will influence how many blocks can ran concurrently in a streaming multiprocessor. A SM have a limited amount of registers and shared memory, too. If you use too much registers or too much shared memory, than you will be limited by these factors.
A good overview for this is the CUDA occupancy calculator. With the excel sheet you can easily set up a kernel configuration for all CUDA architectures and you will see by what the kernel will be limited.
Also the CUDA programming guide provide all the required informations.
Maybe a simple example can help - done with occupancy calculator for compute capability 3.0:
If your thread block consists of 512 threads and you won't use any registers or shared memory, than the amount of parallel blocks is only influenced by the block size.
For cc 3.0 per SM 2048 threads can be launched. So 2048 / 512 = 4. It's only possible to use 4 thread blocks at the same time.
In the second step you'll use additional 48 registers per thread.
Per thread block 512 * 48 = 24576 registers will be used. But a SM can only use 65536 registers. Now it's only possible to run two blocks instead of four.
In the last step let's assume a block uses 32000 bytes of shared memory. Because a SM can only use 49152 bytes for shared memory, it's only possible to use 1 thread block anymore.

How Concurrent blocks can run a single GPU streaming multiprocessor?

I was studying about the CUDA programming structure and what I felt after studying is that; after creating the blocks and threads, each of this blocks is assigned to each of the streaming multiprocessor (e.g. I am using GForce 560Ti which has14 streaming multiprocessors and so at one time 14 blocks can be assigned to all the streaming multiprocessors). But as I am going through several online materials such as this one :
http://moss.csc.ncsu.edu/~mueller/cluster/nvidia/GPU+CUDA.pdf
where it has been mentioned that several blocks can be run concurrently on one multiprocessor. I am basically very much confused with the execution of the threads and the blocks on the streaming multiprocessors. I know that the assignment of blocks and the execution of the threads are absolutely arbitrary but I would like how the mapping of the blocks and the threads actually happens so that the concurrent execution could occur.
The Streaming Multiprocessors (SM) can execute more than one block at a time using Hardware Multithreading, a process akin to Hypter-Threading.
The CUDA C Programming Guide describes this as follows in Section 4.2:
4.2 Hardware Multithreading
The execution context (program counters, registers, etc) for each warp
processed by a multiprocessor is maintained on-chip during the entire
lifetime of the warp. Therefore, switching from one execution context
to another has no cost, and at every instruction issue time, a warp
scheduler selects a warp that has threads ready to execute its next
instruction (the active threads of the warp) and issues the
instruction to those threads.
In particular, each multiprocessor has a set of 32-bit registers that
are partitioned among the warps, and a parallel data cache or shared
memory that is partitioned among the thread blocks.
The number of blocks and warps that can reside and be processed
together on the multiprocessor for a given kernel depends on the
amount of registers and shared memory used by the kernel and the
amount of registers and shared memory available on the multiprocessor.
There are also a maximum number of resident blocks and a maximum
number of resident warps per multiprocessor. These limits as well the
amount of registers and shared memory available on the multiprocessor
are a function of the compute capability of the device and are given
in Appendix F. If there are not enough registers or shared memory
available per multiprocessor to process at least one block, the kernel
will fail to launch.

My GPU has 2 multiprocessors with 48 CUDA cores each. What does this mean?

My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.