CUDA shared memory occupancy - cuda

If I have a 48kB shared memory per SM and I make a kernel where I allocate 32kB in shared memory that means that only 1 block can be running on one SM at the same time?

Yes, that is correct.
Shared memory must support the "footprint" of all "resident" threadblocks. In order for a threadblock to be launched on a SM, there must be enough shared memory to support it. If not, it will wait until the presently executing threadblock has finished.
There is some nuance to this arriving with Maxwell GPUs (cc 5.0, 5.2). These GPUs support either 64KB (cc 5.0) or 96KB (cc 5.2) of shared memory. In this case, the maximum shared memory available to a single threadblock is still limited to 48KB, but multiple threadblocks may use more than 48KB in aggregate, on a single SM. This means a cc 5.2 SM could support 2 threadblocks, even if both were using 32KB shared memory.

Related

amount of data that can be hold in shared memory CUDA

In my gpu max threads per block is 1024. I am working on a image processing project using CUDA. Now if I want to use shared memory is that mean that I can only work with 1024 pixels using one block and need to copy only those 1024 elements to the shared memory
Your question is quite unclear, so I will answer to what is asked in the title.
The amount of data that can be hold in shared memory in CUDA depends on the Compute Capability of your GPU.
For instance, on CC 2.x and 3.x :
On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory.
See Configuring the amount of shared memory section here : Nvidia Parallel Forall Devblog : Using Shared Memory in CUDA C/C++
The optimization you have to think about is to avoid bank conflicts by mapping the threads' access to memory banks. This is introduced in this blog and you should read about it.

CUDA Programming - Shared memory configuration

Could you please explain the differences between using both "16 KB shared memory + 48K L1 cache" or "48 KB shared memory + 16 KB L1 cache" in CUDA programming? What should I expect in time execution? When could I expect smaller gpu time?
On Fermi and Kepler nVIDIA GPUs, each SM has a 64KB chunk of memory which can be configured as 16/48 or 48/16 shared memory/L1 cache. Which mode you use depends on how much use of shared memory your kernel makes. If your kernel uses a lot of shared memory then you would probably find that configuring it as 48KB shared memory allows higher occupancy and hence better performance.
On the other hand, if your kernel does not use shared memory at all, or if it only uses a very small amount per thread, then you would configure it as 48KB L1 cache.
How much a "very small amount" is is probably best illustrated with the occupancy calculator which is a spreadsheet included with the CUDA Toolkit and also available here. This spreadsheet allows you to investigate the effect of different shared memory per block and different block sizes.

CUDA: Is coalesced global memory access faster than shared memory? Also, does allocating a large shared memory array slow down the program?

I'm not finding an improvement in speed with shared memory on an NVIDIA Tesla M2050
with about 49K shared memory per block. Actually if I allocate
a large char array in shared memory it slows down my program. For example
__shared__ char database[49000];
gives me slower running times than
__shared__ char database[4900];
The program accesses only the first 100 chars of database so the extra space
is unnecessary. I can't figure out why this is happening. Any help would be appreciated.
Thanks.
The reason for the relatively poor performance of CUDA shared memory when using larger arrays may have to do with the fact that each multiprocessor has a limited amount of available shared memory.
Each multiprocessor hosts several processors; for modern devices, typically 32, the number of threads in a warp. This means that, in the absence of divergence or memory stalls, the average processing rate is 32 instructions per cycle (latency is high due to pipelining).
CUDA schedules several blocks to a multiprocessor. Each block consists of several warps. When a warp stalls on a global memory access (even coalesced accesses have high latency), other warps are processed. This effectively hides the latency, which is why high-latency global memory is acceptable in GPUs. To effectively hide latency, you need enough extra warps to execute until the stalled warp can continue. If all warps stall on memory accesses, you can no longer hide the latency.
Shared memory is allocated to blocks in CUDA, and stored on a singly multiprocessor on the GPU device. Each multiprocessor has a relatively small, fixed amount of shared memory space. CUDA cannot schedule more blocks to multiprocessors than the multiprocessors can support in terms of shared memory and register usage. In other words, if the amount of shared memory on a multiprocessor is X and each block requires Y shared memory, CUDA will schedule no more than floor(X/Y) blocks at a time to each multiprocessor (it might be less since there are other constraints, such as register usage).
Ergo, by increasing shared memory usage of a block, you might be reducing the number of active warps - the occupancy - of your kernel, thereby hurting performance. You should look into your kernel code by compiling with the -Xptxas="-v" flag; this should give you register and shared & constant memory usage for each kernel. Use this data and your kernel launch parameters, as well as other required information, in the most recent version of the CUDA Occupancy Calculator to determine whether you might be affected by occupancy.
EDIT:
To address the other part of your question, assuming no shared memory bank conflicts and perfect coalescing of global memory accesses... there are two dimensions to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is on-chip. The bandwidth will be much the same. Ergo, if you are able to hide global memory access latency through coalescing, there is no penalty (note: the access pattern is important here, in that shared memory allows for potentially more diverse access patterns with little to no performance loss, so there can be benefits to using shared memory even if you can hide all the global memory latency).
Also, if you increase the shared memory per-block, CUDA will schedule grids with less concurrent blocks, so they all have enough shared memory, so it reduces parallelism and increases execution time.
The amount of resources available on the gpu are limited. The number of blocks running concurrently is roughly inversly proportional to the size of shared memory per block.
This explains why the runtime is slower when you launch the kernel which uses a really large amount of shared memory.

My GPU has 2 multiprocessors with 48 CUDA cores each. What does this mean?

My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.

CUDA, shared memory limit on CC 2.0 card

I know "Maximum amount of shared memory per multiprocessor" for GPU with Compute Capability 2.0 is 48KB as is said in the guide.
I'm a little confused about the amount of shared memory I can use for each block ? How many blocks are in a multiprocessor. I'm using GeForce GTX 580.
On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.