Using GPU Shared Memory : A check on my understanding - cuda

I have a GTX Titan that has 49152 bytes/block of shared memory. I'm trying to solve ~9000 coupled ODE's and would like to store these ~9000 concentrations, which are doubles, in shared memory in order to calculate the rate of change of each concentration.
So I'd just like to affirm that this is NOT possible since a double is 8 bytes and 49152/8 = 6144. Right?

Your understanding is correct. You cannot simultaneously store 9000 double quantities in shared memory that is accessible by a single threadblock (i.e. in a single SM).

You can use register file too! That part can have important fraction of shared mem per SM. Private variables each on register space of a core/streaming unit can be used in a "swap in local memory" manner to communicate with other cores in the same block. Register swapping could compensate for 48kB shared mem per sm for the titan.

Related

total number of registers

I wanted to ask.We say that using --ptxas-options=-v doesn't give the exact number of registers that our program uses.
1) Then , how am I going to supply the occupancu calculator with registers per thread and shared memory per block?
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!)
(the same applies for the shared memory)
1) Then , how am I going to supply the occupancy calculator with registers per thread and shared memory per block?
The only other thing needed should be rounding up (if necessary) the output of ptxas to an even granularity of register allocation, which varies by device (see Greg's answer here) I think the common register allocation granularities are 4 and 8, but I don't have a table of register allocation granularity by compute capability.
I think shared memory also has an allocation granularity. Since the max number of threadblocks per SM is limited anyway, this should only matter (for occupancy) if your allocation/usage is within a granular amount of exceeding the limit for however many blocks you are otherwise limited to.
I think in most cases you'll get a pretty good feel by using the numbers from ptxas without rounding. If you feel you need this level of accuracy in the occupancy calculator, asking a nice directed question like "what are the allocation granularities for registers and shared memory for various GPUs" may get someone like Greg to give you a crisp answer.
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!) (the same applies for the shared memory)
Fundamentally I believe this thinking is incorrect. The only place I could see where it might matter is if you are running concurrent kernels, and I doubt that is the case since you mention thrust. The only figures that matter for occupancy are the metrics for a single kernel launch. You do not add threads, or registers, or shared memory across different kernels, to calculate resource usage. When a kernel completes execution, it releases its resource usage, at least for these resource types (registers, shared memory, threads).

the number of blocks can be scheduled at the same time

This question is also started from following link: shared memory optimization confusion
In above link, from talonmies's answer, I found that the first condition of the number of blocks which will be scheduled to run is "8". I have 3 questions as shown in below.
Does it mean that only 8 blocks can be scheduled at the same time when the number of blocks from condition 2 and 3 is over 8? Is it regardless of any condition such as cuda environment, gpu device, or algorithm?
If so, it really means that it is better not to use shared memory in some cases, it depends. Then we have to think how can we judge which one is better, using or not using shared memory. I think one approach is checking whether there is global memory access limitation (memory bandwidth bottleneck) or not. It means we can select "not using shared memory" if there is no global memory access limitation. Is it good approach?
Plus above question 2, I think if the data that my CUDA program should handle is huge, then we can think "not using shared memory" is better because it is hard to handle within the shared memory. Is it also good approach?
The number of concurrently scheduled blocks are always going to be limited by something.
Playing with the CUDA Occupancy Calculator should make it clear how it works. The usage of three types of resources affect the number of concurrently scheduled blocks. They are, Threads Per Block, Registers Per Thread and Shared Memory Per Block.
If you set up a kernel that uses 1 Threads Per Block, 1 Registers Per Thread and 1 Shared Memory Per Block on Compute Capability 2.0, you are limited by Max Blocks per Multiprocessor, which is 8. If you start increasing Shared Memory Per Block, the Max Blocks per Multiprocessor will continue to be your limiting factor until you reach a threshold at which Shared Memory Per Block becomes the limiting factor. Since there are 49152 bytes of shared memory per SM, that happens at around 8 / 49152 = 6144 bytes (It's a bit less because some shared memory is used by the system and it's allocated in chunks of 128 bytes).
In other words, given the limit of 8 Max Blocks per Multiprocessor, using shared memory is completely free (as it relates to the number of concurrently running blocks), as long as you stay below the threshold at which Shared Memory Per Block becomes the limiting factor.
The same goes for register usage.

CUDA, shared memory limit on CC 2.0 card

I know "Maximum amount of shared memory per multiprocessor" for GPU with Compute Capability 2.0 is 48KB as is said in the guide.
I'm a little confused about the amount of shared memory I can use for each block ? How many blocks are in a multiprocessor. I'm using GeForce GTX 580.
On Fermi, you can use up to 16kb or 48kb (depending on the configuration you select) of shared memory per block - the number of blocks which will run concurrently on a multiprocessor is determined by how much shared memory and registers each block requires, up to a maximum of 8. If you use 48kb, then only a single block can run concurrently. If you use 1kb per block, then up to 8 blocks could run concurrently per multiprocessor, depending on their register usage.

shared memory optimization confusion

I have written an application in cuda , which uses 1kb of shared memory in each block.
Since there is only 16kb of shared memory in each SM, so only 16 blocks can be accommodated overall ( am i understanding it correctly ?), though at a time only 8 can be scheduled, but now if some block is busy in doing memory operation, so other block will be scheduled on gpu, but all the shared memory is used by other 16 blocks which already been scheduled there, so will cuda will not scheduled more blocks on the same sm , unless previous allocated blocks are completely finished ? or it will move some block's shared memory to global memory, and allocated other block there (in this case should we worry about global memory access latency ?)
It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:
8 blocks
The number of blocks whose sum of static and dynamically allocated shared memory is less than 16kb or 48kb, depending on GPU architecture and settings. There is also shared memory page size limitations which mean per block allocations get rounded up to the next largest multiple of the page size
The number of blocks whose sum of per block register usage is less than 8192/16384/32678 depending on architecture. There is also register file page sizes which mean that per block allocations get rounded up to the next largest multiple of the page size.
That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.

Size of statically allocated shared memory per block with Compute Prof (Cuda/OpenCL)

In Nvidia's compute prof there is a column called "static private mem per work group" and the tooltip of it says "Size of statically allocated shared memory per block". My application shows that I am getting 64 (bytes I assume) per block. Does that mean I am using somewhere between 1-64 of those bytes or is the profiler just telling me that this amount of shared memory was allocated and who knows if it was used at all?
If it's allocated, it's probably because you used it. AFAIK CUDA passes parameters to kernels via shared memory, so it's must be that.