the number of blocks can be scheduled at the same time - cuda

This question is also started from following link: shared memory optimization confusion
In above link, from talonmies's answer, I found that the first condition of the number of blocks which will be scheduled to run is "8". I have 3 questions as shown in below.
Does it mean that only 8 blocks can be scheduled at the same time when the number of blocks from condition 2 and 3 is over 8? Is it regardless of any condition such as cuda environment, gpu device, or algorithm?
If so, it really means that it is better not to use shared memory in some cases, it depends. Then we have to think how can we judge which one is better, using or not using shared memory. I think one approach is checking whether there is global memory access limitation (memory bandwidth bottleneck) or not. It means we can select "not using shared memory" if there is no global memory access limitation. Is it good approach?
Plus above question 2, I think if the data that my CUDA program should handle is huge, then we can think "not using shared memory" is better because it is hard to handle within the shared memory. Is it also good approach?

The number of concurrently scheduled blocks are always going to be limited by something.
Playing with the CUDA Occupancy Calculator should make it clear how it works. The usage of three types of resources affect the number of concurrently scheduled blocks. They are, Threads Per Block, Registers Per Thread and Shared Memory Per Block.
If you set up a kernel that uses 1 Threads Per Block, 1 Registers Per Thread and 1 Shared Memory Per Block on Compute Capability 2.0, you are limited by Max Blocks per Multiprocessor, which is 8. If you start increasing Shared Memory Per Block, the Max Blocks per Multiprocessor will continue to be your limiting factor until you reach a threshold at which Shared Memory Per Block becomes the limiting factor. Since there are 49152 bytes of shared memory per SM, that happens at around 8 / 49152 = 6144 bytes (It's a bit less because some shared memory is used by the system and it's allocated in chunks of 128 bytes).
In other words, given the limit of 8 Max Blocks per Multiprocessor, using shared memory is completely free (as it relates to the number of concurrently running blocks), as long as you stay below the threshold at which Shared Memory Per Block becomes the limiting factor.
The same goes for register usage.

Related

total number of registers

I wanted to ask.We say that using --ptxas-options=-v doesn't give the exact number of registers that our program uses.
1) Then , how am I going to supply the occupancu calculator with registers per thread and shared memory per block?
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!)
(the same applies for the shared memory)
1) Then , how am I going to supply the occupancy calculator with registers per thread and shared memory per block?
The only other thing needed should be rounding up (if necessary) the output of ptxas to an even granularity of register allocation, which varies by device (see Greg's answer here) I think the common register allocation granularities are 4 and 8, but I don't have a table of register allocation granularity by compute capability.
I think shared memory also has an allocation granularity. Since the max number of threadblocks per SM is limited anyway, this should only matter (for occupancy) if your allocation/usage is within a granular amount of exceeding the limit for however many blocks you are otherwise limited to.
I think in most cases you'll get a pretty good feel by using the numbers from ptxas without rounding. If you feel you need this level of accuracy in the occupancy calculator, asking a nice directed question like "what are the allocation granularities for registers and shared memory for various GPUs" may get someone like Greg to give you a crisp answer.
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!) (the same applies for the shared memory)
Fundamentally I believe this thinking is incorrect. The only place I could see where it might matter is if you are running concurrent kernels, and I doubt that is the case since you mention thrust. The only figures that matter for occupancy are the metrics for a single kernel launch. You do not add threads, or registers, or shared memory across different kernels, to calculate resource usage. When a kernel completes execution, it releases its resource usage, at least for these resource types (registers, shared memory, threads).

CUDA blocks & warps - which can run in parallel on a single SM?

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.
I have a GTX 550Ti, btw with compute capability 2.1.
4 SMs x 48 cores = 192 CUDA cores.
Ok so what's unclear to me is:
Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each?
Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).
I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!
The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.
When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.
In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.
Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.
Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.
Answers to each Question
Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?
Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.
Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).
The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.
Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.
Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?
Yes, warps can run in parallel.
Q: Because in the Fermi architecture it states that 2 warps are executed concurrently
Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.
Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.
I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.
In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.
One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.
Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.
Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.

Maximum (shared memory per block) / (threads per block) in CUDA with 100% MP load

I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array.
However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude).
Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)?
Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment?
Yet another alternative I can see is to process array in several passes, but it looks like a last resort.
That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
maximum number of threads you can run in a single block
maximum number of blocks you can load on a streaming multiprocessor at once
maximum number of registers per streaming multiprocessor
maximum amount of shared memory per streaming multiprocessor
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.

shared memory optimization confusion

I have written an application in cuda , which uses 1kb of shared memory in each block.
Since there is only 16kb of shared memory in each SM, so only 16 blocks can be accommodated overall ( am i understanding it correctly ?), though at a time only 8 can be scheduled, but now if some block is busy in doing memory operation, so other block will be scheduled on gpu, but all the shared memory is used by other 16 blocks which already been scheduled there, so will cuda will not scheduled more blocks on the same sm , unless previous allocated blocks are completely finished ? or it will move some block's shared memory to global memory, and allocated other block there (in this case should we worry about global memory access latency ?)
It does not work like that. The number of blocks which will be scheduled to run at any given moment on a single SM will always be the minimum of the following:
8 blocks
The number of blocks whose sum of static and dynamically allocated shared memory is less than 16kb or 48kb, depending on GPU architecture and settings. There is also shared memory page size limitations which mean per block allocations get rounded up to the next largest multiple of the page size
The number of blocks whose sum of per block register usage is less than 8192/16384/32678 depending on architecture. There is also register file page sizes which mean that per block allocations get rounded up to the next largest multiple of the page size.
That is all there is to it. There is no "paging" of shared memory to accomodate more blocks. NVIDIA produce a spreadsheet for computing occupancy which ships with the toolkit and is available as a separate download. You can see the exact rules in the formulas it contains. They are also discussed in section 4.2 of the CUDA programming guide.

CUDA determining threads per block, blocks per grid

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.
I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?
In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.
A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.
First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.
Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).
On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.
Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page.
https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail...
With rare exceptions, you should use a constant number of threads per block. The number of blocks per grid is then determined by the problem size, such as the matrix dimensions in the case of matrix multiplication.
Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works. A good choice for a first attempt is 128 or 256.
You also need to consider shared memory because threads in the same block can access the same shared memory. If you're designing something that requires a lot of shared memory, then more threads-per-block might be advantageous.
For example, in terms of context switching, any multiple of 32 works just the same. So for the 1D case, launching 1 block with 64 threads or 2 blocks with 32 threads each makes no difference for global memory accesses. However, if the problem at hand naturally decomposes into 1 length-64 vector, then the first option will be better (less memory overhead, every thread can access the same shared memory) than the second.
There is no silver bullet. The best number of threads per block depends a lot on the characteristics of the specific application being parallelized. CUDA's design guide recommends using a small amount of threads per block when a function offloaded to the GPU has several barriers, however, there are experiments showing that for some applications a small number of threads per block increases the overhead of synchronizations, imposing a larger overhead. In contrast, a larger number of threads per block may decrease the amount of synchronizations and improve the overall performance.
For an in-depth discussion (too lengthy for StackOverflow) about the impact of the number of threads per block on CUDA kernels, check this journal article, it shows tests of different configurations of the number of threads per block in the NPB (NAS Parallel Benchmarks) suite, a set of CFD (Computational Fluid Dynamics) applications.