total number of registers - cuda

I wanted to ask.We say that using --ptxas-options=-v doesn't give the exact number of registers that our program uses.
1) Then , how am I going to supply the occupancu calculator with registers per thread and shared memory per block?
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!)
(the same applies for the shared memory)

1) Then , how am I going to supply the occupancy calculator with registers per thread and shared memory per block?
The only other thing needed should be rounding up (if necessary) the output of ptxas to an even granularity of register allocation, which varies by device (see Greg's answer here) I think the common register allocation granularities are 4 and 8, but I don't have a table of register allocation granularity by compute capability.
I think shared memory also has an allocation granularity. Since the max number of threadblocks per SM is limited anyway, this should only matter (for occupancy) if your allocation/usage is within a granular amount of exceeding the limit for however many blocks you are otherwise limited to.
I think in most cases you'll get a pretty good feel by using the numbers from ptxas without rounding. If you feel you need this level of accuracy in the occupancy calculator, asking a nice directed question like "what are the allocation granularities for registers and shared memory for various GPUs" may get someone like Greg to give you a crisp answer.
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!) (the same applies for the shared memory)
Fundamentally I believe this thinking is incorrect. The only place I could see where it might matter is if you are running concurrent kernels, and I doubt that is the case since you mention thrust. The only figures that matter for occupancy are the metrics for a single kernel launch. You do not add threads, or registers, or shared memory across different kernels, to calculate resource usage. When a kernel completes execution, it releases its resource usage, at least for these resource types (registers, shared memory, threads).

Related

Will performance be hit if a kernel is too short?

If I'm doing an element-by-element operation on a matrix M, say M[i, j] *= (1 - M[i, j]), is it fine to launch a thread for each element (i, j)? I'm just concerned at what point the overhead of launching threads outweighs the parallelism achieved.
It's oftentimes a better idea to try to do more work per thread if possible, with the goal of having instruction-level parallelism. If a given thread executes multiple, independant operations, the instructions can be pipelined and executed without stalls, which will increase your arithmetic throuput. In contrast, if you have each thread doing 1 piece of (trivial) work, then there's no opportunity for any sort of instruction-level parallelism and no opportunity to hide any of your memory latency times.
Also, there's a finite number of registers available, so the more threads you launch with, the fewer the number of register available per thread. I'm not sure about Kepler cards, but back in the Fermi-card generation, registers had roughly 8x the bandwidth of shared memory, so using registers when possible was important (again, I don't have a kepler card, so I don't know if this has since changed).
Although it's a bit dated, the recommendations detailed here are still very relevant

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Maximum (shared memory per block) / (threads per block) in CUDA with 100% MP load

I'm trying to process array of big structures with CUDA 2.0 (NVIDIA 590). I'd like to use shared memory for it. I've experimented with CUDA occupancy calculator, trying to allocate maximum shared memory per thread, so that each thread can process whole element of array.
However maximum of (shared memory per block) / (threads per block) I can see in calculator with 100% Multiprocessor load is 32 bytes, which is not enough for single element (on the order of magnitude).
Is 32 bytes a maximum possible value for (shared memory per block) / (threads per block)?
Is it possible to say which alter4native is preferable - allocate part of array in global memory or just use underloaded multiprocessor? Or it can only be decided by experiment?
Yet another alternative I can see is to process array in several passes, but it looks like a last resort.
That is first time I'm trying something really complex with CUDA, so I could be missing some other options...
There are many hardware limitations you need to keep in mind when designing a CUDA kernel. Here are some of the constraints you need to consider:
maximum number of threads you can run in a single block
maximum number of blocks you can load on a streaming multiprocessor at once
maximum number of registers per streaming multiprocessor
maximum amount of shared memory per streaming multiprocessor
Whichever of these limits you hit first becomes a constraint that limits your occupancy (is maximum occupancy what you are referring to by "100% Multiprocessor load"?). Once you reach a certain threshold of occupancy, it becomes less important to pay attention to occupancy. For example, occupancy of 33% does not mean that you are only able to achieve 33% of the maximum theoretical performance of the GPU. Vasily Volkov gave a great talk at the 2010 GPU Technology Conference which recommends not worrying too much about occupancy, and instead trying to minimize memory transactions by using some explicit caching tricks (and other stuff) in the kernel. You can watch the talk here: http://www.gputechconf.com/gtcnew/on-demand-GTC.php?sessionTopic=25&searchByKeyword=occupancy&submit=&select=+&sessionEvent=&sessionYear=&sessionFormat=#193
The only real way to be sure that you are using a kernel design that gives best performance is to test all the possibilities. And you need to redo this performance testing for each type of device you run it on, because they all have different constraints in some way. This can obviously be tedious, especially when the different design patterns result in fundamentally different kernels. I get around this to some extent by using a templating engine to dynamically generate kernels at runtime according to the device hardware specifications, but it's still a bit of a hassle.

the number of blocks can be scheduled at the same time

This question is also started from following link: shared memory optimization confusion
In above link, from talonmies's answer, I found that the first condition of the number of blocks which will be scheduled to run is "8". I have 3 questions as shown in below.
Does it mean that only 8 blocks can be scheduled at the same time when the number of blocks from condition 2 and 3 is over 8? Is it regardless of any condition such as cuda environment, gpu device, or algorithm?
If so, it really means that it is better not to use shared memory in some cases, it depends. Then we have to think how can we judge which one is better, using or not using shared memory. I think one approach is checking whether there is global memory access limitation (memory bandwidth bottleneck) or not. It means we can select "not using shared memory" if there is no global memory access limitation. Is it good approach?
Plus above question 2, I think if the data that my CUDA program should handle is huge, then we can think "not using shared memory" is better because it is hard to handle within the shared memory. Is it also good approach?
The number of concurrently scheduled blocks are always going to be limited by something.
Playing with the CUDA Occupancy Calculator should make it clear how it works. The usage of three types of resources affect the number of concurrently scheduled blocks. They are, Threads Per Block, Registers Per Thread and Shared Memory Per Block.
If you set up a kernel that uses 1 Threads Per Block, 1 Registers Per Thread and 1 Shared Memory Per Block on Compute Capability 2.0, you are limited by Max Blocks per Multiprocessor, which is 8. If you start increasing Shared Memory Per Block, the Max Blocks per Multiprocessor will continue to be your limiting factor until you reach a threshold at which Shared Memory Per Block becomes the limiting factor. Since there are 49152 bytes of shared memory per SM, that happens at around 8 / 49152 = 6144 bytes (It's a bit less because some shared memory is used by the system and it's allocated in chunks of 128 bytes).
In other words, given the limit of 8 Max Blocks per Multiprocessor, using shared memory is completely free (as it relates to the number of concurrently running blocks), as long as you stay below the threshold at which Shared Memory Per Block becomes the limiting factor.
The same goes for register usage.

CUDA determining threads per block, blocks per grid

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.
I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?
In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.
A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.
First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.
Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).
On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.
Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page.
https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail...
With rare exceptions, you should use a constant number of threads per block. The number of blocks per grid is then determined by the problem size, such as the matrix dimensions in the case of matrix multiplication.
Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works. A good choice for a first attempt is 128 or 256.
You also need to consider shared memory because threads in the same block can access the same shared memory. If you're designing something that requires a lot of shared memory, then more threads-per-block might be advantageous.
For example, in terms of context switching, any multiple of 32 works just the same. So for the 1D case, launching 1 block with 64 threads or 2 blocks with 32 threads each makes no difference for global memory accesses. However, if the problem at hand naturally decomposes into 1 length-64 vector, then the first option will be better (less memory overhead, every thread can access the same shared memory) than the second.
There is no silver bullet. The best number of threads per block depends a lot on the characteristics of the specific application being parallelized. CUDA's design guide recommends using a small amount of threads per block when a function offloaded to the GPU has several barriers, however, there are experiments showing that for some applications a small number of threads per block increases the overhead of synchronizations, imposing a larger overhead. In contrast, a larger number of threads per block may decrease the amount of synchronizations and improve the overall performance.
For an in-depth discussion (too lengthy for StackOverflow) about the impact of the number of threads per block on CUDA kernels, check this journal article, it shows tests of different configurations of the number of threads per block in the NPB (NAS Parallel Benchmarks) suite, a set of CFD (Computational Fluid Dynamics) applications.