CUDA purpose of manually specifying thread blocks

CUDA purpose of manually specifying thread blocks - cuda

just started learning CUDA and there is something I can't quite understand yet. I was wondering whether there is a reason for splitting threads into blocks besides optimizing GPU workload. Because if there isn't, I can't understand why would you need to manually specify the number of blocks and their sizes. Wouldn't that be better to simply supply the number of threads needed to solve the task and let the GPU distribute the threads over the SMs?
That is, consider the following dummy task and GPU setup.
number of available SMs: 16
max number of blocks per SM: 8
max number of threads per block: 1024
Let's say we need to process every entry of a 256x256 matrix and we want a thread assigned to every entry, i.e. the overall number of threads is 256x256 = 65536. Then the number of blocks is:
overall number of threads / max number of threads per block = 65536 / 1024 = 64
Finally, 64 blocks will be distributed among 16 SMs, making it 8 blocks per SM. Now these are trivial calculations that GPU could handle automatically, right?.
The only other reason for manually supplying the number of blocks and their sizes, that I can think of, is separating threads in a specific fashion in order for them to have shared local memory, i.e. somewhat isolating one block of threads from another block of threads.
But surely there must be another reason?

I will try to answer your question from the point of view what I understand best.
The major factor that decides the number of threads per block is the multiprocessor occupancy.The occupancy of a multiprocessor is calculated as the ratio of the active warps to the max. number of active warps that is supported. The threads of a warps may be active or dormant for many reasons depending on the application. Hence a fixed structure for the number of threads may not be viable.
Besides each multiprocessor has a fixed number of registers shared among all the threads of that multiprocessor. If the total registers needed exceeds the max. number, the application is liable to fail.
Further to the above, the fixed shared memory available to a given block may also affect the decision on the number of threads, in case the shared memory is heavily used.
Hence a naive way to decide the number of threads is straightforwardly using the occupancy calculator spreadsheet in case you want to be completely oblivious to the type of application at hand. The other better option would be to consider the occupancy along with the type of application being run.

Related

Causes of Low Achieved Occupancy

Nvidia web-site mentions a few causes of low achieved occupancy, among them uneven distribution of workload among blocks, which results in blocks hoarding shared memory resources and not releasing them until block is finished. The suggestion is to decrease the size of a block, thus increasing the overall number of blocks (given that we keep the number of threads constant, of course).
A good explanation on that was also given here on stackoverflow.
Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)? That is, unless you need to make sure that a larger number of threads needs to communicate through shared memory, I assume.

Given aforementioned information, shouldn't the right course of
actions be (in order to maximize performance) simply setting the size
of a block as small as possible (equal to the size of a warp, say 32
threads)?
No.
As shown in the documentation here, there is a limit on the number of blocks per multiprocessor which would leave you with a maximum theoretical occupancy of 25% or 50% when using 32 thread blocks, depending on what hardware you run the kernel on.

Usually it is a good approach to use as small blocks as possbile but big enough to saturate device (64 or 128 threads per block depending on device) - it is not always possible since you might want to synchronize threads or communicate via shared memory.
Having large number of small blocks allows GPU to do kind of "autobalancing" and keep all SMs running.
The same applies to CPU - if you have 5 independent taks and each takes 4 seconds to finish, but you have only 4 cores then it will end after 8 seconds(during first 4 seconds 4 cores are running on first 4 tasks and then 1 core is running on last task and 3 cores are idling).
If you are able to divide whole job to 20 tasks that take 1 second then whole job will be done in 5 seconds. So having a lot of small tasks helps to utilize hardware.
In case of GPU you can have large number of active blocks (on Titan X it is 24 SM x 32 active blocks = 768 blocks) and would be good to use this power.
Anyway it is not always true that you need to fully saturate device. On many tasks I can see that using 32 threads per block (so having 50% possible occupancy) gives same performance as using 64 threads per block.
In the end all is a matter of doing some benchmarks, and choosing whatever is best for you in given case with given hardware.

NSIGHT: What are those Red and Black colour in kernel-level experiments?

I am trying to learn NSIGHT.
Can some one tell me what are these red marks indicating in the following screenshot taken from the User Guide ? There are two red marks in Occupancy per SM and two in warps section as you can see.
Similarly what are those black lines which are varying in length, indicating?
Another example from same page:

Here is the basic explanation:
Grey bars represent the available amount of resources your particular
device has (due to both its hardware and its compute capability).
Black bars represent the theoretical limit that it is possible to achieve for your kernel under your launch configuration (blocks per grid and threads per block)
The red dots represent your the resources that you are using.
For instance, looking at "Active warps" on the first picture:
Grey: The device supports 64 active warps concurrently.
Black: Because of the use of the registers, it is theoretically possible to map 64 warps.
Red: Your achieve 63.56 active warps.
In such case, the grey bar is under the black one, so you cant see the grey one.
In some cases, can happen that the theoretical limit its greater that the device limit. This is OK. You can see examples on the second picture (block limit (shared memory) and block limit (registers). That makes sense if you think that your kernel use only a little fraction of your resources; If one block uses 1 register, it could be possible to launch 65536 blocks (without taking into account other factors), but still your device limit is 16. Then, the number 128 comes from 65536/512. The same applies to the shared memory section: since you use 0 bytes of shared memory per block, you could launch infinite number of block according to shared memory limitations.
About blank spaces
The theoretical and the achieved values are the same for all rows except for "Active warps" and "Occupancy".
You are really executing 1024 threads per block with 32 warps per block on the first picture.
In the case of Occupancy and Active warps I guess the achieved number is a kind of statistical measure. I think that because of the nature of the CUDA model. In CUDA each thread within a warp is executed simultaneously on a SM. The way of hiding high latency operations -such as memory readings- is through "almost-free warps context switches". I guess that should be difficult to take a exact measure of the number of active warps in that situation. Beside hardware concepts, we also have to take into account the kernel implementation, branch-divergence, for instance could make a warp to slower than others... etc.
Extended information
As you saw, these numbers are closely related to your device specific hardware and compute capability, so perhaps a concrete example could help here:
A devide with CCC 3.0 can handle a maximum of 2048 threads per SM, 16
blocks per SM and 64 warps per SM. You also have a maximum number of
registers avaliable to use (65536 on that case).
This wikipedia entry is a handy site to be aware of each ccc features.
You can query this parameters using the deviceQuery utility sample code provided with the CUDA toolkit or, at execution time using the CUDA API as here.
Performance considerations
The thing is that, ideally, 16 blocks of 128 threads could be executed using less than 32 registers per thread. That means a high occupancy rate. In most cases your kernel needs more that 32 register per block, so it is no longer possible to execute 16 blocks concurrently on the SM, then the reduction is done at the block level granularity, i.e., decreasing the number of block. An this is what the bars capture.
You can play with the number of threads and blocks, or even with the _ _launch_bounds_ _ directive to optimize your kernel, or you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.

Minimizing registers per thread + "maxregcount" effect

Profiling result of my program says maximum theoretical achieved occupancy is 50% and the limiter are registers. What are general instructions about minimizing number of registers in CUDA code? I see profiling results show number of registers are much more than number of 32 and 16 bit variables I have in my code (per thread)? What can be potentially the reason?
Plus, setting "maxregcount" to 32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX), solves the occupancy limit issue but I don't get much of speed up. Does "maxregcount" try to optimize the code more, so it won't be wasteful in using registers? Or it simply chooses L1 cache or local memory for register spilling?

As per the presentation of nvidia given here. If the source exceeds the register limit Local Memory is used. Its worth spending time studying this presentation as it describes various options to increase the performance. As Vasily Volkov says in this presentation occupancy is one of the metrics not the only one.
Also notice,
32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX) is somewhat wrong I feel.
32 * 1024 (registers per block) = 32768 < 65536 ( registers per block). You can still increase the number of registers per thread till 64.

maxrregcount does cause the compiler to rearrange its use of registers, but it's always trying to keep register count low. Where it can't stay below your imposed limit, it will simply spill it to L1, L2 and DRAM. When you have to go to DRAM to fetch your spilled local variables, it can crowd out your explicit memory fetches and/or cause your kernel to become "latency-bound"--that is, computation is held up while waiting for the data to come back.
You might have better luck choosing something between unlimited registers and 32. Often some spilling and less than perfect occupancy beats lots of spilling with 100% occupancy for the reasons given above.
As a side note, you can limit regs for a specific kernel (rather that the whole file), by using launch_bounds, which you can read about in the Programming Guide.

how does CUDA schedule its threads

i've got a few questions regarding cuda's scheduling system.
A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?
B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>(); x,y =?
C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?
Thanks for your help

A. The compute work distributor will distribute a block from the grid to a SM. The SM will convert the block in warps (WARP_SIZE = 32 on all NVIDIA GPUs). Fermi 2.0 GPUs each SM has two warp schedulers which share a set of data paths. Every cycle each warp scheduler picks a warp and issues an instruction to one of data paths (please don't think of CUDA cores). On Fermi 2.1 GPUs each warp scheduler has independent data paths as well as a set of shared data paths. On 2.1 every cycle each warp scheduler will pick a warp and attempt to dual issue instructions for each warp.
The warp schedulers attempt to optimize the use of data paths. This means that it is possible that a single warp will execute multiple instructions in back to back cycle or the warp scheduler can choose to issue from a different warp every cycle.
The number of warps/threads that each SM can handle is specified in the CUDA Programming Guide v.4.2 Table F-1. This scales from 768 threads to 2048 threads (24-64 warps).
B. The maximum threads per launch is defined by the maximum GridDims * the maximum threads per block. See Table F-1 or refer to the documentation for cudaGetDeviceProperties.
C. See the same resources as (B). The optimum distribution of threads/block is defined by your algorithm partitioning and is influenced by the occupancy calculation. There are observable performance impacts based around problem set size of the warps on the SM and the amount of time blocked at instruction barriers (among other things). For starters I recommend at least 2 blocks per SM and ~50% occupancy.

B. It depends on your device. You can use the cuda function cudaGetDeviceProperties to see the specifications for your device. A common maximum number is y=1024 threads per block and x=65535 blocks per Grid dimension.
C.A common practise is to have powers of 2 (128,256,512 etc.) threads/block. Reducing large arrays is very effective that way (see Reduction). The optimum distribution of blocks and threads actually depends on your application and your hardware. I personally use 512 threads/block for large sparse linear algebra computations on a TeslaM2050 since it's the most efficient for my applications.

Threads hierarchy design in kernel in CUDA

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.

I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)
This link provides some insight into how CUDA will (likely) and organize your threads.
A quote from the summary:
To summarize, special parameters at a
kernel launch define the dimensions of
a grid and its blocks. Unique
coordinates in blockId and threadId
variables allow threads of a grid to
distinguish among them. It is the
programmer's responsibility to use
these variables in the kernel
functions so that the threads can
properly identify the portion of the
data to process. These variables
compel the programmers to organize
threads and there data into
hierarchical and multi-dimensional
organizations.

It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.
Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008