Minimizing registers per thread + "maxregcount" effect - cuda

Profiling result of my program says maximum theoretical achieved occupancy is 50% and the limiter are registers. What are general instructions about minimizing number of registers in CUDA code? I see profiling results show number of registers are much more than number of 32 and 16 bit variables I have in my code (per thread)? What can be potentially the reason?
Plus, setting "maxregcount" to 32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX), solves the occupancy limit issue but I don't get much of speed up. Does "maxregcount" try to optimize the code more, so it won't be wasteful in using registers? Or it simply chooses L1 cache or local memory for register spilling?

As per the presentation of nvidia given here. If the source exceeds the register limit Local Memory is used. Its worth spending time studying this presentation as it describes various options to increase the performance. As Vasily Volkov says in this presentation occupancy is one of the metrics not the only one.
Also notice,
32 (32 * 2048(max threads per SMX) = 65536(max registers per SMX) is somewhat wrong I feel.
32 * 1024 (registers per block) = 32768 < 65536 ( registers per block). You can still increase the number of registers per thread till 64.

maxrregcount does cause the compiler to rearrange its use of registers, but it's always trying to keep register count low. Where it can't stay below your imposed limit, it will simply spill it to L1, L2 and DRAM. When you have to go to DRAM to fetch your spilled local variables, it can crowd out your explicit memory fetches and/or cause your kernel to become "latency-bound"--that is, computation is held up while waiting for the data to come back.
You might have better luck choosing something between unlimited registers and 32. Often some spilling and less than perfect occupancy beats lots of spilling with 100% occupancy for the reasons given above.
As a side note, you can limit regs for a specific kernel (rather that the whole file), by using launch_bounds, which you can read about in the Programming Guide.

Related

CUDA memory bandwidth when reading a limited number of finite-sized chunks?

Knowing hardware limits is useful for understanding if your code is performing optimally. The global device memory bandwidth limits how many bytes you can read per second, and you can approach this limit if the chunks you are reading are large enough.
But suppose you are reading, in parallel, N chunks of D bytes each, scattered in random locations in global device memory. Is there a useful formula limiting how much of the bandwidth you'd be able to achieve then?
let's assume:
we are talking about accesses from device code
a chunk of D bytes means D contiguous bytes
when reading a chunk, the read operation is fully coalesced - those bytes are read 4 bytes per thread, by however many adjacent threads in the block are predicted by D/4.
the temporal and spatial characteristics are such that no two chunks are within 32 bytes of each other - either they are all gapped by that much, or else the distribution of loads in time is such that the L2 doesn't provide any benefit. Pretty much saying the L2 hitrate is zero. This seems evident in your statement "global device memory bandwidth" - if the L2 hitrate is not zero, you're not measuring (purely) global device memory bandwidth
we are talking about a relatively recent GPU architecture, say Pascal or newer, or else for an older architecture the L1 is disabled for global loads. Pretty much saying the L1 hitrate is zero.
the overall footprint is not so large as to thrash the TLB
the starting address of each chunk is aligned to a 32-byte boundary (&)
your GPU is sufficiently saturated with warps and blocks to make full use of all resources (e.g. all SMs, all SM partitions, etc.)
the actual chunk access pattern (distribution of addresses) does not result in partition camping or some other hard-to-predict effect
In that case, you can simply round the chunk size D up to the next multiple of 32, and do a calculation based on that. What does that mean?
The predicted bandwidth (B) is:
Bd = the device memory bandwidth of your GPU as indicated by deviceQuery
B = Bd/(((D+31)/32)*32)
And the resultant units there is chunks/sec. (bytes/sec divided by bytes/chunk). The second division operation shown is "integer division", i.e. dropping any fractional part.
(&) In the case where we don't want this assumption, the worst case is to add an additional 32-byte segment per chunk. The formula then becomes:
B = Bd/((((D+31)/32)+1)*32)
note that this condition cannot apply when the chunk size is less than 34 bytes.
All I am really doing here is calculating the number of 32-byte DRAM transactions that would be generated by a stream of such requests, and using that to "derate" the observed peak (100% coalesced/100% utilized) case.
Under #RobertCrovella's assumptions, and assuming the chunk sizes are multiples of 32 bytes and chunks are 32-byte aligned, you will get the same bandwidth as for a single chunk - as Robert's formula tells you. So, no benefit and no detriment.
But ensuring these assumptions hold is often not trivial (even merely ensuring coalesced memory reads).

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

CUDA purpose of manually specifying thread blocks

just started learning CUDA and there is something I can't quite understand yet. I was wondering whether there is a reason for splitting threads into blocks besides optimizing GPU workload. Because if there isn't, I can't understand why would you need to manually specify the number of blocks and their sizes. Wouldn't that be better to simply supply the number of threads needed to solve the task and let the GPU distribute the threads over the SMs?
That is, consider the following dummy task and GPU setup.
number of available SMs: 16
max number of blocks per SM: 8
max number of threads per block: 1024
Let's say we need to process every entry of a 256x256 matrix and we want a thread assigned to every entry, i.e. the overall number of threads is 256x256 = 65536. Then the number of blocks is:
overall number of threads / max number of threads per block = 65536 / 1024 = 64
Finally, 64 blocks will be distributed among 16 SMs, making it 8 blocks per SM. Now these are trivial calculations that GPU could handle automatically, right?.
The only other reason for manually supplying the number of blocks and their sizes, that I can think of, is separating threads in a specific fashion in order for them to have shared local memory, i.e. somewhat isolating one block of threads from another block of threads.
But surely there must be another reason?
I will try to answer your question from the point of view what I understand best.
The major factor that decides the number of threads per block is the multiprocessor occupancy.The occupancy of a multiprocessor is calculated as the ratio of the active warps to the max. number of active warps that is supported. The threads of a warps may be active or dormant for many reasons depending on the application. Hence a fixed structure for the number of threads may not be viable.
Besides each multiprocessor has a fixed number of registers shared among all the threads of that multiprocessor. If the total registers needed exceeds the max. number, the application is liable to fail.
Further to the above, the fixed shared memory available to a given block may also affect the decision on the number of threads, in case the shared memory is heavily used.
Hence a naive way to decide the number of threads is straightforwardly using the occupancy calculator spreadsheet in case you want to be completely oblivious to the type of application at hand. The other better option would be to consider the occupancy along with the type of application being run.

NSIGHT: What are those Red and Black colour in kernel-level experiments?

I am trying to learn NSIGHT.
Can some one tell me what are these red marks indicating in the following screenshot taken from the User Guide ? There are two red marks in Occupancy per SM and two in warps section as you can see.
Similarly what are those black lines which are varying in length, indicating?
Another example from same page:
Here is the basic explanation:
Grey bars represent the available amount of resources your particular
device has (due to both its hardware and its compute capability).
Black bars represent the theoretical limit that it is possible to achieve for your kernel under your launch configuration (blocks per grid and threads per block)
The red dots represent your the resources that you are using.
For instance, looking at "Active warps" on the first picture:
Grey: The device supports 64 active warps concurrently.
Black: Because of the use of the registers, it is theoretically possible to map 64 warps.
Red: Your achieve 63.56 active warps.
In such case, the grey bar is under the black one, so you cant see the grey one.
In some cases, can happen that the theoretical limit its greater that the device limit. This is OK. You can see examples on the second picture (block limit (shared memory) and block limit (registers). That makes sense if you think that your kernel use only a little fraction of your resources; If one block uses 1 register, it could be possible to launch 65536 blocks (without taking into account other factors), but still your device limit is 16. Then, the number 128 comes from 65536/512. The same applies to the shared memory section: since you use 0 bytes of shared memory per block, you could launch infinite number of block according to shared memory limitations.
About blank spaces
The theoretical and the achieved values are the same for all rows except for "Active warps" and "Occupancy".
You are really executing 1024 threads per block with 32 warps per block on the first picture.
In the case of Occupancy and Active warps I guess the achieved number is a kind of statistical measure. I think that because of the nature of the CUDA model. In CUDA each thread within a warp is executed simultaneously on a SM. The way of hiding high latency operations -such as memory readings- is through "almost-free warps context switches". I guess that should be difficult to take a exact measure of the number of active warps in that situation. Beside hardware concepts, we also have to take into account the kernel implementation, branch-divergence, for instance could make a warp to slower than others... etc.
Extended information
As you saw, these numbers are closely related to your device specific hardware and compute capability, so perhaps a concrete example could help here:
A devide with CCC 3.0 can handle a maximum of 2048 threads per SM, 16
blocks per SM and 64 warps per SM. You also have a maximum number of
registers avaliable to use (65536 on that case).
This wikipedia entry is a handy site to be aware of each ccc features.
You can query this parameters using the deviceQuery utility sample code provided with the CUDA toolkit or, at execution time using the CUDA API as here.
Performance considerations
The thing is that, ideally, 16 blocks of 128 threads could be executed using less than 32 registers per thread. That means a high occupancy rate. In most cases your kernel needs more that 32 register per block, so it is no longer possible to execute 16 blocks concurrently on the SM, then the reduction is done at the block level granularity, i.e., decreasing the number of block. An this is what the bars capture.
You can play with the number of threads and blocks, or even with the _ _launch_bounds_ _ directive to optimize your kernel, or you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.

Increasing block size decreases performance

In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg 7 out of 9 s ) is taken by just the call to the kernel .Infact I am quite amazed that even if I comment out the entire kernel the time is almost same.Any suggestions where and how to optimize?
P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel.
It looks pretty bad to me right now but I cant figure out what am I doing wrong.
__global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
int number=2*(id%(width/2))+(id/(width/2))*width*2;
if (id<height*width/4)
{
f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;
f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;
f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;
}
}
Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.
The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).
Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).
Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.
You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):
Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
...
You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.
The nvidia occupancy calculator(excel sheet) will be of great help.