Can a const * __restrict__ increase cuda register usage? - cuda

Because my pointers are all pointing to non-overlapping memory I've went all out and replaced my pointers passed to kernels (and their inlined functions) to be restricted, and to made them const too, where ever possible. This however increased the register usage of some kernels and decreased it for others. This doesn't make make much sense to me.
Does anybody know why this can be the case?

Yes, it can increase register usage.
Referring to the programming guide for __restrict__:
The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to "cached" loads and common sub-expressions.
Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy.
const __restrict__ may be beneficial for at least 2 reasons:
On architectures that support it, it may enable the compiler to discover uses for the constant cache which may be a performance-enhancing feature.
As indicated in the above linked programming guide section, it may enable other optimizations to be made by the compiler (e.g. reducing instructions and memory accesses) which also may improve performance if the corresponding register pressure does not become an issue.
Reducing instructions and memory accesses leading to increased register pressure may be non-intuitive. Let's consider the example given in the above programming guide link:
void foo(const float* a, const float* b, float* c) {
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0]; ... }
If we allow for pointer aliasing in the above example, then the compiler can't make many optimizations, and the compiler is essentially reduced to performing the code exactly as written. The first line of code:
c[0] = a[0] * b[0];
will require 3 registers. The next line of code:
c[1] = a[0] * b[0];
will also require 3 registers, and because everything is being generated as-written, they can be the same 3 registers, reused. Similar register reuse can occur for the remainder of the example, resulting in low overall register usage/pressure.
But if we allow the compiler to re-order things, then we must have registers assigned for each value loaded up front, and reserved until that value is retired. This re-ordering can increase register usage/pressure, but may ultimately lead to faster code (or it may lead to slower code, if the register pressure becomes a performance limiter.)

Related

How to reuse code for CPU fallback in CUDA

I have some calculations that I want to parallelize if my user has a CUDA-compliant GPU, otherwise I want to execute the same code on the CPU. I don't want to have two versions of the algorithm code, one for CPU and one for GPU to maintain. I'm considering the following approach but am wondering if the extra level of indirection will hurt performance or if there is a better practice.
For my test, I took the basic CUDA template that adds the elements of two integer arrays and stores the result in a third array. I removed the actual addition operation and placed it into its own function marked with both device and host directives...
__device__ __host__ void addSingleItem(int* c, const int* a, const int* b)
{
*c = *a + *b;
}
... then modified the kernel to call the aforementioned function on the element identified by threadIdx...
__global__ void addKernel(int* c, const int* a, const int* b)
{
const unsigned i = threadIdx.x;
addSingleItem(c + i, a + i, b + i);
}
So now my application can check for the presence of a CUDA device. If one is found I can use...
addKernel <<<1, size>>> (dev_c, dev_a, dev_b);
... and if not I can forego parallelization and iterate through the elements calling the host version of the function...
int* pA = (int*)a;
int* pB = (int*)b;
int* pC = (int*)c;
for (int i = 0; i < arraySize; i++)
{
addSingleItem(pC++, pA++, pB++);
}
Everything seems to work in my small test app but I'm concerned about the extra call involved. Do device-to-devce function calls incur any significant performance hits? Is there a more generally accepted way to do CPU fallback that I should adopt?
If addSingleItem and addKernel are defined in the same translation unit/module/file, there should be no cost to having a device-to-device function call. The compiler will aggressively inline that code, as if you wrote it in a single function.
That is undoubtedly the best approach if it can be managed, for the reason described above.
If it's desired to still have some file-level modularity, it is possible to break code into a separate file and include that file in the compilation of the kernel function. Conceptually this is no different than what is described already.
Another possible approach is to use compiler macros to assist in the addition or removal or modification of code to handle the GPU case vs. non-GPU case. There are endless possibilities here, but see here for a simple idea. You can redefine what __host__ __device__ means in different scenarios, for example. I would say this probably only makes sense if you are building separate binaries for the GPU vs. non-GPU case, but you may find a clever way to handle it in the same executable.
Finally, if you desire this but must place the __device__ function in a separate translation unit, it is still possible but there may be some performance loss due to the device-to-device function call across module boundaries. The amount of performance loss here is hard to generalize since it depends heavily on code structure, but it's not unusual to see 10% or 20% performance hit. In that case, you may wish to investigate link-time-optimizations that became available in CUDA 11.
This question may also be of interest, although only tangentially related here.

Loading from global memory

Suppose simple kernel like this:
__global__ void fg(struct s_tp tp, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double3 r = tp.rh[idx];
double d = sqrt(r.x*r.x + r.y*r.y + r.z*r.z);
tp.d[idx] = d;
}
Is this true ?:
double3 r = tp.rh[idx];
data are loaded from global memory into r variables.
r are stored in registers or if there is many variables, in local memory.
r are not stored in shared memory.
d are calculated and after that written back into global memory.
registers are faster than other memories.
if the space of registers is full (some big kernels), local memory is used, and the access is slower
when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Thanks to all.
Yes, it's pretty much all true.
•when I need doubles, is there any way to speed it up? For example load data firstly into shared memory and then operate them?
Using shared memory is useful when there is either data reuse (loading the same data item more than once, usually by more than one thread in a threadblock), or possibly when you are making a specialized use of shared memory to aid in global coalescing, such as during an optimized matrix transpose.
Data reuse means that you are using (loading) the data more than once, and for shared memory to be useful, it means you are loading it more than once by more than one thread. If you are using it more than once in a single thread, then the single load plus the compiler (automatic) "optimization" of storing it in a register is all you need.
EDIT
The answer given by #Jez has some good ideas for optimal loading. I would suggest another idea is to convert your AoS data storage scheme to a SoA scheme. Data storage transformation is a common approach to improving speed of CUDA codes.
Your s_tp struct, which you haven't shown, appears to have storage for several double quantities per item/struct. If you instead create separate arrays for each of these quantities, you'll have opportunities for optimal loading/storage. Something like this:
__global__ void fg(struct s_tp tp, double* s_tp_rx, double* s_tp_ry, double* s_tp_rz, double* s_tp_d, struct s_param p)
{
const uint bid = blockIdx.y * gridDim.x + blockIdx.x;
const uint tid = threadIdx.x;
const uint idx = bid * blockDim.x + tid;
if(idx >= p.ntp) return;
double rx = s_tp_rx[idx];
double ry = s_tp_ry[idx];
double rz = s_tp_rz[idx];
double d = sqrt(rx*rx + ry*ry + rz*rz);
s_tp_d[idx] = d;
}
This approach is likely to have benefits elsewhere in your device code also, for similar types of usage patterns.
It's all true.
when I need doubles, is there any way to speed it up? For example load
data firstly into shared memory and then operate them?
For the example you gave, your implementation is possibly not optimal. The first thing you should do is compare the bandwidth acheived to that of a reference kernel, for example, a cudaMemcpy. If the gap is large, and the speedup you'll gain from closing this gap is significant, optimisations may be possible.
Looking at your kernel there are two things that strike me as potentially suboptimal:
There's not much work per thread. If possible, processing mulitple elements per thread can improve performance. This is, in part, because it avoids thread intialisation/removal overheads.
Loading from a double3 isn't as efficient as loading from other types. The best way to load data is usually using 128-bit loads per thread. Loading three consective 64-bit values will be slower, perhaps not by a lot, but slower all the same.
EDIT: Robert Crovella's answer below gives a good solution to the second point which requires changing around your data type. For some reason I had originally thought this wasn't an option, so the below solution is probably over-the-top if you cna just change your data type!
While adding more work per thread is a fairly simple thing to try, optimising your memory access pattern (without changing your datatype) for a solution is less so. Fortunately there are libraries that can help. I think that using CUB, and in particular, the BlockLoad collective, should allow you to load more efficently. By loading, say, 6 double items per thread using a transpose operator, you can process two elements per thread, pack them into a double2, and store them normally.

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?
It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.
If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.
I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

Could setting an independent variable early, increase performance?

Threads don’t stall on memory access
From the famous paper http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf by Vasily Volkov
I am assuming based on this statement that this:
__device__ int a;
int b, c, d;
a = b * c;
// Do some work that is independent of 'a'
// ...
d = a + 1;
Is faster than this
__device__ int a;
int b, c, d;
a = b * c;
d = a + 1;
// Do some work that is independent of 'a'
// ...
I am only assuming that because I am giving the chance to the thread to execute different instructions while writing to the global memory, while in the second approach I am not.
Is my assumption right?
And if my assumption is right, then is it a good practice to set all variables that are going to be used later, in the beginning of the kernel? Given that they are independent from each other, also assuming that a is not cached.
Really the stall referenced is a memory read.
It is pointing out that a memory read does not generate a stall, using the value that is read assuming it's not available, causes the stall.
Suppose I have:
__device__ int a[32];
Then this thread code does not cause a stall (although it generates a memory transaction):
int b = a[0];
But if I do this, I will get a stall:
int b = a[0];
int c = a[1];
int d = b * c; // stall occurs here
Therefore, if I can do this:
int b = a[0];
int c = a[1];
// do lots of other work here
int d = b * c; // this might not stall
For Fermi and Kepler GPUs, writes (and reads from values previously written, assuming they have not been evicted from the cache) to global memory are serviced by caches, so thread code that appears to be writing to global memory is usually writing to the L1 or L2 cache, and the actual write transaction to global memory will occur later, and does not necessarily cause a stall of any kind.
So in your example, ordinarily a will be serviced by a cache:
__device__ int a;
int b, c, d;
a = b * c; // a gets written to cache
d = a + 1; // a is serviced from cache
Note that servicing from the cache is still slower than the fastest access mechanisms (e.g. registers and shared mem) but it's much much faster than a global memory stall.
Having said all this, the compiler will ordinarily do a number of things that may affect this. First of all, rather than you manually re-ordering your code, the compiler may spot independent work, and, to some degree, re-order your code for you. Secondly, in your example, the compiler will spot that a is re-used and most likely assign it to a register variable, in addition to updating the value in global memory at some point. The fact that it is in a register means using a in the last line of your example above will most likely get serviced out of the register, not global memory or the cache.
So to answer your questions, I would say that generally, your assumption will not be correct. The compiler will spot the re-use of a and assign it to a register, completely eliminating the hazard that you think exists. In theory, if there were no caches (true for compute 1.x devices) and no registers, then the compiler might be forced to use global memory as you suggest, but in practice it won't happen.

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.