Struggling with intuition regarding how warp-synchronous thread execution works - cuda

I am new in CUDA. I am working basic parallel algorithms, like reduction, in order to understand how thread execution is working. I have the following code:
__global__ void
Reduction2_kernel( int *out, const int *in, size_t N )
{
extern __shared__ int sPartials[];
int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
sPartials[tid] = sum;
__syncthreads();
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 ) {
if ( tid < activeThreads ) {
sPartials[tid] += sPartials[tid+activeThreads];
}
__syncthreads();
}
if ( threadIdx.x < 32 ) {
volatile int *wsSum = sPartials;
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this statement, any exampele please?
wsSum[tid] += wsSum[tid + 16]; //how these statements are executed in paralle within a warp
wsSum[tid] += wsSum[tid + 8];
wsSum[tid] += wsSum[tid + 4];
wsSum[tid] += wsSum[tid + 2];
wsSum[tid] += wsSum[tid + 1];
if ( tid == 0 ) {
volatile int *wsSum = sPartials;// why this statement is needed?
out[blockIdx.x] = wsSum[0];
}
}
}
Unfortunately it is not clear to me how the code is working from the if ( threadIdx.x < 32 )condition and after. Can somebody give an intuitive example with thread ids and how the statements are executed? I think it is important to understand these conceptes so any help it would be helpful!!

Let's look at the code in blocks, and answer your questions along the way:
int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
The above code travels through a data set of size N. An assumption we can make for understanding purposes is that N > blockDim.x*gridDim.x, this last term simply being the total number of threads in the grid. Since N is larger than the total threads, each thread is summing multiple elements from the data set. From the standpoint of a given thread, it is summing elements that are spaced by the grid dimension of threads (blockDim.x*gridDim.x) Each thread stores it's sum in a local (presumably register) variable named sum.
sPartials[tid] = sum;
__syncthreads();
As each thread finishes (i.e., as it's for-loop exceeds N) it stores it's intermediate sum in shared memory, and then waits for all other threads in the block to finish.
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 ) {
if ( tid < activeThreads ) {
sPartials[tid] += sPartials[tid+activeThreads];
}
__syncthreads();
}
So far we haven't talked about the dimension of the block - it hasn't mattered. Let's assume each block has some integer multiple of 32 threads. The next step will be to start gathering the various intermediate sums stored in shared memory into smaller and smaller groups of variables. The above code starts out by selecting half of the threads in the threadblock (blockDim.x>>1) and uses each of those threads to combine two of the partial sums in shared memory. So if our threadblock started out at 128 threads, we just used 64 of those threads to reduce 128 partial sums into 64 partial sums. This process continues repetetively in the for loop, each time cutting the threads in half and combining partial sums, two at a time per thread. This process continues as long as activeThreads > 32. So if activeThreads is 64, then those 64 threads will combine 128 partial sums into 64 partial sums. But when activeThreads becomes 32, the for-loop is terminated, without combining 64 partial sums into 32. So at the completion of this block of code, we have taken the (arbitrary multiple of 32 threads) threadblock, and reduced however many partial sums we started out with, down to 64. This process of combining say 256 partial sums, to 128 partial sums, to 64 partial sums, must wait at each iteration for all threads (in multiple warps) to complete their work, so the __syncthreads(); statement is executed with each pass of the for-loop.
Keep in mind, at this point, we have reduced our threadblock to 64 partial sums.
if ( threadIdx.x < 32 ) {
For the remainder of the kernel after this point, we will only be using the first 32 threads (i.e. the first warp). All other threads will remain idle. Note that there are no __syncthreads(); after this point either, as that would be a violation of the rule for using it (all threads must participate in a __syncthreads();).
volatile int *wsSum = sPartials;
We are now creating a volatile pointer to shared memory. In theory, this tells the compiler that it should not do various optimizations, such as optimizing a particular value into a register, for example. Why didn't we need this before? Because __syncthreads(); also carries with it a memory-fencing function. A __syncthreads(); call, in addition to causing all threads to wait at the barrier for each other, also forces all thread updates back into shared or global memory. We can no longer depend on this feature, however, because from here on out we will not be using __syncthreads(); because we have restricted ourselves -- for the remainder of the kernel -- to a single warp.
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this
The previous reduction block left us with 64 partial sums. But we have at this point restricted ourselves to 32 threads. So we must do one more combination to gather the 64 partial sums into 32 partial sums, before we can proceed with the remainder of the reduction.
wsSum[tid] += wsSum[tid + 16]; //how these statements are executed in paralle within a warp
Now we are finally getting into some warp-synchronous programming. This line of code depends on the fact that 32 threads are executing in lockstep. To understand why (and how it works at all) it will be convenient to break this down into the sequence of operations needed to complete this line of code. It looks something like:
read the partial sum of my thread into a register
read the partial sum of the thread that is 16 higher than my thread, into a register
add the two partial sums
store the result back into the partial sum corresponding to my thread
All 32 threads will follow the above sequence in lock-step. All 32 threads will begin by reading wsSum[tid] into a (thread-local) register. That means thread 0 reads wsSum[0], thread 1 reads wsSum[1] etc. After that, each thread reads another partial sum into a different register: thread 0 reads wsSum[16], thread 1 reads wsSum[17], etc. It's true that we don't care about the wsSum[32](and higher) values; we've already collapsed those into the first 32 wsSum[] values. However, as we'll see, only the first 16 threads (at this step) will contribute to the final result, so the first 16 threads will be combining the 32 partial sums into 16. The next 16 threads will be acting as well, but they are just doing garbage work -- it will be ignored.
The above step combined 32 partial sums into the first 16 locations in wsSum[]. The next line of code:
wsSum[tid] += wsSum[tid + 8];
repeats this process with a granularity of 8. Again, all 32 threads are active, and the micro-sequence is something like this:
read the partial sum of my thread into a register
read the partial sum of the thread that is 8 higher than my thread, into a register
add the two partial sums
store the result back into the partial sum corresponding to my thread
So the first 8 threads combine the first 16 partial sums (wsSum[0..15]) into 8 partial sums (contained in wsSum[0..7]). The next 8 threads are also combining wsSum[8..23] into wsSums[8..15], but the writes to 8..15 occur after those values were read by threads 0..8, so the valid data is not corrupted. It's just extra junk work going on. Likewise for the other blocks of 8 threads within the warp. So at this point we have combined the partial sums of interest into 8 locations.
wsSum[tid] += wsSum[tid + 4]; //this combines partial sums of interest into 4 locations
wsSum[tid] += wsSum[tid + 2]; //this combines partial sums of interest into 2 locations
wsSum[tid] += wsSum[tid + 1]; //this combines partial sums of interest into 1 location
And these lines of code follow a similar pattern as the previous two, partitioning the warp into 8 groups of 4 threads (only the first 4-thread group contributes to the final result) and then partitioning the warp into 16 groups of 2 threads, with only the first 2-thread group contributing to the final result. And finally, into 32 groups of 1 thread each, each thread generating a partial sum, with only the first partial sum being of interest.
if ( tid == 0 ) {
volatile int *wsSum = sPartials;// why this statement is needed?
out[blockIdx.x] = wsSum[0];
}
At last, in the previous step, we had reduced all partial sums down to a single value. It's now time to write that single value out to global memory. Are we done with the reduction? Perhaps, but probably not. If the above kernel were launched with only 1 threadblock, then we would be done -- our final "partial" sum is in fact the sum of all elements in the data set. But if we launched multiple blocks, then the final result from each block is still a "partial" sum, and the results from all blocks must be added together (somehow).
And to answer your final question?
I don't know why that statement is needed.
My guess is that it was left around from a previous iteration of the reduction kernel, and the programmer forgot to delete it, or didn't notice that it wasn't needed. Perhaps someone else will know the answer to this one.
Finally, the cuda reduction sample provides very good reference code for study, and the accompanying pdf document does a good job of describing the optimizations that can be made along the way.

After the first two code block(separate by __syncthreads()), you can get 64 values in each thread block(stored in sPartials[] of each thread block). So the code from the if ( threadIdx.x < 32 ) is to accumulate the 64 values in each sPartials[]. It is just for optimizing the speed of the reduction. Because the data of the rest steps of the accumulation is small, it's not worthy to decrease threads and loop. You can just adjust the condition in second code block
for ( int activeThreads = blockDim.x>>1;
activeThreads > 32;
activeThreads >>= 1 )
to
for ( int activeThreads = blockDim.x>>1;
activeThreads > 0;
activeThreads >>= 1 )
instead of
if ( threadIdx.x < 32 ) {
volatile int *wsSum = sPartials;
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
wsSum[tid] += wsSum[tid + 16];
wsSum[tid] += wsSum[tid + 8];
wsSum[tid] += wsSum[tid + 4];
wsSum[tid] += wsSum[tid + 2];
wsSum[tid] += wsSum[tid + 1];
for better understanding.
After the accumulation, you can get only one value of each sPartials[], and stored in sPartials[0], here in your code is wsSum[0].
And after the kernel function, you can accumulate the values in wsSum in CPU to get the final result.

The CUDA execution model in a nutshell: computations are divided between blocks on a grid. The blocks can have some shared resources (shared memory).
Each block is executed on a single Stream Multi-processor (SM), which is what makes the fast shared memory possible.
The work for each block is again split into warps of 32 threads. You can look at the work done by warps as independent tasks. The SM switches between warps very quickly. For example, when a thread accesses global memory, the SM will switch to a different warp.
You know nothing about the order in which warps are executed. All you know is that, after a call to __syncthreads, all threads will have run up to that point, and all memory reads and writes have been completed.
The important thing to note is that all threads in a warp all execute the same instruction, or some may be paused when there is a branch and different threads take different branches.
So, in the reduction example, the first part may be executed by multiple warps. In the last part, there are only 32 threads left, so only one warp is active. The line
if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];
is there to add the partial sums computed by the other warps to the partial sums of our final warp.
The next lines work as follows. Because execution within a warp is synchronized, it is safe to assume that the write operation to wsSum[tid] is completed before the next read, and so there is no need for a __syncthreads call.
The volatile keyword lets the compiler know that values in the wsSum array may be changed by other threads, so it will make sure that the value of wsSum[tid + X] isn't read earlier, before it was updated by some thread in the previous instruction.
The last volatile declaration seems redundant: you could just as well use the existing wsSum variable.

Related

prefix sum using CUDA

I am having trouble understanding a cuda code for naive prefix sum.
This is code is from https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
In example 39-1 (naive scan), we have a code like this:
__global__ void scan(float *g_odata, float *g_idata, int n)
{
extern __shared__ float temp[]; // allocated on invocation
int thid = threadIdx.x;
int pout = 0, pin = 1;
// Load input into shared memory.
// This is exclusive scan, so shift right by one
// and set first element to 0
temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;
__syncthreads();
for (int offset = 1; offset < n; offset *= 2)
{
pout = 1 - pout; // swap double buffer indices
pin = 1 - pout;
if (thid >= offset)
temp[pout*n+thid] += temp[pin*n+thid - offset];
else
temp[pout*n+thid] = temp[pin*n+thid];
__syncthreads();
}
g_odata[thid] = temp[pout*n+thid1]; // write output
}
My questions are
Why do we need to create a shared-memory temp?
Why do we need "pout" and "pin" variables? What do they do? Since we only use one block and 1024 threads at maximum here, can we only use threadId.x to specify the element in the block?
In CUDA, do we use one thread to do one add operation? Is it like, one thread does what could be done in one iteration if I use a for loop (loop the threads or processors in OpenMP given one thread for one element in an array)?
My previous two questions may seem to be naive... I think the key is I don't understand the relation between the above implementation and the pseudocode as following:
for d = 1 to log2 n do
for all k in parallel do
if k >= 2^d then
x[k] = x[k – 2^(d-1)] + x[k]
This is my first time using CUDA, so I'll appreciate it if anyone can answer my questions...
1- It's faster to put stuff in Shared Memory (SM) and do calculations there rather than using the Global Memory. It's important to sync threads after loading the SM hence the __syncthreads.
2- These variables are probably there for the clarification of reversing the order in the algorithm. It's simply there for toggling certain parts:
temp[pout*n+thid] += temp[pin*n+thid - offset];
First iteration ; pout = 1 and pin = 0. Second iteration; pout = 0 and pin = 1.
It offsets the output for N amount at odd iterations and offsets the input at even iterations. To come back to your question, you can't achieve the same thing with threadId.x because the it wouldn't change within the loop.
3 & 4 - CUDA executes threads to run the kernel. Meaning that each thread runs that code separately. If you look at the pseudo code and compare with the CUDA code you already parallelized the outer loop with CUDA. So each thread would run the loop in the kernel until the end of loop and would wait each thread to finish before writing to the Global Memory.
Hope it helps.

Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
//we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
//cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
//the size of this mememory allocation
__shared__ int temp[1024 * 2];
//instantiate variables
int id = threadIdx.x;
int pout = 0, pin = 1;
// // load input into shared memory.
// // Exclusive scan: shift right by one and set first element to 0
temp[id] = (id > 0) ? in_data[id - 1] : 0;
__syncthreads();
//for each thread, loop through each of the steps
//each step, move the next resultant addition to the thread's
//corresponding space to manipulted for the next iteration
for (int offset = 1; offset < numElements; offset <<= 1)
{
//these switch so that data can move back and fourth between the extra spaces
pout = 1 - pout;
pin = 1 - pout;
//IF: the number needs to be added to something, make sure to add those contents with the contents of
//the element offset number of elements away, then move it to its corresponding space
//ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
if (id >= offset)
{
//this element needs to be added to something; do that and copy it over
temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
}
else
{
//this element just drops down, so copy it over
temp[pout * numElements + id] = temp[pin * numElements + id];
}
__syncthreads();
}
// write output
out_data[id] = temp[pout * numElements + id];
}
I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x. But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information.
From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
Not exactly. This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated:
int id = threadIdx.x;
This id is independant of blockIdx, and therefore identical across blocks, no matter their number.
If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. Then, run a similar scan on that array, calculating per-block increments. Finally, run a last kernel to add those per-block increments to the block elements. If memory serves there is a similar kernel in the CUDA SDK samples.
Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl. Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance.

Parallel reduction example

I found this parallel reduction code from Stanford which uses shared memory.
The code is an example of 1<<18 number of elements which is equal to 262144 and gets correct results.
Why for certain number of elements I get the correct results and for other number of elements, like 200000 or 25000 I get different results from what is to be expected?
It looks to me it's always appointing the needed thread blocks
// launch a single block to compute the sum of the partial sums
block_sum<<<1,num_blocks,num_blocks * sizeof(float)>>>
this code causes the bug.
suppose numblocks is 13,
Then in the kernal blockDim.x / 2 will be 6,
and
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
will only add the first 12 elements causing the bug.
when the element count is 200000 or 250000, num_blocks will be odd numbers and causes the bug, for even num_blocks it will work fine
This kernel is sensitive to the blocking parameters (grid and threadblock size) of the kernel. Are you invoking it with enough threads to cover the input size?
It is more robust to formulate kernels like this with for loops - instead of:
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
something like:
for ( size_t i = blockIdx.x*blockDim.x + threadIdx.x;
i < N;
i += blockDim.x*gridDim.x ) {
sum += in[i];
}
The source code in the CUDA Handbook has lots of examples of "blocking agnostic" code. The reduction code is here:
https://github.com/ArchaeaSoftware/cudahandbook/tree/master/reduction

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.

cuda register pressure

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,
__global__
void strainAxialKernel(
float* d_dis,
float* d_str
){
int i = threadIdx.x;
float a = 0;
float c = 0;
float e = 0;
float f = 0;
int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
int j;
__shared__ float dis[WINDOW_PER_LINE];
__shared__ float str[WINDOW_PER_LINE];
// fetch data from global memory
dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
__syncthreads();
// least square fit
for (j=-shift; j<NEIGHBOURS-shift; j++)
{
a += j;
c += j*j;
e += dis[i+j];
f += (float(j))*dis[i+j];
}
str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;
// compensate attenuation
if (COMPEN_EXP>0 && COMPEN_BASE>0)
{
str[i]
= (float)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));
}
// write back to global memory
if (!SIGN_PRESERVE && str[i]<0)
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];
}
else
{
d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];
}
}
I have 32x404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem.
The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.
There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There's no way to "get around" that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.
That said, here are some ideas to use fewer registers:
Can shift be precomputed and stored in constant memory? Then each thread just needs to look up shift[i].
Do a and c have to be floats?
Or, can a and c be removed from the loop and computed once? And thus removed completely?
a is computed as a simple arithmetic sequence, so reduce it... (something like this)
a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2
or
a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2
so instead, do something like the following (you can probably reduce these expressions further):
str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;
Occupancy is NOT a problem.
The SM in GTS 250 (compute capability 1.1) may be able to hold 8 blocks (8x96 threads) simultaneously in its registers, but it only has 8 execution units, meaning that only 8 out of 8x96 (or, in your case, 5x96) threads would be advancing at any given moment of time. There's very little value in trying to squeeze more blocks onto the overloaded SM.
In fact, you could try to play with -maxrregcount option to INCREASE the number of registers, that could have a positive effect on performance.
You can use launch bounds to instruct the compiler to generate a register mapping for a maximum number of threads and a minimum number of blocks per multiprocessor. This can reduce register counts so that you can achieve the desired occupancy.
For your case, Nvidia's occupancy calculator shows a theoretical peak occupancy of 63%, which seems to be what you're achieving. This is due to your register count, as you mention, but it is also due to the number of threads per block. Increasing the number of threads per block to 128 and decreasing the register count to 10 yields 100% theoretical peak occupancy.
To control the launch bounds for your kernel:
__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
...
}
Then just launch with a block size of 128 threads and enjoy your occupancy. The compiler should generate your kernel such that it uses 10 or less registers.