Optimizing CUDA kernel interpolation with nonuniform node points - cuda

ORIGINAL QUESTION
I have the following kernel performing an interpolation with nonuniform node points, and I would like to optimize it:
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P;
const double alfa=(2.-1./cc)*pi_double-0.01;
double phi_cap_s;
cufftDoubleComplex temp;
double cc_points=cc*points[i];
double r_cc_points=rint(cc*points[i]);
temp = make_cuDoubleComplex(0.,0.);
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
if(P==0.) phi_cap_s = alfa/pi_double;
PP = modulo((r_cc_points + m -K ),(cc*N));
temp.x = temp.x+phi_cap_s*Uj[PP].x;
temp.y = temp.y+phi_cap_s*Uj[PP].y;
}
result[i] = temp;
}
}
K and cc are constants, points contains the nodes and Uj the values to be interpolated. modulo is a function basically working as %, but properly extended to negative values. For a certain arrangement, the kernel call takes 2.3ms. I have verified that the most expensive parts are
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
if(P==0.) phi_cap_s = alfa/pi_double;
which takes about 40% of the total time, and
PP = modulo((r_cc_points + m -K ),(cc*N));
temp.x = temp.x+phi_cap_s*Uj[PP].x;
temp.y = temp.y+phi_cap_s*Uj[PP].y;
which takes about 60%. By the Visual Profiler, I have verified that the performance of the former is not influenced by the presence of the if statement. Please, note that I want double precision, so I'm avoiding the __exp() solution. I suspect that, for the latter, the "random" memory access Uj[PP] could be responsible of that much calculation percentage. Any suggestion on tricks or comments to reduce the computation time? Thanks in advance.
VERSION FOLLOWING COMMENTS AND ANSWERS
Following the suggestions kindly provided in the answers and comments, I ended up with the code below:
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P,tempd;
const double alfa=(2.-1./cc)*pi_double-0.01;
cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);
double cc_points=cc*points[i];
double r_cc_points=rint(cc_points);
cufftDoubleComplex rtemp[(2*K+1)];
double phi_cap_s[2*K+1];
if(i<M) {
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
PP = modulo(((int)r_cc_points + m -K ),(cc*N));
rtemp[m] = Uj[PP]; //2
P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd); }
else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
else phi_cap_s[m] = alfa/pi_double;
}
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
temp.x = temp.x+phi_cap_s[m]*rtemp[m].x;
temp.y = temp.y+phi_cap_s[m]*rtemp[m].y;
}
result[i] = temp;
}
}
In particular:
1) I moved the global memory variable Uj to the register rtemp array of size 2*K+1 (K is a constant equal to 6 in my case);
2) I moved the variable phi_cap_s to a 2*K+1 sized register;
3) I used the if ... else statements instead of the three previously used if's (the conditions P<0. and P>0. have the same occurrence probability);
3) I defined extra variables for the square root;
4) I used rsqrt instead of sqrt (as long as I know, the sqrt() is calculated by CUDA as 1/rsqrt());
I added each new feature once at a time, verifying the improvement against the original version, but I must say that none of them gave me any relevant improvement.
The execution speed is limited by:
1) the calculation of the sin/sinh functions (about 40% of the time); is there any way to calculate them in double precision arithmetics by somehow exploiting intrinsic math as a "starting guess"?
2) the fact that many threads end up to access the same global memory locations Uj[PP] due to the mapping index PP; one possibility to avoid it would be using shared memory, but this would imply a strong thread cooperation.
My question is. Am I done? Namely, is there any mean to improve the code? I profiled the code by the NVIDIA Visual Profiler and here are the results:
IPC = 1.939 (compute capability 2.1);
Global Memory Load Efficiency = 38.9%;
Global Memory Store Efficiency = 18.8%;
Warp Execution Efficiency = 97%;
Instruction Replay Overhead = 0.7%;
Finally, I would like to notice that this discussion is linked to the discussion at CUDA: 1-dimensional cubic spline interpolation in CUDA
VERSION USING SHARED MEMORY
I have made a feasibility study on using shared memory. I have considered N=64 so that the whole Uj fits the shared memory. Below is the code (basically is my original version)
__global__ void interpolation_shared(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P;
const double alfa=(2.-1./cc)*pi_double-0.01;
double phi_cap_s;
cufftDoubleComplex temp;
double cc_points=cc*points[i];
double r_cc_points=rint(cc*points[i]);
temp = make_cuDoubleComplex(0.,0.);
__shared__ cufftDoubleComplex Uj_shared[128];
if (threadIdx.x < cc*N) Uj_shared[threadIdx.x]=Uj[threadIdx.x];
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
if(P==0.) phi_cap_s = alfa/pi_double;
PP = modulo((r_cc_points + m -K ),(cc*N));
temp.x = temp.x+phi_cap_s*Uj_shared[PP].x;
temp.y = temp.y+phi_cap_s*Uj_shared[PP].y;
}
result[i] = temp;
}
}
The result again does not improve significantly, although this might depend on the small size of the input array.
VERBOSE PTXAS OUTPUT
ptxas : info : Compiling entry function '_Z13interpolationP7double2PdS0_ii' for 'sm_20'
ptxas : info : Function properties for _Z13interpolationP7double2PdS0_ii
352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 55 registers, 456 bytes cumulative stack size, 52 bytes cmem[0]
VALUES OF P, FOR FIRST WARP AND m=0
0.0124300933082964
0.0127183892149176
0.0135847002913749
0.0161796378170038
0.0155488126345702
0.0138890822153499
0.0121163187739057
0.0119998374528905
0.0131600831194518
0.0109574866163769
0.00962949548477354
0.00695850974164358
0.00446426651940612
0.00423369284281705
0.00632921297092537
0.00655137618976198
0.00810202954519923
0.00597974034698723
0.0076811348379735
0.00604267951733561
0.00402922460255439
0.00111841719893846
-0.00180949615796777
-0.00246283218698551
-0.00183256444286428
-0.000462696661685413
0.000725108980390132
-0.00126793006072035
0.00152263101649197
0.0022499598348702
0.00463681632275836
0.00359856091027666
MODULO FUNCTION
__device__ int modulo(int val, int modulus)
{
if(val > 0) return val%modulus;
else
{
int P = (-val)%modulus;
if(P > 0) return modulus -P;
else return 0;
}
}
MODULO FUNCTION OPTIMIZED ACCORDING TO ANSWER
__device__ int modulo(int val, int _mod)
{
if(val > 0) return val&(_mod-1);
else
{
int P = (-val)&(_mod-1);
if(P > 0) return _mod -P;
else return 0;
}
}

//your code above
cufftDoubleComplex rtemp[(2*K+1)] //if it fits into available registers, assumes K is a constant
if(i<M) {
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
PP = modulo((r_cc_points + m -K ),(cc*N));
rtemp[m] = Uj[PP]; //2
}
#pragma unroll
for(nt m=0; m<(2*K+1); m++) {
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
// 1
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
else if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
else phi_cap_s = alfa/pi_double;
temp.x = temp.x+phi_cap_s*rtemp[m].x; //3
temp.y = temp.y+phi_cap_s*rtemp[m].y;
}
result[i] = temp;
}
Explanation
Added else if and else as these conditions are mutually exclusive, if you could, you should order the statements after probability of occurrence. E.g. if P<0. most of the times, you should evaluate that first.
This will fetch the requested memory to multiple registers, what you did before may have certainly caused a block on that thread due to not having the memory available in time for calculation. And keep in mind that if one thread blocks in a warp, the whole warp is blocked. If not enough warps in the ready queue, the program will block until any warp is ready.
We have now moved the calculations further forward in time in order to compensate for the bad memory access, hopefully the calculations done previously have compensated for the bad access pattern.
The reason why this should work is the following:
A request from memory that is in GMEM is around >~400-600 ticks. If a thread tries to perform operations on memory that is not available at time, it will block. That means if each memory request does not live in L1-L2 each warp have to wait that time or more until it can continue.
What I suspect is that temp.x+phi_cap_s*Uj[PP].x is doing just that. By staging (step 2) each memory transfer to a register, and moving on to stage the next you will hide the latency by allowing you to do other work while the memory is transfered.
By the time you reach step 3 the memory is hopefully available or you have to wait less time.
If rtemp does not fit into the registers to achieve 100% occupancy you may have to do it in batches.
You could also try to make phi_cap_s into an array and put it into the first loop like this:
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
//stage memory first
PP = modulo((r_cc_points + m -K ),(cc*N));
rtemp[m] = Uj[PP]; //2
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
// 1
if(P>0.) phi_cap_s[m] = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
else if(P<0.) phi_cap_s[m] = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
else phi_cap_s[m] = alfa/pi_double;
}
#pragma unroll
for(nt m=0; m<(2*K+1); m++) {
temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; //3
temp.y = temp.y+phi_cap_s[m]*rtemp[m].y;
}
Edit
Expression
P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
Can be broken down into:
const double cc_diff = cc_points-r_cc_points;
double exp = cc_diff - (double)(m-K);
exp *= exp;
P = (K*K-exp);
Which may reduce the number of instructions used.
Edit 2
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P,tempd;
cufftDoubleComplex rtemp[(2*K+1)];
double phi_cap_s[2*K+1];
if(i<M) {
const double cc_points=cc*points[i];
cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);
const double alfa=(2.-1./cc)*pi_double-0.01;
const double r_cc_points=rint(cc_points);
const double cc_diff = cc_points-r_cc_points;
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
PP = m-k; //reuse PP
double exp = cc_diff - (double)(PP); //stage exp to be used later, will explain
PP = modulo(((int)r_cc_points + PP ),(cc*N));
rtemp[m] = Uj[PP]; //2
exp *= exp;
P = (K*K-exp);
if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd); }
else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
else phi_cap_s[m] = alfa/pi_double;
}
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
temp.x = temp.x+phi_cap_s[m]*rtemp[m].x;
temp.y = temp.y+phi_cap_s[m]*rtemp[m].y;
}
result[i] = temp;
}
}
What I have done is moved in all the calculations inside the if statement to free up some resources both in terms of calculations but also memory fetch, do not know the divergence you have on the first if statement if(i<M). As m-K appeared twice in the code, I first put it in PP to be used when you calculate exp and PP.
What else you can do is to try and order you instructions so that, if you set a variable, make as many instruction in in between the next usage of said variable as possible, as it takes ~20 tics for it to be set into the registers. Hence, I put the constant cc_diff at the top, however, as this is only a one of instruction, it may not show any benefit.
Modulo function
__device__ modulo(int val, int _mod) {
int p = (val&(_mod-1));// as modulo is always the power of 2
if(val < 0) {
return _mod - p;
} else {
return p;
}
}
As we have _mod always as an integer of power of 2(cc = 2, N = 64, cc*N = 128), we can use this function instead of the mod operator. This should be "much" faster. Check it though so I have the arithmetic correct. It is from Optimizing Cuda - Part II Nvidia page 14.

One optimization you probably would want to look into is to use fast math. Use intrinsics math functions and compile with -use-fast-math option.
intrinsics math

Related

CUDA n body shared memory does not speed up the computation

I am quite new in CUDA. I wrote a short code ONLY for testing the kernel for computing accelerations of mass particles. I test it using only time ./example. I have Kubuntu 12.04, Intel(R) Core(TM) i5 CPU 760 # 2.80GHz, GeForce GTX 560, and compile it by using nvcc -O3 -arch=sm_20 -o example example.cu. Here is my code.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
__global__ void acc_sh(double *x, double *y, double *z, double *ax, double *ay, double *az, double *mass, int N)
{
extern __shared__ double4 shPos[]; //make dynamic
int p = blockDim.x;
int idx = blockIdx.x*p + threadIdx.x;
if (idx > N-1) return;
double3 acc = (double3){0.0,0.0,0.0};
double posx = x[idx];
double posy = y[idx];
double posz = z[idx];
// Tile
for (int k = 0; k < N; k += p) {
//Load positions into shmem
shPos[threadIdx.x].x = x[k + threadIdx.x];
shPos[threadIdx.x].y = y[k + threadIdx.x];
shPos[threadIdx.x].z = z[k + threadIdx.x];
shPos[threadIdx.x].w = mass[k + threadIdx.x];
__syncthreads();
for (int j = 0; j < p && k + j < N; j++) {
//Loop over the shmem
double rijx = posx - shPos[j].x;
double rijy = posy - shPos[j].y;
double rijz = posz - shPos[j].z;
double dist = rijx*rijx + rijy*rijy + rijz*rijz;
double dist3 = dist*dist*dist;
double apre = 0.0;
if (dist3 != 0) //avoid self-interaction
{
apre = rsqrt(dist3)*shPos[j].w;
}
acc.x += apre*rijx;
acc.y += apre*rijy;
acc.z += apre*rijz;
}
__syncthreads();
}
ax[idx] = acc.x;
ay[idx] = acc.y;
az[idx] = acc.z;
}
__global__ void acc(double *x, double *y, double *z, double *ax, double *ay, double *az, double *mass, int N)
{
int p = blockDim.x;
int idx = blockIdx.x*p + threadIdx.x;
if (idx > N-1) return;
double3 acc = (double3){0.0,0.0,0.0};
double posx = x[idx];
double posy = y[idx];
double posz = z[idx];
// Do not use shmem and loop over all bodies
for (int k = 0; k < N; k++) {
double rijx = posx - x[k];
double rijy = posy - y[k];
double rijz = posz - y[k];
double dist = rijx*rijx + rijy*rijy + rijz*rijz;
double dist3 = dist*dist*dist;
double apre = 0.0;
if (dist3 != 0) //avoid self-interaction
{
apre = rsqrt(dist3)*mass[k];
}
acc.x += apre*rijx;
acc.y += apre*rijy;
acc.z += apre*rijz;
__syncthreads();
}
ax[idx] = acc.x;
ay[idx] = acc.y;
az[idx] = acc.z;
}
int main()
{
srand(time(NULL));
const int N = 16384;
double t, dt, tend;
//INIT TEST PARTICLES
// HOST
double *x, *y, *z, *mass;
double *ax, *ay, *az, *dmass;
//DEVICE
double *dx, *dy, *dz;
double *dax, *day, *daz;
double size = N*sizeof(double);
cudaMalloc((void**)&dx, size);
cudaMalloc((void**)&dy, size);
cudaMalloc((void**)&dz, size);
cudaMalloc((void**)&dmass, size);
cudaMalloc((void**)&dax, size);
cudaMalloc((void**)&day, size);
cudaMalloc((void**)&daz, size);
x = (double*) malloc(size);
y = (double*) malloc(size);
z = (double*) malloc(size);
mass = (double*) malloc(size);
ax = (double*) malloc(size);
ay = (double*) malloc(size);
az = (double*) malloc(size);
for (int i = 0; i < N; i++)
{
x[i] = (double) rand()/RAND_MAX;
y[i] = (double) rand()/RAND_MAX;
z[i] = (double) rand()/RAND_MAX;
mass[i] = (double) rand()/RAND_MAX;
// printf("%d %10.5e %10.5e %10.5e %10.5e \n", i, x[i], y[i], z[i], mass[i]);
ax[i] = 0;
ay[i] = 0;
az[i] = 0;
}
cudaMemcpy(dx, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(dy, y, size, cudaMemcpyHostToDevice);
cudaMemcpy(dz, z, size, cudaMemcpyHostToDevice);
cudaMemcpy(dmass, mass, size, cudaMemcpyHostToDevice);
cudaMemcpy(dax, ax, size, cudaMemcpyHostToDevice);
cudaMemcpy(day, ay, size, cudaMemcpyHostToDevice);
cudaMemcpy(daz, az, size, cudaMemcpyHostToDevice);
t = 0.0; //start integ. time
tend = 365.0; //end integr. time, about one year
dt = 1.0;
int TPB = 128;
int BPG = (N/TPB)+1;
//********************************************************
//********************************************************
//********************************************************
//MAIN CYCLE**********************************************
//********************************************************
//********************************************************
//********************************************************
while (t <= tend) {
printf("time [d] %24.20f \n", t);
acc_sh<<< BPG, TPB, sizeof(double4)*TPB >>>(dx,dy,dz,dax,day,daz,dmass,N);
//acc<<< BPG, TPB >>>(dx,dy,dz,dax,day,daz,dmass,N);
t += dt;
}
cudaMemcpy(x, dx, size, cudaMemcpyDeviceToHost);
cudaMemcpy(y, dy, size, cudaMemcpyDeviceToHost);
cudaMemcpy(z, dz, size, cudaMemcpyDeviceToHost);
cudaMemcpy(ax, dax, size, cudaMemcpyDeviceToHost);
cudaMemcpy(ay, day, size, cudaMemcpyDeviceToHost);
cudaMemcpy(az, daz, size, cudaMemcpyDeviceToHost);
//********************************************************
//********************************************************
//********************************************************
//OUTPUT RESULTS******************************************
//********************************************************
//********************************************************
//********************************************************
/*for (int j = 0; j < N; j++) {
printf("%d %23.16e %23.16e %23.16e \n", j+1, ax[j], ay[j], az[j]);
}*/
cudaFree(dx);
cudaFree(dy);
cudaFree(dz);
cudaFree(ax);
cudaFree(ay);
cudaFree(az);
return 0;
}
When I run it and measure the total time of app running, I obtain these running times:
NO SHARED (in MAIN CYCLE only acc_sh is commented):
real 0m44.933s
user 0m32.838s
sys 0m12.001s
SHARED (in MAIN CYCLE only acc is commented):
real 0m44.259s
user 0m32.710s
sys 0m11.445s
Times are comparable! Why? I expected, that when I use acc_sh which uses shared memory, it should be faster... Next question is: why is the program at the beginning so fast, and at the tend it waits for "something"?
don't use a double quantity to specify the number of bytes to allocate or transfer:
double size = N*sizeof(double);
use int, unsigned, or size_t instead. When I compile your code, I see numerous warnings due to this.
You have a bug in your acc kernel code which will produce incorrect results and affect the timing:
double rijy = posy - y[k];
double rijz = posz - y[k];
^
that should be z[k], not y[k]
This coding error significantly reduces the amount of data that your non-shared kernel needs to load, which makes this kernel (incorrectly) perform better. If you had bothered to compare and check the results between the two cases, you would have found a discrepancy there as well.
When I fix those errors, on my particular setup, I get timings of ~21 seconds for the non-shared case, and ~18 seconds for the shared case.
If you're looking for 10x improvement going from global to shared memory, that's simply implausible. Shared memory bandwidth is only about 5x better than global memory bandwidth, so it's unreasonable to expect 10x even in a perfect case. Furthermore, this type of comparison discounts the effect of the L1 and L2 caches in your GPU, which can bring global memory accesses, for frequently accessed data, up to nearly the level of shared memory.
Regarding this question: "why is the program at the beginning so fast, and at the tend it waits for "something"?" The kernel launches are asynchronous. The kernel launch returns control to the host thread before the kernel begins executing. When you launch a kernel in a loop like this, it launches, and then immediately returns control to the host thread (before that kernel begins executing), which launches the next kernel.

GPU Precision issues for relatively small array sizes?

I have some CUDA code that does some linear algebra to invert a special type of structured matrix. I calculate RMS error using the results of a serialized version of the algorithm. The error grows with problem size to a greater extent that I would expect. Can anyone provide insight as to why this may be the case?
The GPU code is very naive. This is intentional, and I will optimize it very soon - I just wanted a simple baseline kernel that gives the proper results.
__global__ void levinson_durbin_gpu(TYPE *h0_d, TYPE *h_d, TYPE *v_d, TYPE *x_d, TYPE *y_d, int N) //Naive kernel
{
int j = threadIdx.x;
int i;
__shared__ TYPE hn_1[512];
hn_1[j] = h_d[j];
for(i=1; i<N; i++)
{
if(j < i)
{
TYPE hn = h_d[i];
TYPE yn = y_d[i];
__syncthreads();
//Set up temporary arrays, compute inner products
__shared__ TYPE temp[512]; //Temp for hn_1_J_v
__shared__ TYPE temp2[512]; //Temp for hn_1_J_x
__shared__ TYPE temp3[512]; //Temp for hn_1_v
temp[j] = hn_1[j]*v_d[i-j-1];
temp2[j] = hn_1[j]*x_d[i-j-1];
temp3[j] = hn_1[j]*v_d[j];
__syncthreads();
//Three reductions at once
for(unsigned int s=1; s<i; s*=2)
{
int index = 2*s*j;
if((index+s) < i)
{
temp[index] += temp[index+s];
temp2[index] += temp2[index+s];
temp3[index] += temp3[index+s];
}
__syncthreads();
}
TYPE hn_1_J_v = temp[0];
TYPE hn_1_J_x = temp2[0];
TYPE hn_1_v = temp3[0];
TYPE alpha_v = (hn - hn_1_J_v)/(h0_d[0] - hn_1_v);
TYPE alpha_x = (yn - hn_1_J_x)/(h0_d[0] - hn_1_v);
__shared__ TYPE w_v[512];
w_v[j] = v_d[j] - alpha_v*v_d[i-j-1];
__shared__ TYPE w_x[512];
w_x[j] = x_d[j] - alpha_x*v_d[i-j-1];
v_d[j] = w_v[j];
x_d[j] = w_x[j];
if(j == 0)
{
v_d[i] = alpha_v;
x_d[i] = alpha_x;
}
}
__syncthreads();
}
}
The identifier TYPE is either float or double depending on how I compile the code. I'm using 1 block with N threads (again, keeping things naive and simple here). With single precision I see the following results:
N=4: RMS Error = 0.0000000027
N=8: RMS Error = 0.0000001127
N=16: RMS Error = 0.0000008832
N=32: RMS Error = 0.0000009233
N=64: RMS Error = 42.0136776452
N=80: RMS Error = 281371.7533760048
I can't tell if this is an error with my algorithm or some sort of precision issue. If it helps I can show the above results using double precision, the CPU version of the algorithm, or the code that calculates the RMS error. I'm using a GeForce GTX 660 Ti (cc 3.0) GPU. The variable x_d contains the end result.
Thanks to the help from the comments section I was able to solve the problem myself, so I'll document it here in case others experience a similar issue.
The problem indeed was synchronization issue - my use of __syncthreads() within a divergent control flow block. The solution was to break that control flow block into multiple parts and calling __syncthreads() after each part:
__global__ void levinson_durbin_gpu(TYPE *h0_d, TYPE *h_d, TYPE *v_d, TYPE *x_d, TYPE *y_d, int N) //Naive kernel
{
int j = threadIdx.x;
int i;
__shared__ TYPE hn_1[512];
hn_1[j] = h_d[j];
__syncthreads();
//Set up temporary arrays
__shared__ TYPE temp[512]; //Temp for hn_1_J_v
__shared__ TYPE temp2[512]; //Temp for hn_1_J_x
__shared__ TYPE temp3[512]; //Temp for hn_1_v
TYPE hn;
TYPE yn;
for(i=1; i<N; i++)
{
if(j < i)
{
hn = h_d[i];
yn = y_d[i];
//Compute inner products
temp[j] = hn_1[j]*v_d[i-j-1];
temp2[j] = hn_1[j]*x_d[i-j-1];
temp3[j] = hn_1[j]*v_d[j];
}
__syncthreads();
//Have all threads complete this section to avoid synchronization issues
//Three reductions at once
for(unsigned int s=1; s<i; s*=2)
{
int index = 2*s*j;
if((index+s) < i)
{
temp[index] += temp[index+s];
temp2[index] += temp2[index+s];
temp3[index] += temp3[index+s];
}
__syncthreads();
}
if(j < i)
{
TYPE hn_1_J_v = temp[0];
TYPE hn_1_J_x = temp2[0];
TYPE hn_1_v = temp3[0];
TYPE alpha_v = (hn - hn_1_J_v)/(h0_d[0] - hn_1_v);
TYPE alpha_x = (yn - hn_1_J_x)/(h0_d[0] - hn_1_v);
__shared__ TYPE w_v[512];
w_v[j] = v_d[j] - alpha_v*v_d[i-j-1];
__shared__ TYPE w_x[512];
w_x[j] = x_d[j] - alpha_x*v_d[i-j-1];
v_d[j] = w_v[j];
x_d[j] = w_x[j];
if(j == 0)
{
v_d[i] = alpha_v;
x_d[i] = alpha_x;
}
}
__syncthreads();
}
}
N=32: RMS Error = 0.0000009233
N=64: RMS Error = 0.0000027644
N=128: RMS Error = 0.0000058276
N=256: RMS Error = 0.0000117755
N=512: RMS Error = 0.0000237040
what I learned: When you use synchronization mechanisms in CUDA, make sure all threads reach the same barrier point! I feel as though this sort of thing should produce a compiler warning.

Cuda kernel - possible optimizations

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
My launch configurations and kernel call are as follows.
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I have to admit, this kernel is running pretty slow.Once I get the array back to host side, I use thrust::inclusive_scan to find my resultant array.
My question is, is there any room for improvement / optimization for my kernel? I tried using shared memory but its producing either wrong answers or throwing runtime exceptions.
Also, how does the dynamically allocated shared memory ( which is allocated by third parameter in kernel launch ) is distributed among the blocks?
Any help/hints/insinuations will be appreciated.
Thanks in advance.
As for the shared memory allocated using kernel<<<blocks,threads,mem>>> mem is the amount of memory allocated each block. So each block gets mem amount of memory.
For your code, I don't understand why are there 2 for loops in the mult function. Just want to point out that each thread will be executing these 2 for loops. Moreover, as you also have a for loop in the kernel function, it means that each thread will be executing the 2 for loops in the mult function several times. THis is slow. Moreover, doing
int beg = Bptr[tid];
int end = Bptr[tid+1];
is not exactly coalesced access. Non coalesced access is slow.

cuda multiplication

Serial code snippet looks like this:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
I converted this to CUDA using this kernel:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(tid = 0; tid <nx*ny; tid++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
However the GPU kernel does not give any speedup improvement? Any suggestions on a better solution?? Thanks in advance
If this is the serial code:
int i, j;
for(j=0; j<ny; j++)
{
for(i=0; i<nx; i++)
{
x[i + j*nx] *= y[i];
}
}
then you should be doing this:
__global__ void fn(float *x, int nx)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx, i = tid - j * nx;
x[tid] *= y[i];
}
fn<<<nx*ny/B, B>>>(x, nx); // with B = 256, 512, etc.
What you're doing is fairly bizarre: you're instructing each thread of the CUDA kernel to iterate over all values of tid between 0 and nx*ny, and compute the same function as your CPU version! Moreover, instead of just iterating over the indices, you're actually doing the loop less efficiently than you did for the CPU version; in other words, you do the same thing in each thread, just less efficiently, than you are doing in 1 thread on the CPU. It's no wonder that this is slower; it should be much, much slower. Your CUDA kernel is:
int **tid** = blockIdx.x * blockDim.x + threadIdx.x;
int i,j;
for(**tid** = 0; **tid** <nx*ny; **tid**++)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
This does nx*ny iterations, same as your host code, for each thread; you lose all benefit of the parallelism, since each thread is doing the same thing; you would get the same performance using one thread on the GPU, and the same result!
If this is the verbatim code from your CUDA source file, you need to change it and redo the comparison; if this is code you have written to help explain what your code is doing for a lay non-CUDA audience, then you need to present your actual CUDA code so that we can see what's going on... as it is, the performance analysis I have done - the trivial one - is all you can expect.
Given your comment to this answer:
the nx * ny = 2205; so I used no. of blocks =
(nx*ny+(threads-1))/threads and threads = 64.
is implying you are intending to launch one thread per computation, the correct CUDA implementation would just be:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int j = tid/nx;
int i = tid - j*nx;
if (tid < (nx*ny))
x[tid] *= y[i];
If you were intending for each thread to compute more than one computation per kernel launch, then you would size the grid to "fill" each of the SM on the target GPU, not use the same number of threads as the input size, and then do something like:
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int gsize = blockDim.x * gridDim.x;
int i,j;
for(; tid <nx*ny; tid+=gsize)
{
j = tid/nx;
i = tid - j*nx;
x[tid] *= y[i];
}
That would get you at least coalesced reads and writes to x, and remove the enormous number of redundant calculations in your posted version. There are a number of further optimizations that could be made, but it would require more information about the problem than has been supplied in the question and subsequent comments. Your indexing scheme contains an integer division and then an integer multiply-add per calculation. That is a lot of overhead for a single FLOP per input value. However, having said all of that, if the problem size I quoted is that actual problem size you are interested in, the GPU will never be faster than even a modest host CPU. You would require many orders of magnitude larger problems to realize useful speed up using the GPU for this sort low arithmetic intensity operation.
How big is the block? it may be that the time needed to copy a small amount of data to the GPU and setup the envirnoment is much longer than the calculation time.
Remember also that CUDA does a jit compile on the first run so to get accurate benchmarking you need to run it many times.
Try this using shared memory. One of the best implementations around:
// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
int width;
int height;
int stride; // In number of elements
float *elements;
} Matrix;
// Thread block size
#define BLOCK_SIZE 16
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col, float value)
{
A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
__device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
Matrix Asub;
Asub.width = BLOCK_SIZE; Asub.height = BLOCK_SIZE;
Asub.stride = A.stride;
Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row +
BLOCK_SIZE * col];
return Asub;
}
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
// Same as in previous example, except the followings:
// d_A.width = d_A.stride = A.width;
// d_B.width = d_B.stride = B.width;
// d_C.width = d_C.stride = C.width;
}
// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Block row and column
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
// Each thread block computes one sub-matrix Csub of C
Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
// Loop over all the sub-matrices of A and B that are
// required to compute Csub
// Multiply each pair of sub-matrices together
// and accumulate the results
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)
{
// Get sub-matrix Asub of A and Bsub of B
Matrix Asub = GetSubMatrix(A, blockRow, m);
Matrix Bsub = GetSubMatrix(B, m, blockCol);
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load Asub and Bsub from device memory to shared memory
// Each thread loads one element of each sub-matrix
As[row][col] = GetElement(Asub, row, col);
Bs[row][col] = GetElement(Bsub, row, col);
// Synchronize to make sure the sub-matrices are loaded
// before starting the computation
__syncthreads();
// Multiply Asub and Bsub together
for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write Csub to device memory
// Each thread writes one element
SetElement(Csub, row, col, Cvalue);
}

CUDA memory troubles

I have a CUDA kernel which I'm compiling to a cubin file without any special flags:
nvcc text.cu -cubin
It compiles, though with this message:
Advisory: Cannot tell what pointer points to, assuming global memory space
and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.
The kernel is as follows:
__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
int localMatches = 0;
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = threadIdx.x + threadIdx.y * blockDim.x;
int blockThreads = blockDim.x * blockDim.y;
__shared__ int localMatchCounts[32];
bool breaking = false;
for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
{
if(texts[blockId][i] == symbol[0])
{
for(int j = 1; j < symbolLength; j++)
{
if(texts[blockId][i + j] != symbol[j])
{
breaking = true;
break;
}
}
if (breaking) continue;
localMatches++;
}
}
localMatchCounts[threadId] = localMatches;
__syncthreads();
if(threadId == 0)
{
int sum = 0;
for(int i = 0; i < 32; i++)
{
sum += localMatchCounts[i];
}
matches[blockId] = sum;
}
}
If I replace the line
localMatchCounts[threadId] = localMatches;
after the first for loop with this line
localMatchCounts[threadId] = 5;
it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?
The system is Vista 64bit, for what its worth.
Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).
Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.
It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.
In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:
For example - preparation for kernel:
char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
metadata[cnt] = lastpos;
lastpos += length(st[cnt]);
metadata[cnt] = length(st[cnt]);
}
In kernel:
currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];
The problem seems to be associated with the char** parameter. Turning this into a char* solved the warning, so I suspect that cuda might have problems with this form of data. Perhaps cuda prefers that one uses the specific cuda 2D arrays in this case.