GPU Precision issues for relatively small array sizes? - cuda

I have some CUDA code that does some linear algebra to invert a special type of structured matrix. I calculate RMS error using the results of a serialized version of the algorithm. The error grows with problem size to a greater extent that I would expect. Can anyone provide insight as to why this may be the case?
The GPU code is very naive. This is intentional, and I will optimize it very soon - I just wanted a simple baseline kernel that gives the proper results.
__global__ void levinson_durbin_gpu(TYPE *h0_d, TYPE *h_d, TYPE *v_d, TYPE *x_d, TYPE *y_d, int N) //Naive kernel
{
int j = threadIdx.x;
int i;
__shared__ TYPE hn_1[512];
hn_1[j] = h_d[j];
for(i=1; i<N; i++)
{
if(j < i)
{
TYPE hn = h_d[i];
TYPE yn = y_d[i];
__syncthreads();
//Set up temporary arrays, compute inner products
__shared__ TYPE temp[512]; //Temp for hn_1_J_v
__shared__ TYPE temp2[512]; //Temp for hn_1_J_x
__shared__ TYPE temp3[512]; //Temp for hn_1_v
temp[j] = hn_1[j]*v_d[i-j-1];
temp2[j] = hn_1[j]*x_d[i-j-1];
temp3[j] = hn_1[j]*v_d[j];
__syncthreads();
//Three reductions at once
for(unsigned int s=1; s<i; s*=2)
{
int index = 2*s*j;
if((index+s) < i)
{
temp[index] += temp[index+s];
temp2[index] += temp2[index+s];
temp3[index] += temp3[index+s];
}
__syncthreads();
}
TYPE hn_1_J_v = temp[0];
TYPE hn_1_J_x = temp2[0];
TYPE hn_1_v = temp3[0];
TYPE alpha_v = (hn - hn_1_J_v)/(h0_d[0] - hn_1_v);
TYPE alpha_x = (yn - hn_1_J_x)/(h0_d[0] - hn_1_v);
__shared__ TYPE w_v[512];
w_v[j] = v_d[j] - alpha_v*v_d[i-j-1];
__shared__ TYPE w_x[512];
w_x[j] = x_d[j] - alpha_x*v_d[i-j-1];
v_d[j] = w_v[j];
x_d[j] = w_x[j];
if(j == 0)
{
v_d[i] = alpha_v;
x_d[i] = alpha_x;
}
}
__syncthreads();
}
}
The identifier TYPE is either float or double depending on how I compile the code. I'm using 1 block with N threads (again, keeping things naive and simple here). With single precision I see the following results:
N=4: RMS Error = 0.0000000027
N=8: RMS Error = 0.0000001127
N=16: RMS Error = 0.0000008832
N=32: RMS Error = 0.0000009233
N=64: RMS Error = 42.0136776452
N=80: RMS Error = 281371.7533760048
I can't tell if this is an error with my algorithm or some sort of precision issue. If it helps I can show the above results using double precision, the CPU version of the algorithm, or the code that calculates the RMS error. I'm using a GeForce GTX 660 Ti (cc 3.0) GPU. The variable x_d contains the end result.

Thanks to the help from the comments section I was able to solve the problem myself, so I'll document it here in case others experience a similar issue.
The problem indeed was synchronization issue - my use of __syncthreads() within a divergent control flow block. The solution was to break that control flow block into multiple parts and calling __syncthreads() after each part:
__global__ void levinson_durbin_gpu(TYPE *h0_d, TYPE *h_d, TYPE *v_d, TYPE *x_d, TYPE *y_d, int N) //Naive kernel
{
int j = threadIdx.x;
int i;
__shared__ TYPE hn_1[512];
hn_1[j] = h_d[j];
__syncthreads();
//Set up temporary arrays
__shared__ TYPE temp[512]; //Temp for hn_1_J_v
__shared__ TYPE temp2[512]; //Temp for hn_1_J_x
__shared__ TYPE temp3[512]; //Temp for hn_1_v
TYPE hn;
TYPE yn;
for(i=1; i<N; i++)
{
if(j < i)
{
hn = h_d[i];
yn = y_d[i];
//Compute inner products
temp[j] = hn_1[j]*v_d[i-j-1];
temp2[j] = hn_1[j]*x_d[i-j-1];
temp3[j] = hn_1[j]*v_d[j];
}
__syncthreads();
//Have all threads complete this section to avoid synchronization issues
//Three reductions at once
for(unsigned int s=1; s<i; s*=2)
{
int index = 2*s*j;
if((index+s) < i)
{
temp[index] += temp[index+s];
temp2[index] += temp2[index+s];
temp3[index] += temp3[index+s];
}
__syncthreads();
}
if(j < i)
{
TYPE hn_1_J_v = temp[0];
TYPE hn_1_J_x = temp2[0];
TYPE hn_1_v = temp3[0];
TYPE alpha_v = (hn - hn_1_J_v)/(h0_d[0] - hn_1_v);
TYPE alpha_x = (yn - hn_1_J_x)/(h0_d[0] - hn_1_v);
__shared__ TYPE w_v[512];
w_v[j] = v_d[j] - alpha_v*v_d[i-j-1];
__shared__ TYPE w_x[512];
w_x[j] = x_d[j] - alpha_x*v_d[i-j-1];
v_d[j] = w_v[j];
x_d[j] = w_x[j];
if(j == 0)
{
v_d[i] = alpha_v;
x_d[i] = alpha_x;
}
}
__syncthreads();
}
}
N=32: RMS Error = 0.0000009233
N=64: RMS Error = 0.0000027644
N=128: RMS Error = 0.0000058276
N=256: RMS Error = 0.0000117755
N=512: RMS Error = 0.0000237040
what I learned: When you use synchronization mechanisms in CUDA, make sure all threads reach the same barrier point! I feel as though this sort of thing should produce a compiler warning.

Related

Segmentation Fault with 3D array

I am trying to work with 3D arrays in CUDA (200x200x100).
The moment I change my z dimension (model_num) from 4 to 5, I get a segmentation fault. Why, and how can I fix it?
const int nrcells = 200;
const int nphicells = 200;
const int model_num = 5; //So far, 4 is the maximum model_num that works. At 5 and after, there is a segmentation fault
__global__ void kernel(float* mgridb)
{
const unsigned long long int i = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;
if(tx >= 0 && tx < nphicells && ty >=0 && ty < nrcells && tz >= 0 && tz < model_num){
//Do stuff with mgridb[i]
}
}
int main (void)
{
unsigned long long int size_matrices = nphicells*nrcells*model_num;
unsigned long long int mem_size_matrices = sizeof(float) * size_matrices;
float *h_mgridb = (float *)malloc(mem_size_matrices);
float mgridb[nphicells][nrcells][model_num];
for(int k = 0; k < model_num; k++){
for(int j = 0; j < nrcells; j++){
for(int i = 0; i < nphicells; i++){
mgridb[i][j][k] = 0;
}
}
}
float *d_mgridb;
cudaMalloc( (void**)&d_mgridb, mem_size_matrices );
cudaMemcpy(d_mgridb, h_mgridb, mem_size_matrices, cudaMemcpyHostToDevice);
int threads = nphicells;
uint3 blocks = make_uint3(nrcells,model_num,1);
kernel<<<blocks,threads>>>(d_mgridb);
cudaMemcpy( h_mgridb, d_mgridb, mem_size_matrices, cudaMemcpyDeviceToHost);
cudaFree(d_mgridb);
return 0;
}
This is getting stored on the stack:
float mgridb[nphicells][nrcells][model_num];
Your stack space is limited. When you exceed the amount you can store on the stack, you are getting a seg fault, either at the point of allocation, or as soon as you try and access it.
Use malloc instead. That allocates heap storage, which has much higher limits.
None of the above has anything to do with CUDA. Furthermore its not unique or specific to "3D" arrays. Any large stack based allocation (e.g. 1D array) is going to have the same trouble.
You may also have to adjust how you access the array, but it's not difficult to handle a flattened array using pointer indexing.
Your code is actually strange looking, because you are creating an appropriately sized array h_mgridb using malloc and then copying that array to the device (into d_mgridb). It's not clear what purpose mgridb serves in your code. h_mgridb and mgridb are not the same.

Optimizing CUDA kernel interpolation with nonuniform node points

ORIGINAL QUESTION
I have the following kernel performing an interpolation with nonuniform node points, and I would like to optimize it:
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P;
const double alfa=(2.-1./cc)*pi_double-0.01;
double phi_cap_s;
cufftDoubleComplex temp;
double cc_points=cc*points[i];
double r_cc_points=rint(cc*points[i]);
temp = make_cuDoubleComplex(0.,0.);
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
if(P==0.) phi_cap_s = alfa/pi_double;
PP = modulo((r_cc_points + m -K ),(cc*N));
temp.x = temp.x+phi_cap_s*Uj[PP].x;
temp.y = temp.y+phi_cap_s*Uj[PP].y;
}
result[i] = temp;
}
}
K and cc are constants, points contains the nodes and Uj the values to be interpolated. modulo is a function basically working as %, but properly extended to negative values. For a certain arrangement, the kernel call takes 2.3ms. I have verified that the most expensive parts are
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
if(P==0.) phi_cap_s = alfa/pi_double;
which takes about 40% of the total time, and
PP = modulo((r_cc_points + m -K ),(cc*N));
temp.x = temp.x+phi_cap_s*Uj[PP].x;
temp.y = temp.y+phi_cap_s*Uj[PP].y;
which takes about 60%. By the Visual Profiler, I have verified that the performance of the former is not influenced by the presence of the if statement. Please, note that I want double precision, so I'm avoiding the __exp() solution. I suspect that, for the latter, the "random" memory access Uj[PP] could be responsible of that much calculation percentage. Any suggestion on tricks or comments to reduce the computation time? Thanks in advance.
VERSION FOLLOWING COMMENTS AND ANSWERS
Following the suggestions kindly provided in the answers and comments, I ended up with the code below:
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P,tempd;
const double alfa=(2.-1./cc)*pi_double-0.01;
cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);
double cc_points=cc*points[i];
double r_cc_points=rint(cc_points);
cufftDoubleComplex rtemp[(2*K+1)];
double phi_cap_s[2*K+1];
if(i<M) {
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
PP = modulo(((int)r_cc_points + m -K ),(cc*N));
rtemp[m] = Uj[PP]; //2
P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd); }
else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
else phi_cap_s[m] = alfa/pi_double;
}
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
temp.x = temp.x+phi_cap_s[m]*rtemp[m].x;
temp.y = temp.y+phi_cap_s[m]*rtemp[m].y;
}
result[i] = temp;
}
}
In particular:
1) I moved the global memory variable Uj to the register rtemp array of size 2*K+1 (K is a constant equal to 6 in my case);
2) I moved the variable phi_cap_s to a 2*K+1 sized register;
3) I used the if ... else statements instead of the three previously used if's (the conditions P<0. and P>0. have the same occurrence probability);
3) I defined extra variables for the square root;
4) I used rsqrt instead of sqrt (as long as I know, the sqrt() is calculated by CUDA as 1/rsqrt());
I added each new feature once at a time, verifying the improvement against the original version, but I must say that none of them gave me any relevant improvement.
The execution speed is limited by:
1) the calculation of the sin/sinh functions (about 40% of the time); is there any way to calculate them in double precision arithmetics by somehow exploiting intrinsic math as a "starting guess"?
2) the fact that many threads end up to access the same global memory locations Uj[PP] due to the mapping index PP; one possibility to avoid it would be using shared memory, but this would imply a strong thread cooperation.
My question is. Am I done? Namely, is there any mean to improve the code? I profiled the code by the NVIDIA Visual Profiler and here are the results:
IPC = 1.939 (compute capability 2.1);
Global Memory Load Efficiency = 38.9%;
Global Memory Store Efficiency = 18.8%;
Warp Execution Efficiency = 97%;
Instruction Replay Overhead = 0.7%;
Finally, I would like to notice that this discussion is linked to the discussion at CUDA: 1-dimensional cubic spline interpolation in CUDA
VERSION USING SHARED MEMORY
I have made a feasibility study on using shared memory. I have considered N=64 so that the whole Uj fits the shared memory. Below is the code (basically is my original version)
__global__ void interpolation_shared(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P;
const double alfa=(2.-1./cc)*pi_double-0.01;
double phi_cap_s;
cufftDoubleComplex temp;
double cc_points=cc*points[i];
double r_cc_points=rint(cc*points[i]);
temp = make_cuDoubleComplex(0.,0.);
__shared__ cufftDoubleComplex Uj_shared[128];
if (threadIdx.x < cc*N) Uj_shared[threadIdx.x]=Uj[threadIdx.x];
if(i<M) {
for(int m=0; m<(2*K+1); m++) {
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
if(P==0.) phi_cap_s = alfa/pi_double;
PP = modulo((r_cc_points + m -K ),(cc*N));
temp.x = temp.x+phi_cap_s*Uj_shared[PP].x;
temp.y = temp.y+phi_cap_s*Uj_shared[PP].y;
}
result[i] = temp;
}
}
The result again does not improve significantly, although this might depend on the small size of the input array.
VERBOSE PTXAS OUTPUT
ptxas : info : Compiling entry function '_Z13interpolationP7double2PdS0_ii' for 'sm_20'
ptxas : info : Function properties for _Z13interpolationP7double2PdS0_ii
352 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 55 registers, 456 bytes cumulative stack size, 52 bytes cmem[0]
VALUES OF P, FOR FIRST WARP AND m=0
0.0124300933082964
0.0127183892149176
0.0135847002913749
0.0161796378170038
0.0155488126345702
0.0138890822153499
0.0121163187739057
0.0119998374528905
0.0131600831194518
0.0109574866163769
0.00962949548477354
0.00695850974164358
0.00446426651940612
0.00423369284281705
0.00632921297092537
0.00655137618976198
0.00810202954519923
0.00597974034698723
0.0076811348379735
0.00604267951733561
0.00402922460255439
0.00111841719893846
-0.00180949615796777
-0.00246283218698551
-0.00183256444286428
-0.000462696661685413
0.000725108980390132
-0.00126793006072035
0.00152263101649197
0.0022499598348702
0.00463681632275836
0.00359856091027666
MODULO FUNCTION
__device__ int modulo(int val, int modulus)
{
if(val > 0) return val%modulus;
else
{
int P = (-val)%modulus;
if(P > 0) return modulus -P;
else return 0;
}
}
MODULO FUNCTION OPTIMIZED ACCORDING TO ANSWER
__device__ int modulo(int val, int _mod)
{
if(val > 0) return val&(_mod-1);
else
{
int P = (-val)&(_mod-1);
if(P > 0) return _mod -P;
else return 0;
}
}
//your code above
cufftDoubleComplex rtemp[(2*K+1)] //if it fits into available registers, assumes K is a constant
if(i<M) {
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
PP = modulo((r_cc_points + m -K ),(cc*N));
rtemp[m] = Uj[PP]; //2
}
#pragma unroll
for(nt m=0; m<(2*K+1); m++) {
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
// 1
if(P>0.) phi_cap_s = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
else if(P<0.) phi_cap_s = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
else phi_cap_s = alfa/pi_double;
temp.x = temp.x+phi_cap_s*rtemp[m].x; //3
temp.y = temp.y+phi_cap_s*rtemp[m].y;
}
result[i] = temp;
}
Explanation
Added else if and else as these conditions are mutually exclusive, if you could, you should order the statements after probability of occurrence. E.g. if P<0. most of the times, you should evaluate that first.
This will fetch the requested memory to multiple registers, what you did before may have certainly caused a block on that thread due to not having the memory available in time for calculation. And keep in mind that if one thread blocks in a warp, the whole warp is blocked. If not enough warps in the ready queue, the program will block until any warp is ready.
We have now moved the calculations further forward in time in order to compensate for the bad memory access, hopefully the calculations done previously have compensated for the bad access pattern.
The reason why this should work is the following:
A request from memory that is in GMEM is around >~400-600 ticks. If a thread tries to perform operations on memory that is not available at time, it will block. That means if each memory request does not live in L1-L2 each warp have to wait that time or more until it can continue.
What I suspect is that temp.x+phi_cap_s*Uj[PP].x is doing just that. By staging (step 2) each memory transfer to a register, and moving on to stage the next you will hide the latency by allowing you to do other work while the memory is transfered.
By the time you reach step 3 the memory is hopefully available or you have to wait less time.
If rtemp does not fit into the registers to achieve 100% occupancy you may have to do it in batches.
You could also try to make phi_cap_s into an array and put it into the first loop like this:
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
//stage memory first
PP = modulo((r_cc_points + m -K ),(cc*N));
rtemp[m] = Uj[PP]; //2
P = (K*K-(cc_points-(r_cc_points+m-K))*(cc_points-(r_cc_points+m-K)));
// 1
if(P>0.) phi_cap_s[m] = (1./pi_double)*((sinh(alfa*sqrt(P)))/sqrt(P));
else if(P<0.) phi_cap_s[m] = (1./pi_double)*((sin(alfa*sqrt(-P)))/sqrt(-P));
else phi_cap_s[m] = alfa/pi_double;
}
#pragma unroll
for(nt m=0; m<(2*K+1); m++) {
temp.x = temp.x+phi_cap_s[m]*rtemp[m].x; //3
temp.y = temp.y+phi_cap_s[m]*rtemp[m].y;
}
Edit
Expression
P = (K*K-(cc_points-(r_cc_points+(double)(m-K)))*(cc_points-(r_cc_points+(double)(m-K))));
Can be broken down into:
const double cc_diff = cc_points-r_cc_points;
double exp = cc_diff - (double)(m-K);
exp *= exp;
P = (K*K-exp);
Which may reduce the number of instructions used.
Edit 2
__global__ void interpolation(cufftDoubleComplex *Uj, double *points, cufftDoubleComplex *result, int N, int M)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
int PP;
double P,tempd;
cufftDoubleComplex rtemp[(2*K+1)];
double phi_cap_s[2*K+1];
if(i<M) {
const double cc_points=cc*points[i];
cufftDoubleComplex temp = make_cuDoubleComplex(0.,0.);
const double alfa=(2.-1./cc)*pi_double-0.01;
const double r_cc_points=rint(cc_points);
const double cc_diff = cc_points-r_cc_points;
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
PP = m-k; //reuse PP
double exp = cc_diff - (double)(PP); //stage exp to be used later, will explain
PP = modulo(((int)r_cc_points + PP ),(cc*N));
rtemp[m] = Uj[PP]; //2
exp *= exp;
P = (K*K-exp);
if(P<0.) {tempd=rsqrt(-P); phi_cap_s[m] = (1./pi_double)*((sin(alfa/tempd))*tempd); }
else if(P>0.) {tempd=rsqrt(P); phi_cap_s[m] = (1./pi_double)*((sinh(alfa/tempd))*tempd); }
else phi_cap_s[m] = alfa/pi_double;
}
#pragma unroll //unroll the loop
for(int m=0; m<(2*K+1); m++) {
temp.x = temp.x+phi_cap_s[m]*rtemp[m].x;
temp.y = temp.y+phi_cap_s[m]*rtemp[m].y;
}
result[i] = temp;
}
}
What I have done is moved in all the calculations inside the if statement to free up some resources both in terms of calculations but also memory fetch, do not know the divergence you have on the first if statement if(i<M). As m-K appeared twice in the code, I first put it in PP to be used when you calculate exp and PP.
What else you can do is to try and order you instructions so that, if you set a variable, make as many instruction in in between the next usage of said variable as possible, as it takes ~20 tics for it to be set into the registers. Hence, I put the constant cc_diff at the top, however, as this is only a one of instruction, it may not show any benefit.
Modulo function
__device__ modulo(int val, int _mod) {
int p = (val&(_mod-1));// as modulo is always the power of 2
if(val < 0) {
return _mod - p;
} else {
return p;
}
}
As we have _mod always as an integer of power of 2(cc = 2, N = 64, cc*N = 128), we can use this function instead of the mod operator. This should be "much" faster. Check it though so I have the arithmetic correct. It is from Optimizing Cuda - Part II Nvidia page 14.
One optimization you probably would want to look into is to use fast math. Use intrinsics math functions and compile with -use-fast-math option.
intrinsics math

Cuda kernel - possible optimizations

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
My launch configurations and kernel call are as follows.
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I have to admit, this kernel is running pretty slow.Once I get the array back to host side, I use thrust::inclusive_scan to find my resultant array.
My question is, is there any room for improvement / optimization for my kernel? I tried using shared memory but its producing either wrong answers or throwing runtime exceptions.
Also, how does the dynamically allocated shared memory ( which is allocated by third parameter in kernel launch ) is distributed among the blocks?
Any help/hints/insinuations will be appreciated.
Thanks in advance.
As for the shared memory allocated using kernel<<<blocks,threads,mem>>> mem is the amount of memory allocated each block. So each block gets mem amount of memory.
For your code, I don't understand why are there 2 for loops in the mult function. Just want to point out that each thread will be executing these 2 for loops. Moreover, as you also have a for loop in the kernel function, it means that each thread will be executing the 2 for loops in the mult function several times. THis is slow. Moreover, doing
int beg = Bptr[tid];
int end = Bptr[tid+1];
is not exactly coalesced access. Non coalesced access is slow.

Cuda kernel producing the resultant vector as zero

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
And here is how I decide the configuration of grid and blocks
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?
I seems like there are a couple of things wrong with the code you posted:
I guess findkernel is kernel in the CUDA code above?
kernel has 8 parameters, but you only use 7 parameters to call findkernel. This doesn't look right!
In kernel, you test if(tid < cols) - I guess this should be if(tid < count)??
Why does kernel expect count to be a pointer? I think you don't pass in an int pointer but a regular integer value to findkernel.
Why does __device__ bool mult get count/int *val if it is not used?
I guess #3 or #4 could be the source of your problem, but you should look at the other things as well.
OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.
CUDA error: the launch timed out and was terminated
It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS 'watchdog' time limit.
I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the 'watchdog' limit of OS.

CUDA memory troubles

I have a CUDA kernel which I'm compiling to a cubin file without any special flags:
nvcc text.cu -cubin
It compiles, though with this message:
Advisory: Cannot tell what pointer points to, assuming global memory space
and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.
The kernel is as follows:
__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
int localMatches = 0;
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = threadIdx.x + threadIdx.y * blockDim.x;
int blockThreads = blockDim.x * blockDim.y;
__shared__ int localMatchCounts[32];
bool breaking = false;
for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
{
if(texts[blockId][i] == symbol[0])
{
for(int j = 1; j < symbolLength; j++)
{
if(texts[blockId][i + j] != symbol[j])
{
breaking = true;
break;
}
}
if (breaking) continue;
localMatches++;
}
}
localMatchCounts[threadId] = localMatches;
__syncthreads();
if(threadId == 0)
{
int sum = 0;
for(int i = 0; i < 32; i++)
{
sum += localMatchCounts[i];
}
matches[blockId] = sum;
}
}
If I replace the line
localMatchCounts[threadId] = localMatches;
after the first for loop with this line
localMatchCounts[threadId] = 5;
it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?
The system is Vista 64bit, for what its worth.
Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).
Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.
It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.
In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:
For example - preparation for kernel:
char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
metadata[cnt] = lastpos;
lastpos += length(st[cnt]);
metadata[cnt] = length(st[cnt]);
}
In kernel:
currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];
The problem seems to be associated with the char** parameter. Turning this into a char* solved the warning, so I suspect that cuda might have problems with this form of data. Perhaps cuda prefers that one uses the specific cuda 2D arrays in this case.