use threads for cublas calls from kernel? - cuda

BEFORE reading below! :
As I have understand , when you call cublas from the kernel :
cublas calls are kernels themselves
the threads and blocks are managed from the cublas calls
a cublas call is launched by 1 thread ( and 1 block ) and then it is
checking the number of elements and shedules threads/blocks
automatically. So , you don't specify number of threads/blocks when
you run a cublas call.
I am launching a kernel with 1 thread and 1 block as I said above.
__global__ void (...)
{
...
cublasCtrsm( CublasHandle , CUBLAS_SIDE_LEFT ,CUBLAS_FILL_MODE_LOWER , CUBLAS_OP_N , CUBLAS_DIAG_NON_UNIT , M , N , &alpha , inCov, M , inSample, M )
for ( int i = 0; i < N; i++ )
cublasCdotc( CublasHandle , M , inCoil + i * M , 1 , inSample + i * M , 1 , devImage + i );
}
Now, this code works fine ( I am taking an image ) but the for loop takes too much time.I want to optimize this for loop.
So , I tried:
int i = threadIdx.x + blockDim.x * blockIdx.x;
if ( i < N )
cublasCdotc( CublasHandle , M , inCoil + i * M , 1 , inSample + i * M , 1 , devImage + i );
But , as I said I am calling the kernel with 1 thread and 1 block.
So , is going to be executed by 1 thread only,right?
(that's why I am not taking the image I want ,but only 1 pixel)
And this has as a concequence the expressions i * M not to be evaluated for all N.
My question is how to accomplish what I want?

For anyone who might understood the answer or want to find out , anyway...
I came with this solution.
In a global function:
int i = threadIdx.x + blockIdx.x * blockDim.x;
devImage[ i ] = 0;
if ( i < N )
{
for ( int j = 0; j < M; j++ )
{
devImage[ i ] += inCoil[ i * M + j ] * inSample[ i * M + j ] - inCoil[ i * M + j ] \
* inSample[ i * M + j ] + inCoil[ i * M + j ] * inSample[ i * M + j ] + inCoil[ i * M + j ] \
* inSample[ i * M + j ];
}
}
I did a small loop ( j < M ) instead of a big one ( M is much smaller than N).
Now , I can't think a way of using cublasCdotc running fast.

Related

Read data in a proper way

I have a cpp file where I am creating an image and store the data to myOutput pointer:
int Rows = 80;
int Cols = 64;
for (int i = 0; i < Rows; i++ ){
for (int j = 0; j < Cols; j++ )
{
X = 1.0f * ((float) i - (float) Rows / 2) / (float) Rows;
Y = 2.0f * ((float) j - (float) Cols / 2) / (float) Cols;
.....
myOutput->Re = cosf( ......);
myOutput->Im = sinf(.......);
++myOutput;
}
}
Then , in cuda I am reading like:
int bx = blockIdx.x , by = blockIdx.y;
int tx = threadIdx.x , ty = threadIdx.y;
int RowIdx = ty + by * TILE_WIDTH;
int ColIdx = tx + bx * TILE_WIDTH;
Index = RowIdx * Cols + ColIdx;
//copy input data to shared memory
myshared[ty+1][tx+1] = *( devInputArray + Index );
(So , the myOutput generated from cpp is loaded in devInputArray).
Now , I want to process many images simultaneously.
So, in cpp ,the following additions must be made (for 2 images for example) :
int ImagesNb = 2;
for ( ImagesIdx = 0; ImagesIdx < ImagesNb; ImagesIdx++ ){
for (int i = 0; i < Rows; i++ ){
for (int j = 0; j < Cols; j++ )
{
X = (ImagesIdx + 1) * 1.0f * ((float) i - (float) Rows / 2) / (float) Rows;
Y = (ImagesIdx + 1) * 2.0f * ((float) j - (float) Cols / 2) / (float) Cols;
...
But , now I am not sure how to read the data from cuda.
I don't know how to take into account the number of images.
Before , I had a pointer which contained data (80 x 64) .
Now , it still contains the same dimension of every image but with more data.
I must change this:
Index = RowIdx * Cols + ColIdx;
//copy input data to shared memory
myshared[ty+1][tx+1] = *( devInputArray + Index );
but I can't figure how!
I hope it is clear!
UPDATED
I am trying something like this:
int bx = blockIdx.x , by = blockIdx.y , bz = blockIdx.z;
int tx = threadIdx.x , ty = threadIdx.y , tz = threadIdx.z;
int RowIdx = ty + by * TILE_WIDTH;
int ColIdx = tx + bx * TILE_WIDTH;
int ImagesIdx = tz + bz * blockDim.z;
Index = RowIdx * Cols + ColIdx + Rows * Cols * ImagesIdx
and :
dim3 dimGrid( ImagesNb * (Cols / TILE_WIDTH) , ImagesNb * (Rows / TILE_WIDTH) , ImagesNb);
dim3 dimBlock( TILE_WIDTH , TILE_WIDTH , 2);
but if I try for 2 images I am not getting right results..
Ok, for using a number of images you must add an extra dimension to shared variable in order to hold the number of images.

Parallel prefix sum with multiple elements per thread without using thrust

I'm trying to perform an inclusive scan to find the cumulative sum of an array. Following the advice given by harrism here, I'm using the procedure given here, but following the advice of those authors, I'm trying to write code that has each thread calculate 4 elements instead of one to mask memory latency.
I am staying away from thrust as performance is essential, and I need multi-stream capability. I have only just discovered CUB, and that will be my next effort, but I would like a multi-block solution and would also like to know where I've gone wrong on my existing code, just as an exercise to better understand CUDA.
The code below allocates 4 data elements to each block, where each block must have a multiple of 32 threads. My data will have a multiple of 128 threads so this restriction is acceptable to me. Enough shared memory is allocated to each block for the 4*blockDim.x elements plus an additional 32 elements to sum between warps. scanBlockAnyLength then adds the necessary offset to correct mismatch between warps, saving the final value of each warp to dev_blockSum in device global memory. sumWarp4_32 then scans this array to find the final to correct the mismatch between blocks, which is then added on in kernel_sumBlock
#include<cuda.h>
#include<iostream>
using std::cout;
using std::endl;
#define MAX_THREADS 1024
#define MAX_BLOCKS 65536
#define N 512
__device__ float sumWarp4_128(float* ptr, const int tidx = threadIdx.x) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 threads per warp
unsigned int i = warpid*128+lane; //first element of block data set this thread looks at
if( lane >= 1 ) ptr[i] += ptr[i-1];
if( lane >= 2 ) ptr[i] += ptr[i-2];
if( lane >= 4 ) ptr[i] += ptr[i-4];
if( lane >= 8 ) ptr[i] += ptr[i-8];
if( lane >= 16 ) ptr[i] += ptr[i-16];
if( lane==0 ) ptr[i+32] += ptr[i+31];
if( lane >= 1 ) ptr[i+32] += ptr[i+32-1];
if( lane >= 2 ) ptr[i+32] += ptr[i+32-2];
if( lane >= 4 ) ptr[i+32] += ptr[i+32-4];
if( lane >= 8 ) ptr[i+32] += ptr[i+32-8];
if( lane >= 16 ) ptr[i+32] += ptr[i+32-16];
if( lane==0 ) ptr[i+64] += ptr[i+63];
if( lane >= 1 ) ptr[i+64] += ptr[i+64-1];
if( lane >= 2 ) ptr[i+64] += ptr[i+64-2];
if( lane >= 4 ) ptr[i+64] += ptr[i+64-4];
if( lane >= 8 ) ptr[i+64] += ptr[i+64-8];
if( lane >= 16 ) ptr[i+64] += ptr[i+64-16];
if( lane==0 ) ptr[i+96] += ptr[i+95];
if( lane >= 1 ) ptr[i+96] += ptr[i+96-1];
if( lane >= 2 ) ptr[i+96] += ptr[i+96-2];
if( lane >= 4 ) ptr[i+96] += ptr[i+96-4];
if( lane >= 8 ) ptr[i+96] += ptr[i+96-8];
if( lane >= 16 ) ptr[i+96] += ptr[i+96-16];
return ptr[i+96];
}
__host__ __device__ float sumWarp4_32(float* ptr, const int tidx = threadIdx.x) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 elements per warp
unsigned int i = warpid*32+lane; //first element of block data set this thread looks at
if( lane >= 1 ) ptr[i] += ptr[i-1];
if( lane >= 2 ) ptr[i] += ptr[i-2];
if( lane >= 4 ) ptr[i] += ptr[i-4];
if( lane >= 8 ) ptr[i] += ptr[i-8];
if( lane >= 16 ) ptr[i] += ptr[i-16];
return ptr[i];
}
__device__ float sumBlock4(float* ptr, const int tidx = threadIdx.x, const int bdimx = blockDim.x ) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 threads per warp
float val = sumWarp4_128(ptr);
__syncthreads();//should be included
if( tidx==bdimx-1 ) ptr[4*bdimx+warpid] = val;
__syncthreads();
if( warpid==0 ) sumWarp4_32((float*)&ptr[4*bdimx]);
__syncthreads();
if( warpid>0 ) {
ptr[warpid*128+lane] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+32] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+64] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+96] += ptr[4*bdimx+warpid-1];
}
__syncthreads();
return ptr[warpid*128+lane+96];
}
__device__ void scanBlockAnyLength4(float *ptr, float* dev_blockSum, const float* dev_input, float* dev_output, const int idx = threadIdx.x, const int bdimx = blockDim.x, const int bidx = blockIdx.x) {
const unsigned int lane = idx & 31;
const unsigned int warpid = idx >> 5;
ptr[lane+warpid*128] = dev_input[lane+warpid*128+bdimx*bidx*4];
ptr[lane+warpid*128+32] = dev_input[lane+warpid*128+bdimx*bidx*4+32];
ptr[lane+warpid*128+64] = dev_input[lane+warpid*128+bdimx*bidx*4+64];
ptr[lane+warpid*128+96] = dev_input[lane+warpid*128+bdimx*bidx*4+96];
__syncthreads();
float val = sumBlock4(ptr);
__syncthreads();
dev_blockSum[0] = 0.0f;
if( idx==0 ) dev_blockSum[bidx+1] = ptr[bdimx*4-1];
dev_output[lane+warpid*128+bdimx*bidx*4] = ptr[lane+warpid*128];
dev_output[lane+warpid*128+bdimx*bidx*4+32] = ptr[lane+warpid*128+32];
dev_output[lane+warpid*128+bdimx*bidx*4+64] = ptr[lane+warpid*128+64];
dev_output[lane+warpid*128+bdimx*bidx*4+96] = ptr[lane+warpid*128+96];
__syncthreads();
}
__global__ void kernel_sumBlock(float* dev_blockSum, const float* dev_input, float* dev_output ) {
extern __shared__ float ptr[];
scanBlockAnyLength4(ptr,dev_blockSum,dev_input,dev_output);
}
__global__ void kernel_offsetBlocks(float* dev_blockSum, float* dev_arr) {
const int tidx = threadIdx.x;
const int bidx = blockIdx.x;
const int bdimx = blockDim.x;
const int lane = tidx & 31;
const int warpid = tidx >> 5;
if( warpid==0 ) sumWarp4_32(dev_blockSum);
float val = dev_blockSum[warpid];
dev_arr[warpid*128+lane] += val;
dev_arr[warpid*128+lane+32] += val;
dev_arr[warpid*128+lane+64] += val;
dev_arr[warpid*128+lane+96] += val;
}
void scan4( const float input[], float output[]) {
int blocks = 2;
int threadsPerBlock = 64; //multiple of 32
int smemsize = (threadsPerBlock*4+32)*sizeof(float);
float* dev_input, *dev_output;
cudaMalloc((void**)&dev_input,blocks*threadsPerBlock*4*sizeof(float));
cudaMalloc((void**)&dev_output,blocks*threadsPerBlock*4*sizeof(float));
float *dev_blockSum;
cudaMalloc((void**)&dev_blockSum,blocks*sizeof(float));
int offset = 0;
int Nrem = N;
int chunksize;
while( Nrem ) {
chunksize = max(Nrem,blocks*threadsPerBlock*4);
cudaMemcpy(dev_input,(void**)&input[offset],chunksize*sizeof(float),cudaMemcpyHostToDevice);
kernel_sumBlock<<<blocks,threadsPerBlock,smemsize>>>(dev_blockSum,dev_input,dev_output);
kernel_offsetBlocks<<<blocks,threadsPerBlock>>>(dev_blockSum,dev_output);
cudaMemcpy((void**)&output[offset],dev_output,chunksize*sizeof(float),cudaMemcpyDeviceToHost);
offset += chunksize;
Nrem -= chunksize;
}
cudaFree(dev_input);
cudaFree(dev_output);
}
int main() {
float h_vec[N], sol[N];
for( int i = 0; i < N; i++ ) h_vec[i] = (float)i+1.0f;
scan4(h_vec,sol);
cout << "solution:" << endl;
for( int i = 0; i < N; i++ ) cout << i << " " << (i+2)*(i+1)/2 << " " << sol[i] << endl;
return 0;
}
To my eye, the code is throwing errors because the lines in sumWarp4_128 are not executed in order within a warp. I.e, the if( lane==0 ) lines are executing before the other logical blocks that precede it. I thought this was not possible within a warp.
If I __syncthreads() before and after the lane==0 calls, I get some new exotic error that I just can't figure out.
Any help to point out where I've gone wrong would be appreciated
The code you are writing has race conditions due to not synchronizing between threads that are sharing data. While it is true that this can be done on current hardware for communication within a warp (so-called warp-synchronous programming), it is highly discouraged because the race conditions in the code could cause it to fail on possible future hardware.
While it is true that you will get higher performance by processing multiple items per thread, 4 is not a magic number -- you should make this a tunable parameter if possible. CUDPP uses 8 per thread, for example.
I would highly recommend that you use CUB for this. You should use cub::BlockLoad() to load multiple items per thread and cub::BlockScan() to scan them. Then you would just need some code to combine multiple blocks. The most bandwidth-efficient way to do this is to use the "Reduce-Scan-Scan" approach that Thrust uses. First reduce each block (cub::BlockReduce) and store the sum from each block to a blockSums array. Then scan that array to get the per-block offset. Then perform a cub::BlockScan on the blocks and add the previously computed per-block offset to each element.

Multi GPU performance degrade when allocated memory increases

I've tested the following on a GTX 690 GPU with 4GB RAM in Windows 7 x64, Visual C++ 10:
I've written a function that receives 2 vectors and adds into a 3rd vector. The task is broken over 2 GPU devices. I gradually increased the vector size to benchmark GPU performance. The required time linearly increases relative to vector size up to a certain point and then it abruptly jumps up. When I disable each of the GPU cores, the required time stays linear to the end of available memory. I've enclosed a diagram displaying required time versus allocated memory.
You can see the speed diagram here: Speed Comparison Diagram!
Can you tell me what is wrong?
Bests,
Ramin
This is my code:
unsigned BenchMark( unsigned VectorSize )
{
unsigned * D[ 2 ][ 3 ] ;
for ( int i = 0 ; i < 2 ; i++ )
{
cudaSetDevice( i ) ;
for ( int j = 0 ; j < 3 ; j++ )
cudaMalloc( & D[ i ][ j ] , VectorSize * sizeof( unsigned ) ) ;
}
unsigned uStartTime = clock() ;
// TEST
for ( int i = 0 ; i < 2 ; i++ )
{
cudaSetDevice( i ) ;
AddKernel<<<VectorSize/256,256>>>(
D[ i ][ 0 ] ,
D[ i ][ 1 ] ,
D[ i ][ 2 ] ,
VectorSize ) ;
}
cudaDeviceSynchronize() ;
cudaSetDevice( 0 ) ;
cudaDeviceSynchronize() ;
unsigned uEndTime = clock() ;
for ( int i = 0 ; i < 2 ; i++ )
{
cudaSetDevice( i ) ;
for ( int j = 0 ; j < 3 ; j++ )
cudaFree( D[ i ][ j ] ) ;
}
return uEndTime - uStartTime ;
}
__global__ void AddKernel(
const Npp32u * __restrict__ pSource1 ,
const Npp32u * __restrict__ pSource2 ,
Npp32u * __restrict__ pDestination ,
unsigned uLength )
{
unsigned x = blockIdx.x * blockDim.x + threadIdx.x ;
if ( x < uLength )
pDestination[ x ] = pSource1[ x ] + pSource2[ x ] ;
}
I found the answer. The problem happened as SLI was active, I disabled it and now it is working smoothly.

Distribute the threads between blocks in CUDA

I'm working on a project in CUDA. The first time I used only one block with Dim 8*8 as my matrix. And then I calculated the index as follows:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
And it gave me a correct answer. After that I want to distribute the threads between blocks to measure the performance. I make the grid dim to be (2,1) and the block dim to be (4,8).
When I debug the code by hand, it seems to give me the correct index without changing the formula mentioned above. But when I run the program, the screen hangs and the results are all zero.
What did I do wrong, and how can I fix this?
This is the kernel function
__global__ void cover_fault(int *a,int *b, int *c, int *d, int *mulFV1, int *mulFV2, int *checkDalU1, int *checkDalU2, int N)
{
//Fig.2
__shared__ int f[9][9];
__shared__ int compV1[9],compV2[9];
int dalU1[9] , dalU2[9];
int Ra=2 , Ca=2;
for (int i = 0 ; i < N ; i++)
for (int j = 0 ; j < N ; j++)
f[i][j]=0;
f[3][0] = 1;
f[0][2] = 1;
f[0][6] = 1;
f[3][7] = 1;
f[2][4] = 1;
f[6][4] = 1;
f[7][1] = 1;
int t =0 ,A = 1,B = 1 , UTP = 5 , LTP = -5 , U_max = 40 , U_min = -160;
bool flag = true;
int sumV1, sumV2;
int checkZero1 , checkZero2;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
while ( flag == true)
{
if ( c[idy] == 0 )
compV1[idy] = 1;
else if ( c[idy]==1)
compV1[idy] = 0 ;
if ( d[idy] == 0 )
compV2[idy] = 1;
else if ( d[idy]==1 )
compV2[idy] = 0 ;
sumV1 = reduce ( c, N );
sumV2 = reduce ( d, N );
if (idx<N && idy <N)
{
if(idx==0)
mulFV1[idy]=0;
if(idy==0)
mulFV2[idx]=0;
__syncthreads();
atomicAdd(&(mulFV1[idy]),f[idy][idx]*compV2[idx]);
atomicAdd(&(mulFV2[idx]),f[idy][idx]*compV1[idy]);
}
dalU1[idy] = ( -1*A*( sumV1 - Ra )) + (B * mulFV1[idy] * compV1[idy]) ;
dalU2[idy] = ( -1*A*( sumV2 - Ca )) + (B * mulFV2[idy] * compV2[idy]) ;
a[idy] = a[idy] + dalU1[idy];
b[idy] = b[idy] + dalU2[idy];
if ( a[idy] > U_max )
a[idy] = U_max;
else
if (a[idy] < U_min )
a[idy] = U_min;
if ( b[idy] > U_max )
b[idy] = U_max;
else
if (b[idy] < U_min )
b[idy] = U_min;
if (dalU1[idy]==0)
checkDalU1[idy]=0;
else
checkDalU1[idy]=1;
if (dalU2[idy]==0)
checkDalU2[idy]=0;
else
checkDalU2[idy]=1;
__syncthreads();
checkZero1 = reduce(checkDalU1,N);
checkZero2 = reduce(checkDalU2,N);
if ( checkZero1==0 && checkZero2==0)
flag = false;
else
{
if ( a[idy] > UTP )
c[idy] = 1;
else
if ( a[idy] < LTP )
c[idy] = 0 ;
if ( b[idy] > UTP )
d[idy] = 1;
else
if ( b[idy] < LTP )
d[idy] = 0 ;
t++;
}//end else
sumV1=0;
sumV2=0;
mulFV1[idy]=0;
mulFV2[idy]=0;
} //end while
}//end function
In your index computation, idx will give you the column index and idy the row index. Are you accessing your matrix as M[idy][idx]?
The cuda threads are organized according to the orthogonal system: X is horizontal and Y is vertical. So if you say the point M[0][1] in the actual matrix it's M[1][0].

CUDA kernel - nested for loop

Hello
I'm trying to write a CUDA kernel to perform the following piece of code.
for (n = 0; n < (total-1); n++)
{
a = values[n];
for ( i = n+1; i < total ; i++)
{
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
This is what I have currently, but it does not seem to be giving the correct results? does anyone know what I'm doing wrong. Cheers
__global__ void calc(int total, float *values, float *newvalues){
float a,b,c;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x){
a = values[n];
for(int i = n+1; i < total; i++){
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
Realize this problem in 2D and launch your kernel with 2D thread blocks. The total number of threads in x and y dimension will be equal to total . The kernel code should look like this:
__global__ void calc(float *values, float *newvalues, int total){
float a,b,c;
int n= blockIdx.y * blockDim.y + threadIdx.y;
int i= blockIdx.x * blockDim.x + threadIdx.x;
if (n>=total || i>=total)
return;
a = values[n];
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
// I don't know your problem statement but i think it should be like: newvalues[n*total+i] = c;
}
Update:
This is how you should call the kernel
dim3 block(16,16);
dim3 grid ( (total+15)/16, (total+15)/16 );
calc<<<grid,block>>>(float *val, float *newval, int T);
Also make sure you add this line in kernel (see updated kernel)
if (n>=total || i>=total)
return;
Update 2:
fixed blockIdy.y, correct is blockIdx.y
I'll probably be way wrong but the n < (total-1) check in
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x)
seems different than the original version.
Why don't you just remove the outter loop and start the kernel with as many threads as you need for this loop? It's a bit weird to have a loop that depends on your blockId. Normally you try to avoid these loops.
Secondly it seems to me that newvalues[i] can be overriden by different threads.