Does the leading dimension in cuBLAS allow for accessing any submatrix? - cuda

I'm trying to understand the idea of the leading dimension in cuBLAS. It's mentioned that lda must always be greater than or equal to the # of rows in a matrix.
If I have a 100x100 matrix A and I wanted to access A(90:99, 0:99), what would be the arguments of cublasSetMatrix? lda specifies the number of rows between the elements in the same column(100 in this case), but where would I specify the 90? I can only see a way by adjusting *A.
The function definition is:
cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
And I'm also guessing that I wouldn't be able to transfer the bottom right 3x3 portion of a 5x5 matrix given the length limits.

You have to "adjust *A", as you called it. The pointer that is given to this function must be the starting entry of the respective sub-matrix.
You did not say whether your matrix A is actually the input- or the output matrix, but this should not change much, conceptually.
Assuming you have the following code:
// The matrix in host memory
int rowsA = 100;
int colsA = 100;
float *A = new float[rowsA*colsA];
// Fill A with values
...
// The sub-matrix that should be copied to the device.
// The minimum index is INCLUSIVE
// The maximum index is EXCLUSIVE
int minRowA = 0;
int maxRowA = 100;
int minColA = 90;
int maxColA = 100;
int rowsB = maxRowA-minRowA;
int colsB = maxColA-minColA;
// Allocate the device matrix
float *dB = nullptr;
cudaMalloc(&dB, rowsB * colsB * sizeof(float));
Then, for the cublasSetMatrix call, you have to compute the starting element of the source matrix:
float *sourceA = A + (minRowA + minColA * rowsA);
cublasSetMatrix(rowsB, colsB, sizeof(float), sourceA, rowsA, dB, rowsB);
And this is where the 90 that you asked for comes into play: It is the minColA in the computation of the source pointer.

Related

Multiplication of two cudaArray in kernel?(using texture memory)

I have two cudaArray, a1 and a2 (which have the same size) which reprensent two matrices .
Using texture memory, I want to multiplicate those two cudaArrays .
Then I want to copy back the result in one normal arrays,let's name it *a1_h.
The fact is, I just don't know how to do it . I've managed to define, allocate my two cudaArrays and to put floats into them .
Now I want to do a kernel which does those multiplications .
Can somebody help me ?
ROOM_X and ROOM_Y are int, they define width and height of matrices .
mytex_M1 and mytex_M2 are texture defined as : texture < float,2,cudaReadModeElementType > .
Here is my main :
int main(int argc, char * argv[]) {
int size = ROOM_X * ROOM_Y * sizeof(float);
//creation of arrays on host.Will be useful for filling the cudaArrays
float *M1_h, *M2_h;
//allocating memories on Host
M1_h = (float *)malloc(size);
M2_h = (float *)malloc(size);
//creation of channel descriptions for 2d texture
cudaChannelFormatDesc channelDesc_M1 = cudaCreateChannelDesc<float>();
cudaChannelFormatDesc channelDesc_M2 = cudaCreateChannelDesc<float>();
//creation of 2 cudaArray * .
cudaArray *M1_array,*M2_array;
//bind arrays and channel in order to allocate space
cudaMallocArray(&M1_array,&channelDesc_M1,ROOM_X,ROOM_Y);
cudaMallocArray(&M2_array,&channelDesc_M2,ROOM_X,ROOM_Y);
//filling the matrices on host
Matrix(M1_h);
Matrix(M2_h);
//copy from host to device (putting the initial values of M1 and M2 into the arrays)
cudaMemcpyToArray(M1_array, 0, 0,M1_h, size,cudaMemcpyHostToDevice);
cudaMemcpyToArray(M2_array, 0, 0,M2_h, size,cudaMemcpyHostToDevice);
//set textures parameters
mytex_M1.addressMode[0] = cudaAddressModeWrap;
mytex_M1.addressMode[1] = cudaAddressModeWrap;
mytex_M1.filterMode = cudaFilterModeLinear;
mytex_M1.normalized = true; //NB coordinates in [0,1]
mytex_M2.addressMode[0] = cudaAddressModeWrap;
mytex_M2.addressMode[1] = cudaAddressModeWrap;
mytex_M2.filterMode = cudaFilterModeLinear;
mytex_M2.normalized = true; //NB coordinates in [0,1]
//bind arrays to the textures
cudaBindTextureToArray(mytex_M1,M1_array);
cudaBindTextureToArray(mytex_M2,M2_array);
//allocate device memory for result
float* M1_d;
cudaMalloc( (void**)&M1_d, size);
//dimensions of grid and blocks
dim3 dimGrid(ROOM_X,ROOM_Y);
dim3 dimBlock(1,1);
//execution of the kernel . The result of the multiplication has to be put in M1_d
mul_texture<<<dimGrid, dimBlock >>>(M1_d);
//copy result from device to host
cudaMemcpy(M1_h,M1_d, size, cudaMemcpyDeviceToHost);
//free memory on device
cudaFreeArray(M1_array);
cudaFreeArray(M2_array);
cudaFree(M1_d);
//free memory on host
free(M1_h);
free(M2_h);
return 0;
}
When you declare a texture
A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#texture-reference-api
So, if you have successfully define the texture references, initialize the arrays, copy then to the texture space and prepare the output buffers (something that seems to be done according to your code), what you need to do is implement the kernel. For example:
__global__ void
mul_texture(float* M1_d, int w, int h)
{
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// take care of the size of the image, it's a good practice
if ( x < w && y < h )
{
// the output M1_d is actually represented as 1D array
// so the offset of each value is related to their (x,y) position
// in a tow-major order
int gid = x + y * w;
// As texture are declared at global scope,
// we can access their content at any kernel
float M1_value = tex2D(mytex_M1,x,y);
float M2_value = tex2D(mytex_M2,x,y);
// The final results is the pointwise multiplication
M1_d[ gid ] = M1_value * M2_value;
}
}
You need to change the kernel invocation to include the w and h values, corresponding to the width (number of columns in the matrix) and height (number of rows of the matrix).
mul_texture<<<dimGrid, dimBlock >>>(M1_d, ROOM_X, ROOM_Y);
Note you are not doing any error checking, something that will help you quite a lot now and in the future. I have not checked if the kernel provided in this answer works as your code didnt compile.

Finding min and max by the cuda kernel reduction

Here is my kernel code
typedef unsigned char Npp8u;
...
// Kernel Implementation
__device__ unsigned int min_device;
__device__ unsigned int max_device;
__global__ void findMax_Min(Npp8u * data, int numEl){
int index = blockDim.x*blockIdx.x + threadIdx.x;
int shared_index = threadIdx.x;
__shared__ Npp8u data_shared_min[BLOCKDIM];
__shared__ Npp8u data_shared_max[BLOCKDIM];
// check index condition
if(index < numEl){
data_shared_min[shared_index] = data[index]; //pass values from global to shared memory
__syncthreads();
data_shared_max[shared_index] = data[index]; //pass values from global to shared memory
for (unsigned int stride = BLOCKDIM/2; stride > 0; stride >>= 1) {
if(threadIdx.x < stride){
if(data_shared_max[threadIdx.x] < data_shared_max[threadIdx.x+stride]) data_shared_max[shared_index] = data_shared_max[shared_index+stride];
if(data_shared_min[threadIdx.x]> data_shared_min[threadIdx.x+stride]) data_shared_min[shared_index] = data_shared_min[shared_index+stride];
}
__syncthreads();
}
if(threadIdx.x == 0 ){
atomicMin(&(min_device), (unsigned int)data_shared_min[threadIdx.x ]);
//min_device =10;
__syncthreads();
atomicMax(&(max_device), (unsigned int)data_shared_max[threadIdx.x ]);
}
}else{
data_shared_min[shared_index] = 9999;
}
}
I have an image that is 512x512 and I want to find the min and max pixel values. data is the 1-D version of the image. This code works for max but not for min value. As I checked from matlab max value is 202 and min value is 10 but it finds 0 for the min value. Here is my kernel codes and memcpy calls
int main(){
// Host parameter declarations.
Npp8u * imageHost;
int nWidth, nHeight, nMaxGray;
// Load image to the host.
std::cout << "Load PGM file." << std::endl;
imageHost = LoadPGM("lena_before.pgm", nWidth, nHeight, nMaxGray);
// Device parameter declarations.
Npp8u * imageDevice;
unsigned int max, min;
size_t size = sizeof(Npp8u)*nWidth*nHeight;
cudaMalloc((Npp8u**)&imageDevice, size);
cudaMemcpy(imageDevice, imageHost, size, cudaMemcpyHostToDevice);
int numPixels = nWidth*nHeight;
dim3 numThreads(BLOCKDIM);
dim3 numBlocks(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1));
findMax_Min<<<numBlocks, numThreads>>>(imageDevice,numPixels);
cudaMemcpyFromSymbol(&max,max_device, sizeof(max_device), 0, cudaMemcpyDeviceToHost);
cudaMemcpyFromSymbol(&min,min_device, sizeof(min_device), 0, cudaMemcpyDeviceToHost);
printf("Min value for image : %i\n", min);
printf("Max value for image : %i\n", max);
...
Another interesting thing is changing the order of cudaMemcpy just after the kernel call also causes malfunctioning and values both are read as zero. I do not see the problem. Is there anyone sees the obstructed part?
You might want to do cuda error checking. You might also want to initialize min_device to a large value and max_device to zero. There are other problems with your reduction method related to stride (what happens in the last block of an odd size image when you add stride to threadIdx.x, it may exceed the defined image range in shared memory) , but I don't think it matters for a 512x512 image. If min_device just happened to start out at zero, all of your atomicMin operations would always leave zero there.
You can try initializing min_device and max_device like this:
__device__ unsigned int min_device = 9999;
__device__ unsigned int max_device = 0;
For the cudamemcpy calls at the end, you are copying 4 bytes (size of max_device) into a one-byte variable (Npp8u max) and likewise for min. So that's a problem. Since you're using pointers, the copy operation is definitely overwriting something that you don't intend. If the compiler stores the variables sequentially the way you have them defined, one copy operation is overwriting the other variable, which I think would explain the behavior you're seeing. If you created min and max as unsigned int quantities, I think this problem would go away.
EDIT: Since you haven't shown your actual block dimensions, it's possible that you still have a problem with your reduction. You might want to change this line:
if(threadIdx.x < stride){
To something like:
if((threadIdx.x < stride) && ((index + stride)< numEl)){
This or something like it should correct the hazard I mention in the first paragraph. I guess you're trying to account for the hazard using this line:
data_shared_min[shared_index] = 9999;
But there's no guarantee that line of code gets executed before the data element that it is setting in shared memory gets read by some other thread. I also don't know what happens when you assign a value of 9999 to a byte quantity, but it's probably not what you expect.

Tips for optimizing X_transpose*X CUDA kernel

I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.

CUFFT: How to calculate fft of pitched pointer?

I'm trying to calculate the fft of an image using CUFFT. It seems like CUFFT only offers fft of plain device pointers allocated with cudaMalloc.
My input images are allocated using cudaMallocPitch but there is no option for handling pitch of the image pointer.
Currently, I have to remove the alignment of rows, then execute the fft, and copy back the results to the pitched pointer. My current code is as follows:
void fft_device(float* src, cufftComplex* dst, int width, int height, int srcPitch, int dstPitch)
{
//src and dst are device pointers allocated with cudaMallocPitch
//Convert them to plain pointers. No padding of rows.
float *plainSrc;
cufftComplex *plainDst;
cudaMalloc<float>(&plainSrc,width * height * sizeof(float));
cudaMalloc<cufftComplex>(&plainDst,width * height * sizeof(cufftComplex));
cudaMemcpy2D(plainSrc,width * sizeof(float),src,srcPitch,width * sizeof(float),height,cudaMemcpyDeviceToDevice);
cufftHandle handle;
cufftPlan2d(&handle,width,height,CUFFT_R2C);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,plainSrc,plainDst);
cufftDestroy(handle);
cudaMemcpy2D(dst,dstPitch,plainDst,width * sizeof(cufftComplex),width * sizeof(cufftComplex),height,cudaMemcpyDeviceToDevice);
cudaFree(plainSrc);
cudaFree(plainDst);
}
It gives correct result, but I don't want to do 2 extra memory allocations and copies inside the function. I want to do something like this:
void fft_device(float* src, cufftComplex* dst, int width, int height, int srcPitch, int dstPitch)
{
//src and dst are device pointers allocated with cudaMallocPitch
//Don't know how to handle pitch here???
cufftHandle handle;
cufftPlan2d(&handle,width,height,CUFFT_R2C);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,src,dst);
cufftDestroy(handle);
}
Question:
How to calculate the fft of pitched pointer directly using CUFFT?
I think you may be interested in cufftPlanMany which would let you do 1D, 2D, and 3D ffts with pitches. The key here is inembed and onembed parameters.
You can look up CUDA_CUFFT_Users_Guide.pdf (Pages 23-24) for more information. But for your example, you'd be doing something like the follows.
void fft_device(float* src, cufftComplex* dst,
int width, int height,
int srcPitch, int dstPitch)
{
cufftHandle handle;
int rank = 2; // 2D fft
int n[] = {width, height}; // Size of the Fourier transform
int istride = 1, ostride = 1; // Stride lengths
int idist = 1, odist = 1; // Distance between batches
int inembed[] = {srcPitch, height}; // Input size with pitch
int onembed[] = {dstPitch, height}; // Output size with pitch
int batch = 1;
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_R2C, batch);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,src,dst);
cufftDestroy(handle);
}
P.S. I did not add return checks for the sake of example here. Always check for return values in your code.

CUDA-Kernel supposed to be dynamic crashes depending upon block size

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.
My kernel looks like the following:
__global__ void
krnlSpMVmul1(
float *data_mat,
int num_nonzeroes,
unsigned int *row_ptr,
float *data_vec,
float *data_result)
{
extern __shared__ float local_result[];
local_result[threadIdx.x] = 0;
float vector_elem = data_vec[blockIdx.x];
unsigned int start_index = row_ptr[blockIdx.x];
unsigned int end_index = row_ptr[blockIdx.x + 1];
for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
local_result[threadIdx.x] += (data_mat[index] * vector_elem);
__syncthreads();
// Reduction
// Writing accumulated sum into result vector
}
As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.
Now to my problem:
Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.
Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.
Help is very much appreciated!
Edit towards talonmies comment:
I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:
The reduction part:
int k = blockDim.x / 2;
while (k != 0)
{
if (threadIdx.x < k)
local_result[threadIdx.x] += local_result[threadIdx.x + k];
else
return;
__syncthreads();
k /= 2;
}
and how I write the results back to global memory:
data_result[blockIdx.x] = local_result[0];
Thats all I got.
Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:
float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }
And thats an excerpt of my wrapper function:
float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;
// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));
// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
dev_data_mat,
data_mat,
num_nonzeroes * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_row_ptr,
row_ptr,
num_row_ptr * sizeof(unsigned int),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_vec,
data_vec,
dim_x * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_result,
data_result,
dim_y * sizeof(float),
cudaMemcpyHostToDevice));
// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);
// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
dev_data_mat,
num_nonzeroes,
dev_row_ptr,
dev_data_vec,
dev_data_result);
I hope this is straightforward but will explain things if it is of any interest.
One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.
Your dynamically allocated shared memory size is incorrect. Right now you are doing this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)
The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure - you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.
You should be doing something like this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)
That will give you the correct dynamic shared memory size and should eliminate the problem.