The data in the 3D matrix was generated by layers (from top to bottom) and I want to multiply that data with a 2D matrix B but istead of taking each layer I need to take a vector from layer 1, a vector from layer 2 and so on.
Currently what I'm doing is to copy those vectors from the 3D matrix to a 2D matrix tmpA then multiply with B (using CUBLAS) and store result in tmpB to finally copy back row by row to where it corresponds in a 3D matrix C.
Overall, my whole app runs at least twice as faster than the CPU version, but it seems to me that those memory copies (even) made from device to device are not very good at all for the performance.
What would be a better way to do this computation? I was thinking about rearranging data before multiplying, so to avoid the memory copies.
The 3D matrix A and C and the 2D matrix B are already in GPU's memory.
EDIT
Let M, N, P be the dimensions of the 3D matrix A stored in row major order in a linear array on the device's memory. My code looks like this:
cudaMalloc((void**)&d_tmpIn, sizeof(float)*M*P);
cudaMalloc((void**)&d_tmpOut, sizeof(float)*M*P);
cudaMalloc((void**)&d_C, sizeof(float)*M*N*P);
for (int iN = 0; iN < N; iN++)
{
dst = d_tmpIn;
for (int iM = 0; iM < M; iM++)
{
cudaMemcpy(dst, &(d_A[iN*P+0+iM*N*P]), sizeof(float)*P, cudaMemcpyD2D);
dst += P;
}
cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, P, M, M, &alpha, d_tmpIn, P, d_B, M, &beta, d_tmpOut, P);
src = d_tmpOut;
for (int iM = 0; iM < M; iM++)
{
cudaMemcpy(&(d_C[iN*P+0+iM*N*P]), src, sizeof(float)*P, cudaMemcpyD2D);
src += P;
}
}
Hope this helps.
You don't need to do memory copies! The BLAS and LAPACK APIs were created in such a way that you can specify the starting point, the stride length, the length of the leading dimensions and so on.
This way you can use the 3D arrays A and C as is, but call cublasDgemm by using the correct parameters.
In your case (if I understand the code correctly) it looks like each matrix should be P X M and you have N of them. But it looks like the 3D array is arranged as PxNxM. So without allocating memory for d_tmpIn and d_tmpOut, you could do something like this: The number of rows of A are P. the number of columns are M. However, the leading dimension (lda) should be mentioned as N * P. The same goes for C.
int lda = N * P;
int ldc = N * P;
for (int iN = 0; iN < N; iN++)
{
double *d_tmpIn = d_A + iN * P;
double *d_tmpOut = d_C + iN * P;
cublasSetStream(streams[iN]); // Optional
cublasDgemm(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N,
P, M, M, &alpha, d_tmpIn, lda, d_B, M, &beta, d_tmpOut, ldc);
}
You could also create iN streams and run each cublas run in a separate stream. Note that this is only going to be useful if M and P are small enough (i.e. the GPU is not yet saturated computationally)
EDIT If you do plan to go ahead with streams, try to create them once at the beginning of the program and re-use them. Do not create and destroy streams in the same loop as the Dgemm. This increases overhead.
Related
I am trying to use CUBLAS in C++ to rewrite a python/tensorflow script which is operating on batches of input samples (of shape BxD, B: BatchSize, D: Depth of the flattened 2D matrix)
For the first step, I decided to use CUBLAS cublasSgemmBatched to compute MatMul for batches of matrices.
I've found couple working sample codes as the one in link to the question,
but what I want is to allocate one big contiguous device array to store batches of flattened identical shaped matrices. I DO NOT want to store batches separated from each other on device memory(as they are in the provided sample code in the given link to StackOverflow question)
From what I can imagine, somehow I have to get a list of pointers to starting elements of each batch on device memory. something like this:
float **device_batch_ptr;
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
for(int i = 0 ; i < batch_size; i++ ) {
// set device_batch_ptr[i] to starting point of i'th batch on device memory array.
}
Note that cublasSgemmBatched needs a float** that each float* in it, points to starting element of each batch in a given input matrix.
Any advice and suggestions will be greatly appreciated.
If your arrays are in contiguous linear memory (device_array) then all you need to do is calculate the offsets using standard pointer arithmetic and store the device addresses in a host array which you then copy to the device. Something like:
float** device_batch_ptr;
float** h_device_batch_ptr = new float*[batch_size];
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
size_t nelementsperrarray = N * N;
for(int i = 0 ; i < batch_size; i++ ) {
// set h_device_batch_ptr[i] to starting point of i'th batch on device memory array.
h_device_batch_ptr[i] = device_array + i * nelementsperarray;
}
cudaMemcpy(device_batch_ptr, h_device_batch_ptr, batch_size*sizeof(float *)),
cudaMemcpyHostToDevice);
[Obviously never compiled or tested, use at own risk]
I have an N x N square matrix of integers (which is stored in the device as a 1-d array for convenience).
I'm implementing an algorithm which requires the following to be performed:
There are 2N anti diagonals in this square. (anti - diagonals are parallel lines from top edge to left edge and right edge to bottom edge)
I need a for loop with 2N iterations with each iteration computing one anti-diagonal starting from the top left and ending at bottom right.
In each iteration, all the elements in that anti-diagonal must run parallelly.
Each anti-diagonal is calculated based on the values of the previous anti-diagonal.
So, how do I index the threads with this requirement in CUDA?
As long as I understand, you want something like
Parallelizing the Smith-Waterman Local Alignment Algorithm using CUDA A
At each iteration, the kernel is launched with a different number of threads.
Perhaps the code in Parallel Anti diagonal 'for' loop could be modified as
int iDivUp(const int a, const int b) { return (a % b != 0) ? (a / b + 1) : (a / b); };
#define BLOCKSIZE 32
__global__ antiparallel(float* d_A, int step, int N) {
int i = threadIdx.x + blockIdx.x* blockDim.x;
int j = step-i;
/* do work on d_A[i*N+j] */
}
for (int step = 0; step < 2*N-1; step++) {
dim3 dimBlock(BLOCKSIZE);
dim3 dimGrid(iDivUp(step,dimBlock.x));
antiparallel<<<dimGrid.x,dimBlock.x>>>(d_A,step,N);
}
This code is untested and is just a sketch of a possible solution (provided that I have not misunderstood your question). Furthermore, I do not know how efficient would be a solution like that since you will have kernels launched with very few threads.
I've been struggling the whole day, trying to make a basic CUFFT example work properly. However i run into a little problem which I cannot identify. Basically I have a linear 2D array vx with x and y coordinates. Then I just calculate a forward then backward CUFFT (in-place), that simple. Then I copy back the array vx, normalize it by NX*NY , then display.
#define NX 32
#define NY 32
#define LX (2*M_PI)
#define LY (2*M_PI)
float *x = new float[NX*NY];
float *y = new float[NX*NY];
float *vx = new float[NX*NY];
for(int j = 0; j < NY; j++){
for(int i = 0; i < NX; i++){
x[j*NX + i] = i * LX/NX;
y[j*NX + i] = j * LY/NY;
vx[j*NX + i] = cos(x[j*NX + i]);
}
}
float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(float)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(float), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, (cufftComplex *)d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, (cufftComplex *)d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftReal), cudaMemcpyDeviceToHost));
for (int j = 0; j < NY; j++){
for (int i = 0; i < NX; i++){
printf("%.3f ", vx[j*NX + i]/(NX*NY));
}
printf("\n");
}
When vx is defined as cos(x) or sin(x), it works fine, but when using sin(y) or cos(y), it gives me back the correct function (sin or cos) but with half amplitude (that is, oscillating between 0.5 and -0.5 instead of 1 and -1) ! Note that using sin(2*y) or cos(2*y) (or sin(4*y), cos(4*y), ...) works fine. Any idea?
The problem here is that input and output of an in-place real to complex transform is a complex type whose size isn't the same as the input real data (it is twice as large). You haven't allocated enough memory to hold the intermediate complex results of the real to complex transform. Quoting from the documentation:
cufftExecR2C() (cufftExecD2Z()) executes a single-precision
(double-precision) real-to-complex, implicitly forward, CUFFT
transform plan. CUFFT uses as input data the GPU memory pointed to by
the idata parameter. This function stores the nonredundant Fourier
coefficients in the odata array. Pointers to idata and odata are both
required to be aligned to cufftComplex data type in single-precision
transforms and cufftDoubleComplex data type in double-precision
transforms.
The solution is either to allocate a second device buffer to hold the intermediate result or enlarge the in place allocation so it is large enough to hold the complex data. So the core transform code changes to something like:
float *d_vx;
CUDA_CHECK(cudaMalloc(&d_vx, NX*NY*sizeof(cufftComplex)));
CUDA_CHECK(cudaMemcpy(d_vx, vx, NX*NY*sizeof(cufftComplex), cudaMemcpyHostToDevice));
cufftHandle planr2c;
cufftHandle planc2r;
CUFFT_CHECK(cufftPlan2d(&planr2c, NY, NX, CUFFT_R2C));
CUFFT_CHECK(cufftPlan2d(&planc2r, NY, NX, CUFFT_C2R));
CUFFT_CHECK(cufftSetCompatibilityMode(planr2c, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftSetCompatibilityMode(planc2r, CUFFT_COMPATIBILITY_NATIVE));
CUFFT_CHECK(cufftExecR2C(planr2c, (cufftReal *)d_vx, d_vx));
CUFFT_CHECK(cufftExecC2R(planc2r, d_vx, (cufftReal *)d_vx));
CUDA_CHECK(cudaMemcpy(vx, d_vx, NX*NY*sizeof(cufftComplex), cudaMemcpyDeviceToHost));
[disclaimer: written in browser, never compiled or tested, use at own risk]
Note you will need to adjust the host code to match the size and type of the input and data.
As a final comment, would it have been that hard to add the additional 8 or 10 lines required to turn what you posted into a compilable, runnable example that someone trying to help you could work with?
I have a vector, and I would like to do the following, using CUDA and Thrust transformations:
// thrust::device_vector v;
// for k times:
// calculate constants a and b as functions of k;
// for (i=0; i < v.size(); i++)
// v[i] = a*v[i] + b*v[i+1];
How should I correctly implement this? One way I can do it is to have vector w, and apply thrust::transform onto v and save the results to w. But k is unknown ahead of time, and I don't want to create w1, w2, ... and waste a lot of GPU memory space. Preferably I want to minimize the amount of data copying. But I'm not sure how to implement this using one vector without the values stepping on each other. Is there something Thrust provides that can do this?
If the v.size() is large enough to fully utilize the GPU, you could launch k kernels to do this, with a extra buffer mem and no extra data transfer.
thrust::device_vector u(v.size());
for(k=0;;)
{
// calculate a & b
thrust::transform(v.begin(), v.end()-1, v.begin()+1, u.begin(), a*_1 + b*_2);
k++;
if(k>=K)
break;
// calculate a & b
thrust::transform(u.begin(), u.end()-1, u.begin()+1, v.begin(), a*_1 + b*_2);
k++;
if(k>=K)
break;
}
I don't actually understand the "k times", but the following code may help you.
struct OP {
const int a, b;
OP(const int p, const int q): a(p), b(q){};
int operator()(const int v1, const int v2) {
return a*v1+b*v2;
}
}
thrust::device_vector<int> w(v.size());
thrust::transform(v.begin(), v.end()-1, //input_1
v.begin()+1, //input_2
w.begin(), //output
OP(a, b)); //functor
v = w;
I think learning about "functor", and several examples of thrust will give you a good guide.
Hope this will help you to solve your problem. :)
I am writing my first CUDA application and am writing all the kernels my self for practice.
In one portion I am simply calculating X_transpose * X.
I have been using cudaMallocPitch and cudaMemcpy2D, I first allocate enough space on the device for X and X_transpose*X. I copy X to the device, my kernel takes two inputs, the X matrix, then the space to write the X_transpose * X result.
Using the profiler the kernel originally took 104 seconds to execute on a matrix of size 5000x6000. I pad the matrix with zeros on the host so that it is a multiple of the block size to avoid checking the bounds of the matrix in the kernel. I use a block size of 32 by 32.
I made some changes to try to maximize coalesced reads/writes to global memory, this seemed to help significantly. Using the visual profiler to profile the release build of my code, the kernel now takes 4.27 seconds to execute.
I haven't done an accurate timing of my matlab execution(just the operation X'*X;), but it appears to be about 3 seconds. I was hoping I could get much better speedups than matlab using CUDA.
The nvidia visual profiler is unable to find any issues with my kernel, I was hoping the community here might have some suggestions as to how I can make it go faster.
The kernel code:
__global__ void XTXKernel(Matrix X, Matrix XTX) {
//find location in output matrix
int blockRow = blockIdx.y;
int blockCol = blockIdx.x;
int row = threadIdx.y;
int col = threadIdx.x;
Matrix XTXsub = GetSubMatrix(XTX, blockRow, blockCol);
float Cvalue = 0;
for(int m = 0; m < (X.paddedHeight / BLOCK_SIZE); ++m) {
//Get sub-matrix
Matrix Xsub = GetSubMatrix(X, m, blockCol);
Matrix XTsub = GetSubMatrix(X, m, blockRow);
__shared__ float Xs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float XTs[BLOCK_SIZE][BLOCK_SIZE];
//Xs[row][col] = GetElement(Xsub, row, col);
//XTs[row][col] = GetElement(XTsub, col, row);
Xs[row][col] = *(float*)((char*)Xsub.data + row*Xsub.pitch) + col;
XTs[col][row] = *(float*)((char*)XTsub.data + row*XTsub.pitch) + col;
__syncthreads();
for(int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += Xs[e][row] * XTs[col][e];
__syncthreads();
}
//write the result to the XTX matrix
//SetElement(XTXsub, row, col, Cvalue);
((float *)((char*)XTXsub.data + row*XTX.pitch) + col)[0] = Cvalue;
}
The definition of my Matrix structure:
struct Matrix {
matrixLocation location;
unsigned int width; //width of matrix(# cols)
unsigned int height; //height of matrix(# rows)
unsigned int paddedWidth; //zero padded width
unsigned int paddedHeight; //zero padded height
float* data; //pointer to linear array of data elements
size_t pitch; //pitch in bytes, the paddedHeight*sizeof(float) for host, device determines own pitch
size_t size; //total number of elements in the matrix
size_t paddedSize; //total number of elements counting zero padding
};
Thanks in advance for your suggestions.
EDIT: I forgot to mention, I am running the on a Kepler card, GTX 670 4GB.
Smaller block size like 16x16 or 8x8 may be faster. This slides also demos larger non-square size of block/shared mem may be faster for particular matrix size.
For shared mem allocation, add a dumy element on the leading dimension by using [BLOCK_SIZE][BLOCK_SIZE+1] to avoid the bank conflict.
Try to unroll the inner for loop by using #pragma unroll
On the other hand, You probably won't be much faster than matlab GPU code for large enough A'*A. Since the performance bottleneck of matlab is the invoking overhead rather than the kernel performance.
The cuBLAS routine culas_gemm() may have highest performance for matrix multiplication. You could compare yours with it.
MAGMA routine magma_gemm() has higher performance than cuBLAS in some cases. It's a open source project. You may also get some ideas from their code.