cuFFT of a matrix as a 1D transformation of rows or columns - cuda

I could not find an example of application of cuFFT with CUDA in which the transformation of a matrix is realized as 1D transformations of rows and columns.
I have a 2048x2048 array (set as 1D of cuComplex data). With 2D transform - no problem. But now what I need is to do the transform along x, do some work on it, take inverse fft, then do the transform along y, and do another work on it, then take its inverse transform.
How exactly would the sequence of commands look like if I want to use parallel processing? Should I use cuFFTPlanMany? How? Or, perhaps, is there an example somewhere that I was not able to find?

In the cuFFT Library User's guide, on page 3, there is an example on how computing a number BATCH of one-dimensional DFTs of size NX. Using cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);, then cufftExecC2C will perform a number BATCH 1D FFTs of size NX. To achieve that, you have to arrange your data in a complex array of length BATCH*NX. In your case, for the transform along x, it would be BATCH=2048 and NX=2048. For the transforms along y, you have to transpose the matrix arising from previous calculations.
Your code will look like the following
#define NX 2048
#define NY 2048
int main() {
cufftHandle plan;
cufftComplex *data;
...
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY);
cufftPlan1d(&plan, NX, CUFFT_C2C, NY);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
// do some work
...
// make a transposition
...
cufftPlan1d(&plan, NY, CUFFT_C2C, NX);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
}

Related

confusing on row/column vector in CUBLAS

I am just starting on CUBLAS/CUDA programming. I mainly use this of matrix and vector operations. I am quite confusing on the orientation of the vector used in CUBLAS. It seems that there is no difference between row and column vector. So if I using level-2 function to multiply a matrix with a vector, how can I specify the orientation of the vector? Will it be always treated as column vector? If I want to multiply a column vector (nx1) to a row vector (1xm) to make a matrix (nxm), should I always treat them as matrix and use level 3 function for multiplication?
Also, I am using thrust to generate the vector, so if I pass the thrust vector (n elements) to cublasCgemm to form a 1xn or nx1 matrix (i.e. row or column vector). Will that vector be treated as 1xn or nx1 vector if I set the cublasOperation_t as CUBLAS_OP_N?
Thanks.
All data are stored in single pointers i.e. double*. They are stored in memory sequentially. There is no difference between row and column vectors. Single pointers are used for 2D arrays, too. CUBLAS gives you an easy define to locate elements in the matrix
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
where i is row, j is column and ld is leading dimension of the matrix. ld is used when you want to use a submatrix of a full matrix in the operation.
The multiplication (nx1)(1xm)=(nxm) is performed by cublasDger function.
cublasStatus_t cublasDger(cublasHandle_t handle, int m, int n,
const double *alpha,
const double *x, int incx,
const double *y, int incy,
double *A, int lda)
If for example y is part of an (kxm) matrix then use incy=k.

running FFTW on GPU vs using CUFFT

I have a basic C++ FFTW implementation that looks like this:
for (int i = 0; i < N; i++){
// declare pointers and plan
fftw_complex *in, *out;
fftw_plan p;
// allocate
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
// initialize "in"
...
// create plan
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
// execute plan
fftw_execute(p);
// clean up
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.
I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?
You cannot call FFTW methods from device code. The FFTW libraries are compiled x86 code and will not run on the GPU.
If the "heavy lifting" in your code is in the FFT operations, and the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine. Once the machine is fully utilized, there is generally no additional benefit to trying to run more things in parallel.
cufft routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".

Cudamemcpy function usage

How will the cudaMemcpy function work in this case?
I have declared a matrix like this
float imagen[par->N][par->M];
and I want to copy it to the cuda device so I did this
float *imagen_cuda;
int tam_cuda=par->M*par->N*sizeof(float);
cudaMalloc((void**) &imagen_cuda,tam_cuda);
cudaMemcpy(imagen_cuda,imagen,tam_cuda,cudaMemcpyHostToDevice);
Will this copy the 2d array into a 1d array fine?
And how can I copy to another 2d array? can I change this and will it work?
float **imagen_cuda;
It's not trivial to handle a doubly-subscripted C array when copying data between host and device. For the most part, cudaMemcpy (including cudaMemcpy2D) expect an ordinary pointer for source and destination, not a pointer-to-pointer.
The simplest approach (I think) is to "flatten" the 2D arrays, both on host and device, and use index arithmetic to simulate 2D coordinates:
float imagen[par->N][par->M];
float *myimagen = &(imagen[0][0]);
float myval = myimagen[(rowsize*row) + col];
You can then use ordinary cudaMemcpy operations to handle the transfers (using the myimagen pointer):
float *d_myimagen;
cudaMalloc((void **)&d_myimagen, (par->N * par->M)*sizeof(float));
cudaMemcpy(d_myimagen, myimagen, (par->N * par->M)*sizeof(float), cudaMemcpyHostToDevice);
If you really want to handle dynamically sized (i.e. not known at compile time) doubly-subscripted arrays, you can review this question/answer.

Multiply matrix by scalar

I'm a newbie with cuda and cublas.
I want to multiply each element in a matrix (I used cublasSetMatrix) with a scalar value.
Can I use cublasscal() for that? the documentation says it's for a vector.
Thanks.
Yes, you can use it for a matrix scaling operation as well, assuming your matrix is stored contiguously. That means you did an ordinary cudaMalloc with a flat pointer to store the matrix. In that case even though it's a "matrix" it's stored contiguously in memory, and so the storage looks the same as a vector. If you have a MxN matrix, then pass MxN as the number of elements in the vector.
For example, something like (omitting error checking for clarity/brevity):
float *mymatrix, *d_mymatrix;
int size = M*N*sizeof(float);
mymatrix = (float *)malloc(size);
cudaMalloc((void **)&d_mymatrix, size);
... (cublas/handle setup)
cublasSetVector(M*N, sizeof(float), mymatrix, 1, d_mymatrix, 1);
float alpha = 5.0;
cublasSscal(handle, M*N, &alpha, d_mymatrix, 1);

Non Square Matrix Multiplication in CUDA

The code I use for matrix multiplications in CUDA lets me multiply both square and non square matrices, however, both Width and Height MUST be multiples of blocksize.
So, for example, I can multiply [3][6] * [6][3] (using blocksize=3), but I can't multiply [3][2]*[2][3].
Does anyone knows a way to do that? This is my kernel:
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#define blocksize 3
#define HM (1*blocksize)
#define WM (2*blocksize)
#define WN (1*blocksize)
#define HN WM
#define WP WN
#define HP HM
#define PTH WM
#define PTW HM
__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
{
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];
int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
float Pvalue=0;
for(int m=0; m< uWM/blocksize;++m){
MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
__syncthreads();
for(int k=0;k<blocksize;k++)
Pvalue+=MS[ty][k]*NS[k][tx];
__syncthreads();
P[rowM*WP+colN]=Pvalue;
}
}
Thanks in advance!
I think the easiest thing to do would be to just pad the blocks on the end with zeros:
for(int m=0; m< uWM/blocksize;++m){
colM = m*blocksize+tx;
rowN = m*blocksize+ty;
if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
MS[ty][tx]=0.;
NS[ty][tx]=0.;
} else {
MS[ty][tx]=M[rowM*uWM+colM];
NS[ty][tx]=N[colN + uWN*rowN];
}
plus or minus. (That NS line should reference N, not M, right?)
But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.
The underlying performance requirement here is that either the first or second dimension of the shared memory "tile" be a round multiple of 16 - historically that is what is necessary to achieve optimal global memory bandwidth (ie. half warp coalesced transactions). Whether it should be the first or second dimension of the tile is dictated by whether the matrices are stored in column or row major order. There is nothing to say that the shared memory tile need be square, only that the leading dimension of the storage (LDA in BLAS notation) be round multiples of 16.
You could easily template the kernel with the tile dimensions as template arguments and instantiate several versions, depending on matrix dimensions. For a given architecture, there is probably an optimal tile dimension which balances occupancy and instruction level parallelism. The "clever" way to solve this is probably to decompose the matrix multiplication into two operations - the first doing the bulk of the work at the optimal tile size, and the second at a different size for the remaining columns. If the result is going straight back to host memory after the product is completed, the second operation might best be done on the host using an optimised BLAS, overlapped with the GPU kernel. This is the approach that many of the routines in the UTK Magma library use.