running FFTW on GPU vs using CUFFT - cuda

I have a basic C++ FFTW implementation that looks like this:
for (int i = 0; i < N; i++){
// declare pointers and plan
fftw_complex *in, *out;
fftw_plan p;
// allocate
in = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
// initialize "in"
...
// create plan
p = fftw_plan_dft_1d(N, in, out, FFTW_FORWARD, FFTW_ESTIMATE);
// execute plan
fftw_execute(p);
// clean up
fftw_destroy_plan(p);
fftw_free(in); fftw_free(out);
}
I'm doing N fft's in a for loop. I know I can execute many plans at once with FFTW, but in my implementation in and out are different every loop. The point is I'm doing the entire FFTW pipeline INSIDE a for loop.
I want to transition to using CUDA to speed this up. I understand that CUDA has its own FFT library CUFFT. The syntax is very similar: From their online documentation:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C);
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD);
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD);
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
However, each of these "kernels" (as Nvida calls them) (cufftPlan3d, cufftExecC2C, etc.) are calls to-and-from the GPU. If I understand the CUDA structure correctly, each of these method calls are INDIVIDUALLY parallelized operations:
#define NX 64
#define NY 64
#define NZ 128
cufftHandle plan;
cufftComplex *data1, *data2;
cudaMalloc((void**)&data1, sizeof(cufftComplex)*NX*NY*NZ);
cudaMalloc((void**)&data2, sizeof(cufftComplex)*NX*NY*NZ);
/* Create a 3D FFT plan. */
cufftPlan3d(&plan, NX, NY, NZ, CUFFT_C2C); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the first signal in place. */
cufftExecC2C(plan, data1, data1, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Transform the second signal using the same plan. */
cufftExecC2C(plan, data2, data2, CUFFT_FORWARD); // DO THIS IN PARALLEL ON GPU, THEN COME BACK TO CPU
/* Destroy the cuFFT plan. */
cufftDestroy(plan);
cudaFree(data1); cudaFree(data2);
I understand how this can speed up my code by running each FFT step on a GPU. But, what if I want to parallelize my entire for loop? What if I want each of my original N for loops to run the entire FFTW pipeline on the GPU? Can I create a custom "kernel" and call FFTW methods from the device (GPU)?

You cannot call FFTW methods from device code. The FFTW libraries are compiled x86 code and will not run on the GPU.
If the "heavy lifting" in your code is in the FFT operations, and the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine. Once the machine is fully utilized, there is generally no additional benefit to trying to run more things in parallel.
cufft routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".

Related

How do I take real FFT using STM32F407G?

I've been trying to use fft for real data in STM32F407G for quite sometime with no luck. However, I when I use cfft function for complex data, it works. But I don't want to process the imaginary part.So my question is - how to use the rfft function in stm32?
Here is the code which I have tried -
#include "stm32f407xx.h"
#include "arm_math.h"
#include "arm_const_structs.h"
#include "core_cm4.h"
#define TEST_LENGTH_SAMPLES 32
extern float32_t ffttestrealip[TEST_LENGTH_SAMPLES];
static float32_t ffttestrealop[TEST_LENGTH_SAMPLES];
uint32_t fftSize = 32;
uint8_t ifftFlag = 0;
uint8_t doBitReverse = 1;
int32_t main(void)
{
arm_rfft_fast_instance_f32 * S;
arm_rfft_fast_init_f32 (S,fftSize);
arm_rfft_fast_f32 (S,ffttestrealip,ffttestrealop,ifftFlag ) ;
while(1);
}
/** \endlink */
But when I compile this, it says "error: L6047U: The size of this image (83968 bytes) exceeds the maximum allowed for this version of the linker"
When I comment out "arm_rfft_fast_init_f32 (S,fftSize);" , it compiles but I get a wrong result.
The 32k limit is the binary limit as opposed to the RAM limit - there is no guarantee that the binary for performing the real FFT will be smaller than the complex FFT.
In reality it could be a more optimized algorithm that is more complex, so that it might require less ram / CPU cycles to compute but could still require more (cheaper) operations to complete.
I have run up to a 4096 pt complex FFT on these parts with no trouble
You're using a size limited linker, 32 k is the common size for evaluation versions of IAR and Keil.
You can either pay to get the full size version, or build your code with arm gcc which is free.

Multiple global functions in the same CUDA source file

Can I write two separate global functions, that compute different things, in the same CUDA source file? Something like this:
__global__ void Ker1(mpz_t *d,mpz_t *c,mpz_t e,mpz_t n )
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm (d[i], c[i], e, n);
}
__global__ void Ker2(mpz_t *d,mpz_t *c,mpz_t d, mpz_t n)
{
int i=blockIdx.x*blockDim.x + threadIdx.x;
mpz_powm(c[i], d[i],d, n);
}
int main()
{
/* ... */
cudaMemcpy(decode_device,decode_buffer,memSize,cudaMemcpyHostToDevice);
Ker1<<<dimGrid , dimBlock >>>( d_device,c_device,e,n );
Ker2<<<dimGrid , dimBlock>>>(c_device,d_device,d,n);
cudaMemcpy(decode_buffer,decode_device,memSize,cudaMemcpyDeviceToHost);
}
If not, how would you do something like this?
It is quite unclear what you're asking, but after 3 readings I assume : "Can I write several Kernels in the same source file ?".
Your can write as much kernel launchs as you want in your main function.
An example here on page 9 :
...
cudaMemcpy( dev1, host1, size, H2D ) ;
kernel2 <<< grid, block, 0 >>> ( ..., dev2, ... ) ;
kernel3 <<< grid, block, 0 >>> ( ..., dev3, ... ) ;
cudaMemcpy( host4, dev4, size, D2H ) ;
...
From : Streams and concurrency webinar
The calls will be asynchronous by default, so as soon as the kernel is launched in the GPU, the CPU will treat the instructions that follow.
To force synchronization you have to use cudaDeviceSynchronize(), or any memory transfer via cudaMemcpy that forces synchronization by itself.
Source : the CUDA FAQ.
Q: Can the CPU and GPU run in parallel?
Kernel invocation in CUDA is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.
The "cudaThreadSynchronize()" API call should be used when measuring
performance to ensure that all device operations have completed before
stopping the timer.
CUDA functions that perform memory copies and that control graphics
interoperability are synchronous, and implicitly wait for all kernels
to complete.
By the way, if you don't need to synchronize between kernels, they can be executed concurrently if your GPU has the required compute capability (CC) :
Q: Is it possible to execute multiple kernels at the same time?
Yes. GPUs of compute capability 2.x or higher support concurrent kernel execution and launches.
(still readen from the CUDA FAQ).

cuBLAS dsyrk slower than dgemm

I am trying to compute C = A*A' on the GPU using cuBLAS and am finding that the rank-k update cublasDsyrk is running about 5x slower than the general matrix-matrix multiplication routine cublasDgemm.
This is surprising to me; I thought syrk would be faster since it is a more specialized piece of code. Is that an unreasonable expectation? Am I doing this wrong?
Timing the code
Ultimately I'm writing CUDA code to be compiled into MEX files for MATLAB, so apologies for not providing a complete working example (there would be a lot of extraneous code for wrangling with the MATLAB objects).
I know this is probably not the best way, but I'm using clock() to time how long the code takes to run:
// Start of main function
clock_t tic = clock();
clock_t toc;
/* ---- snip ---- */
cudaDeviceSynchronize();
toc = clock();
printf("%8d (%7.3f ms) Allocated memory on GPU for output matrix\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
// Compute the upper triangle of C = alpha*A*A' + beta*C
stat = cublasDsyrk(handle, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N,
M, N, &alpha, A, M, &beta, C, M);
toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk launched\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
cudaDeviceSynchronize();
toc = clock();
printf("%8d (%7.3f ms) cublasDsyrk completed\n",
toc-tic,1000*(double)(toc-tic)/CLOCKS_PER_SEC);
/* ----- snip ----- */
Runtimes
The output, running on a [12 x 500,000] random matrix (column-major storage):
911 ( 0.911 ms) Loaded inputs, initialized cuBLAS context
1111 ( 1.111 ms) Allocated memory on GPU for output matrix
1352 ( 1.352 ms) cublasDsyrk launched
85269 ( 85.269 ms) cublasDsyrk completed
85374 ( 85.374 ms) Launched fillLowerTriangle kernel
85399 ( 85.399 ms) kernel completed
85721 ( 85.721 ms) Finished and cleaned up
After replacing the syrk call with
stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, M, M, N,
&alpha, A, M, A, M, &beta, C, M);
the whole thing runs way faster:
664 ( 0.664 ms) Loaded inputs, initialized cuBLAS context
796 ( 0.796 ms) Allocated memory on GPU for output matrix
941 ( 0.941 ms) cublasDgemm launched
16787 ( 16.787 ms) cublasDgemm completed
16837 ( 16.837 ms) Launched fillLowerTriangle kernel
16859 ( 16.859 ms) kernel completed
17263 ( 17.263 ms) Finished and cleaned up
I tried it with a few matrices of other sizes; interestingly it seems that the speed difference is most pronounced when the matrix has few rows. At 100 rows, gemm is only 2x faster, and at 1000 rows it's slightly slower (which is what I would have expected all along).
Other details
I'm using CUDA Toolkit 7.5 and the GPU device is an NVIDIA Grid K520 (Kepler, compute capability 3.0). I'm running on an Amazon EC2 g2.x2large instance.
[n x 500,000] for n=12,100,1000 are all very wide matrix. In these corner cases, gemm() and syrk() may not be able to reach their peak performance, where syrk() is nearly twice faster as gemm() (as the result matrix is symentric so you can save half of the computation).
Another consideration is that CUDA gemm()/syrk() usually divides matrix into fixed size sub-matrices as the basic computing unit to achieve high performance. The size of the sub-matrix can be up to 32x64 for dgemm() as shown in the following link.
http://www.netlib.org/lapack/lawnspdf/lawn267.pdf
The performance usually drops a lot if your size (12 or 100) is neither much larger than the sub-matrix nor a multiple of it.

cuFFT of a matrix as a 1D transformation of rows or columns

I could not find an example of application of cuFFT with CUDA in which the transformation of a matrix is realized as 1D transformations of rows and columns.
I have a 2048x2048 array (set as 1D of cuComplex data). With 2D transform - no problem. But now what I need is to do the transform along x, do some work on it, take inverse fft, then do the transform along y, and do another work on it, then take its inverse transform.
How exactly would the sequence of commands look like if I want to use parallel processing? Should I use cuFFTPlanMany? How? Or, perhaps, is there an example somewhere that I was not able to find?
In the cuFFT Library User's guide, on page 3, there is an example on how computing a number BATCH of one-dimensional DFTs of size NX. Using cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);, then cufftExecC2C will perform a number BATCH 1D FFTs of size NX. To achieve that, you have to arrange your data in a complex array of length BATCH*NX. In your case, for the transform along x, it would be BATCH=2048 and NX=2048. For the transforms along y, you have to transpose the matrix arising from previous calculations.
Your code will look like the following
#define NX 2048
#define NY 2048
int main() {
cufftHandle plan;
cufftComplex *data;
...
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY);
cufftPlan1d(&plan, NX, CUFFT_C2C, NY);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
// do some work
...
// make a transposition
...
cufftPlan1d(&plan, NY, CUFFT_C2C, NX);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
}

memset in cuda that allows to set values within kernel

i am making several cudamemset calls in order to set my values to 0 as below:
void allocateByte( char **gStoreR,const int byte){
char **cStoreR = (char **)malloc(N * sizeof(char*));
for( int i =0 ; i< N ; i++){
char *c;
cudaMalloc((void**)&c, byte*sizeof(char));
cudaMemset(c,0,byte);
cStoreR[i] = c;
}
cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}
However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.
Every cudaMemset call launches a kernel, so if N is large and byte is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device side memset, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.
As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single cudaMalloc call and a single cudaMemset call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.