MKL FFTW is Slower than FFTPACK for small data size - fft

I wrote a matrix computation C++ library 20 years ago and Iā€™m willing to boost its performance using intel MKL library.
For complex value vector/matrix, my library uses two split arrays: one for the real part and one for imaginary part.
Here are the timing results:
N=65536, fftw time = 0.005(s), fftpack time = 0.001(s)
N=100000, fftw time = 0.005(s), fftpack time = 0.003(s)
N=131072, fftw time = 0.006(s), fftpack time = 0.004(s)
N=250000, fftw time = 0.013(s), fftpack time = 0.007(s)
N=262144, fftw time = 0.012(s), fftpack time = 0.008(s)
N=524288, fftw time = 0.022(s), fftpack time = 0.018(s)
N=750000, fftw time = 0.037(s), fftpack time = 0.025(s)
N=1048576, fftw time = 0.063(s), fftpack time = 0.059(s)
N=1500000, fftw time = 0.114(s), fftpack time = 0.079(s)
N=2097152, fftw time = 0.126(s), fftpack time = 0.146(s)
N=4194304, fftw time = 0.241(s), fftpack time = 0.35(s)
N=8388608, fftw time = 0.433(s), fftpack time = 0.788(s)
For vectors with length < 1500000 double value fftpack is faster than fftw.
Here is the code I use:
Matrix X=randn(M,1); //input vector
//start timer
Matrix Y = MyFFTW(X);
// measure time
//function to compute the FFT
Matrix MyFFTW(Matrix X)
{
int M= X.rows();
int N= X.cols();
Matrix Y(T_COMPLEX,M,N); // output complex to store FFT results
// Input data could also ba matrix
double* in_data = (double*)fftw_malloc(sizeof(double) * M );
fftw_complex* out_data = (fftw_complex*)fftw_malloc(sizeof(fftw_complex) * (M / 2 + 1));
fftw_plan fftplan = fftw_plan_dft_r2c_1d(M, in_data, out_data, FFTW_ESTIMATE);
//one fftplan is used for all the matrix columns
for (int i = 1; i <= N; i++)
{
//copy column number i to in_dataused by the fftplan, arrays indexing is 1-based like matlab
memcpy(in_data, X.pr(1,i), M* sizeof(double));
fftw_execute(fftplan);
//split out_data to real and imag parts
double* pr = Y.pr(1,i), * pi = Y.pi(1,i);
int k = (M - 1) / 2, j;
for (j = 0; j <= k; j++)
{
*pr++ = out_data[j][0];
*pi++ = out_data[j][1];
}
if (M % 2 == 0)
{
*pr++ = out_data[M/2][0];
*pi++ = out_data[M/2][1];
}
for (j = k; j >= 1; j--)
{
*pr++ = out_data[j][0];
*pi++ = out_data[j][1];
}
}
fftw_destroy_plan(fftplan);
fftw_free(in_data);
fftw_free(out_data);
return Y;
}
Results are obtained on Intel core i7 # 3.2 GHz using Visual Studio 2019 as compiler and the last intel MKL library.
Compiler flags are:
/fp:fast /DWIN32 /O2 /Ot /Oi /Oy /arch:AVX2 /openmp /MD
Linker libs are:
mkl_intel_c.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
Is there a better way to make fftw faster for vector of small size ?
Update:
I tested against Matlab that uses MKL fftw for fft computation :
N=65536, matlab fft time = 0.071233(s)
N=100000, matlab fft time = 0.011437(s)
N=131072, matlab fft time = 0.0074411(s)
N=250000, matlab fft time = 0.015349(s)
N=262144, matlab fft time = 0.0082545(s)
N=524288, matlab fft time = 0.011395(s)
N=750000, matlab fft time = 0.022364(s)
N=1048576, matlab fft time = 0.019683(s)
N=1500000, matlab fft time = 0.033493(s)
N=2097152, matlab fft time = 0.035345(s)
N=4194304, matlab fft time = 0.069539(s)
N=8388608, matlab fft time = 0.1387(s)
Except the first call to fft with N=65536, Matlab(64bits) is faster than both my function (win32) using fftpack (for N > 500000) and using MKL fftw.
Thanks

regarding fftw, AFAIK, there are no specific performance tips from MKL which would help to accelerate the performance for small cases. Actually the overhead of using fftw from mkl is pretty negligible.
regarding your bench: I see you measure the allocation/deallocation parts, creating the fftw plans, memcopy operations as well. But, the only one routine (fftw_execute) from this benchmark are optimize by mkl.
That's could be a problem with this pipeline.
You could add the MKL_VERBOSE mode to check the execution time from fftw_execute...

Related

Negative Speed Gain Using Numba Vectorize target='cuda'

I am trying to test out the effectiveness of using the Python Numba module's #vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("VectorAdd took %f seconds" % vectoradd_time)
if __name__ == '__main__':
main()
In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the #vectorize decorator. However, when I set the #vectorize target to the gpu:
#vectorize(["float32(float32, float32)"], target='cuda')
... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?
This question is also interesting for me.
I've tried your code and got similar results.
To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda
N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1
#cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
i = cuda.grid(1)
if i < N:
a[i] += b[i]
#vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
return a + b
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)
start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.
For the case above:
CPU - 0.0033;
GPU - 0.0096;
Vectorize (target='cuda') - 0.15 (for my PC).
If the copying time is not accounted:
GPU - 0.000245
So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but #vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).
By the way I have also tested the #cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.
UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.

why the result is different between cpu and gpu?

This is my code running on GPU
tid=threadidx%x
bid=blockidx%x
bdim=blockdim%x
isec = mesh_sec_1(lev)+bid-1
if (isec .le. mesh_sec_0(lev)) then
if(.not. sec_is_int(isec)) return
do iele = tid, sec_n_ele(isec), bdim
idx = n_ele_idx(isec)+iele
u(1:5) = fv_u(1:5,idx)
u(6 ) = fv_t( idx)
g = 0.0d0
do j= sec_iA_ls(idx), sec_iA_ls(idx+1)-1
ss = sec_jA_ls(1,j)
ee = sec_jA_ls(2,j)
tem = n_ele_idx(ss)+ee
du(1:5) = fv_u(1:5, n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = fv_t( n_ele_idx(ss)+ee)-u(6 )
coe(1:3) = sec_coe_ls(1:3,j)
do k=1,6
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
fv_gra(i+(j-1)*3,idx)=g(i,j)
end do
end do
end do
end if
and next is my code running on CPU
do isec = h_mesh_sec_1(lev),h_mesh_sec_0(lev)
if(.not. h_sec_is_int(isec)) cycle
do iele=1,h_sec_n_ele(isec)
idx = h_n_ele_idx(isec)+iele
u(1:5) = h_fv_u(1:5,idx)
u(6 ) = h_fv_t( idx)
g = 0.0d0
do j= h_sec_iA_ls(idx),h_sec_iA_ls(idx+1)-1
ss = h_sec_jA_ls(1,j)
ee = h_sec_jA_ls(2,j)
du(1:5) = h_fv_u(1:5,h_n_ele_idx(ss)+ee)-u(1:5)
du(6 ) = h_fv_t( h_n_ele_idx(ss)+ee)-u(6 )
do k=1,6
g(1:3,k)= g(1:3,k) + du(k)*h_sec_coe_ls(1:3,j)
end do
end do
do j=1,6
do i=1,3
h_fv_gra(i+(j-1)*3,idx) = g(i,j)
enddo
enddo
end do
end do
The variable between h_* and * shows it belong to cpu and gpu separately.
The result is same at many points, but at some points they are a little different. I add the check code like this.
do i =1,size(h_fv_gra,1)
do j = 1,size(h_fv_gra,2)
if(hd_fv_gra(i,j)-h_fv_gra(i,j) .ge. 1.0d-9) then
print *,hd_fv_gra(i,j)-h_fv_gra(i,j),i,j
end if
end do
end do
The hd_* is a copy of the gpu result. we can see the difference:
1.8626451492309570E-009 13 14306
1.8626451492309570E-009 13 14465
1.8626451492309570E-009 13 14472
1.8626451492309570E-009 14 14128
1.8626451492309570E-009 14 14146
1.8626451492309570E-009 14 14150
1.8626451492309570E-009 14 14153
1.8626451492309570E-009 14 14155
1.8626451492309570E-009 14 14156
So I am confused about that. The precision of Cuda should not as large as this. Any reply will be welcomed.
In addition, I don't know how to print the variables in GPU codes, which can help me debug.
In your code, calculation of g value most probably benefits from Fused Multiply Add (fma) optimization in CUDA.
g(1:3,k)=g(1:3,k)+du(k)*sec_coe_ls(1:3,j)
On the CPU side, this is not impossible but strongly depends on compiler choices (and the actual CPU running the code should it implement fma).
To enforce usage of separate multiply and add, you want to use intrinsics from CUDA, as defined here, such as :
__device__ ā€‹ double __dadd_rn ( double x, double y ) Add two floating point values in round-to-nearest-even mode.
and
__device__ ā€‹ double __dmul_rn ( double x, double y ) Multiply two floating point values in round-to-nearest-even mode.
with a rounding mode identical to the one defined on CPU (it depends on the CPU architecture, whether it be Power or Intel x86 or other).
Alternate approach is to pass the --fmad false option to ptxas when compiling cuda, using the --ptxas-options option from nvcc detailed here.

Using arrayfun to apply two arguments of a function on every combination

Let i = [1 2] and j = [3 5]. Now in octave:
arrayfun(#(x,y) x+y,i,j)
we get [4 7]. But I want to apply the function on the combinations of i vs. j to get [i(1)+j(1) i(1)+j(2) i(2)+j(1) i(2)+j(2)]=[4 6 5 7].
How do I accomplish this? I know I can go with for-loopsl but I want vectorized-code because it's faster.
In Octave, for finding summations between two vectors, you can use a truly vectorized approach with broadcasting like so -
out = reshape(ii(:).' + jj(:),[],1)
Here's a runtime test on ideone for the input vectors of size 1 x 100 each -
-------------------- With FOR-LOOP
Elapsed time is 0.148444 seconds.
-------------------- With BROADCASTING
Elapsed time is 0.00038299 seconds.
If you want to keep it generic to accommodate operations other than just summations, you can use anonymous functions like so -
func1 = #(I,J) I+J;
out = reshape(func1(ii,jj.'),1,[])
In MATLAB, you could accomplish the same with two bsxfun alternatives as listed next.
I. bsxfun with Anonymous Function -
func1 = #(I,J) I+J;
out = reshape(bsxfun(func1,ii(:).',jj(:)),1,[]);
II. bsxfun with Built-in #plus -
out = reshape(bsxfun(#plus,ii(:).',jj(:)),1,[]);
With the input vectors of size 1 x 10000 each, the runtimes at my end were -
-------------------- With FOR-LOOP
Elapsed time is 1.193941 seconds.
-------------------- With BSXFUN ANONYMOUS
Elapsed time is 0.252825 seconds.
-------------------- With BSXFUN BUILTIN
Elapsed time is 0.215066 seconds.
First, your first example is not the best because the most efficient way to accomplish what you're doing with arrayfun would be to vectorize:
a = [1 2];
b = [3 5];
out = a+b
Second, in Matlab at least, arrayfun is not necessarily faster than a simple for loop. arrayfun is mainly a convenience (especially for it's more advanced options). Try this simple timing example yourself:
a = 1:1e5;
b = a+1;
y = arrayfun(#(x,y)x+y,a,b); % Warm up
tic
y = arrayfun(#(x,y)x+y,a,b);
toc
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k); % Warm up
end
tic
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k);
end
toc
In Matlab R2015a, the for loop method is over 70 times faster run from the Command window and over 260 times faster when run from an M-file function. Octave may be different, but you should experiment.
Finally, you can accomplish what you want using meshgrid:
a = [1 2];
b = [3 5];
[x,y] = meshgrid(a,b);
out = x(:).'+y(:).'
which returns [4 6 5 7] as in your question. You can also use ndgrid to get output in a different order.

Why is cuFFT so slow?

I'm hoping to accelerate a computer vision application that computes many FFTs using FFTW and OpenMP on an Intel CPU. However, for a variety of FFT problem sizes, I've found that cuFFT is slower than FFTW with OpenMP.
In the experiments and discussion below, I find that cuFFT is slower than FFTW for batched 2D FFTs. Why is cuFFT so slow, and is there anything I can do to make cuFFT run faster?
Experiments (code download)
Our computer vision application requires a forward FFT on a bunch of small planes of size 256x256. I'm running the FFTs on on HOG features with a depth of 32, so I use the batch mode to do 32 FFTs per function call. Typically, I do about 8 FFT function calls of size 256x256 with a batch size of 32.
FFTW + OpenMP
The following code executes in 16.0ms on an Intel i7-2600 8-core CPU.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
//if nCols is even, cols_padded = (nCols+2). if nCols is odd, cols_padded = (nCols+1)
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL
float* h_in = (float*)malloc(sizeof(float)*nRows*cols_padded*depth);
memset(h_in, 0, sizeof(float)*nRows*cols_padded*depth);
fftwf_complex* h_freq = reinterpret_cast<fftwf_complex*>(h_in); //in-place version
fftwf_plan forwardPlan = fftwf_plan_many_dft_r2c(2, //rank
n, //dims -- this doesn't include zero-padding
depth, //howmany
h_in, //in
inembed, //inembed
depth, //istride
1, //idist
h_freq, //out
onembed, //onembed
depth, //ostride
1, //odist
FFTW_PATIENT /*flags*/);
double start = read_timer();
#pragma omp parallel for
for(int i=0; i<nIter; i++){
fftwf_execute_dft_r2c(forwardPlan, h_in, h_freq);
}
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
cuFFT
The following code executes in 21.7ms on a top-of-the-line NVIDIA K20 GPU. Note that, even if I use streams, cuFFT does not run multiple FFTs concurrently.
int depth = 32; int nRows = 256; int nCols = 256; int nIter = 8;
int n[2] = {nRows, nCols};
int cols_padded = 2*(nCols/2 + 1); //allocate this width, but tell FFTW that it's nCols width
int inembed[2] = {nRows, 2*(nCols/2 + 1)};
int onembed[2] = {nRows, (nCols/2 + 1)}; //default -- equivalent ot onembed=NULL in FFTW
cufftHandle forwardPlan;
float* d_in; cufftComplex* d_freq;
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
depth, //istride
1, //idist
onembed, //onembed
depth, //ostride
1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
d_freq = reinterpret_cast<cufftComplex*>(d_in);
double start = read_timer();
for(int i=0; i<nIter; i++){
CHECK_CUFFT(cufftExecR2C(forwardPlan, d_in, d_freq));
}
CHECK_CUDART(cudaDeviceSynchronize());
double responseTime = read_timer() - start;
printf("did %d FFT calls in %f ms \n", nIter, responseTime);
Other notes
In the GPU version, cudaMemcpys between the CPU and GPU are not included in my computation time.
The performance numbers presented here are averages of several experiments, where each experiment has 8 FFT function calls (total of 10 experiments, so 80 FFT function calls).
I've tried many problem sizes (e.g. 128x128, 256x256, 512x512, 1024x1024), all with depth=32. Based on the nvvp profiler, some sizes like 1024x1024 are able to fully saturate the GPU. But, for all of these sizes, the CPU FFTW+OpenMP is faster than cuFFT.
Question might be outdated, though here is a possible explanation (for the slowness of cuFFT).
When structuring your data for cufftPlanMany, the data arrangement is not very nice with the GPU. Indeed, using an istride and ostride of 32 means no data read is coalesced. See here for details on the read pattern
input[b * idist + (x * inembed[1] + y) * istride]
output[b * odist + (x * onembed[1] + y) * ostride]
in which case if i/ostride is 32, it will very unlikely be coalesced/optimal. (indeed b is the batch number). Here are the changes I applied:
CHECK_CUFFT(cufftPlanMany(&forwardPlan,
2, //rank
n, //dimensions = {nRows, nCols}
inembed, //inembed
1, // WAS: depth, //istride
nRows*cols_padded, // WAS: 1, //idist
onembed, //onembed
1, // WAS: depth, //ostride
nRows*cols_padded, // WAS:1, //odist
CUFFT_R2C, //cufftType
depth /*batch*/));
Running this, I entered a unspecified launch failure because of illegal memory access. You might want to change the memory allocation (cufftComplex is two floats, you need an x2 in your allocation size - looks like a typo).
// WAS : CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth));
CHECK_CUDART(cudaMalloc(&d_in, sizeof(float)*nRows*cols_padded*depth*2));
When running it this way, I got a x8 performance improvement on my card.

how to optimize matrix multiplication using OpenACC?

I am learning OpenACC (with PGI's compiler) and trying to optimize matrix multiplication example. The fastest implementation I came up so far is the following:
void matrix_mul(float *restrict r, float *a, float *b, int N, int accelerate){
#pragma acc data copyin (a[0: N * N ], b[0: N * N]) copyout (r [0: N * N ]) if(accelerate)
{
# pragma acc region if(accelerate)
{
# pragma acc loop independent vector(32)
for (int j = 0; j < N; j ++)
{
# pragma acc loop independent vector(32)
for (int i = 0; i < N ; i ++ )
{
float sum = 0;
for (int k = 0; k < N ; k ++ ) {
sum += a [ i + k*N ] * b [ k + j * N ];
}
r[i + j * N ] = sum ;
}
}
}
}
This results in thread blocks of size 32x32 threads and gives me the best performance so far.
Here are the benchmarks:
Matrix multiplication (1500x1500):
GPU: Geforce GT650 M, 64-bit Linux
Data sz : 1500
Unaccelerated:
matrix_mul() time : 5873.255333 msec
Accelerated:
matrix_mul() time : 420.414700 msec
Data size : 1750 x 1750
matrix_mul() time : 876.271200 msec
Data size : 2000 x 2000
matrix_mul() time : 1147.783400 msec
Data size : 2250 x 2250
matrix_mul() time : 1863.458100 msec
Data size : 2500 x 2500
matrix_mul() time : 2516.493200 msec
Unfortunately I realized that the generated CUDA code is quite primitive (e.g. it does not even use shared memory) and hence cannot compete with hand-optimized CUDA program. As a reference implementation I took Arrayfire lib with the following results:
Arrayfire 1500 x 1500 matrix mul
CUDA toolkit 4.2, driver 295.59
GPU0 GeForce GT 650M, 2048 MB, Compute 3.0 (single,double)
Memory Usage: 1932 MB free (2048 MB total)
af: 0.03166 seconds
Arrayfire 1750 x 1750 matrix mul
af: 0.05042 seconds
Arrayfire 2000 x 2000 matrix mul
af: 0.07493 seconds
Arrayfire 2250 x 2250 matrix mul
af: 0.10786 seconds
Arrayfire 2500 x 2500 matrix mul
af: 0.14795 seconds
I wonder if there any suggestions how to get better performance from OpenACC ?
Perhaps my choice of directives is not right ?
You're getting right at a 14x speedup, which is pretty good for PGI's compiler in my experience.
First off, are you compiling with -Minfo? That will give you a lot of feedback from the compiler regarding optimization choices.
You are using a 32x32 thread block, but in my experience 16x16 thread blocks tend to get better performance. If you omit the vector(32) clauses, what scheduling does the compiler choose?
Declaring a and b with restrict might let the compiler generate better code.
Just by looking at your code, I'm not sure that shared memory would help performance. Shared memory only helps improve performance if your code can store and reuse values there instead of going to global memory. In this case you're not reusing any part of a or b after reading it.
It's also worth noting that I've had bad experiences with PGI's compiler when it comes to shared memory usage. It will sometimes do funny stuff and cache the wrong values (seems to mostly happen if you iterate a loop backward), generating wrong results. I actually have to compile my current application using the undocumented -ta=nvidia,nocache option to get it to work correctly, by bypassing shared memory usage altogether.