2D array FFT - ios Accelerate performance gains nullified by API limitations

2D array FFT - ios Accelerate performance gains nullified by API limitations - fft

The aim is to do correlation/convolutions(flip) of two 2D arrays using ios Accelerate framework for gaining speed.
My first attempt was with vImageConvolve_PlanarF/vdsp_imgfir which was good for lower sized arrays. But as array size increased, performance dropped drastically as it was an O(n2) implementation as mentioned by Accelerate developers themselves here(1).
I moved to FFT implementation(2) for reducing complexity to O(nlog2n). Using vDSP_fft2d_zip, gained speed to an extent. But using vDSP_fft2d_zip on 2D arrays at non powers of 2, we need to pad zeros. For e.g. on a 2D array of size 640 * 480, we need to pad zeros to make it 1024 * 512. Other FFT implementations like FFTW or OpenCV's DFT allow sizes which could be expressed as size = 2p * 3p * 5r. That allows, FFTW/OpenCV to do FFT of 640 * 480 2D array at the same size.
So for 2D arrays at size 640*480, in an Accelerate vs FFTW/OpenCV comparison, it is effectively between 1024*512 and 640*480 dimensions. So what ever performance gains I get from Accelerate's FFT on 2D arrays is nullified by it's inability to performs FFT at reasonable dimensions like size = 2p * 3p * 5r
2 Queries.
Am I missing any Accelerate functionality to perform this easily ? For e.g. any Accelerate function which could perform 2D array FFT at size = 2p * 3p * 5r. I assume vDSP_DFT_Execute performs only 1D FFT.
Better approaches to 2D FFT or correlation. Like in this answer(3), which asks to split arrays like 480 = 256 + 128 + 64 + 32 with repeated 1D FFT's over rows and then over columns. But this will need too many functions calls and I assume, will not help with performance.
Of lesser importance: I am doing correlation in tiles as one of the 2D arrays is far bigger then another. Say like 1920*1024 vs 150*100.

Linear convolution or correlation requires zero padding anyway, otherwise the result will be circular convolution or correlation.
1d iOS vDSP/Accelerate FFTs do allow N to be the product of small primes, not just 2^M. Not sure about 2d, but one can build a 2d FFT out of a 1d FFT.

Related

2D FFT using 1D FFT on 1D vector

Based on this post i can apply 1DFFT on each row and then each column of a 2D vector.
If i have a 1D vector i can view it as a 2D using cells like this v[rowIndex * columnCount + columnIndex]
My FFT1D algorithm is padding 0's to a vector until the next power of 2 position.
So if i am using it in this case on my 1D vector viewed as 2D, will it not add 0's where it is not supposed to, in between values, thus messing my result of the FFT2D?

The actual algorithm (I am quoting from the post you referenced in your question) is:
1D FFT on each row (real to complex)
1D FFT on each column resulting from (1) (complex to complex)
The zero-padding might cause problems only if you modify the original 2D matrix while you are working on it, which doesn’t appear to be necessary.

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks

This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

STFT Clarification (FFT for real-time input)

I get how the DFT via correlation works, and use that as a basis for understanding the results of the FFT. If I have a discrete signal that was sampled at 44.1kHz, then that means if I were to take 1s of data, I would have 44,100 samples. In order to run the FFT on that, I would have to have an array of 44,100 and a DFT with N=44,100 in order to get the resolution necessary to detect a frequencies up to 22kHz, right? (Because the FFT can only correlate the input with sinusoidal components up to a frequency of N/2)
That's obviously a lot of data points and calculation time, and I have read that this is where the Short-time FT (STFT) comes in. If I then take the first 1024 samples (~23ms) and run the FFT on that, then take an overlapping 1024 samples, I can get the continuous frequency domain of the signal every 23ms. Then how do I interpret the output? If the output of the FFT on static data is N/2 data points with fs/(N/2) bandwidth, what is the bandwidth of the STFT's frequency output?
Here's an example that I ran in Mathematica:
100Hz sine wave at 44.1kHz sample rate:
Then I run the FFT on only the first 1024 points:
The frequency of interest is then at data point 3, which should somehow correspond to 100Hz. I think 44100/1024 = 43 is something like a scaling factor, which means that a signal with 1Hz in this little window will then correspond to a signal of 43Hz in the full data array. However, this would give me an output of 43Hz*3 = 129Hz. Is my logic correct but not my implementation?

As I have already stated in my earlier comments, the variable N affects the resolution achievable by the output frequency spectrum and not the range of frequencies you can detect.A larger N gives you a higher resolution at the expense of higher computation time and a lower N gives you lower computation time but can cause spectral leakage, which is the effect you have seen in your last figure.
As for your other question, well, theoretically the bandwidth of an FFT is infinite but we band-limit our result to the band of frequencies in the range [-fs/2 to fs/2] because all frequencies outside that band are susceptible to aliasing and are therefore of no use.Furthermore, if the input signal is real (which is true in most cases including ours) then the frequencies from [-fs/2 to 0] are just a reflection of the frequencies from [0 to fs/2] and so some FFT procedures just output the FFT spectrum from [0 to fs/2], which I think applies to your case.This means that the N/2 data points that you received as output represent the frequencies in the range [0 to fs/2] so that is the bandwidth you are working with in the case of the FFT and also in the case of the STFT (the STFT is just a series of FFT's, each FFT in a STFT will give you a spectrum with data points in this band).
I would also like to point out that the STFT will most likely not reduce your computation time if your input is a varying signal such as music because in that case you will need to take perform it several times over the duration of the song for it to be of any use, it will however enable you to understand the frequency characteristics of your song much better that you would do if you just performed one FFT.
To visualise the results of an FFT you use frequency (and/or phase) spectrum plots but in order to visualise the results of an STFT you will most probably need to create a spectrogram which is basically a graph can is made by just basically putting the individual FFT spectrums side by side.The process of creating a spectrogram can be seen in the figure below (Source: Dan Ellis - Introduction to Speech Processing).The spectrogram will show you how your signal's frequency characteristics change over time and how you interpret it will depend on what specific features you are looking to extract/detect from the audio.You might want to look at the spectrogram wikipedia page for more information.

zero padded FFT using FFTW

To interpolate a signal in frequency domain, one can pad zeros in time domain and do an FFT.
Suppose the number of elements in a given vector X is N and Y is the same as X but padded one sided with N zeros. Then the following give the same result.
$$\hat{x}(k)=\sum_{n=0}^{2N-1} Y(n)e^{i2\pi k n/2N},\quad k=0,...,2N-1,$$
$$\hat{x}(k)=\sum_{n=0}^{ N-1} X(n)e^{i2\pi k n/2N},\quad k=0,...,2N-1.$$
Now if we use FFTW package, the first equation needs 2N memory space for the input vector while the second one needs only N memory space (I do not know if it is even possible to do in the existing FFTW package)! Also the computational complexity lowers from 2N^2log(2N) to 2N^2log(N). The problem is worse whenever we do a 2D FFT or 3D FFT. Is it possible to do the second approach using FFTW package? This is fairly easy to do in MATLAB though.

If x is a 2N signal padded with zeros above N , its DFT writes :
If k is even :
Hence, the coefficients of even frequencies arise from the N-point discrete Fourier transform of x(n).
if k is odd :
Hence, the coefficients of odd frequencies arise from the N-point discrete Fourier transform of x(n)exp(i*M_PI*n/N).
Thus, the discrete Fourier transform of a zero-padded 2N signal resumes to two DFT of signals of length N and fftw can be used to compute them.
The overall computation time will be 2*c*N*ln(N), where c is a constant. It is expected to be faster than the direct computation of the DFT c*2*N*ln(2*N). Remember that ln(2*N)=ln(2)+ln(N) : as N gets large, the extra work in case of direct computation is negligible compared to ln(N) : the trick becomes useless, even if the dimension is larger than one. It does not affect complexity.
Moreover, FFTW is really efficient, using lots of features of your PC if it is correctly installed, and it will be hard to do better than this in any case, even if the presented trick is used. Finally, if the input signal is real, you may use fftw_plan fftw_plan_dft_r2c_2d : only half the coefficients in the Fourier space are computed and stored.
Regarding memory requirements, if you are really short of memory, you can use the FFTW_IN_PLACE flag and use the same array for input and output. Yet, it is slightly slower.
The procedure presented above can be extended to compute the DFT of a LN signal of a N-point signal padded with (L-1)N zeros : it resumes to the computation of L DFTs of length N.
Do you have any reference showing how MATLAB handles and optimizes the DFT of padded signals compared to FFTW ?
EDIT : Further research about the 3D case :
The 3D DFT of a padded 3D signal x(n,m,p) is :
If k_n, k_m and k_p are even :
If k_n and k_m are even and k_p is odd :
...There are 8 cases.
So, the computation of the 3d dft of a 3D x of size NxNxN padded to 2Nx2Nx2N resumes to the computation of 8 3d dft of size NxNxN. Size a 3d dft is a combination of 3 1d dft, the total number of dft of size N is 3x8xNxN while the direct computation requires 3x(2N)*(2N) dft of size 2N. Computational time is 24cN^3ln(N) against 24cN^3ln(2N) : a small gain is possible...Again fftw is fast...
Yet, instead of using a black-box 3d fft, let's compute the 8 dfts of size N at once, by performing the 1d dfts in each direction.
1d dft along N : 2 cases, NxN dfts => 2cN^3ln(N)
1d dft along M : 2 cases, 2NxN dfts => 4cN^3ln(N)
1d dft along P : 2 cases, 2Nx2N dfts => 8cN^3ln(N)
Hence, the total computation time is expected to be 14cN^3ln(N) against 24cN^3ln(2N) : a small gain is possible...Again fftw is fast...
Moreover, the computation of
requires only a single call to exp : first compute w=exp(I*M_PI/N) then update wn=wn*w; x(n)=x(n)*wn or use pow if precision becomes an issue.

Best approach for convolution of multiple small matrices using CUDA

I need to preform multiple convolutions with small matrices and kernels, and I was hoping that utilizing the many processors of the GPU would enable me to it as fast as possible.
The problem is as follows: I have many matrices (~1,000 to ~10,000) or relatively small sizes (~15x15 down to 1x1 - as in scalar), and a certain number of convolution masks (~20 to 1). I need to convolve all the matrices with each convolution mask
example:
A; %5,000 matrices of size 10x10, A(i) = a 10x10 matrix
B; 10 matrices of size 5x5, B(k) = a 5x5 matrix
res(j)=conv(A,B(1)); %res(j) is the result of convolving all 5,000
%matrices in A by the j'th kernel B(j)
the goal is computing res(1),...,res(10) as quickly as possible
I would like to hear suggestions about how to implement the most efficient algorithm.
FFT based convolution would probably be too slow.
Every implementation I've seen so far is for 2d convolution, meant to convolve 2 large matrices, while I need to convolve many small matrices.
I know very little about CUDA programming right now, but I'm in the process of learning.
I was hoping to figure this out myself, but due to time constraints, I am forced to ask for any advice anyone with experience can give me, while I learn how to code in CUDA.
Thank you!
p.s. any pointers to an implementation that suits my purposes is more than appreciated. I am a university students, and this is for a small research project, so nothing I need to pay for please...

I do not pretend to give an ultimate answer to your question, but I would just like to point out a couple of things:
As you mentioned, a first possibility would be to use the FFT approach. A problem on this line is that (correct me if I'm wrong) the cuFFT library is primarily designed to cope with large matrices, so to fruitfully benefit from this approach would be developing FFT routines efficient for small matrices. I just want to indicate that there are some algorithms of this kind, please see for example the paper: Small Discrete Fourier Transforms on GPUs. I have no direct experience with the performance of CUDA FFTs on small matrices of the indicated type, but perhaps it could be interesting for you since the mask matrices are in a low number (10) and so you can "recycle" their FFTs for a large number of convolutions (5000).
If you decide not to use the FFT approach, then, if you have a GPU architecture with compute capability >=3.5, then dynamic parallelism could be a good candidate to calculate convolutions. If you regard the evaluation of each convolution matrix element as an interpolation, then you will have interpolation problems of size 15x15 and dynamic parallelism could help, see the post: Benefit of splitting a big CUDA kernel and using dynamic parallelism

One approach is to use ArrayFire's GFOR loop, which I work on.
You can tile as many small convolutions into one big kernel launch as you want, as long as you don't run out of GPU memory, as follows:
array x = randu(5); // the input
array y = randu(m,5); // the output
array f = constant(1,3); // the kernel
gfor (array k, 0, m-1) {
y(span,k) = convolve(x,f);
}
Good luck!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008