2D FFT using 1D FFT on 1D vector - fft

Based on this post i can apply 1DFFT on each row and then each column of a 2D vector.
If i have a 1D vector i can view it as a 2D using cells like this v[rowIndex * columnCount + columnIndex]
My FFT1D algorithm is padding 0's to a vector until the next power of 2 position.
So if i am using it in this case on my 1D vector viewed as 2D, will it not add 0's where it is not supposed to, in between values, thus messing my result of the FFT2D?

The actual algorithm (I am quoting from the post you referenced in your question) is:
1D FFT on each row (real to complex)
1D FFT on each column resulting from (1) (complex to complex)
The zero-padding might cause problems only if you modify the original 2D matrix while you are working on it, which doesn’t appear to be necessary.

Related

Flatten data matrix for better deep learning?

I am implementing Algorithm 1 from this paper. The core of the algorithm is a deep learning optimization procedure (the loss is not important). Each of my data points is a matrix of dimension 2 by N, with n such data points. I have two options for the architecture of the Neural Network:
Make the NN a function from R^2, i.e. 2 input neurons. When I apply it to a 2 by N matrix, I simply apply it to each individual column of the matrix. This makes sense from the physical point of view underlying the problem (each data point is a collection of N interacting particles in 2-dimensional space).
Make the NN a function from R^{2 x N}, i.e. 2 x N input neurons. When I apply it to a 2 by N matrix, I flatten the matrix, and apply the NN to the resulting vector. This makes sense from the deep learning point of view, since the data is "really" 2xN-dimensional.
I have implemented the former architecture. But is the second architecture more "correct"? Can I expect better accuracy from the second approach? Mathematically speaking it can approximate a strictly larger family of functions (functions with interdependence of columns of the 2xN input matrix).
I know that in image classification an image matrix gets flattened and fed into the NN as a vector. This suggests using the latter approach over the former. However, images lack the physical structure that my problem possesses: The columns of each 2XN matrix are positions of N physical particles in 2-dimensional space.

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks
This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

2D array FFT - ios Accelerate performance gains nullified by API limitations

The aim is to do correlation/convolutions(flip) of two 2D arrays using ios Accelerate framework for gaining speed.
My first attempt was with vImageConvolve_PlanarF/vdsp_imgfir which was good for lower sized arrays. But as array size increased, performance dropped drastically as it was an O(n2) implementation as mentioned by Accelerate developers themselves here(1).
I moved to FFT implementation(2) for reducing complexity to O(nlog2n). Using vDSP_fft2d_zip, gained speed to an extent. But using vDSP_fft2d_zip on 2D arrays at non powers of 2, we need to pad zeros. For e.g. on a 2D array of size 640 * 480, we need to pad zeros to make it 1024 * 512. Other FFT implementations like FFTW or OpenCV's DFT allow sizes which could be expressed as size = 2p * 3p * 5r. That allows, FFTW/OpenCV to do FFT of 640 * 480 2D array at the same size.
So for 2D arrays at size 640*480, in an Accelerate vs FFTW/OpenCV comparison, it is effectively between 1024*512 and 640*480 dimensions. So what ever performance gains I get from Accelerate's FFT on 2D arrays is nullified by it's inability to performs FFT at reasonable dimensions like size = 2p * 3p * 5r
2 Queries.
Am I missing any Accelerate functionality to perform this easily ? For e.g. any Accelerate function which could perform 2D array FFT at size = 2p * 3p * 5r. I assume vDSP_DFT_Execute performs only 1D FFT.
Better approaches to 2D FFT or correlation. Like in this answer(3), which asks to split arrays like 480 = 256 + 128 + 64 + 32 with repeated 1D FFT's over rows and then over columns. But this will need too many functions calls and I assume, will not help with performance.
Of lesser importance: I am doing correlation in tiles as one of the 2D arrays is far bigger then another. Say like 1920*1024 vs 150*100.
Linear convolution or correlation requires zero padding anyway, otherwise the result will be circular convolution or correlation.
1d iOS vDSP/Accelerate FFTs do allow N to be the product of small primes, not just 2^M. Not sure about 2d, but one can build a 2d FFT out of a 1d FFT.

zero padded FFT using FFTW

To interpolate a signal in frequency domain, one can pad zeros in time domain and do an FFT.
Suppose the number of elements in a given vector X is N and Y is the same as X but padded one sided with N zeros. Then the following give the same result.
$$\hat{x}(k)=\sum_{n=0}^{2N-1} Y(n)e^{i2\pi k n/2N},\quad k=0,...,2N-1,$$
$$\hat{x}(k)=\sum_{n=0}^{ N-1} X(n)e^{i2\pi k n/2N},\quad k=0,...,2N-1.$$
Now if we use FFTW package, the first equation needs 2N memory space for the input vector while the second one needs only N memory space (I do not know if it is even possible to do in the existing FFTW package)! Also the computational complexity lowers from 2N^2log(2N) to 2N^2log(N). The problem is worse whenever we do a 2D FFT or 3D FFT. Is it possible to do the second approach using FFTW package? This is fairly easy to do in MATLAB though.
If x is a 2N signal padded with zeros above N , its DFT writes :
If k is even :
Hence, the coefficients of even frequencies arise from the N-point discrete Fourier transform of x(n).
if k is odd :
Hence, the coefficients of odd frequencies arise from the N-point discrete Fourier transform of x(n)exp(i*M_PI*n/N).
Thus, the discrete Fourier transform of a zero-padded 2N signal resumes to two DFT of signals of length N and fftw can be used to compute them.
The overall computation time will be 2*c*N*ln(N), where c is a constant. It is expected to be faster than the direct computation of the DFT c*2*N*ln(2*N). Remember that ln(2*N)=ln(2)+ln(N) : as N gets large, the extra work in case of direct computation is negligible compared to ln(N) : the trick becomes useless, even if the dimension is larger than one. It does not affect complexity.
Moreover, FFTW is really efficient, using lots of features of your PC if it is correctly installed, and it will be hard to do better than this in any case, even if the presented trick is used. Finally, if the input signal is real, you may use fftw_plan fftw_plan_dft_r2c_2d : only half the coefficients in the Fourier space are computed and stored.
Regarding memory requirements, if you are really short of memory, you can use the FFTW_IN_PLACE flag and use the same array for input and output. Yet, it is slightly slower.
The procedure presented above can be extended to compute the DFT of a LN signal of a N-point signal padded with (L-1)N zeros : it resumes to the computation of L DFTs of length N.
Do you have any reference showing how MATLAB handles and optimizes the DFT of padded signals compared to FFTW ?
EDIT : Further research about the 3D case :
The 3D DFT of a padded 3D signal x(n,m,p) is :
If k_n, k_m and k_p are even :
If k_n and k_m are even and k_p is odd :
...There are 8 cases.
So, the computation of the 3d dft of a 3D x of size NxNxN padded to 2Nx2Nx2N resumes to the computation of 8 3d dft of size NxNxN. Size a 3d dft is a combination of 3 1d dft, the total number of dft of size N is 3x8xNxN while the direct computation requires 3x(2N)*(2N) dft of size 2N. Computational time is 24cN^3ln(N) against 24cN^3ln(2N) : a small gain is possible...Again fftw is fast...
Yet, instead of using a black-box 3d fft, let's compute the 8 dfts of size N at once, by performing the 1d dfts in each direction.
1d dft along N : 2 cases, NxN dfts => 2cN^3ln(N)
1d dft along M : 2 cases, 2NxN dfts => 4cN^3ln(N)
1d dft along P : 2 cases, 2Nx2N dfts => 8cN^3ln(N)
Hence, the total computation time is expected to be 14cN^3ln(N) against 24cN^3ln(2N) : a small gain is possible...Again fftw is fast...
Moreover, the computation of
requires only a single call to exp : first compute w=exp(I*M_PI/N) then update wn=wn*w; x(n)=x(n)*wn or use pow if precision becomes an issue.

How do I calculate variance of gpu_array?

I am trying to compute the variance of a 2D gpu_array. A reduction kernel sounds like a good idea:
http://documen.tician.de/pycuda/array.html
However, this documentation implies that reduction kernels just reduce 2 arrays into 1 array. How do I reduce a single 2D array into a single value?
I guess the first step is to define variance for this case. In matlab, the variance function on a 2D array returns a vector (1D-array) of values. But it sounds like you want a single-valued variance, so as others have already suggested, probably the first thing to do is to treat the 2D-array as 1D. In C we won't require any special steps to accomplish this. If you have a pointer to the array you can index into it as if it were a 1D array. I'm assuming you don't need help on how to handle a 2D array with a 1D index.
Now if it's the 1D variance you're after, I'm assuming a function like variance(x)=sum((x[i]-mean(x))^2) where the sum is over all i, is what you're after (based on my read of the wikipedia article ). We can break this down into 3 steps:
compute the mean (this is a classical reduction - one value is produced for the data set - sum all elements then divide by the number of elements)
compute the value (x[i]-mean)^2 for all i - this is an element by element operation producing an output data set equal in size (number of elements) to the input data set
compute the sum of the elements produced in step 2 - this is another classical reduction, as one value is produced for the entire data set.
Both steps 1 and 3 are classical reductions which are summing all elements of an array. Rather than cover that ground here, I'll point you to Mark Harris' excellent treatment of the topic as well as some CUDA sample code. For step 2, I'll bet you could figure out the kernel code on your own, but it would look something like this:
#include <math.h>
__global__ void var(float *input, float *output, unsigned N, float mean){
unsigned idx=threadIdx.x+(blockDim.x*blockIdx.x);
if (idx < N) output[idx] = __powf(input[idx]-mean, 2);
}
Note that you will probably want to combine the reductions and the above code into a single kernel.