How to calculate a combination of convolution and correlation by FFT? - fft

I'm trying to achieve an algorithm to efficiently calculate a combination of convolution and correlation such as following :
c(x,y)=(sum of i, (sum of j, a(x-i,y+j)*b(i,j)))
I have known that 1-D convolution or correlation can be solved by
a conv b = ifft(fft(a).*fft(b))
a corr b = ifft(fft(a).*conjg(fft(b)))
But I have no idea about the combination of them in 2-D or N-D problems. I think it is similar to 2-D convolution, but I don't know the specific deduction process.

The correlation can be written in terms of the convolution by reversing one of the arguments:
corr(x(t),y(t)) = conv(x(t),y(-t))
Thus, if you want the x-axis to behave like a convolution and the y-axis to behave like a correlation, reverse the y-axis only and compute the convolution. It doesn’t matter if you use a spatial or frequency domain implementation.

Related

Questions about convolution (in CNN)

I suddenly came up with a question about convolution and just wanted to be clear if I'm missing something. The question is whether if the two operations below are identical.
Case1)
Suppose we have a feature map C^2 x H x W. And, we have a K x K x C^2 Conv weight with stride S. (To be clear, C^2 is the channel dimension but just wanted to make it as a square number, K is the kernel size).
Case2)
Suppose we have a feature map 1 x CH x CW. And, we have a CK x CK x 1 Conv weight with stride CS.
So, basically Case2 is a pixel-upshuffled version of case1 (both feature-map and Conv weight.) As convolutions are simply element-wise multiplication, both operations seem identical to me.
# given a feature map and a conv_weight, namely f_map, conv_weight
#case1)
convLayer = Conv(conv_weight)
result = convLayer(f_map, stride=1)
#case2)
f_map = pixelshuffle(f_map, scale=C)
conv_weight = pixelshuffle(f_map, scale=C)
result = convLayer(f_map, stride=C)
But this means that, (for example) given a 256xHxW feature-map with a 3x3 Conv (as in many deep learning models), performing a convolution was simply doing a HUUUGE 48x48 Conv to a 1 x 16*H x 16*W Feature map.
But this doesn't meet my basic intuition of CNNs, stacking multiple of layers with the smallest 3x3 Conv, resulting in somewhat large receptive field, and each channel having different (possibly redundant) information.
You can, in a sense, think of "folding" spatial information into the channel dimension. This is the rationale behind ResNet's trade-off between spatial resolution and feature dimension. In the ResNet case whenever they sample x2 in space they increase feature space x2. However, since you have two spatial dimensions and you sample x2 in both you effectively reduce the "volume" of the feature map by x0.5.

Write two vectors as convolution

How to write two column vectors as an analytic convolution so that the discrete FFT may be used. MATLAB syntax is used.
Consider:
a set of vectors which, when sorted into a step function appears as any of the following:
[1,1,1,1,0,0,0,0], or [1,1,1,1,1,0,0,0], or [1,1,1,1,1,1,0,0]
(...the location at which the function "steps up" varies over members of this set)
The other is random vec=[1,0,1,0,1,1,1,0], and obviously both contain only 0s and 1s.
Is it possible to write these vectors as an analytic convolution? I would like the 1st, 2nd, 3rd, 4th... entries of the convolution to have values of:
sum(vec.*[1,0,0,0,0,0,0,0])
sum(vec.*[1,1,0,0,0,0,0,0])
sum(vec.*[1,1,1,0,0,0,0,0])
sum(vec.*[1,1,1,1,0,0,0,0])
...
sum(vec.*[1,1,1,1,1,1,1,1])
For speed, I am trying to avoid use of a for-loop. I cannot vectorize because this requires terabytes of RAM. (I work with vectors that are not of length 8, but rather length nearly a million).
The convolution theorem gives the function R from the convolution of functions L and 1/w from the Fourier transform F and its inverse F-1 as,
Clearly, the function 1/(w-w') in the convolution is from 1/w under F; it's as if you just set w'=0. But if I use analogous reasoning in my [1,1,1,1,0,0,0,0], I get either [1,1,1,1,1,1,1,1], the identity under .* in MATLAB or [0,0,0,0,0,0,0,0](a very boring result).
What is the mistake in reasoning I've made?

Determining the values of the filter matrices in a CNN

I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks
This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

How does opengl project a 3d point to screenspace?

I am trying to emulate a subset of opengl with my own software rasterizer.
I'm taking a wild guess that the process looks like this:
Multiply the 3d point by the modelview matrix -> multiply that result by the projection matrix
Is this correct?
Also what size is the projection matrix and how does it work?
The point is multiplied by the modelview matrix and then with projection matrix. The resultant is normalized and then multiplied with viewport matrix to get the screen coordinates. All matrices are 4X4 matrix. You can view this link for further details.
http://www.songho.ca/opengl/gl_transform.html#example2
(shameless self-promotion, sorry) I wrote a tutorial on the subject :
http://www.opengl-tutorial.org/beginners-tutorials/tutorial-3-matrices/
There is a slight caveat that I don't explain, though. At the end of the tutorial, you're in Normalized Device Coordinates, i.e. -1 to +1. A simple linear mapping transorms this to [0-screensize].
You might also benefit from looking at the gluProject() code. This takes an x, y, z point in object coordinates as well as pointers to modelView, projection, and viewport matrices and tells you what the x, y, (z) coordinates are in screenspace (the z is a value between 0 and 1 that can be used in the depth buffer). All three matrix multiplications are shown there in the code, along with the divisions necessary for perspective.