How to calculate the gradient for a custom deep learning layer that takes a 1X1 neuron as input and multiplies it by a constant NXN matrix? - deep-learning

I want to create a custom deep learning layer that takes as input a 1X1 neuron and uses it to scale a constant, predefined NXN matrix. I do not understand how to calculate the gradient for this layer.
I understand that in this case dLdZ is NXN and dLdX should be 1X1, and I don't understand what dZdX should be to satisfy that, it's obviously not a simple chained derivative where dLdX = dLdZ*dZdX since the dimensions don't match.
The question is not really language depenedent, I write here in Matlab.
%M is the constant NXN matrix
%X is 1X1X1Xb
Z = zeros(N,N,1,b);
for i = 1:b
Z(:,:,:,i) = squeeze(X(:,:,1,i))*M;
end
==============================
edit: the answer I got was very helpful. I now perform the calculation as follows:
dLdX = zeros(1,1,1,b);
for i = 1:b
dLdX(:,:,:,i) =sum(sum(dLdZ(:,:,:,i).*M)));
end
This works perfectly. Thanks!!

I think ur question is a little unclear. I will assume ur goal is to propagate the gradients through ur above defined layer to the batch of scalar values. Let me answer according to how I understand it.
U have parameter X, which is a scalar and of dimension b (b: batch_size). This is used to scale a constant matrix Z, which is of dimension NxN. Lets assume u calculate some scalar loss L immediately from the scaled matrix Z' = Z*X, where Z' is of dimension bxNxN.
Then u can calculate the gradients in X according to:
dL/dX = dL/dZ * dZ/dX --> Note that the dimensions of this product indeed match (unlike ur initial impression) since dL/dZ' is bxNxN and dZ'/dX is bxNxN. Summing over the correct indeces yields dL/dX which is of dimension b.
Did I understand u correct?
Cheers

Related

How to calculate a combination of convolution and correlation by FFT?

I'm trying to achieve an algorithm to efficiently calculate a combination of convolution and correlation such as following :
c(x,y)=(sum of i, (sum of j, a(x-i,y+j)*b(i,j)))
I have known that 1-D convolution or correlation can be solved by
a conv b = ifft(fft(a).*fft(b))
a corr b = ifft(fft(a).*conjg(fft(b)))
But I have no idea about the combination of them in 2-D or N-D problems. I think it is similar to 2-D convolution, but I don't know the specific deduction process.
The correlation can be written in terms of the convolution by reversing one of the arguments:
corr(x(t),y(t)) = conv(x(t),y(-t))
Thus, if you want the x-axis to behave like a convolution and the y-axis to behave like a correlation, reverse the y-axis only and compute the convolution. It doesn’t matter if you use a spatial or frequency domain implementation.

How to reshape a pytorch matrix without mixing elements of items in a batch

In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks
This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

How does opengl project a 3d point to screenspace?

I am trying to emulate a subset of opengl with my own software rasterizer.
I'm taking a wild guess that the process looks like this:
Multiply the 3d point by the modelview matrix -> multiply that result by the projection matrix
Is this correct?
Also what size is the projection matrix and how does it work?
The point is multiplied by the modelview matrix and then with projection matrix. The resultant is normalized and then multiplied with viewport matrix to get the screen coordinates. All matrices are 4X4 matrix. You can view this link for further details.
http://www.songho.ca/opengl/gl_transform.html#example2
(shameless self-promotion, sorry) I wrote a tutorial on the subject :
http://www.opengl-tutorial.org/beginners-tutorials/tutorial-3-matrices/
There is a slight caveat that I don't explain, though. At the end of the tutorial, you're in Normalized Device Coordinates, i.e. -1 to +1. A simple linear mapping transorms this to [0-screensize].
You might also benefit from looking at the gluProject() code. This takes an x, y, z point in object coordinates as well as pointers to modelView, projection, and viewport matrices and tells you what the x, y, (z) coordinates are in screenspace (the z is a value between 0 and 1 that can be used in the depth buffer). All three matrix multiplications are shown there in the code, along with the divisions necessary for perspective.