During backpropagation, will these cases have different effect:-
sum up loss over all pixels then backpropagate.
average loss over all pixels then backpropagate
backpropagate individuallyover all pixels.
My main doubts in regarding the numerical value but the effect all these would be having.
The difference between no 1 and 2 is basically : since sum will result in bigger than mean, the magnitude of gradients from sum operation will be bigger, but direction will be same.
Here's a little demonstration, lets first declare necessary variables:
x = torch.tensor([4,1,3,7],dtype=torch.float32,requires_grad=True)
target = torch.tensor([4,2,5,4],dtype=torch.float32)
Now lets compute gradient for x using L2 loss with sum:
loss = ((x-target)**2).sum()
loss.backward()
print(x.grad)
This outputs: tensor([ 0., -2., -4., 6.])
Now using mean: (after resetting x grad)
loss = ((x-target)**2).mean()
loss.backward()
print(x.grad)
And this outputs: tensor([ 0.0000, -0.5000, -1.0000, 1.5000])
Notice how later gradients are exactly 1/4th of that of sum, that's because the tensors here contain 4 elements.
About third option, if I understand you correctly, that's not possible. You can not backpropagate before aggregating individual pixel errors to a scalar, using sum, mean or anything else.
Related
I have a 5 x 5 x 21 array. The last dimension represents channels. In a few channels, only one of the pixel values is 1 and the rest of the values are 0. For other channels, all of the pixel values are 0. I am applying softmax activation along the spatial dimension while training a deep neural network. Is it a good idea to use softmax even when all the values is zero, i.e. sum of all pixel values along the spatial dimension is not equal to 1.
I am not sure if understand your question.
Softmax should be applied in places where we want [almost] one-hot distribution in trained network. Output of softmax defines distribution (sum is equal to 1) but there are no restrictions about input of softmax. If you pass all 0 to softmax you will get uniform distribution as an output.
Whether it make sense it depends on the goal of the network
I want to create a custom deep learning layer that takes as input a 1X1 neuron and uses it to scale a constant, predefined NXN matrix. I do not understand how to calculate the gradient for this layer.
I understand that in this case dLdZ is NXN and dLdX should be 1X1, and I don't understand what dZdX should be to satisfy that, it's obviously not a simple chained derivative where dLdX = dLdZ*dZdX since the dimensions don't match.
The question is not really language depenedent, I write here in Matlab.
%M is the constant NXN matrix
%X is 1X1X1Xb
Z = zeros(N,N,1,b);
for i = 1:b
Z(:,:,:,i) = squeeze(X(:,:,1,i))*M;
end
==============================
edit: the answer I got was very helpful. I now perform the calculation as follows:
dLdX = zeros(1,1,1,b);
for i = 1:b
dLdX(:,:,:,i) =sum(sum(dLdZ(:,:,:,i).*M)));
end
This works perfectly. Thanks!!
I think ur question is a little unclear. I will assume ur goal is to propagate the gradients through ur above defined layer to the batch of scalar values. Let me answer according to how I understand it.
U have parameter X, which is a scalar and of dimension b (b: batch_size). This is used to scale a constant matrix Z, which is of dimension NxN. Lets assume u calculate some scalar loss L immediately from the scaled matrix Z' = Z*X, where Z' is of dimension bxNxN.
Then u can calculate the gradients in X according to:
dL/dX = dL/dZ * dZ/dX --> Note that the dimensions of this product indeed match (unlike ur initial impression) since dL/dZ' is bxNxN and dZ'/dX is bxNxN. Summing over the correct indeces yields dL/dX which is of dimension b.
Did I understand u correct?
Cheers
In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)
I'm looking at this pytorch starter tutorial:
https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py
the zero_grad() function is being used to zero the gradients which means that it's running with mini-batches, is this a correct assumption? If so, where is the batch size defined??
I found the following for nn.conv2d:
For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.
in that case nSamples is the batch size?
but how do you specify the batch size for a nn.Linear layer? do you decide what your mini-batches are when you load the data or what?
I am making a few assumptions here that may be totally incorrect, pls correct me if i'm wrong.
thank you!
You predefine the batch_Size in the dataloader, For a linear layer you do not specify batch size but the number of features of your previous layer and the number of features you wish to get after the linear operation.
This is a code sample from the Pytorch Docs
m = nn.Linear(20, 30)
input = Variable(torch.randn(128, 20))
output = m(input)
print(output.size())
As Ryan said, you don't have to specify the batch size in Lieanr layers.
Here I'd add something for you to clarify more details.
Let's first consider the equation of a linear layer:
where X is a tensor with size batch_size * in_feats_dim, W is a weights matrix with size out_feats_dim and in_feats_dim, and b is a bias vector with size out_feats_dim.
So far you probably find that your parameters, W and b, are independent of your batch size.
You can check the implementation in Pytorch nn.module.functional.linear line 1000 to line 1002. It actually matches what we discuss above.
I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best