I'm looking at this pytorch starter tutorial:
https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py
the zero_grad() function is being used to zero the gradients which means that it's running with mini-batches, is this a correct assumption? If so, where is the batch size defined??
I found the following for nn.conv2d:
For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.
in that case nSamples is the batch size?
but how do you specify the batch size for a nn.Linear layer? do you decide what your mini-batches are when you load the data or what?
I am making a few assumptions here that may be totally incorrect, pls correct me if i'm wrong.
thank you!
You predefine the batch_Size in the dataloader, For a linear layer you do not specify batch size but the number of features of your previous layer and the number of features you wish to get after the linear operation.
This is a code sample from the Pytorch Docs
m = nn.Linear(20, 30)
input = Variable(torch.randn(128, 20))
output = m(input)
print(output.size())
As Ryan said, you don't have to specify the batch size in Lieanr layers.
Here I'd add something for you to clarify more details.
Let's first consider the equation of a linear layer:
where X is a tensor with size batch_size * in_feats_dim, W is a weights matrix with size out_feats_dim and in_feats_dim, and b is a bias vector with size out_feats_dim.
So far you probably find that your parameters, W and b, are independent of your batch size.
You can check the implementation in Pytorch nn.module.functional.linear line 1000 to line 1002. It actually matches what we discuss above.
Related
I am working on a classification task using transfer learning. I am using ResNet50 and weights from ImageNet.
My_model = (ResNet50( include_top=False, weights='imagenet', input_tensor=None,
input_shape=(img_height, img_width, 3),pooling=None))
I didn't rescale my input images between 0-255 but my result is quite good (acc: 93.25%). So my question is do I need to rescale images between 0-255? Do you think my result is wrong without rescaling between 0-255?
Thank you.
No basically your result is not wrong. to give a clue on that, we standardize the pixels values to a range between (0 and 1) just to avoid resulting big values during the calculus in the forward propagation z = w*x + b and then the backward propagation.
Why we do that ?
I develop, the optimization algorithm is definitely dependent on the result of the backward prop, so when we start updating our optimization algo with big values of weights/bias, then we need then a lot of epochs to reach the global minimum.
In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)
I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best
I found in the current keras, All input arrays (x) should have the same number of samples.
For many multi-input and multi-output models, it is more desirable if we can define different number of samples (aka. batch-size) for each inputs.
This is really important for the case one input X1 is much 'cheaper' than another input X2.
Say now I have two inputs X1, X2 and two outputs Y1, Y2.
Y1 is a function of X1 and Y2 is a function of X1,X2.
The mapping X1->Y1 is much faster ('cheaper') to train than the mapping X1,X2->Y2.
So I may desire a large batch size of X1 and a small batch size of X2.
Or is it possible to hack the current code so as to make input with different batch-size possible?
Looking forward to anyone who can give me some suggestions. Thanks!
You can do different weights for your samples. By setting the sample weights variable (https://keras.io/models/model/#fit) your training algorithm will take cheaper samples into account.
If you want to train on different batch sizes you will need to use model.train_on_batch and pass in different batch sizes. This in essence means that you write your own fit loop.
I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.