PyTorch - unexpected shape of model parameters weights - deep-learning

I created a fully connected network in Pytorch with an input layer of shape (1,784) and a first hidden layer of shape (1,256).
To be short: nn.Linear(in_features=784, out_features=256, bias=True)
Method 1 : model.fc1.weight.data.shape gives me torch.Size([128, 256]), while
Method 2 : list(model.parameters())[0].shape gives me torch.Size([256, 784])
In fact, between an input layer of size 784 and a hidden layer of size 256, I was expecting a matrix of shape (784,256).
So, in the first case, I see the shape of the next hidden layer (128), which does not make sense for the weights between the input and first hidden layer, and, in the second case, it looks like Pytorch took the transform of the weight matrix.
I don't really understand how Pytorch shapes the different weight matrices, and how can I access individual weights after the training. Should I use method 1 or 2? When I display the corresponding tensors, the displays look totally similar, while the shapes are different.

In Pytorch, the weights of model parameters are transposed before applying the matmul operation on the input matrix. That's why the weight matrix dimensions are flipped, and is different from what you expect; i.e., instead of being [784, 256], you observe that it is [256, 784].
You can see the Pytorch source documentation for nn.Linear, where we have:
...
self.weight = Parameter(torch.Tensor(out_features, in_features))
...
def forward(self, input):
return F.linear(input, self.weight, self.bias)
When looking at the implementation of F.linear, we see the corresponding line that multiplies the input matrix with the transpose of the weight matrix:
output = input.matmul(weight.t())

Related

Combine two tensors of same dimension to get a final output tensor using trainable weights

While working on a problem related to question-answering(MRC), I have implemented two different architectures that independently give two tensors (probability distribution over the tokens). Both the tensors are of dimension (batch_size,512). I wish to obtain the final output of the form (batch_size,512). How can I combine the two tensors using trainable weights and then train the model on the final prediction?
Edit (Additional Information):
So in the forward function of my NN model, I have used BERT model to encode the 512 tokens. These encodings are 768 dimensional. These are then passed to a Linear layer nn.Linear(768,1) to output a tensor of shape (batch_size,512,1). Apart from this I have another model built on top of the BERT encodings that also yields a tensor of shape (batch_size, 512, 1). I wish to combine these two tensors to finally get a tensor of shape (batch_size, 512, 1) which can be trained against the output logits of the same shape using CrossEntropyLoss.
Please share the PyTorch code snippet if possible.
Assume your two vectors are V1 and V2. You need to combine them (ensembling) to get a new vector. You can use a weighted sum like this:
alpha = sigmoid(alpha)
V_final = alpha * V1 + (1 - alpha) * V2
where alpha is a learnable scaler. The sigmoid is to bound alpha between 0 and 1,
and you can initialise alpha = 0 so that sigmoid(alpha) is half, meaning you are adding V1 and V2 with equal weights.
This is a linear combination, and there can be non-linear versions as well.
You can have a nonlinear layer that accepts (V1;V2) (the concatenation) and outputs a softmaxed output as well e.g. softmax(W * (V1;V2) + b).

How to create a CNN for image classification with dynamic input

I would like to create a fully convolution network for binary image classification in pytorch that can take dynamic input image sizes, but I don't quite understand conceptually the idea behind changing the final layer from a fully connected layer to a convolution layer. Here and here both state that this is possible by using a 1x1 convolution.
Suppose I have a 16x16x1 image as input to the CNN. After several convolutions, the output is a 16x16x32. If using a fully connected layer, I can produce a single value output by creating 16*16*32 weights and feeding it to a single neuron. What I don't understand is how you would get a single value output by applying a 1x1 convolution. Wouldn't you end up with 16x16x1 output?
Check this link: http://cs231n.github.io/convolutional-networks/#convert
In this case, your convolution layer should be a 16 x 16 filter with 1 output channel. This will convert the 16 x 16 x 32 input into a single output.
Sample code to test:
from keras.layers import Conv2D, Input
from keras.models import Model
import numpy as np
input = Input((16,16,32))
output = Conv2D(1, 16)(input)
model = Model(input, output)
print(model.summary()) # check the output shape
output = model.predict(np.zeros((1, 16, 16, 32))) # check on sample data
print(f'output is {np.squeeze(output)}')
This approach of Fully convolutional networks are useful in segmentation tasks using patch based approaches since you can speed up prediction(inference) by feeding a bigger portion of the image.
For classification tasks, you usually have a fc layer at the end. In that case, a layer like AdaptiveAvgPool2d is used which ensures the fc layer sees a constant input feature size irrespective of the input image size.
https://pytorch.org/docs/stable/nn.html#adaptiveavgpool2d
See this pull request for torchvision VGG: https://github.com/pytorch/vision/pull/747
In case of Keras, GlobalAveragePooling2D. See the example, "Fine-tune InceptionV3 on a new set of classes".
https://keras.io/applications/
I hope you are familier with keras. Now see your image is of 16*16*1. Image will pass to the keras convoloutional layer but first we have to create the model. like model=Sequential() by this we are able to get keras model instance. now we will give our convoloutional layer with our parameters like
model.add(Conv2D(20,(2,2),padding="same"))
now here we are adding 20 filters to our image. and our image becomes 16*16*20 now for more best features we add more conv layers like
model.add(Conv2D(32,(2,2),padding="same"))
now we add 32 filters to your image after this your image will be size of 16*16*32
dont forgot to put activation after conv layers. If you are new than you should study about activations, Optimization and loss of the network. these are the basic part of neural Networks.
Now its time to move towards fully connected layer. First we need to flatten our image because fully connected layer only works on 2d vectors (no_of_ex,image_dim) in your case
imgae diminsion after applying flattening will be (16*16*32)
model.add(Flatten())
after flatening our image your network will give it to fully connected layers
model.add(Dense(32))
model.add(Activation("relu"))
model.add(Dense(8))
model.add(Activation("relu"))
model.add(Dense(2))
because you are having a problem of binary classification if you have to classify 3 classes than last layer will have 3 neuron if you have to classify 10 examples than your last dense layer willh have 10 neuron.
model.add(Activation("softmax"))
model.compile(loss='binary_crossentropy',
optimizer=Adam(),
metrics=['accuracy'])
return model
after this you have to fit this model.
estimator=model()
estimator.fit(X_train,y_train)
full code:
def model (classes):
model=Sequential()
# conv2d set =====> Conv2d====>relu=====>MaxPooling
model.add(Conv2D(20,(5,5),padding="same"))
model.add(Activation("relu"))
model.add(Conv2D(32,(5,5),padding="same"))
model.add(Activation("relu"))
model.add(Flatten())
model.add(Dense(32))
model.add(Activation("relu"))
model.add(Dense(8))
model.add(Activation("relu"))
model.add(Dense(2))
#now adding Softmax Classifer because we want to classify 10 class
model.add(Dense(classes))
model.add(Activation("softmax"))
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001, decay=1e-6),
metrics=['accuracy'])
return model
You can take help from this kernal

How to reshape a pytorch matrix without mixing elements of items in a batch

In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)

What exactly is a Softmax output layer?

I'm trying to make a simple conv net in c#, and I want to make a Softmax outputlayer, but I don't really now what it is. Is it a fully connected layer with Softmax activation or just a layer which outputs the softmax of the data?
Softmax is just a function that takes a vector and outputs a vector of the same size having values within the range [0,1]. Also the values inside the vector follow the fundamental rule of probability ie. sum of values in vector = 1.
softmax(x)_i = exp(x_i) / ( SUM_{j=1}^K exp(x_j) ) # for each i = 1,.., K
But sometimes people use Softmax classifier which refers to a MLP with input and 1 output layer (which makes it a linear classifier like linear SVM) where softmax function is applied to the outputs of output layer. This setup gives the probability of the input being close to each of the output classes.

Understanding convolutional layer depth in caffe

Im new with caffe, I have trained a concolutional neural network with 64 feature maps of 7x7, when I get weights of a filter y get a 7x7 matrix. However my second layer has 32 feature maps of 3x3 , when I get the weights of any filter I get a number of 64 matrix of 3x3 kernel for any filter of the second layer.
Does anybody know why?
TL;DR:
The filters of a convolutional layer must matches the number of
channels of that layer's input.
Let's see an example:
So let's say your network receives 3-channel colored images (RGB, for example) with dimensions 128x128 (height and width of 128 pixels) as input. So the input to your first convolution layer (let's call it conv1) would be 3x128x128 (channels x width x height).
Now suppose conv1 has 64 filters of size 7x7. In order to process all values from the input, a single filter must match the number of input channels being fed to that layer (or else some of the channels would not be taken into account during the convolution). So it must also be 3-channel filter and, in the end, we will have 64 filters of dimension 3x7x7 for conv1.
Conv1 will output maps of dimension 64x128x128 (number of filters X weight X height). If this is not clear to you, please check this demo [1].
And then the filters from the next conv layer (conv2) will also have to match their dimension to match the output. For example, 32 filters of size 64x5x5 (for filters with spatial dimension of 5x5). And so on...
(For the sake of simplicity, we supposed that we zero-pad the input before convolving. Zero-padding is that "border" of zeroes that we envolve the input map. This means that the spatial dimensions, i.e. width and height, will not change. If there is no padding, then the output would be smaller than the input. E.g., for 7x7 filters with input of size 128x128, the output would end up having size 125x125. This decrease in spatial dimension is equal to the floor(filter_size / 2) )
[1] CS231n Convolutional Neural Networks for Visual Recognition