PyTorch and Chainer implementations of the Linear layer- are they equivalent? - deep-learning

I want to use a Linear, Fully-Connected Layer as one of the input layers in my network. The input has shape (batch_size, in_channels, num_samples). It is based on the Tacotron paper: https://arxiv.org/pdf/1703.10135.pdf, the Enocder prenet part.
It feels to me as if Chainer and PyTorch have different implementations of the Linear layer - are they really performing the same operations or am I misunderstanding something?
In PyTorch, behavior of the Linear layer follows the documentations: https://pytorch.org/docs/0.3.1/nn.html#torch.nn.Linear
according to which, the shape of the input and output data are as follows:
Input: (N,∗,in_features) where * means any number of additional dimensions
Output: (N,∗,out_features) where all but the last dimension are the same shape as the input.
Now, let's try creating a linear layer in pytorch and performing the operation. I want an output with 8 channels, and the input data will have 3 channels.
import numpy as np
import torch
from torch import nn
linear_layer_pytorch = nn.Linear(3, 8)
Let's create some dummy input data of shape (1, 4, 3) - (batch_size, num_samples, in_channels:
data = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=np.float32).reshape(1, 4, 3)
data_pytorch = torch.from_numpy(data)
and finally, perform the operation:
results_pytorch = linear_layer_pytorch(data_pytorch)
results_pytorch.shape
the shape of the output is as follows: Out[27]: torch.Size([1, 4, 8])
Taking a look at the source of the PyTorch implementation:
def linear(input, weight, bias=None):
# type: (Tensor, Tensor, Optional[Tensor]) -> Tensor
r"""
Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.
Shape:
- Input: :math:`(N, *, in\_features)` where `*` means any number of
additional dimensions
- Weight: :math:`(out\_features, in\_features)`
- Bias: :math:`(out\_features)`
- Output: :math:`(N, *, out\_features)`
"""
if input.dim() == 2 and bias is not None:
# fused op is marginally faster
ret = torch.addmm(bias, input, weight.t())
else:
output = input.matmul(weight.t())
if bias is not None:
output += bias
ret = output
return ret
It transposes the weight matrix that is passed to it, broadcasts it along the batch_size axis and performs a matrix multiplications. Having in mind how a linear layer works, I imagine it as 8 nodes, connected through a synapse, holding a weight, with every channel in an input sample, thus in my case it has 3*8 weights. And that is exactly the shape I see in debugger (8, 3).
Now, let's jump to Chainer. The Chainer's linear layer documentation is available here: https://docs.chainer.org/en/stable/reference/generated/chainer.links.Linear.html#chainer.links.Linear. According to this documentation, the Linear layer wraps the function linear, which according to the docs, flattens the input along the non-batch dimensions and the shape of it's weight matrix is (output_size, flattend_input_size)
import chainer
linear_layer_chainer = chainer.links.Linear(8)
results_chainer = linear_layer_chainer(data)
results_chainer.shape
Out[21]: (1, 8)
Creating the layer as linear_layer_chainer = chainer.links.Linear(3, 8) and calling it causes a size mismatch. So in case of chainer, I have gotten a totally different results, because this time around I have a weight matrix that is of shape (8, 12) and my results have a shape of (1, 8). So now, here is my question : since the results are clearly different,both the weight matrices and the outputs have different shapes, how can I make them equivalent and what should be the desired output? In the PyTorch implementation of Tacotron it seems that the PyTorch approach is used as is (https://github.com/mozilla/TTS/blob/master/layers/tacotron.py) - Prenet. If that is the case, how can I make the Chainer produce the same results (I have to implement this in Chainer). I will be grateful for any inshight, sorry that the post has gotten this long.

Chainer Linear layer (a bit frustratingly) does not apply the transformation to the last axis. Chainer flattens the rest of the axes. Instead you need to provide how many batch axes there are, documentation which is 2 in your case:
# data.shape == (1, 4, 3)
results_chainer = linear_layer_chainer(data, n_batch_axes=2)
# 2 batch axes (1,4) means you apply linear to (..., 3)
# results_chainer.shape == (1, 4, 8)
You can also use l(data, n_batch_axes=len(data.shape)-1) to always apply to the last dimension which is the default behaviour in PyTorch, Keras etc.

Related

how to model and train a PyTorch Model to to learn a mathematical function f(x)?

I'm currently learning how to use pytorch to model NNs and did the "Getting Started" Session on the PyTorch Website.
I tried to train a PyTorch NN to apply the function e.g. f(x)=2x-1 to a given input integer list but my model is far apart from learning the right thing.
How can I model and train a PyTorch model to learn a given mathematical function f(x) ?
I've tried this model and trained it with 10 random numbers with labels generated by the 'myFunc' function to learn the function 2x-1.
Thanks for your help.
batch_size = 10
def myFunc(a):
#y = 2x-1
return 2*a-1
class NeuralNetwork(nn.Module):
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(batch_size,1)
self.lin2 = nn.Linear(1,batch_size)
def forward(self, x):
x = self.lin1(x)
x = F.relu(x)
x = self.lin2(x)
return x
model = NeuralNetwork()
Theoretically for your example of an affine-linear function over a bounded interval you need only
linear(bias) -> relu -> linear(bias)
with one node per linear layer. Or just one linear layer without activation.
For more general functions, you will need larger layers in the construction of the first type, with one node for every piece in a piece-wise approximation. The last layer always needs to be linear without activation. Using more layers might give more pieces with less total nodes.

Training LSTM but output converges to a single value after a few inputs

Background
I'm learning about LSTMs and figured I'd try training a very basic LSTM model (big believer of learning through doing).
To start with something basic, I tried implement a LSTM that would sum up the last 10 inputs it has seen. I generated a dataset consisting of 1000 random numbers between 0 and 1, with 1000 labels representing the sum of the previous 10 numbers (label[i] = data[i-9:i+1].sum()), and tried to train the LSTM to recognize this pattern.
I know that simpler models can solve this (ie. linear regression), but I believe that LSTMs should also be able to solve this fairly basic problem.
My initial implementation does seem to work, but when I try to improve the implementation, I start getting constant output values after a few timestamps, looks like it's approximately the average of the training labels.
I'd appreciate any insights as to why the second and third iterations don't work, especially the third iteration, since it looks like it's the same implementation as what I read in "Deep Learning for Coders with fastai & PyTorch" book.
What I've tried so far
I've done 3 iterations so far:
Initially, I generated all sub-sequences of length 10 from the input data along with the corresponding label ([(data[i-9:i+1], label[i]) for i in range(9, len(data))] and fed this into the LSTM
This iteration worked very well, if I feed a sequence of 10 inputs I get an output from the LSTM that is very close to the sum. However, it is kinda cheating in that I'm basically telling the LSTM that the sequerce is length 10. I believe that LSTM should be able to infer the sequence length, so I tried to remove that bit of information.
In my second iteration, I feed the entire sequence into the LSTM at once: ([data[i] for i in range(len(data)), [label[i] for i in range(len(data))]). Basically, a single input with a sequence length of 1000, with a single output of 1000 labels.
However, after training, while running on validation data of 100, all except the first few labels are always a constant number, approximately the average of the training labels.
In my last iteration, I tried feeding inputs one at a time to the LSTM (1000 inputs with a sequence length of 1), with manually storing the hidden and cell states and passing it into the next run with the next input. This produces similar results to #2.
Network setup
For all runs, I used a single layer LSTM with 25 hidden since it's a fairly simple problem. I did try adding layers with dropout or increasing hidden size but didn't help.
Code samples
First iteration:
class LSTMModel(Module):
def __init__(self, seq_len, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self,x):
x, _ = self.lstm1(x)
# [:,-1,:] to grab the last output for sequence length of 10
x = self.fc(x[:,-1,:])
return x[:,0]
Second iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False)
self.fc = nn.Linear(hidden_size, 1)
self.input_size = input_size
def forward(self,x):
x,h = self.lstm1(x)
x = self.fc(x)
return x
Final iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
self.h = [torch.zeros(layers, 1, hidden_size).cuda() for _ in range(2)]
def forward(self,x):
x,h = self.lstm1(x, self.h)
self.h = [h_.detach() for h_ in h]
x = self.fc(x)
return x
def reset(self):
for h in self.h: h.zero_()

Compare two segmentation maps predictions

I am using consistency between two predicted segmentation maps on unlabeled data. For labeled data, I’m using nn.BCEwithLogitsLoss and Dice Loss.
I’m working on videos that’s why 5 dimensions output.
(batch_size, channels, frames, height, width)
I want to know how can we compare two predicted segmentation maps.
gmentation maps.
# gt_seg - Ground truth segmentation map. - (8, 1, 8, 112, 112)
# aug_gt_seg - Augmented ground truth segmentation map - (8, 1, 8, 112, 112)
predicted_seg_1 = model(data, targets) # (8, 1, 8, 112, 112)
predicted_seg_2 = model(augmented_data, augmented_targets) #(8, 1, 8, 112, 112)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# Consistency loss
if consistency_loss == "l2":
consistency_criterion = nn.MSELoss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
elif consistency_loss == "l1":
consistency_criterion = nn.L1Loss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
total_supervised_loss = supervised_loss_1 + supervised_loss_2
total_consistency_loss = cons_loss
Is this the right way to apply consistency between two predicted segmentation maps?
I’m mainly confused due to the definition on the torch website. It’s a comparison with input x with target y. I thought it looks correct since I want both predicted segmentation maps similar. But, 2nd segmentation map is not a target. That’s why I’m confused. Because if this could be valid, then every loss function can be applied in some or another way. That doesn’t look appealing to me. If it’s the correct way to compare, can it be extended to other segmentation-based losses such as Dice Loss, IoU Loss, etc.?
One more query regarding loss computation on labeled data:
# gt_seg - Ground truth segmentation map
# aug_gt_seg - Augmented ground truth segmentation map
predicted_seg_1 = model(data, targets)
predicted_seg_2 = model(augmented_data, augmented_targets)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# augmented labeled losses
aug_supervised_loss_1 = seg_criterion_1(predicted_seg_2, aug_gt_seg)
aug_supervised_loss_2 = seg_criterion_2(predicted_seg_2, aug_gt_seg)
total_supervised_loss = supervised_loss_1 + supervised_loss_2 + aug_supervised_loss_1 + aug_supervised_loss_2
Is the calculation of total_supervised_loss correct? Can I apply loss.backward() on this?
Yes, this is a valid way to implement consistency loss. The nomenclature used by pytorch documentation lists one input as the target and the other as the prediction, but consider that L1, L2, Dice, and IOU loss are all symmetrical (that is, Loss(a,b) = Loss(b,a)). So any of these functions will accomplish a form of consistency loss with no regard for whether one input is actually a ground-truth or "target".

About Softmax function as output layer in preddictions

I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.

Is there some "scale invariant" substitute for the softmax function?

It is very common tu use softmax function for converting an array of values in an array of probabilities. In general, the function amplifies the probability of the greater values of the array.
However, this function is not scale invariant. Let us consider an example:
If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the '4' was in the original input. That is, softmax highlights the largest values and suppress values which are significantly below the maximum value. However, if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25).
I would need a function that amplifies differences between values in an array, emphasizing the greatest values and that is not so affected by the scale of the numbers in the array.
Can you suggest some function with these properties?
As Robert suggested in the comment, you can use temperature. Here is a toy realization in Python using numpy:
import numpy as np
def softmax(preds):
exp_preds = np.exp(preds)
sum_preds = np.sum(exp_preds)
return exp_preds / sum_preds
def softmax_with_temperature(preds, temperature=0.5):
preds = np.log(preds) / temperature
preds = np.exp(preds)
sum_preds = np.sum(preds)
return preds / sum_preds
def check_softmax_scalability():
base_preds = [1, 2, 3, 4, 1, 2, 3]
base_preds = np.asarray(base_preds).astype("float64")
for i in range(1,3):
print('logits: ', base_preds*i,
'\nsoftmax: ', softmax(base_preds*i),
'\nwith temperature: ', softmax_with_temperature(base_preds*i))
Calling check_softmax_scalability() would return:
logits: [1. 2. 3. 4. 1. 2. 3.]
softmax: [0.02364054 0.06426166 0.1746813 0.474833 0.02364054 0.06426166
0.1746813 ]
with temperature: [0.02272727 0.09090909 0.20454545 0.36363636 0.02272727 0.09090909
0.20454545]
logits: [2. 4. 6. 8. 2. 4. 6.]
softmax: [0.00188892 0.01395733 0.10313151 0.76204449 0.00188892 0.01395733
0.10313151]
with temperature: [0.02272727 0.09090909 0.20454545 0.36363636 0.02272727 0.09090909
0.20454545]
But the scale invariance comes with a cost: as you increase temperature, the output values will come closer to each other. Increase it too much, and you will have an output that looks like a uniform distribution. In your case, you should pick a low value for temperature to emphasize the maximum value.
You can read more about how temperature works here.