Is there some "scale invariant" substitute for the softmax function? - function

It is very common tu use softmax function for converting an array of values in an array of probabilities. In general, the function amplifies the probability of the greater values of the array.
However, this function is not scale invariant. Let us consider an example:
If we take an input of [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The output has most of its weight where the '4' was in the original input. That is, softmax highlights the largest values and suppress values which are significantly below the maximum value. However, if the input were [0.1, 0.2, 0.3, 0.4, 0.1, 0.2, 0.3] (which sums to 1.6) the softmax would be [0.125, 0.138, 0.153, 0.169, 0.125, 0.138, 0.153]. This shows that for values between 0 and 1 softmax, in fact, de-emphasizes the maximum value (note that 0.169 is not only less than 0.475, it is also less than the initial proportion of 0.4/1.6=0.25).
I would need a function that amplifies differences between values in an array, emphasizing the greatest values and that is not so affected by the scale of the numbers in the array.
Can you suggest some function with these properties?

As Robert suggested in the comment, you can use temperature. Here is a toy realization in Python using numpy:
import numpy as np
def softmax(preds):
exp_preds = np.exp(preds)
sum_preds = np.sum(exp_preds)
return exp_preds / sum_preds
def softmax_with_temperature(preds, temperature=0.5):
preds = np.log(preds) / temperature
preds = np.exp(preds)
sum_preds = np.sum(preds)
return preds / sum_preds
def check_softmax_scalability():
base_preds = [1, 2, 3, 4, 1, 2, 3]
base_preds = np.asarray(base_preds).astype("float64")
for i in range(1,3):
print('logits: ', base_preds*i,
'\nsoftmax: ', softmax(base_preds*i),
'\nwith temperature: ', softmax_with_temperature(base_preds*i))
Calling check_softmax_scalability() would return:
logits: [1. 2. 3. 4. 1. 2. 3.]
softmax: [0.02364054 0.06426166 0.1746813 0.474833 0.02364054 0.06426166
0.1746813 ]
with temperature: [0.02272727 0.09090909 0.20454545 0.36363636 0.02272727 0.09090909
0.20454545]
logits: [2. 4. 6. 8. 2. 4. 6.]
softmax: [0.00188892 0.01395733 0.10313151 0.76204449 0.00188892 0.01395733
0.10313151]
with temperature: [0.02272727 0.09090909 0.20454545 0.36363636 0.02272727 0.09090909
0.20454545]
But the scale invariance comes with a cost: as you increase temperature, the output values will come closer to each other. Increase it too much, and you will have an output that looks like a uniform distribution. In your case, you should pick a low value for temperature to emphasize the maximum value.
You can read more about how temperature works here.

Related

Compare two segmentation maps predictions

I am using consistency between two predicted segmentation maps on unlabeled data. For labeled data, I’m using nn.BCEwithLogitsLoss and Dice Loss.
I’m working on videos that’s why 5 dimensions output.
(batch_size, channels, frames, height, width)
I want to know how can we compare two predicted segmentation maps.
gmentation maps.
# gt_seg - Ground truth segmentation map. - (8, 1, 8, 112, 112)
# aug_gt_seg - Augmented ground truth segmentation map - (8, 1, 8, 112, 112)
predicted_seg_1 = model(data, targets) # (8, 1, 8, 112, 112)
predicted_seg_2 = model(augmented_data, augmented_targets) #(8, 1, 8, 112, 112)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# Consistency loss
if consistency_loss == "l2":
consistency_criterion = nn.MSELoss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
elif consistency_loss == "l1":
consistency_criterion = nn.L1Loss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
total_supervised_loss = supervised_loss_1 + supervised_loss_2
total_consistency_loss = cons_loss
Is this the right way to apply consistency between two predicted segmentation maps?
I’m mainly confused due to the definition on the torch website. It’s a comparison with input x with target y. I thought it looks correct since I want both predicted segmentation maps similar. But, 2nd segmentation map is not a target. That’s why I’m confused. Because if this could be valid, then every loss function can be applied in some or another way. That doesn’t look appealing to me. If it’s the correct way to compare, can it be extended to other segmentation-based losses such as Dice Loss, IoU Loss, etc.?
One more query regarding loss computation on labeled data:
# gt_seg - Ground truth segmentation map
# aug_gt_seg - Augmented ground truth segmentation map
predicted_seg_1 = model(data, targets)
predicted_seg_2 = model(augmented_data, augmented_targets)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# augmented labeled losses
aug_supervised_loss_1 = seg_criterion_1(predicted_seg_2, aug_gt_seg)
aug_supervised_loss_2 = seg_criterion_2(predicted_seg_2, aug_gt_seg)
total_supervised_loss = supervised_loss_1 + supervised_loss_2 + aug_supervised_loss_1 + aug_supervised_loss_2
Is the calculation of total_supervised_loss correct? Can I apply loss.backward() on this?
Yes, this is a valid way to implement consistency loss. The nomenclature used by pytorch documentation lists one input as the target and the other as the prediction, but consider that L1, L2, Dice, and IOU loss are all symmetrical (that is, Loss(a,b) = Loss(b,a)). So any of these functions will accomplish a form of consistency loss with no regard for whether one input is actually a ground-truth or "target".

multi label problem with intermediate labels

I am trying to create a model for the following problem
id input (diagnoses) elapsed_days output (medication)
1 [2,3,4] 0 [3,4]
1 [4,5,6] 7 [1]
1 [2,3] 56 [6,3]
2 [6,5,9,10] 0 [5,3,1]
Rather than a single label for the different codes over time, there are labels at each time period.
I am think that my arch would be [input] -> [embedding for diagnoses] -> [append normalized elapsed days to embeddings]
-> [LSTM] -> [FFNs] -> [labels over time]
I am familiar with how to set this up if there were a single label per id. Given there are labels for each row (i.e. multiple per id), should I be passing the hidden layers of the LSTM through the FFN and then assigning the labels? I would really appreciate if somebody could point me to a reference/blog/github/anything for this kind of problem or suggest an alternative approach here.
Assuming the [6,3] is equal to [3, 6].
You can use Sigmoid activation with Binary Cross-Entropy loss function (nn.BCELoss class) instead of Softmax Cross-Entropy (nn.CrossEntropyLoss class).
But the output ground truth instead of integers like when using nn.CrossEntropyLoss. You need to make them sort of one hot encoding instead. For example, if the desired output is [6, 3] and the output has 10 nodes. The y_true has to be [0, 0, 0, 1, 0, 0, 1, 0, 0, 0].
Depending on how you implement your data generator, this is one way to do it.
output = [3, 6]
out_tensor = torch.zeros(10)
out_tensor[output] = 1
But if [6,3] is not equal to [3, 6]. Then more information about this is needed.

About Softmax function as output layer in preddictions

I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.

PyTorch and Chainer implementations of the Linear layer- are they equivalent?

I want to use a Linear, Fully-Connected Layer as one of the input layers in my network. The input has shape (batch_size, in_channels, num_samples). It is based on the Tacotron paper: https://arxiv.org/pdf/1703.10135.pdf, the Enocder prenet part.
It feels to me as if Chainer and PyTorch have different implementations of the Linear layer - are they really performing the same operations or am I misunderstanding something?
In PyTorch, behavior of the Linear layer follows the documentations: https://pytorch.org/docs/0.3.1/nn.html#torch.nn.Linear
according to which, the shape of the input and output data are as follows:
Input: (N,∗,in_features) where * means any number of additional dimensions
Output: (N,∗,out_features) where all but the last dimension are the same shape as the input.
Now, let's try creating a linear layer in pytorch and performing the operation. I want an output with 8 channels, and the input data will have 3 channels.
import numpy as np
import torch
from torch import nn
linear_layer_pytorch = nn.Linear(3, 8)
Let's create some dummy input data of shape (1, 4, 3) - (batch_size, num_samples, in_channels:
data = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=np.float32).reshape(1, 4, 3)
data_pytorch = torch.from_numpy(data)
and finally, perform the operation:
results_pytorch = linear_layer_pytorch(data_pytorch)
results_pytorch.shape
the shape of the output is as follows: Out[27]: torch.Size([1, 4, 8])
Taking a look at the source of the PyTorch implementation:
def linear(input, weight, bias=None):
# type: (Tensor, Tensor, Optional[Tensor]) -> Tensor
r"""
Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.
Shape:
- Input: :math:`(N, *, in\_features)` where `*` means any number of
additional dimensions
- Weight: :math:`(out\_features, in\_features)`
- Bias: :math:`(out\_features)`
- Output: :math:`(N, *, out\_features)`
"""
if input.dim() == 2 and bias is not None:
# fused op is marginally faster
ret = torch.addmm(bias, input, weight.t())
else:
output = input.matmul(weight.t())
if bias is not None:
output += bias
ret = output
return ret
It transposes the weight matrix that is passed to it, broadcasts it along the batch_size axis and performs a matrix multiplications. Having in mind how a linear layer works, I imagine it as 8 nodes, connected through a synapse, holding a weight, with every channel in an input sample, thus in my case it has 3*8 weights. And that is exactly the shape I see in debugger (8, 3).
Now, let's jump to Chainer. The Chainer's linear layer documentation is available here: https://docs.chainer.org/en/stable/reference/generated/chainer.links.Linear.html#chainer.links.Linear. According to this documentation, the Linear layer wraps the function linear, which according to the docs, flattens the input along the non-batch dimensions and the shape of it's weight matrix is (output_size, flattend_input_size)
import chainer
linear_layer_chainer = chainer.links.Linear(8)
results_chainer = linear_layer_chainer(data)
results_chainer.shape
Out[21]: (1, 8)
Creating the layer as linear_layer_chainer = chainer.links.Linear(3, 8) and calling it causes a size mismatch. So in case of chainer, I have gotten a totally different results, because this time around I have a weight matrix that is of shape (8, 12) and my results have a shape of (1, 8). So now, here is my question : since the results are clearly different,both the weight matrices and the outputs have different shapes, how can I make them equivalent and what should be the desired output? In the PyTorch implementation of Tacotron it seems that the PyTorch approach is used as is (https://github.com/mozilla/TTS/blob/master/layers/tacotron.py) - Prenet. If that is the case, how can I make the Chainer produce the same results (I have to implement this in Chainer). I will be grateful for any inshight, sorry that the post has gotten this long.
Chainer Linear layer (a bit frustratingly) does not apply the transformation to the last axis. Chainer flattens the rest of the axes. Instead you need to provide how many batch axes there are, documentation which is 2 in your case:
# data.shape == (1, 4, 3)
results_chainer = linear_layer_chainer(data, n_batch_axes=2)
# 2 batch axes (1,4) means you apply linear to (..., 3)
# results_chainer.shape == (1, 4, 8)
You can also use l(data, n_batch_axes=len(data.shape)-1) to always apply to the last dimension which is the default behaviour in PyTorch, Keras etc.

LSTM Timeseries recursive prediction converge to same value

I'm working on Timeseries sequence prediction using LSTM.
My goal is to use window of 25 past values in order to generate a prediction for the next 25 values. I'm doing that recursively:
I use 25 known values to predict the next value. Append that value as know value then shift the 25 values and predict the next one again until i have 25 new generated values (or more)
I'm using "Keras" to implement the RNN
Architecture:
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.1))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.1))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.1))
regressor.add(Dense(units = 1))
regressor.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 10, batch_size = 32)
Problem:
Recursive prediction always converge to the some value no matter what sequence comes before.
For sure this is not what I want, I was expecting that the generated sequence will be different depending on what I have before and I'm wondering if someone have an idea about this behavior and how to avoid it. Maybe I'm doing something wrong ...
I tried different epochs number and didn't help much, actually more epochs made it worse. Changing Batch Size, Number of Units , Number of Layers , and window size didn't help too in avoiding this issue.
I'm using MinMaxScaler for the data.
Edit:
scaling new inputs for testing:
dataset_test = sc.transform(dataset_test.reshape(-1, 1))