Compare two segmentation maps predictions - deep-learning

I am using consistency between two predicted segmentation maps on unlabeled data. For labeled data, I’m using nn.BCEwithLogitsLoss and Dice Loss.
I’m working on videos that’s why 5 dimensions output.
(batch_size, channels, frames, height, width)
I want to know how can we compare two predicted segmentation maps.
gmentation maps.
# gt_seg - Ground truth segmentation map. - (8, 1, 8, 112, 112)
# aug_gt_seg - Augmented ground truth segmentation map - (8, 1, 8, 112, 112)
predicted_seg_1 = model(data, targets) # (8, 1, 8, 112, 112)
predicted_seg_2 = model(augmented_data, augmented_targets) #(8, 1, 8, 112, 112)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# Consistency loss
if consistency_loss == "l2":
consistency_criterion = nn.MSELoss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
elif consistency_loss == "l1":
consistency_criterion = nn.L1Loss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
total_supervised_loss = supervised_loss_1 + supervised_loss_2
total_consistency_loss = cons_loss
Is this the right way to apply consistency between two predicted segmentation maps?
I’m mainly confused due to the definition on the torch website. It’s a comparison with input x with target y. I thought it looks correct since I want both predicted segmentation maps similar. But, 2nd segmentation map is not a target. That’s why I’m confused. Because if this could be valid, then every loss function can be applied in some or another way. That doesn’t look appealing to me. If it’s the correct way to compare, can it be extended to other segmentation-based losses such as Dice Loss, IoU Loss, etc.?
One more query regarding loss computation on labeled data:
# gt_seg - Ground truth segmentation map
# aug_gt_seg - Augmented ground truth segmentation map
predicted_seg_1 = model(data, targets)
predicted_seg_2 = model(augmented_data, augmented_targets)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# augmented labeled losses
aug_supervised_loss_1 = seg_criterion_1(predicted_seg_2, aug_gt_seg)
aug_supervised_loss_2 = seg_criterion_2(predicted_seg_2, aug_gt_seg)
total_supervised_loss = supervised_loss_1 + supervised_loss_2 + aug_supervised_loss_1 + aug_supervised_loss_2
Is the calculation of total_supervised_loss correct? Can I apply loss.backward() on this?

Yes, this is a valid way to implement consistency loss. The nomenclature used by pytorch documentation lists one input as the target and the other as the prediction, but consider that L1, L2, Dice, and IOU loss are all symmetrical (that is, Loss(a,b) = Loss(b,a)). So any of these functions will accomplish a form of consistency loss with no regard for whether one input is actually a ground-truth or "target".

Related

About Softmax function as output layer in preddictions

I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.

Why does my LSTM autoencoder model couldn't detect outliers?

I am trying to build a LSTM Autoendoer for anomaly detection.
But the model seems not work for my data.
Here is the normal data that I use it for training.
And here is abnormal data that I use it for validation.
If model works, its loss should be high at #200000~#500000.
Unfortunately, here is the result I put the valid data to the model:
In the abnormal interval, the loss still low.
Here is my code of training model.
I would greatly appreciate it if you kindly give me any suggestions.
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(healthy_data)
data_scaled = scaler.transform(healthy_data)
data_broken_scaled = scaler.transform(broken_data)
timesteps=32
data = data_scaled
dim = 1
data.shape = (-1,timesteps,dim)
lr = 0.0001
Nadam = optimizers.Nadam(lr=lr)
model = Sequential()
model.add(LSTM(50,input_shape=(timesteps,dim),return_sequences=True))
model.add(Dense(dim))
model.compile(loss='mae', optimizer=Nadam ,metrics=['mse'])
EStop = EarlyStopping(monitor='val_loss', min_delta=0.001,patience=150, verbose=2, mode='auto',restore_best_weights=True)
history = model.fit(data,data,validation_data=(data,data),epochs=3000,batch_size=72,verbose=2,shuffle=False,callbacks=[EStop]).history
pred_broken = model.predict(data_broken_scaled)
loss_broken = np.mean(np.abs(pred_broken-data_broken_scaled),axis=1)
fig, ax = plt.subplots(figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')
ax.plot(range(0,len(loss_broken)), loss_broken, '-', color='red', animated = True, linewidth=1)
I Think, it may better to use Fourier transform to detect anomaly in frequency view.
I means, the train and test data will be converted to frequency domain via Fourier transform. Also I recommend to use Time-windows.
Most AI will act better if they have enough good train data. Your anomaly points should be labeled to 1(by pre-processing step of the data) and it should have high frequency at your time-step(time-windows).
In short, I think your train data may not sufficiently representative for anomaly now.

PyTorch and Chainer implementations of the Linear layer- are they equivalent?

I want to use a Linear, Fully-Connected Layer as one of the input layers in my network. The input has shape (batch_size, in_channels, num_samples). It is based on the Tacotron paper: https://arxiv.org/pdf/1703.10135.pdf, the Enocder prenet part.
It feels to me as if Chainer and PyTorch have different implementations of the Linear layer - are they really performing the same operations or am I misunderstanding something?
In PyTorch, behavior of the Linear layer follows the documentations: https://pytorch.org/docs/0.3.1/nn.html#torch.nn.Linear
according to which, the shape of the input and output data are as follows:
Input: (N,∗,in_features) where * means any number of additional dimensions
Output: (N,∗,out_features) where all but the last dimension are the same shape as the input.
Now, let's try creating a linear layer in pytorch and performing the operation. I want an output with 8 channels, and the input data will have 3 channels.
import numpy as np
import torch
from torch import nn
linear_layer_pytorch = nn.Linear(3, 8)
Let's create some dummy input data of shape (1, 4, 3) - (batch_size, num_samples, in_channels:
data = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=np.float32).reshape(1, 4, 3)
data_pytorch = torch.from_numpy(data)
and finally, perform the operation:
results_pytorch = linear_layer_pytorch(data_pytorch)
results_pytorch.shape
the shape of the output is as follows: Out[27]: torch.Size([1, 4, 8])
Taking a look at the source of the PyTorch implementation:
def linear(input, weight, bias=None):
# type: (Tensor, Tensor, Optional[Tensor]) -> Tensor
r"""
Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.
Shape:
- Input: :math:`(N, *, in\_features)` where `*` means any number of
additional dimensions
- Weight: :math:`(out\_features, in\_features)`
- Bias: :math:`(out\_features)`
- Output: :math:`(N, *, out\_features)`
"""
if input.dim() == 2 and bias is not None:
# fused op is marginally faster
ret = torch.addmm(bias, input, weight.t())
else:
output = input.matmul(weight.t())
if bias is not None:
output += bias
ret = output
return ret
It transposes the weight matrix that is passed to it, broadcasts it along the batch_size axis and performs a matrix multiplications. Having in mind how a linear layer works, I imagine it as 8 nodes, connected through a synapse, holding a weight, with every channel in an input sample, thus in my case it has 3*8 weights. And that is exactly the shape I see in debugger (8, 3).
Now, let's jump to Chainer. The Chainer's linear layer documentation is available here: https://docs.chainer.org/en/stable/reference/generated/chainer.links.Linear.html#chainer.links.Linear. According to this documentation, the Linear layer wraps the function linear, which according to the docs, flattens the input along the non-batch dimensions and the shape of it's weight matrix is (output_size, flattend_input_size)
import chainer
linear_layer_chainer = chainer.links.Linear(8)
results_chainer = linear_layer_chainer(data)
results_chainer.shape
Out[21]: (1, 8)
Creating the layer as linear_layer_chainer = chainer.links.Linear(3, 8) and calling it causes a size mismatch. So in case of chainer, I have gotten a totally different results, because this time around I have a weight matrix that is of shape (8, 12) and my results have a shape of (1, 8). So now, here is my question : since the results are clearly different,both the weight matrices and the outputs have different shapes, how can I make them equivalent and what should be the desired output? In the PyTorch implementation of Tacotron it seems that the PyTorch approach is used as is (https://github.com/mozilla/TTS/blob/master/layers/tacotron.py) - Prenet. If that is the case, how can I make the Chainer produce the same results (I have to implement this in Chainer). I will be grateful for any inshight, sorry that the post has gotten this long.
Chainer Linear layer (a bit frustratingly) does not apply the transformation to the last axis. Chainer flattens the rest of the axes. Instead you need to provide how many batch axes there are, documentation which is 2 in your case:
# data.shape == (1, 4, 3)
results_chainer = linear_layer_chainer(data, n_batch_axes=2)
# 2 batch axes (1,4) means you apply linear to (..., 3)
# results_chainer.shape == (1, 4, 8)
You can also use l(data, n_batch_axes=len(data.shape)-1) to always apply to the last dimension which is the default behaviour in PyTorch, Keras etc.

Implementing WNGrad in Pytorch?

I'm trying to implement the WNGrad (technically WN-Adam, algorithm 4 in the paper) optimizier (WNGrad) in pytorch. I've never implemented an optimizer in pytorch before so I don't know if I've done it correctly (I started from the adam implementation). The optimizer does not make much progress and falls down like I would expect (bj values can only monotonically increase, which happens quickly so no progress is made) but I'm guessing I have a bug. Standard optimizers (Adam, SGD) work fine on the same model I'm trying to optimize.
Does this implementation look correct?
from torch.optim import Optimizer
class WNAdam(Optimizer):
"""Implements WNAdam algorithm.
It has been proposed in `WNGrad: Learn the Learning Rate in Gradient Descent`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.1)
beta1 (float, optional): exponential smoothing coefficient for gradient.
When beta=0 this implements WNGrad.
.. _WNGrad\: Learn the Learning Rate in Gradient Descent:
https://arxiv.org/abs/1803.02865
"""
def __init__(self, params, lr=0.1, beta1=0.9):
if not 0.0 <= beta1 < 1.0:
raise ValueError("Invalid beta1 parameter: {}".format(beta1))
defaults = dict(lr=lr, beta1=beta1)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Learning rate adjustment
state['bj'] = 1.0
exp_avg = state['exp_avg']
beta1 = group['beta1']
state['step'] += 1
state['bj'] += (group['lr']**2)/(state['bj'])*grad.pow(2).sum()
# update exponential moving average
exp_avg.mul_(beta1).add_(1 - beta1, grad)
bias_correction = 1 - beta1 ** state['step']
p.data.sub_(group['lr'] / state['bj'] / bias_correction, exp_avg)
return loss
The paper's author has an open sourced implementation on GitHub.
The WNGrad paper
states it's inspired by batch (and weight) normalization. You should use L2 norm with respect to the weight dimensions (don't sum it all) as show in this algorithm

LSTM Timeseries recursive prediction converge to same value

I'm working on Timeseries sequence prediction using LSTM.
My goal is to use window of 25 past values in order to generate a prediction for the next 25 values. I'm doing that recursively:
I use 25 known values to predict the next value. Append that value as know value then shift the 25 values and predict the next one again until i have 25 new generated values (or more)
I'm using "Keras" to implement the RNN
Architecture:
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.1))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.1))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.1))
regressor.add(Dense(units = 1))
regressor.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 10, batch_size = 32)
Problem:
Recursive prediction always converge to the some value no matter what sequence comes before.
For sure this is not what I want, I was expecting that the generated sequence will be different depending on what I have before and I'm wondering if someone have an idea about this behavior and how to avoid it. Maybe I'm doing something wrong ...
I tried different epochs number and didn't help much, actually more epochs made it worse. Changing Batch Size, Number of Units , Number of Layers , and window size didn't help too in avoiding this issue.
I'm using MinMaxScaler for the data.
Edit:
scaling new inputs for testing:
dataset_test = sc.transform(dataset_test.reshape(-1, 1))