MNIST trained with Sigmoid fails while Softmax works fine
I am trying to investigate how different activation affects the final results, so I implemented a simple net for MNIST with PyTorch.
I am using NLLLoss (Negative log likelihood) as it implements Cross Entropy Loss when used with softmax.
When I have softmax as activation of the last layer, it works great.
But when I used sigmoid instead, I noticed that things fall apart
Here is my network code
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 80)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.XXXX(x)
where XXXX is the activation function
both Sigmoid and Softmax output values between (0, 1).
Yes Softmax guarantees the sum of 1 but I am not sure if this answers why the training fails with Sigmoid.
Is there any detail I am not catching here?
Sigmoid + crossentropy can be used for multilabel classification (assume a picture with a dog and a cat, you want the model to return "dog and cat"). It works when the classes aren't mutually exclusive or the samples contain more than one object that you want to recognize.
In your case MNIST has mutually exclusive classes and in each image there is only one number, so it is better to use logsoftmax + negative loglikelihood, which assume that the classes are mutually exclusive and there is only one correct label associated to the image.
So, you can't really expect to have that behavior from sigmoid.
Related
I am trying to reproduce a Neural Network trained to detect whether there is a 0-3 digit in an image with another confounding image. The paper I am following lists the corresponding architecture:
A neural network with 28×56 input neurons and one output neuron is
trained on this task. The input values are coded between −0.5 (black)
and +1.5 (white). The neural network is composed of a first detection
pooling layer with 400 detection neurons sum-pooled into 100 units
(i.e. we sum-pool non-overlapping groups of 4 detection units). A
second detection-pooling layer with 400 detection neurons is applied
to the 100-dimensional output of the previous layer, and activities
are sum-pooled onto a single unit representing the deep network
output. Positive examples (0-3 digit in the image) are assigned target
value 100 and negative examples are assigned target value 0. The
neural network is trained to minimize the mean-square error between
the target values and its output.
My main doubt is in this context what they mean by detection neurons, if they mean filters or a single standard ReLU neuron. Also, if the mean filters, how could they be applied in the second layer to a 100-dimensional output when they are designed to operate on 2x2 matrixes.
Reference:
Montavon, G., Bach, S., Binder, A., Samek, W., & Müller, K. (2015).
Explaining NonLinear Classification Decisions with Deep Taylor
Decomposition. arXiv. https://doi.org/10.1016/j.patcog.2016.11.008.
Specifically section 4.C
Thanks a lot for the help!
My best guess at this is something like (code not tested - just rough PyTorch):
from torch import nn
class Model(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Sequential(
nn.Flatten(), # Flatten row-wise into a 1D sequence
nn.Linear(28 * 56, 400), # Linear layer with 400 outputs.
nn.AvgPool1D(4, 4), # Sum pool to 100 outputs.
)
self.layer2 = nn.Sequential(
nn.Linear(100, 400), # Linear layer with 400 outputs.
nn.AdaptiveAvgPool1D(1), # Sum pool to 1 output.
)
def forward(self, x):
return self.layer2(self.layer1(x))
But overall I would agree with the commentor on your post that there are some issues with language here.
Simple and short question. I have a network (Unet) which performs image segmentation. I want the logits as the output to feed into the cross entropy loss (using pytorch). Currently my final layer looks as so:
class Logits(nn.Sequential):
def __init__(self,
in_channels,
n_class
):
super(Logits, self).__init__()
# fully connected layer outputting the prediction layers for each of my classes
self.conv = self.add_module('conv_out',
nn.Conv2d(in_channels,
n_class,
kernel_size = 1
)
)
self.activ = self.add_module('sigmoid_out',
nn.Sigmoid()
)
Is it correct to use the sigmoid activation function here? Does this give me logits?
When people talk about "logits" they usually refer to the "raw" n_class-dimensional output vector. For multi-class classification (n_class > 2) you want to convert the n_class-dimensional vector of raw "logits" into a n_class-dim probability vector.
That is, you want prob = f(logits) with prob_i >= 0 for all n_class entries, and that sum(prob)=1.
The most straight forward way of doing that in a differentiable way is to use the Softmax function:
prob_i = softmax(logits) = exp(logits_i) / sum_j exp(logits_j)
It is easy to see that the output of softmax is indeed a n_class-dim probability vector (I leave it to you as a short exercise).
BTW, this is why the raw predictions are called "logits" because they are kind of "log" of the output predicted probabilities.
Now, it is customary not to explicitly compute the softmax on top of a classification network and defer its computation to the loss function, e.g. nn.CrossEntropyLoss that internally computes the softmax and requires the raw logits as inputs, rather than the normalized probabilities. This is done mainly for numerical stability.
Therefore, if you are training a multi-class classification network with nn.CrossEntropyLoss you do not need to worry at all about the final activation and simply output the raw logits from your final conv/linear layer.
Most importantly, do not use nn.Sigmoid() activation as it tends to have saturated gradients and will mess up your training.
As far as I understood, you are working on a multi-label classification task where a single input can have several labels, hence your usage of nn.Sigmoid (vs nn.Softmax for multi-class classification).
There a loss function which combines nn.Sigmoid and the nn.BCELoss: nn.BCEWithLogitsLoss. So you would have as input, a vector of logits whose length is the number of classes. And, the target would as well have the same shape: as a multi-hot-encoding, with 1s for active classes.
I'm wondering if I need to do anything special when training with BatchNorm in pytorch. From my understanding the gamma and beta parameters are updated with gradients as would normally be done by an optimizer. However, the mean and variance of the batches are updated slowly using momentum.
So do we need to specify to the optimizer when the mean and variance parameters are updated, or does pytorch automatically take care of this?
Is there a way to access the mean and variance of the BN layer so that I can make sure it was changing while I trained the model.
If needed here is my model and training procedure:
def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0.):
"Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."
layers = [nn.BatchNorm1d(n_in)] if bn else []
if p != 0: layers.append(nn.Dropout(p))
layers.append(nn.Linear(n_in, n_out))
return nn.Sequential(*layers)
class Model(nn.Module):
def __init__(self, i, o, h=()):
super().__init__()
nodes = (i,) + h + (o,)
self.layers = nn.ModuleList([bn_drop_lin(i,o, p=0.5)
for i, o in zip(nodes[:-1], nodes[1:])])
def forward(self, x):
x = x.view(x.shape[0], -1)
for layer in self.layers[:-1]:
x = F.relu(layer(x))
return self.layers[-1](x)
Training:
for i, data in enumerate(trainloader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Batchnorm layers behave differently depending on if the model is in train or eval mode.
When net is in train mode (i.e. after calling net.train()) the batch norm layers contained in net will use batch statistics along with gamma and beta parameters to scale and translate each mini-batch. The running mean and variance will also be adjusted while in train mode. These updates to running mean and variance occur during the forward pass (when net(inputs) is called). The gamma and beta parameters are like any other pytorch parameter and are updated only once optimizer.step() is called.
When net is in eval mode (net.eval()) batch norm uses the historical running mean and running variance computed during training to scale and translate samples.
You can check the batch norm layers running mean and variance by displaying the layers running_mean and running_var members to ensure batch norm is updating them as expected. The learnable gamma and beta parameters can be accessed by displaying the weight and bias members of a batch norm layer respectively.
Edit
Below is a simple demonstration code showing that running_mean is updated during forward. Observe that it is not updated by the optimizer.
>>> import torch
>>> import torch.nn as nn
>>> layer = nn.BatchNorm1d(5)
>>> layer.train()
>>> layer.running_mean
tensor([0., 0., 0., 0., 0.])
>>> result = layer(torch.randn(5,5))
>>> layer.running_mean
tensor([ 0.0271, 0.0152, -0.0403, -0.0703, -0.0056])
I am finetuning resnet on my dataset which has multiple labels.
I would like to convert the 'scores' of the classification layer to probabilities and use those probabilities to calculate the loss at the training.
Could you give an example code for this?
Can I use like this:
P = net.forward(x)
p = torch.nn.functional.softmax(P, dim=1)
loss = torch.nn.functional.cross_entropy(P, y)
I am unclear whether this is the correct way or not as I am passing probabilities as the input to crossentropy loss.
So, you are training a model i.e resnet with cross-entropy in pytorch. Your loss calculation would look like this.
logit = model(x)
loss = torch.nn.functional.cross_entropy(logits=logit, target=y)
In this case, you can calculate the probabilities of all classes by doing,
logit = model(x)
p = torch.nn.functional.softmax(logit, dim=1)
# to calculate loss using probabilities you can do below
loss = torch.nn.functional.nll_loss(torch.log(p), y)
Note that if you use probabilities you will have to manually take a log, which is bad for numerical reasons. Instead, either use log_softmax or cross_entropy in which case you may end up computing losses using cross entropy and computing probability separately.
When not using KL divergence term, the VAE reconstructs mnist images almost perfectly but fails to generate new ones properly when provided with random noise.
When using KL divergence term, the VAE gives the same weird output both when reconstructing and generating images.
Here's the pytorch code for the loss function:
def loss_function(recon_x, x, mu, logvar):
BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), size_average=True)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return (BCE+KLD)
recon_x is the reconstructed image, x is the original_image, mu is the mean vector while logvar is the vector containing the log of variance.
What is going wrong here? Thanks in advance :)
A possible reason is the numerical unbalance between the two losses, with your BCE loss computed as an average over the batch (c.f. size_average=True) while the KLD one is summed.
Multiplying KLD with 0.0001 did it. The generated images are a little distorted, but similarity issue is resolved.
Yes, try out with different weight factor for the KLD loss term. Weighing down the KLD loss term resolves the same reconstruction output issue in the CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html).
There are many possible reasons for that. As benjaminplanche stated you need to use .mean instead of .sum reduction. Also, KLD term weight could be different for different architecture and data sets. So, try different weights and see the reconstruction loss, and latent space to decide. There is a trade-off between reconstruction loss (output quality) and KLD term which forces the model to shape a gausian like latent space.
To evaluate different aspects of VAEs I trained a Vanilla autoencoder and VAE with different KLD term weights.
Note that, I used the MNIST hand-written digits dataset to train networks with input size 784=28*28 and latent size 30 dimensions. Although for data samples in range of [0, 1] we normally use a Sigmoid activation function, I used a Tanh for experimental reasons.
Vanilla Autoencoder:
Autoencoder(
(encoder): Encoder(
(nn): Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
)
)
(decoder): Decoder(
(nn): Sequential(
(0): Linear(in_features=30, out_features=784, bias=True)
(1): Tanh()
)
)
)
Afterward, I implemented the VAE model as shown in the following code blocks. I trained this model with different KLD weights from the set {0.5, 1, 5}.
class VAE(nn.Module):
def __init__(self,dim_latent_representation=2):
super(VAE,self).__init__()
class Encoder(nn.Module):
def __init__(self, output_size=2):
super(Encoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(28 * 28, output_size),
)
def forward(self, x):
# needs your implementation
return self.nn(x)
class Decoder(nn.Module):
def __init__(self, input_size=2):
super(Decoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(input_size, 28 * 28),
nn.Tanh(),
)
def forward(self, z):
# needs your implementation
return self.nn(z)
self.dim_latent_representation = dim_latent_representation
self.encoder = Encoder(output_size=dim_latent_representation)
self.mu_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.logvar_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.decoder = Decoder(input_size=dim_latent_representation)
# Implement this function for the VAE model
def reparameterise(self, mu, logvar):
if self.training:
std = logvar.mul(0.5).exp_()
eps = std.data.new(std.size()).normal_()
return eps.mul(std).add_(mu)
else:
return mu
def forward(self,x):
# This function should be modified for the DAE and VAE
x = self.encoder(x)
mu, logvar = self.mu_layer(x), self.logvar_layer(x)
z = self.reparameterise(mu, logvar)
return self.decoder(z), mu, logvar
Vanilla Autoencoder
Training loss: 0.4089 Validation loss
Validation loss (reconstruction error) : 0.4171
VAE Loss = MSE + 0.5 * KLD
Training loss: 0.6420
Validation loss (reconstruction error) : 0.6060
VAE Loss = MSE + 1 * KLD
Training loss: 0.6821
Validation loss (reconstruction error) : 0.6550
VAE Loss = MSE + 5 * KLD
Training loss: 0.7122
Validation loss (reconstruction error) : 0.7154
Here you can see output results from different models. I also visualized the 30 dimensional latent space in 2D using sklearn.manifold.TSNE transformation.
We observe a low loss value for the vanilla autoencoder with 30D bottleneck size which results in high-quality reconstructed images. Although loss values increased in VAEs, the VAE arranged the latent space such that gaps between latent representations for different classes decreased. It means we can get better manipulated (mixed latents) output. Since VAE follows an isotropic multivariate normal distribution at the latent space, we can generate new unseen images by taking samples from the latent space with higher quality compared to the Vanilla autoencoder. However, the reconstruction quality was reduced (loss values increased) since the loss function is a weighted combination of MSE and KLD terms to be optimized where the KLD term forces the latent space to resemble a Gaussian distribution. As we increased the KLD weight, we achieved a more compact latent space closer to the prior distribution by sacrificing the reconstruction quality.