I'm wondering if I need to do anything special when training with BatchNorm in pytorch. From my understanding the gamma and beta parameters are updated with gradients as would normally be done by an optimizer. However, the mean and variance of the batches are updated slowly using momentum.
So do we need to specify to the optimizer when the mean and variance parameters are updated, or does pytorch automatically take care of this?
Is there a way to access the mean and variance of the BN layer so that I can make sure it was changing while I trained the model.
If needed here is my model and training procedure:
def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0.):
"Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."
layers = [nn.BatchNorm1d(n_in)] if bn else []
if p != 0: layers.append(nn.Dropout(p))
layers.append(nn.Linear(n_in, n_out))
return nn.Sequential(*layers)
class Model(nn.Module):
def __init__(self, i, o, h=()):
super().__init__()
nodes = (i,) + h + (o,)
self.layers = nn.ModuleList([bn_drop_lin(i,o, p=0.5)
for i, o in zip(nodes[:-1], nodes[1:])])
def forward(self, x):
x = x.view(x.shape[0], -1)
for layer in self.layers[:-1]:
x = F.relu(layer(x))
return self.layers[-1](x)
Training:
for i, data in enumerate(trainloader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Batchnorm layers behave differently depending on if the model is in train or eval mode.
When net is in train mode (i.e. after calling net.train()) the batch norm layers contained in net will use batch statistics along with gamma and beta parameters to scale and translate each mini-batch. The running mean and variance will also be adjusted while in train mode. These updates to running mean and variance occur during the forward pass (when net(inputs) is called). The gamma and beta parameters are like any other pytorch parameter and are updated only once optimizer.step() is called.
When net is in eval mode (net.eval()) batch norm uses the historical running mean and running variance computed during training to scale and translate samples.
You can check the batch norm layers running mean and variance by displaying the layers running_mean and running_var members to ensure batch norm is updating them as expected. The learnable gamma and beta parameters can be accessed by displaying the weight and bias members of a batch norm layer respectively.
Edit
Below is a simple demonstration code showing that running_mean is updated during forward. Observe that it is not updated by the optimizer.
>>> import torch
>>> import torch.nn as nn
>>> layer = nn.BatchNorm1d(5)
>>> layer.train()
>>> layer.running_mean
tensor([0., 0., 0., 0., 0.])
>>> result = layer(torch.randn(5,5))
>>> layer.running_mean
tensor([ 0.0271, 0.0152, -0.0403, -0.0703, -0.0056])
Related
Simple and short question. I have a network (Unet) which performs image segmentation. I want the logits as the output to feed into the cross entropy loss (using pytorch). Currently my final layer looks as so:
class Logits(nn.Sequential):
def __init__(self,
in_channels,
n_class
):
super(Logits, self).__init__()
# fully connected layer outputting the prediction layers for each of my classes
self.conv = self.add_module('conv_out',
nn.Conv2d(in_channels,
n_class,
kernel_size = 1
)
)
self.activ = self.add_module('sigmoid_out',
nn.Sigmoid()
)
Is it correct to use the sigmoid activation function here? Does this give me logits?
When people talk about "logits" they usually refer to the "raw" n_class-dimensional output vector. For multi-class classification (n_class > 2) you want to convert the n_class-dimensional vector of raw "logits" into a n_class-dim probability vector.
That is, you want prob = f(logits) with prob_i >= 0 for all n_class entries, and that sum(prob)=1.
The most straight forward way of doing that in a differentiable way is to use the Softmax function:
prob_i = softmax(logits) = exp(logits_i) / sum_j exp(logits_j)
It is easy to see that the output of softmax is indeed a n_class-dim probability vector (I leave it to you as a short exercise).
BTW, this is why the raw predictions are called "logits" because they are kind of "log" of the output predicted probabilities.
Now, it is customary not to explicitly compute the softmax on top of a classification network and defer its computation to the loss function, e.g. nn.CrossEntropyLoss that internally computes the softmax and requires the raw logits as inputs, rather than the normalized probabilities. This is done mainly for numerical stability.
Therefore, if you are training a multi-class classification network with nn.CrossEntropyLoss you do not need to worry at all about the final activation and simply output the raw logits from your final conv/linear layer.
Most importantly, do not use nn.Sigmoid() activation as it tends to have saturated gradients and will mess up your training.
As far as I understood, you are working on a multi-label classification task where a single input can have several labels, hence your usage of nn.Sigmoid (vs nn.Softmax for multi-class classification).
There a loss function which combines nn.Sigmoid and the nn.BCELoss: nn.BCEWithLogitsLoss. So you would have as input, a vector of logits whose length is the number of classes. And, the target would as well have the same shape: as a multi-hot-encoding, with 1s for active classes.
MNIST trained with Sigmoid fails while Softmax works fine
I am trying to investigate how different activation affects the final results, so I implemented a simple net for MNIST with PyTorch.
I am using NLLLoss (Negative log likelihood) as it implements Cross Entropy Loss when used with softmax.
When I have softmax as activation of the last layer, it works great.
But when I used sigmoid instead, I noticed that things fall apart
Here is my network code
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 80)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.XXXX(x)
where XXXX is the activation function
both Sigmoid and Softmax output values between (0, 1).
Yes Softmax guarantees the sum of 1 but I am not sure if this answers why the training fails with Sigmoid.
Is there any detail I am not catching here?
Sigmoid + crossentropy can be used for multilabel classification (assume a picture with a dog and a cat, you want the model to return "dog and cat"). It works when the classes aren't mutually exclusive or the samples contain more than one object that you want to recognize.
In your case MNIST has mutually exclusive classes and in each image there is only one number, so it is better to use logsoftmax + negative loglikelihood, which assume that the classes are mutually exclusive and there is only one correct label associated to the image.
So, you can't really expect to have that behavior from sigmoid.
I recently read a paper about embedding.
In Eq. (3), the f is a 4096X1 vector. the author try to compress the vector in to theta (a 20X1 vector) by using an embedding matrix E.
The equation is simple theta = E*f
I was wondering if it can using pytorch to achieve this goal, then in the training, the E can be learned automatically.
How to finish the rest? thanks so much.
The demo code is follow:
import torch
from torch import nn
f = torch.randn(4096,1)
Assuming your input vectors are one-hot that is where "embedding layers" are used, you can directly use embedding layer from torch which does above as well as some more things. nn.Embeddings take non-zero index of one-hot vector as input as a long tensor. For ex: if feature vector is
f = [[0,0,1], [1,0,0]]
then input to nn.Embeddings will be
input = [2, 0]
However, what OP has asked in question is getting embeddings by matrix multiplication and below I will address that. You can define a module to do that as below. Since, param is an instance of nn.Parameter it will be registered as a parameter and will be optimized when you call Adam or any other optimizer.
class Embedding(nn.Module):
def __init__(self, input_dim, embedding_dim):
super().__init__()
self.param = torch.nn.Parameter(torch.randn(input_dim, embedding_dim))
def forward(self, x):
return torch.mm(x, self.param)
If you notice carefully this is the same as a linear layer with no bias and slightly different initialization. Therefore, you can achieve the same by using a linear layer as below.
self.embedding = nn.Linear(4096, 20, bias=False)
# change initial weights to normal[0,1] or whatever is required
embedding.weight.data = torch.randn_like(embedding.weight)
When not using KL divergence term, the VAE reconstructs mnist images almost perfectly but fails to generate new ones properly when provided with random noise.
When using KL divergence term, the VAE gives the same weird output both when reconstructing and generating images.
Here's the pytorch code for the loss function:
def loss_function(recon_x, x, mu, logvar):
BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), size_average=True)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return (BCE+KLD)
recon_x is the reconstructed image, x is the original_image, mu is the mean vector while logvar is the vector containing the log of variance.
What is going wrong here? Thanks in advance :)
A possible reason is the numerical unbalance between the two losses, with your BCE loss computed as an average over the batch (c.f. size_average=True) while the KLD one is summed.
Multiplying KLD with 0.0001 did it. The generated images are a little distorted, but similarity issue is resolved.
Yes, try out with different weight factor for the KLD loss term. Weighing down the KLD loss term resolves the same reconstruction output issue in the CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html).
There are many possible reasons for that. As benjaminplanche stated you need to use .mean instead of .sum reduction. Also, KLD term weight could be different for different architecture and data sets. So, try different weights and see the reconstruction loss, and latent space to decide. There is a trade-off between reconstruction loss (output quality) and KLD term which forces the model to shape a gausian like latent space.
To evaluate different aspects of VAEs I trained a Vanilla autoencoder and VAE with different KLD term weights.
Note that, I used the MNIST hand-written digits dataset to train networks with input size 784=28*28 and latent size 30 dimensions. Although for data samples in range of [0, 1] we normally use a Sigmoid activation function, I used a Tanh for experimental reasons.
Vanilla Autoencoder:
Autoencoder(
(encoder): Encoder(
(nn): Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
)
)
(decoder): Decoder(
(nn): Sequential(
(0): Linear(in_features=30, out_features=784, bias=True)
(1): Tanh()
)
)
)
Afterward, I implemented the VAE model as shown in the following code blocks. I trained this model with different KLD weights from the set {0.5, 1, 5}.
class VAE(nn.Module):
def __init__(self,dim_latent_representation=2):
super(VAE,self).__init__()
class Encoder(nn.Module):
def __init__(self, output_size=2):
super(Encoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(28 * 28, output_size),
)
def forward(self, x):
# needs your implementation
return self.nn(x)
class Decoder(nn.Module):
def __init__(self, input_size=2):
super(Decoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(input_size, 28 * 28),
nn.Tanh(),
)
def forward(self, z):
# needs your implementation
return self.nn(z)
self.dim_latent_representation = dim_latent_representation
self.encoder = Encoder(output_size=dim_latent_representation)
self.mu_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.logvar_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.decoder = Decoder(input_size=dim_latent_representation)
# Implement this function for the VAE model
def reparameterise(self, mu, logvar):
if self.training:
std = logvar.mul(0.5).exp_()
eps = std.data.new(std.size()).normal_()
return eps.mul(std).add_(mu)
else:
return mu
def forward(self,x):
# This function should be modified for the DAE and VAE
x = self.encoder(x)
mu, logvar = self.mu_layer(x), self.logvar_layer(x)
z = self.reparameterise(mu, logvar)
return self.decoder(z), mu, logvar
Vanilla Autoencoder
Training loss: 0.4089 Validation loss
Validation loss (reconstruction error) : 0.4171
VAE Loss = MSE + 0.5 * KLD
Training loss: 0.6420
Validation loss (reconstruction error) : 0.6060
VAE Loss = MSE + 1 * KLD
Training loss: 0.6821
Validation loss (reconstruction error) : 0.6550
VAE Loss = MSE + 5 * KLD
Training loss: 0.7122
Validation loss (reconstruction error) : 0.7154
Here you can see output results from different models. I also visualized the 30 dimensional latent space in 2D using sklearn.manifold.TSNE transformation.
We observe a low loss value for the vanilla autoencoder with 30D bottleneck size which results in high-quality reconstructed images. Although loss values increased in VAEs, the VAE arranged the latent space such that gaps between latent representations for different classes decreased. It means we can get better manipulated (mixed latents) output. Since VAE follows an isotropic multivariate normal distribution at the latent space, we can generate new unseen images by taking samples from the latent space with higher quality compared to the Vanilla autoencoder. However, the reconstruction quality was reduced (loss values increased) since the loss function is a weighted combination of MSE and KLD terms to be optimized where the KLD term forces the latent space to resemble a Gaussian distribution. As we increased the KLD weight, we achieved a more compact latent space closer to the prior distribution by sacrificing the reconstruction quality.
I am trying to implement discriminant condition codes in Keras as proposed in
Xue, Shaofei, et al., "Fast adaptation of deep neural network based
on discriminant codes for speech recognition."
The main idea is you encode each condition as an input parameter and let the network learn dependency between the condition and the feature-label mapping. On a new dataset instead of adapting the entire network you just tune these weights using backprop. For example say my network looks like this
X ---->|----|
|DNN |----> Y
Z --- >|----|
X: features Y: labels Z:condition codes
Now given a pretrained DNN, and X',Y' on a new dataset I am trying to estimate the Z' using backprop that will minimize prediction error on Y'. The math seems straightforward except I am not sure how to implement this in keras without having access to the backprop itself.
For instance, can I add an Input() layer with trainable=True with all other layers set to trainable= False. Can backprop in keras update more than just layer weights? Or is there a way to hack keras layers to do this?
Any suggestions welcome.
thanks
I figured out how to do this (exactly) in Keras by looking at fchollet's post here
Using the keras backend I was able to compute the gradient of my loss w.r.t to Z directly and used it to drive the update.
Code below:
import keras.backend as K
import numpy as np
model.summary() #Pretrained model
loss = K.categorical_crossentropy(Y, Y_out)
grads = K.gradients(loss, Z)
grads /= (K.sqrt(K.mean(K.square(grads)))+ 1e-5)
iterate = K.function([X,Z],[loss,grads])
step = 0.1
Z_adapt = Z_in.copy()
for i in range(100):
loss_val, grads_val = iterate([X_in,Z_adapt])
Z_adapt -= grads_val[0] * step
print "iter:",i,np.mean(loss_value)
print "Before:"
print model.evaluate([X_in, Z_in],Y_out)
print "After:"
print model.evaluate([X_in, Z_adapt],Y_out)
X,Y,Z are nodes in the model graph. Z_in is an initial value for Z'. I set it to an average value from the train set. Z_adapt is after 100 iterations of gradient descent and should give you a better result.
Assume that the size of Z is m x n. Then you can first define an input layer of size m * n x 1. The input will be an m * n x 1 vector of ones. You can define a dense layer containing m * n neurons and set trainable = True for it. The response of this layer will give you a flattened version of Z. Reshape it appropriately and give it as input to the rest of the network that can be appended ahead of this.
Keep in mind that if the size of Z is too large, then network may not be able to learn a dense layer of that many neurons. In that case, maybe you need to put additional constraints or look into convolutional layers. However, convolutional layers will put some constraints on Z.