When I train a neural network consisting of 2 convolutional and 2 fully connected layers on the MNIST handwritten digits task, I receive the following train loss curve:
The datasets contains 235 batches and I plotted the loss after each batch for 1500 batches in total, and trained therefore the model for a little more than 6 epochs. The batches are sampled using the following PyTorch code:
sampler = torch.utils.data.RandomSampler(train_dataset, replacement=False)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, sampler=sampler)
So, in each epoch every batch is looked at once and every epoch has a new shuffling of the batches. As you see in the plot, the loss decreases at every start of the epoch rapidly.
I have never seen this behavior before and I was wondering why that is the case.
For comparison, when I run the same code, but I choose to sample with replacement, I get the following (normal looking) loss curve:
My thoughts so far:
Since at the start of each epoch the specific samples are already looked at before, the network could have an easier task if it memorized the specific batches. Still, since at every epoch there is a different shuffling, the network would have to memorize all batches which in my experience does not happen after 2 epochs.
I have also tried it on another dataset and with a variation of the model with the same results.
My model structure is the following:
self.conv_part = nn.Sequential(
nn.Conv2d(1, 32, (5, 5), stride=1, padding=2),
nn.ReLU(),
nn.MaxPool2d((2, 2)),
nn.Conv2d(32, 64, (5, 5), stride=1, padding=2),
nn.ReLU(),
nn.MaxPool2d((2, 2))
)
self.linear_part = nn.Sequential(
nn.Linear(7*7*64, 1024),
nn.ReLU(),
nn.Linear(1024, 10)
)
When the model is simplified by reducing the number of channels in each layer significantly, the problem vanishes almost completely, I have marked the epochs to get a clearer view.
Related
I am training a convolutional network using pytorch that works on 3D medical raster images (.nrrd files) to get estimated volume measurements from very noisy ultrasound images.
I have around 200 individual raster images of 30 patients, and have augmented them to over 5000 applying all kind of transforms and noise in all 3 axis (chosen randomly). All the rasters are resized to 128x128x128 before being used.
I am doing 6-fold cross validation, where I make sure that the validation set is composed of entirely different patients from those in the training set. I think this helps see if the model is actually generalizing and is capable of estimating rasters of unseen patients.
Problem is, the model is failing to generalize or learn at all. See the results I get for 2 test runs I have made (10 hours processing each):
First Training Failure
Second Training Failure
The architecture used is just 6 convolutional layers followed by 2 densely connected ones, nothing too fancy. What could be causing this? Could it be I don't have enough data for my model to learn?
I tried lowering the learning rate and raising weight decay, no luck. I haven't tried using other criterions and optimizers (currently using MSE Loss and Adam).
*Edit: Added code:
class RasterNet(nn.Module):
def __init__(self):
super(RasterNet, self).__init__()
self.conv0 = nn.Sequential( # 128x128x128 -> 256x32x32
nn.Conv2d(128, 256, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv1 = nn.Sequential( # 256x32x32 -> 512x16x16
nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(512),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv2 = nn.Sequential( # 512x16x16 -> 1024x8x8
nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(1024),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv3 = nn.Sequential( # 1024x8x8 -> 2048x4x4
nn.Conv2d(1024, 2048, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(2048),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv4 = nn.Sequential( # 2048x4x4 -> 4096x2x2
nn.Conv2d(2048, 4096, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(4096),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv5 = nn.Sequential( # 4096x2x2 -> 8192x1x1
nn.Conv2d(4096, 8192, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(8192),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.linear = nn.Sequential(
nn.Linear(8192, 4096),
nn.ReLU(),
nn.Linear(4096, 1)
)
def forward(self, base):
base = base.squeeze().float().to(dml)
# View from y axis (Coronal, as this is the clearest view)
base = torch.transpose(base, 2, 1)
x = self.conv0(base)
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.conv5(x)
x = x.view(x.size(0), -1)
return self.linear(x)
Ok a few notes which are not an "answer" per se but are too extended for comments:
First, the fact that your training loss converges to a low value, but your validation loss is high, means that your model is overfit to the training distribution. This could mean:
Your model architecture is not expressive enough to meaningfully distill high-level information from low-level (pixel/voxel) information so instead learns training-set wide bias terms that bring the loss relatively low. This could indicate that your validation and training split are from different distributions, or else that your loss function is not well-chosen for the task.
Your model is too expressive (high variance) such that it can learn the exact training examples (classic overfitting)
Second, an almost-ubiquitous trick for NN training is to use at-runtime data augmentation. This means that, rather then generating a set of augmented images before training, you instead generate a set of augmenting functions which apply data transformations randomly. This set of functions is used to transform the data batch at each training epoch, such that the model never sees exactly the same data example twice.
Third, this model architecture is relatively simplistic (simpler than AlexNet, the first modern deep CNN.) Far greater performance has been achieved by making much deeper architectures and using residual layers to (see ResNet) to deal with the vanishing gradient problem. I'd be somewhat surprised if you could achieve good performance on this task with this architecture.
It is normal for the validation loss to be higher on average than the training loss. It is possible that your model is learning to some extent but the loss curve is relatively shallow when compared to the (likely overfit) training curve. I suggest also computing epoch-wide validation accuracy and reporting this value across epochs. You should see training accuracy increase, and possibly validation accuracy as well.
Do note that cross-validation is not quite exactly meant to determine whether the model generalizes to unseen patients. That is the purpose of the validation set. Instead, cross-validation ensures that the training - validation performance is valid across multiple data partitions, and isn't simply the result of selecting an "easy" validation set.
Purely for speed/simplicity, I recommend training the model first without cross-validation (i.e. use a single training-testing partition. Once you achieve good performance on the whole dataset, you can retrain with k-fold to ensure the above, but this should make your debug cycles a bit faster.
I have a question regarding the role of the batch size. My MLP model has 2 Dense-layers with "softmax" activation function:
# Creat my MLP MODEL:
model = Sequential()
model.add(Dense(units=64, input_dim = 100))
model.add(BatchNormalization())
model.add(Activation("softmax"))
model.add(Dense(units=64))
model.add(BatchNormalization())
model.add(Activation("softmax"))
model.add(Dense(units=1))
Green's Batchsize = 2, Pink's Batchsize = 8, Red's Batchsize = 5
The dataset has 84000 samples. Each of the sample consists of 100 input values and 1 output value. Each of the sample describes a different subprocess, so the relationship between the samples do not exist. I have evaluated the training process with different batch_size. What is the reason that the training result looks better when the batch size is increased to 8? As far as I Is there a relationship in my datasample that I was not aware of?
First of all you are useing batchnorm, which, as the name suggests normalises samples based on statistics in the batch, thus it will work better if the sample size (batch size) is bigger. Apart from this higher batch size also means lower variance of your gradient estimator and is often good.
For fun, I tried to create a neural network that can detect the time difference (t2 - t1) of two consecutive bounces of a ball within 1.5 seconds (disregarding the third bounce). "The idea is that if you have the time difference of the first two bounces, you can calculate the initial rebounce height, through a physics formula."
Input for the CNN was a spectrogram image as shown below. The output is one neuron, which will output the time difference between the first bounce and the second bounce (t1 the first bounce - t2 the second bounce). Overall there are 1000 samples in this CNN.
The first two bounces can have the same time difference, but be placed somewhere else. For example, one sample might be t2-t1=0.810-0.530=0.280 and another sample might be 0.980-0.7=2.80. This is clear in example 1 and example 2.
Exmaple 1 of Spectrogram
Example 2 of Spectrogram
Here is the full code (isn't much):
https://www.codepile.net/pile/Al51wXl6
Here's the network structure:
cnn = tf.keras.models.Sequential()
cnn.add(tf.keras.layers.Conv2D(filters=32, kernel_size=5, activation='relu', input_shape=[1025, 65, 1]))
cnn.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
cnn.add(tf.keras.layers.Conv2D(filters=32, kernel_size=3, activation='relu'))
cnn.add(tf.keras.layers.MaxPool2D(pool_size=2, strides=2))
cnn.add(tf.keras.layers.Flatten())
cnn.add(tf.keras.layers.Dense(units=128, activation='relu'))
cnn.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))
cnn.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
cnn.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30)
The output was far of the accuracy I was hoping for:
Mean Absolute error is: ~0.3
So my question is, am I missunderstanding CNN's or why cant my CNN perform this task.
Most critical mistake
Choice of loss function and output unit
You have a regression task (predicting the continious variable time-difference). But your loss function is binary_crossentropy, which is for classification. Must use something like "mean_squared_error" instead.
The output neuron non-linearity is sigmoid, which is for classification (or other things that should saturate between 0.0 and 1.0). Recomment using linear instead.
When not using KL divergence term, the VAE reconstructs mnist images almost perfectly but fails to generate new ones properly when provided with random noise.
When using KL divergence term, the VAE gives the same weird output both when reconstructing and generating images.
Here's the pytorch code for the loss function:
def loss_function(recon_x, x, mu, logvar):
BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), size_average=True)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return (BCE+KLD)
recon_x is the reconstructed image, x is the original_image, mu is the mean vector while logvar is the vector containing the log of variance.
What is going wrong here? Thanks in advance :)
A possible reason is the numerical unbalance between the two losses, with your BCE loss computed as an average over the batch (c.f. size_average=True) while the KLD one is summed.
Multiplying KLD with 0.0001 did it. The generated images are a little distorted, but similarity issue is resolved.
Yes, try out with different weight factor for the KLD loss term. Weighing down the KLD loss term resolves the same reconstruction output issue in the CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html).
There are many possible reasons for that. As benjaminplanche stated you need to use .mean instead of .sum reduction. Also, KLD term weight could be different for different architecture and data sets. So, try different weights and see the reconstruction loss, and latent space to decide. There is a trade-off between reconstruction loss (output quality) and KLD term which forces the model to shape a gausian like latent space.
To evaluate different aspects of VAEs I trained a Vanilla autoencoder and VAE with different KLD term weights.
Note that, I used the MNIST hand-written digits dataset to train networks with input size 784=28*28 and latent size 30 dimensions. Although for data samples in range of [0, 1] we normally use a Sigmoid activation function, I used a Tanh for experimental reasons.
Vanilla Autoencoder:
Autoencoder(
(encoder): Encoder(
(nn): Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
)
)
(decoder): Decoder(
(nn): Sequential(
(0): Linear(in_features=30, out_features=784, bias=True)
(1): Tanh()
)
)
)
Afterward, I implemented the VAE model as shown in the following code blocks. I trained this model with different KLD weights from the set {0.5, 1, 5}.
class VAE(nn.Module):
def __init__(self,dim_latent_representation=2):
super(VAE,self).__init__()
class Encoder(nn.Module):
def __init__(self, output_size=2):
super(Encoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(28 * 28, output_size),
)
def forward(self, x):
# needs your implementation
return self.nn(x)
class Decoder(nn.Module):
def __init__(self, input_size=2):
super(Decoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(input_size, 28 * 28),
nn.Tanh(),
)
def forward(self, z):
# needs your implementation
return self.nn(z)
self.dim_latent_representation = dim_latent_representation
self.encoder = Encoder(output_size=dim_latent_representation)
self.mu_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.logvar_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.decoder = Decoder(input_size=dim_latent_representation)
# Implement this function for the VAE model
def reparameterise(self, mu, logvar):
if self.training:
std = logvar.mul(0.5).exp_()
eps = std.data.new(std.size()).normal_()
return eps.mul(std).add_(mu)
else:
return mu
def forward(self,x):
# This function should be modified for the DAE and VAE
x = self.encoder(x)
mu, logvar = self.mu_layer(x), self.logvar_layer(x)
z = self.reparameterise(mu, logvar)
return self.decoder(z), mu, logvar
Vanilla Autoencoder
Training loss: 0.4089 Validation loss
Validation loss (reconstruction error) : 0.4171
VAE Loss = MSE + 0.5 * KLD
Training loss: 0.6420
Validation loss (reconstruction error) : 0.6060
VAE Loss = MSE + 1 * KLD
Training loss: 0.6821
Validation loss (reconstruction error) : 0.6550
VAE Loss = MSE + 5 * KLD
Training loss: 0.7122
Validation loss (reconstruction error) : 0.7154
Here you can see output results from different models. I also visualized the 30 dimensional latent space in 2D using sklearn.manifold.TSNE transformation.
We observe a low loss value for the vanilla autoencoder with 30D bottleneck size which results in high-quality reconstructed images. Although loss values increased in VAEs, the VAE arranged the latent space such that gaps between latent representations for different classes decreased. It means we can get better manipulated (mixed latents) output. Since VAE follows an isotropic multivariate normal distribution at the latent space, we can generate new unseen images by taking samples from the latent space with higher quality compared to the Vanilla autoencoder. However, the reconstruction quality was reduced (loss values increased) since the loss function is a weighted combination of MSE and KLD terms to be optimized where the KLD term forces the latent space to resemble a Gaussian distribution. As we increased the KLD weight, we achieved a more compact latent space closer to the prior distribution by sacrificing the reconstruction quality.
I am using keras to implement a simple network for binary classification. I have a dataset with 2 categories and I am trying to train my network using this data. I don't have a huge data set. Total number of images in both categories are around 500.
The network is as below:
self.model = Sequential()
self.model.add(Conv2D(128, (2, 2), padding='same', input_shape=dataset.X_train.shape[1:]))
self.model.add(Activation('relu'))
self.model.add(MaxPooling2D(pool_size=(2, 2)))
self.model.add(Dropout(0.25))
self.model.add(Conv2D(64, (2, 2), padding='same'))
self.model.add(Activation('relu'))
self.model.add(MaxPooling2D(pool_size=(2, 2)))
self.model.add(Dropout(0.25))
SGD config:
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
I am using binary_crossentropy
The model training and loss graph look as below:
I am just wondering why there are a lot of big peaks in the graphs and what i can do to optimize it.
I am a newbie thus any comments and suggestions will be appreciated.
thanks!
If you look at the end of each epoch in the training/test,it seems that the accuracy drops(loss also increases),which means that the sequence of your dataset doesn't changes,this might not lead to better generalization of a the model,in my opinion,what you should do at each epoch is to randomize your dataset(batch) in the training phase,but for testing phase,you can just leave it since the model isn't doing any learning anymore
I believe these peaks actually coincides with a start of new epoch. Throughout one epoch gradients of previous batches are used to compute the current gradient when you use momentum. This explains why loss is decreasing steadily throughout one epoch and hikes at the beginning of the next one, that is, when the new epoch starts optimiser doesn't use gradients computed for batches in previous epochs.