Why is this deep learning convolutional model not generalizing? - deep-learning

I am training a convolutional network using pytorch that works on 3D medical raster images (.nrrd files) to get estimated volume measurements from very noisy ultrasound images.
I have around 200 individual raster images of 30 patients, and have augmented them to over 5000 applying all kind of transforms and noise in all 3 axis (chosen randomly). All the rasters are resized to 128x128x128 before being used.
I am doing 6-fold cross validation, where I make sure that the validation set is composed of entirely different patients from those in the training set. I think this helps see if the model is actually generalizing and is capable of estimating rasters of unseen patients.
Problem is, the model is failing to generalize or learn at all. See the results I get for 2 test runs I have made (10 hours processing each):
First Training Failure
Second Training Failure
The architecture used is just 6 convolutional layers followed by 2 densely connected ones, nothing too fancy. What could be causing this? Could it be I don't have enough data for my model to learn?
I tried lowering the learning rate and raising weight decay, no luck. I haven't tried using other criterions and optimizers (currently using MSE Loss and Adam).
*Edit: Added code:
class RasterNet(nn.Module):
def __init__(self):
super(RasterNet, self).__init__()
self.conv0 = nn.Sequential( # 128x128x128 -> 256x32x32
nn.Conv2d(128, 256, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv1 = nn.Sequential( # 256x32x32 -> 512x16x16
nn.Conv2d(256, 512, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(512),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv2 = nn.Sequential( # 512x16x16 -> 1024x8x8
nn.Conv2d(512, 1024, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(1024),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv3 = nn.Sequential( # 1024x8x8 -> 2048x4x4
nn.Conv2d(1024, 2048, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(2048),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv4 = nn.Sequential( # 2048x4x4 -> 4096x2x2
nn.Conv2d(2048, 4096, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(4096),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.conv5 = nn.Sequential( # 4096x2x2 -> 8192x1x1
nn.Conv2d(4096, 8192, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(8192),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
self.linear = nn.Sequential(
nn.Linear(8192, 4096),
nn.ReLU(),
nn.Linear(4096, 1)
)
def forward(self, base):
base = base.squeeze().float().to(dml)
# View from y axis (Coronal, as this is the clearest view)
base = torch.transpose(base, 2, 1)
x = self.conv0(base)
x = self.conv1(x)
x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.conv5(x)
x = x.view(x.size(0), -1)
return self.linear(x)

Ok a few notes which are not an "answer" per se but are too extended for comments:
First, the fact that your training loss converges to a low value, but your validation loss is high, means that your model is overfit to the training distribution. This could mean:
Your model architecture is not expressive enough to meaningfully distill high-level information from low-level (pixel/voxel) information so instead learns training-set wide bias terms that bring the loss relatively low. This could indicate that your validation and training split are from different distributions, or else that your loss function is not well-chosen for the task.
Your model is too expressive (high variance) such that it can learn the exact training examples (classic overfitting)
Second, an almost-ubiquitous trick for NN training is to use at-runtime data augmentation. This means that, rather then generating a set of augmented images before training, you instead generate a set of augmenting functions which apply data transformations randomly. This set of functions is used to transform the data batch at each training epoch, such that the model never sees exactly the same data example twice.
Third, this model architecture is relatively simplistic (simpler than AlexNet, the first modern deep CNN.) Far greater performance has been achieved by making much deeper architectures and using residual layers to (see ResNet) to deal with the vanishing gradient problem. I'd be somewhat surprised if you could achieve good performance on this task with this architecture.
It is normal for the validation loss to be higher on average than the training loss. It is possible that your model is learning to some extent but the loss curve is relatively shallow when compared to the (likely overfit) training curve. I suggest also computing epoch-wide validation accuracy and reporting this value across epochs. You should see training accuracy increase, and possibly validation accuracy as well.
Do note that cross-validation is not quite exactly meant to determine whether the model generalizes to unseen patients. That is the purpose of the validation set. Instead, cross-validation ensures that the training - validation performance is valid across multiple data partitions, and isn't simply the result of selecting an "easy" validation set.
Purely for speed/simplicity, I recommend training the model first without cross-validation (i.e. use a single training-testing partition. Once you achieve good performance on the whole dataset, you can retrain with k-fold to ensure the above, but this should make your debug cycles a bit faster.

Related

Unclear Architecture of MNIST Neural Network

I am trying to reproduce a Neural Network trained to detect whether there is a 0-3 digit in an image with another confounding image. The paper I am following lists the corresponding architecture:
A neural network with 28×56 input neurons and one output neuron is
trained on this task. The input values are coded between −0.5 (black)
and +1.5 (white). The neural network is composed of a first detection
pooling layer with 400 detection neurons sum-pooled into 100 units
(i.e. we sum-pool non-overlapping groups of 4 detection units). A
second detection-pooling layer with 400 detection neurons is applied
to the 100-dimensional output of the previous layer, and activities
are sum-pooled onto a single unit representing the deep network
output. Positive examples (0-3 digit in the image) are assigned target
value 100 and negative examples are assigned target value 0. The
neural network is trained to minimize the mean-square error between
the target values and its output.
My main doubt is in this context what they mean by detection neurons, if they mean filters or a single standard ReLU neuron. Also, if the mean filters, how could they be applied in the second layer to a 100-dimensional output when they are designed to operate on 2x2 matrixes.
Reference:
Montavon, G., Bach, S., Binder, A., Samek, W., & Müller, K. (2015).
Explaining NonLinear Classification Decisions with Deep Taylor
Decomposition. arXiv. https://doi.org/10.1016/j.patcog.2016.11.008.
Specifically section 4.C
Thanks a lot for the help!
My best guess at this is something like (code not tested - just rough PyTorch):
from torch import nn
class Model(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Sequential(
nn.Flatten(), # Flatten row-wise into a 1D sequence
nn.Linear(28 * 56, 400), # Linear layer with 400 outputs.
nn.AvgPool1D(4, 4), # Sum pool to 100 outputs.
)
self.layer2 = nn.Sequential(
nn.Linear(100, 400), # Linear layer with 400 outputs.
nn.AdaptiveAvgPool1D(1), # Sum pool to 1 output.
)
def forward(self, x):
return self.layer2(self.layer1(x))
But overall I would agree with the commentor on your post that there are some issues with language here.

Why does my train loss jump down when a new epoch starts?

When I train a neural network consisting of 2 convolutional and 2 fully connected layers on the MNIST handwritten digits task, I receive the following train loss curve:
The datasets contains 235 batches and I plotted the loss after each batch for 1500 batches in total, and trained therefore the model for a little more than 6 epochs. The batches are sampled using the following PyTorch code:
sampler = torch.utils.data.RandomSampler(train_dataset, replacement=False)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, sampler=sampler)
So, in each epoch every batch is looked at once and every epoch has a new shuffling of the batches. As you see in the plot, the loss decreases at every start of the epoch rapidly.
I have never seen this behavior before and I was wondering why that is the case.
For comparison, when I run the same code, but I choose to sample with replacement, I get the following (normal looking) loss curve:
My thoughts so far:
Since at the start of each epoch the specific samples are already looked at before, the network could have an easier task if it memorized the specific batches. Still, since at every epoch there is a different shuffling, the network would have to memorize all batches which in my experience does not happen after 2 epochs.
I have also tried it on another dataset and with a variation of the model with the same results.
My model structure is the following:
self.conv_part = nn.Sequential(
nn.Conv2d(1, 32, (5, 5), stride=1, padding=2),
nn.ReLU(),
nn.MaxPool2d((2, 2)),
nn.Conv2d(32, 64, (5, 5), stride=1, padding=2),
nn.ReLU(),
nn.MaxPool2d((2, 2))
)
self.linear_part = nn.Sequential(
nn.Linear(7*7*64, 1024),
nn.ReLU(),
nn.Linear(1024, 10)
)
When the model is simplified by reducing the number of channels in each layer significantly, the problem vanishes almost completely, I have marked the epochs to get a clearer view.

Training with BatchNorm in pytorch

I'm wondering if I need to do anything special when training with BatchNorm in pytorch. From my understanding the gamma and beta parameters are updated with gradients as would normally be done by an optimizer. However, the mean and variance of the batches are updated slowly using momentum.
So do we need to specify to the optimizer when the mean and variance parameters are updated, or does pytorch automatically take care of this?
Is there a way to access the mean and variance of the BN layer so that I can make sure it was changing while I trained the model.
If needed here is my model and training procedure:
def bn_drop_lin(n_in:int, n_out:int, bn:bool=True, p:float=0.):
"Sequence of batchnorm (if `bn`), dropout (with `p`) and linear (`n_in`,`n_out`) layers followed by `actn`."
layers = [nn.BatchNorm1d(n_in)] if bn else []
if p != 0: layers.append(nn.Dropout(p))
layers.append(nn.Linear(n_in, n_out))
return nn.Sequential(*layers)
class Model(nn.Module):
def __init__(self, i, o, h=()):
super().__init__()
nodes = (i,) + h + (o,)
self.layers = nn.ModuleList([bn_drop_lin(i,o, p=0.5)
for i, o in zip(nodes[:-1], nodes[1:])])
def forward(self, x):
x = x.view(x.shape[0], -1)
for layer in self.layers[:-1]:
x = F.relu(layer(x))
return self.layers[-1](x)
Training:
for i, data in enumerate(trainloader):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
Batchnorm layers behave differently depending on if the model is in train or eval mode.
When net is in train mode (i.e. after calling net.train()) the batch norm layers contained in net will use batch statistics along with gamma and beta parameters to scale and translate each mini-batch. The running mean and variance will also be adjusted while in train mode. These updates to running mean and variance occur during the forward pass (when net(inputs) is called). The gamma and beta parameters are like any other pytorch parameter and are updated only once optimizer.step() is called.
When net is in eval mode (net.eval()) batch norm uses the historical running mean and running variance computed during training to scale and translate samples.
You can check the batch norm layers running mean and variance by displaying the layers running_mean and running_var members to ensure batch norm is updating them as expected. The learnable gamma and beta parameters can be accessed by displaying the weight and bias members of a batch norm layer respectively.
Edit
Below is a simple demonstration code showing that running_mean is updated during forward. Observe that it is not updated by the optimizer.
>>> import torch
>>> import torch.nn as nn
>>> layer = nn.BatchNorm1d(5)
>>> layer.train()
>>> layer.running_mean
tensor([0., 0., 0., 0., 0.])
>>> result = layer(torch.randn(5,5))
>>> layer.running_mean
tensor([ 0.0271, 0.0152, -0.0403, -0.0703, -0.0056])

Variational Autoencoder gives same output image for every input mnist image when using KL divergence

When not using KL divergence term, the VAE reconstructs mnist images almost perfectly but fails to generate new ones properly when provided with random noise.
When using KL divergence term, the VAE gives the same weird output both when reconstructing and generating images.
Here's the pytorch code for the loss function:
def loss_function(recon_x, x, mu, logvar):
BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), size_average=True)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return (BCE+KLD)
recon_x is the reconstructed image, x is the original_image, mu is the mean vector while logvar is the vector containing the log of variance.
What is going wrong here? Thanks in advance :)
A possible reason is the numerical unbalance between the two losses, with your BCE loss computed as an average over the batch (c.f. size_average=True) while the KLD one is summed.
Multiplying KLD with 0.0001 did it. The generated images are a little distorted, but similarity issue is resolved.
Yes, try out with different weight factor for the KLD loss term. Weighing down the KLD loss term resolves the same reconstruction output issue in the CelebA dataset (http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html).
There are many possible reasons for that. As benjaminplanche stated you need to use .mean instead of .sum reduction. Also, KLD term weight could be different for different architecture and data sets. So, try different weights and see the reconstruction loss, and latent space to decide. There is a trade-off between reconstruction loss (output quality) and KLD term which forces the model to shape a gausian like latent space.
To evaluate different aspects of VAEs I trained a Vanilla autoencoder and VAE with different KLD term weights.
Note that, I used the MNIST hand-written digits dataset to train networks with input size 784=28*28 and latent size 30 dimensions. Although for data samples in range of [0, 1] we normally use a Sigmoid activation function, I used a Tanh for experimental reasons.
Vanilla Autoencoder:
Autoencoder(
(encoder): Encoder(
(nn): Sequential(
(0): Linear(in_features=784, out_features=30, bias=True)
)
)
(decoder): Decoder(
(nn): Sequential(
(0): Linear(in_features=30, out_features=784, bias=True)
(1): Tanh()
)
)
)
Afterward, I implemented the VAE model as shown in the following code blocks. I trained this model with different KLD weights from the set {0.5, 1, 5}.
class VAE(nn.Module):
def __init__(self,dim_latent_representation=2):
super(VAE,self).__init__()
class Encoder(nn.Module):
def __init__(self, output_size=2):
super(Encoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(28 * 28, output_size),
)
def forward(self, x):
# needs your implementation
return self.nn(x)
class Decoder(nn.Module):
def __init__(self, input_size=2):
super(Decoder, self).__init__()
# needs your implementation
self.nn = nn.Sequential(
nn.Linear(input_size, 28 * 28),
nn.Tanh(),
)
def forward(self, z):
# needs your implementation
return self.nn(z)
self.dim_latent_representation = dim_latent_representation
self.encoder = Encoder(output_size=dim_latent_representation)
self.mu_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.logvar_layer = nn.Linear(self.dim_latent_representation, self.dim_latent_representation)
self.decoder = Decoder(input_size=dim_latent_representation)
# Implement this function for the VAE model
def reparameterise(self, mu, logvar):
if self.training:
std = logvar.mul(0.5).exp_()
eps = std.data.new(std.size()).normal_()
return eps.mul(std).add_(mu)
else:
return mu
def forward(self,x):
# This function should be modified for the DAE and VAE
x = self.encoder(x)
mu, logvar = self.mu_layer(x), self.logvar_layer(x)
z = self.reparameterise(mu, logvar)
return self.decoder(z), mu, logvar
Vanilla Autoencoder
Training loss: 0.4089 Validation loss
Validation loss (reconstruction error) : 0.4171
VAE Loss = MSE + 0.5 * KLD
Training loss: 0.6420
Validation loss (reconstruction error) : 0.6060
VAE Loss = MSE + 1 * KLD
Training loss: 0.6821
Validation loss (reconstruction error) : 0.6550
VAE Loss = MSE + 5 * KLD
Training loss: 0.7122
Validation loss (reconstruction error) : 0.7154
Here you can see output results from different models. I also visualized the 30 dimensional latent space in 2D using sklearn.manifold.TSNE transformation.
We observe a low loss value for the vanilla autoencoder with 30D bottleneck size which results in high-quality reconstructed images. Although loss values increased in VAEs, the VAE arranged the latent space such that gaps between latent representations for different classes decreased. It means we can get better manipulated (mixed latents) output. Since VAE follows an isotropic multivariate normal distribution at the latent space, we can generate new unseen images by taking samples from the latent space with higher quality compared to the Vanilla autoencoder. However, the reconstruction quality was reduced (loss values increased) since the loss function is a weighted combination of MSE and KLD terms to be optimized where the KLD term forces the latent space to resemble a Gaussian distribution. As we increased the KLD weight, we achieved a more compact latent space closer to the prior distribution by sacrificing the reconstruction quality.

Why does Keras LSTM batch size used for prediction have to be the same as fitting batch size?

When using a Keras LSTM to predict on time series data I've been getting errors when I'm trying to train the model using a batch size of 50, while then trying to predict on the same model using a batch size of 1 (ie just predicting the next value).
Why am I not able to train and fit the model with multiple batches at once, and then use that model to predict for anything other than the same batch size. It doesn't seem to make sense, but then I could easily be missing something about this.
Edit: this is the model. batch_size is 50, sl is sequence length, which is set at 20 currently.
model = Sequential()
model.add(LSTM(1, batch_input_shape=(batch_size, 1, sl), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(trainX, trainY, epochs=epochs, batch_size=batch_size, verbose=2)
here is the line for predicting on the training set for RMSE
# make predictions
trainPredict = model.predict(trainX, batch_size=batch_size)
here is the actual prediction of unseen time steps
for i in range(test_len):
print('Prediction %s: ' % str(pred_count))
next_pred_res = np.reshape(next_pred, (next_pred.shape[1], 1, next_pred.shape[0]))
# make predictions
forecastPredict = model.predict(next_pred_res, batch_size=1)
forecastPredictInv = scaler.inverse_transform(forecastPredict)
forecasts.append(forecastPredictInv)
next_pred = next_pred[1:]
next_pred = np.concatenate([next_pred, forecastPredict])
pred_count += 1
This issue is with the line:
forecastPredict = model.predict(next_pred_res, batch_size=batch_size)
The error when batch_size here is set to 1 is:
ValueError: Cannot feed value of shape (1, 1, 2) for Tensor 'lstm_1_input:0', which has shape '(10, 1, 2)' which is the same error that throws when batch_size here is set to 50 like the other batch sizes as well.
The total error is:
forecastPredict = model.predict(next_pred_res, batch_size=1)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/models.py", line 899, in predict
return self.model.predict(x, batch_size=batch_size, verbose=verbose)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/engine/training.py", line 1573, in predict
batch_size=batch_size, verbose=verbose)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/engine/training.py", line 1203, in _predict_loop
batch_outs = f(ins_batch)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2103, in __call__
feed_dict=feed_dict)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/entelechy/tf_keras/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 944, in _run
% (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 1, 2) for Tensor 'lstm_1_input:0', which has shape '(10, 1, 2)'
Edit: Once I set the model to stateful=False then I am able to use different batch sizes for fitting/training and prediction. What is the reason for this?
Unfortunately what you want to do is impossible with Keras ... I've also struggle a lot of time on this problems and the only way is to dive into the rabbit hole and work with Tensorflow directly to do LSTM rolling prediction.
First, to be clear on terminology, batch_size usually means number of sequences that are trained together, and num_steps means how many time steps are trained together. When you mean batch_size=1 and "just predicting the next value", I think you meant to predict with num_steps=1.
Otherwise, it should be possible to train and predict with batch_size=50 meaning you are training on 50 sequences and make 50 predictions every time step, one for each sequence (meaning training/prediction num_steps=1).
However, I think what you mean is that you want to use stateful LSTM to train with num_steps=50 and do prediction with num_steps=1. Theoretically this make senses and should be possible, and it is possible with Tensorflow, just not Keras.
The problem: Keras requires an explicit batch size for stateful RNN. You must specify batch_input_shape (batch_size, num_steps, features).
The reason: Keras must allocate a fixed-size hidden state vector in the computation graph with shape (batch_size, num_units) in order to persist the values between training batches. On the other hand, when stateful=False, the hidden state vector can be initialized dynamically with zeroes at the beginning of each batch so it does not need to be a fixed size. More details here: http://philipperemy.github.io/keras-stateful-lstm/
Possible work around: Train and predict with num_steps=1. Example: https://github.com/keras-team/keras/blob/master/examples/lstm_stateful.py. This might or might not work at all for your problem as the gradient for back propagation will be computed on only one time step. See: https://github.com/fchollet/keras/issues/3669
My solution: use Tensorflow: In Tensorflow you can train with batch_size=50, num_steps=100, then do predictions with batch_size=1, num_steps=1. This is possible by creating a different model graph for training and prediction sharing the same RNN weight matrices. See this example for next-character prediction: https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/model.py#L11 and blog post http://karpathy.github.io/2015/05/21/rnn-effectiveness/. Note that one graph can still only work with one specified batch_size, but you can setup multiple model graphs sharing weights in Tensorflow.
Sadly what you wish for is impossible because you specify the batch_size when you define the model...
However, I found a simple way around this problem: create 2 models! The first is used for training and the second for predictions, and have them share weights:
train_model = Sequential([Input(batch_input_shape=(batch_size,...),
<continue specifying your model>])
predict_model = Sequential([Input(batch_input_shape=(1,...),
<continue specifying exact same model>])
train_model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
predict_model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
Now you can use any batch size you want. after you fit your train_model just save it's weights and load them with the predict_model:
train_model.save_weights('lstm_model.h5')
predict_model.load_weights('lstm_model.h5')
notice that you only want to save and load the weights, and not the whole model (which includes the architecture, optimizer etc...). This way you get the weights but you can input one batch at a time...
more on keras save/load models:
https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
notice that you need to install h5py to use "save weights".
Another easy workaround is:
def create_model(batch_size):
model = Sequential()
model.add(LSTM(1, batch_input_shape=(batch_size, 1, sl), stateful=True))
model.add(Dense(1))
return model
model_train = create_model(batch_size=50)
model_train.compile(loss='mean_squared_error', optimizer='adam')
model_train.fit(trainX, trainY, epochs=epochs, batch_size=batch_size)
model_predict = create_model(batch_size=1)
weights = model_train.get_weights()
model_predict.set_weights(weights)
The best solution to this problem is "Copy Weights". It can be really helpful if you want to train & predict with your LSTM model with different batch sizes.
For example, once you have trained your model with 'n' batch size as shown below:
# configure network
n_batch = len(X)
n_epoch = 1000
n_neurons = 10
# design network
model = Sequential()
model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
And now you want to want predict values fewer than your batch size where n=1.
What you can do is that, copy the weights of your fit model and reinitialize the new model LSTM model with same architecture and set batch size equal to 1.
# re-define the batch size
n_batch = 1
# re-define model
new_model = Sequential()
new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
new_model.add(Dense(1))
# copy weights
old_weights = model.get_weights()
new_model.set_weights(old_weights)
Now you can easily predict and train LSTMs with different batch sizes.
For more information please read: https://machinelearningmastery.com/use-different-batch-sizes-training-predicting-python-keras/
I found below helpful (and fully inline with above). The section "Solution 3: Copy Weights" worked for me:
How to use Different Batch Sizes when Training and Predicting with LSTMs, by Jason Brownlee
n_neurons = 10
# design network
model = Sequential()
model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
# fit network
for i in range(n_epoch):
model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False)
model.reset_states()
# re-define the batch size
n_batch = 1
# re-define model
new_model = Sequential()
new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
new_model.add(Dense(1))
# copy weights
old_weights = model.get_weights()
new_model.set_weights(old_weights)
# compile model
new_model.compile(loss='mean_squared_error', optimizer='adam')
I also have same problem and resolved it.
In another way, you can save your weights, when you test your result, you can reload your model with same architecture and set batch_size=1 as below:
n_neurons = 10
# design network
model = Sequential()
model.add(LSTM(n_neurons, batch_size=1, batch_input_shape=(n_batch,X.shape[1], X.shape[2]), statefull=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.load_weights("w.h5")
It will work well. I hope it will helpfull for you.
If you don't have access to the code that created the model or if you just don't want your prediction/validation code to depend on your model creation and training code there is another way:
You could create a new model from a modified version of the loaded model's config like this:
loaded_model = tf.keras.models.load_model('model_file.h5')
config = loaded_model.get_config()
old_batch_input_shape = config['layers'][0]['config']['batch_input_shape']
config['layers'][0]['config']['batch_input_shape'] = (new_batch_size, old_batch_input_shape[1])
new_model = loaded_model.__class__.from_config(config)
new_model.set_weights(loaded_model.get_weights())
This works well for me in a situation where I have several different models with state-full RNN layers working together in a graph network but being trained separately with different networks leading to different batch sizes. It allows me to experiment with the model structures and training batches without needing to change anything in my validation script.