Why are all weights in my GRU close to zero after training? - deep-learning

Problem description:
I want to train a GRU by using a small dataset, which has 40 training samples. Length of input sequence is 3 and length of output sequence is 1. Below is the configuraions:
hidden units: 128
input size(input feature num): 11
layer: 2
drupout: 0.5
bidirectional: False
lr: 0.0001
optimizer: Adam
batch size: 8
Below is my model. Thank Azhar Khan for helping me format the code.
class GRU(nn.Module):
def __init__(self, GRU_input_num, GRU_hidden_num, GRU_layer_num, dropout, bidirectional, seed):
super(GRU, self).__init__()
torch.manual_seed(seed)
self.GRU_input_num = GRU_input_num
self.GRU_hidden_num = GRU_hidden_num
self.GRU_layer_num = GRU_layer_num
self.dropout = dropout
self.bidirectional = bidirectional
self.direction = 2 if self.bidirectional else 1
if GRU_layer_num > 1:
self.gru = nn.GRU(GRU_input_num, GRU_hidden_num, GRU_layer_num, dropout=dropout, batch_first=True,
bidirectional=bidirectional)
else:
self.gru = nn.GRU(GRU_input_num, GRU_hidden_num, GRU_layer_num, batch_first=True,
bidirectional=bidirectional)
self.predict = nn.Linear(in_features=self.GRU_hidden_num * self.direction * self.GRU_layer_num,
out_features=constants.CATEGORY_NUM * constants.BACTERIA_PER_CATEGORY)
def forward(self, x):
"""GRU forward. Input shape: [batch, input_step, features]"""
# shape of hidden state: [layer * direction, batch, hidden]
batch = x.shape[0]
_, hidden_state = self.gru(x)
hidden_state = hidden_state.permute(1, 0, 2) # [batch, layer * direction, hidden]
hidden_state = hidden_state.contiguous().view(batch, 1, self.GRU_hidden_num * self.direction * self.GRU_layer_num).squeeze(1) # [batch, layer * direction * hidden_num]
return self.predict(hidden_state), hidden_state.cpu().detach().numpy() # [batch, features]
I am curious about how will the hidden state change during training. Here are my observations.
My observations:
Heatmap of hidden state.
I printed the heatmap of hidden states every 25 epochs. x axis is sample index in one batch, y axis is hidden state dimension and the title is the sum of all hidden states. The figures are showed below
after 25 epochs
after 75 epochs
after 150 epochs
after 300 epochs
after 550 epochs
The summation keeps shrink along with training.
Weights of GRU.
I also have a glance to the weights of GRU, below is a snapshot of part of weights.
snapshot of GRU weights
Almost all the weights are end with 'e-40(or 41 etc.)', which mean every single weight close to zero.
My question:
Is this a normal phenomenon? If not, what could be the cause of this issue?
Thanks for anyone who can give me some comments about this, and do let me know what else information you need.

Related

Tensor shape for multivariable LSTM on Pytorch

I have a dataset with 8 features and 4 timesteps. I am trying to implement an LSTM but need help understanding if i have set my tensor correctly. The aim is to take the outputted features from the LSTM and pass them through a NN.
My tensor shape is currently #samples x #timesteps x #features i.e. 4500x4x8. This works with the code below. I want to make sure that the model is indeed taking each timestep matrix as a new sequence (with matrix 4500x[0]x8 being the first timestep matrix and 4500x[3]x8 being the last timestep). I then take the final timestep output (output[:,-1,:] to feed through a NN.
Is the code doing what i think it is doing? I ask as performance is marginally less than a simple RF that only uses the final timestep data. This would be unexpected as the data has strong time-series correlations (it tracks patients vitals declining before going on ventilation).
I have the following code:
class LSTM1(nn.Module):
def __init__(self, num_classes, input_size, hidden_size, num_layers):
super(LSTM1, self).__init__()
self.num_classes = num_classes #number of classes
self.num_layers = num_layers #number of layers
self.input_size = input_size #input size
self.hidden_size = hidden_size #hidden state
self.lstm = nn.LSTM(input_size=input_size, hidden_size=hidden_size,
num_layers=num_layers, batch_first=True) #lstm
self.fc_1 = nn.Linear(hidden_size, 32) #fully connected 1
self.fc_2 = nn.Linear(32, 12) #fully connected 1
self.fc_3 = nn.Linear(12, 1)
self.fc = nn.Sigmoid() #fully connected last layer
self.relu = nn.ReLU()
def forward(self,x):
h_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size)) #hidden state
c_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size)) #internal state
# Propagate input through LSTM
output, (hn, cn) = self.lstm(x, (h_0, c_0)) #lstm with input, hidden, and internal state
out = output[:,-1,:] #reshaping the data for Dense layer next
out = self.relu(out)
out = self.fc_1(out) #first Dense
out = self.relu(out) #relu
out = self.fc_2(out) #2nd dense
out = self.relu(out) #relu
out = self.fc_3(out) #3rd dense
out = self.relu(out) #relu
out = self.fc(out) #Final Output
return out
Error
Your error stems from the last three lines.
Do not use ReLU activation at the end of your network
Use nn.Linear -> nn.Sigmoid with BCELoss or
nn.Linear with nn.BCEWithLogitsLoss (see here for what logits are).
What is going on
With ReLu you output values in the range [0, +inf)
Applying sigmoid on top of it “squashes” values to (0, 1) with threshold being 0 (e.g. 0 becomes 0.5 probability, hence 1 after threaholding at 0.5!)
In effect, you always predict 1 with this code, which is not what you want probably

Pytorch model running out of memory on both CPU and GPU, can’t figure out what I’m doing wrong

Trying to implement a simple multi-label image classifier using Pytorch Lightning. Here's the model definition:
import torch
from torch import nn
# creates network class
class Net(pl.LightningModule):
def __init__(self):
super().__init__()
# defines conv layers
self.conv_layer_b1 = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=32,
kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
)
# passes dummy x matrix to find the input size of the fc layer
x = torch.randn(1, 3, 800, 600)
self._to_linear = None
self.forward(x)
# defines fc layer
self.fc_layer = nn.Sequential(
nn.Linear(in_features=self._to_linear,
out_features=256),
nn.ReLU(),
nn.Linear(256, 5),
)
# defines accuracy metric
self.accuracy = pl.metrics.Accuracy()
self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)
def forward(self, x):
x = self.conv_layer_b1(x)
if self._to_linear is None:
# does not run fc layer if input size is not determined yet
self._to_linear = x.shape[1]
else:
x = self.fc_layer(x)
return x
def cross_entropy_loss(self, logits, y):
criterion = nn.CrossEntropyLoss()
return criterion(logits, y)
def training_step(self, train_batch, batch_idx):
x, y = train_batch
logits = self.forward(x)
train_loss = self.cross_entropy_loss(logits, y)
train_acc = self.accuracy(logits, y)
train_cm = self.confusion_matrix(logits, y)
self.log('train_loss', train_loss)
self.log('train_acc', train_acc)
self.log('train_cm', train_cm)
return train_loss
def validation_step(self, val_batch, batch_idx):
x, y = val_batch
logits = self.forward(x)
val_loss = self.cross_entropy_loss(logits, y)
val_acc = self.accuracy(logits, y)
return {'val_loss': val_loss, 'val_acc': val_acc}
def validation_epoch_end(self, outputs):
avg_val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
avg_val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
self.log("val_loss", avg_val_loss)
self.log("val_acc", avg_val_acc)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=0.0008)
return optimizer
The issue is probably not the machine since I'm using a cloud instance with 60 GBs of RAM and 12 GBs of VRAM. Whenever I run this model even for a single epoch, I get an out of memory error. On the CPU it looks like this:
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1966080000 bytes. Error code 12 (Cannot allocate memory)
and on the GPU it looks like this:
RuntimeError: CUDA out of memory. Tried to allocate 7.32 GiB (GPU 0; 11.17 GiB total capacity; 4.00 KiB already allocated; 2.56 GiB free; 2.00 MiB reserved in total by PyTorch)
Clearing the cache and reducing the batch size did not work. I'm a novice so clearly something here is exploding but I can't tell what. Any help would be appreciated.
Thank you!
Indeed, it's not a machine issue; the model itself is simply unreasonably big. Typically, if you take a look at common CNN models, the fc layers occur near the end, after the inputs already pass through quite a few convolutional blocks (and have their spatial resolutions reduced).
Assuming inputs are of shape (batch, 3, 800, 600), while passing the conv_layer_b1 layer, the feature map shape would be (batch, 32, 400, 300) after the MaxPool operation. After flattening, the inputs become (batch, 32 * 400 * 300), ie, (batch, 3840000).
The immediately following fc_layer thus contains nn.Linear(3840000, 256), which is simply absurd. This single linear layer contains ~983 million trainable parameters! For reference, popular image classification CNNs roughly have 3 to 30 million parameters on average, with larger variants reaching 60 to 80 million. Few ever really cross the 100 million mark.
You can count your model params with this:
def count_params(model):
return sum(map(lambda p: p.data.numel(), model.parameters()))
My advice: 800 x 600 is really a massive input size. Reduce it to something like 400 x 300, if possible. Furthermore, add several convolutional blocks similar to conv_layer_b1, before the FC layer. For example:
def get_conv_block(C_in, C_out):
return nn.Sequential(
nn.Conv2d(in_channels=C_in, out_channels=C_out,
kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
class Net(pl.LightningModule):
def __init__(self):
super().__init__()
# defines conv layers
self.conv_layer_b1 = get_conv_block(3, 16)
self.conv_layer_b2 = get_conv_block(16, 32)
self.conv_layer_b3 = get_conv_block(32, 64)
self.conv_layer_b4 = get_conv_block(64, 128)
self.conv_layer_b5 = get_conv_block(128, 256)
# passes dummy x matrix to find the input size of the fc layer
x = torch.randn(1, 3, 800, 600)
self._to_linear = None
self.forward(x)
# defines fc layer
self.fc_layer = nn.Sequential(
nn.Flatten(),
nn.Linear(in_features=self._to_linear,
out_features=256),
nn.ReLU(),
nn.Linear(256, 5)
)
# defines accuracy metric
self.accuracy = pl.metrics.Accuracy()
self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)
def forward(self, x):
x = self.conv_layer_b1(x)
x = self.conv_layer_b2(x)
x = self.conv_layer_b3(x)
x = self.conv_layer_b4(x)
x = self.conv_layer_b5(x)
if self._to_linear is None:
# does not run fc layer if input size is not determined yet
self._to_linear = nn.Flatten()(x).shape[1]
else:
x = self.fc_layer(x)
return x
Here, because more conv-relu-pool layers are applied, the input is reduced to a feature map of a much smaller shape, (batch, 256, 25, 18), and the overall number of trainable parameters would be reduced to about ~30 million parameters.

Building RNN from scratch in pytorch

I am trying to build RNN from scratch using pytorch and I am following this tutorial to build it.
import torch
import torch.nn as nn
import torch.nn.functional as F
class BasicRNN(nn.Module):
def __init__(self, n_inputs, n_neurons):
super(BasicRNN, self).__init__()
self.Wx = torch.randn(n_inputs, n_neurons) # n_inputs X n_neurons
self.Wy = torch.randn(n_neurons, n_neurons) # n_neurons X n_neurons
self.b = torch.zeros(1, n_neurons) # 1 X n_neurons
def forward(self, X0, X1):
self.Y0 = torch.tanh(torch.mm(X0, self.Wx) + self.b) # batch_size X n_neurons
self.Y1 = torch.tanh(torch.mm(self.Y0, self.Wy) +
torch.mm(X1, self.Wx) + self.b) # batch_size X n_neurons
return self.Y0, self.Y1
class CleanBasicRNN(nn.Module):
def __init__(self, batch_size, n_inputs, n_neurons):
super(CleanBasicRNN, self).__init__()
self.rnn = BasicRNN(n_inputs, n_neurons)
self.hx = torch.randn(batch_size, n_neurons) # initialize hidden state
def forward(self, X):
output = []
# for each time step
for i in range(2):
self.hx = self.rnn(X[i], self.hx)
output.append(self.hx)
return output, self.hx
FIXED_BATCH_SIZE = 4 # our batch size is fixed for now
N_INPUT = 3
N_NEURONS = 5
X_batch = torch.tensor([[[0,1,2], [3,4,5],
[6,7,8], [9,0,1]],
[[9,8,7], [0,0,0],
[6,5,4], [3,2,1]]
], dtype = torch.float) # X0 and X1
model = CleanBasicRNN(FIXED_BATCH_SIZE,N_INPUT,N_NEURONS)
a1,a2 = model(X_batch)
Running this code returns this error
RuntimeError: size mismatch, m1: [4 x 5], m2: [3 x 5] at /pytorch/..
After some digging I found this error happens when passing the hidden states to the BasicRNN model
N_INPUT = 3 # number of features in input
N_NEURONS = 5 # number of units in layer
X0_batch = torch.tensor([[0,1,2], [3,4,5],
[6,7,8], [9,0,1]],
dtype = torch.float) #t=0 => 4 X 3
X1_batch = torch.tensor([[9,8,7], [0,0,0],
[6,5,4], [3,2,1]],
dtype = torch.float) #t=1 => 4 X 3
test_model = BasicRNN(N_INPUT,N_NEURONS)
a1,a2 = test_model(X0_batch,X1_batch)
a1,a2 = test_model(X0_batch,torch.randn(1,N_NEURONS)) # THIS LINE GIVES ERROR
What is happening in the hidden states and How can I solve this problem?
Maybe the tutorial is wrong: torch.mm(X1, self.Wx) multiplies a 3 x 5 and a 4 x 5 tensor, which doesn't work. Even if you make it work by rewriting as torch.mm(self.Wx, X1.t()), you expect it to output a 4 x 5 tensor, but the result is a 4 x 3 tensor.
The BasicRNN is not an implementation of an RNN cell, but rather the full RNN fixed for two time steps. It is depicted in the image of the tutorial:
Where Y0, the first time step, does not include the previous hidden state (technically zero) and Y0 is also h0, which is then used for the second time step, Y1 or h1.
An RNN cell is one of the time steps in isolation, particularly the second one, as it should include the hidden state of the previous time step.
The next hidden state is calculate as described in the nn.RNNCell documentation:
In your BasicRNN there is only one bias term, but you still have a weight Wx for the input and the weight Wy for the hidden state, which should probably be called Wh instead. As for the forward method, its arguments become the input and the previous hidden state, instead of being two inputs at different time steps. This also means that you only have one calculation, corresponding to the formula of the nn.RNNCell, which was the calculation for the Y1, except that it uses the hidden state that was passed to the forward method.
class BasicRNN(nn.Module):
def __init__(self, n_inputs, n_neurons):
super(BasicRNN, self).__init__()
self.Wx = torch.randn(n_inputs, n_neurons) # n_inputs X n_neurons
self.Wh = torch.randn(n_neurons, n_neurons) # n_neurons X n_neurons
self.b = torch.zeros(1, n_neurons) # 1 X n_neurons
def forward(self, x, hidden):
return torch.tanh(torch.mm(x, self.Wx) + torch.mm(hidden, self.Wh) + self.b)
In the tutorial, they opted to use nn.RNNCell directly instead of implementing the cell.
Note: The terms of the matrix multiplications are in a different order, because the weights are usually transposed in comparison to your weights and the formula assumes the input and hidden state to be vectors (not batches). Technically, the batched inputs and hidden states would have to be transposed, and the output would be transposed back for it to work with the batches. It's easier to just use the transposed the weight, as the result is the same due to the transpose property of the matrix multiplication:

what should I do if my regression model stuck at a high value loss?

I'm using neural nets for a regression problem where I have 3 features and I'm trying to predict one continuous value. I noticed that my neural net start learning good but after 10 epochs it get stuck on a high loss value and could not improve anymore.
I tried to use Adam and other adaptive optimizers instead of SGD but that didn't work. I tried a complex architectures like adding layers, neurons, batch normalization and other activations etc.. and that also didn't work.
I tried to debug and try to find out if something is wrong with the implementation but when I use only 10 examples of the data my model learn fast so there are no errors. I start to increase the examples of the data and monitoring my model results as I increase the data examples. when I reach 3000 data examples my model start to get stuck on a high value loss.
I tried to increase layers, neurons and also to try other activations, batch normalization. My data are also normalized between [-1, 1], my target value is not normalized since it is regression and I'm predicting a continuous value. I also tried using keras but I've got the same result.
My real dataset have 40000 data, I don't know what should I try, I almost try all things that I know for optimization but none of them worked. I would appreciate it if someone can guide me on this. I'll post my Code but maybe it is too messy to try to understand, I'm sure there is no problem with my implementation, I'm using skorch/pytorch and some SKlearn functions:
# take all features as an Independant variable except the bearing and distance
# here when I start small the model learn good but from 3000 data points as you can see the model stuck on a high value. I mean the start loss is 15 and it start to learn good but when it reach 9 it stucks there
# and if I try to use the whole dataset for training then the loss start at 47 and start decreasing until it reach 36 and then stucks there too
X = dataset.iloc[:3000, 0:-2].reset_index(drop=True).to_numpy().astype(np.float32)
# take distance and bearing as the output values:
y = dataset.iloc[:3000, -2:].reset_index(drop=True).to_numpy().astype(np.float32)
y_bearing = y[:, 0].reshape(-1, 1)
y_distance = y[:, 1].reshape(-1, 1)
# normalize the input values
scaler = StandardScaler()
X_norm = scaler.fit_transform(X, y)
X_br_train, X_br_test, y_br_train, y_br_test = train_test_split(X_norm,
y_bearing,
test_size=0.1,
random_state=42,
shuffle=True)
X_dis_train, X_dis_test, y_dis_train, y_dis_test = train_test_split(X_norm,
y_distance,
test_size=0.1,
random_state=42,
shuffle=True)
bearing_trainset = Dataset(X_br_train, y_br_train)
bearing_testset = Dataset(X_br_test, y_br_test)
distance_trainset = Dataset(X_dis_train, y_dis_train)
distance_testset = Dataset(X_dis_test, y_dis_test)
def root_mse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
class RMSELoss(nn.Module):
def __init__(self):
super().__init__()
self.mse = nn.MSELoss()
def forward(self, yhat, y):
return torch.sqrt(self.mse(yhat, y))
class AED(nn.Module):
"""custom average euclidean distance loss"""
def __init__(self):
super().__init__()
def forward(self, yhat, y):
return torch.dist(yhat, y)
def train(on_target,
hidden_units,
batch_size,
epochs,
optimizer,
lr,
regularisation_factor,
train_shuffle):
network = None
trainset = distance_trainset if on_target.lower() == 'distance' else bearing_trainset
testset = distance_testset if on_target.lower() == 'distance' else bearing_testset
print(f"shape of trainset.X = {trainset.X.shape}, shape of trainset.y = {trainset.y.shape}")
print(f"shape of testset.X = {testset.X.shape}, shape of testset.y = {testset.y.shape}")
mse = EpochScoring(scoring=mean_squared_error, lower_is_better=True, name='MSE')
r2 = EpochScoring(scoring=r2_score, lower_is_better=False, name='R2')
rmse = EpochScoring(scoring=make_scorer(root_mse), lower_is_better=True, name='RMSE')
checkpoint = Checkpoint(dirname=f'results/{on_target}/checkpoints')
train_end_checkpoint = TrainEndCheckpoint(dirname=f'results/{on_target}/checkpoints')
if on_target.lower() == 'bearing':
network = BearingNetwork(n_features=X_norm.shape[1],
n_hidden=hidden_units,
n_out=y_distance.shape[1])
elif on_target.lower() == 'distance':
network = DistanceNetwork(n_features=X_norm.shape[1],
n_hidden=hidden_units,
n_out=1)
model = NeuralNetRegressor(
module=network,
criterion=RMSELoss,
device='cpu',
batch_size=batch_size,
lr=lr,
optimizer=optim.Adam if optimizer.lower() == 'adam' else optim.SGD,
optimizer__weight_decay=regularisation_factor,
max_epochs=epochs,
iterator_train__shuffle=train_shuffle,
train_split=predefined_split(testset),
callbacks=[mse, r2, rmse, checkpoint, train_end_checkpoint]
)
print(f"{'*' * 10} start training the {on_target} model {'*' * 10}")
history = model.fit(trainset, y=None)
print(f"{'*' * 10} End Training the {on_target} Model {'*' * 10}")
if __name__ == '__main__':
args = parser.parse_args()
train(on_target=args.on_target,
hidden_units=args.hidden_units,
batch_size=args.batch_size,
epochs=args.epochs,
optimizer=args.optimizer,
lr=args.learning_rate,
regularisation_factor=args.regularisation_lambda,
train_shuffle=args.shuffle)
and this is my network declaration:
class DistanceNetwork(nn.Module):
"""separate NN for predicting distance"""
def __init__(self, n_features=5, n_hidden=16, n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, n_hidden),
nn.LeakyReLU(),
nn.Linear(n_hidden, 5),
nn.LeakyReLU(),
nn.Linear(5, n_out)
)
here is the log while training:

Strange jumps in val_loss for Keras mobile_net based model

I am trying to train a model based on MobileNet to do landmark detection for Dog Faces (output is a tensor with x/y coordinates for the position of the eyes and the nose of the dog).
In my training, I am seeing the following graph for val_loss:
Question: What is going on with the random spikes in val_loss?
My model looks like this:
model = applications.MobileNet(weights="imagenet",
include_top=False,
input_shape=(224, 224, 3))
for layer in model.layers:
layer.trainable = True
x = model.output
p = 0.6
x = AveragePooling2D()(output_layer)
x = BatchNormalization(axis=1, name="net_out")(x)
x = Dropout(p/4)(x)
x = Flatten()(x)
x = Dense(512, activation='relu')(x)
x = BatchNormalization()(x)
x = Dropout(p)(x)
x = Dense(512, activation='relu')(x)
x = BatchNormalization()(x)
x = Dropout(p/2)(x)
# PREDICT_SIZE is 6 - the number of landmarks I'm extracting
x = Dense(PREDICT_SIZE, activation="linear")(x)
points = Reshape((PREDICT_SIZE,), name="f")(x)
model_final = Model(input=model.input, output=points)
model_final.compile(loss="mse",
optimizer=optimizers.Adam(epsilon=1e-7),
metrics=["accuracy"])
Further questions: Should I set epsilon to be an even bigger number in my optimizer? Will my model suffer if I make epsilon larger?
Should I try a different loss function?
Is the model itself at fault?
Should I try different top layers for my model?
Update: Removing the "relu" activation resulted both in no more funky jumps in val_loss, but also in a model with lower loss and even better precision! WOOHOO!
Here's the val_loss value after removing "relu" from the top layers (and no other changes).