train function
def train(model, iterator, optimizer, criterion, clip):
model.train()
epoch_loss = 0
for i, batch in enumerate(iterator):
optimizer.zero_grad()
output = model(batch.text)
loss = criterion(output, torch.unsqueeze(batch.labels, 1))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / len(iterator)
main_script
def main(
train_file,
test_file,
config_file,
checkpoint_path,
best_model_path
):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
with open(config_file, 'r') as j:
config = json.loads(j.read())
for k,v in config['model'].items():
v = float(v)
if v < 1.0:
config['model'][k] = float(v)
else:
config['model'][k] = int(v)
for k,v in config['training'].items():
v = float(v)
if v < 1.0:
config['training'][k] = float(v)
else:
config['training'][k] = int(v)
train_itr, val_itr, test_itr, vocab_size = data_pipeline(
train_file,
test_file,
config['training']['max_vocab'],
config['training']['min_freq'],
config['training']['batch_size'],
device
)
model = CNNNLPModel(
vocab_size,
config['model']['emb_dim'],
config['model']['hid_dim'],
config['model']['model_layer'],
config['model']['model_kernel_size'],
config['model']['model_dropout'],
device
)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
num_epochs = config['training']['n_epoch']
clip = config['training']['clip']
is_best = False
best_valid_loss = float('inf')
model = model.to(device)
for epoch in tqdm(range(num_epochs)):
train_loss = train(model, train_itr, optimizer, criterion, clip)
valid_loss = evaluate(model, val_itr, criterion)
if (epoch + 1) % 2 == 0:
print("training loss {}, validation_loss{}".format(train_loss,valid_loss))
I was training a Convolution Neural Network for binary Text classification. Given a sentence, it detects its a hate speech or not. Training loss and validation loss was fine till 5 epoch after that suddenly the training loss and validation loss shot up suddenly from 0.2 to 10,000.
What could be the reason for such huge increase is loss suddenly?
Default learning rate of Adam is 0.001, which, depending on task, might be too high.
It looks like instead of converging your neural network became divergent (it left the previous ~0.2 loss minima and fell into different region).
Lowering your learning rate at some point (after 50% or 70% percent of training) would probably fix the issue.
Usually people divide the learning rate by 10 (0.0001 in your case) or by half (0.0005 in your case). Try with dividing by half and see if the issue persist, in general you would want to keep your learning rate as high as possible until divergence occurs as is probably the case here.
This is what schedulers are for (gamma specifies learning rate multiplier, might want to change that to 0.5 first).
One can think of lower learning rate phase as fine-tuning already found solution (placing weights in better region of the loss valley) and might require some patience.
Related
I am developing an AI for video classification, which classifies a video file into one of three labels: Normal, Violent, or Pornography.
Here is a summary of my efforts so far to improve the accuracy of the model:
1. Dataset: I have collected a training dataset of 50,000 videos, consisting of 5000 original videos and 45,000 augmented videos, evenly split between the three labels.
2. Pre-processing: I have used an InceptionV3 model pre-trained on the ImageNet dataset to extract features from the videos for feeding into my main model.
3. Model Architecture: I have tried many different model architectures, but all of them resulted in overfitting problems after a maximum of 15 epochs.
4. Regularization: I have added L1 and L2 regularization, but they did not help improve the model.
5. Early Stopping: I have implemented early stopping, but it stopped training when the validation values were still not good enough to achieve good accuracy.
6. Model Complexity: I have tried both complex and less complex models, but both still resulted in overfitting.
7. Batch Normalization: I have added batch normalization, but it did not solve the overfitting problem.
8. Learning Rate Scheduler: I have tried using ReduceLROnPlateau and LearningRateScheduler togheter and alone, but still no luck.
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=0, mode='min', min_delta=0.0001, cooldown=0, min_lr=0)
lr_schedule = keras.callbacks.LearningRateScheduler(
lambda epoch: 0.0005* tf.math.exp(-0.05 * epoch),
verbose=True)
9. Computing Resources: I am running the training on AWS Sagemaker ml.t3.2xlarge with 32GB RAM memory.
10. Dataset Size: I would prefer to avoid increasing the size of the dataset as I am running short on time for the project delivery. However, if this is my only option, I am open to suggestions.
11. Tuning regularizer Gradually increase the regularization value in each layer to fine-tune the model.
Please note that these are just examples of the models I have tried, I have experimented with many others with similar results.
x = keras.layers.GRU(32, return_sequences=True, kernel_regularizer=keras.regularizers.l2(0.001))(
frame_features_input, mask=mask_input
)
x = keras.layers.GRU(16, kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.4)(x)
x = keras.layers.Dense(1024, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(256, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(128, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
rnn_model = keras.Model([frame_features_input, mask_input], output)
opt = keras.optimizers.experimental.AdamW(
learning_rate=0.0001, # 0.001
weight_decay=0.004, # .004 best perform
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
clipnorm=None,
clipvalue=None,
global_clipnorm=None,
use_ema=False,
ema_momentum=0.99,
ema_overwrite_frequency=None,
jit_compile=True,
name="AdamW")
rnn_model.compile(
loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"]
)
x = keras.layers.GRU(128, return_sequences=True, recurrent_dropout=0.3)(frame_features_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(64, return_sequences=False, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(32, activation="relu", kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.BatchNormalization()(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
x = keras.layers.GRU(256, return_sequences=True, recurrent_dropout=0.3)(frame_features_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(128, return_sequences=True, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(64, return_sequences=False, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(32, activation="relu", kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.BatchNormalization()(x)
output = keras.layers.Dense
The results
The results using learning rate scheduler
Tried different model architectures, adding regularization, early stopping, and batch normalization, but still faced overfitting issue. Expected improved accuracy, but actual results show overfitting.
In the following code, I am using a Neural Network (net) to minimize the expectation of a complex stochastic function (complex_function).
loss = torch.tensor(0., requires_grad=True)
params = list(net.parameters())
optimizer = torch.optim.SGD(params, lr=learning_rate)
nb_train = 100
nb_mean = 1000
for i in range(nb_train):
for j in range(nb_mean):
value = complex_function(net)
loss = loss + value
loss = loss / nb_mean
loss.backward(retain_graph=True)
optimizer.step()
optimizer.zero_grad()
if i==0:
best_loss = loss
best_net = copy.deepcopy(net)
else:
if loss < best_loss:
best_net = copy.deepcopy(net)
best_loss = loss
return best_net
Weirdly, this code does not return the best net that I have encountered. Could someone explain me why? I suppose it is related to copy.deepcopy(net) but I do not know how...
I'm using neural nets for a regression problem where I have 3 features and I'm trying to predict one continuous value. I noticed that my neural net start learning good but after 10 epochs it get stuck on a high loss value and could not improve anymore.
I tried to use Adam and other adaptive optimizers instead of SGD but that didn't work. I tried a complex architectures like adding layers, neurons, batch normalization and other activations etc.. and that also didn't work.
I tried to debug and try to find out if something is wrong with the implementation but when I use only 10 examples of the data my model learn fast so there are no errors. I start to increase the examples of the data and monitoring my model results as I increase the data examples. when I reach 3000 data examples my model start to get stuck on a high value loss.
I tried to increase layers, neurons and also to try other activations, batch normalization. My data are also normalized between [-1, 1], my target value is not normalized since it is regression and I'm predicting a continuous value. I also tried using keras but I've got the same result.
My real dataset have 40000 data, I don't know what should I try, I almost try all things that I know for optimization but none of them worked. I would appreciate it if someone can guide me on this. I'll post my Code but maybe it is too messy to try to understand, I'm sure there is no problem with my implementation, I'm using skorch/pytorch and some SKlearn functions:
# take all features as an Independant variable except the bearing and distance
# here when I start small the model learn good but from 3000 data points as you can see the model stuck on a high value. I mean the start loss is 15 and it start to learn good but when it reach 9 it stucks there
# and if I try to use the whole dataset for training then the loss start at 47 and start decreasing until it reach 36 and then stucks there too
X = dataset.iloc[:3000, 0:-2].reset_index(drop=True).to_numpy().astype(np.float32)
# take distance and bearing as the output values:
y = dataset.iloc[:3000, -2:].reset_index(drop=True).to_numpy().astype(np.float32)
y_bearing = y[:, 0].reshape(-1, 1)
y_distance = y[:, 1].reshape(-1, 1)
# normalize the input values
scaler = StandardScaler()
X_norm = scaler.fit_transform(X, y)
X_br_train, X_br_test, y_br_train, y_br_test = train_test_split(X_norm,
y_bearing,
test_size=0.1,
random_state=42,
shuffle=True)
X_dis_train, X_dis_test, y_dis_train, y_dis_test = train_test_split(X_norm,
y_distance,
test_size=0.1,
random_state=42,
shuffle=True)
bearing_trainset = Dataset(X_br_train, y_br_train)
bearing_testset = Dataset(X_br_test, y_br_test)
distance_trainset = Dataset(X_dis_train, y_dis_train)
distance_testset = Dataset(X_dis_test, y_dis_test)
def root_mse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
class RMSELoss(nn.Module):
def __init__(self):
super().__init__()
self.mse = nn.MSELoss()
def forward(self, yhat, y):
return torch.sqrt(self.mse(yhat, y))
class AED(nn.Module):
"""custom average euclidean distance loss"""
def __init__(self):
super().__init__()
def forward(self, yhat, y):
return torch.dist(yhat, y)
def train(on_target,
hidden_units,
batch_size,
epochs,
optimizer,
lr,
regularisation_factor,
train_shuffle):
network = None
trainset = distance_trainset if on_target.lower() == 'distance' else bearing_trainset
testset = distance_testset if on_target.lower() == 'distance' else bearing_testset
print(f"shape of trainset.X = {trainset.X.shape}, shape of trainset.y = {trainset.y.shape}")
print(f"shape of testset.X = {testset.X.shape}, shape of testset.y = {testset.y.shape}")
mse = EpochScoring(scoring=mean_squared_error, lower_is_better=True, name='MSE')
r2 = EpochScoring(scoring=r2_score, lower_is_better=False, name='R2')
rmse = EpochScoring(scoring=make_scorer(root_mse), lower_is_better=True, name='RMSE')
checkpoint = Checkpoint(dirname=f'results/{on_target}/checkpoints')
train_end_checkpoint = TrainEndCheckpoint(dirname=f'results/{on_target}/checkpoints')
if on_target.lower() == 'bearing':
network = BearingNetwork(n_features=X_norm.shape[1],
n_hidden=hidden_units,
n_out=y_distance.shape[1])
elif on_target.lower() == 'distance':
network = DistanceNetwork(n_features=X_norm.shape[1],
n_hidden=hidden_units,
n_out=1)
model = NeuralNetRegressor(
module=network,
criterion=RMSELoss,
device='cpu',
batch_size=batch_size,
lr=lr,
optimizer=optim.Adam if optimizer.lower() == 'adam' else optim.SGD,
optimizer__weight_decay=regularisation_factor,
max_epochs=epochs,
iterator_train__shuffle=train_shuffle,
train_split=predefined_split(testset),
callbacks=[mse, r2, rmse, checkpoint, train_end_checkpoint]
)
print(f"{'*' * 10} start training the {on_target} model {'*' * 10}")
history = model.fit(trainset, y=None)
print(f"{'*' * 10} End Training the {on_target} Model {'*' * 10}")
if __name__ == '__main__':
args = parser.parse_args()
train(on_target=args.on_target,
hidden_units=args.hidden_units,
batch_size=args.batch_size,
epochs=args.epochs,
optimizer=args.optimizer,
lr=args.learning_rate,
regularisation_factor=args.regularisation_lambda,
train_shuffle=args.shuffle)
and this is my network declaration:
class DistanceNetwork(nn.Module):
"""separate NN for predicting distance"""
def __init__(self, n_features=5, n_hidden=16, n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, n_hidden),
nn.LeakyReLU(),
nn.Linear(n_hidden, 5),
nn.LeakyReLU(),
nn.Linear(5, n_out)
)
here is the log while training:
I am currently using PyTorch for deep neural network. I wrote a toy neural network shown below and I found that whether or not I set requires_grad=True for label y makes huge difference. When y.requires_grad=True, the neural network diverges. I am wondering why this happens.
import torch
x = torch.unsqueeze(torch.linspace(-1, 1, 100), dim=1)
y = x.pow(2) + 10 * torch.rand(x.size())
x.requires_grad = True
# this is where problem occurs
y.requires_grad = True
class Net(torch.nn.Module):
def __init__(self, n_feature, n_hidden, n_output):
super(Net, self).__init__()
self.hidden = torch.nn.Linear(n_feature, n_hidden)
self.predict = torch.nn.Linear(n_hidden, n_output)
def forward(self, x):
x = torch.relu(self.hidden(x))
x = self.predict(x)
return x
net = Net(1, 10, 1)
optimizer = torch.optim.SGD(net.parameters(), lr=0.5)
criterion = torch.nn.MSELoss()
for t in range(200):
y_pred = net(x)
loss= criterion(y_pred, y)
optimizer.zero_grad()
loss.backward()
print("Epoch {}: {}".format(t, loss))
optimizer.step()
It seems that you are using an outdated version of PyTorch. In more recent versions (0.4.0+), this will throw you the following error:
AssertionError: nn criterions don't compute the gradient w.r.t. targets -
please mark these tensors as not requiring gradients
Essentially, it tells you that it will only work if you set the requires_grad flag to False for your targets. The reason why this works at all in prior versions is indeed very interesting, and also why it causes diverging behavior.
My guess would be that a backwards pass would then also change your targets (instead of only changing your weights), which is obviously something you do not desire.
I have a binary classification problem with balanced number of examples per class. When testing the performance of the classifier on the test set, if I use all examples from both classes I get an accuracy of 79.87 %. However, when testing on the classes individually, accuracy per class 1 is 73.41 % and accuracy per class 2 is 63.31 %. The problem is that if I compute the average accuracy for the two classes, i.e. (73.41 + 63.31) /2 = 68.36 %, which does not equal 79.87 %.
How is this possible? I am using the model.evaluate function from Keras in order to obtain the accuracy numbers. My code is as follows:
model.compile(loss='binary_crossentropy',
optimizer=optim,
metrics=['accuracy'])
earlystop = EarlyStopping(monitor='val_acc', min_delta=0.001, patience=5, verbose=0, mode='auto')
callbacks_list = [earlystop]
X_train, y_train, X_val, y_val = data()
hist = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=30, batch_size=batch_size, shuffle=True, callbacks=callbacks_list)
#get training accuracy
training_accuracy = np.mean(hist.history["acc"])
validation_accuracy = np.mean(hist.history["val_acc"])
print("Training accuracy: %.2f%%" % (training_accuracy * 100))
print("Validation accuracy: %.2f%%" % (validation_accuracy * 100))
scores = model.evaluate(X_test, y_test, verbose=2)
y_pred = model.predict_classes(X_test)
print(metrics.classification_report(y_test, y_pred))
print("Testing loss: %.2f%%" % (scores[0]))
print("Testing accuracy: %.2f%%" % (scores[1]*100))
Why do I get results which don't add up? My setup is very trivial so I am sure there is no bug in my code. Thank you!
I can't find where in your code you're separating the classes to test each one.
But there is a big problem in taking the mean value of the history in np.mean(hist.history["val_acc"]).
The history evolves, you start with terrible accuracy and every epoch improves the value. Certainly, the only value that can be compared is the last.