Overcoming Overfitting: How to Improve Video Classification AI Training Accuracy - deep-learning

I am developing an AI for video classification, which classifies a video file into one of three labels: Normal, Violent, or Pornography.
Here is a summary of my efforts so far to improve the accuracy of the model:
1. Dataset: I have collected a training dataset of 50,000 videos, consisting of 5000 original videos and 45,000 augmented videos, evenly split between the three labels.
2. Pre-processing: I have used an InceptionV3 model pre-trained on the ImageNet dataset to extract features from the videos for feeding into my main model.
3. Model Architecture: I have tried many different model architectures, but all of them resulted in overfitting problems after a maximum of 15 epochs.
4. Regularization: I have added L1 and L2 regularization, but they did not help improve the model.
5. Early Stopping: I have implemented early stopping, but it stopped training when the validation values were still not good enough to achieve good accuracy.
6. Model Complexity: I have tried both complex and less complex models, but both still resulted in overfitting.
7. Batch Normalization: I have added batch normalization, but it did not solve the overfitting problem.
8. Learning Rate Scheduler: I have tried using ReduceLROnPlateau and LearningRateScheduler togheter and alone, but still no luck.
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=0, mode='min', min_delta=0.0001, cooldown=0, min_lr=0)
lr_schedule = keras.callbacks.LearningRateScheduler(
lambda epoch: 0.0005* tf.math.exp(-0.05 * epoch),
verbose=True)
9. Computing Resources: I am running the training on AWS Sagemaker ml.t3.2xlarge with 32GB RAM memory.
10. Dataset Size: I would prefer to avoid increasing the size of the dataset as I am running short on time for the project delivery. However, if this is my only option, I am open to suggestions.
11. Tuning regularizer Gradually increase the regularization value in each layer to fine-tune the model.
Please note that these are just examples of the models I have tried, I have experimented with many others with similar results.
x = keras.layers.GRU(32, return_sequences=True, kernel_regularizer=keras.regularizers.l2(0.001))(
frame_features_input, mask=mask_input
)
x = keras.layers.GRU(16, kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.4)(x)
x = keras.layers.Dense(1024, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(256, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.Dense(128, activation="relu",
kernel_regularizer=keras.regularizers.l2(0.001))(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
rnn_model = keras.Model([frame_features_input, mask_input], output)
opt = keras.optimizers.experimental.AdamW(
learning_rate=0.0001, # 0.001
weight_decay=0.004, # .004 best perform
beta_1=0.9,
beta_2=0.999,
epsilon=1e-07,
amsgrad=False,
clipnorm=None,
clipvalue=None,
global_clipnorm=None,
use_ema=False,
ema_momentum=0.99,
ema_overwrite_frequency=None,
jit_compile=True,
name="AdamW")
rnn_model.compile(
loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"]
)
x = keras.layers.GRU(128, return_sequences=True, recurrent_dropout=0.3)(frame_features_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(64, return_sequences=False, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(32, activation="relu", kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.BatchNormalization()(x)
output = keras.layers.Dense(len(class_vocab), activation="softmax")(x)
x = keras.layers.GRU(256, return_sequences=True, recurrent_dropout=0.3)(frame_features_input)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(128, return_sequences=True, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.GRU(64, return_sequences=False, recurrent_dropout=0.3)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(32, activation="relu", kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = keras.layers.Dropout(0.5)(x)
x = keras.layers.BatchNormalization()(x)
output = keras.layers.Dense
The results
The results using learning rate scheduler
Tried different model architectures, adding regularization, early stopping, and batch normalization, but still faced overfitting issue. Expected improved accuracy, but actual results show overfitting.

Related

Difference between WGAN and WGAN-GP (Gradient Penalty)

I just find that in the code here:
https://github.com/NUS-Tim/Pytorch-WGAN/tree/master/models
The "generator" loss, G, between WGAN and WGAN-GP is different, for WGAN:
g_loss = self.D(fake_images)
g_loss = g_loss.mean().mean(0).view(1)
g_loss.backward(one) # !!!
g_cost = -g_loss
But for WGAN-GP:
g_loss = self.D(fake_images)
g_loss = g_loss.mean()
g_loss.backward(mone) # !!!
g_cost = -g_loss
Why one is one=1 and another is mone=-1?
You might have misread the source code, the first sample you gave is not averaging the resut of D to compute its loss but instead uses the binary cross-entropy.
To be more precise:
The first method ("GAN") uses the BCE loss to compute the loss terms for D and G. The standard GAN optimization objective for D is to minimize E_x[log(D(x))] + E_z[log(1-D(G(z)))]. Source code:
outputs = self.D(images)
d_loss_real = self.loss(outputs.flatten(), real_labels) # <- bce loss
real_score = outputs
# Compute BCELoss using fake images
fake_images = self.G(z)
outputs = self.D(fake_images)
d_loss_fake = self.loss(outputs.flatten(), fake_labels) # <- bce loss
fake_score = outputs
# Optimizie discriminator
d_loss = d_loss_real + d_loss_fake
self.D.zero_grad()
d_loss.backward()
self.d_optimizer.step()
For d_loss_real you optimize towards 1s (output is considered real), while d_loss_fake optimizes towards 0s (output is considered fake).
While the second ("WCGAN") uses the Wasserstein loss (ref) whereby we maximise for D the loss: E_x[D(x)] - E_z[D(G(z))]. Source code:
# Train discriminator
# WGAN - Training discriminator more iterations than generator
# Train with real images
d_loss_real = self.D(images)
d_loss_real = d_loss_real.mean()
d_loss_real.backward(mone)
# Train with fake images
z = self.get_torch_variable(torch.randn(self.batch_size, 100, 1, 1))
fake_images = self.G(z)
d_loss_fake = self.D(fake_images)
d_loss_fake = d_loss_fake.mean()
d_loss_fake.backward(one)
# [...]
Wasserstein_D = d_loss_real - d_loss_fake
By doing d_loss_real.backward(mone) you backpropage with a gradient of opposite sign, i.e. its's a gradient ascend, and you end up maximizing d_loss_real.
In order to Update D network:
lossD = Expectation of D(fake data) - Expectation of D(real data) + gradient penalty
lossD ↓,D(real data) ↑
so you need to add minus one to the gradient process

My training and validation loss suddenly increased in power of 3

train function
def train(model, iterator, optimizer, criterion, clip):
model.train()
epoch_loss = 0
for i, batch in enumerate(iterator):
optimizer.zero_grad()
output = model(batch.text)
loss = criterion(output, torch.unsqueeze(batch.labels, 1))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / len(iterator)
main_script
def main(
train_file,
test_file,
config_file,
checkpoint_path,
best_model_path
):
device = 'cuda' if torch.cuda.is_available() else 'cpu'
with open(config_file, 'r') as j:
config = json.loads(j.read())
for k,v in config['model'].items():
v = float(v)
if v < 1.0:
config['model'][k] = float(v)
else:
config['model'][k] = int(v)
for k,v in config['training'].items():
v = float(v)
if v < 1.0:
config['training'][k] = float(v)
else:
config['training'][k] = int(v)
train_itr, val_itr, test_itr, vocab_size = data_pipeline(
train_file,
test_file,
config['training']['max_vocab'],
config['training']['min_freq'],
config['training']['batch_size'],
device
)
model = CNNNLPModel(
vocab_size,
config['model']['emb_dim'],
config['model']['hid_dim'],
config['model']['model_layer'],
config['model']['model_kernel_size'],
config['model']['model_dropout'],
device
)
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()
num_epochs = config['training']['n_epoch']
clip = config['training']['clip']
is_best = False
best_valid_loss = float('inf')
model = model.to(device)
for epoch in tqdm(range(num_epochs)):
train_loss = train(model, train_itr, optimizer, criterion, clip)
valid_loss = evaluate(model, val_itr, criterion)
if (epoch + 1) % 2 == 0:
print("training loss {}, validation_loss{}".format(train_loss,valid_loss))
I was training a Convolution Neural Network for binary Text classification. Given a sentence, it detects its a hate speech or not. Training loss and validation loss was fine till 5 epoch after that suddenly the training loss and validation loss shot up suddenly from 0.2 to 10,000.
What could be the reason for such huge increase is loss suddenly?
Default learning rate of Adam is 0.001, which, depending on task, might be too high.
It looks like instead of converging your neural network became divergent (it left the previous ~0.2 loss minima and fell into different region).
Lowering your learning rate at some point (after 50% or 70% percent of training) would probably fix the issue.
Usually people divide the learning rate by 10 (0.0001 in your case) or by half (0.0005 in your case). Try with dividing by half and see if the issue persist, in general you would want to keep your learning rate as high as possible until divergence occurs as is probably the case here.
This is what schedulers are for (gamma specifies learning rate multiplier, might want to change that to 0.5 first).
One can think of lower learning rate phase as fine-tuning already found solution (placing weights in better region of the loss valley) and might require some patience.

what should I do if my regression model stuck at a high value loss?

I'm using neural nets for a regression problem where I have 3 features and I'm trying to predict one continuous value. I noticed that my neural net start learning good but after 10 epochs it get stuck on a high loss value and could not improve anymore.
I tried to use Adam and other adaptive optimizers instead of SGD but that didn't work. I tried a complex architectures like adding layers, neurons, batch normalization and other activations etc.. and that also didn't work.
I tried to debug and try to find out if something is wrong with the implementation but when I use only 10 examples of the data my model learn fast so there are no errors. I start to increase the examples of the data and monitoring my model results as I increase the data examples. when I reach 3000 data examples my model start to get stuck on a high value loss.
I tried to increase layers, neurons and also to try other activations, batch normalization. My data are also normalized between [-1, 1], my target value is not normalized since it is regression and I'm predicting a continuous value. I also tried using keras but I've got the same result.
My real dataset have 40000 data, I don't know what should I try, I almost try all things that I know for optimization but none of them worked. I would appreciate it if someone can guide me on this. I'll post my Code but maybe it is too messy to try to understand, I'm sure there is no problem with my implementation, I'm using skorch/pytorch and some SKlearn functions:
# take all features as an Independant variable except the bearing and distance
# here when I start small the model learn good but from 3000 data points as you can see the model stuck on a high value. I mean the start loss is 15 and it start to learn good but when it reach 9 it stucks there
# and if I try to use the whole dataset for training then the loss start at 47 and start decreasing until it reach 36 and then stucks there too
X = dataset.iloc[:3000, 0:-2].reset_index(drop=True).to_numpy().astype(np.float32)
# take distance and bearing as the output values:
y = dataset.iloc[:3000, -2:].reset_index(drop=True).to_numpy().astype(np.float32)
y_bearing = y[:, 0].reshape(-1, 1)
y_distance = y[:, 1].reshape(-1, 1)
# normalize the input values
scaler = StandardScaler()
X_norm = scaler.fit_transform(X, y)
X_br_train, X_br_test, y_br_train, y_br_test = train_test_split(X_norm,
y_bearing,
test_size=0.1,
random_state=42,
shuffle=True)
X_dis_train, X_dis_test, y_dis_train, y_dis_test = train_test_split(X_norm,
y_distance,
test_size=0.1,
random_state=42,
shuffle=True)
bearing_trainset = Dataset(X_br_train, y_br_train)
bearing_testset = Dataset(X_br_test, y_br_test)
distance_trainset = Dataset(X_dis_train, y_dis_train)
distance_testset = Dataset(X_dis_test, y_dis_test)
def root_mse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))
class RMSELoss(nn.Module):
def __init__(self):
super().__init__()
self.mse = nn.MSELoss()
def forward(self, yhat, y):
return torch.sqrt(self.mse(yhat, y))
class AED(nn.Module):
"""custom average euclidean distance loss"""
def __init__(self):
super().__init__()
def forward(self, yhat, y):
return torch.dist(yhat, y)
def train(on_target,
hidden_units,
batch_size,
epochs,
optimizer,
lr,
regularisation_factor,
train_shuffle):
network = None
trainset = distance_trainset if on_target.lower() == 'distance' else bearing_trainset
testset = distance_testset if on_target.lower() == 'distance' else bearing_testset
print(f"shape of trainset.X = {trainset.X.shape}, shape of trainset.y = {trainset.y.shape}")
print(f"shape of testset.X = {testset.X.shape}, shape of testset.y = {testset.y.shape}")
mse = EpochScoring(scoring=mean_squared_error, lower_is_better=True, name='MSE')
r2 = EpochScoring(scoring=r2_score, lower_is_better=False, name='R2')
rmse = EpochScoring(scoring=make_scorer(root_mse), lower_is_better=True, name='RMSE')
checkpoint = Checkpoint(dirname=f'results/{on_target}/checkpoints')
train_end_checkpoint = TrainEndCheckpoint(dirname=f'results/{on_target}/checkpoints')
if on_target.lower() == 'bearing':
network = BearingNetwork(n_features=X_norm.shape[1],
n_hidden=hidden_units,
n_out=y_distance.shape[1])
elif on_target.lower() == 'distance':
network = DistanceNetwork(n_features=X_norm.shape[1],
n_hidden=hidden_units,
n_out=1)
model = NeuralNetRegressor(
module=network,
criterion=RMSELoss,
device='cpu',
batch_size=batch_size,
lr=lr,
optimizer=optim.Adam if optimizer.lower() == 'adam' else optim.SGD,
optimizer__weight_decay=regularisation_factor,
max_epochs=epochs,
iterator_train__shuffle=train_shuffle,
train_split=predefined_split(testset),
callbacks=[mse, r2, rmse, checkpoint, train_end_checkpoint]
)
print(f"{'*' * 10} start training the {on_target} model {'*' * 10}")
history = model.fit(trainset, y=None)
print(f"{'*' * 10} End Training the {on_target} Model {'*' * 10}")
if __name__ == '__main__':
args = parser.parse_args()
train(on_target=args.on_target,
hidden_units=args.hidden_units,
batch_size=args.batch_size,
epochs=args.epochs,
optimizer=args.optimizer,
lr=args.learning_rate,
regularisation_factor=args.regularisation_lambda,
train_shuffle=args.shuffle)
and this is my network declaration:
class DistanceNetwork(nn.Module):
"""separate NN for predicting distance"""
def __init__(self, n_features=5, n_hidden=16, n_out=1):
super().__init__()
self.model = nn.Sequential(
nn.Linear(n_features, n_hidden),
nn.LeakyReLU(),
nn.Linear(n_hidden, 5),
nn.LeakyReLU(),
nn.Linear(5, n_out)
)
here is the log while training:

tensorflow GPU crashes for 0 batch size CUDNN_STATUS_BAD_PARAM

This issue seem to be existing for a long time and lots of users are facing the issue.
stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 64 spatial: 7 264 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} t
o cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
The message is so mysterious that I do not know what happened in my code, however, my code works fine on CPU tensorflow.
I heard that we can use tf.cond to get around this, but I'm new to tensorflow-gpu, so can someone please help me? My code uses Keras and takes generator like input, this is to avoid any out-of-memory issue. The generator is built by a while True loop that spits out data by some batch size.
def resnet_model(bin_multiple):
#input and reshape
inputs = Input(shape=input_shape)
reshape = Reshape(input_shape_channels)(inputs)
#normal convnet layer (have to do one initially to get 64 channels)
conv = Conv2D(64,(1,bin_multiple*note_range),padding="same",activation='relu')(reshape)
pool = MaxPooling2D(pool_size=(1,2))(conv)
for i in range(int(np.log2(bin_multiple))-1):
print( i)
#residual block
bn = BatchNormalization()(pool)
re = Activation('relu')(bn)
freq_range = int((bin_multiple/(2**(i+1)))*note_range)
print(freq_range)
conv = Conv2D(64,(1,freq_range),padding="same",activation='relu')(re)
#add and downsample
ad = add([pool,conv])
pool = MaxPooling2D(pool_size=(1,2))(ad)
flattened = Flatten()(pool)
fc = Dense(1024, activation='relu')(flattened)
do = Dropout(0.5)(fc)
fc = Dense(512, activation='relu')(do)
do = Dropout(0.5)(fc)
outputs = Dense(note_range, activation='sigmoid')(do)
model = Model(inputs=inputs, outputs=outputs)
return model
model = resnet_model(bin_multiple)
init_lr = float(args['init_lr'])
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr=init_lr,momentum=0.9), metrics=['accuracy', 'mae', 'categorical_accuracy'])
model.summary()
history = model.fit_generator(trainGen.next(),trainGen.steps(), epochs=epochs,
verbose=1,validation_data=valGen.next(),validation_steps=valGen.steps(),callbacks=callbacks, workers=8, use_multiprocessing=True)
The problem is when you model received 0 batch size. For me I had the error because I have 1000 example and I run it on multiple GPus ( 2 GPU) with batch size equal to 32 .And in My graph I divided the batch size to mini batch size to so each GPU take 16 example. At step 31 ( 31 * 32) I will finished 992 examples , so there is only 8 example left, it will go to GPU 1 and GPU2 will end with zero batch size that's why I received your error above.
Still couldn't solve it and still searching about proper solution.
I hope this help you to discover when in your code you received zero batch size.

Function approximator and q-learning

I am trying to implement q-learning with an action-value approximation-function. I am using openai-gym and the "MountainCar-v0" enviroment to test my algorithm out. My problem is, it does not converge or find the goal at all.
Basically the approximator works like the following, you feed in the 2 features: position and velocity and one of the 3 actions in a one-hot encoding: 0 -> [1,0,0], 1 -> [0,1,0] and 2 -> [0,0,1]. The output is the action-value approximation Q_approx(s,a), for one specific action.
I know that usually, the input is the state (2 features) and the output layer contains 1 output for each action. The big difference that I see is that I have run the feed forward pass 3 times (one for each action) and take the max, while in the standard implementation you run it once and take the max over the output.
Maybe my implementation is just completely wrong and I am thinking wrong. Gonna paste the code here, it is a mess but I am just experimenting a bit:
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
env = gym.make('MountainCar-v0')
# The mean reward over 20 episodes
mean_rewards = np.zeros(20)
# Feature numpy holder
features = np.zeros(5)
# Q_a value holder
qa_vals = np.zeros(3)
one_hot = {
0 : np.asarray([1,0,0]),
1 : np.asarray([0,1,0]),
2 : np.asarray([0,0,1])
}
model = Sequential()
model.add(Dense(20, activation="relu",input_dim=(5)))
model.add(Dense(10,activation="relu"))
model.add(Dense(1))
model.compile(optimizer='rmsprop',
loss='mse',
metrics=['accuracy'])
epsilon_greedy = 0.1
discount = 0.9
batch_size = 16
# Experience replay containing features and target
experience = np.ones((10*300,5+1))
# Ring buffer
def add_exp(features,target,index):
if index % experience.shape[0] == 0:
index = 0
global filled_once
filled_once = True
experience[index,0:5] = features
experience[index,5] = target
index += 1
return index
for e in range(0,100000):
obs = env.reset()
old_obs = None
new_obs = obs
rewards = 0
loss = 0
for i in range(0,300):
if old_obs is not None:
# Find q_a max for s_(t+1)
features[0:2] = new_obs
for i,pa in enumerate([0,1,2]):
features[2:5] = one_hot[pa]
qa_vals[i] = model.predict(features.reshape(-1,5))
rewards += reward
target = reward + discount*np.max(qa_vals)
features[0:2] = old_obs
features[2:5] = one_hot[a]
fill_index = add_exp(features,target,fill_index)
# Find new action
if np.random.random() < epsilon_greedy:
a = env.action_space.sample()
else:
a = np.argmax(qa_vals)
else:
a = env.action_space.sample()
obs, reward, done, info = env.step(a)
old_obs = new_obs
new_obs = obs
if done:
break
if filled_once:
samples_ids = np.random.choice(experience.shape[0],batch_size)
loss += model.train_on_batch(experience[samples_ids,0:5],experience[samples_ids,5].reshape(-1))[0]
mean_rewards[e%20] = rewards
print("e = {} and loss = {}".format(e,loss))
if e % 50 == 0:
print("e = {} and mean = {}".format(e,mean_rewards.mean()))
Thanks in advance!
There shouldn't be much difference between the actions as inputs to your network or as different outputs of your network. It does make a huge difference if your states are images for example. because Conv nets work very well with images and there would be no obvious way of integrating the actions to the input.
Have you tried the cartpole balancing environment? It is better to test if your model is working correctly.
Mountain climb is pretty hard. It has no reward until you reach the top, which often doesn't happen at all. The model will only start learning something useful once you get to the top once. If you are never getting to the top you should probably increase your time doing exploration. in other words take more random actions, a lot more...