Validation loss higher than training loss from first epoch - deep-learning

From what I’ve learned when the validation loss > training loss there is overfitting. However I’m getting this from the first epoch.
See below:
I'm using a tabular learner (from FastAI v2) that has about 72 inputs.
I have 360K unevenly distributed cases, of which the majority is cat1, then cat2, etc… We got about 20K of cat6. I upsample all training data so all categories are equally represented. The validation set is 2% of the training set.
I tried lowering the number of layers. This brings the training loss a bit closer to the validation loss, but the validation loss always is high than the training loss, from the first epoch on.
What could be the explanation for this?
Code used:
coord_labels, semantic_labels = [], []
for i in range(18):
coord_labels += [f'x{i+1}', f'y{i+1}', f'conf{i+1}']
semantic_labels += [f'sem{i+1}']
dls = TabularDataLoaders.from_csv(
'/content/total_training.csv',
y_names='corrected_person_position_type_id',
cont_names = coord_labels,
cat_names = semantic_labels,
procs = [Categorify, Normalize],
valid_idx = valid_idx,
bs=2048
)
learn = tabular_learner(dls, metrics=accuracy)
learn.fit_one_cycle(10)

Related

Validation Loss with non in time-series classification

Case 1:
I am feeding a variable-length input time-series window to the GRU model. Sometimes there may be 900 samples in the window, and sometimes there may be only 16. I fed into the RNN model (GRU) since I learned that RNN methods work better on long sequences. I utilize one GRU layer and get hidden sequences across all the time stamps in order to get maximum information of all the time stamps. Then, I used average pooling on GRU output to bring representation into fixed-length. The intuition of using average-pooling instead of max-pooling is that it may achieve summarized information of all the timestamps. Here is the code of the model:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5, return_sequences=True
)(input_mask)
AP = tf.keras.layers.GlobalAveragePooling1D()(gru_l5)
gru_fm = tf.keras.layers.Dropout(0.3)(AP)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
From this model, I am obtaining better performance on validation set while on training data, performance increased by 100% (going for worst), however, the major issue is that validation loss is "nan." This issue is currently being explored on GitHub and StackOverflow.
I tried nearly all of the options provided here, here and here. But unable to resolve this validation_loss = non issue.
Case 2:
Then I decided not to get all of the GRU's hidden states but rather to retrieve only the last hidden state, which would provide a fixed-length representation and eliminate the requirement for pooling. Here, the validation loss as "nan" probelm is fixed, but the test data performance is drastically reduced. Here is this model's source code:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5)(input_mask)
gru_fm = tf.keras.layers.Dropout(0.3)(gru_l5)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
We can observe the results of both Cases. In Case 1, I have the feeling that the vanishing gradient problem occurs with longer sequences. Any thoughts or discussions on resolving this "nan" issue and achieving high performance would be much appreciated.

Fully connected neural network with constant loss

I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.

Which parameters of Mask-RCNN control mask recall?

I'm interested in fine-tuning a Mask-RCNN model that I'm using for instance segmentation. Currently I have trained the model for 6 epochs and the various Mask-RCNN losses are as follows:
The reason I'm stopping is that the COCO evaluation metrics seem to have dipped in the last epoch:
I know this is a far reaching question, but I'm looking to gain some intuition of how to understand which parameters are going to be the most impactful in improving the evaluation metrics. I understand there are three places to consider:
Should I be looking at batch size, learning rate and momentum, this uses an SGD optimizer with a learning rate of 1e-4 and batch size 2?
Should I be looking at using more training data or adding augmentation (I don't currently use any) and my dataset is current pretty large 40K images?
Should I be looking at the specific MaskRCNN parameters?
I thing I'll likely be asked to me more specific on what I want to improve so let me say that I would like to improve the recall of the individual masks. The model is performing well but doesn't quite capture the full extend of what I would like it to. I'm also leaving out details of the specific learning problem as I'd like to gain intuition of how to approach this in general.
A couple of notes:
6 epochs are too small for the network to converge even if you use a pre-trained network—especially such a big one as resnet50. I think you need at least 50 epochs. On a pre-trained resnet18 I started to get good results after 30 epochs, resnet34 needed +10-20 epochs and your resnet50 + 40k images of the train set - definitely need more epochs than 6;
definitely use a pre-trained network;
in my experience, I failed to get the results I like with SGD. I started using AdamW + ReduceLROnPlateau scheduler. The network converges quite fast, like 50-60% AP on epoch 7 or 8 but then it comes up to 80-85 after 50-60 epochs using very small improvements from epoch to epoch, only if the LR is small enough. You must be familiar with the gradient descent notion. I used to think of it as if you have more augmentation, your "hill" is covered with "boulders" that you have to be able to bypass and this is only possible if you control the LR. Additionally, AdamW helps with the overfitting.
This is how I do it. For networks with higher input resolution (your input images are scaled on input by the net itself), I use higher LR.
init_lr = 0.00005
weight_decay = init_lr * 100
optimizer = torch.optim.AdamW(params, lr=init_lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, verbose=True, patience=3, factor=0.75)
for epoch in range(epochs):
# train for one epoch, printing every 10 iterations
metric_logger = train_one_epoch(model, optimizer, train_loader, scaler, device,
epoch, print_freq=10)
scheduler.step(metric_logger.loss.global_avg)
optimizer.param_groups[0]["weight_decay"] = optimizer.param_groups[0]["lr"] * 100
# scheduler.step()
# evaluate on the test dataset
evaluate(model, test_loader, device=device)
print("[INFO] serializing model to '{}' ...".format(args["model"]))
save_and_print_size_of_model(model, args["model"], script=False)
Find such an LR and weight decay that the training exhausts LR to a very small value, like 1/10 of your initial LR, at the end of the training. If you will have a plateau too often, the scheduler quickly brings it to very small values and the network will learn nothing all the rest of the epochs.
Your plots indicate that your LR is too high at some point in the training, the network stops training and then AP is going down. You need constant improvements, even small ones. The more network trains the more subtle details it learns about your domain and the smaller the learning rate. Imho, constant LR will not allow doing that correctly.
anchor generator settings. Here is how I initialize the network.
def get_maskrcnn_resnet_model(name, num_classes, pretrained, res='normal'):
print('Using maskrcnn with {} backbone...'.format(name))
backbone = resnet_fpn_backbone(name, pretrained=pretrained, trainable_layers=5)
sizes = ((4,), (8,), (16,), (32,), (64,))
aspect_ratios = ((0.25, 0.5, 1.0, 2.0, 4.0),) * len(sizes)
anchor_generator = AnchorGenerator(
sizes=sizes, aspect_ratios=aspect_ratios
)
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],
output_size=7, sampling_ratio=2)
default_min_size = 800
default_max_size = 1333
if res == 'low':
min_size = int(default_min_size / 1.25)
max_size = int(default_max_size / 1.25)
elif res == 'normal':
min_size = default_min_size
max_size = default_max_size
elif res == 'high':
min_size = int(default_min_size * 1.25)
max_size = int(default_max_size * 1.25)
else:
raise ValueError('Invalid res={} param'.format(res))
model = MaskRCNN(backbone, min_size=min_size, max_size=max_size, num_classes=num_classes,
rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
model.roi_heads.detections_per_img = 512
return model
I need to find small objects here why I use such anchor params.
classes in-balancing issue. If you have only your object and bg - no problem. If you have more classes then make sure that your training split (as 80% for train and 20% for the test) is more or less precisely applied to all the classes used in your particular training.
Good luck!

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

Pytorch LSTM text-generator repeats same words

UPDATE: It was a mistake in the logic generating new characters. See answer below.
ORIGINAL QUESTION: I built an LSTM for character-level text generation with Pytorch. The model trains well (loss decreases reasonably etc.) but the trained model ends up outputting the last handful of words of the input repeated over and over again (e.g. Input: "She told her to come back later, but she never did"; Output: ", but she never did, but she never did, but she never did" and so on).
I have played around with the hyperparameters a bit, and the problem persists. I'm currently using:
Loss function: BCE
Optimizer: Adam
Learning rate: 0.001
Sequence length: 64
Batch size: 32
Embedding dim: 128
Hidden dim: 512
LSTM layers: 2
I also tried not always choosing the top choice, but this only introduces incorrect words and doesn't break the loop. I've been looking at countless tutorials, and I can't quite figure out what I'm doing differently/wrong.
The following is the code for training the model. training_data is one long string and I'm looping over it predicting the next character for each substring of length SEQ_LEN. I'm not sure if my mistake is here or elsewhere but any comment or direction is highly appreciated!
loss_dict = dict()
for e in range(EPOCHS):
print("------ EPOCH {} OF {} ------".format(e+1, EPOCHS))
lstm.reset_cell()
for i in range(0, DATA_LEN, BATCH_SIZE):
if i % 50000 == 0:
print(i/float(DATA_LEN))
optimizer.zero_grad()
input_vector = torch.tensor([[
vocab.get(char, len(vocab))
for char in training_data[i+b:i+b+SEQ_LEN]
] for b in range(BATCH_SIZE)])
if USE_CUDA and torch.cuda.is_available():
input_vector = input_vector.cuda()
output_vector = lstm(input_vector)
target_vector = torch.zeros(output_vector.shape)
if USE_CUDA and torch.cuda.is_available():
target_vector = target_vector.cuda()
for b in range(BATCH_SIZE):
target_vector[b][vocab.get(training_data[i+b+SEQ_LEN])] = 1
error = loss(output_vector, target_vector)
error.backward()
optimizer.step()
loss_dict[(e, int(i/BATCH_SIZE))] = error.detach().item()
ANSWER: I had made a stupid mistake when producing the characters with the trained model: I got confused with the batch size and assumed that at each step the network would predict an entire batch of new characters when in fact it only predicts a single one… That's why it simply repeated the end of the input. Yikes!
Anyways, if you run into this problem DOUBLE CHECK that you have the right logic for producing new output with the trained model (especially if you're using batches). If it's not that and the problem persists, you can try fine-tuning the following:
sequence length
greediness (e.g. probabilistic choice vs. top choice for next character)
batch size
epochs