I'm interested in fine-tuning a Mask-RCNN model that I'm using for instance segmentation. Currently I have trained the model for 6 epochs and the various Mask-RCNN losses are as follows:
The reason I'm stopping is that the COCO evaluation metrics seem to have dipped in the last epoch:
I know this is a far reaching question, but I'm looking to gain some intuition of how to understand which parameters are going to be the most impactful in improving the evaluation metrics. I understand there are three places to consider:
Should I be looking at batch size, learning rate and momentum, this uses an SGD optimizer with a learning rate of 1e-4 and batch size 2?
Should I be looking at using more training data or adding augmentation (I don't currently use any) and my dataset is current pretty large 40K images?
Should I be looking at the specific MaskRCNN parameters?
I thing I'll likely be asked to me more specific on what I want to improve so let me say that I would like to improve the recall of the individual masks. The model is performing well but doesn't quite capture the full extend of what I would like it to. I'm also leaving out details of the specific learning problem as I'd like to gain intuition of how to approach this in general.
A couple of notes:
6 epochs are too small for the network to converge even if you use a pre-trained network—especially such a big one as resnet50. I think you need at least 50 epochs. On a pre-trained resnet18 I started to get good results after 30 epochs, resnet34 needed +10-20 epochs and your resnet50 + 40k images of the train set - definitely need more epochs than 6;
definitely use a pre-trained network;
in my experience, I failed to get the results I like with SGD. I started using AdamW + ReduceLROnPlateau scheduler. The network converges quite fast, like 50-60% AP on epoch 7 or 8 but then it comes up to 80-85 after 50-60 epochs using very small improvements from epoch to epoch, only if the LR is small enough. You must be familiar with the gradient descent notion. I used to think of it as if you have more augmentation, your "hill" is covered with "boulders" that you have to be able to bypass and this is only possible if you control the LR. Additionally, AdamW helps with the overfitting.
This is how I do it. For networks with higher input resolution (your input images are scaled on input by the net itself), I use higher LR.
init_lr = 0.00005
weight_decay = init_lr * 100
optimizer = torch.optim.AdamW(params, lr=init_lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, verbose=True, patience=3, factor=0.75)
for epoch in range(epochs):
# train for one epoch, printing every 10 iterations
metric_logger = train_one_epoch(model, optimizer, train_loader, scaler, device,
epoch, print_freq=10)
scheduler.step(metric_logger.loss.global_avg)
optimizer.param_groups[0]["weight_decay"] = optimizer.param_groups[0]["lr"] * 100
# scheduler.step()
# evaluate on the test dataset
evaluate(model, test_loader, device=device)
print("[INFO] serializing model to '{}' ...".format(args["model"]))
save_and_print_size_of_model(model, args["model"], script=False)
Find such an LR and weight decay that the training exhausts LR to a very small value, like 1/10 of your initial LR, at the end of the training. If you will have a plateau too often, the scheduler quickly brings it to very small values and the network will learn nothing all the rest of the epochs.
Your plots indicate that your LR is too high at some point in the training, the network stops training and then AP is going down. You need constant improvements, even small ones. The more network trains the more subtle details it learns about your domain and the smaller the learning rate. Imho, constant LR will not allow doing that correctly.
anchor generator settings. Here is how I initialize the network.
def get_maskrcnn_resnet_model(name, num_classes, pretrained, res='normal'):
print('Using maskrcnn with {} backbone...'.format(name))
backbone = resnet_fpn_backbone(name, pretrained=pretrained, trainable_layers=5)
sizes = ((4,), (8,), (16,), (32,), (64,))
aspect_ratios = ((0.25, 0.5, 1.0, 2.0, 4.0),) * len(sizes)
anchor_generator = AnchorGenerator(
sizes=sizes, aspect_ratios=aspect_ratios
)
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],
output_size=7, sampling_ratio=2)
default_min_size = 800
default_max_size = 1333
if res == 'low':
min_size = int(default_min_size / 1.25)
max_size = int(default_max_size / 1.25)
elif res == 'normal':
min_size = default_min_size
max_size = default_max_size
elif res == 'high':
min_size = int(default_min_size * 1.25)
max_size = int(default_max_size * 1.25)
else:
raise ValueError('Invalid res={} param'.format(res))
model = MaskRCNN(backbone, min_size=min_size, max_size=max_size, num_classes=num_classes,
rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
model.roi_heads.detections_per_img = 512
return model
I need to find small objects here why I use such anchor params.
classes in-balancing issue. If you have only your object and bg - no problem. If you have more classes then make sure that your training split (as 80% for train and 20% for the test) is more or less precisely applied to all the classes used in your particular training.
Good luck!
Related
I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.
I am working with a very large dataset with hundreds of long videos to be used as training and I'm using Google Colab to perform some tests. The whole code I wrote is quite simple and uses PyTorch.
When I try to perform the training, if I use more than 200 videos at a time, the RAM fullfills during the training and the Colab crashes. I noticed that this does not happens if I train with lower number of training videos.
For that reason I thought that my model may be trained incrementally creaing a structure as follows:
model = torch.nn.Sequential( # create a model
...
nn.Softmax(dim=1)
)
MAX_VIDEOS_PER_BATCH = 100
for current_batch in range (0, TOTAL_BATCHES): # Perform TOTAL_BATCHES trainings
videos = []
labels = []
for index, video_file_name in enumerate(os.listdir(VIDEOS_DIR)): # Read 100 videos as training set
if index < MAX_VIDEOS_PER_BATCH * current_batch:
continue
... # read the video and add it to videos
... # add the considered labels to videos list
video_training = torch.tensor(np.asarray(videos)).float() # (batch x frames x channels x height x width)
learning_rate = 1e-4
for t in range(ITERATIONS): # Train the model, if I already trained it the model is not resetted
y_pred = model(torch.FloatTensor(np.asarray(video_training )))
loss = loss_fn(y_pred, torch.tensor(labels))
print("#" + str(t), " loss:" + str(loss.item()))
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
My question is, is this method correct? I am training the network in a correct manner or this batches approach will create some damages or biases to the model?
When I go from batch 1 to batch 2, the model won't lose the previous trained knowledge, is it correct?
This is correct but the best way is to do the reverse.
Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
I'm currently working on a project in pytorch on Wasserstein GAN (https://arxiv.org/pdf/1701.07875.pdf).
In Wasserstain GAN a new objective function is defined using the wasserstein distance as :
Which leads to the following algorithms for training the GAN:
My question is :
When implementing line 5 and 6 of the algorithm in pytorch should I be multiplying my loss -1 ? As in my code (I use RMSprop as my optimizer for both the generator and critic):
############################
# (1) Update D network: maximize (D(x)) + (D(G(x)))
###########################
for n in range(n_critic):
D.zero_grad()
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
output = D(real_cpu)
#errD_real = -criterion(output, label) #DCGAN
errD_real = torch.mean(output)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, 100, device=device) #Careful here we changed shape of input (original : torch.randn(4, 100, 1, 1, device=device))
# Generate fake image batch with G
fake = G(noise)
# Classify all fake batch with D
output = D(fake.detach())
# Calculate D's loss on the all-fake batch
errD_fake = torch.mean(output)
# Calculate the gradients for this batch
errD_fake.backward()
D_G_z1 = output.mean().item()
# Add the gradients from the all-real and all-fake batches
errD = -(errD_real - errD_fake)
# Update D
optimizerD.step()
#Clipping weights
for p in D.parameters():
p.data.clamp_(-0.01, 0.01)
As you can see, I do the operation errD = -(errD_real - errD_fake), with errD_real and errD_fake being respectively the mean of the predictions of the critic on real and fake samples.
To my understanding RMSprop should optimize the weights of the critic the following way :
w <- w - alpha*gradient(w)
(alpha being the learning rate divided by the square root of the weighted moving average of the squared gradient)
Since the optimization problem requires to "go" in the same direction as the gradient it should be required to multiply gradient(w) by -1 before optimizing the weights.
Do you think that my reasoning is right ?
The program runs but my results are quiet poor.
I follow the same logic for the generator's weights but this time in order to go in the opposite direction of the gradient:
############################
# (2) Update G network: minimize -D(G(x))
###########################
G.zero_grad()
noise = torch.randn(b_size, 100, device=device)
fake = G(noise)
#label.fill_(fake_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = D(fake).view(-1)
# Calculate G's loss based on this output
#errG = criterion(output, label) #DCGAN
errG = -torch.mean(output)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
Sorry for the long question, I tried to explain my doubt as clear as possible. Thank you everyone.
I noticed some errors in the implementation of your discriminator training protocol. You call your backward functions twice with both the real and fake values loss being backpropagated at different time steps.
Technically an implementation using this scheme is possible but highly unreadable. There was a mistake with your errD_real in which your output is going to be positive instead of negative as an optimal D(G(z))>0 and so you penalize it for being correct. Overall your model converges simply by predicting D(x)<0 for all inputs.
To fix this do not call your errD_readl.backward() or your errD_fake.backward(). Simply using an errD.backward() after you define errD would work perfectly fine. Otherwise, your generator seems to be correct.
I am working with Keras 2.0.0 and I'd like to train a deep model with a huge amount of parameters on a GPU.
As my data are big, I have to use the ImageDataGenerator. To be honest, I want to abuse the ImageDataGenerator in that sense, that I don't want to perform any augmentations. I just want to put my training images into batches (and rescale them), so I can feed them to model.fit_generator.
I adapted the code from here and did some small changes according to my data (i.e. changing binary classification to categorical. But this doesn't matter for this problem which should be discussed here).
I have 15000 train images and the only 'augmentation' I want to perform, is rescaling to scope [0,1] by train_datagen = ImageDataGenerator(rescale=1./255).
After creating my 'train_generator' :
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical',
shuffle = True,
seed = 1337,
save_to_dir = save_data_dir)
I fit the model by using model.fit_generator().
I set amount of epochs to: epochs = 1
And batch_size to: batch_size = 60
What I expect to see in the directory where my augmented (i.e. resized) images are stored: 15.000 rescaled images per epoch, i.e. with only one epoch: 15.000 rescaled images. But, mysteriously, there are 15.250 images.
Is there a reason for this amount of images?
Do I have the power to control the amount of augmented images?
Similar problems:
Model fit_generator not pulling data samples as expected (respectively at stackoverflow: Keras - How are batches and epochs used in fit_generator()?)
A concrete example for using data generator for large datasets such as ImageNet
I appreciate your help.