Gradient accumulation for Siamese network - deep-learning

I am implementing gradient accumulation in PyTorch for the training of a Siamese network where one of the inputs is constant throughout the accumulated batches. Currently, my training loop looks roughly as follows:
for x_in_batch in x_in_batches:
x_out_batch = siamese_network(x_in_batch)
y_out = siamese_network(y_in)
batch_loss = criterion(x_out_batch, y_out, *args)
batch_loss /= num_accum_steps
batch_loss.backward()
optimizer.step()
optimizer.zero_grad()
However, with this loop, I recompute y_out in every iteration. If possible, I would like to have a more efficient solution, like this:
y_out = siamese_network(y_in)
for x_in_batch in x_in_batches:
x_out_batch = siamese_network(x_in_batch)
batch_loss = criterion(x_out_batch, y_out, *args)
batch_loss /= num_accum_steps
batch_loss.backward()
optimizer.step()
optimizer.zero_grad()
Unfortunately, this does not work since the backward() call destroys the whole computation graph needed for the backpropagation. Using retain_graph to retain the whole computation graph is not an option due to lack of memory on the GPU; this would be against the idea of gradient accumulation.
Is there a possibility to somehow retain just the data that does not change between the iteration steps? (This should be at least the value of y_out, but I guess also part of the computation graph could be reused, couldn't it?)

Related

Fully connected neural network with constant loss

I am working on a project to predict soccer player values from a set of inputs. The data consists of about 19,000 rows and 8 columns (7 columns for input and 1 column for the target) all of numerical values.
I am using a fully connected Neural Network for the prediction but the problem is the loss is not decreasing as it should.
The loss is very large (1e+13) and doesn’t decrease as it should, it just fluctuates.
This is the function I am using to run the model:
def gradient_descent(model, learning_rate, num_epochs, data_loader, criterion):
losses = []
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(num_epochs): # one epoch
for inputs, outputs in data_loader: # one iteration
inputs, outputs = inputs.to(torch.float32), outputs.to(torch.float32)
logits = model(inputs)
loss = criterion(torch.squeeze(logits), outputs) # forward-pass
optimizer.zero_grad() # zero out the gradients
loss.backward() # compute the gradients (backward-pass)
optimizer.step() # take one step
losses.append(loss.item())
loss = sum(losses[-len(data_loader):]) / len(data_loader)
print(f'Epoch #{epoch}: Loss={loss:.3e}')
return losses
The model is fully connected neural network with 4 hidden layers, each with 7 neurons. input layer has 7 neurons and output has 1. I am using MSE for loss function. I tried changing the learning rate but it is still bad.
What could be the reason behind this?
Thank you!
It is difficult to diagnose your problem from the information you provided, but I'll try to point you in some useful directions.
Data Normalization:
The way we initialize the weights in deep NN has a significant effect on the training process. See, e.g.:
He, K., Zhang, X., Ren, S. and Sun, J., Delving deep into rectifiers: Surpassing human-level performance on imagenet classification (ICCV 2015).
Most initialization methods assume the inputs have zero mean and unit variance (or similar statistics). If your inputs violate these assumptions, you will find it difficult to train. See, e.g., this post.
Normalize the Targets:
You are trying to solve a regression problem (MSE loss), it might be the case that your targets are poorly scaled and causing very large loss values. Try and normalize the targets to span a more compact range.
Learning Rate:
Try and adjust your learning rate: both increasing it and decreasing it by orders of magnitude.

Which parameters of Mask-RCNN control mask recall?

I'm interested in fine-tuning a Mask-RCNN model that I'm using for instance segmentation. Currently I have trained the model for 6 epochs and the various Mask-RCNN losses are as follows:
The reason I'm stopping is that the COCO evaluation metrics seem to have dipped in the last epoch:
I know this is a far reaching question, but I'm looking to gain some intuition of how to understand which parameters are going to be the most impactful in improving the evaluation metrics. I understand there are three places to consider:
Should I be looking at batch size, learning rate and momentum, this uses an SGD optimizer with a learning rate of 1e-4 and batch size 2?
Should I be looking at using more training data or adding augmentation (I don't currently use any) and my dataset is current pretty large 40K images?
Should I be looking at the specific MaskRCNN parameters?
I thing I'll likely be asked to me more specific on what I want to improve so let me say that I would like to improve the recall of the individual masks. The model is performing well but doesn't quite capture the full extend of what I would like it to. I'm also leaving out details of the specific learning problem as I'd like to gain intuition of how to approach this in general.
A couple of notes:
6 epochs are too small for the network to converge even if you use a pre-trained network—especially such a big one as resnet50. I think you need at least 50 epochs. On a pre-trained resnet18 I started to get good results after 30 epochs, resnet34 needed +10-20 epochs and your resnet50 + 40k images of the train set - definitely need more epochs than 6;
definitely use a pre-trained network;
in my experience, I failed to get the results I like with SGD. I started using AdamW + ReduceLROnPlateau scheduler. The network converges quite fast, like 50-60% AP on epoch 7 or 8 but then it comes up to 80-85 after 50-60 epochs using very small improvements from epoch to epoch, only if the LR is small enough. You must be familiar with the gradient descent notion. I used to think of it as if you have more augmentation, your "hill" is covered with "boulders" that you have to be able to bypass and this is only possible if you control the LR. Additionally, AdamW helps with the overfitting.
This is how I do it. For networks with higher input resolution (your input images are scaled on input by the net itself), I use higher LR.
init_lr = 0.00005
weight_decay = init_lr * 100
optimizer = torch.optim.AdamW(params, lr=init_lr, weight_decay=weight_decay)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, verbose=True, patience=3, factor=0.75)
for epoch in range(epochs):
# train for one epoch, printing every 10 iterations
metric_logger = train_one_epoch(model, optimizer, train_loader, scaler, device,
epoch, print_freq=10)
scheduler.step(metric_logger.loss.global_avg)
optimizer.param_groups[0]["weight_decay"] = optimizer.param_groups[0]["lr"] * 100
# scheduler.step()
# evaluate on the test dataset
evaluate(model, test_loader, device=device)
print("[INFO] serializing model to '{}' ...".format(args["model"]))
save_and_print_size_of_model(model, args["model"], script=False)
Find such an LR and weight decay that the training exhausts LR to a very small value, like 1/10 of your initial LR, at the end of the training. If you will have a plateau too often, the scheduler quickly brings it to very small values and the network will learn nothing all the rest of the epochs.
Your plots indicate that your LR is too high at some point in the training, the network stops training and then AP is going down. You need constant improvements, even small ones. The more network trains the more subtle details it learns about your domain and the smaller the learning rate. Imho, constant LR will not allow doing that correctly.
anchor generator settings. Here is how I initialize the network.
def get_maskrcnn_resnet_model(name, num_classes, pretrained, res='normal'):
print('Using maskrcnn with {} backbone...'.format(name))
backbone = resnet_fpn_backbone(name, pretrained=pretrained, trainable_layers=5)
sizes = ((4,), (8,), (16,), (32,), (64,))
aspect_ratios = ((0.25, 0.5, 1.0, 2.0, 4.0),) * len(sizes)
anchor_generator = AnchorGenerator(
sizes=sizes, aspect_ratios=aspect_ratios
)
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0', '1', '2', '3'],
output_size=7, sampling_ratio=2)
default_min_size = 800
default_max_size = 1333
if res == 'low':
min_size = int(default_min_size / 1.25)
max_size = int(default_max_size / 1.25)
elif res == 'normal':
min_size = default_min_size
max_size = default_max_size
elif res == 'high':
min_size = int(default_min_size * 1.25)
max_size = int(default_max_size * 1.25)
else:
raise ValueError('Invalid res={} param'.format(res))
model = MaskRCNN(backbone, min_size=min_size, max_size=max_size, num_classes=num_classes,
rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
model.roi_heads.detections_per_img = 512
return model
I need to find small objects here why I use such anchor params.
classes in-balancing issue. If you have only your object and bg - no problem. If you have more classes then make sure that your training split (as 80% for train and 20% for the test) is more or less precisely applied to all the classes used in your particular training.
Good luck!

How to manually compute the number of FLOPS in backward pass for a CNN like ResNet?

I've been trying to figure out how to compute the number of Flops in backward pass of ResNet. For forward pass, it seems straightforward: apply the conv filters to the input for each layer. But how does one do the Flops counts for gradient computation and update of all weights during the backward pass?
Specifically,
how to compute Flops in gradient computations for each layer?
what all gradients need to be computed so Flops for each of those can be counted?
How many Flops in computation of gradient for Pool, BatchNorm, and Relu layers?
I understand the chain rule for gradient computation, but having a hard time formulating how it'd apply to weight filters in conv layers of ResNet and how many Flops each of those would take. It'd be very useful to get any comments about method to compute total Flops for Backward pass. Thanks
You can definitely count the number of multiplication, addition for the backward pass manually, but I guess that's an exhaustive process for complex models.
Usually, most models are benchmarked with flops for a forward pass instead of backward flop count for CNN and other models. I guess the reason has to do with the inference being more important in terms of different CNN variants and other deep learning models in the application.
The backward pass is only important while training, and for most of the simple models, the backward and forward flops should be close with some constant factors.
So, I tried a hacky approach to calculate the gradients for the whole resnet model in the graph to get the flop counts for both forward pass and gradient calculation and then subtracted the forward flops. It's not an exact measurement, may miss many operations for a complex graph/model.
But this may give a flop estimate for most models.
[Following code snippet works with tensorflow 2.0]
import tensorflow as tf
def get_flops():
for_flop = 0
total_flop = 0
session = tf.compat.v1.Session()
graph = tf.compat.v1.get_default_graph()
# forward
with graph.as_default():
with session.as_default():
model = tf.keras.applications.ResNet50() # change your model here
run_meta = tf.compat.v1.RunMetadata()
opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
# We use the Keras session graph in the call to the profiler.
flops = tf.compat.v1.profiler.profile(graph=graph,
run_meta=run_meta, cmd='op', options=opts)
for_flop = flops.total_float_ops
# print(for_flop)
# forward + backward
with graph.as_default():
with session.as_default():
model = tf.keras.applications.ResNet50() # change your model here
outputTensor = model.output
listOfVariableTensors = model.trainable_weights
gradients = tf.gradients(outputTensor, listOfVariableTensors)
run_meta = tf.compat.v1.RunMetadata()
opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
# We use the Keras session graph in the call to the profiler.
flops = tf.compat.v1.profiler.profile(graph=graph,
run_meta=run_meta, cmd='op', options=opts)
total_flop = flops.total_float_ops
# print(total_flop)
return for_flop, total_flop
for_flops, total_flops = get_flops()
print(f'forward: {for_flops}')
print(f'backward: {total_flops - for_flops}')
Out:
51112224
102224449
forward: 51112224
backward: 51112225

Wasserstein GAN implemtation in pytorch. How to implement the loss?

I'm currently working on a project in pytorch on Wasserstein GAN (https://arxiv.org/pdf/1701.07875.pdf).
In Wasserstain GAN a new objective function is defined using the wasserstein distance as :
Which leads to the following algorithms for training the GAN:
My question is :
When implementing line 5 and 6 of the algorithm in pytorch should I be multiplying my loss -1 ? As in my code (I use RMSprop as my optimizer for both the generator and critic):
############################
# (1) Update D network: maximize (D(x)) + (D(G(x)))
###########################
for n in range(n_critic):
D.zero_grad()
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
output = D(real_cpu)
#errD_real = -criterion(output, label) #DCGAN
errD_real = torch.mean(output)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, 100, device=device) #Careful here we changed shape of input (original : torch.randn(4, 100, 1, 1, device=device))
# Generate fake image batch with G
fake = G(noise)
# Classify all fake batch with D
output = D(fake.detach())
# Calculate D's loss on the all-fake batch
errD_fake = torch.mean(output)
# Calculate the gradients for this batch
errD_fake.backward()
D_G_z1 = output.mean().item()
# Add the gradients from the all-real and all-fake batches
errD = -(errD_real - errD_fake)
# Update D
optimizerD.step()
#Clipping weights
for p in D.parameters():
p.data.clamp_(-0.01, 0.01)
As you can see, I do the operation errD = -(errD_real - errD_fake), with errD_real and errD_fake being respectively the mean of the predictions of the critic on real and fake samples.
To my understanding RMSprop should optimize the weights of the critic the following way :
w <- w - alpha*gradient(w)
(alpha being the learning rate divided by the square root of the weighted moving average of the squared gradient)
Since the optimization problem requires to "go" in the same direction as the gradient it should be required to multiply gradient(w) by -1 before optimizing the weights.
Do you think that my reasoning is right ?
The program runs but my results are quiet poor.
I follow the same logic for the generator's weights but this time in order to go in the opposite direction of the gradient:
############################
# (2) Update G network: minimize -D(G(x))
###########################
G.zero_grad()
noise = torch.randn(b_size, 100, device=device)
fake = G(noise)
#label.fill_(fake_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = D(fake).view(-1)
# Calculate G's loss based on this output
#errG = criterion(output, label) #DCGAN
errG = -torch.mean(output)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
Sorry for the long question, I tried to explain my doubt as clear as possible. Thank you everyone.
I noticed some errors in the implementation of your discriminator training protocol. You call your backward functions twice with both the real and fake values loss being backpropagated at different time steps.
Technically an implementation using this scheme is possible but highly unreadable. There was a mistake with your errD_real in which your output is going to be positive instead of negative as an optimal D(G(z))>0 and so you penalize it for being correct. Overall your model converges simply by predicting D(x)<0 for all inputs.
To fix this do not call your errD_readl.backward() or your errD_fake.backward(). Simply using an errD.backward() after you define errD would work perfectly fine. Otherwise, your generator seems to be correct.

pytorch - loss.backward() and optimizer.step() in eval mode with batch norm layers?

I have a ResNet-8 network I am using for a project of Domain Adaptation over images, basically I have trained the network over a dataset and now I want to evaluate it over another dataset simulating a real time environment where I try to predict one image at a time, but here comes the fun part:
The way I want to do the evaluation on the target dataset is by doing, for each image, a forward pass in train mode so that the batch norm layers statistics are updated (with torch.no_grad(), since I don't want to update the network parameters then, but only "adapt" the batch norm layers), and then do another forward pass in eval mode to get the actual prediction so that the batch norm layers will use mean and variance based on the whole set of images seen so far (and not only those of that batch, a single image in this case):
optimizer.zero_grad()
model.train()
with torch.no_grad():
output_train = model(inputs)
model.eval()
output_eval = model(inputs)
loss = criterion(output_eval, targets)
The idea is that I do domain adaptation just by updating the batch norm layers to the new target distribution.
Then after doing this let's say I get an accuracy of 60%.
Now if I add this two other lines I am able to achieve something like 80% accuracy:
loss.backward()
optimizer.step()
Therefore my question is what happens if I do backward() and step() while in eval mode? Because I know about the different behaviour of batch norm and dropout layers between train and eval mode and I know about torch.no_grad() and how gradient are calculated and then parameters updated by the optimizer, but I wasn't able to find any information about my specific problem.
I think that since the model is then set in eval mode, those two line should be useless, but something clearly happens, does this have something to do with the affine parameters of the batch norm layers?
UPDATE: Ok I misunderstood something: eval mode does not block parameters to be updated, it only changes the behaviour of some layers (batch norm and dropout) during the forward pass, am I right? Therefore with those two lines I am actually training the network, hence the better accuracy. Anyway does this change something if batch norm affine is set to true? Are those parameters considered as "normal" parameters to be updated during optimizer.step() or is it different?
eval mode does not block parameters to be updated, it only changes the behaviour of some layers (batch norm and dropout) during the forward pass, am I right?
True.
Therefore with those two lines I am actually training the network, hence the better accuracy. Anyway does this change something if batch norm affine is set to true? Are those parameters considered as "normal" parameters to be updated during optimizer.step() or is it different?
BN parameters are updated during optimizer step. Look:
if self.affine:
self.weight = Parameter(torch.Tensor(num_features))
self.bias = Parameter(torch.Tensor(num_features))
else:
self.register_parameter('weight', None)
self.register_parameter('bias', None)