Correct way of freezing layers - deep-learning

I have a model M and I am cloning it M.clone()
Now, I want to freeze certain layers of M.clone(). When I set requires_grad=False, it results in this error:
RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn't require differentiation use var_no_grad = var.detach().
How to freeze the layers of M.clone() in that case? I want to ensure that when I backpropagate using the loss computed on a batch using M.clone(), I compute the gradients of M
A small script:
model = ResNet()
optimizer = Adam(model.parameters())
cloned_model = model.clone()
for p in cloned_model.features.parameters():
p.require_grad = False
error = loss(cloned_model(data), labels)
error.backward()
optimizer.step()
P.S. I am not sure if I can use .detach() as I don't want to break the graph. Do correct me if I am wrong.
Thank you!

You can use the in-place requires_grad_ function either on a nn.Module or on a torch.Tensor directly. Here you could do:
cloned_model = copy.deepcopy(model)
cloned_model.requires_grad_(False)
Where deepcopy is from copy.
You should copy your optimizer as well otherwise optimizer will be updating model, not cloned_model... resulting in no changes at all since you are not back-propagating on model.

Related

Having issue with max_norm parameter of torch.nn.Embedding

I use torch.nn.Embedding to embed my model’s categorical input features, however, I face problems when I set the max_norm parameter to not None.
There is a note on the pytorch docs page that explains how to use max_norm parameter through the following example:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=True)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor(\[1, 2\])
a = embedding.weight.clone() # W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) # W.t() # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()
I can’t easily understand this example from the docs. What is the purpose of having both ‘a’ and ‘b’ and why ‘out’ is defined as, out = (a.unsqueeze(0) + b.unsqueeze(1))?
Do we need to first clone the entire embedding tensor as in ‘a’, and then finding the embeddings for our desired indices as in ‘b’? Then how do ‘a’ and ‘b’ need to be added?
In my code, I don’t have W explicitly, I am assuming that W is representative of the weights applied by the torch.nn.Linear layers. So, I just need to prepare the input (which includes the embeddings for categorical features) that goes into my network.
I greatly appreciate any instructions on this, as understanding this example would help me adapt my code accordingly.
Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b.
If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this.
If you get an error, post it (and your code).

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?
Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

Understanding when to call zero_grad() in pytorch, when training with multiple losses

I am going through an open-source implementation of a domain-adversarial model (GAN-like). The implementation uses pytorch and I am not sure they use zero_grad() correctly. They call zero_grad() for the encoder optimizer (aka the generator) before updating the discriminator loss. However zero_grad() is hardly documented, and I couldn't find information about it.
Here is a psuedo code comparing a standard GAN training (option 1), with their implementation (option 2). I think the second option is wrong, because it may accumulate the D_loss gradients with the E_opt. Can someone tell if these two pieces of code are equivalent?
Option 1 (a standard GAN implementation):
X, y = get_D_batch()
D_opt.zero_grad()
pred = model(X)
D_loss = loss(pred, y)
D_opt.step()
X, y = get_E_batch()
E_opt.zero_grad()
pred = model(X)
E_loss = loss(pred, y)
E_opt.step()
Option 2 (calling zero_grad() for both optimizers at the beginning):
E_opt.zero_grad()
D_opt.zero_grad()
X, y = get_D_batch()
pred = model(X)
D_loss = loss(pred, y)
D_opt.step()
X, y = get_E_batch()
pred = model(X)
E_loss = loss(pred, y)
E_opt.step()
It depends on params argument of torch.optim.Optimizer subclasses (e.g. torch.optim.SGD) and exact structure of the model.
Assuming E_opt and D_opt have different set of parameters (model.encoder and model.decoder do not share weights), something like this:
E_opt = torch.optim.Adam(model.encoder.parameters())
D_opt = torch.optim.Adam(model.decoder.parameters())
both options MIGHT indeed be equivalent (see commentary for your source code, additionally I have added backward() which is really important here and also changed model to discriminator and generator appropriately as I assume that's the case):
# Starting with zero gradient
E_opt.zero_grad()
D_opt.zero_grad()
# See comment below for possible cases
X, y = get_D_batch()
pred = discriminator(x)
D_loss = loss(pred, y)
# This will accumulate gradients in discriminator only
# OR in discriminator and generator, depends on other parts of code
# See below for commentary
D_loss.backward()
# Correct weights of discriminator
D_opt.step()
# This only relies on random noise input so discriminator
# Is not part of this equation
X, y = get_E_batch()
pred = generator(x)
E_loss = loss(pred, y)
E_loss.backward()
# So only parameters of generator are updated always
E_opt.step()
Now it's all about get_D_Batch feeding data to discriminator.
Case 1 - real samples
This is not a problem as it does not involve generator, you pass real samples and only discriminator takes part in this operation.
Case 2 - generated samples
Naive case
Here indeed gradient accumulation may occur. It would occur if get_D_batch would simply call X = generator(noise) and passed this data to discriminator.
In such case both discriminator and generator have their gradients accumulated during backward() as both are used.
Correct case
We should take generator out of the equation. Taken from PyTorch DCGan example there is a little line like this:
# Generate fake image batch with G
fake = generator(noise)
label.fill_(fake_label)
# DETACH HERE
output = discriminator(fake.detach()).view(-1)
What detach does is it "stops" the gradient by detaching it from the computational graph. So gradients will not be backpropagated along this variable. This effectively does not impact gradients of generator so it has no more gradients so no accumulation happens.
Another way (IMO better) would be to use with.torch.no_grad(): block like this:
# Generate fake image batch with G
with torch.no_grad():
fake = generator(noise)
label.fill_(fake_label)
# NO DETACH NEEDED
output = discriminator(fake).view(-1)
This way generator operations will not build part of the graph so we get better performance (it would in the first case but would be detached afterwards).
Finally
Yeah, all in all first option is better for standard GANs as one doesn't have to think about such stuff (people implementing it should, but readers should not). Though there are also other approaches like single optimizer for both generator and discriminator (one cannot zero_grad() only for subset of parameters (e.g. encoder) in this case), weight sharing and others which further clutter the picture.
with torch.no_grad() should alleviate the problem in all/most cases as far as I can tell and imagine ATM.

How to call a function batchwise in Pytorch

I have a function defined and I just want to know if it is possible to perform it batchwise. For instance,
def function():
Some processes here
return x
def forward():
encode = self._encoding(embedded_premises,premises_lengths)
Now since, the encode will be 3D tensor, which will be batch size, seq_length, hidden size I want to perform function() batchwise and return x also batchwise.
Is there any other way than looping over all batches?
If you're working with pytorch functions inside your function, which you most likely are if you want your method to work with autograd, it can work batchwise. That means most pytorch operations respect or can be made to respect the first dimension as batch dimension (for instance convolutions, linear layers, etc). Sometimes it's more complex to express your operation such that it is both correct and fast, but in general pytorch is built with the assumption that operations will be used on batched data and it is made as simple as reasonably possible. If you have a more specific example of your function, please post it.

Testing from an LMDB file in Caffe

I am wondering how to go about setting up ONLY a test phase in Caffe for an LMDB file. I have already trained my model, everything seems good, my loss has decreased, and the output I am getting on images loaded in one by one also seem good.
Now I would like to see how my model performs on a separate LMDB test set, but seem to be unable to do so successfully. It would not be ideal for me to do a loop by loading images one at a time since my loss function is already defined in caffe and this would require me to redefine it.
this is what I have so far, but the results of this dont make sense; when I compare the loss I have from the train set to the loss I get from this, they don't match (orders of magnitude apart). Does anyone have any idea what my problem could be?
caffe.set_device(0)
caffe.set_mode_gpu()
net = caffe.Net('/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/deploy_JP.prototxt','/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/snapshot_iter_10000.caffemodel',caffe.TEST)
solver = None # ignore this workaround for lmdb data (can't instantiate two solvers on the same data)
solver = caffe.SGDSolver('/home/jeremy/Desktop/caffestuff/JP_Kitti/all_proto/mirror_shuffle/lenet_auto_solverJP_test.prototxt')
niter = 100
test_loss = zeros(niter)
count = 0
for it in range(niter):
solver.test_nets[0].forward() # SGD by Caffe
# store the test loss
test_loss[count] = solver.test_nets[0].blobs['loss']
print(solver.test_nets[0].blobs['loss'].data)
count = count+1
See my answer here. Do not forget to subtract the mean, otherwise you'll get low accuracy. The link to the code, posted above, takes care of that.