I am trying to get the input gradients from a BERT model in pytorch. How can I do that?
Suppose, y' = BertModel(x). I am trying to find $d(loss(y,y'))/dx$
One of the problems with Bert models is that your input mostly contains token IDs rather than token embeddings, which makes getting gradient difficult since the relation between token ID and token embeddings is discontinued.
To solve this issue, you can work with token embeddings.
# get your batch data: token_id, mask and labels
token_ids, mask, labels = batch
# get your token embeddings
token_embeds=BertModel.bert.get_input_embeddings().weight[token_ids].clone()
# track gradient of token embeddings
token_embeds.requires_grad=True
# get model output that contains loss value
outs = BertModel(inputs_embeds=inputs_embeds,labels=labels)
loss=outs.loss
After getting loss value, you can use torch.autograd.grad in this answer or backward function
loss.backward()
grad=token_embeds.grad
You can use torch.autograd.grad (documentation):
y_pred = BertModel(x)
out = loss_func(y_label, y_pred) # not necessary a scalar!
grad = torch.autograd.grad(
outputs=out,
inputs=x,
grad_outputs=torch.ones(out.size()).to(device), # or simply None if out is a scalar
retain_graph=False,
create_graph=False,
only_inputs=True)[0]
You should pass retain_graph and create_graph to True if you want to use grad for computing a loss and apply backward (typically for computing a gradient penalty). Otherwise keep it to False to save memory and time.
Related
Simple and short question. I have a network (Unet) which performs image segmentation. I want the logits as the output to feed into the cross entropy loss (using pytorch). Currently my final layer looks as so:
class Logits(nn.Sequential):
def __init__(self,
in_channels,
n_class
):
super(Logits, self).__init__()
# fully connected layer outputting the prediction layers for each of my classes
self.conv = self.add_module('conv_out',
nn.Conv2d(in_channels,
n_class,
kernel_size = 1
)
)
self.activ = self.add_module('sigmoid_out',
nn.Sigmoid()
)
Is it correct to use the sigmoid activation function here? Does this give me logits?
When people talk about "logits" they usually refer to the "raw" n_class-dimensional output vector. For multi-class classification (n_class > 2) you want to convert the n_class-dimensional vector of raw "logits" into a n_class-dim probability vector.
That is, you want prob = f(logits) with prob_i >= 0 for all n_class entries, and that sum(prob)=1.
The most straight forward way of doing that in a differentiable way is to use the Softmax function:
prob_i = softmax(logits) = exp(logits_i) / sum_j exp(logits_j)
It is easy to see that the output of softmax is indeed a n_class-dim probability vector (I leave it to you as a short exercise).
BTW, this is why the raw predictions are called "logits" because they are kind of "log" of the output predicted probabilities.
Now, it is customary not to explicitly compute the softmax on top of a classification network and defer its computation to the loss function, e.g. nn.CrossEntropyLoss that internally computes the softmax and requires the raw logits as inputs, rather than the normalized probabilities. This is done mainly for numerical stability.
Therefore, if you are training a multi-class classification network with nn.CrossEntropyLoss you do not need to worry at all about the final activation and simply output the raw logits from your final conv/linear layer.
Most importantly, do not use nn.Sigmoid() activation as it tends to have saturated gradients and will mess up your training.
As far as I understood, you are working on a multi-label classification task where a single input can have several labels, hence your usage of nn.Sigmoid (vs nn.Softmax for multi-class classification).
There a loss function which combines nn.Sigmoid and the nn.BCELoss: nn.BCEWithLogitsLoss. So you would have as input, a vector of logits whose length is the number of classes. And, the target would as well have the same shape: as a multi-hot-encoding, with 1s for active classes.
I'm currently working on a project in pytorch on Wasserstein GAN (https://arxiv.org/pdf/1701.07875.pdf).
In Wasserstain GAN a new objective function is defined using the wasserstein distance as :
Which leads to the following algorithms for training the GAN:
My question is :
When implementing line 5 and 6 of the algorithm in pytorch should I be multiplying my loss -1 ? As in my code (I use RMSprop as my optimizer for both the generator and critic):
############################
# (1) Update D network: maximize (D(x)) + (D(G(x)))
###########################
for n in range(n_critic):
D.zero_grad()
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
output = D(real_cpu)
#errD_real = -criterion(output, label) #DCGAN
errD_real = torch.mean(output)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, 100, device=device) #Careful here we changed shape of input (original : torch.randn(4, 100, 1, 1, device=device))
# Generate fake image batch with G
fake = G(noise)
# Classify all fake batch with D
output = D(fake.detach())
# Calculate D's loss on the all-fake batch
errD_fake = torch.mean(output)
# Calculate the gradients for this batch
errD_fake.backward()
D_G_z1 = output.mean().item()
# Add the gradients from the all-real and all-fake batches
errD = -(errD_real - errD_fake)
# Update D
optimizerD.step()
#Clipping weights
for p in D.parameters():
p.data.clamp_(-0.01, 0.01)
As you can see, I do the operation errD = -(errD_real - errD_fake), with errD_real and errD_fake being respectively the mean of the predictions of the critic on real and fake samples.
To my understanding RMSprop should optimize the weights of the critic the following way :
w <- w - alpha*gradient(w)
(alpha being the learning rate divided by the square root of the weighted moving average of the squared gradient)
Since the optimization problem requires to "go" in the same direction as the gradient it should be required to multiply gradient(w) by -1 before optimizing the weights.
Do you think that my reasoning is right ?
The program runs but my results are quiet poor.
I follow the same logic for the generator's weights but this time in order to go in the opposite direction of the gradient:
############################
# (2) Update G network: minimize -D(G(x))
###########################
G.zero_grad()
noise = torch.randn(b_size, 100, device=device)
fake = G(noise)
#label.fill_(fake_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = D(fake).view(-1)
# Calculate G's loss based on this output
#errG = criterion(output, label) #DCGAN
errG = -torch.mean(output)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
Sorry for the long question, I tried to explain my doubt as clear as possible. Thank you everyone.
I noticed some errors in the implementation of your discriminator training protocol. You call your backward functions twice with both the real and fake values loss being backpropagated at different time steps.
Technically an implementation using this scheme is possible but highly unreadable. There was a mistake with your errD_real in which your output is going to be positive instead of negative as an optimal D(G(z))>0 and so you penalize it for being correct. Overall your model converges simply by predicting D(x)<0 for all inputs.
To fix this do not call your errD_readl.backward() or your errD_fake.backward(). Simply using an errD.backward() after you define errD would work perfectly fine. Otherwise, your generator seems to be correct.
I'm currently trying to implement the CBOW model on managed to get the training and testing, but am facing some confusion as to the "proper" way to finally extract the weights from the model to use as our word embeddings.
Model
class CBOW(nn.Module):
def __init__(self, config, vocab):
self.config = config # Basic config file to hold arguments.
self.vocab = vocab
self.vocab_size = len(self.vocab.token2idx)
self.window_size = self.config.window_size
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.config.embed_dim)
self.linear = nn.Linear(in_features=self.config.embed_dim, out_features=self.vocab_size)
def forward(self, x):
x = self.embed(x)
x = torch.mean(x, dim=0) # Average out the embedding values.
x = self.linear(x)
return x
Main process
After I run my model through a Solver with the training and testing data, I basically told the train and test functions to also return the model that's used. Then I assigned the embedding weights to a separate variable and used those as the word embeddings.
Training and testing was conducted using cross entropy loss, and each training and testing sample is of the form ([context words], target word).
def run(solver, config, vocabulary):
for epoch in range(config.num_epochs):
loss_train, model_train = solver.train()
loss_test, model_test = solver.test()
embeddings = model_train.embed.weight
I'm not sure if this is the correct way of going about extracting and using the embeddings. Is there usually another way to do this? Thanks in advance.
Yes, model_train.embed.weight will give you a torch tensor that stores the embedding weights. Note however, that this tensor also contains the latest gradients. If you don't want/need them, model_train.embed.weight.data will give you the weights only.
A more generic option is to call model_train.embed.parameters(). This will give you a generator of all the weight tensors of the layer. In general, there are multiple weight tensors in a layer and weight will give you only one of them. Embedding happens to have only one, so here it doesn't matter which option you use.
I have recently started to learn Deep Learning and CNNs. I have come across the following code which defines a simple CNN.
Can anyone help me to understand how these lines work:
loss = layer_output[:, :, :, 0] - What is the result of this ? My question is that, the network has not been trained yet. Weights [Kernels] are not yet calculated. so, what data it is going to return !! Does 0 represent the first kernel ?
iterate = K.function([input_img], [loss, grads]) - There is not much documentation available on Keras site. What I understand is that iterate is a function which takes an Input tensor and returns a list of tensors, first one is loss and second one is grads. But, they are defined elsewhere !!
Define Input Image with these dimensions:
img_data = np.random.uniform(size=(1, 250, 250, 3))
There is a Simple CNN, which has one Convolutional layer. It uses two 3 X 3 kernels.
input = Input(shape=250, 250, 3,), name='input_1')
First_Conv2D = Conv2D(2, kernel_size=(3, 3), padding="same", name='conv2d_1', activation='relu')(input)
flat = Flatten(name='flatten_1')(First_Conv2D)
output = Dense(2, name='dense_1', activation='softmax')(flat)
model = Model(inputs=[input], outputs=[output])
layer_dict = dict([(layer.name, layer) for layer in model.layers[0:]])
layer_output = layer_dict['conv2d_1'].output
input_img = model.input
# Calculate loss and gradient.
loss = layer_output[:, :, :, 0]
grads = K.gradients(loss, input_img)[0]
# Define a Keras function
iterate = K.function([input_img], [loss, grads])
# Call iterate function
loss_value, grads_value = iterate([img_data])
Thank You.
This looks like a nasty dissection of Keras as an API. I reckon it leads to more confusion rather than an introduction to deep learning. Anyway, addressing your questions:
All tensors are symbolic meaning that until we run a session, they do not contain any values. They instead define a directed computation graph. The loss = layer_output[:,:,:,0] is an slicing operation that takes the first element of the last dimension returning another tensor with 3 dimensions. When you run the session with actual inputs, then the tensors will have values which these operations run. The operations are almost identical to NumPy ndarrays which are not symbolic and contain values, you can get an intuition.
K.function just glues the inputs to the outputs returning a single operation that when given the inputs it will follow the computation graph from the inputs to the defined outputs. In this case, given a list of single input it returns a list of 2 output tensors loss and gradients. These are still symbolic remember, if you try to print one you'll just get what it is and it's shape, data type.
Question: How do I print/return the softmax layer for a multiclass problem using Keras?
my motivation: it is important for visualization/debugging.
it is important to do this for the 'training' setting. ergo batch normalization and dropout must behave as they do in train time.
it should be efficient. calling vanilla model.predict() every now and then is less desirable as the model I am using is heavy and this is extra forward passes. The most desirable case is finding a way to simply display the original network output which was calculated during training.
it is ok to assume that this is done while using Tensorflow as a backend.
Thank you.
You can get the outputs of any layer by using: model.layers[index].output
For all layers use this:
from keras import backend as K
inp = model.input # input placeholder
outputs = [layer.output for layer in model.layers] # all layer outputs
functor = K.function([inp]+ [K.learning_phase()], outputs ) # evaluation function
# Testing
test = np.random.random(input_shape)[np.newaxis,...]
layer_outs = functor([test, 1.])
print layer_outs