I realized that the optimizer with which I trained my PyTorch models had repeated parameters, e.g. like this:
model = MyModel()
params = list(model.parameters()) + list(model.parameters()) + list(model.parameters())
optimizer = Adam(params)
I have now fixed the issue but I need to understand what was happening before. Was the training being performed normally? Or did each optimizer step() update the parameters 3 times?
Related
I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.
Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,
from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))
I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.
How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?
What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.
These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.
If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.
I am working on a Reinforcement Learning problem in StableBaselines3.
I am trying to understand why this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
Does not give the exact same result as this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.
I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.
Can someone help me understand why those two situations deliver different results?
I have a model M and I am cloning it M.clone()
Now, I want to freeze certain layers of M.clone(). When I set requires_grad=False, it results in this error:
RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn't require differentiation use var_no_grad = var.detach().
How to freeze the layers of M.clone() in that case? I want to ensure that when I backpropagate using the loss computed on a batch using M.clone(), I compute the gradients of M
A small script:
model = ResNet()
optimizer = Adam(model.parameters())
cloned_model = model.clone()
for p in cloned_model.features.parameters():
p.require_grad = False
error = loss(cloned_model(data), labels)
error.backward()
optimizer.step()
P.S. I am not sure if I can use .detach() as I don't want to break the graph. Do correct me if I am wrong.
Thank you!
You can use the in-place requires_grad_ function either on a nn.Module or on a torch.Tensor directly. Here you could do:
cloned_model = copy.deepcopy(model)
cloned_model.requires_grad_(False)
Where deepcopy is from copy.
You should copy your optimizer as well otherwise optimizer will be updating model, not cloned_model... resulting in no changes at all since you are not back-propagating on model.
I'm totally new to pytorch, so it might be a very basic question. I have two networks that should be trained together.
First one takes data as input and returns its embedding as output.
Second one takes pairs of embedded datapoints and returns their 'similarity' as output.
Partial loss is then computed for every datapoint, and then all the losses are combined.
This final loss should be backpropagated through both networks.
How should the code for that look like? I'm thinking something like this:
def train_models(inputs, targets):
network1.train()
network2.train()
embeddings = network1(inputs)
paired_embeddings = pair_embeddings(embeddings)
similarities = network2(similarities)
"""
I don't know how the loss should be calculated here.
I have a loss formula for every embedded datapoint,
but not for every similarity.
But if I only calculate loss for every embedding (using similarites),
won't backpropagate() only modify network1,
since embeddings are network1's outputs
and have not been modified in network2?
"""
optimizer1.step()
optimizer2.step()
scheduler1.step()
scheduler2.step()
network1.eval()
network2.eval()
I hope this specific enough. I'll gladly share more details if necessary. I'm just so inexperienced with pytorch and deep learning in general, that I'm not even sure how to ask this question.
You can use single optimizer for this purpose, and even pass different learning rate for each network.
optimizer = optim.Adam([
{'params': network1.parameters()},
{'params': network2.parameters(), 'lr': 1e-3}
], lr=1e-4)
# ...
loss = loss1 + loss2
loss.backward()
optimizer.step()
I tried to train & cross validate a set of data with 8616 samples using epsilon SVR.
Among the datasets, I take 4368 for test, 4248 for CV.
Kernel type = RBF kernel. Libsvm provides a result as shown below.
optimization finished, #iter = 502363
nu = 0.689607
obj = -6383530527604706.000000, rho = 2884789.960212
nSV = 3023, nBSV = 3004
This is a result gotten by setting
-s 3 -t 2 -c 2^28 -g 2^-13 -p 2^12
(a) What does "nu" means? Sometimes I got nu = 0.99xx for different parameter.
(b) It seems that "obj" is surprisingly large. Does it sounds correct? Libsvm FAQ said this is "optimal objective value of the dual SVM problem". Does it means that this is the min value of f(alpha)?
(c) "rho" is large too. This is the bias term, b. The dataset labels (y) consist of value between 82672 to 286026. So I guess this is reasonable, am I right?
For training set,
Mean squared error = 1.26991e+008 (regression)
Squared correlation coefficient = 0.881112 (regression)
For cross-validation set,
Mean squared error = 1.38909e+008 (regression)
Squared correlation coefficient = 0.883144 (regression)
Using the selected param, I have produced the below result
kernel_type=2 (best c:2^28=2.68435e+008, g:2^-13=0.00012207, e:2^12=4096)
NRMS: 0.345139, best_gap:0.00199433
Mean Absolute Percent Error (MAPE): 5.39%
Mean Absolute Error (MAE): 8956.12 MWh
Daily Peak MAPE: 5.30%
The CV set MAPE is low (5.39%). Using Bias-Variance test, the difference between train set MAPE and CV set MAPE is only 0.00199433, which mean the param seems to be set correctly. But I wonder if the extremely large "obj", "rho" value is correct....
I am very new to SVR, do correct me if my interpretation or validation method is incorrect/insufficient.
Method to calculate MAPE
train_model = svmtrain(train_label, train_data, cmd);
[result_label, train_accuracy, train_dec_values] = svmpredict(train_label, train_data, train_model);
train_err = train_label-result_label;
train_errpct = abs(train_err)./train_label*100;
train_MAPE = mean(train_errpct(~isinf(train_errpct)));
The objective and rho values are high because (most probably) the data were not scaled. Scaling is highly recommended to avoid overflow; the overflow risk also depends on the type of kernel. Btw, when scaling the training data, do not forget to also scale the test data, which is most easily accomplished by scaling all data first, and then splitting them into a training and test set.