loss increase fine-tuning caffe - deep-learning

I have a classification problem of 280 classes with ~278,000 images.
I do fine-tuning based on the model GoogleNet (bvlc_googlenet in caffe) using quick_solver.txt.
My solver is as follows:
test_iter: 1000
test_interval: 4000
test_initialization: false
display: 40
average_loss: 40
base_lr: 0.001
lr_policy: "poly"
power: 0.5
max_iter: 800000
momentum: 0.9
weight_decay: 0.0002
snapshot: 20000
During training I use the batch size of 32, and the test batch 32 too. I just relearn from scratch three layers loss1/classifier loss2/classifier and loss3/classifier by renaming them. I set the global learning rate 0.001, i.e. 10 times less than the one used in training from scratch. The last three layers however still get the learning rate 0.01.
Logfile of the very first iterations:
I0515 08:44:41.838122 1279 solver.cpp:228] Iteration 40, loss = 9.72169
I0515 08:44:41.838163 1279 solver.cpp:244] Train net output #0: loss1/loss1 = 5.7261 (* 0.3 = 1.71783 loss)
I0515 08:44:41.838170 1279 solver.cpp:244] Train net output #1: loss2/loss1 = 5.65961 (* 0.3 = 1.69788 loss)
I0515 08:44:41.838173 1279 solver.cpp:244] Train net output #2: loss3/loss3 = 5.46685 (* 1 = 5.46685 loss)
I0515 08:44:41.838179 1279 sgd_solver.cpp:106] Iteration 40, lr = 0.000999975
Until the 100,000-th iteration, my net obtains 50% top-1 accuracy and ~80% top-5 accuracy:
I0515 13:45:59.789113 1279 solver.cpp:337] Iteration 100000, Testing net (#0)
I0515 13:46:53.914217 1279 solver.cpp:404] Test net output #0: loss1/loss1 = 2.08631 (* 0.3 = 0.625893 loss)
I0515 13:46:53.914274 1279 solver.cpp:404] Test net output #1: loss1/top-1 = 0.458375
I0515 13:46:53.914279 1279 solver.cpp:404] Test net output #2: loss1/top-5 = 0.768781
I0515 13:46:53.914284 1279 solver.cpp:404] Test net output #3: loss2/loss1 = 1.88489 (* 0.3 = 0.565468 loss)
I0515 13:46:53.914288 1279 solver.cpp:404] Test net output #4: loss2/top-1 = 0.494906
I0515 13:46:53.914290 1279 solver.cpp:404] Test net output #5: loss2/top-5 = 0.805906
I0515 13:46:53.914294 1279 solver.cpp:404] Test net output #6: loss3/loss3 = 1.77118 (* 1 = 1.77118 loss)
I0515 13:46:53.914297 1279 solver.cpp:404] Test net output #7: loss3/top-1 = 0.517719
I0515 13:46:53.914299 1279 solver.cpp:404] Test net output #8: loss3/top-5 = 0.827125
At the 119,00-th iteration everything is still normal
I0515 14:43:38.669674 1279 solver.cpp:228] Iteration 119000, loss = 2.70265
I0515 14:43:38.669777 1279 solver.cpp:244] Train net output #0: loss1/loss1 = 2.41406 (* 0.3 = 0.724217 loss)
I0515 14:43:38.669783 1279 solver.cpp:244] Train net output #1: loss2/loss1 = 2.38374 (* 0.3 = 0.715123 loss)
I0515 14:43:38.669787 1279 solver.cpp:244] Train net output #2: loss3/loss3 = 1.92663 (* 1 = 1.92663 loss)
I0515 14:43:38.669798 1279 sgd_solver.cpp:106] Iteration 119000, lr = 0.000922632
Right after that the loss suddenly raise, i.e. equal to the initial loss ( from 8 to 9),
I0515 14:43:45.377710 1279 solver.cpp:228] Iteration 119040, loss = 8.3068
I0515 14:43:45.377751 1279 solver.cpp:244] Train net output #0: loss1/loss1 = 5.77026 (* 0.3 = 1.73108 loss)
I0515 14:43:45.377758 1279 solver.cpp:244] Train net output #1: loss2/loss1 = 5.76971 (* 0.3 = 1.73091 loss)
I0515 14:43:45.377763 1279 solver.cpp:244] Train net output #2: loss3/loss3 = 5.70022 (* 1 = 5.70022 loss)
I0515 14:43:45.377768 1279 sgd_solver.cpp:106] Iteration 119040, lr = 0.000922605
And the net cannot reduce that loss long after the sudden change happened
I0515 16:51:10.485610 1279 solver.cpp:228] Iteration 161040, loss = 9.01994
I0515 16:51:10.485649 1279 solver.cpp:244] Train net output #0: loss1/loss1 = 5.63485 (* 0.3 = 1.69046 loss)
I0515 16:51:10.485656 1279 solver.cpp:244] Train net output #1: loss2/loss1 = 5.63484 (* 0.3 = 1.69045 loss)
I0515 16:51:10.485661 1279 solver.cpp:244] Train net output #2: loss3/loss3 = 5.62972 (* 1 = 5.62972 loss)
I0515 16:51:10.485666 1279 sgd_solver.cpp:106] Iteration 161040, lr = 0.0008937
I rerun the experiment two times and it just repeat exactly at the iteration 119040-th. For further information, I did data shuffling in creating the LMDB database. And I used this database to train a VGG-16 (step learning rate policy, max 80k iterations, 20k iters per step) without any problem. With VGG I obtain 55% top-1 accuracy.
Anybody meet a similar problem to mine?

Related

Pytorch autograd function backward is doesn't work ( which is output 0 of MmBackward, is at version 1; expected version 0 instead)

I'm making a model mixing Fine-tuning CLIP model & Freezing clip model. And I make a custom loss using kl_loss and CEE
with torch.no_grad():
zero_shot_image_features = zero_shot_model.encode_image(input_image)
zero_shot_context_text_features = zero_shot_model.encode_text(context_label_text)
zero_shot_image_features /= zero_shot_image_features.norm(dim=-1, keepdim=True)
zero_shot_context_text_features /= zero_shot_context_text_features.norm(dim=-1, keepdim=True)
zero_shot_output_context = (zero_shot_image_features # zero_shot_context_text_features.T).softmax(dim=-1)
fine_tunning_image_features = fine_tunning_model.encode_image(input_image)
fine_tunning_context_text_features = fine_tunning_model.encode_text(context_label_text)
fine_tunning_image_features /= fine_tunning_image_features.norm(dim=-1, keepdim=True)
fine_tunning_context_text_features /= fine_tunning_context_text_features.norm(dim=-1, keepdim=True)
fine_tunning_output_context = (fine_tunning_image_features # fine_tunning_context_text_features.T).softmax(dim=-1)
fine_tunning_label_text_features = fine_tunning_model.encode_text(label_text)
fine_tunning_label_text_features /= fine_tunning_label_text_features.norm(dim=-1, keepdim=True)
fine_tunning_output_label = (fine_tunning_image_features # fine_tunning_label_text_features.T).softmax(dim=-1)
optimizer_zeroshot.zero_grad()
optimizer_finetunning.zero_grad()
loss.backward(retain_graph=True)
def custom_loss(zero_shot_output_context, fine_output_context, fine_output_label, target, alpha):
\# Compute the cross entropy loss
ce_loss = F.cross_entropy(fine_output_label, target)
# Compute ce_loss KL divergence between the output and the target
kl_loss = F.kl_div(zero_shot_output_context.log(), fine_output_context.log(), reduction = 'batchmean').requires_grad_(True)
final_loss = (ce_loss + alpha * kl_loss)
return final_loss
RuntimeError Traceback (most recent call last) Cell In[18], line 81 78 optimizer2.zero_grad() 79 optimizer.zero_grad() ---> 81 loss.backward(retain_graph=True) 83 if device == "cpu": 84 optimizer.step()
File ~/anaconda3/envs/sh_clip/lib/python3.8/site-packages/torch/tensor.py:221, in Tensor.backward(self, gradient, retain_graph, create_graph) 213 if type(self) is not Tensor and has_torch_function(relevant_args): 214 return handle_torch_function( 215 Tensor.backward, 216 relevant_args, (...) 219 retain_graph=retain_graph, 220 create_graph=create_graph) --> 221 torch.autograd.backward(self, gradient, retain_graph, create_graph)
File ~/anaconda3/envs/sh_clip/lib/python3.8/site-packages/torch/autograd/init.py:130, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 127 if retain_graph is None: 128 retain_graph = create_graph --> 130 Variable.execution_engine.run_backward( 131 tensors, grad_tensors, retain_graph, create_graph, 132 allow_unreachable=True)
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [6, 1024]], which is output 0 of MmBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
But when I have train model, backward() function dosen't work,,,,, how to fix it??
You use 'a /= b' which is an inplace operation, it will work well if you change it to 'a = a/b'.

how to prevent my DNN / MLP converging to average

I want to use available several features to predict a variable. It does not seem to be related to vision or NLP. Although I believe there are good reasons that the variable to be predicted is a non linear function of these features. So I just use normal MLP like following:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(53, 200)
self.fc2 = nn.Linear(200, 100)
self.fc3 = nn.Linear(100, 36)
self.fc4 = nn.Linear(36, 1)
def forward(self, x):
x = F.leaky_relu(self.fc1(x))
x = F.leaky_relu(self.fc2(x))
x = F.leaky_relu(self.fc3(x))
x = self.fc4(x)
return x
net = Net().to(device)
loss_function = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.001, weight_decay= 1e-6)
def train_normal(model, device, train_loader, optimizer, epoch):
model.train ()
for batch_idx, (data, target) in enumerate (train_loader):
data = data.to (device)
target = target.to (device)
optimizer.zero_grad ()
output = model (data)
loss = loss_function (output, target)
loss.backward ()
torch.nn.utils.clip_grad_norm_(model.parameters(), 100)
optimizer.step ()
if batch_idx % 100 == 0:
print ('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format (
epoch, batch_idx * len (data), len (train_loader.dataset),
100. * batch_idx / len (train_loader), loss.item ()))
At first it seems to work and did learn something:
Train Epoch: 9 [268800/276316 (97%)] Loss: 0.217219
Train Epoch: 9 [275200/276316 (100%)] Loss: 0.234965
predicted actual diff
-1.18 -1.11 -0.08
0.15 -0.15 0.31
0.19 0.27 -0.08
-0.49 -0.48 -0.01
-0.05 0.08 -0.14
0.44 0.50 -0.06
-0.17 -0.05 -0.12
1.81 1.92 -0.12
1.55 0.76 0.79
-0.05 -0.30 0.26
But when it kept learning, I saw the results seemingly to be close to each other's average regardless the different input:
predicted actual diff
-0.16 -0.06 -0.10
-0.16 -0.55 0.39
-0.13 -0.26 0.14
-0.15 0.50 -0.66
-0.16 0.02 -0.18
-0.16 -0.12 -0.04
-0.16 -0.40 0.24
-0.01 1.20 -1.21
-0.07 0.33 -0.40
-0.09 0.02 -0.10
What technology / trick can prevent it? Also, how to increase the accuracy, shall I add more hidden layers or add more neurons of each layer?
One possible problem is that there is nothing to learn.
Check that your data is standardized and try different learning rates (maybe even cyclic learning rate). Something that can be happening is that the algorithm is not able to get inside the minima and keeps jumping around.
I am not sure, if you are using it but, use a standard implementation that works in another dataset and then change it to your problem, just to avoid small development mistakes. You can check either this tutorial How to apply Deep Learning on tabular data with FastAi but if you are really new I will totally recommend doing this MOOC https://course.fast.ai/. This should allow you to gain some level and understanding.
If you have all tabular data already you can try to use a machine learning algorithm like linear regression/gradient boosting. Just to check if your data has some info.
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.score(X, y)
1.0
>>> reg.coef_
array([1., 2.])
>>> reg.intercept_
3.0000...
>>> reg.predict(np.array([[3, 5]]))
array([16.])
Let me know if you find the solution to your problem!

Weird loss pattern when using two losses in caffe

I am training a CNN in caffe and receive following weird loss pattern:
I0425 16:38:58.305482 23335 solver.cpp:398] Test net output #0: loss = nan (* 1 = nan loss)
I0425 16:38:58.305524 23335 solver.cpp:398] Test net output #1: loss_intermediate = inf (* 1 = inf loss)
I0425 16:38:59.235857 23335 solver.cpp:219] Iteration 0 (-4.2039e-45 iter/s, 20.0094s/50 iters), loss = 18284.4
I0425 16:38:59.235926 23335 solver.cpp:238] Train net output #0: loss = 18274.9 (* 1 = 18274.9 loss)
I0425 16:38:59.235942 23335 solver.cpp:238] Train net output #1: loss_intermediate = 9.46859 (* 1 = 9.46859 loss)
I0425 16:38:59.235955 23335 sgd_solver.cpp:105] Iteration 0, lr = 1e-06
I0425 16:39:39.330327 23335 solver.cpp:219] Iteration 50 (1.24704 iter/s, 40.0948s/50 iters), loss = 121737
I0425 16:39:39.330410 23335 solver.cpp:238] Train net output #0: loss = 569.695 (* 1 = 569.695 loss)
I0425 16:39:39.330425 23335 solver.cpp:238] Train net output #1: loss_intermediate = 121168 (* 1 = 121168 loss)
I0425 16:39:39.330433 23335 sgd_solver.cpp:105] Iteration 50, lr = 1e-06
I0425 16:40:19.372197 23335 solver.cpp:219] Iteration 100 (1.24868 iter/s, 40.0421s/50 iters), loss = 34088.4
I0425 16:40:19.372268 23335 solver.cpp:238] Train net output #0: loss = 369.577 (* 1 = 369.577 loss)
I0425 16:40:19.372283 23335 solver.cpp:238] Train net output #1: loss_intermediate = 33718.8 (* 1 = 33718.8 loss)
I0425 16:40:19.372292 23335 sgd_solver.cpp:105] Iteration 100, lr = 1e-06
I0425 16:40:59.501541 23335 solver.cpp:219] Iteration 150 (1.24596 iter/s, 40.1297s/50 iters), loss = 21599.6
I0425 16:40:59.501606 23335 solver.cpp:238] Train net output #0: loss = 478.262 (* 1 = 478.262 loss)
I0425 16:40:59.501621 23335 solver.cpp:238] Train net output #1: loss_intermediate = 21121.3 (* 1 = 21121.3 loss)
...
I0425 17:09:01.895849 23335 solver.cpp:219] Iteration 2200 (1.24823 iter/s, 40.0568s/50 iters), loss = 581.874
I0425 17:09:01.895912 23335 solver.cpp:238] Train net output #0: loss = 532.049 (* 1 = 532.049 loss)
I0425 17:09:01.895926 23335 solver.cpp:238] Train net output #1: loss_intermediate = 49.8377 (* 1 = 49.8377 loss)
I0425 17:09:01.895936 23335 sgd_solver.cpp:105] Iteration 2200, lr = 1e-06
FYI: My Network consists of basically two stages therefore I have two losses. The first stage can be seen as a coarse stage and the second one is an upsampling stage of the coarse stage.
My question is: Is this a typical loss pattern? First the loss value is high and the intermediate_loss is low for the first iteration and then it basically turns around in the next iterations so the loss is lower and the intermediate_loss is higher. In the end only the intermediate_loss converges.
"typical" isn't really an applicable term. There is such a variety of models and topologies that you can find many examples of strange loss progressions.
In your case, the intermediate loss may well start out low because it "doesn't know any better" yet. As the latter layers become trained enough to give reliable feedback to the intermediate layers, then it begins learning enough to have serious mistakes.
The final loss computation is in direct contact with ground truth; it learns from the first iteration, so it has a more understandable progression from high loss to low.

caffe output the negative loss value with SoftmaxWithLoss layer?

Below is my last layer in training net:
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "final"
bottom: "label"
top: "loss"
loss_param {
ignore_label: 255
normalization: VALID
}
}
Note I adopt a softmax_loss layer. Since its calculation form is like: - log (probability), it's weird the loss can be negative, as shown below(iteration 80).
I0404 23:32:49.400624 6903 solver.cpp:228] Iteration 79, loss = 0.167006
I0404 23:32:49.400806 6903 solver.cpp:244] Train net output #0: loss = 0.167008 (* 1 = 0.167008 loss)
I0404 23:32:49.400825 6903 sgd_solver.cpp:106] Iteration 79, lr = 0.0001
I0404 23:33:25.660655 6903 solver.cpp:228] Iteration 80, loss = -1.54972e-06
I0404 23:33:25.660845 6903 solver.cpp:244] Train net output #0: loss = 0 (* 1 = 0 loss)
I0404 23:33:25.660862 6903 sgd_solver.cpp:106] Iteration 80, lr = 0.0001
I0404 23:34:00.451464 6903 solver.cpp:228] Iteration 81, loss = 1.89034
I0404 23:34:00.451661 6903 solver.cpp:244] Train net output #0: loss = 1.89034 (* 1 = 1.89034 loss)
Can anyone explain it for me? How can this happened?
Thank you very much!
PS:
The task I do here is semantic segmentation.
There are 20 object classes plus background in total(So 21 classes). The label range from 0-21. The extra label 225 is ignored which can be find in SoftmaxWithLoss definition at the beginning of this post.
Is caffe run on GPU or CPU ?
Print out prob_data that you get after softmax operation:
// find the next line in your cpu or gpu Forward function
softmax_layer_->Forward(softmax_bottom_vec_, softmax_top_vec_);
// make sure you have data in cpu
const Dtype* prob_data = prob_.cpu_data();
for (int i = 0; i < prob_.count(); i++) {
printf("%f ", prob_data[i]);
}

why iteration loss does not match the sum of all train net output in caffe

Iteration 1, loss = 0.0486978
Train net output #0: loss_bbox = 0.236533 (* 0.5 = 0.118266 loss)
Train net output #1: loss_cls = 0.353563 (* 0.5 = 0.176781 loss)
Hi, i am training a CNN network using caffe, i found the iteration loss does not match the sum of all train net outputs.
any advise? thanks!