Custom Loss Function Caffe (Spearman Coefficient) while FInetuning (Regression) - regression

I am finetuning imagenet for a regression problem in caffe. At present I am using Eucliden Loss, but I don't think it is any good in my case.
I want the loss values to be the spearman coefficient between predicted labels and actual labels. How can I do so?
Please help!

As cleared in the comment section, since the loss function needs to be differentiable, and spearman coefficient isn't, we can't use it as a loss function.

Related

Most wierd loss function shape (because of weight decay parameter)

I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.

Resnet50 does not converge. VGG16 works fine

I trained one regression network using resnet50 as backbone. The input of the network is image whose size is 224*224*3, the output of the network is one value, varying from 0 to 1.
but the netwrok can not converge, no matter I use sigmoid or relu as output layer's activation. mae or mse as loss function.
For exampple, I use resnet50 as backbone,mae as loss function, sigmoid is the activation function of output layer. SGD as optimizer. The training loss would be:
Epoch 1 training loss is 0.4900, val_loss is 0.4797
Epoch 2 training loss is 0.4923, val_loss is 0.4794
Epoch 3 training loss is 0.4923, val_loss is 0.4783
...
Epoch 35 training loss is 0.4923, val_loss is 0.4771
The training loss would not change, it is constant 0.4923. the val_loss is always about 0.47. I tested differentoptimizer, learning rate. the network is still not converge.
When I use VGG16 or Mobilenet as backbone, the network converged.
Could anyone give me some suggestions about how I can fix this problem.
Can you somehow validate if the Resnet50 backbone is correctly implemented. Maybe try to train it on MNIST and see if it works in general.
It kinda seems to me that the ResNet varaint just outputs some mean value instead of learning the actual problem.
Can you give some more information on what you want to achieve. How your regression looks like and what input is expected from the backbone. Also you might want to have a look at similar work (if that exists) and read what architectures they were using and what hyperparameters.

How to implement soft-argmax in caffe?

In Caffe deep-learning framework there is an argmax layer which is not differentiable and hence can not be used for end to end training of a CNN.
Can anyone tell me how I could implement the soft version of argmax which is soft-argmax?
I want to regress coordinates from heatmap and then use those coordinates in loss calculations. I am very new to this framework therefore no idea how to do this. any help will be much appreciated.
I don't get exactly what you want, but there are following options:
Use L2 loss to train regression task (EuclideanLoss). Or SmoothL1Loss (from SSD Caffe by Wei Lui), or L1 (don't know were you get it).
Use softmax with cross-entropy loss (SoftmaxWithLoss) to train classification task with classes corresponding to the possible values of x or y coordinate. For example, one loss layer for x, and one for y. SoftmaxWithLoss accepts label as a numeric value, and casts it to int with static_cast(). But take into account that implementation doesn't check that the casted value is within 0..(num_classes-1) range, so you have to be careful.
If you want something more unusual, you'll have to write you own layer in C++, C++/CUDA or Python+NumPy. This is very often the case unless you are already using someone other's implementation.

Tensorflow Multiple Input Loss Function

I am trying to implement a CNN in Tensorflow (quite similar architecture to VGG), which then splits into two branches after the first fully connected layer. It follows this paper: https://arxiv.org/abs/1612.01697
Each of the two branches of the network outputs a set of 32 numbers. I want to write a joint loss function, which will take 3 inputs:
The predictions of branch 1 (y)
The predictions of branch 2 (alpha)
The labels Y (ground truth) (q)
and calculate a weighted loss, as in the image below:
Loss function definition
q_hat = tf.divide(tf.reduce_sum(tf.multiply(alpha, y),0), tf.reduce_sum(alpha,0))
loss = tf.abs(tf.subtract(q_hat, q))
I understand the fact that I need to use the tf functions in order to implement this loss function. Having implemented the above function, the network is training, but once trained, it is not outputting the expected results.
Has anyone ever tried combining outputs of two branches of a network in one joint loss function? Is this something TensorFlow supports? Maybe I am making a mistake somewhere here? Any help whatsoever would be greatly appreciated. Let me know if you would like me to add any further details.
From TensorFlow perspective, there is absolutely no difference between a "regular" CNN graph and a "branched" graph. For TensorFlow, it is just a graph that needs to be executed. So, TensorFlow certainly supports this. "Combining two branches into joint loss" is also nothing special. In fact, it is "good" that loss depends on both branches. It means that when you ask TensorFlow to compute loss, it will have to do the forward pass through both branches, which is what you want.
One thing I noticed is that your code for loss is different than the image. Your code appears to do this https://ibb.co/kbEH95

weight decay in caffe. How exactly is it used?

From some old discussions (link1, link2) I got the idea that 'weight_decay' parameter is the regularization parameter for L2 loss over the weights. For example, in the cifar10 solver, the weight_decay value is 0.004. Does it mean the loss to be minimized is is "cross-entropy + 0.004*sum_of_L2_Norm_of_all_weights"? Is it, by any chance, "cross-entropy + 0.004/2*sum_of_L2_Norm_of_all_weights"?
The loss seems to be cross-entropy+0.004/2*sum_of_L2_Norm_of_all_weights.
Looking at the official caffe implementation of AlexNet, the solver file (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/solver.prototxt) sets weight_decay=0.0005, while in the original AlexNet paper (http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf, page 6) the gradient update includes the term
-0.0005*e*w_i
Since the gradient is the partial derivative of the loss, and the regularization component of the loss is usually expressed as lambda*||w||^2, it seems as if
weight_decay=2*lambda