Does caffe multiply the regularization parameter to biased? - caffe

I have bunch of questions about the way regularization and biased are working in caffe.
First, by default biased exist in the network, is it right?
Or, I need to ask caffe to add them?
Second, when it obtains the loss value, it does not consider the regularization. is it right? I mean the loss just contains the loss function value. As I understood, it just considers regularization in the gradient calculation. Is it right?
Third, when caffe obtains the gradient, does it consider the biased value in the regularization as well? Or does it just consider the weight of the network in the regularization?
Thanks in advance,
Afshin

For your 3 questions, my answer is:
Yes. Bias do exist in the network by default. For example, in the ConvolutionParameter and InnerProductParameter in caffe.proto, the bias_term's default value is true, which means the convolution/innerproduct layer in the network will has bias by default.
Yes. The loss value obtained by loss layer does not contain the value of regularization term. And it just consider the regularization after calling the function net_->ForwardBackward() and in fact in ApplyUpdate() function, where updating the network parameters happens.
Take a convolution layer in a network for example:
layer {
name: "SomeLayer"
type: "Convolution"
bottom: "data"
top: "conv"
#for weights
param {
lr_mult: 1
decay_mult: 1.0 #coefficient of regularization for weights
#default is 1.0, here is for the sake of clarity
}
#for bias
param {
lr_mult: 2
decay_mult: 1.0 #coefficient of regularization for bias
#default is 1.0, here is for the sake of clarity
}
... #left
}
The answer for this question is: when caffe obtains the gradient, the solver will consider the biased value in the regularization only if the 2 variables: the second decay_mult above and the weight_decay in the solver.prototxt are both larger than zero.
Details can be found in functoin void SGDSolver::Regularize().
Hope this will help you.

Related

how to get/print the regularization loss/ l2 loss /weight decay value from optimizer in pytorch?

described as the title.
I know the regularization loss in pytorch usually defined through the defination of the optimizer (weight_decay):
torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=1e-5, nesterov=False)
how can I get the regularization loss value so that I can print it?
According to this answer, the regularization loss is never computed explicitly. So, what you need to do is calculate the loss on your own using the parameters. Something like
l2_loss = 0
for param in net.parameters() :
l2_loss += 0.5 * torch.sum(param ** 2)

Binary classification with Softmax

I am training a binary classifier using Sigmoid activation function with Binary crossentropy which gives good accuracy around 98%.
The same when I train using softmax with categorical_crossentropy gives very low accuracy (< 40%).
I am passing the targets for binary_crossentropy as list of 0s and 1s eg; [0,1,1,1,0].
Any idea why this is happening?
This is the model I am using for the second classifier:
Right now, your second model always answers "Class 0" as it can choose between only one class (number of outputs of your last layer).
As you have two classes, you need to compute the softmax + categorical_crossentropy on two outputs to pick the most probable one.
Hence, your last layer should be:
model.add(Dense(2, activation='softmax')
model.compile(...)
Your sigmoid + binary_crossentropy model, which computes the probability of "Class 0" being True by analyzing just a single output number, is already correct.
EDIT: Here is a small explanation about the Sigmoid function
Sigmoid can be viewed as a mapping between the real numbers space and a probability space.
Notice that:
Sigmoid(-infinity) = 0
Sigmoid(0) = 0.5
Sigmoid(+infinity) = 1
So if the real number, output of your network, is very low, the sigmoid will decide the probability of "Class 0" is close to 0, and decide "Class 1"
On the contrary, if the output of your network is very high, the sigmoid will decide the probability of "Class 0" is close to 1, and decide "Class 0"
Its decision is similar to deciding the Class only by looking at the sign of your output. However, this would not allow your model to learn! Indeed, the gradient of this binary loss is null nearly everywhere, making impossible for your model to learn from error, as it is not quantified properly.
That's why sigmoid and "binary_crossentropy" are used:
They are a surrogate to the binary loss, which has nice smooth properties, and enables learning.
Also, please find more info about Softmax Function and Cross Entropy

What's the difference between Softmax and SoftmaxWithLoss layer in caffe?

While defining prototxt in caffe, I found sometimes we use Softmax as the last layer type, sometimes we use SoftmaxWithLoss, I know the Softmax layer will return the probability the input data belongs to each class, but it seems that SoftmaxwithLoss will also return the class probability, then what's the difference between them? or did I misunderstand the usage of the two layer types?
While Softmax returns the probability of each target class given the model predictions, SoftmaxWithLoss not only applies the softmax operation to the predictions, but also computes the multinomial logistic loss, returned as output. This is fundamental for the training phase (without a loss there will be no gradient that can be used to update the network parameters).
See
SoftmaxWithLossLayer
and Caffe Loss
for more info.

weight decay in caffe. How exactly is it used?

From some old discussions (link1, link2) I got the idea that 'weight_decay' parameter is the regularization parameter for L2 loss over the weights. For example, in the cifar10 solver, the weight_decay value is 0.004. Does it mean the loss to be minimized is is "cross-entropy + 0.004*sum_of_L2_Norm_of_all_weights"? Is it, by any chance, "cross-entropy + 0.004/2*sum_of_L2_Norm_of_all_weights"?
The loss seems to be cross-entropy+0.004/2*sum_of_L2_Norm_of_all_weights.
Looking at the official caffe implementation of AlexNet, the solver file (https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/solver.prototxt) sets weight_decay=0.0005, while in the original AlexNet paper (http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf, page 6) the gradient update includes the term
-0.0005*e*w_i
Since the gradient is the partial derivative of the loss, and the regularization component of the loss is usually expressed as lambda*||w||^2, it seems as if
weight_decay=2*lambda

Deconvolution layer in caffe

After some reading about deconvolution in caffe, I am confused about the FCN's train.prototx here. The deconvolution layer's default weight filler is 'constant' and default value is zero. According to the deconvolution operation in caffe, doesn't all the output is zero in that input are multiplied by zero.
This model used pretrained parameters for initialization. You should use 'xavier' filler (mnist model):
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
You are absolutely right, the inference of a FCN initialised with Zero Deconv weights would be zero. You don't want that.
Initialising a deconv layer with weight_filler:{type:"bilinear"} would be appropriate. That would initialise the filter weights to a bilinear filer of required size.