caffe fast rcnn smoothL1layer implementation - caffe

I was reading the fast rcnn caffe code. Inside the SmoothL1LossLayer, I found that the implementation is not the same as the paper equation, is that what it should be ?
The paper equation:
For each labeled bounding box with class u, we calculate the sum error of tx, ty, tw, th, but in the code, we have:
There is no class label information used. Can anyone explain why?
And in the backpropagation step,
why there is an i here ?

In train.prototxt bbox_pred has output size 84 = 4(x,y,h,w) * 21(number of label). So does bbox_targets. So it is using all labels.
As for loss layers it is looping over bottom blobs to find which on to propagate gradient through. Here only one of propagate_down[i] is true.

Related

How contrastive loss work intuitively in siamese network

I am having issue in getting clear concept of contrastive loss used in siamese network.
Here is pytorch formula
torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(margin - euclidean_distance, min=0.0), 2))
where margin=2.
If we convert this to equation format, it can be written as
(1-Y)*D^2 + Y* max(m-d,0)^2
Y=0, if both images are from same class
Y=1, if both images are from different class
What i think, if images are from same class the distance between embedding should decrease. and if images are from different class, the distance should increase.
I am unable to map this concept to contrastive loss.
Let say, if Y is 1 and distance value is larger, the first part become zero (1-Y), and second also become zero, because it should choose whether m-d or 0 is bigger.
So the loss is zero which does not make sense.
Can you please help me to understand this
If the distance of a negative sample is greater than the specified margin, it should be already separable from a positive sample. Therefore, there is no benefit in pushing it farther away.
For details please check this blog post, where the concept of "Equilibrium" gets explained and why the Contrastive Loss makes reaching this point easier.

What does a.sub_(lr*a.grad) actually do?

I am doing the course of fast-ai, SGD and I can not understand.....
This subtracts the coefficients by (learning rate * gradient)...
But why is it necessary to subtract?
here is the code:
def update():
y_hat = x#a
loss = mse(y_hat, y)
if t % 10 == 0: print (loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
Look at the image. It shows the loss function J as a function of the parameter W. Here it is a simplified representation with W being the only parameter. So, for a convex loss function, the curve looks as shown.
Note that the learning rate is positive. On the left side, the gradient (slope of the line tangent to the curve at that point) is negative, so the product of the learning rate and gradient is negative. Thus, subtracting the product from W will actually increase W (since 2 negatives make a positive). In this case, this is good because loss decreases.
On the other hand (on the right side), the gradient is positive, so the product of the learning rate and gradient is positive. Thus, subtracting the product from W reduces W. In this case also, this is good because the loss decreases.
We can extend this same thing for more number of parameters (the graph shown will be higher dimensional and won't be easy to visualize, which is why we had taken a single parameter W initially) and for other loss functions (even non-convex ones, though it won't always converge to the global minima, but definitely to the nearest local minima).
Note : This explanation can be found in Andrew Ng's courses of deeplearning.ai, but I couldn't find a direct link, so I wrote this answer.
I'm assuming a represents your model parameters, based on y_hat = x # a. This is necessary because the stochastic gradient descent algorithm aims to find a minima of the loss function. Therefore, you take the gradient w.r.t. your model parameters, and update them a little in the direction of the gradient.
Think of the analogy of sliding down a hill: if the landscape represents your loss, the gradient is the direction of steepest descent. To get to the bottom (i.e. minimize loss), you take little steps in the direction of the steepest descent from where you're standing.

Confused about Yolo

I am a bit confused with how Yolo works.
In the paper, they say that:
"The confidence prediction represents the IOU between the
predicted box and any ground truth box."
But how do we have the ground truth box? Let's say I use my Yolo network (already trained) on an image that is not labelled. What is my confidence then?
Sorry if the question is simple, but I really don't get this part...
Thank you!
But how do we have the ground truth box?
You seem to be confused about what exactly is training data and what is the output or prediction by YOLO.
Training data is a bounding box along with the class label(s). This is referred to as 'ground truth box', b = [bx, by, bh, bw, class_name (or number)] where bx, by is the midpoint of annotated bounding box and bh, bw is height and width of box.
Output or prediction is bounding box b along with class c for an image i.
Formally: y = [ pl, bx, by, bh, bw, cn ] where bx, by is the midpoint of annotated bounding box. bh, bw is height and width of box and pc - The probability of having class(es) c in 'box' b.
Let's say I use my Yolo network (already trained) on an image that is not labelled. What is my confidence then?
When you say you have a pre-trained model (which you refer to already trained), your network already 'knows' bounding boxes for certain object classes and it tries to approximate where the object might be in new image but while doing so your network might predict bounding box somewhere else than its supposed to be. So how do you calculate how much is the box 'somewhere else'? IOU to the rescue!
What IOU (Intersection Over Union) does is, it gets you a score of area of overlap over area of union.
IOU = Area of Overlap / Area of Union
While it's rarely perfect or 1. Its somewhat closer, the lesser the value of IOU, the worse YOLO is predicting the bounding box with reference to ground truth.
IOU Score of 1 means the bounding box is accurately or very confidently predicted with reference to ground truth.
YOLO uses IOU to measure weights for training.When you searched what's IOU it like that.
So when training this IoU scores calculate the prediction on validation data.It means
(Prediction of object)*IoU score
Hope it'll helps you.
I think all you need is a good image that clarifies what is the ground truth.
As you may see on the left the rectangle that perfectly envelopes the object is the ground truth (the blue one).
The orange rectangle is the predicted one. The IoU is what you can visually understand from the right hand side of the image.
Hope this helps.
I think I know the answer
Guess YOLO uses IoU in 2 cases for different gooals
1- to asses prediction while training
2- when you use already trained model, sometimes you get many boxes for the same object. I have red that This is the way YOLO tackles this issue (not sure if this is a part of Non Maximum Suppresion)

Loss function for Bounding Box Regression using CNN

I am trying to understand Loss functions for Bounding Box Regression in CNNs. Currently I use Lasagne and Theano, which makes writing loss expressions very easy. Many sources propose different methods and I am asking myself which one is usually used in practice.
The bounding boxes coordinates are represented as normalized coordinates in the order [left, top, right, bottom] (using T.matrix('targets', dtype=theano.config.floatX)).
I have tried the following functions so far; however all of them have their drawbacks.
Intersection over Union
I was adviced to use the Intersection over Union measure to identify how well the 2 bounding boxes align and overlap. However, a problem occurs when the boxes don't overlap and then intersection is 0; then the whole quotient turns 0 regardless of how far the bounding boxes are apart. I implemented it as:
def get_area(A):
return (A[:,2] - A[:,0]) * (A[:,1] - A[:,3])
def get_intersection(A, B):
return (T.minimum(A[:,2], B[:,2]) - T.maximum(A[:,0], B[:,0])) \
* (T.minimum(A[:,1], B[:,1]) - T.maximum(A[:,3], B[:,3]))
def bbox_overlap_loss(A, B):
"""Computes the bounding box overlap using the
Intersection over union"""
intersection = get_intersection(A, B)
union = get_area(A) + get_area(B) - intersection
# Turn into loss
l = 1.0 - intersection / union
return l.mean()
Squared Diameter Difference
To create an error measure for non overlapping bounding boxes, I tried to compute the squared difference of the bounding box diameter. It seems to work, but I almost sure that there is much better way to do this. I implemented it as:
def squared_diameter_loss(A, B):
# Represent the squared distance from the real diameter
# in normalized pixel coordinates
l = (abs(A[:,0:2]-B[:,0:2]) + abs(A[:,2:4]-B[:,2:4]))**2
return l.mean()
Euclidean Loss
The simplest function would be the Euclidean Loss which computes the square root of the difference of the bounding box parameters squared. However, this doesn't take into account the area of the overlapping bounding box but only the difference of the parameters left, right, top, bottom. I implemented it as:
def euclidean_loss(A, B):
l = lasagne.objectives.squared_error(A, B)
return l.mean()
Could someone guide me on which would be the best loss function for bounding box regression for this use case or spot if I am doing something wrong here. Which loss function is usually used in practice?
Speaking from personal implementation experience, I had much better results training a CNN using IOU as the loss function as opposed to Euclidean (MSE or L2) Loss. Have not used the squared diameter difference loss. In general, a loss function that explicitly represents the goodness of your outputs for the tasks you hope to accomplish is probably best.
With regards to the IOU having a value of zero, you can introduce some additional term in the formulation so that it gracefully trends towards 0, perhaps based on normalized distance between bbox centers. This might give the additional effect of helping to center bounding boxes relative to the ground truth.
This response is mostly conceptual but I'd be happy to supply code examples if desired.

Blending two functions, where one is inverse

Let me first explain the idea. The actual math question is below the screenshots.
For musical purpose I am building a groove algorithm where event positions are translated by a mathematical function F(X). The positions are normalized inside the groove range, so I am basically dealing with values between zero and one (which makes shaping groove curves way easier-the only limitation is x'>=0).
This groove algorithm accepts any event position and also work by filtering static notes from a data-structure like a timeline note-track. For filtering events in a certain range (audio block-size) I need the inverse groove-function to locate the notes in the track and transform them into the groove space. So far so good. It works!
In short: I use an inverse function for the fact that it is mirrored to (y=x). So I can plug in a value x and get a y. This y can obviously plugged into the inverse function to get first x again.
Problem: I now want to be able to blend the groove into another, but the usual linear (hint hint) blending code does not behave like I expected it. To make it easier, I first tried to blend to y=x.
B(x)=alpha*F(x) + (1-alpha)*x;
iB(x)=alpha*iF(x) + (1-alpha)*x;
For alpha=1 we get the full curve. For alpha=0 we get the straight line. But for alpha between 0 and 1 B(x) and iB(x) are not mirrored anymore (close, but not enough), F(x) and iF(x) are still mirrored.
Is there a solution for that (besides quantizing the curve into line segments)? Any subject I should throw an eye on?
you are combining two functions, f(x) and g(x), so that y = a f(x) + (1-a) g(x). and given some y, a, f and g, you want to find x. at least, that is what i understand.
i don't see how to do this generally (although i haven't tried very hard - i mean, it would be worth asking someone else), but i suspect that for "nice" shaped functions, like you seem to be using, newton's method would be fairly quick.
you want to find x such that y = a f(x) + (1-a) g(x). in other words, when 0 = a f(x) + (1-a) g(x) - y.
so let's define r(x) = a f(x) + (1-a) g(x) - y and find the "zero" of that. start with a guess in the middle, x_0 = 0.5. calculate x_1 = x_0 - r(x_0) / r'(x_0). repeat. if you are lucky this will rapidly converge (if not, you might consider defining the functions relative to y=x, which you already seem to be doing, and trying it again).
see wikipedia
This problem can't be solved algebraically, in general.
Consider for instance
y = 2e^x (inverse x = log 0.5y)
and
y = 2x (inverse x = 0.5y).
Blending these together with weight 0.5 gives y = e^x+x, and it is well-known that it is not possible to solve for x here using only elementary functions, even though the inverse of each piece was easy to find.
You will want to use a numerical method to approximate the inverse, as discussed by andrew above.