How contrastive loss work intuitively in siamese network - deep-learning

I am having issue in getting clear concept of contrastive loss used in siamese network.
Here is pytorch formula
torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(margin - euclidean_distance, min=0.0), 2))
where margin=2.
If we convert this to equation format, it can be written as
(1-Y)*D^2 + Y* max(m-d,0)^2
Y=0, if both images are from same class
Y=1, if both images are from different class
What i think, if images are from same class the distance between embedding should decrease. and if images are from different class, the distance should increase.
I am unable to map this concept to contrastive loss.
Let say, if Y is 1 and distance value is larger, the first part become zero (1-Y), and second also become zero, because it should choose whether m-d or 0 is bigger.
So the loss is zero which does not make sense.
Can you please help me to understand this

If the distance of a negative sample is greater than the specified margin, it should be already separable from a positive sample. Therefore, there is no benefit in pushing it farther away.
For details please check this blog post, where the concept of "Equilibrium" gets explained and why the Contrastive Loss makes reaching this point easier.

Related

What does a.sub_(lr*a.grad) actually do?

I am doing the course of fast-ai, SGD and I can not understand.....
This subtracts the coefficients by (learning rate * gradient)...
But why is it necessary to subtract?
here is the code:
def update():
y_hat = x#a
loss = mse(y_hat, y)
if t % 10 == 0: print (loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
Look at the image. It shows the loss function J as a function of the parameter W. Here it is a simplified representation with W being the only parameter. So, for a convex loss function, the curve looks as shown.
Note that the learning rate is positive. On the left side, the gradient (slope of the line tangent to the curve at that point) is negative, so the product of the learning rate and gradient is negative. Thus, subtracting the product from W will actually increase W (since 2 negatives make a positive). In this case, this is good because loss decreases.
On the other hand (on the right side), the gradient is positive, so the product of the learning rate and gradient is positive. Thus, subtracting the product from W reduces W. In this case also, this is good because the loss decreases.
We can extend this same thing for more number of parameters (the graph shown will be higher dimensional and won't be easy to visualize, which is why we had taken a single parameter W initially) and for other loss functions (even non-convex ones, though it won't always converge to the global minima, but definitely to the nearest local minima).
Note : This explanation can be found in Andrew Ng's courses of deeplearning.ai, but I couldn't find a direct link, so I wrote this answer.
I'm assuming a represents your model parameters, based on y_hat = x # a. This is necessary because the stochastic gradient descent algorithm aims to find a minima of the loss function. Therefore, you take the gradient w.r.t. your model parameters, and update them a little in the direction of the gradient.
Think of the analogy of sliding down a hill: if the landscape represents your loss, the gradient is the direction of steepest descent. To get to the bottom (i.e. minimize loss), you take little steps in the direction of the steepest descent from where you're standing.

Loss function for Bounding Box Regression using CNN

I am trying to understand Loss functions for Bounding Box Regression in CNNs. Currently I use Lasagne and Theano, which makes writing loss expressions very easy. Many sources propose different methods and I am asking myself which one is usually used in practice.
The bounding boxes coordinates are represented as normalized coordinates in the order [left, top, right, bottom] (using T.matrix('targets', dtype=theano.config.floatX)).
I have tried the following functions so far; however all of them have their drawbacks.
Intersection over Union
I was adviced to use the Intersection over Union measure to identify how well the 2 bounding boxes align and overlap. However, a problem occurs when the boxes don't overlap and then intersection is 0; then the whole quotient turns 0 regardless of how far the bounding boxes are apart. I implemented it as:
def get_area(A):
return (A[:,2] - A[:,0]) * (A[:,1] - A[:,3])
def get_intersection(A, B):
return (T.minimum(A[:,2], B[:,2]) - T.maximum(A[:,0], B[:,0])) \
* (T.minimum(A[:,1], B[:,1]) - T.maximum(A[:,3], B[:,3]))
def bbox_overlap_loss(A, B):
"""Computes the bounding box overlap using the
Intersection over union"""
intersection = get_intersection(A, B)
union = get_area(A) + get_area(B) - intersection
# Turn into loss
l = 1.0 - intersection / union
return l.mean()
Squared Diameter Difference
To create an error measure for non overlapping bounding boxes, I tried to compute the squared difference of the bounding box diameter. It seems to work, but I almost sure that there is much better way to do this. I implemented it as:
def squared_diameter_loss(A, B):
# Represent the squared distance from the real diameter
# in normalized pixel coordinates
l = (abs(A[:,0:2]-B[:,0:2]) + abs(A[:,2:4]-B[:,2:4]))**2
return l.mean()
Euclidean Loss
The simplest function would be the Euclidean Loss which computes the square root of the difference of the bounding box parameters squared. However, this doesn't take into account the area of the overlapping bounding box but only the difference of the parameters left, right, top, bottom. I implemented it as:
def euclidean_loss(A, B):
l = lasagne.objectives.squared_error(A, B)
return l.mean()
Could someone guide me on which would be the best loss function for bounding box regression for this use case or spot if I am doing something wrong here. Which loss function is usually used in practice?
Speaking from personal implementation experience, I had much better results training a CNN using IOU as the loss function as opposed to Euclidean (MSE or L2) Loss. Have not used the squared diameter difference loss. In general, a loss function that explicitly represents the goodness of your outputs for the tasks you hope to accomplish is probably best.
With regards to the IOU having a value of zero, you can introduce some additional term in the formulation so that it gracefully trends towards 0, perhaps based on normalized distance between bbox centers. This might give the additional effect of helping to center bounding boxes relative to the ground truth.
This response is mostly conceptual but I'd be happy to supply code examples if desired.

What is the purpose of setEulerZYX() in bullet physics?

I have been looking into the RagdollDemo and have kind of got stuck in the part where setEulerZYX() is used in basis matrix.
transform.setIdentity();
transform.setOrigin(btVector3(btScalar(-0.35), btScalar(1.45), btScalar(0.)));
transform.getBasis().setEulerZYX(0,0,M_PI_2);
m_bodies[BODYPART_LEFT_UPPER_ARM] = localCreateRigidBody(btScalar(1.), offset*transform, m_shapes[BODYPART_LEFT_UPPER_ARM]);
I did some research, yet couldn`t fully understand what exactly does this function do and why it is needed. Any help would be very nice.
It is a way (there are others) to set the rotation of the body.
http://bulletphysics.org/Bullet/BulletFull/classbtMatrix3x3.html#a0acce3d3502e34b4f34efd275c140d2a
So this is setting it to 0,0,M_PI_2, M_PI_2 being Pi/2 means this is a rotation on the x axis of 1/4 turn, i.e. 90 degrees.

center of a cluster of points and track shape

I have plots of points which look like this.
The tracks which these points form can be a circle or an ellipse. Clearly the center of the circular tracks in the two images above are different.
How can I find the center point of these tracks (circular/elliptical)? I want to find the (x,y) coordinates which is the center, not necessary that it has to be a point that's in the plotted data set. i.e., I don't want a medoid.
EDIT: Also, is there anyway that I can find an equation for circle/ellipse that envelopes a majority of these points? In the elliptical track, I've added an ellipse that envelopes the points on the track. The values were calculated by trial and error. The center was also calculated by eye balling the plot. How can I do this programmatically?
Smallest circle problem and the here is a paper (PDF download available) on the smallest ellipse problem. Both have O(N) algorithms and should be able to provide the formula for the circle and area from which you can get the center. However, they focus on enclosing all of the points. To solve that issue you'll need to remove some a number of the bounding points, which you should get from the algorithms as well. Unfortunately, it's pretty much up to you as to what qualifies as a good enough solution.
A fast and simple randomized solution is:
Randomly divide the set of points into k sets of N/k points each.
Run the smallest circle/ellipse algorithm for each set
For each of the k sets, pick at least 1 but no more m bounding points to remove from main point set.
Return to step 1, t times.
Return the result of the circle/ellipse algorithm on remaining points.
The algorithm removes between k and mk bounding points every pass at a cost of O(N). For your purpose you'll probably want to remove some percentage of the bounding points, 1-25% seems like a good starting point. This solution assumes that k is very small compared to N, otherwise you'll be removing too many points.
A slower but likely better algorithm is useful in the case that you want to repeated remove one or all of the bounding point from the smallest elipse, recalculate the smallest ellipse, then remove the bounding points again.
You can do this by having the parent node be the bounding points (points stored as a set for easy for faster removal) of the smallest enclosing ellipse of it's children. The maximum number of bounding points should be no more than k (which I'm thinking is 9 for an ellipse, compared to 3 for a circle). So removing a point from the data structure at O(k log N) as it requires recalculating the smallest circle, which is O(k) for each parent that is affected which is O(log N). So removing m points from the data structure should be O(mk log N). You might also want to consider calculating the area of the ellipse every every removed point and removing every point for a cost of O(Nk log N) until you only have three points left. You could then analyze the area data to determine what ellipse should be used. A simple result would be to simply use the ellipse that has the area closest to the average area of all of the ellipses created, but may not be exactly what you seek. It also might be too slow, in which case I recommend a single pass of the faster algorithm.
This looks like an instance of Robust Ellipse Fitting. Check this paper: Outlier Elimination for
Robust Ellipse and Ellipsoid Fitting http://arxiv.org/pdf/0910.4610.pdf.
A first rough and easy solution is provided by the ellipse of inertia (2D version of the ellipsoid of inertia http://en.wikipedia.org/wiki/Moment_of_inertia#Inertia_ellipsoid). Its center is just the centroid and axes are given by Eigen vectors/values of the 2x2 matrix of inertia.

How to divide tiny double precision numbers correctly without precision errors?

I'm trying to diagnose and fix a bug which boils down to X/Y yielding an unstable result when X and Y are small:
In this case, both cx and patharea increase smoothly. Their ratio is a smooth asymptote at high numbers, but erratic for "small" numbers. The obvious first thought is that we're reaching the limit of floating point accuracy, but the actual numbers themselves are nowhere near it. ActionScript "Number" types are IEE 754 double-precision floats, so should have 15 decimal digits of precision (if I read it right).
Some typical values of the denominator (patharea):
0.0000000002119123
0.0000000002137313
0.0000000002137313
0.0000000002155502
0.0000000002182787
0.0000000002200977
0.0000000002210072
And the numerator (cx):
0.0000000922932995
0.0000000930474444
0.0000000930582124
0.0000000938123574
0.0000000950458711
0.0000000958000159
0.0000000962901528
0.0000000970442977
0.0000000977984426
Each of these increases monotonically, but the ratio is chaotic as seen above.
At larger numbers it settles down to a smooth hyperbola.
So, my question: what's the correct way to deal with very small numbers when you need to divide one by another?
I thought of multiplying numerator and/or denominator by 1000 in advance, but couldn't quite work it out.
The actual code in question is the recalculate() function here. It computes the centroid of a polygon, but when the polygon is tiny, the centroid jumps erratically around the place, and can end up a long distance from the polygon. The data series above are the result of moving one node of the polygon in a consistent direction (by hand, which is why it's not perfectly smooth).
This is Adobe Flex 4.5.
I believe the problem most likely is caused by the following line in your code:
sc = (lx*latp-lon*ly)*paint.map.scalefactor;
If your polygon is very small, then lx and lon are almost the same, as are ly and latp. They are both very large compared to the result, so you are subtracting two numbers that are almost equal.
To get around this, we can make use of the fact that:
x1*y2-x2*y1 = (x2+(x1-x2))*y2 - x2*(y2+(y1-y2))
= x2*y2 + (x1-x2)*y2 - x2*y2 - x2*(y2-y1)
= (x1-x2)*y2 - x2*(y2-y1)
So, try this:
dlon = lx - lon
dlat = ly - latp
sc = (dlon*latp-lon*dlat)*paint.map.scalefactor;
The value is mathematically the same, but the terms are an order of magnitude smaller, so the error should be an order of magnitude smaller as well.
Jeffrey Sax has correctly identified the basic issue - loss of precision from combining terms that are (much) larger than the final result.
The suggested rewriting eliminates part of the problem - apparently sufficient for the actual case, given the happy response.
You may find, however, that if the polygon becomes again (much) smaller and/or farther away from the origin, inaccuracy will show up again. In the rewritten formula the terms are still quite a bit larger than their difference.
Furthermore, there's another 'combining-large&comparable-numbers-with-different-signs'-issue in the algorithm. The various 'sc' values in subsequent cycles of the iteration over the edges of the polygon effectively combine into a final number that is (much) smaller than the individual sc(i) are. (if you have a convex polygon you will find that there is one contiguous sequence of positive values, and one contiguous sequence of negative values, in non-convex polygons the negatives and positives may be intertwined).
What the algorithm is doing, effectively, is computing the area of the polygon by adding areas of triangles spanned by the edges and the origin, where some of the terms are negative (whenever an edge is traversed clockwise, viewing it from the origin) and some positive (anti-clockwise walk over the edge).
You get rid of ALL the loss-of-precision issues by defining the origin at one of the polygon's corners, say (lx,ly) and then adding the triangle-surfaces spanned by the edges and that corner (so: transforming lon to (lon-lx) and latp to (latp-ly) - with the additional bonus that you need to process two triangles less, because obviously the edges that link to the chosen origin-corner yield zero surfaces.
For the area-part that's all. For the centroid-part, you will of course have to "transform back" the result to the original frame, i.e. adding (lx,ly) at the end.