What does a.sub_(lr*a.grad) actually do? - deep-learning

I am doing the course of fast-ai, SGD and I can not understand.....
This subtracts the coefficients by (learning rate * gradient)...
But why is it necessary to subtract?
here is the code:
def update():
y_hat = x#a
loss = mse(y_hat, y)
if t % 10 == 0: print (loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)

Look at the image. It shows the loss function J as a function of the parameter W. Here it is a simplified representation with W being the only parameter. So, for a convex loss function, the curve looks as shown.
Note that the learning rate is positive. On the left side, the gradient (slope of the line tangent to the curve at that point) is negative, so the product of the learning rate and gradient is negative. Thus, subtracting the product from W will actually increase W (since 2 negatives make a positive). In this case, this is good because loss decreases.
On the other hand (on the right side), the gradient is positive, so the product of the learning rate and gradient is positive. Thus, subtracting the product from W reduces W. In this case also, this is good because the loss decreases.
We can extend this same thing for more number of parameters (the graph shown will be higher dimensional and won't be easy to visualize, which is why we had taken a single parameter W initially) and for other loss functions (even non-convex ones, though it won't always converge to the global minima, but definitely to the nearest local minima).
Note : This explanation can be found in Andrew Ng's courses of deeplearning.ai, but I couldn't find a direct link, so I wrote this answer.

I'm assuming a represents your model parameters, based on y_hat = x # a. This is necessary because the stochastic gradient descent algorithm aims to find a minima of the loss function. Therefore, you take the gradient w.r.t. your model parameters, and update them a little in the direction of the gradient.
Think of the analogy of sliding down a hill: if the landscape represents your loss, the gradient is the direction of steepest descent. To get to the bottom (i.e. minimize loss), you take little steps in the direction of the steepest descent from where you're standing.

Related

How contrastive loss work intuitively in siamese network

I am having issue in getting clear concept of contrastive loss used in siamese network.
Here is pytorch formula
torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(margin - euclidean_distance, min=0.0), 2))
where margin=2.
If we convert this to equation format, it can be written as
(1-Y)*D^2 + Y* max(m-d,0)^2
Y=0, if both images are from same class
Y=1, if both images are from different class
What i think, if images are from same class the distance between embedding should decrease. and if images are from different class, the distance should increase.
I am unable to map this concept to contrastive loss.
Let say, if Y is 1 and distance value is larger, the first part become zero (1-Y), and second also become zero, because it should choose whether m-d or 0 is bigger.
So the loss is zero which does not make sense.
Can you please help me to understand this
If the distance of a negative sample is greater than the specified margin, it should be already separable from a positive sample. Therefore, there is no benefit in pushing it farther away.
For details please check this blog post, where the concept of "Equilibrium" gets explained and why the Contrastive Loss makes reaching this point easier.

Gradient Ascent vs Gradient Descent

I'm a programming who is just recently looking in machine and deep learning.
What exactly is the difference between the usages for gradient ascent and descent? Why would we want maximize a loss instead of minimalizing it? More specifically, I'm curious about its usage for convolutional networks.
The difference is a sign, gradient ascent means to change parameters according to the gradient of the function (so increase its value) and gradient descent against the gradient (thus decrease).
You almost never want to increase the loss (apart from say some form of gamified system, e.g. a GAN). But if you frame your problem as maximisation of probability of correct answer then you want to utilise gradient ascent. It is always a dual thing, for every problem expressed as gradient ascent of something you can think about it as gradient descent of minus this function, and vice versa.
theta_t + grad(f)[theta_t] = theta_t - grad(-f)[theta_t]
gradient ascent on f gradient descent on -f
In other words there is absolutely no difference in usage of these two methods, they are equivalent. The reason why people use one or the other is just what helps explain the method in most natural terms. It is more natural to say "I am going to decrease the cost" or "I am going to maximise the probability" than it is to say "I am going to decrease minus cost" or "I am going to minimise 1 minus probability".

Loss function for Bounding Box Regression using CNN

I am trying to understand Loss functions for Bounding Box Regression in CNNs. Currently I use Lasagne and Theano, which makes writing loss expressions very easy. Many sources propose different methods and I am asking myself which one is usually used in practice.
The bounding boxes coordinates are represented as normalized coordinates in the order [left, top, right, bottom] (using T.matrix('targets', dtype=theano.config.floatX)).
I have tried the following functions so far; however all of them have their drawbacks.
Intersection over Union
I was adviced to use the Intersection over Union measure to identify how well the 2 bounding boxes align and overlap. However, a problem occurs when the boxes don't overlap and then intersection is 0; then the whole quotient turns 0 regardless of how far the bounding boxes are apart. I implemented it as:
def get_area(A):
return (A[:,2] - A[:,0]) * (A[:,1] - A[:,3])
def get_intersection(A, B):
return (T.minimum(A[:,2], B[:,2]) - T.maximum(A[:,0], B[:,0])) \
* (T.minimum(A[:,1], B[:,1]) - T.maximum(A[:,3], B[:,3]))
def bbox_overlap_loss(A, B):
"""Computes the bounding box overlap using the
Intersection over union"""
intersection = get_intersection(A, B)
union = get_area(A) + get_area(B) - intersection
# Turn into loss
l = 1.0 - intersection / union
return l.mean()
Squared Diameter Difference
To create an error measure for non overlapping bounding boxes, I tried to compute the squared difference of the bounding box diameter. It seems to work, but I almost sure that there is much better way to do this. I implemented it as:
def squared_diameter_loss(A, B):
# Represent the squared distance from the real diameter
# in normalized pixel coordinates
l = (abs(A[:,0:2]-B[:,0:2]) + abs(A[:,2:4]-B[:,2:4]))**2
return l.mean()
Euclidean Loss
The simplest function would be the Euclidean Loss which computes the square root of the difference of the bounding box parameters squared. However, this doesn't take into account the area of the overlapping bounding box but only the difference of the parameters left, right, top, bottom. I implemented it as:
def euclidean_loss(A, B):
l = lasagne.objectives.squared_error(A, B)
return l.mean()
Could someone guide me on which would be the best loss function for bounding box regression for this use case or spot if I am doing something wrong here. Which loss function is usually used in practice?
Speaking from personal implementation experience, I had much better results training a CNN using IOU as the loss function as opposed to Euclidean (MSE or L2) Loss. Have not used the squared diameter difference loss. In general, a loss function that explicitly represents the goodness of your outputs for the tasks you hope to accomplish is probably best.
With regards to the IOU having a value of zero, you can introduce some additional term in the formulation so that it gracefully trends towards 0, perhaps based on normalized distance between bbox centers. This might give the additional effect of helping to center bounding boxes relative to the ground truth.
This response is mostly conceptual but I'd be happy to supply code examples if desired.

How to divide tiny double precision numbers correctly without precision errors?

I'm trying to diagnose and fix a bug which boils down to X/Y yielding an unstable result when X and Y are small:
In this case, both cx and patharea increase smoothly. Their ratio is a smooth asymptote at high numbers, but erratic for "small" numbers. The obvious first thought is that we're reaching the limit of floating point accuracy, but the actual numbers themselves are nowhere near it. ActionScript "Number" types are IEE 754 double-precision floats, so should have 15 decimal digits of precision (if I read it right).
Some typical values of the denominator (patharea):
0.0000000002119123
0.0000000002137313
0.0000000002137313
0.0000000002155502
0.0000000002182787
0.0000000002200977
0.0000000002210072
And the numerator (cx):
0.0000000922932995
0.0000000930474444
0.0000000930582124
0.0000000938123574
0.0000000950458711
0.0000000958000159
0.0000000962901528
0.0000000970442977
0.0000000977984426
Each of these increases monotonically, but the ratio is chaotic as seen above.
At larger numbers it settles down to a smooth hyperbola.
So, my question: what's the correct way to deal with very small numbers when you need to divide one by another?
I thought of multiplying numerator and/or denominator by 1000 in advance, but couldn't quite work it out.
The actual code in question is the recalculate() function here. It computes the centroid of a polygon, but when the polygon is tiny, the centroid jumps erratically around the place, and can end up a long distance from the polygon. The data series above are the result of moving one node of the polygon in a consistent direction (by hand, which is why it's not perfectly smooth).
This is Adobe Flex 4.5.
I believe the problem most likely is caused by the following line in your code:
sc = (lx*latp-lon*ly)*paint.map.scalefactor;
If your polygon is very small, then lx and lon are almost the same, as are ly and latp. They are both very large compared to the result, so you are subtracting two numbers that are almost equal.
To get around this, we can make use of the fact that:
x1*y2-x2*y1 = (x2+(x1-x2))*y2 - x2*(y2+(y1-y2))
= x2*y2 + (x1-x2)*y2 - x2*y2 - x2*(y2-y1)
= (x1-x2)*y2 - x2*(y2-y1)
So, try this:
dlon = lx - lon
dlat = ly - latp
sc = (dlon*latp-lon*dlat)*paint.map.scalefactor;
The value is mathematically the same, but the terms are an order of magnitude smaller, so the error should be an order of magnitude smaller as well.
Jeffrey Sax has correctly identified the basic issue - loss of precision from combining terms that are (much) larger than the final result.
The suggested rewriting eliminates part of the problem - apparently sufficient for the actual case, given the happy response.
You may find, however, that if the polygon becomes again (much) smaller and/or farther away from the origin, inaccuracy will show up again. In the rewritten formula the terms are still quite a bit larger than their difference.
Furthermore, there's another 'combining-large&comparable-numbers-with-different-signs'-issue in the algorithm. The various 'sc' values in subsequent cycles of the iteration over the edges of the polygon effectively combine into a final number that is (much) smaller than the individual sc(i) are. (if you have a convex polygon you will find that there is one contiguous sequence of positive values, and one contiguous sequence of negative values, in non-convex polygons the negatives and positives may be intertwined).
What the algorithm is doing, effectively, is computing the area of the polygon by adding areas of triangles spanned by the edges and the origin, where some of the terms are negative (whenever an edge is traversed clockwise, viewing it from the origin) and some positive (anti-clockwise walk over the edge).
You get rid of ALL the loss-of-precision issues by defining the origin at one of the polygon's corners, say (lx,ly) and then adding the triangle-surfaces spanned by the edges and that corner (so: transforming lon to (lon-lx) and latp to (latp-ly) - with the additional bonus that you need to process two triangles less, because obviously the edges that link to the chosen origin-corner yield zero surfaces.
For the area-part that's all. For the centroid-part, you will of course have to "transform back" the result to the original frame, i.e. adding (lx,ly) at the end.

Blending two functions, where one is inverse

Let me first explain the idea. The actual math question is below the screenshots.
For musical purpose I am building a groove algorithm where event positions are translated by a mathematical function F(X). The positions are normalized inside the groove range, so I am basically dealing with values between zero and one (which makes shaping groove curves way easier-the only limitation is x'>=0).
This groove algorithm accepts any event position and also work by filtering static notes from a data-structure like a timeline note-track. For filtering events in a certain range (audio block-size) I need the inverse groove-function to locate the notes in the track and transform them into the groove space. So far so good. It works!
In short: I use an inverse function for the fact that it is mirrored to (y=x). So I can plug in a value x and get a y. This y can obviously plugged into the inverse function to get first x again.
Problem: I now want to be able to blend the groove into another, but the usual linear (hint hint) blending code does not behave like I expected it. To make it easier, I first tried to blend to y=x.
B(x)=alpha*F(x) + (1-alpha)*x;
iB(x)=alpha*iF(x) + (1-alpha)*x;
For alpha=1 we get the full curve. For alpha=0 we get the straight line. But for alpha between 0 and 1 B(x) and iB(x) are not mirrored anymore (close, but not enough), F(x) and iF(x) are still mirrored.
Is there a solution for that (besides quantizing the curve into line segments)? Any subject I should throw an eye on?
you are combining two functions, f(x) and g(x), so that y = a f(x) + (1-a) g(x). and given some y, a, f and g, you want to find x. at least, that is what i understand.
i don't see how to do this generally (although i haven't tried very hard - i mean, it would be worth asking someone else), but i suspect that for "nice" shaped functions, like you seem to be using, newton's method would be fairly quick.
you want to find x such that y = a f(x) + (1-a) g(x). in other words, when 0 = a f(x) + (1-a) g(x) - y.
so let's define r(x) = a f(x) + (1-a) g(x) - y and find the "zero" of that. start with a guess in the middle, x_0 = 0.5. calculate x_1 = x_0 - r(x_0) / r'(x_0). repeat. if you are lucky this will rapidly converge (if not, you might consider defining the functions relative to y=x, which you already seem to be doing, and trying it again).
see wikipedia
This problem can't be solved algebraically, in general.
Consider for instance
y = 2e^x (inverse x = log 0.5y)
and
y = 2x (inverse x = 0.5y).
Blending these together with weight 0.5 gives y = e^x+x, and it is well-known that it is not possible to solve for x here using only elementary functions, even though the inverse of each piece was easy to find.
You will want to use a numerical method to approximate the inverse, as discussed by andrew above.