Gradient Ascent vs Gradient Descent - deep-learning

I'm a programming who is just recently looking in machine and deep learning.
What exactly is the difference between the usages for gradient ascent and descent? Why would we want maximize a loss instead of minimalizing it? More specifically, I'm curious about its usage for convolutional networks.

The difference is a sign, gradient ascent means to change parameters according to the gradient of the function (so increase its value) and gradient descent against the gradient (thus decrease).
You almost never want to increase the loss (apart from say some form of gamified system, e.g. a GAN). But if you frame your problem as maximisation of probability of correct answer then you want to utilise gradient ascent. It is always a dual thing, for every problem expressed as gradient ascent of something you can think about it as gradient descent of minus this function, and vice versa.
theta_t + grad(f)[theta_t] = theta_t - grad(-f)[theta_t]
gradient ascent on f gradient descent on -f
In other words there is absolutely no difference in usage of these two methods, they are equivalent. The reason why people use one or the other is just what helps explain the method in most natural terms. It is more natural to say "I am going to decrease the cost" or "I am going to maximise the probability" than it is to say "I am going to decrease minus cost" or "I am going to minimise 1 minus probability".

Related

How contrastive loss work intuitively in siamese network

I am having issue in getting clear concept of contrastive loss used in siamese network.
Here is pytorch formula
torch.mean((1-label) * torch.pow(euclidean_distance, 2) +
(label) * torch.pow(torch.clamp(margin - euclidean_distance, min=0.0), 2))
where margin=2.
If we convert this to equation format, it can be written as
(1-Y)*D^2 + Y* max(m-d,0)^2
Y=0, if both images are from same class
Y=1, if both images are from different class
What i think, if images are from same class the distance between embedding should decrease. and if images are from different class, the distance should increase.
I am unable to map this concept to contrastive loss.
Let say, if Y is 1 and distance value is larger, the first part become zero (1-Y), and second also become zero, because it should choose whether m-d or 0 is bigger.
So the loss is zero which does not make sense.
Can you please help me to understand this
If the distance of a negative sample is greater than the specified margin, it should be already separable from a positive sample. Therefore, there is no benefit in pushing it farther away.
For details please check this blog post, where the concept of "Equilibrium" gets explained and why the Contrastive Loss makes reaching this point easier.

What does a.sub_(lr*a.grad) actually do?

I am doing the course of fast-ai, SGD and I can not understand.....
This subtracts the coefficients by (learning rate * gradient)...
But why is it necessary to subtract?
here is the code:
def update():
y_hat = x#a
loss = mse(y_hat, y)
if t % 10 == 0: print (loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
Look at the image. It shows the loss function J as a function of the parameter W. Here it is a simplified representation with W being the only parameter. So, for a convex loss function, the curve looks as shown.
Note that the learning rate is positive. On the left side, the gradient (slope of the line tangent to the curve at that point) is negative, so the product of the learning rate and gradient is negative. Thus, subtracting the product from W will actually increase W (since 2 negatives make a positive). In this case, this is good because loss decreases.
On the other hand (on the right side), the gradient is positive, so the product of the learning rate and gradient is positive. Thus, subtracting the product from W reduces W. In this case also, this is good because the loss decreases.
We can extend this same thing for more number of parameters (the graph shown will be higher dimensional and won't be easy to visualize, which is why we had taken a single parameter W initially) and for other loss functions (even non-convex ones, though it won't always converge to the global minima, but definitely to the nearest local minima).
Note : This explanation can be found in Andrew Ng's courses of deeplearning.ai, but I couldn't find a direct link, so I wrote this answer.
I'm assuming a represents your model parameters, based on y_hat = x # a. This is necessary because the stochastic gradient descent algorithm aims to find a minima of the loss function. Therefore, you take the gradient w.r.t. your model parameters, and update them a little in the direction of the gradient.
Think of the analogy of sliding down a hill: if the landscape represents your loss, the gradient is the direction of steepest descent. To get to the bottom (i.e. minimize loss), you take little steps in the direction of the steepest descent from where you're standing.

What is soft constraint in box2d?

I am creating a mouse joint and I bump across this term, what it actually means.
documentation for mouse joint:-"A mouse joint is used to make a point on a body track a specified world point. This a soft constraint with a maximum force.
* This allows the constraint to stretch and without applying huge forces."
Let's say that we have a distance joint;
b2DistanceJointDef DistJointDef;
you can achieve a spring-like effect by tuning the frequency and damping ratios.
DistJointDef.frequencyHz = 0.5f;
DistJointDef.dampingRatio = 0.5f;
FrequencyHz will determine how much the body should stretch/shrink over time.
whereas the dampingRation will determine how long the spring-like effect will last.
These principles are also applied to Mouse joints. you can modify their frequency and damping ratio to achieve a similar effect.
If I recall correctly, you can apply the soft constraints on wheel joints as well.
here is a little bit more info on the subject from Box2dManual
Softness is achieved by tuning two constants in the definition: frequency and damping ratio. Think of the frequency as the frequency of a harmonic oscillator (like a guitar string). The frequency is specified in Hertz. Typically the frequency should be less than a half the frequency of the time step. So if you are using a 60Hz time step, the frequency of the distance joint should be less than 30Hz. The reason is related to the Nyquist frequency.
The damping ratio is non-dimensional and is typically between 0 and 1, but can be larger. At 1, the damping is critical (all oscillations should vanish).

Anchor Boxes in YOLO : How are they decided

I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.

Why are negative frequencies of a discrete fourier transform appear where high frequencies are supposed to be?

i am very new to this topic, please help me understand why this happens...
if i make the dft of a cos wave
cos(w*x)=0.5*(exp(i*w*x)+exp(-i*w*x))
i expect to get one peak at the "w" frequency, and the negative one "-w" should not be visible, but it appears at the far end of the spectrum where higher frequencies are supposed to be...
why? do high frequencies produce the same effect as negative ones?
if you imagine a wave with frequency equal to 3*pi/2, that is, (0, 3pi/2, 6pi/2, 9pi/2), it does seem like a wave with negative frequency -pi/2 (0, -pi/2, -pi, -3pi/2)
is that the reason to what is happening? please help!
High frequencies between the Nyquist rate (Fs/2) and the Sample rate (Fs) alias to negative frequencies, due to sampling above the Nyquist rate. And vice versa, negative frequencies alias to the top half of an FFT. "Alias" means you can't tell those frequencies apart at that sample rate, due to the fact that sampling sinewaves at either of those frequencies at that sample rate can produce absolutely identical sets of samples.