I am a bit confused with how Yolo works.
In the paper, they say that:
"The confidence prediction represents the IOU between the
predicted box and any ground truth box."
But how do we have the ground truth box? Let's say I use my Yolo network (already trained) on an image that is not labelled. What is my confidence then?
Sorry if the question is simple, but I really don't get this part...
Thank you!
But how do we have the ground truth box?
You seem to be confused about what exactly is training data and what is the output or prediction by YOLO.
Training data is a bounding box along with the class label(s). This is referred to as 'ground truth box', b = [bx, by, bh, bw, class_name (or number)] where bx, by is the midpoint of annotated bounding box and bh, bw is height and width of box.
Output or prediction is bounding box b along with class c for an image i.
Formally: y = [ pl, bx, by, bh, bw, cn ] where bx, by is the midpoint of annotated bounding box. bh, bw is height and width of box and pc - The probability of having class(es) c in 'box' b.
Let's say I use my Yolo network (already trained) on an image that is not labelled. What is my confidence then?
When you say you have a pre-trained model (which you refer to already trained), your network already 'knows' bounding boxes for certain object classes and it tries to approximate where the object might be in new image but while doing so your network might predict bounding box somewhere else than its supposed to be. So how do you calculate how much is the box 'somewhere else'? IOU to the rescue!
What IOU (Intersection Over Union) does is, it gets you a score of area of overlap over area of union.
IOU = Area of Overlap / Area of Union
While it's rarely perfect or 1. Its somewhat closer, the lesser the value of IOU, the worse YOLO is predicting the bounding box with reference to ground truth.
IOU Score of 1 means the bounding box is accurately or very confidently predicted with reference to ground truth.
YOLO uses IOU to measure weights for training.When you searched what's IOU it like that.
So when training this IoU scores calculate the prediction on validation data.It means
(Prediction of object)*IoU score
Hope it'll helps you.
I think all you need is a good image that clarifies what is the ground truth.
As you may see on the left the rectangle that perfectly envelopes the object is the ground truth (the blue one).
The orange rectangle is the predicted one. The IoU is what you can visually understand from the right hand side of the image.
Hope this helps.
I think I know the answer
Guess YOLO uses IoU in 2 cases for different gooals
1- to asses prediction while training
2- when you use already trained model, sometimes you get many boxes for the same object. I have red that This is the way YOLO tackles this issue (not sure if this is a part of Non Maximum Suppresion)
Related
I am doing the course of fast-ai, SGD and I can not understand.....
This subtracts the coefficients by (learning rate * gradient)...
But why is it necessary to subtract?
here is the code:
def update():
y_hat = x#a
loss = mse(y_hat, y)
if t % 10 == 0: print (loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
Look at the image. It shows the loss function J as a function of the parameter W. Here it is a simplified representation with W being the only parameter. So, for a convex loss function, the curve looks as shown.
Note that the learning rate is positive. On the left side, the gradient (slope of the line tangent to the curve at that point) is negative, so the product of the learning rate and gradient is negative. Thus, subtracting the product from W will actually increase W (since 2 negatives make a positive). In this case, this is good because loss decreases.
On the other hand (on the right side), the gradient is positive, so the product of the learning rate and gradient is positive. Thus, subtracting the product from W reduces W. In this case also, this is good because the loss decreases.
We can extend this same thing for more number of parameters (the graph shown will be higher dimensional and won't be easy to visualize, which is why we had taken a single parameter W initially) and for other loss functions (even non-convex ones, though it won't always converge to the global minima, but definitely to the nearest local minima).
Note : This explanation can be found in Andrew Ng's courses of deeplearning.ai, but I couldn't find a direct link, so I wrote this answer.
I'm assuming a represents your model parameters, based on y_hat = x # a. This is necessary because the stochastic gradient descent algorithm aims to find a minima of the loss function. Therefore, you take the gradient w.r.t. your model parameters, and update them a little in the direction of the gradient.
Think of the analogy of sliding down a hill: if the landscape represents your loss, the gradient is the direction of steepest descent. To get to the bottom (i.e. minimize loss), you take little steps in the direction of the steepest descent from where you're standing.
I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.
I am trying to understand Loss functions for Bounding Box Regression in CNNs. Currently I use Lasagne and Theano, which makes writing loss expressions very easy. Many sources propose different methods and I am asking myself which one is usually used in practice.
The bounding boxes coordinates are represented as normalized coordinates in the order [left, top, right, bottom] (using T.matrix('targets', dtype=theano.config.floatX)).
I have tried the following functions so far; however all of them have their drawbacks.
Intersection over Union
I was adviced to use the Intersection over Union measure to identify how well the 2 bounding boxes align and overlap. However, a problem occurs when the boxes don't overlap and then intersection is 0; then the whole quotient turns 0 regardless of how far the bounding boxes are apart. I implemented it as:
def get_area(A):
return (A[:,2] - A[:,0]) * (A[:,1] - A[:,3])
def get_intersection(A, B):
return (T.minimum(A[:,2], B[:,2]) - T.maximum(A[:,0], B[:,0])) \
* (T.minimum(A[:,1], B[:,1]) - T.maximum(A[:,3], B[:,3]))
def bbox_overlap_loss(A, B):
"""Computes the bounding box overlap using the
Intersection over union"""
intersection = get_intersection(A, B)
union = get_area(A) + get_area(B) - intersection
# Turn into loss
l = 1.0 - intersection / union
return l.mean()
Squared Diameter Difference
To create an error measure for non overlapping bounding boxes, I tried to compute the squared difference of the bounding box diameter. It seems to work, but I almost sure that there is much better way to do this. I implemented it as:
def squared_diameter_loss(A, B):
# Represent the squared distance from the real diameter
# in normalized pixel coordinates
l = (abs(A[:,0:2]-B[:,0:2]) + abs(A[:,2:4]-B[:,2:4]))**2
return l.mean()
Euclidean Loss
The simplest function would be the Euclidean Loss which computes the square root of the difference of the bounding box parameters squared. However, this doesn't take into account the area of the overlapping bounding box but only the difference of the parameters left, right, top, bottom. I implemented it as:
def euclidean_loss(A, B):
l = lasagne.objectives.squared_error(A, B)
return l.mean()
Could someone guide me on which would be the best loss function for bounding box regression for this use case or spot if I am doing something wrong here. Which loss function is usually used in practice?
Speaking from personal implementation experience, I had much better results training a CNN using IOU as the loss function as opposed to Euclidean (MSE or L2) Loss. Have not used the squared diameter difference loss. In general, a loss function that explicitly represents the goodness of your outputs for the tasks you hope to accomplish is probably best.
With regards to the IOU having a value of zero, you can introduce some additional term in the formulation so that it gracefully trends towards 0, perhaps based on normalized distance between bbox centers. This might give the additional effect of helping to center bounding boxes relative to the ground truth.
This response is mostly conceptual but I'd be happy to supply code examples if desired.
I have a molecular dynamics trajectory of a micelle in a water box. For each frame, I want to get center of mass (COM) of the micelle and calculate the density of water molecules from COM to the sides of the water box. Does anybody know what is the best way to do that? I was wondering whether somebody has dealt with such an issue before.
Thanks
Sadegh
First - a lot of useful commands and how to use them are written here.
You should select your lipids (I suppose micelle is build from phospholipids) than selecting by type - lipids:
set sel1 [atomselect top "lipids"]
or by resname :
set sel1 [atomselect top resname DPC]
And than:
center_of_mass $sel1
calculate the density of water molecules from COM to the sides of the
water box
I don't know what exactly you are talking about - there are some water molecules inside the micelle? Or waterbox is water molecules within certain distance from the micelle? Please, give some more info (better some pics of the system with descriptions).
I have plots of points which look like this.
The tracks which these points form can be a circle or an ellipse. Clearly the center of the circular tracks in the two images above are different.
How can I find the center point of these tracks (circular/elliptical)? I want to find the (x,y) coordinates which is the center, not necessary that it has to be a point that's in the plotted data set. i.e., I don't want a medoid.
EDIT: Also, is there anyway that I can find an equation for circle/ellipse that envelopes a majority of these points? In the elliptical track, I've added an ellipse that envelopes the points on the track. The values were calculated by trial and error. The center was also calculated by eye balling the plot. How can I do this programmatically?
Smallest circle problem and the here is a paper (PDF download available) on the smallest ellipse problem. Both have O(N) algorithms and should be able to provide the formula for the circle and area from which you can get the center. However, they focus on enclosing all of the points. To solve that issue you'll need to remove some a number of the bounding points, which you should get from the algorithms as well. Unfortunately, it's pretty much up to you as to what qualifies as a good enough solution.
A fast and simple randomized solution is:
Randomly divide the set of points into k sets of N/k points each.
Run the smallest circle/ellipse algorithm for each set
For each of the k sets, pick at least 1 but no more m bounding points to remove from main point set.
Return to step 1, t times.
Return the result of the circle/ellipse algorithm on remaining points.
The algorithm removes between k and mk bounding points every pass at a cost of O(N). For your purpose you'll probably want to remove some percentage of the bounding points, 1-25% seems like a good starting point. This solution assumes that k is very small compared to N, otherwise you'll be removing too many points.
A slower but likely better algorithm is useful in the case that you want to repeated remove one or all of the bounding point from the smallest elipse, recalculate the smallest ellipse, then remove the bounding points again.
You can do this by having the parent node be the bounding points (points stored as a set for easy for faster removal) of the smallest enclosing ellipse of it's children. The maximum number of bounding points should be no more than k (which I'm thinking is 9 for an ellipse, compared to 3 for a circle). So removing a point from the data structure at O(k log N) as it requires recalculating the smallest circle, which is O(k) for each parent that is affected which is O(log N). So removing m points from the data structure should be O(mk log N). You might also want to consider calculating the area of the ellipse every every removed point and removing every point for a cost of O(Nk log N) until you only have three points left. You could then analyze the area data to determine what ellipse should be used. A simple result would be to simply use the ellipse that has the area closest to the average area of all of the ellipses created, but may not be exactly what you seek. It also might be too slow, in which case I recommend a single pass of the faster algorithm.
This looks like an instance of Robust Ellipse Fitting. Check this paper: Outlier Elimination for
Robust Ellipse and Ellipsoid Fitting http://arxiv.org/pdf/0910.4610.pdf.
A first rough and easy solution is provided by the ellipse of inertia (2D version of the ellipsoid of inertia http://en.wikipedia.org/wiki/Moment_of_inertia#Inertia_ellipsoid). Its center is just the centroid and axes are given by Eigen vectors/values of the 2x2 matrix of inertia.