Why does YOLO divide an image into grid cells? - deep-learning

I'm trying to understand how YOLO works for a project I'm doing. I've gone through the papers, many articles, and blog posts, but I'm still not sure why YOLO divides the entire image into a grid cell and considers each cell for computations. What would happen if we considered the whole image as just one cell (without dividing)? What is the purpose this grid cell serve? Is there a maximum number of objects a particular cell can detect?

Grid cells put the network predictions in a more structure form. Each grid cells correspond to a specific region of image, and these cells predicts objects which their centers lay into the region. So, it is about having a structured output representation to use the advantage of spatial regularity of images.
Each grid cell can make a prediction of a vector which has a form [objectness_value, bbox_h, bbox_w, bbox_cx, bbox_cy, p1, p2, .. pn].
objectness_value: how confident the prediction
bbox_h, bbox_w, bbox_cx, bbox_cy: offsets for bounding box height, width, center coordinate in x-axis, and center coordinate in y_axis, respectively.
p1, p2, ..pn: predicted class probabilities of each object category. (n objects in total)
More grid cell means more predictions. If you have one grid cell (image itself), you will have one bounding box prediction. It is not practical because there are likely many objects in images.
Note that a grid cell can make multiple bounding box predictions adding more bbox offsets to the output vector.

Related

Anchor Boxes in YOLO : How are they decided

I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.

How does the training data look like in Yolo model

I'm trying to figure out how to create YOLOv1 model from scratch but I can't figure out how the training data should look like. I suspect training labels (ground truth) looks like a matrix with (7, 7, 5*2 + 10) where
7x7 stands for the prediction grid
5 is object location and confidence (always equal to 1); x,y - known box center; h,v - box height and width
*2 is because there should be horizontal and vertical box for each cell
10 - is one-hot encoding for a class present in this position
what I don't understand is
whether to put confidence==1 to both horizontal and vertical bounding boxes?
whether x and y should be coordinates in the original (resized for the input) image?
...or maybe I'm completely off with my whole understanding. Does somebody have experience with YOLO?

SSRS - How to align the vertical axis of 2 chart areas on the one chart

I have basically loosely followed this link
http://www.angelsbiblog.com/2012/02/improve-data-visualization-in-your-ssrs.html
and made the below linked graph. Its one dataset, I have simply pulled in Gross Profit and Sales fields. Neither are calculated fields. I put them in 2 different chart areas, but then as per that link, made the chart areas the same size so they overlay.
*Apologies for a photobucket link instead of inserted image but I don't have 10 reputation points to be able to insert images.
http://i1375.photobucket.com/albums/ag447/AndrewJacksons/IncomeandProfit_zpse074ac02.jpg
what I want to do, is as illustrated by that inserted green arrow in the graph image, is raise up the Zero line for the Income bars (yellow) to the same level as the Profit/Loss(Blue-Red).
I also want the vertical axis to preferably have the same axis, so i dont have to have that secondary axis on the right.
However the main thing is the graphs sharing their zero line. I have made the Profit bars smaller I width than the yellow bars, so in a month of blue profit, it would simply sit neatly inside the yellow income bar.
I haven't added expenses because it should be obvious what they are by the height differential btw Income to Profit or to the Loss.
Any ideas much appreciated.
I have just experienced this problem, but this page did not solve it.
Dan's answer ("simply set the minimum and maximum values for the vertical axes on both areas manually") came close, but did not solve the problem for me because I needed the axis to be automatically calculated. If the maximum of the two datasets is something like 193,456 then you get that exact value as a label on the axis rather than the sensible value of 200,000.
The solution is to allow SSRS to calculate the axis labels automatically but to trick it by using both sets of data in each chart. Then you hide the data set that you don't want the user to see.
In each chart I made the data series of interest a column chart and the other data series a line chart (without markers), as all you need to do is set the fill color for the line series to None. If you try the same with columns for the other series the invisible columns affect the position of the visible columns even if they have been set to zero width.
Make sure both series in the chart use the Primary Vertical axis. Go into properties for the Income series > Go to "Axes and Chart Area", and make sure that the series uses the primary vertical axis:

center of a cluster of points and track shape

I have plots of points which look like this.
The tracks which these points form can be a circle or an ellipse. Clearly the center of the circular tracks in the two images above are different.
How can I find the center point of these tracks (circular/elliptical)? I want to find the (x,y) coordinates which is the center, not necessary that it has to be a point that's in the plotted data set. i.e., I don't want a medoid.
EDIT: Also, is there anyway that I can find an equation for circle/ellipse that envelopes a majority of these points? In the elliptical track, I've added an ellipse that envelopes the points on the track. The values were calculated by trial and error. The center was also calculated by eye balling the plot. How can I do this programmatically?
Smallest circle problem and the here is a paper (PDF download available) on the smallest ellipse problem. Both have O(N) algorithms and should be able to provide the formula for the circle and area from which you can get the center. However, they focus on enclosing all of the points. To solve that issue you'll need to remove some a number of the bounding points, which you should get from the algorithms as well. Unfortunately, it's pretty much up to you as to what qualifies as a good enough solution.
A fast and simple randomized solution is:
Randomly divide the set of points into k sets of N/k points each.
Run the smallest circle/ellipse algorithm for each set
For each of the k sets, pick at least 1 but no more m bounding points to remove from main point set.
Return to step 1, t times.
Return the result of the circle/ellipse algorithm on remaining points.
The algorithm removes between k and mk bounding points every pass at a cost of O(N). For your purpose you'll probably want to remove some percentage of the bounding points, 1-25% seems like a good starting point. This solution assumes that k is very small compared to N, otherwise you'll be removing too many points.
A slower but likely better algorithm is useful in the case that you want to repeated remove one or all of the bounding point from the smallest elipse, recalculate the smallest ellipse, then remove the bounding points again.
You can do this by having the parent node be the bounding points (points stored as a set for easy for faster removal) of the smallest enclosing ellipse of it's children. The maximum number of bounding points should be no more than k (which I'm thinking is 9 for an ellipse, compared to 3 for a circle). So removing a point from the data structure at O(k log N) as it requires recalculating the smallest circle, which is O(k) for each parent that is affected which is O(log N). So removing m points from the data structure should be O(mk log N). You might also want to consider calculating the area of the ellipse every every removed point and removing every point for a cost of O(Nk log N) until you only have three points left. You could then analyze the area data to determine what ellipse should be used. A simple result would be to simply use the ellipse that has the area closest to the average area of all of the ellipses created, but may not be exactly what you seek. It also might be too slow, in which case I recommend a single pass of the faster algorithm.
This looks like an instance of Robust Ellipse Fitting. Check this paper: Outlier Elimination for
Robust Ellipse and Ellipsoid Fitting http://arxiv.org/pdf/0910.4610.pdf.
A first rough and easy solution is provided by the ellipse of inertia (2D version of the ellipsoid of inertia http://en.wikipedia.org/wiki/Moment_of_inertia#Inertia_ellipsoid). Its center is just the centroid and axes are given by Eigen vectors/values of the 2x2 matrix of inertia.

How to expand to a normal vessel with ITK when I have a skeleton line and every radius for pixels?

I did an thinning operation on vessels, and now I'm trying to reconstruct it.
How to expand them to normal vessels in ITK when I have a skeleton line and radius values for each pixel?
DISCLAIMER: This could be slow, but since no other answer has been suggested, here you go.
Since your question does not indicate this, I'm assuming that you're talking about a 2D image, but the following approach can be extended for 3D too. This is how I'd go about it:
Create a blank image with zero filled pixel values
Create multiple instances of disk/sphere ShapedNeighborhoodIterator each having a different radius on the blank image (choose the most common radii from the vessel width histogram).
Visit each pixel in the binary skeleton image. When you come upon a white (vessel skeleton) pixel, recollect the vessel radius at that pixel.
If you already have a ShapedNeighborhoodIterator for that radius value, take the iterator to the pixel location in the blank image and fill up a disk/sphere of white pixels centered about that pixel. If you don't have a ShapedNeighborhoodIterator for that radius value, create one and do the same operation.
Once you finish iterating over the skeletonized image, you will have a reconstructed tree in the other image. Note that step 2 is optional, but will help you achieve faster computation.