I tried with input of B*C*H*W, it seems that the result is no difference with input of B*(C*H*W).
So, is Flatten the must do action before euclidean?
Look at the way caffe is computing the loss: as long as the count of both inputs is the same (that is, both bottoms has the same number of elements) the shape into which these elements are arranged into does not matter.
Related
I would like to add a neural network layer that takes as input from a output of another layer in the neural network and another separate number k and outputs the kth element of the list. This layer is supposed to be part a bigger deep network that supplies only the k element to succeeding layer.
The way i think is to dynamically change weights dynamically to a one hot array with only kth element = 1 rest all zeros.
Second way would be to freeze weights and mutliply the previous layer out with the one hot output and input it to next layer. But I am not sure how to do this.
You can just compose top-k modules from any library:
just_kth_element(x, k) := -topk(-topk(x, k=k), k=1)
Since kth element is nothing but smallest element in topK elements.
Or equivalent
just_kth_element(x, k) := min(topk(x, k=k))
I have gone through a couple of YOLO tutorials but I am finding it some what hard to figure if the Anchor boxes for each cell the image is to be divided into is predetermined. In one of the guides I went through, The image was divided into 13x13 cells and it stated each cell predicts 5 anchor boxes(bigger than it, ok here's my first problem because it also says it would first detect what object is present in the small cell before the prediction of the boxes).
How can the small cell predict anchor boxes for an object bigger than it. Also it's said that each cell classifies before predicting its anchor boxes how can the small cell classify the right object in it without querying neighbouring cells if only a small part of the object falls within the cell
E.g. say one of the 13 cells contains only the white pocket part of a man wearing a T-shirt how can that cell classify correctly that a man is present without being linked to its neighbouring cells? with a normal CNN when trying to localize a single object I know the bounding box prediction relates to the whole image so at least I can say the network has an idea of what's going on everywhere on the image before deciding where the box should be.
PS: What I currently think of how the YOLO works is basically each cell is assigned predetermined anchor boxes with a classifier at each end before the boxes with the highest scores for each class is then selected but I am sure it doesn't add up somewhere.
UPDATE: Made a mistake with this question, it should have been about how regular bounding boxes were decided rather than anchor/prior boxes. So I am marking #craq's answer as correct because that's how anchor boxes are decided according to the YOLO v2 paper
I think there are two questions here. Firstly, the one in the title, asking where the anchors come from. Secondly, how anchors are assigned to objects. I'll try to answer both.
Anchors are determined by a k-means procedure, looking at all the bounding boxes in your dataset. If you're looking at vehicles, the ones you see from the side will have an aspect ratio of about 2:1 (width = 2*height). The ones viewed from in front will be roughly square, 1:1. If your dataset includes people, the aspect ratio might be 1:3. Foreground objects will be large, background objects will be small. The k-means routine will figure out a selection of anchors that represent your dataset. k=5 for yolov3, but there are different numbers of anchors for each YOLO version.
It's useful to have anchors that represent your dataset, because YOLO learns how to make small adjustments to the anchor boxes in order to create an accurate bounding box for your object. YOLO can learn small adjustments better/easier than large ones.
The assignment problem is trickier. As I understand it, part of the training process is for YOLO to learn which anchors to use for which object. So the "assignment" isn't deterministic like it might be for the Hungarian algorithm. Because of this, in general, multiple anchors will detect each object, and you need to do non-max-suppression afterwards in order to pick the "best" one (i.e. highest confidence).
There are a couple of points that I needed to understand before I came to grips with anchors:
Anchors can be any size, so they can extend beyond the boundaries of
the 13x13 grid cells. They have to be, in order to detect large
objects.
Anchors only enter in the final layers of YOLO. YOLO's neural network makes 13x13x5=845 predictions (assuming a 13x13 grid and 5 anchors). The predictions are interpreted as offsets to anchors from which to calculate a bounding box. (The predictions also include a confidence/objectness score and a class label.)
YOLO's loss function compares each object in the ground truth with one anchor. It picks the anchor (before any offsets) with highest IoU compared to the ground truth. Then the predictions are added as offsets to the anchor. All other anchors are designated as background.
If anchors which have been assigned to objects have high IoU, their loss is small. Anchors which have not been assigned to objects should predict background by setting confidence close to zero. The final loss function is a combination from all anchors. Since YOLO tries to minimise its overall loss function, the anchor closest to ground truth gets trained to recognise the object, and the other anchors get trained to ignore it.
The following pages helped my understanding of YOLO's anchors:
https://medium.com/#vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807
https://github.com/pjreddie/darknet/issues/568
I think that your statement about the number of predictions of the network could be misleading. Assuming a 13 x 13 grid and 5 anchor boxes the output of the network has, as I understand it, the following shape: 13 x 13 x 5 x (2+2+nbOfClasses)
13 x 13: the grid
x 5: the anchors
x (2+2+nbOfClasses): (x, y)-coordinates of the center of the bounding box (in the coordinate system of each cell), (h, w)-deviation of the bounding box (deviation to the prior anchor boxes) and a softmax activated class vector indicating a probability for each class.
If you want to have more information about the determination of the anchor priors you can take a look at the original paper in the arxiv: https://arxiv.org/pdf/1612.08242.pdf.
I was reading the fast rcnn caffe code. Inside the SmoothL1LossLayer, I found that the implementation is not the same as the paper equation, is that what it should be ?
The paper equation:
For each labeled bounding box with class u, we calculate the sum error of tx, ty, tw, th, but in the code, we have:
There is no class label information used. Can anyone explain why?
And in the backpropagation step,
why there is an i here ?
In train.prototxt bbox_pred has output size 84 = 4(x,y,h,w) * 21(number of label). So does bbox_targets. So it is using all labels.
As for loss layers it is looping over bottom blobs to find which on to propagate gradient through. Here only one of propagate_down[i] is true.
I will be thankful if you answer my question. I am worried I am doing wrong, because my network always gives black image without any segmentation.
I am doing semantic segmentation in Caffe. The output of score layer is <1 5 256 256> batch_size no_classes image_width image_height. Which is sent to SoftmaxWithLoss layer, and the out input of loss layer is the groundtruth image with 5 class labels <1 1 256 256>.
My question is: the dimension of these two inputs of loss layer does not match. Should I create 5 label images for these 5 classes and send a batch_size of 5 in label layer into the loss layer?
How can I prepare label data for semantic segmentation?
Regards
your dimensions are okay. you are outputting 5 vector per pixel indicating the probability of each class. The ground truth is a single label (integer) and the loss encourages the probability of the correct label to be the maximal for the pixel
I am trying to solve a simple classification problem where the label has 12 different levels and need to classify each example into one of these 12. However, I want my output to look like refer the image:
http://i.stack.imgur.com/49USG.png
Here; assuming that I set a confidence threshold of 20%; I want my output to contain all the labels for each id which are above 20% and ordered (highest confidence first). If none of the labels are above 20%; then a default label.
More specifically, are there any existing operators in Rapidminer which could give such an output?
Whenever the Apply Model operator runs, it produces new special attributes corresponding to confidences for the individual values of the label attribute. So if the label has values one, two, three, three new attributes will be created confidence(one), confidence(two), confidence(three). It would be possible to use the Generate Attributes operator to work out some logic to decide how to really classify each example. It would also be possible to use the Apply Threshold operator (with Create Threshold) to do something similar. It's impossible to give any more guidance unless you post a representative example with data.