Suppose pass our input image into a convolutional layer as in the sample caffe net:
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
.
.
.
convolution_param {
num_output: 96
kernel_size: 11
stride: 4
}
.
.
.
}
How can the network give us exactly the number of outputs we want while also using precisely the size and stride of the convolution kernel that it's given? Shouldn't kernel size and stride already determine the number of outputs we will get (modulo padding decisions)?
If I had a 5x5 image, convolved it with a 3x3 kernel using stride 2 and zero padding the boundary, then I would expect to get a 3x3 output from the convolution. But what if I ask for num_output: 5? Or num_output: 100?
After some experimentation, it looks this num_output parameter actually determines how many times you convolve the kernel with the entire image (at least in the single-channel-image case). So it in fact does not interact with the width and height values of the image and filter at all.
Related
Some object detection framework such as SSD (Single Shot MultiBox Detector) and Faster-RCNN have “convolutional filters” for classification and regression. The following is from SSD:
For a feature layer of size m × n with p channels, the basic element for predicting parameters of a potential detection is a 3 × 3 × p small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates. At each of the m × n locations where the kernel is applied, it produces an output value.
My question is: does the numbers of “small kernels” have to be p? How about set a arbitrary number k (which is not same with feature channels)?
In the figure, the part extra Feature layers shows how the small kernel extracts p vector from each of the output location, that predict detections for different aspect ratios and class categories.
For example, from the first convolutional feature map, p is (3x(classes+4)), and for the second one it is (6x(classes+4)). The numbers 3 and 6 indicate the number of anchor boxes defined for those feature maps, and for each of those anchor boxes there are classes + 4 box coordinates output.
So you need to fix p based on the number of anchor boxes you decide for each feature map, the number of classes you want to detect.
My question is: does the numbers of “small kernels” have to be p? How
about set a arbitrary number k (which is not same with feature
channels)?
The feature channel is the result of convolution of the 3x3xp channel so it will always takes size p which is the output channel size of the kernel. And note 3x3xp is actually 3 x 3 x in_channels x p, for example the first features layer is obtained by convolving 38x38x512 from the VGG with the kernel 3x3x512xp to get 38x38xp
I will be thankful if you answer my question. I am worried I am doing wrong, because my network always gives black image without any segmentation.
I am doing semantic segmentation in Caffe. The output of score layer is <1 5 256 256> batch_size no_classes image_width image_height. Which is sent to SoftmaxWithLoss layer, and the out input of loss layer is the groundtruth image with 5 class labels <1 1 256 256>.
My question is: the dimension of these two inputs of loss layer does not match. Should I create 5 label images for these 5 classes and send a batch_size of 5 in label layer into the loss layer?
How can I prepare label data for semantic segmentation?
Regards
your dimensions are okay. you are outputting 5 vector per pixel indicating the probability of each class. The ground truth is a single label (integer) and the loss encourages the probability of the correct label to be the maximal for the pixel
I need help. Trying to understand how the math of a deconv layer works. Let's talk about this layer:
layer {
name: "decon"
type: "Deconvolution"
bottom: "conv2"
top: "decon"
convolution_param {
num_output: 1
kernel_size: 4
stride: 2
pad: 1
}
}
So basically this layer is supposed to "upscale" an image by a factor of 2. If I look at the learned weights, I see e.g. this:
-0,0629104823 -0,1560362280 -0,1512266700 -0,0636162385
-0,0635886043 +0,2607241870 +0,2634004350 -0,0603787377
-0,0718072355 +0,3858278100 +0,3168329000 -0,0817491412
-0,0811873227 -0,0312164668 -0,0321144797 -0,0388795212
So far, so good. Now I'm trying to understand how to apply these weights to actually achieve the upscaling effect. I need to do this in my own code because I want to use simple pixel shaders.
Looking at the Caffe code, "DeconvolutionLayer::Forward_cpu" internally calls "backward_cpu_gemm", which does "gemm", followed by "col2im". My understanding of how all this works is this: gemm takes the input image, and multiplies each pixel with each of the 16 weights listed above. So basically gemm produces 16 output "images". Then col2im sums up these 16 "images" to produce the final output image. But due to the stride of 2, it stretches the 16 gemm images over the output image in such a way that each output pixel is only comprised of 4 gemm pixels. Does that sound correct to you so far?
My understand is that each output pixel is calculated from the nearest 4 low-res pixels, by using 4 weights from the 4x4 deconv weight matrix. If you look at the following image:
https://i.stack.imgur.com/X6iXE.png
Each output pixel uses either the yellow, pink, grey or white weights, but not the other weights. Do I understand that correctly? If so, I have a huge understanding problem, because in order for this whole concept to work correctly, e.g. the yellow weights should add up to the same sum as the pink weights etc. But they do not! As a result my pixel shader produces images where 1 out of 4 pixels is darker than the others, or every other line is darker, or things like that (depending on which trained model I'm using). Obviously, when running the model through Caffe, no such artifacts occur. So I must have a misunderstanding somewhere. But I can't find it... :-(
P.S: Just to complete the information: There's a conv layer in front of the deconv layer with "num_output" of e.g. 64. So the deconv layer actually has e.g. 64 4x4 weights, plus one bias, of course.
After a lot of debugging I found that my understanding of the deconv layer was perfectly alright. I fixed all the artifacts by simply dividing the bias floats by 255.0. That's necessary because pixel shaders run in 0-1 range, while the Caffe bias constants seem to be targetted at 0-255 pixel values.
Everything working great now.
I still don't understand why the 4 weight pairs don't sum up to the same value and how that can possibly work. But what do I know. It does work, after all. I suppose some things will always be a mystery to me.
I am trying to train a fully convolutional network for my problem. I am using the implementation https://github.com/shelhamer/fcn.berkeleyvision.org .
I have different image sizes.
I am not sure how to set the 'Offset' param in the 'Crop' layer.
What are the default values for the 'Offset' param?
How to use this param to crop the images around the center?
According to the Crop layer documentation, it takes two bottom blobs and outputs one top blob. Let's call the bottom blobs as A and B, the top blob as T.
A -> 32 x 3 x 224 x 224
B -> 32 x m x n x p
Then,
T -> 32 x m x n x p
Regarding axis parameter, from docs:
Takes a Blob and crop it, to the shape specified by the second input Blob, across all dimensions after the specified axis.
which means, if we set axis = 1, then it will crop dimensions 1, 2, 3. If axis = 2, then T would have been of the size 32 x 3 x n x p. You can also set axis to a negative value, such as -1, which would mean the last dimension, i.e. 3 in this case.
Regarding offset parameter, I checked out $CAFFE_ROOT/src/caffe/proto/caffe.proto (on line 630), I did not find any default value for offset parameter, so I assume that you have to provide that parameter, otherwise it will result in an error. However, I may be wrong.
Now, Caffe knows that you need a blob of size m on the first axis. We still need to tell Caffe from where to crop. That's where offset comes in. If offset is 10, then your blob of size m will be cropped starting from 10 and end at 10+m-1 (for a total of size m). Set one value for offset to crop by that amount in all the dimensions (which are determined by axis, remember? In this case 1, 2, 3). Otherwise, if you want to crop each dimension differently, you have to specify number of offsets equal to the number of dimensions being cropped (in this case 3). So to sum up all,
If you have a blob of size 32 x 3 x 224 x 224 and you want to crop a center part of size 32 x 3 x 32 x 64, then you would write the crop layer as follows:
layer {
name: "T"
type: "Crop"
bottom: "A"
bottom: "B"
top: "T"
crop_param {
axis: 2
offset: 96
offset: 80
}
}
When convolution uses a kernel size of 4 and stride size of 4, meanwhile, the input size is only 10, it will be fail when trying to do third convolution operation on the boundary of input, so, should the input padded with zeros on boundary implicitly to avoid this problem? Is there any problem when I padded with other real numbers? Is it equals to increase the input size automatically?
Besides, if I expected to get a same size output feature map, usually kernel size of 3 and pad size of 1 can be used, but when kernel size is a odd number, how to decide the pad size on each side of input?
Yes, the input must be padded with zeros to overcome the small input image size problem. To compute the output feature maps at each level use the following formula:
H_out = ( H_in + 2 x Padding_Height - Kernel_Height ) / Stride_Height + 1
W_out = (W_in + 2 x Padding_Width - Kernel_Width) / Stride_Width + 1
You may keep the padding in accordance with the above formula.