I run through the code of Faster RCNN for the better understanding of the implementation.
I used gdb to debug C++ code behind the python interface and I can go through line by line to C++ codes.
This paper (page 4, first para) mentioned the split of Convolutional Map to 2k scores and 4k coordinates.
That is implemented using this prototxt as
layer {
name: "rpn_conv/3x3"
type: "Convolution"
bottom: "conv5_3"
top: "rpn/output"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 512
kernel_size: 3 pad: 1 stride: 1
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
name: "rpn_cls_score"
type: "Convolution"
bottom: "rpn/output"
top: "rpn_cls_score"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 18 # 2(bg/fg) * 9(anchors)
kernel_size: 1 pad: 0 stride: 1
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
name: "rpn_bbox_pred"
type: "Convolution"
bottom: "rpn/output"
top: "rpn_bbox_pred"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 36 # 4 * 9(anchors)
kernel_size: 1 pad: 0 stride: 1
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
}
}
But I go through the code and that is actually implemented under cudnn_conv_layer.cpp and cudnn_conv_layer.cu.
After passing through these rpn_cls_score and rpn_bbox_pred layers, I can see the output blob shapes are
capacity 4 = {1, 18, 36, 49}
capacity 4 = {1, 36, 36, 49}, so it splitted scores and boxes.
(1)How can I understand the process it went through so that 256 or 512 Dimension is splitted into {1, 18, 36, 49} and {1, 36, 36, 49}. There are lr_mult, but I even can't find how lr_mult is used?
(2)Then Page 5 first column discussed about Loss implementation, I can't find the source how this SGD loss minimization is implemented inside the code?
It would be better to try some basic tutorials on CNN and Caffe first. Directly jumping into Caffe implementation without knowing background theory might lead to more confusion.
Whatever region of prototxt you showed is just 3 layers of Convolution.
512-plane input feature maps are convolved with 18 filters to get 18-plane output feature maps in layer "rpn_cls_score". Same 512-plane input feature maps are convolved with 36 filters to get 36-plane output feature maps in layer "rpn_bbox_pred".
Both these layers are convolution layers.
See the CPU implementation : https://github.com/BVLC/caffe/blob/master/src/caffe/layers/conv_layer.cpp
lr_mult is a learning-rate multiplication factor. In your solver.prototxt, there will be a base_lr. It is multiplied with lr_mult of each layer to get the effective learning rate of that layer. It is a part of parameter update and it is hidden from the user. (That is the beauty of machine learning frameworks)
Once again, entire backward pass and parameter update are done by the Caffe in background. User need not worry about it. Since you are looking for the implementation, See SGD here : https://github.com/BVLC/caffe/blob/master/src/caffe/solvers/sgd_solver.cpp
Related
I have this network loaded into net net = caffe.Net('mobilenet_v2_deploy.prototxt', caffe.TEST). Then, to save the weights of this layer I can do net.save('mymodel.caffemodel') But, how could I save only a particular layer's weights? I know that to visualize conv1 layer's weights I can do (1) net.params['conv1'][0].data but, this just outputs on the command line some text, it does not save a caffemodel-like file.
(1) Outputs something such as: (does not save a file)
array([[[[-0.1010774 ]],
[[-0.03301976]],
[[ 0.19851202]],
...,}
Example of the prototxt file
"name: "MOBILENET_V2"
# transform_param {
# scale: 0.017
# mirror: false
# crop_size: 224
# mean_value: [103.94,116.78,123.68]
# }
input: "data"
input_dim: 1
input_dim: 3
input_dim: 224
input_dim: 224
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
decay_mult: 1
}
convolution_param {
num_output: 32
bias_term: false
pad: 1
kernel_size: 3
stride: 2
weight_filler {
type: "msra"
}
}
}
You can use the method described in this answer to convert the weights to numpy and then numpy.save() only the weights that you care about.
So basically this are the dimensions of the weights from trained caffenet:
conv1: (96,3,11,11)
conv2: (256,48,5,5)
conv3:(384,256,3,3)
conv4: (384,192,3,3)
conv5:(256, 192, 3 , 3)
I am confused that although conv1 gives 96 channels as output why does conv2 only considers 48 while convolution? Am I missing something?
Yes, you missed the parameter 'group'. The convolution_param defined in the conv2 layer is given below.You can find out that parameter group is set to 2 as grouping the convolution layer can save gpu memory.
convolution_param {
num_output: 256
pad: 2
kernel_size: 5
group: 2
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 1
}
We know that Convolution layer in CNN uses filters and different filters will look for different information in the input image.
But let say in this SSD, we have prototxt file and it has specification for the convolution layer as
layer {
name: "conv2_1"
type: "Convolution"
bottom: "pool1"
top: "conv2_1"
param {
lr_mult: 1.0
decay_mult: 1.0
}
param {
lr_mult: 2.0
decay_mult: 0.0
}
convolution_param {
num_output: 128
pad: 1
kernel_size: 3
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0.0
}
}
}
All convolution layers in different networks like (GoogleNet, AlexNet, VGG etc) are more or less similar.
Just look at that and how to understand, filters in this convolution layer try to extract which information of the input image?
EDIT:
Let me clarify for my question.
I see two convolutions layer from the prototxt file as follows. They are from SSD.
layer {
name: "conv1_1"
type: "Convolution"
bottom: "data"
top: "conv1_1"
param {
lr_mult: 1.0
decay_mult: 1.0
}
param {
lr_mult: 2.0
decay_mult: 0.0
}
convolution_param {
num_output: 64
pad: 1
kernel_size: 3
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0.0
}
}
}
layer {
name: "conv2_1"
type: "Convolution"
bottom: "pool1"
top: "conv2_1"
param {
lr_mult: 1.0
decay_mult: 1.0
}
param {
lr_mult: 2.0
decay_mult: 0.0
}
convolution_param {
num_output: 128
pad: 1
kernel_size: 3
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
value: 0.0
}
}
}
Then I print here of their outputs
Data
conv1_1 and conv2_1 images are here and here.
So my query is how these two conv layers produced different output. But no difference in prototxt file.
The filters at earlier layers represent low-level features like edges
(These features retain higher
spatial resolution for precise localization with low-level visual information similar to the response map of Gabor filters). On the other hand, the filter at the mid-layer extract features like corners or blobs, which are more complex.
And as you go deeper you can not visualize and interpret these features, because filters in mid-level and high-level layers are not directly connected to the input image. For instance, when you get the output of the first layer you can actually visualize and interpret it as edges but when you go deeper and apply second convolution layer to these extracted edges (the output of the first layer), then you get something like edges of edges ( or sth like this) and capture more semantic information and less fine-grained spatial details. In the the prototxt file all convolutions and other types of operation can resemble each other. But they extract different kinds of features, because of having different order and weights.
"Convolution" layer differ not only in their parameters (e.g., kernel_size, stride, pad etc.) but also in their weights: the trainable parameters of the convolution kernels.
You see different output (aka "responses") because the weights of the filters are different.
See this answer regarding the difference between "data" blobs and "parameter/weights" blobs in caffe.
I'm using the following command to draw the block diagram of networks from prototxt files in caffe
python draw_net.py <filename.prototxt> <output.png>
This works fine if I use Alexnet, BVLC Caffenet or even RCNN. But when I use VGG-16 file, it gives a blank output image of size 11x11. No error is thrown. I have verified the paths too. All the files are taken from the Caffe Model Zoo. I'm using the Caffe taken from the master branch.
Your VGG16 file may contain old type definition of layers:
layers {
bottom: "data"
top: "conv1_1"
name: "conv1_1"
type: CONVOLUTION
convolution_param {
num_output: 64
pad: 1
kernel_size: 3
}
}
To make it work, you need to make use of the new API of type:
layer {
bottom: "conv1_1"
top: "conv1_2"
name: "conv1_2"
type: "Convolution"
convolution_param {
num_output: 64
pad: 1
kernel_size: 3
}
param {
lr_mult: 0
}
param {
lr_mult: 0
}
}
I have, admittedly, a rather large network. It's based on a network from a paper that claims to use Caffe for the implementation. Here's the topology:
To the best of my ability, I've tried to recreate the model. The authors use the term "upconv" which is a combination of 2x2 unpooling followed by 5x5 convolution. I've taken this to mean a deconvolutional layer with stride 2 and kernel size 5 (please do correct me if you believe otherwise). Here's a short snippet from the full model and solver:
...
# upconv2
layer {
name: "upconv2"
type: "Deconvolution"
bottom: "upconv1rec"
top: "upconv2"
convolution_param {
num_output: 65536 # 256x16x16
kernel_size: 5
stride: 2
}
}
layer {
name: "upconv2-rec"
type: "ReLU"
bottom: "upconv2"
top: "upconv2rec"
relu_param {
negative_slope: 0.01
}
}
# upconv3
layer {
name: "upconv3"
type: "Deconvolution"
bottom: "upconv2rec"
top: "upconv3"
convolution_param {
num_output: 94208 # 92x32x32
kernel_size: 5
stride: 2
}
}
...
But it seems this is too large for Caffe to handle:
I0502 10:42:08.859184 13048 net.cpp:86] Creating Layer upconv3
I0502 10:42:08.859184 13048 net.cpp:408] upconv3 <- upconv2rec
I0502 10:42:08.859184 13048 net.cpp:382] upconv3 -> upconv3
F0502 10:42:08.859184 13048 blob.cpp:34] Check failed: shape[i] <= 2147483647 / count_ (94208 vs. 32767) blob size exceeds INT_MAX
How can I get around this limitation?