I am trying to understand the model architecture of DINO https://arxiv.org/pdf/2203.03605.pdf
These are the last few layers I see when I execute model.children()
Question 1)
In class_embed, (0) is of dimension 256 by 91, and if it's feeding into (1) of class_embed, shouldn't the first dimension be 91?
So, I realize (0) of class_embed is not actually feeding into (1) of class_embed. Could someone explain this to me?
Question 2)
Also, the last layer(2) of MLP (see the first picture which says (5): MLP) has dimension 256 by 4. So, shouldn't the first dimension of class_embed (0) be having a size of 4 ?
Now, when I use a different function to print the layers, I see that the layers shown above are appearing as clubbed. For example, there is only one layer of
Linear(in_features=256, out_features=91, bias=True)]
Why does this function give me a different architecture?
Question 3)
Now, I went on to create a hook for the 3rd last layer.
When I print the size, I am getting 1 by 900 by 256. Shouldn't I be getting something like 1 by 256 by 256 ?
Code to find dimension:
Output:
especially since layer 4 is :
I want to create an NN layer such that:
for the input of size 100 assume every 5 samples create "block"
the layer should compute let's say 3 values for every block
so the input/output sizes of this layer should be: 100 -> 20*3
every block of size 5 (and only this block) is fully connected to the result block of size 3
If I understand it correctly I can use Conv2d for this problem. But I'm not sure how to correctly choose conv2d parameters.
Is Conv2d suitable for this task? If so, what are the correct parameters? Is that
input channels = 100
output channels = 20*3
kernel = (5,1)
?
You can use either Conv2D or Conv1D.
With the data shaped like batch x 100 x n_features you can use Conv1D with this setup:
Input channels: n_features
Output channels: 3 * output_features
kernel: 5
strides: 5
Thereby, the kernel is applied to 5 samples and generates 3 outputs. The values for n_features and output_features can be anything you like and might as well be 1. Setting the strides to 5 results in a non-overlapping convolution so that each block uniquely contributes to one output.
Suppose you have a 10x10x3 colour image input and you want to stack two convolutional layers with kernel size 3x3 with 10 and 20 filters respectively.
How many parameters do you have to train for these two layers?
Don't forget bias terms!
I've tried (3*3*3+1) * (10+20) but it's apparently not right.
How to calculate the number of parameters in the CNN?
For each layer do:
n: kernel width
m: kernel length
l: no. input feature maps
k: no. output feature maps
no. parameters = (n*m*l+1)*k
I seem to have some problems understandind how the model described in this paper has been designed
This is what is written about the model dimension..
...In these experiments we used one convolution ply, one poolingply
and two fully connected hidden layers on the top. The fullyconnected
layers had 1000 units in each. The convolution andpooling parameters
were: pooling size of 6, shift size of 2,filtersize of 8, 150 feature
maps for FWS..
So according to ^ does the model consist of
Input
Convolution
Pooling
Input being the 150 feature maps (each with shape (8,3)
Covolution being 1d as kernel size is 8
and pooling is with size 6 and stride 2.
What was expected of output would be a shape of (1,"number of filters), but what i get is (14,"number of filters)
Which I understand why i get, but I don't understand how the paper suggest this can give an output shape of (1,"number of filters")
when using 100 filters I get these outputs from each layer
convolution1d give me (33,100)
pooling (14,100)..
Why i expect the output to be 1 instead of 14
The model is supposed to recognise phones, it takes in a 50 frames (150 including deltas) as input, these being a context frame, meaning that these are used as support to detect one single frame... That usually why context windows are used.
As I understand from your question, the shape (14,'number of filters) comes out after the pooling layer. That is expected.
What you have to do is to flatten the results in to a single vector before feeding them to the two layer fully connected networks.
Marcin Morzejko's answer to my question in here would help.
In caffe, the convolution layer takes one bottom blob, and convolves it with learned filters (which are initialized using the weight type - "Xavier", "MSRA" etc.). However, my question is whether we can simply convolve two bottom blobs and produce a top blob. What would be the most elegant way of doing this? The purpose of this is: one of the bottom blob will be data and the other one will be a dynamic filter (changing depending on the data) produced by previous layers (I am trying to implement dynamic convolution).
My attempt:
One way which came to my mind was to modify the filler.hpp and assign a bottom blob as a filler matrix itself (instead of "Xavier", "MSRA" etc.). Then I thought the convolution layer would pick up from there. We can set lr = 0 to indicate that the weight initialized by our custom filler should not be changed. However, after I looked at the source code, I still don't know how to do it. On the other hand, I don't want to break the workflow of caffe. I still want conv layers to function normally, if I want them to.
Obviously a more tedious way is to use a combination of Slice, tile and/or Scale layer to literally implement convolution. I think it would work, but it will turn out to be messy. Any other thoughts?
Edit 1:
I wrote a new layer by modifying the convolution layer of caffe. In particular, in src/caffe/layers/conv_layer.cpp, on line 27, it takes the weight defined by the filler and convolves it with the bottom blob. So instead of populating that blob from the filler, I modified the layer such that it now takes two bottoms. One of the bottom directly gets assigned to the filler. Now I had to make some other changes such as:
weight blob has the same value for all the samples. Here it will have a different value for different samples. So I changed line 32 from:
this->forward_cpu_gemm(
bottom_data + n * this->bottom_dim_,
weight,
top_data + n * this->top_dim_);
to:
this->forward_cpu_gemm(
bottom_data + n * bottom[1]->count(1),
bottom[0]->cpu_data() + n * bottom[0]->count(1),
top_data + n * this->top_dim_);
To make things easier, I assumed that there is no bias term involved, stride is always 1, padding can always be 0, group will always be 1 etc. However, when I tested the forward pass, it gave me some weird answer (with a simple convolution kernel = np.ones((1,1,3,3)). The learning rates were set to zero for this kernel so that it doesn't change. However, I can't get a right answer. Any suggestions will be appreciated.
Please do not propose solutions using existing layers such as Slice, Eltwise, Crop. I have already implemented - it works - but it is unbelievably complex and memory inefficient.
I think you are on the right way as a whole.
For the "weird" convolution results, I guess the bug most possibly is:
Consider 2D convolution
and suppose bottom[1]'s shape is (num, channels, height, width),
since convolution in caffe is performed as a multiplication of 2 matrix, weight(representing convolution kernels) and col_buffer(reorganized from data to be convolved), and weight is of num_out rows and channels / this->group_ * kernel_h * kernel_w columns, col_buffer is of channels / this->group_ * kernel_h * kernel_w rows and height_out * width_out columns, so as a weight blob of dynamic convolution layer, bottom[0]'s shape should better be (num, num_out, channels/group, kernel_h, kernel_w) to satisfy
bottom[0]->count(1) == num_out * channels / this->group_ * kernel_h * kernel_w
, in which num_out is the number of the dynamic convolution layer's output feature maps.
That means, to make the convolution function
this->forward_cpu_gemm(bottom_data + n * bottom[1]->count(1)
, bottom[0]->cpu_data() + n * bottom[0]->count(1)
, top_data + n * this->top_dim_);
work properly, you must make sure that
bottom[0]->shape(0) == bottom[1]->shape(0) == num
bottom[0]->count(1) == num_out * channels / this->group_ * kernel_h * kernel_w
So most possibly the simple convolution kernel of 4-dimension np.ones((1,1,3,3)) you used may not satify the above condition and result in the wrong convolution results.
Hope it's clear and will help you.
########## Update 1, Oct 10th,2016,Beijing time ##########
I add a dynamic convolution layer here but with no unit test yet. This layer doesn't break the workflow of caffe and only change some private members of BaseConvolution class to be protected.
The files involved are:
include/caffe/layers/dyn_conv_layer.hpp,base_conv_layer.hpp
src/caffe/layers/dyn_conv_layer.cpp(cu)
It grows almost the same with the convolution layer in caffe, and the differences mainly are:
Override the function LayerSetUp() to initialize this->kernel_dim_, this->weight_offset_ etc properly for convolution and ignore initializing this->blobs_ used by Convolution layer routinely to contain weight and bias;
Override the function Reshape() to check that the bottom[1] as a kernel container has proper shape for convolution.
Because I have no time to test it, there may be bugs and I will be very glad to see your feedbacks.
########## Update 2, Oct 12th,2016,Beijing time ##########
I updated test case for dynamic convolution just now. The involved file is src/caffe/test/test_dyn_convolution_layer.cpp. It seems to work fine, but maybe need more thorough tests.
You can build this caffe by cd $CAFFE_ROOT/build && ccmake .., cmake -DBUILD_only_tests="dyn_convolution_layer" .. and make runtest to check it.