Understanding DINO (object classifier) model architecture - deep-learning

I am trying to understand the model architecture of DINO https://arxiv.org/pdf/2203.03605.pdf
These are the last few layers I see when I execute model.children()
Question 1)
In class_embed, (0) is of dimension 256 by 91, and if it's feeding into (1) of class_embed, shouldn't the first dimension be 91?
So, I realize (0) of class_embed is not actually feeding into (1) of class_embed. Could someone explain this to me?
Question 2)
Also, the last layer(2) of MLP (see the first picture which says (5): MLP) has dimension 256 by 4. So, shouldn't the first dimension of class_embed (0) be having a size of 4 ?
Now, when I use a different function to print the layers, I see that the layers shown above are appearing as clubbed. For example, there is only one layer of
Linear(in_features=256, out_features=91, bias=True)]
Why does this function give me a different architecture?
Question 3)
Now, I went on to create a hook for the 3rd last layer.
When I print the size, I am getting 1 by 900 by 256. Shouldn't I be getting something like 1 by 256 by 256 ?
Code to find dimension:
Output:
especially since layer 4 is :

Related

Softmax as activation function in CNN while doing convolution

I was working on segmentation using unet, its a multiclass segmentation problem with 21 classes.
Thus Ideally we go with softmax as activation in the last layer, which contains 21 kernels so that output depth will be 21 which will match the number of classes.
But my question is if we use 'Softmax' as activation in this layer how will it work? I mean since softmax will be applied to each feature map and by the nature of 'softmax' it will give probabilities that sum to 1. But we need 1's in all places where the corresponding class is present in the feature map.
Or is the 'softmax' applied depth wise like taking all 21 class pixels in depth and applied on top of it?
Hope I have explained the problem properly
I have tried with sigmoid as activation, and the result is not good.
If I understand correctly, you have 21 kernels that are of some shape m*n. So if you reshape your final layer to have a shape of (batch_size, 21, (m*n)), then you can apply softmax long the first dimension (21). Then every value within a single kernel should be the same, and you can take the kernel with the max value.
In this case, you'll find the feature map that has the best overall overlap with the region of interest, rather than finding which part of every feature map overlaps with the ROI if any.

How to use conv2d in this case

I want to create an NN layer such that:
for the input of size 100 assume every 5 samples create "block"
the layer should compute let's say 3 values for every block
so the input/output sizes of this layer should be: 100 -> 20*3
every block of size 5 (and only this block) is fully connected to the result block of size 3
If I understand it correctly I can use Conv2d for this problem. But I'm not sure how to correctly choose conv2d parameters.
Is Conv2d suitable for this task? If so, what are the correct parameters? Is that
input channels = 100
output channels = 20*3
kernel = (5,1)
?
You can use either Conv2D or Conv1D.
With the data shaped like batch x 100 x n_features you can use Conv1D with this setup:
Input channels: n_features
Output channels: 3 * output_features
kernel: 5
strides: 5
Thereby, the kernel is applied to 5 samples and generates 3 outputs. The values for n_features and output_features can be anything you like and might as well be 1. Setting the strides to 5 results in a non-overlapping convolution so that each block uniquely contributes to one output.

Dimensions issues with deep learning model

I seem to have some problems understandind how the model described in this paper has been designed
This is what is written about the model dimension..
...In these experiments we used one convolution ply, one poolingply
and two fully connected hidden layers on the top. The fullyconnected
layers had 1000 units in each. The convolution andpooling parameters
were: pooling size of 6, shift size of 2,filtersize of 8, 150 feature
maps for FWS..
So according to ^ does the model consist of
Input
Convolution
Pooling
Input being the 150 feature maps (each with shape (8,3)
Covolution being 1d as kernel size is 8
and pooling is with size 6 and stride 2.
What was expected of output would be a shape of (1,"number of filters), but what i get is (14,"number of filters)
Which I understand why i get, but I don't understand how the paper suggest this can give an output shape of (1,"number of filters")
when using 100 filters I get these outputs from each layer
convolution1d give me (33,100)
pooling (14,100)..
Why i expect the output to be 1 instead of 14
The model is supposed to recognise phones, it takes in a 50 frames (150 including deltas) as input, these being a context frame, meaning that these are used as support to detect one single frame... That usually why context windows are used.
As I understand from your question, the shape (14,'number of filters) comes out after the pooling layer. That is expected.
What you have to do is to flatten the results in to a single vector before feeding them to the two layer fully connected networks.
Marcin Morzejko's answer to my question in here would help.

What does caffe do with the mean-binary file ?

In the caffe-input layer one can define a mean image that holds mean values of all the images used. From the image net example: "The model requires us to subtract the image mean from each image, so we have to compute the mean".
My question is: What is the implementation of this subtraction? Is it simply :
used_image = original_image - mean_image
or
used_image = mean_image - original_iamge
or
used_image = |original_image - mean_image|^2
if it is one of the first two, then how are negative pixels handeld ? Since the pictures are usually stored in uint8 it would mean that it simply starts from the beginning. e.g
200 - 255 = 56
Why I need to know this? I made tests and I know that the second example or the third example would work better.
It's the first one, a trivial normalization step. Using the second instead wouldn't really matter: the weights would invert.
There are no "negative pixels", per se: this is simply integer input to the matrix operations. You are welcome to interpret this as a visual alteration of some sort, but the arithmetic doesn't care.

deep autoencoder training, small data vs. big data

I am training a deep autoencoder (for now 5 layers encoding and 5 layers decoding, using leaky ReLu) to reduce the dimensionality of the data from about 2000 dims to 2. I can train my model on 10k data, and the outcome is acceptable.
The problem arises when I am using bigger data (50k to 1M). Using the same model with the same optimizer and drop out etc does not work and the training gets stuck after a few epochs.
I am trying to do some hyper-parameter search on the optimizer (I am using adam), but I am not sure if this will solve the problem.
Should I look for something else to change/check? Does the batch size matter in this case? Should I solve the problem by fine tuning the optimizer? Shoul I play with the dropout ratio? ...
Any advice is very much appreciated.
p.s. I am using Keras. It is very convenient. If you do not know about it, then check it out: http://keras.io/
I would have the following questions when trying to find a cause of the problem:
1) What happens if you change the size of the middle layer from 2 to something bigger? Does it improve the performance of the model trained on >50k training set?
2) Are 10k training examples and test examples randomly selected from 1M dataset?
My guess is that your training model is simply not able to decompress your 50K-1M data using just 2 dimensions in the middle layer. So, it's easier for the model to fit their params for 10k data, activations from middle layer are more sensible in that case, but for >50k data activations are random noise.
After some investigation, I have realized that the layer configuration I am using is somehow ill for the problem, and this seems to cause -at least parts of the- problem.
I have been using sequence of layers for encoding and decoding. The layer sizes where chosen to decrease linearly, for example:
input: 1764 (dims)
hidden1: 1176
hidden2: 588
encoded: 2
hidden3: 588
hidden4: 1176
output: 1764 (same as input)
However this seems to work only occasionally and it is sensitive to the choice of hyper parameters.
I tried to replace this with an exponentially decreasing layer size (for encoding) and the other way for decoding. so:
1764, 128, 16, 2, 16, 128, 1764
Now in this case the training seems to be happening more robustly. I still have to make a hyper parameter search to see if this one is sensitive or not, but a few manual trials seems to show its robustness.
I will post an update if I encounter some other interesting points.