Do we need to reweight the output of max pooling layer by the probability of presence when testing? In theory, the max-pooling will be impacted by the dropout ratio.
The background is that I'm reading the paper(https://arxiv.org/pdf/2111.14973.pdf). In chapter 3.2, it uses max-pooling for the variable size objects as the input.
For normal max pooling. In the test phase, we should reweight the output by (1-p). But for the struct, I have no idea.reweight
Related
I am working with a long sequence (~60 000 timesteps) classification task with continuous input domain. The input has the shape (B, L, C) where B is the batch size, L is the sequence length (i.e. timesteps) and C is the number of features where each feature is continuous (i.e. values like 0.6, 0.2, 0.5, 1.3, etc.).
Since the sequence is very long, I can't directly apply an RNN or Transformer Encoder layer without exceeding memory limits. Some proposed methods use several CNN layers to "downsample" the sequence length before feeding it into an RNN model. A successful example of this includes the CNN-LSTM model. By introducing several subsequent Convolutional blocks followed by max-pooling it is possible to "downsample" the sequence length by a given factor. The sampled sequence would instead have a sequence length of 60 timesteps for instance, which is much more manageable for an LSTM model.
Does it make sense to directly substitute the LSTM model with a Transformer encoder? I have read that the transformer attention mechanism can complement the LSTM layers and be used in succession.
There also exist many variants of Transformers and other architectures designed for handling long sequences. Latest examples include Performer, Linformer, Reformer, Nyströmformer, BigBird, FNet, S4, CDIL-CNN. Does there exist a library similar to torchvision for implementing these models in pytorch without copy-pasting large amounts of code from the respective repositories?
I have read many papers and web articles that claim that depthwise-separable convolutions reduce the memory required by a deep learning model compared to standard convolution. However, I do not understand how this would be the case, since depthwise-separable convolution requires storing an extra intermediate-step matrix as well as the final output matrix.
Here are two scenarios:
Typical convolution: You have a 3x3 filter, which is applied to a 7x7 RGB input volume. This results in an output of size 5x5x1 which needs to be stored in GPU memory. Suppose activations are float32, this requires 100 bytes of memory
Depthwise-separable convolution: You have three 3x3x1 filters applied to a 7x7 RGB input volume. This results in three output volumes each of size 5x5x1. You then apply a 1x1 convolution to the concatenated 5x5x3 volume to get a final output volume of size 5x5x1. Hence, with float32 activations, this requires 300 bytes for the intermediate 5x5x3 volume, and 100 bytes for the final output. Hence a total of 400 bytes of memory
As additional evidence, when using an implementation U-Net in pytorch with typical nn.Conv2d convolutions, the model has 17.3M parameters and a forward/backward pass size of 320MB. If I replace all convolutions with depthwise-separable convolutions, the model has 2M parameters, and a forward/backward pass size of 500MB. So fewer parameters, but more memory required
I am sure I am going wrong somewhere, as every article states that depthwise-separable convolutions require less memory. Where am I going wrong with my logic?
In most of the architectures, conv layers are being followed by a pooling layer (max / avg etc.). As those pooling layers are just selecting the output of previous layer (i.e. conv), can we just use convolution with stride 2 and expect the similar accuracy results with reduced process need?
Yes that can be done. Its explained in the paper 'Striving for simplicity: The all convolutional net' https://arxiv.org/pdf/1412.6806.pdf. Quote from the paper:
'We find that max-pooling can simply be replaced by a convolutional
layer with increased stride without loss in accuracy on several image
recognition benchmarks'
I seem to have some problems understandind how the model described in this paper has been designed
This is what is written about the model dimension..
...In these experiments we used one convolution ply, one poolingply
and two fully connected hidden layers on the top. The fullyconnected
layers had 1000 units in each. The convolution andpooling parameters
were: pooling size of 6, shift size of 2,filtersize of 8, 150 feature
maps for FWS..
So according to ^ does the model consist of
Input
Convolution
Pooling
Input being the 150 feature maps (each with shape (8,3)
Covolution being 1d as kernel size is 8
and pooling is with size 6 and stride 2.
What was expected of output would be a shape of (1,"number of filters), but what i get is (14,"number of filters)
Which I understand why i get, but I don't understand how the paper suggest this can give an output shape of (1,"number of filters")
when using 100 filters I get these outputs from each layer
convolution1d give me (33,100)
pooling (14,100)..
Why i expect the output to be 1 instead of 14
The model is supposed to recognise phones, it takes in a 50 frames (150 including deltas) as input, these being a context frame, meaning that these are used as support to detect one single frame... That usually why context windows are used.
As I understand from your question, the shape (14,'number of filters) comes out after the pooling layer. That is expected.
What you have to do is to flatten the results in to a single vector before feeding them to the two layer fully connected networks.
Marcin Morzejko's answer to my question in here would help.
In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.