Max pool layer vs Convolution with stride performance - deep-learning

In most of the architectures, conv layers are being followed by a pooling layer (max / avg etc.). As those pooling layers are just selecting the output of previous layer (i.e. conv), can we just use convolution with stride 2 and expect the similar accuracy results with reduced process need?

Yes that can be done. Its explained in the paper 'Striving for simplicity: The all convolutional net' https://arxiv.org/pdf/1412.6806.pdf. Quote from the paper:
'We find that max-pooling can simply be replaced by a convolutional
layer with increased stride without loss in accuracy on several image
recognition benchmarks'

Related

How fully connected layer after global average pooling works in Resnet50?

I have resnet50 network with the top layers that include global average pooling with shape (1, 2048) and dense layer using softmax with shape (1, 3). How output shape of (1,2048) in global average pooling layer becomes (1, 3) for the output of dense layer? How does it work? I can't find a reliable source for explain this
Dense or Fully connected layers are just matrix multiplication (with bias). So what you do is multiply a matrix with shape 1x2048 with another matrix of shape 2048x3 to get a output matrix of shape 1x3 which gives you scores for your 3 classes. Softmax converts these scores to probability. Of course your network learns the weights of these matrices using back-propagation.

Strided convolution & residual connection

I am studying SwishNet, a Fast CNN for Speech, Music and Noise Classification and Segmentation.
In that paper, they used Strided Convolution & Residual Net. After through stride=2 conv layer its output length will be half of the input length.
My question is how can merger output with input(residual connection) even their array dimension is mismatched?
G.A is just gated activation function, so it does not affect on the output dimension!
If you look into the reference paper Conditional Image Generation with
PixelCNN Decoders which Gate Activation (G.A) borrowed from, you can notice that it use the following formula:
Although stride=2 reduce dimension to half of the input size, G.A layer with proper W dimension produce the same as the input dimension which means no mismatch dimension will occur.

In deep learning what's difference between two 3*3 convolution filters and one 5*5 convolution filters?

For example, as to the famous AlexNet architecutre (original paper), what's the difference of using two 3*3 convolution filters between using one 5*5 convolution filter ?
The two 3*3 convolution filters and one 5*5 convolution filter have been highlighted by red rectangle in the below image.
What about use another 5*5 convolution filter to supersede the two 3*3 convolution filters, or vice verse?
If you still have some doubt, hope this one helps.
If you stack two 3x3 conv layers, it eventually gets a receptive field of 5 (same as one 5x5 conv layer) with respect to the input. However, the advantage of using a smaller conv layer like 3x3 is it needs less parameter (you can do the parameter calculation of two 3x3 layers and one 5x5 layer --> like 2*(33) = 18 and 1(5*5) = 25 assuming 1 channel). Also, two conv layer gets more non-linearity in between than one 5x5 layer, so it has got more discriminative power.
For the receptive field part, I hope this paper of mine helps you to visualize (BTW it's the answer sheet from my exam):
I have found from paper <<Very Deep Convolutional Networks for Large-Scale Image Recognition>>.
Rather than using relatively large receptive fields in the first conv. layers (e.g. 11×11with stride 4 in (Krizhevsky et al., 2012), or 7×7 with stride 2 in (Zeiler & Fergus, 2013; Sermanet et al., 2014)), we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3×3 conv.layers (without spatial poolingin between) has an effective receptive field of 5×5; three such layers have a 7 × 7 effective receptive field.
two 3*3 convolution filter is equivalent to one 5*5 convolution filter.
two 3*3 convolution filter will have less parameters than one 5*5 convolution filter.
two 3*3 convolution filter will make network more deep and extract more complex features than one 5*5 convolution filter.
paper:https://arxiv.org/pdf/1409.1556.pdf

What is the purpose of the ROI layer in a Fast R-CNN?

In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.

Inception style convolution

I have the necessity to keep the model as small as possible to deploy an image classifier that can run efficiently on an app (the accuracy is not really relevant for me)
I recently approached deep learning and I haven't great experience, hence I'm currently playing with the cifar-10 example.
I tried to replace the first two 5x5 convolutional layers with two 3x3 convolution each, as described in the inception paper.
Unluckily, when I'm going to classify the test set, I got around 0.1 correct classification (random choice)
This is the modified code of the first layer (the second is similar):
with tf.variable_scope('conv1') as scope:
kernel_l1 = _variable_with_weight_decay('weights_l1', shape=[3, 3, 3, 64],
stddev=1e-4, wd=0.0)
kernel_l2 = _variable_with_weight_decay('weights_l2', shape=[3, 3, 64, 1],
stddev=1e-4, wd=0.0)
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
conv_l1 = tf.nn.conv2d(images, kernel_l1, [1, 1, 1, 1], padding='SAME')
conv_l2 = tf.nn.depthwise_conv2d(conv_l1, kernel_l2, [1, 1, 1, 1], padding='SAME')
bias = tf.nn.bias_add(conv_l2, biases)
conv1 = tf.nn.relu(bias, name=scope.name)
_activation_summary(conv1)
Is it correct?
It seems you're attempting to compute 64 features (for each 3x3 patch) in the first convolutional layer and feed this directly into the second convolutional layer, with no intermediate pooling layer. Convolutional neural networks typically have a structure of stacked convolutional layers, followed by contrast normalization and max pooling.
To reduce processing overheads researchers have experimented in moving from fully connected to sparsely connected architectures, and hence the creation of inception architecture. However, whilst these yield good results for high dimensional inputs, you may be expecting too much from the 32x32 pixels of Cifar10 in TensorFlow.
Therefore, I think the issue is less around patch size of and more to do with overall architecture. This code is a known good starting point. Get this working and start reducing parameters until it breaks.