Architecture of VGGnet. What is multi-crop, dense evaluation? - deep-learning

I was reading the VGG16 paper very deep convolutional networks for large-scale image recognition
In 3.2 TESTING, It talks that all fully-connected layers are replaced by some CNN layers
Namely,
the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7
conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is
then applied to the whole (uncropped) image. The result is a class score map with the number of
channels equal to the number of classes, and a variable spatial resolution, dependent on the input
image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is
spatially averaged (sum-pooled)
So the architecture of VGG16(Configuration D) when predict on testing set will be
input=(224, 224)
conv2d(64, (3,3))
conv2d(64, (3,3))
Maxpooling(2, 2)
conv2d(128, (3,3))
conv2d(128, (3,3))
Maxpooling(2, 2)
conv2d(256, (3,3))
conv2d(256, (3,3))
conv2d(256, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
Dense(4096) is replaced by conv2d((7, 7))
Dense(4096) is replaced by conv2d((1, 1))
Dense(1000) is replaced by conv2d((1, 1))
So this architecture only uses for testing set?
Does the last 3 CNN layers all have 1000 channels?
The result is a class score map with the number of channels equal to the number of classes
Since the input size is 224*224, the size of output after the last Maxpooling layer will be (7 * 7). Why does it say a variable spatial resolution? I know it do multi-class scale, but it will be cropped to a (224, 224) image before input.
And How VGG16 gets a (1000, ) vector? What is spatially average(sum-pooled) in here? Does it just add a sum pooling layer with size (7, 7) to get a (1, 1, 1000) array?
the class score map is spatially averaged (sum-pooled)
In 3.2 TESTING
Also, multi-crop evaluation is complementary to dense evaluation due
to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved
feature maps are padded with zeros, while in the case of dense evaluation the padding for the same
crop naturally comes from the neighbouring parts of an image (due to both the convolutions and
spatial pooling), which substantially increases the overall network receptive field, so more context
is captured.
So the multi-crop and dense evaluation will be used only on the validation set?
Let's say the input size is (256, 256), multi-crop might get a size of (224, 224) image, where the centre of the cropped image may be different, say [0:223, 0:223] or [1:224, 1:224]. Is my understand of multi-crop correct?
And what is dense evaluation? I am trying to google them, but cannot get relevant results.

the main idea of changing the dense layer to the convolutional layer is to make the inference input image size independent. Suppose you have (224,224) size image, then your network with FC will work nicely, but as soon as the image size is changed, your network will start throwing size mismatch error (which means your network is image size dependent).
Hence, to counter such things, a complete convolutional network is made where the features are stored in the channel while the size of the image is average using an average pooling layer or even convolutional steps to this dimension (channel=number_of_classification classes,1,1). So when you flatten this last outcome, it will come as *number_of_classes = channel*1*1.*
I am not attaching a complete code for this, because your complete questions will need more detailed answers while defining lots of basics. I encourage you to read the full connected convolutional network to get the idea. It's easy and I am 100% sure you will understand the nitty-gritty of that.

Related

Unclear Architecture of MNIST Neural Network

I am trying to reproduce a Neural Network trained to detect whether there is a 0-3 digit in an image with another confounding image. The paper I am following lists the corresponding architecture:
A neural network with 28×56 input neurons and one output neuron is
trained on this task. The input values are coded between −0.5 (black)
and +1.5 (white). The neural network is composed of a first detection
pooling layer with 400 detection neurons sum-pooled into 100 units
(i.e. we sum-pool non-overlapping groups of 4 detection units). A
second detection-pooling layer with 400 detection neurons is applied
to the 100-dimensional output of the previous layer, and activities
are sum-pooled onto a single unit representing the deep network
output. Positive examples (0-3 digit in the image) are assigned target
value 100 and negative examples are assigned target value 0. The
neural network is trained to minimize the mean-square error between
the target values and its output.
My main doubt is in this context what they mean by detection neurons, if they mean filters or a single standard ReLU neuron. Also, if the mean filters, how could they be applied in the second layer to a 100-dimensional output when they are designed to operate on 2x2 matrixes.
Reference:
Montavon, G., Bach, S., Binder, A., Samek, W., & Müller, K. (2015).
Explaining NonLinear Classification Decisions with Deep Taylor
Decomposition. arXiv. https://doi.org/10.1016/j.patcog.2016.11.008.
Specifically section 4.C
Thanks a lot for the help!
My best guess at this is something like (code not tested - just rough PyTorch):
from torch import nn
class Model(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Sequential(
nn.Flatten(), # Flatten row-wise into a 1D sequence
nn.Linear(28 * 56, 400), # Linear layer with 400 outputs.
nn.AvgPool1D(4, 4), # Sum pool to 100 outputs.
)
self.layer2 = nn.Sequential(
nn.Linear(100, 400), # Linear layer with 400 outputs.
nn.AdaptiveAvgPool1D(1), # Sum pool to 1 output.
)
def forward(self, x):
return self.layer2(self.layer1(x))
But overall I would agree with the commentor on your post that there are some issues with language here.

Curious positive Impact of the batchsize on the training accuracy

I have a question regarding the role of the batch size. My MLP model has 2 Dense-layers with "softmax" activation function:
# Creat my MLP MODEL:
model = Sequential()
model.add(Dense(units=64, input_dim = 100))
model.add(BatchNormalization())
model.add(Activation("softmax"))
model.add(Dense(units=64))
model.add(BatchNormalization())
model.add(Activation("softmax"))
model.add(Dense(units=1))
Green's Batchsize = 2, Pink's Batchsize = 8, Red's Batchsize = 5
The dataset has 84000 samples. Each of the sample consists of 100 input values and 1 output value. Each of the sample describes a different subprocess, so the relationship between the samples do not exist. I have evaluated the training process with different batch_size. What is the reason that the training result looks better when the batch size is increased to 8? As far as I Is there a relationship in my datasample that I was not aware of?
First of all you are useing batchnorm, which, as the name suggests normalises samples based on statistics in the batch, thus it will work better if the sample size (batch size) is bigger. Apart from this higher batch size also means lower variance of your gradient estimator and is often good.

Resizing a convolution layer before concatenating in Keras

I'm reading U-Net: Convolutional Networks for Biomedical Image Segmentation and want to implement this in Keras.
In U-Net, I need to concatenate convolutional layers, one is in the contracting path and the other is in the expansive path (Fig1. 1. in the paper).
However, the sizes of them doesn't match, so I have to resize the output of convolutional layer before concatenating.
How do I do this in Keras?
There is a Cropping2D Layer in Keras: https://keras.io/layers/convolutional/#cropping2d
...
conv_13 = Conv2D(64, (3, 3), padding='same', activation='relu')(conv_12) # has outputsize of 568x568
...
crop_13 = Cropping2D((392, 392))(conv_13) # crop 568x568 to 392x392 symmetrically
merge_91 = Concatenate()([crop_13, upsampled_81) # concatenate both layers with same 2D size
...
Example for concatenating the first size (568x568) to the last upsampled size (392x392).

Why are my Keras Conv2D kernels 3-dimensional?

In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights
Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.
In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."

Inception style convolution

I have the necessity to keep the model as small as possible to deploy an image classifier that can run efficiently on an app (the accuracy is not really relevant for me)
I recently approached deep learning and I haven't great experience, hence I'm currently playing with the cifar-10 example.
I tried to replace the first two 5x5 convolutional layers with two 3x3 convolution each, as described in the inception paper.
Unluckily, when I'm going to classify the test set, I got around 0.1 correct classification (random choice)
This is the modified code of the first layer (the second is similar):
with tf.variable_scope('conv1') as scope:
kernel_l1 = _variable_with_weight_decay('weights_l1', shape=[3, 3, 3, 64],
stddev=1e-4, wd=0.0)
kernel_l2 = _variable_with_weight_decay('weights_l2', shape=[3, 3, 64, 1],
stddev=1e-4, wd=0.0)
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
conv_l1 = tf.nn.conv2d(images, kernel_l1, [1, 1, 1, 1], padding='SAME')
conv_l2 = tf.nn.depthwise_conv2d(conv_l1, kernel_l2, [1, 1, 1, 1], padding='SAME')
bias = tf.nn.bias_add(conv_l2, biases)
conv1 = tf.nn.relu(bias, name=scope.name)
_activation_summary(conv1)
Is it correct?
It seems you're attempting to compute 64 features (for each 3x3 patch) in the first convolutional layer and feed this directly into the second convolutional layer, with no intermediate pooling layer. Convolutional neural networks typically have a structure of stacked convolutional layers, followed by contrast normalization and max pooling.
To reduce processing overheads researchers have experimented in moving from fully connected to sparsely connected architectures, and hence the creation of inception architecture. However, whilst these yield good results for high dimensional inputs, you may be expecting too much from the 32x32 pixels of Cifar10 in TensorFlow.
Therefore, I think the issue is less around patch size of and more to do with overall architecture. This code is a known good starting point. Get this working and start reducing parameters until it breaks.