How exactly Convolution2D layer works in Keras? - deep-learning

I want to write own convolution layer same as Convolution2D.
How it works in Keras?
For example, if Convolution2D(64, 3, 3, activation='relu', input_shape=(3,226,226)
Which equation will be for output data?

Since you input image shape is (266, 266, 3)[tf]/(3, 266, 266)[th], and the filter number is 64, and the kernel size is 3x3, and for padding, I think the default padding is 1, and the default stride is 1.
So, the output is 266x266x64.
output_width=output_height=(width – filter + 2*padding)/stride + 1
in your code, the width=266, the filter=3, the padding=1, and stride=1.
If you have any trouble understanding the basic concepts, I think you can read this cs231n post for more information.
For how to understanding the process of conv, click here.

Actually, Keras is not doing a convolution in conv2d.
In order to speed up the process, the convolution operation is converted into a matrix (row-per-column) multiplication.
More info here, at chapter 6.

Related

Transpose Convolution Output Size

I have been learning GAN (Generative Adversarial Networks) lately and having a hard time understanding the output size for transpose convolution. Let's say I am using a Tensor of [1, 64, 1, 1] as an input noise. How do I calculate the output of each layer until I construct a 28x28 image (let's say an MNIST digit)? What should be the kernel size, stride, and padding and assuming I use 3 or 4 layers to reconstruct the 28x28 image?
Note: A handwritten example will be enough as well.

Pytorch add hyperparameters for 3x3,32 conv2d layer and 2x2 maxpool layer

I am trying to create a conv2d layer below using pytorch. The hyperparameters are given in the image below. I am unsure how to implement the hyperparameters (3x3,32) for the first conv2d layer. I want to know how to use this using torch.nn.Conv2d.
Thank you very much.
Conv2d with hyperparameters
The conv2d hyper-parameters (3x3, 32) represents kernel_size=(3, 3) and number of output channels=32.
Therefore, this is how you define the first conv layer in your diagram:
conv3x3_32 = nn.Conv2d(in_channles=3, out_channels=32, kernel_size=3)
Note that the in_channles hyper-parameter should match the out_channels of the previous layer (or the input's).
For more details, see nn.Conv2d.

Architecture of VGGnet. What is multi-crop, dense evaluation?

I was reading the VGG16 paper very deep convolutional networks for large-scale image recognition
In 3.2 TESTING, It talks that all fully-connected layers are replaced by some CNN layers
Namely,
the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7
conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is
then applied to the whole (uncropped) image. The result is a class score map with the number of
channels equal to the number of classes, and a variable spatial resolution, dependent on the input
image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is
spatially averaged (sum-pooled)
So the architecture of VGG16(Configuration D) when predict on testing set will be
input=(224, 224)
conv2d(64, (3,3))
conv2d(64, (3,3))
Maxpooling(2, 2)
conv2d(128, (3,3))
conv2d(128, (3,3))
Maxpooling(2, 2)
conv2d(256, (3,3))
conv2d(256, (3,3))
conv2d(256, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
conv2d(512, (3,3))
conv2d(512, (3,3))
conv2d(512, (3,3))
Maxpooling(2, 2)
Dense(4096) is replaced by conv2d((7, 7))
Dense(4096) is replaced by conv2d((1, 1))
Dense(1000) is replaced by conv2d((1, 1))
So this architecture only uses for testing set?
Does the last 3 CNN layers all have 1000 channels?
The result is a class score map with the number of channels equal to the number of classes
Since the input size is 224*224, the size of output after the last Maxpooling layer will be (7 * 7). Why does it say a variable spatial resolution? I know it do multi-class scale, but it will be cropped to a (224, 224) image before input.
And How VGG16 gets a (1000, ) vector? What is spatially average(sum-pooled) in here? Does it just add a sum pooling layer with size (7, 7) to get a (1, 1, 1000) array?
the class score map is spatially averaged (sum-pooled)
In 3.2 TESTING
Also, multi-crop evaluation is complementary to dense evaluation due
to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved
feature maps are padded with zeros, while in the case of dense evaluation the padding for the same
crop naturally comes from the neighbouring parts of an image (due to both the convolutions and
spatial pooling), which substantially increases the overall network receptive field, so more context
is captured.
So the multi-crop and dense evaluation will be used only on the validation set?
Let's say the input size is (256, 256), multi-crop might get a size of (224, 224) image, where the centre of the cropped image may be different, say [0:223, 0:223] or [1:224, 1:224]. Is my understand of multi-crop correct?
And what is dense evaluation? I am trying to google them, but cannot get relevant results.
the main idea of changing the dense layer to the convolutional layer is to make the inference input image size independent. Suppose you have (224,224) size image, then your network with FC will work nicely, but as soon as the image size is changed, your network will start throwing size mismatch error (which means your network is image size dependent).
Hence, to counter such things, a complete convolutional network is made where the features are stored in the channel while the size of the image is average using an average pooling layer or even convolutional steps to this dimension (channel=number_of_classification classes,1,1). So when you flatten this last outcome, it will come as *number_of_classes = channel*1*1.*
I am not attaching a complete code for this, because your complete questions will need more detailed answers while defining lots of basics. I encourage you to read the full connected convolutional network to get the idea. It's easy and I am 100% sure you will understand the nitty-gritty of that.

How to reshape a pytorch matrix without mixing elements of items in a batch

In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)

Inception style convolution

I have the necessity to keep the model as small as possible to deploy an image classifier that can run efficiently on an app (the accuracy is not really relevant for me)
I recently approached deep learning and I haven't great experience, hence I'm currently playing with the cifar-10 example.
I tried to replace the first two 5x5 convolutional layers with two 3x3 convolution each, as described in the inception paper.
Unluckily, when I'm going to classify the test set, I got around 0.1 correct classification (random choice)
This is the modified code of the first layer (the second is similar):
with tf.variable_scope('conv1') as scope:
kernel_l1 = _variable_with_weight_decay('weights_l1', shape=[3, 3, 3, 64],
stddev=1e-4, wd=0.0)
kernel_l2 = _variable_with_weight_decay('weights_l2', shape=[3, 3, 64, 1],
stddev=1e-4, wd=0.0)
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
conv_l1 = tf.nn.conv2d(images, kernel_l1, [1, 1, 1, 1], padding='SAME')
conv_l2 = tf.nn.depthwise_conv2d(conv_l1, kernel_l2, [1, 1, 1, 1], padding='SAME')
bias = tf.nn.bias_add(conv_l2, biases)
conv1 = tf.nn.relu(bias, name=scope.name)
_activation_summary(conv1)
Is it correct?
It seems you're attempting to compute 64 features (for each 3x3 patch) in the first convolutional layer and feed this directly into the second convolutional layer, with no intermediate pooling layer. Convolutional neural networks typically have a structure of stacked convolutional layers, followed by contrast normalization and max pooling.
To reduce processing overheads researchers have experimented in moving from fully connected to sparsely connected architectures, and hence the creation of inception architecture. However, whilst these yield good results for high dimensional inputs, you may be expecting too much from the 32x32 pixels of Cifar10 in TensorFlow.
Therefore, I think the issue is less around patch size of and more to do with overall architecture. This code is a known good starting point. Get this working and start reducing parameters until it breaks.