Confusion about the output channels of convolution neural network - deep-learning

I'm confused about the multi-channel scenario in convolution neural network.
Say I have a 10(width) * 5(height) * 6(channels) image, and I feed it into a default 2-D convolution layer with stride=1 and padding=0 and expect the output to be 8(width) * 3(height) * 16(channels).
I know the size of the kernel is 3(width) * 3(height), but I don't know how many kernels are there exactly, and how the are applied to the input data to give the final 16 channels.
Someone can help me please.

A 2D convolution layer contains one kernel per input channel, per output channel. So in your case, this will be 6*16=96 kernels. For 3x3 kernels, this corresponds to 3*3*96 = 864 parameters.
>>> import torch
>>> conv = torch.nn.Conv2d(6, 16, (3, 3))
>>> torch.numel(conv.weight)
864
For one image, one kernel per input channel is first applied. In your case, this results in 6 features maps, that are summed together (+ a possible bias) to form 1 of the output channel. Then, you repeat this 15 times to form the 15 other output channels.

Related

How to use conv2d in this case

I want to create an NN layer such that:
for the input of size 100 assume every 5 samples create "block"
the layer should compute let's say 3 values for every block
so the input/output sizes of this layer should be: 100 -> 20*3
every block of size 5 (and only this block) is fully connected to the result block of size 3
If I understand it correctly I can use Conv2d for this problem. But I'm not sure how to correctly choose conv2d parameters.
Is Conv2d suitable for this task? If so, what are the correct parameters? Is that
input channels = 100
output channels = 20*3
kernel = (5,1)
?
You can use either Conv2D or Conv1D.
With the data shaped like batch x 100 x n_features you can use Conv1D with this setup:
Input channels: n_features
Output channels: 3 * output_features
kernel: 5
strides: 5
Thereby, the kernel is applied to 5 samples and generates 3 outputs. The values for n_features and output_features can be anything you like and might as well be 1. Setting the strides to 5 results in a non-overlapping convolution so that each block uniquely contributes to one output.

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Combine two tensors of same dimension to get a final output tensor using trainable weights

While working on a problem related to question-answering(MRC), I have implemented two different architectures that independently give two tensors (probability distribution over the tokens). Both the tensors are of dimension (batch_size,512). I wish to obtain the final output of the form (batch_size,512). How can I combine the two tensors using trainable weights and then train the model on the final prediction?
Edit (Additional Information):
So in the forward function of my NN model, I have used BERT model to encode the 512 tokens. These encodings are 768 dimensional. These are then passed to a Linear layer nn.Linear(768,1) to output a tensor of shape (batch_size,512,1). Apart from this I have another model built on top of the BERT encodings that also yields a tensor of shape (batch_size, 512, 1). I wish to combine these two tensors to finally get a tensor of shape (batch_size, 512, 1) which can be trained against the output logits of the same shape using CrossEntropyLoss.
Please share the PyTorch code snippet if possible.
Assume your two vectors are V1 and V2. You need to combine them (ensembling) to get a new vector. You can use a weighted sum like this:
alpha = sigmoid(alpha)
V_final = alpha * V1 + (1 - alpha) * V2
where alpha is a learnable scaler. The sigmoid is to bound alpha between 0 and 1,
and you can initialise alpha = 0 so that sigmoid(alpha) is half, meaning you are adding V1 and V2 with equal weights.
This is a linear combination, and there can be non-linear versions as well.
You can have a nonlinear layer that accepts (V1;V2) (the concatenation) and outputs a softmaxed output as well e.g. softmax(W * (V1;V2) + b).

Calculating the number of weights in Convolutional Neural Network using Parameter Sharing

While reading a book Machine Learning: a Probabilistic Perspective by Murphy and this article by Mike O'Neill I have encountered some calculations about the number of weights in Convolutional Neural Network which I want to understand. The architecture of the network is like this:
And this is the explanation from the above article:
Layer #2 is also a convolutional layer, but with 50 feature maps. Each
feature map is 5x5, and each unit in the feature maps is a 5x5
convolutional kernel of corresponding areas of all 6 of the feature
maps of the previous layers, each of which is a 13x13 feature map.
There are therefore 5x5x50 = 1250 neurons in Layer #2, (5x5+1)x6x50 =
7800 weights, and 1250x26 = 32500 connections.
The calculation of the number of weights, (5x5+1)x6x50 = 7800, seems strange for me. Shouldn't be the actual calculation like this:
(5x5x6+1)x50 = 7550 according to the parameter sharing explained here.
My argument is as follows:
We have 50 filters of size 5x5x6 and 1 bias for each filter, hence the total number of weights is (5x5x6+1)x50=7550. And this is Pytorch code which verifies this:
import torch
import torch.nn as nn
model = nn.Conv2d(in_channels=6, out_channels=50, kernel_size=5, stride=2)
params_count = sum(param.numel() for param in model.parameters() if param.requires_grad)
print(params_count) # 7550
Can anyone explain this and which one is correct?
My calculations:
Layer-1 depth is 6, kernel : 5*5
Layer-2 depth is 50 , kernel : 5*5
Total number of neurons #Layer-2 : 5*5*50 = 1250
Total number of weights would be: 5*5*50*6 = 7500
Finally, bias for #Layer-2 = 50 (depth is 50)
I agree with you : Total weights must be 7550.

Why are my Keras Conv2D kernels 3-dimensional?

In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights
Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.
In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."