Why are my Keras Conv2D kernels 3-dimensional? - deep-learning

In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights

Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.

In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."

Related

is convolution kernel size of 1 meaningful?

I have time series data of 32x32 size and I used sliding window to form a 3d array of timestepsx32x32. So my input are shape of (batchsize x timesteps x 32 x 32)
I tried a (1,3,3) conv3d filter and a (3,3) timedistributed conv2d filter separately. The performance is different.
Can I say that a convolution kernel of size 1 can extract features? what's the difference between these 2 kernels?

Confusion about the output channels of convolution neural network

I'm confused about the multi-channel scenario in convolution neural network.
Say I have a 10(width) * 5(height) * 6(channels) image, and I feed it into a default 2-D convolution layer with stride=1 and padding=0 and expect the output to be 8(width) * 3(height) * 16(channels).
I know the size of the kernel is 3(width) * 3(height), but I don't know how many kernels are there exactly, and how the are applied to the input data to give the final 16 channels.
Someone can help me please.
A 2D convolution layer contains one kernel per input channel, per output channel. So in your case, this will be 6*16=96 kernels. For 3x3 kernels, this corresponds to 3*3*96 = 864 parameters.
>>> import torch
>>> conv = torch.nn.Conv2d(6, 16, (3, 3))
>>> torch.numel(conv.weight)
864
For one image, one kernel per input channel is first applied. In your case, this results in 6 features maps, that are summed together (+ a possible bias) to form 1 of the output channel. Then, you repeat this 15 times to form the 15 other output channels.

Why is 1x1 conv same as fully connected layer?

i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~
I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Interconnection between two Convolutional Layers

I have a question regarding interconnection between two convolutional layers in CNN. for example suppose I have architecture like this:
input: 28 x 28
conv1: 3 x 3 filter, no. of filters : 16
conv2: 3 x 3 filter, no. of filters : 32
after conv1 we get output as 16 x 28 x 28 assuming dimension of image is not reduced. So we have 16 feature maps. In the next layer each feature map is connected to next layer means if we consider each feature map(28 x 28) as a neuron then each neuron will be connected to all 32 filters means total
(3 x 3 x 16) x 32 parameters. How these two layers are stacked or interconnected? In the case of Artificial Neural Network we have weights between two layers. Is there something like this in CNN also? How the output of one convolutional layer is fed to the next convolutional layer?
The number of parameters of a convolutional layer with n filters of size k×k which comes after f feature maps is
n ⋅ (f ⋅ k ⋅ k + 1)
where the +1 comes from the bias.
Hence each of the f filters is not of shape k×k×1 but of shape k×k×f.
How the output of one convolutional layer is fed to the next convolutional layer?
Just like the input is fed to the first convolutional layer. There is no difference (except the number of feature maps).
Convolution on one input feature map
Image source: https://github.com/vdumoulin/conv_arithmetic
See also: another animation
Multiple input feature maps
It works the same:
The filter has the same depth as the input. Before it was 1, now it is more.
You still slide the filter over all (x, y) positions. For each position, it gives one output.
Your example
First conv layer: 160 = 16*(3*3+1)
Second conv layer: 4640 = 32*(16*3*3+1)