Suppose you have a 10x10x3 colour image input and you want to stack two convolutional layers with kernel size 3x3 with 10 and 20 filters respectively.
How many parameters do you have to train for these two layers?
Don't forget bias terms!
I've tried (3*3*3+1) * (10+20) but it's apparently not right.
How to calculate the number of parameters in the CNN?
For each layer do:
n: kernel width
m: kernel length
l: no. input feature maps
k: no. output feature maps
no. parameters = (n*m*l+1)*k
Related
I was working on segmentation using unet, its a multiclass segmentation problem with 21 classes.
Thus Ideally we go with softmax as activation in the last layer, which contains 21 kernels so that output depth will be 21 which will match the number of classes.
But my question is if we use 'Softmax' as activation in this layer how will it work? I mean since softmax will be applied to each feature map and by the nature of 'softmax' it will give probabilities that sum to 1. But we need 1's in all places where the corresponding class is present in the feature map.
Or is the 'softmax' applied depth wise like taking all 21 class pixels in depth and applied on top of it?
Hope I have explained the problem properly
I have tried with sigmoid as activation, and the result is not good.
If I understand correctly, you have 21 kernels that are of some shape m*n. So if you reshape your final layer to have a shape of (batch_size, 21, (m*n)), then you can apply softmax long the first dimension (21). Then every value within a single kernel should be the same, and you can take the kernel with the max value.
In this case, you'll find the feature map that has the best overall overlap with the region of interest, rather than finding which part of every feature map overlaps with the ROI if any.
i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~
I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:
I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.
Im new with caffe, I have trained a concolutional neural network with 64 feature maps of 7x7, when I get weights of a filter y get a 7x7 matrix. However my second layer has 32 feature maps of 3x3 , when I get the weights of any filter I get a number of 64 matrix of 3x3 kernel for any filter of the second layer.
Does anybody know why?
TL;DR:
The filters of a convolutional layer must matches the number of
channels of that layer's input.
Let's see an example:
So let's say your network receives 3-channel colored images (RGB, for example) with dimensions 128x128 (height and width of 128 pixels) as input. So the input to your first convolution layer (let's call it conv1) would be 3x128x128 (channels x width x height).
Now suppose conv1 has 64 filters of size 7x7. In order to process all values from the input, a single filter must match the number of input channels being fed to that layer (or else some of the channels would not be taken into account during the convolution). So it must also be 3-channel filter and, in the end, we will have 64 filters of dimension 3x7x7 for conv1.
Conv1 will output maps of dimension 64x128x128 (number of filters X weight X height). If this is not clear to you, please check this demo [1].
And then the filters from the next conv layer (conv2) will also have to match their dimension to match the output. For example, 32 filters of size 64x5x5 (for filters with spatial dimension of 5x5). And so on...
(For the sake of simplicity, we supposed that we zero-pad the input before convolving. Zero-padding is that "border" of zeroes that we envolve the input map. This means that the spatial dimensions, i.e. width and height, will not change. If there is no padding, then the output would be smaller than the input. E.g., for 7x7 filters with input of size 128x128, the output would end up having size 125x125. This decrease in spatial dimension is equal to the floor(filter_size / 2) )
[1] CS231n Convolutional Neural Networks for Visual Recognition
I have a question regarding interconnection between two convolutional layers in CNN. for example suppose I have architecture like this:
input: 28 x 28
conv1: 3 x 3 filter, no. of filters : 16
conv2: 3 x 3 filter, no. of filters : 32
after conv1 we get output as 16 x 28 x 28 assuming dimension of image is not reduced. So we have 16 feature maps. In the next layer each feature map is connected to next layer means if we consider each feature map(28 x 28) as a neuron then each neuron will be connected to all 32 filters means total
(3 x 3 x 16) x 32 parameters. How these two layers are stacked or interconnected? In the case of Artificial Neural Network we have weights between two layers. Is there something like this in CNN also? How the output of one convolutional layer is fed to the next convolutional layer?
The number of parameters of a convolutional layer with n filters of size k×k which comes after f feature maps is
n ⋅ (f ⋅ k ⋅ k + 1)
where the +1 comes from the bias.
Hence each of the f filters is not of shape k×k×1 but of shape k×k×f.
How the output of one convolutional layer is fed to the next convolutional layer?
Just like the input is fed to the first convolutional layer. There is no difference (except the number of feature maps).
Convolution on one input feature map
Image source: https://github.com/vdumoulin/conv_arithmetic
See also: another animation
Multiple input feature maps
It works the same:
The filter has the same depth as the input. Before it was 1, now it is more.
You still slide the filter over all (x, y) positions. For each position, it gives one output.
Your example
First conv layer: 160 = 16*(3*3+1)
Second conv layer: 4640 = 32*(16*3*3+1)