I seem to have some problems understandind how the model described in this paper has been designed
This is what is written about the model dimension..
...In these experiments we used one convolution ply, one poolingply
and two fully connected hidden layers on the top. The fullyconnected
layers had 1000 units in each. The convolution andpooling parameters
were: pooling size of 6, shift size of 2,filtersize of 8, 150 feature
maps for FWS..
So according to ^ does the model consist of
Input
Convolution
Pooling
Input being the 150 feature maps (each with shape (8,3)
Covolution being 1d as kernel size is 8
and pooling is with size 6 and stride 2.
What was expected of output would be a shape of (1,"number of filters), but what i get is (14,"number of filters)
Which I understand why i get, but I don't understand how the paper suggest this can give an output shape of (1,"number of filters")
when using 100 filters I get these outputs from each layer
convolution1d give me (33,100)
pooling (14,100)..
Why i expect the output to be 1 instead of 14
The model is supposed to recognise phones, it takes in a 50 frames (150 including deltas) as input, these being a context frame, meaning that these are used as support to detect one single frame... That usually why context windows are used.
As I understand from your question, the shape (14,'number of filters) comes out after the pooling layer. That is expected.
What you have to do is to flatten the results in to a single vector before feeding them to the two layer fully connected networks.
Marcin Morzejko's answer to my question in here would help.
Related
I was working on segmentation using unet, its a multiclass segmentation problem with 21 classes.
Thus Ideally we go with softmax as activation in the last layer, which contains 21 kernels so that output depth will be 21 which will match the number of classes.
But my question is if we use 'Softmax' as activation in this layer how will it work? I mean since softmax will be applied to each feature map and by the nature of 'softmax' it will give probabilities that sum to 1. But we need 1's in all places where the corresponding class is present in the feature map.
Or is the 'softmax' applied depth wise like taking all 21 class pixels in depth and applied on top of it?
Hope I have explained the problem properly
I have tried with sigmoid as activation, and the result is not good.
If I understand correctly, you have 21 kernels that are of some shape m*n. So if you reshape your final layer to have a shape of (batch_size, 21, (m*n)), then you can apply softmax long the first dimension (21). Then every value within a single kernel should be the same, and you can take the kernel with the max value.
In this case, you'll find the feature map that has the best overall overlap with the region of interest, rather than finding which part of every feature map overlaps with the ROI if any.
i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~
I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:
I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.
Link to paper
I'm trying to understand the region proposal network in faster rcnn. I understand what it's doing, but I still don't understand how training exactly works, especially the details.
Let's assume we're using VGG16's last layer with shape 14x14x512 (before maxpool and with 228x228 images) and k=9 different anchors. At inference time I want to predict 9*2 class labels and 9*4 bounding box coordinates. My intermediate layer is a 512 dimensional vector.
(image shows 256 from ZF network)
In the paper they write
"we randomly sample 256 anchors in an image to compute the loss
function of a mini-batch, where the sampled positive and negative
anchors have a ratio of up to 1:1"
That's the part I'm not sure about. Does this mean that for each one of the 9(k) anchor types the particular classifier and regressor are trained with minibatches that only contain positive and negative anchors of that type?
Such that I basically train k different networks with shared weights in the intermediate layer? Therefore each minibatch would consist of the training data x=the 3x3x512 sliding window of the conv feature map and y=the ground truth for that specific anchor type.
And at inference time I put them all together.
I appreciate your help.
Not exactly. From what I understand, the RPN predicts WHk bounding boxes per feature map, and then 256 are randomly sampled per the 1:1 criteria, and these are used as part of the computation for the loss function of that particular mini-batch. You're still only training one network, not k, since the 256 random samples are not of any particular type.
Disclaimer: I only started learning about CNNs a month ago, so I may not understand what I think I understand.
In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.