Strided convolution & residual connection - deep-learning

I am studying SwishNet, a Fast CNN for Speech, Music and Noise Classification and Segmentation.
In that paper, they used Strided Convolution & Residual Net. After through stride=2 conv layer its output length will be half of the input length.
My question is how can merger output with input(residual connection) even their array dimension is mismatched?
G.A is just gated activation function, so it does not affect on the output dimension!

If you look into the reference paper Conditional Image Generation with
PixelCNN Decoders which Gate Activation (G.A) borrowed from, you can notice that it use the following formula:
Although stride=2 reduce dimension to half of the input size, G.A layer with proper W dimension produce the same as the input dimension which means no mismatch dimension will occur.

Related

Why is 1x1 conv same as fully connected layer?

i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~
I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:

caffe softmax with loss layer for semantic segmentation loss calculation

The caffe documentation on the softmax_loss_layer.hpp file seems to be targeted towards classification tasks and not semantic segmentation. However, I have seen this layer being used for the latter.
What would be the dimensions of the input blobs and output blob in the case where you're classifying each pixel (semantic segmentation)?
More importantly, how are the equations for calculating the loss applied to these blobs? Like, in what form are the matrices/blobs arranged and the eventual "loss value" that's output, what is the equation for that?
Thank you.
edits:
I have referenced this page for understanding concepts of loss equation, just don't know how it's applied to the blobs, which axis, etc.: http://cs231n.github.io/linear-classify/
Here is the documentation from caffe:
Firstly, the input blobs should be of the form data NxKxHxW and label Nx1XHxW where each value in the label blob is an integer from [0-K]. I think there's an error in the caffe documentation where it doesn't consider the case for semantic segmentation, and I'm not sure what K = CHW means. The output blob is of the shape 1x1x1x1 which is the loss.
Secondly, the loss function is as follows, from softmax_loss_layer.cpp:
loss -= log(std::max(prob_data[i * dim + label_value * inner_num_ + j], Dtype(FLT_MIN)));
Breaking that line down (for semantic segmentation):
std::max is just to ensure there's no invalid input like nan
prob_data is the output from the softmax, as explained in the caffe tutorials, softmax loss layer can be decomposed into a softmax layer followed by multinomial logistic loss
i * dim specifies the Nth image in your batch where the batch shape is like so NxKxHxW where K is the number of classes
label_value * inner_num_ specifies the Kth image, because at this stage, each one of your classes have their own "image" of probabilities, so to speak
Finally, j is the index for each pixel
Basically, you want prob_data[i * dim + label_value * inner_num_ + j] for each pixel to be as close to 1 as possible. This means that the negative log of that will be close to 0. Here the log is to base e. And then you do the stochastic gradient descent for that loss.

Does a Convolutional Layer Have an Exact Inverse

...and if so under what circumstances?
A Convolutional Layer usually yields an output of lesser size. Is it possible to reverse/invert such an operation by flipping/transposing the used kernel and providing padding or likewise?
Just looking at the convolutional layer's operation here - without pooling layers, concatenation, non-linear activation functions etc.
I'm not looking for any of the several trainable versions of reverse convolutional operations. Such can be achieved by strides $\geq 1$ in the output space or intrinsic padding in the input space for example. Vincent Dumoulin and Francesco Visin provide very elucidating, animated gifs on their github page. And the Deep Learning community is divided over the naming of these operations: Transpose convolution, fractionally strided convolution and deconvolution are all used (the latter, although widely used, is very misleading since it's no proper mathematical deconvolution).
I believe this is where the difference between a transposed convolution and a deconvolution is essential.
A deconvolution is the mathematical inverse of what a convolution does, whereas a transposed convolution reverses only the spatial transformation between input and output. Meaning, if you want to reverse the changes concerning the shape of the output, a transposed convolution will do the job, but it will not be the mathematical inverse concerning the values it produces. I wrote a few words about this topic in an article.
Well the deconvolution is defined very clearly.
In convo, you multiply the map with the input frame like vector miltiplication, summ it and ASSIGN the value to output.
In deconvo, you take the output value (one), multiplie it with the map, quasi highlighting the points what influated the output and ADD it to former input layer, (what of course must be filled with zeros in the begin. This must give the same "input" layer as in forward pass.

Determining the values of the filter matrices in a CNN

I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best

What is the purpose of the ROI layer in a Fast R-CNN?

In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.