Why is 1x1 conv same as fully connected layer? - deep-learning

i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~

I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:

Related

Cost of back-propagation for subset of DNN parameters

I am using pytorch for evaluating gradients of feed-forward network, but only for a subset of parameters, related to the first two layers.
Since backpropagation is carried backwards layer by layer, I wonder: why is it computationally faster than evaluating gradients of whole network?
Pytorch builds a computation graph for backward propagation that only contains the minimum nodes and edges to get the accumulated gradient for leaves that require gradient. Even if the first two layers require gradient, there are many tensors (intermediate tensors or frozen parameters tensors) that are unused and that are cut in the backward graph. Plus the built-in function AccumulatedGradient that stores the gradients in .grad attribute is call less time reducing the total computation time too.
Here you can see an example for an "AddBackward Node" where for instance A is an intermediate tensor computed with the first two layers and B is the 3rd (constant) layer that can be ignored.
An other example: if you have a matrix-matrix product (MmBackward Node) that uses an intermediate tensor that not depends on the 2 first layers. In this case the tensor itself is required to compute the backprop but the "previous" tensors that were used to compute it can be ignored in the graph.
To visualize the sub-graph that is actually computed (and compare when the model is unfrozen), you can use torchviz.

Determining the values of the filter matrices in a CNN

I am getting started with deep learning and have a basic question on CNN's.
I understand how gradients are adjusted using backpropagation according to a loss function.
But I thought the values of the convolving filter matrices (in CNN's) needs to be determined by us.
I'm using Keras and this is how (from a tutorial) the convolution layer was defined:
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = 'relu'))
There are 32 filter matrices with dimensions 3x3 is used.
But, how are the values for these 32x3x3 matrices are determined?
It's not the gradients that are adjusted, the gradient calculated with the backpropagation algorithm is just the group of partial derivatives with respect to each weight in the network, and these components are in turn used to adjust the network weights in order to minimize the loss.
Take a look at this introductive guide.
The weights in the convolution layer in your example will be initialized to random values (according to a specific method), and then tweaked during training, using the gradient at each iteration to adjust each individual weight. Same goes for weights in a fully connected layer, or any other layer with weights.
EDIT: I'm adding some more details about the answer above.
Let's say you have a neural network with a single layer, which has some weights W. Now, during the forward pass, you calculate your output yHat for your network, compare it with your expected output y for your training samples, and compute some cost C (for example, using the quadratic cost function).
Now, you're interested in making the network more accurate, ie. you'd like to minimize C as much as possible. Imagine you want to find the minimum value for simple function like f(x)=x^2. You can start at some random point (as you did with your network), then compute the slope of the function at that point (ie, the derivative) and move down that direction, until you reach a minimum value (a local minimum at least).
With a neural network it's the same idea, with the difference that your inputs are fixed (the training samples), and you can see your cost function C as having n variables, where n is the number of weights in your network. To minimize C, you need the slope of the cost function C in each direction (ie. with respect to each variable, each weight w), and that vector of partial derivatives is the gradient.
Once you have the gradient, the part where you "move a bit following the slope" is the weights update part, where you update each network weight according to its partial derivative (in general, you subtract some learning rate multiplied by the partial derivative with respect to that weight).
A trained network is just a network whose weights have been adjusted over many iterations in such a way that the value of the cost function C over the training dataset is as small as possible.
This is the same for a convolutional layer too: you first initialize the weights at random (ie. you place yourself on a random position on the plot for the cost function C), then compute the gradients, then "move downhill", ie. you adjust each weight following the gradient in order to minimize C.
The only difference between a fully connected layer and a convolutional layer is how they calculate their outputs, and how the gradient is in turn computed, but the part where you update each weight with the gradient is the same for every weight in the network.
So, to answer your question, those filters in the convolutional kernels are initially random and are later adjusted with the backpropagation algorithm, as described above.
Hope this helps!
Sergio0694 states ,"The weights in the convolution layer in your example will be initialized to random values". So if they are random and say I want 10 filters. Every execution algorithm could find different filter. Also say I have Mnist data set. Numbers are formed of edges and curves. Is it guaranteed that there will be a edge filter or curve filter in 10?
I mean is first 10 filters most meaningful most distinctive filters we can find.
best

Why are my Keras Conv2D kernels 3-dimensional?

In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights
Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.
In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."

Dimensions issues with deep learning model

I seem to have some problems understandind how the model described in this paper has been designed
This is what is written about the model dimension..
...In these experiments we used one convolution ply, one poolingply
and two fully connected hidden layers on the top. The fullyconnected
layers had 1000 units in each. The convolution andpooling parameters
were: pooling size of 6, shift size of 2,filtersize of 8, 150 feature
maps for FWS..
So according to ^ does the model consist of
Input
Convolution
Pooling
Input being the 150 feature maps (each with shape (8,3)
Covolution being 1d as kernel size is 8
and pooling is with size 6 and stride 2.
What was expected of output would be a shape of (1,"number of filters), but what i get is (14,"number of filters)
Which I understand why i get, but I don't understand how the paper suggest this can give an output shape of (1,"number of filters")
when using 100 filters I get these outputs from each layer
convolution1d give me (33,100)
pooling (14,100)..
Why i expect the output to be 1 instead of 14
The model is supposed to recognise phones, it takes in a 50 frames (150 including deltas) as input, these being a context frame, meaning that these are used as support to detect one single frame... That usually why context windows are used.
As I understand from your question, the shape (14,'number of filters) comes out after the pooling layer. That is expected.
What you have to do is to flatten the results in to a single vector before feeding them to the two layer fully connected networks.
Marcin Morzejko's answer to my question in here would help.

What is the purpose of the ROI layer in a Fast R-CNN?

In this tutorial about object detection, the fast R-CNN is mentioned. The ROI (region of interest) layer is also mentioned.
What is happening, mathematically, when region proposals get resized according to final convolution layer activation functions (in each cell)?
Region-of-Interest(RoI) Pooling:
It is a type of pooling layer which performs max pooling on inputs (here, convnet feature maps) of non-uniform sizes and produces a small feature map of fixed size (say 7x7). The choice of this fixed size is a network hyper-parameter and is predefined.
The main purpose of doing such a pooling is to speed up the training and test time and also to train the whole system from end-to-end (in a joint manner).
It's because of the usage of this pooling layer the training & test time is faster compared to original(vanilla?) R-CNN architecture and hence the name Fast R-CNN.
Simple example (from Region of interest pooling explained by deepsense.io):
ROI (region of interest) layer is introduced in Fast R-CNN and is a special case of spatial pyramid pooling layer which is introduced in Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition. The main function of ROI layer is reshape inputs with arbitrary size into a fixed length output because of size constraint in Fully Connected layers.
How ROI layer works is showed below:
In this image, the input image with arbitrary size is fed into this layer which has 3 different window: 4x4 (blue), 2x2 (green), 1x1 (gray) to produce outputs with fixed size of 16 x F, 4 x F, and 1 x F, respectively, where F is the number of filters. Then, those outputs are concatenated into a vector to be fed to Fully Connected layer.