Questions about convolution (in CNN) - deep-learning

I suddenly came up with a question about convolution and just wanted to be clear if I'm missing something. The question is whether if the two operations below are identical.
Case1)
Suppose we have a feature map C^2 x H x W. And, we have a K x K x C^2 Conv weight with stride S. (To be clear, C^2 is the channel dimension but just wanted to make it as a square number, K is the kernel size).
Case2)
Suppose we have a feature map 1 x CH x CW. And, we have a CK x CK x 1 Conv weight with stride CS.
So, basically Case2 is a pixel-upshuffled version of case1 (both feature-map and Conv weight.) As convolutions are simply element-wise multiplication, both operations seem identical to me.
# given a feature map and a conv_weight, namely f_map, conv_weight
#case1)
convLayer = Conv(conv_weight)
result = convLayer(f_map, stride=1)
#case2)
f_map = pixelshuffle(f_map, scale=C)
conv_weight = pixelshuffle(f_map, scale=C)
result = convLayer(f_map, stride=C)
But this means that, (for example) given a 256xHxW feature-map with a 3x3 Conv (as in many deep learning models), performing a convolution was simply doing a HUUUGE 48x48 Conv to a 1 x 16*H x 16*W Feature map.
But this doesn't meet my basic intuition of CNNs, stacking multiple of layers with the smallest 3x3 Conv, resulting in somewhat large receptive field, and each channel having different (possibly redundant) information.

You can, in a sense, think of "folding" spatial information into the channel dimension. This is the rationale behind ResNet's trade-off between spatial resolution and feature dimension. In the ResNet case whenever they sample x2 in space they increase feature space x2. However, since you have two spatial dimensions and you sample x2 in both you effectively reduce the "volume" of the feature map by x0.5.

Related

Why is 1x1 conv same as fully connected layer?

i am currently reading " Network in Network' paper.
And in the paper, it is stated that
"the cross channel parametric pooling layer is also equivalent to convolution layer with
1x1 convolution kernel. "
My question is first of all, what is cross channel parametric pooling layer exactly mean?is it just fully connected layer?
And why is cross channel parametric pooling layer same with 1x1 convolution kernel.
It would be thankful if you answer both mathematically and with examples.
Please help me~
I haven't read the paper but I have a fair idea of what this is. First of all
How is a 1x1 convolution like a fully connected layer?
So we have a feature map with dims (C, H, W), where C = (number of channels), H = height, W = width. I'll call positions in (H, W) "pixels". A 1x1 convolution will consist of C' (number of output channels of the convolution) kernels each with shape (C, 1, 1). So if we consider any pixel in the input feature map, we can apply a single (C, 1, 1) kernel to it to produce a (1, 1, 1) output. Applying C' different kernels will result in a (C', 1, 1) output. This is equivalent to applying a single fully connected layer to one pixel of the input feature map. Have a look at the following diagram to understand the action of a 1x1 convolution to a single pixel of the input feature map
The different colors represent different kernels of the convolution, corresponding to different output channels. You can see now how the kernels effectively comprise the weights of a single fully connected layer.
What is cross channel parametric pooling?
This is where I'm going to make a guess I'm 90% certain of (not 100% because I didn't read the paper). This is just an extension of the logic above, to whole feature maps rather than individual pixels. You're applying a cross-channel aggregation mechanism. The mechanism is parametric because it's not just a simple mean or sum or max, it's actually a parameterised weighted sum. Also note that the weights are held constant across all pixels (remember, that's how convolution kernels work). So it's essentially the same as applying the weights of a single fully connected layer to channels of a feature map in order to produce a different set of feature maps. But instead of applying the weights to individual neurons, you are applying them to the all the neurons of the feature map at the same time:

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Can someone explain the correlation layer in FlowNet in a simple way?

I am currently reading the paper "FlowNet: Learning Optical Flow with Convolutional Networks" and having trouble understanding the correlation layer.
I can't seem to find any explanation on google, so I thought I should ask her:
When the paper talks about comparing each patch from f_1 to each patch from f_2, where f_1 and f_2 are feature maps of dimension whc, what do they mean by patch? Are we talking about a patch of features from a feature map or a patch of pixels from one of the original images?
what is x_1 and x_2? Are they a feature pixel (1*1*c) in the feature maps? are they coordinate values?
What does f_1(x_1 + o) mean exactly?
Many thanks!
From feature map-2 the patch of 21x21x256 is extracted only once and then each 1x1x256 kernel from feature map-1 is convolved with this (21x21x256) patch.
More Explanation:
Each (1x1x256) kernel from feature map-1 is convolved with only pixel-1 of patch (21x21x256) to get one feature map and then all (1x1x256) kernels of feature map-1 are again convolved with pixel-2 of (21x21x256) patch to get second feature map
This process is continued for all pixels of (21x21x256) patch till we get 441 feature maps which is equal to number of pixels in extarcted feature map.please look at this figure
The way I understand it, suppose you have two feature maps (ignoring batches for the moment):
f_1 of shape (w, h, c),
f_2 of shape (w, h, c)
Then there are two stride values s_1 and s_2. The first stride s_1 is applied to f_1 in the sense that we only consider feature map patches x_i of f_1 at strided patch centers. For instance if the stride was 5 (in both the height and width direction), we would consider patches at locations:
(0,0), (0,5), ..., (0, w)
(5,0), (5,5), ..., (5, w)
...
(h,0), (h, 5), ..., (h, w)
** (supposing the width/height are divisible by 5 for simplicity, otherwise you have to do some padding arithmetic)
For a given patch center x_i, the patch centers , call them {y_i}, of f_2 considered in the correlation operation around x_i are only those that are within a neighborhood of size D := 2d+1, and those are strided as well with stride value s_2. There will be D^2 of these, according to the authors. (This part is not well described in my opinion as there are many ways of interpreting what the stride value s_2 means. If s_2 = 1, then there will be D^2 patches {y_i} of f_2 to consider, but if it is larger, there should be less, and hence the final tensor shape will not necessarily be D^2 in the last axis.)
The correlation operation itself is a simple sum of dot products, where the dot products are taken with vectors of shape (1, c) * (1, c), of which there will be K^2 of these summed, where K=2k+1 (an odd-sized filter for some positive integer k).
patch of features from feature map
feature pixel (1*1*c) in the feature maps
feature pixel located in distance o from x1
correlation layer in flownet computes patches from feature maps(first feature map and second feature map).
enter image description here
to calculate correlation between feature pixel x1 and feature pixel x2, correlation layer computes dot product between windows(size (2k+1,2k+1)) that centered x1 and x2. so they just do dot product between elements in windows and add them up.

How to reshape a pytorch matrix without mixing elements of items in a batch

In my Neural network model, I represent an 8 word-sentence with a 8x256 dimensional embedding matrix. I want to give it to a LSTM as a input where LSTM takes a single word embedding at a time as input and process it. According to pytorch documentation, the input should be in the shape of (seq_len, batch, input_size). What is the correct way to convert my input to desired shape ? I don't want to mixup the numbers by mistake. I am quite new in PyTorch and row-major calculations, therefore I wanted to ask it here. I do it as follows, is it correct ?
x = torch.rand(8,256)
lstm_input = torch.reshape(x,(8,1,256))
Your solution is correct: you added a Singleton dimension for the "batch" dimension, leaving x to be with temporal dimension 8 and input dimension 256.
Since you are new to pytorch, here are a few equivalent ways of doing the same thing:
x = x[:, None, :]
Putting None in the dim=1 indicates to pytorch to add a singelton dimension.
Another way is to use view:
x = x.view(8, 1, 256)

Why are my Keras Conv2D kernels 3-dimensional?

In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights
Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.
In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."