I have time series data of 32x32 size and I used sliding window to form a 3d array of timestepsx32x32. So my input are shape of (batchsize x timesteps x 32 x 32)
I tried a (1,3,3) conv3d filter and a (3,3) timedistributed conv2d filter separately. The performance is different.
Can I say that a convolution kernel of size 1 can extract features? what's the difference between these 2 kernels?
Related
Since 3D convolution requires too much computational cost, so I prefer to use 2D conv. My motivation here is using 2D conv for volumetric images to reduce this cost.
I want to apply 2D convolution along three orthogonals to get 3 results, each belongs to one of these orthogonals. More clearly, suppose I have a 3D volumetric image. Instead of apply 3D conv, I want to use 2D conv both xy, xz, yz axis. Then, I expect that 3 different volumetric results. Each result represent three different orthogonals.
Is there way to do that? Thanks for help.
You can permute your images. (Some frameworks such as numpy calls it transpose).
Assume we use 3 x 3 a convolutional kernel.
# A batch of 16 3 channel images (channels first)
a = tensor(shape=[16,3,1920,1080])
# 2D conv will slide over a `1920 x 1080` image, kernel size is `3 x 3 x 3`
a.shape is (16,3,1920,1080)
# 2D conv will slide over a `3 x 1080` image, kernel size is `1920 x 3 x 3`
a.permute(0,2,1,3)
a.shape is (16,1920,3,1080)
# 2D conv will slide over a `1920 x 3` image, kernel size is `1080 x 3 x 3`
a.permute(0,3,2,1)
a.shape is (16,1080,1920,3)
I'm confused about the multi-channel scenario in convolution neural network.
Say I have a 10(width) * 5(height) * 6(channels) image, and I feed it into a default 2-D convolution layer with stride=1 and padding=0 and expect the output to be 8(width) * 3(height) * 16(channels).
I know the size of the kernel is 3(width) * 3(height), but I don't know how many kernels are there exactly, and how the are applied to the input data to give the final 16 channels.
Someone can help me please.
A 2D convolution layer contains one kernel per input channel, per output channel. So in your case, this will be 6*16=96 kernels. For 3x3 kernels, this corresponds to 3*3*96 = 864 parameters.
>>> import torch
>>> conv = torch.nn.Conv2d(6, 16, (3, 3))
>>> torch.numel(conv.weight)
864
For one image, one kernel per input channel is first applied. In your case, this results in 6 features maps, that are summed together (+ a possible bias) to form 1 of the output channel. Then, you repeat this 15 times to form the 15 other output channels.
I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.
Im new with caffe, I have trained a concolutional neural network with 64 feature maps of 7x7, when I get weights of a filter y get a 7x7 matrix. However my second layer has 32 feature maps of 3x3 , when I get the weights of any filter I get a number of 64 matrix of 3x3 kernel for any filter of the second layer.
Does anybody know why?
TL;DR:
The filters of a convolutional layer must matches the number of
channels of that layer's input.
Let's see an example:
So let's say your network receives 3-channel colored images (RGB, for example) with dimensions 128x128 (height and width of 128 pixels) as input. So the input to your first convolution layer (let's call it conv1) would be 3x128x128 (channels x width x height).
Now suppose conv1 has 64 filters of size 7x7. In order to process all values from the input, a single filter must match the number of input channels being fed to that layer (or else some of the channels would not be taken into account during the convolution). So it must also be 3-channel filter and, in the end, we will have 64 filters of dimension 3x7x7 for conv1.
Conv1 will output maps of dimension 64x128x128 (number of filters X weight X height). If this is not clear to you, please check this demo [1].
And then the filters from the next conv layer (conv2) will also have to match their dimension to match the output. For example, 32 filters of size 64x5x5 (for filters with spatial dimension of 5x5). And so on...
(For the sake of simplicity, we supposed that we zero-pad the input before convolving. Zero-padding is that "border" of zeroes that we envolve the input map. This means that the spatial dimensions, i.e. width and height, will not change. If there is no padding, then the output would be smaller than the input. E.g., for 7x7 filters with input of size 128x128, the output would end up having size 125x125. This decrease in spatial dimension is equal to the floor(filter_size / 2) )
[1] CS231n Convolutional Neural Networks for Visual Recognition
In a typical CNN, a conv layer will have Y filters of size NxM, and thus it has N x M x Y trainable parameters (not including bias).
Accordingly, in the following simple keras model, I expect the second conv layer to have 16 kernels of size (7x7), and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
I understand the mechanics of what is happening: the Conv2D layers are actually doing a 3D convolution, treating the output maps of the previous layer as channels. It has 16 3D kernels of size(7x7x8). What I don't understand is:
why this is Keras's default behavior?
how do I get a "traditional" convolutional layer without dropping down into the low-level API (avoiding that is my reason for using Keras in the first place)?
_
from keras.models import Sequential
from keras.layers import InputLayer, Conv2D
model = Sequential([
InputLayer((101, 101, 1)),
Conv2D(8, (11, 11)),
Conv2D(16, (7, 7))
])
model.weights
Q1:and thus kernel weights of size (7x7x16). Why then are its weights actually size (7x7x8x16)?
No, the kernel weights is not the size(7x7x16).
from cs231n:
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
Be careful the 'every'.
In your model, 7x7 is your single filter size, and it will connect to previous conv layer, so the parameters on a single filter is 7x7x8, and you have 16, so the total parameters is 7x7x8x16
Q2:why this is Keras's default behavior?
See Q1.
In the typical jargon, when someone refers to a conv layer with N kernels of size (x, y), it is implied that the kernels actually have size (x, y, z), where z is the depth of the input volume to that layer.
Imagine what happens when the input image to the network has R, G, and B channels: each of the initial kernels itself has 3 channels. Subsequent layers are the same, treating the input volume as a multi-channel image, where the channels are now maps of some other feature.
The motion of that 3D kernel as it "sweeps" across the input is only 2D, so it is still referred to as a 2D convolution, and the output of that convolution is a 2D feature map.
Edit:
I found a good quote about this in a recent paper, https://arxiv.org/pdf/1809.02601v1.pdf
"In a convolutional layer, the input feature map X is a W1 × H1 × D1 cube, with W1, H1 and D1 indicating its width, height and depth (also referred to as the number of channels), respectively. The output feature map, similarly, is a cube Z with W2 × H2 × D2 entries. The convolution Z = f(X) is parameterized by D2 convolutional kernels, each of which is a S × S × D1 cube."