How to use conv2d in this case - deep-learning

I want to create an NN layer such that:
for the input of size 100 assume every 5 samples create "block"
the layer should compute let's say 3 values for every block
so the input/output sizes of this layer should be: 100 -> 20*3
every block of size 5 (and only this block) is fully connected to the result block of size 3
If I understand it correctly I can use Conv2d for this problem. But I'm not sure how to correctly choose conv2d parameters.
Is Conv2d suitable for this task? If so, what are the correct parameters? Is that
input channels = 100
output channels = 20*3
kernel = (5,1)
?

You can use either Conv2D or Conv1D.
With the data shaped like batch x 100 x n_features you can use Conv1D with this setup:
Input channels: n_features
Output channels: 3 * output_features
kernel: 5
strides: 5
Thereby, the kernel is applied to 5 samples and generates 3 outputs. The values for n_features and output_features can be anything you like and might as well be 1. Setting the strides to 5 results in a non-overlapping convolution so that each block uniquely contributes to one output.

Related

Understanding DINO (object classifier) model architecture

I am trying to understand the model architecture of DINO https://arxiv.org/pdf/2203.03605.pdf
These are the last few layers I see when I execute model.children()
Question 1)
In class_embed, (0) is of dimension 256 by 91, and if it's feeding into (1) of class_embed, shouldn't the first dimension be 91?
So, I realize (0) of class_embed is not actually feeding into (1) of class_embed. Could someone explain this to me?
Question 2)
Also, the last layer(2) of MLP (see the first picture which says (5): MLP) has dimension 256 by 4. So, shouldn't the first dimension of class_embed (0) be having a size of 4 ?
Now, when I use a different function to print the layers, I see that the layers shown above are appearing as clubbed. For example, there is only one layer of
Linear(in_features=256, out_features=91, bias=True)]
Why does this function give me a different architecture?
Question 3)
Now, I went on to create a hook for the 3rd last layer.
When I print the size, I am getting 1 by 900 by 256. Shouldn't I be getting something like 1 by 256 by 256 ?
Code to find dimension:
Output:
especially since layer 4 is :

Confusion about the output channels of convolution neural network

I'm confused about the multi-channel scenario in convolution neural network.
Say I have a 10(width) * 5(height) * 6(channels) image, and I feed it into a default 2-D convolution layer with stride=1 and padding=0 and expect the output to be 8(width) * 3(height) * 16(channels).
I know the size of the kernel is 3(width) * 3(height), but I don't know how many kernels are there exactly, and how the are applied to the input data to give the final 16 channels.
Someone can help me please.
A 2D convolution layer contains one kernel per input channel, per output channel. So in your case, this will be 6*16=96 kernels. For 3x3 kernels, this corresponds to 3*3*96 = 864 parameters.
>>> import torch
>>> conv = torch.nn.Conv2d(6, 16, (3, 3))
>>> torch.numel(conv.weight)
864
For one image, one kernel per input channel is first applied. In your case, this results in 6 features maps, that are summed together (+ a possible bias) to form 1 of the output channel. Then, you repeat this 15 times to form the 15 other output channels.

Shape of ground truth in multiclass image segmentation with pytorch

I'm working on 128 x 128 x 3 cell images and want to segment them into 5 classes including backgrounds. I first made target images to be 128 x 128 and values are in {0,1,2,3,4}. But I found I have to make my target ground truth as 5-channel image, and all the values are 0 or 1: if a pixel has 1 in the nth channel, then it should be classified to nth class.
But when I run my model into a Unet model which I forked from GitHub, I found there's an error while calculating cross-entropy loss.
I initially set up the number of channels in the input to be 3 and the number of classes in the output to be 5. And batch size = 2
Here is my codes:
for i, (x, y) in batch_iter:
input, target = x.to(self.device), y.to(self.device) # send to device (GPU or CPU)
self.optimizer.zero_grad() # zerograd the parameters
out = self.model(input) # one forward pass
loss = self.criterion(out, target) # calculate loss
loss_value = loss.item()
train_losses.append(loss_value)
loss.backward() # one backward pass
self.optimizer.step() # update the parameters
batch_iter.set_description(f'Training: (loss {loss_value:.4f})') # update progressbar
self.training_loss.append(np.mean(train_losses))
self.learning_rate.append(self.optimizer.param_groups[0]['lr'])
batch_iter.close()
And error message
RuntimeError: 1only batches of spatial targets supported (3D tensors) but got targets of size: : [2, 5, 128, 128]
How can I solve this?
It seems you are using either nn.CrossEntropyLoss or nn.functional.cross_entropy
I also faced the same error.
CrossEntropyLoss is usually used for classification use cases.
If your targets are normalized tensors with values in [0, 1], you could use nn.BCELoss or nn.functional.binary_cross_entropy_with_logits. This worked in my case as we are using separate mask for each class - it becomes a binary cross entropy problem.

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Pytorch: How do I deal with different input sizes within one batch?

I am implementing something closely related to the DeepSets architecture on point clouds:
https://arxiv.org/abs/1703.06114
That means I am working with a set of inputs (coordinates), have fully connected layers process each of those seperately and then perform average pooling over them (to then do further processing).
The input for each sample i is a tensor of shape [L_i, 3] where L_i is the number of points and the last dimension is 3 because each points has x,y,z coordinates. Crucially, L_i depends on the sample. So I have a different number of points per instance. When I put everything into a batch, I currently have the input in the shape [B, L, 3] where L is larger than L_i for all i. The individual samples are padded with 0's. The issue is that 0's are not ignored by the network, they are processed and fed into the average pooling. Instead I would like the average pooling to only consider actual points (not padded 0's). I do have another array which stores the lengths [L_1, L_2, L_3, L_4...], but I am not sure how to use it.
My Question is: How do you handle different input sizes wihtin one batch in the most graceful manner?
This is how the model is define:
encoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 128))
x = self.encoder(x)
x = x.max(dim=1)[0]
decoder = ...