Xavier uniform initialisation - deep-learning

I have a global idea of how Xavier initialisation works. But lately I was reading a code that splitted the weights of an LSTM (the weights related to the input and the hidden state) to 4 chunks before applying xavier initialisation. Here is a part of the code :
def _init_lstm(self, weight):
for w in weight.chunk(4, 0):
init.xavier_uniform_(w)
self._init_lstm(self.lstm.weight_ih_l0)
self._init_lstm(self.lstm.weight_hh_l0)
The only reason I can see behind this is that the fan_in and fan_out in the Xavier Formula will be divided by 4.
Can anyone explain to me why did he split the weights to 4 chunks?
Thanks !

Related

Harmonizing regression and classification losses

I'm investigating the task of training a neural network to predict one future value given a sinusoidal input. So for example, as seen in the Figure, the input signal is x and the expected output signal y. The model's output is y^. Doing the regression task is fairly straightforward, and there are a lot of choices for this problem. I'm using a simple recurrent neural network with mean-squared error (MSE) loss between y and y^.
Additionally, suppose I know that the sinusoid is made up of N modalities, e.g., at some points, the wave oscillates at 5 Hz, then 10 Hz, then back to 5 Hz, then up to 15 Hz maybe—i.e., N=3.
In this case, I have ground-truth class labels in a vector k and the model does both regression and classification, additionally outputting a vector k^. An example is shown in the Figure. As this is a multi-class problem with exclusivity, I figured binary cross entropy (BCE) loss should be relevant here.
I'm sure there is a lot of research about combining loss functions, but does just adding MSE and BCE make sense? Scaling one up or down by a factor of 10 doesn't seem to change the learning outcome too much. So I was wondering what is considered the standard approach to problems where there is a joint classification and regression objective.
Additionally, on top of just BCE, I want to penalize k^ for quickly jumping around between classes; for example, if the model guesses one class, I'd like it to stay in that one class and switch only when it's necessary. See how in the Figure, there are fast dark blue blips in k^. I would like the same solid bands as seen in k, and naive BCE loss doesn't account for that.
Appreciate any and all advice!

Understanding DINO (object classifier) model architecture

I am trying to understand the model architecture of DINO https://arxiv.org/pdf/2203.03605.pdf
These are the last few layers I see when I execute model.children()
Question 1)
In class_embed, (0) is of dimension 256 by 91, and if it's feeding into (1) of class_embed, shouldn't the first dimension be 91?
So, I realize (0) of class_embed is not actually feeding into (1) of class_embed. Could someone explain this to me?
Question 2)
Also, the last layer(2) of MLP (see the first picture which says (5): MLP) has dimension 256 by 4. So, shouldn't the first dimension of class_embed (0) be having a size of 4 ?
Now, when I use a different function to print the layers, I see that the layers shown above are appearing as clubbed. For example, there is only one layer of
Linear(in_features=256, out_features=91, bias=True)]
Why does this function give me a different architecture?
Question 3)
Now, I went on to create a hook for the 3rd last layer.
When I print the size, I am getting 1 by 900 by 256. Shouldn't I be getting something like 1 by 256 by 256 ?
Code to find dimension:
Output:
especially since layer 4 is :

Question on the kernel dimensions for convolutions on mel filter bank features

I am currently trying to understand the following paper: https://arxiv.org/pdf/1703.08581.pdf. I am struggling to understand a part about how a convolution is performed on an input of log mel filterbank features:
We train seq2seq models for both end-to-end speech translation, and a baseline model for speech recognition. We found
that the same architecture, a variation of that from [10], works
well for both tasks. We use 80 channel log mel filterbank features extracted from 25ms windows with a hop size of 10ms,
stacked with delta and delta-delta features. The output softmax
of all models predicts one of 90 symbols, described in detail in
Section 4, that includes English and Spanish lowercase letters.
The encoder is composed of a total of 8 layers. The input
features are organized as a T × 80 × 3 tensor, i.e. raw features,
deltas, and delta-deltas are concatenated along the ’depth’ dimension. This is passed into a stack of two convolutional layers
with ReLU activations, each consisting of 32 kernels with shape
3 × 3 × depth in time × frequency. These are both strided by
2 × 2, downsampling the sequence in time by a total factor of 4,
decreasing the computation performed in the following layers.
Batch normalization [26] is applied after each layer.
As I understand it, the input to the convolutional layer is 3 dimensional (number of 25 ms windows (T) x 80 (features for each window) x 3 (features, delta features and delta-delta features). However, the kernels used on those inputs seem to have 4 dimensions and I do not understand why that is. Wouldn't a 4 dimensional kernel need a 4 dimensional input? In my head, the input has the same dimensions as a rgb picture: width (time) x height (frequency) x color channels (features, delta features and delta-delta features). Therefore I would think of a kernel for a 2D convolution as a filter of size a (filter width) x b (filter height) x 3 (depth of the input). Am I missing something here? What is wrong about my idea or what is done different in this paper?
Thanks in advance for your answer!
I figured it out, turns out it was just a misunderstanding from my side: the authors are using 32 kernels of shape 3x3, which results (after two layers with 2x2 striding) in an output of shape t/4x20x32 where t stands for the time dimension.

Convergence failure while training GAN for 128x128 images

thanks for looking at this question!
I attempted to train a simple DCGAN to generate room designs from a dataset of 215 coloured images of size 128x128. My attempt can be summarised as below:
Generator: 5 deconvolution layers from (100x1) noise input to (128x128x1) grayscale image output
Discriminator: 4 convolution layers from (128x128x1) grayscale image input
Optimizer: Adam at learning rate of 0.002 for both Generator and Discriminator
Batch size: 21 images/batch
Epoch: 100 epochs with 10 batches/epoch
Results:
1. D-loss is close to 0, G-loss is close to 1. After which, I've cut down my discriminator by 2 convolution layers, reduce Adam learning rate to 0.00002, hoping that the discriminator doesn't overpower my generator.
Upon (1), D-loss and G-loss hovers around 0.5 - 1.0. However, the generated image still show noise images even after 100 epochs.
Questions:
Is there something wrong in terms of how I trained my GAN?
How should I modify my approach to successfully train the GAN?
Thank you so much everyone for your help, really looking forward!

my resnet32, vgg16,vgg19, densenet do not converge

I have an interesting problem. I am working on a project which I am trying to classify 15 logos (14 logo + 1 nonlogo class). The dataset is our own. I am using Digits 5 /6 that employs caffe. My caffe is 0.15.14 flavored by NVIDIA.
I have trained it with Alexnet and Googlenet which have been shipped with Digits. The models built by from scratch and finetuning seem ok. (GLT: 90% accuracy, alexnet: 80%) Meanwhile, these models have been created by finetuning the pretrained models (IMAGENET)
My problem is, I wanted to extend my study to cover resnet 32, densenet 121 and vgg16-19. Whenever I try to model these models, their top 1 accuracies get very poor results. (Generally 0) You might guess (as I did) that sources from by building from scratch. However, as far as i know, the model should converge to some limit but i always see a straight line after 2,3 epochs (the line is accuracy line and it is generally 0) the loss value increases to 87 after a few epochs.
I have searched the possible reasons that i may encounter.
1. I changed the weight_filler param to "xavier" nothing has changed.
2. I increased the learning rate but nothing has changed.
3. I even used a pretrained model to finetune vgg16 but it is still the same.
4. I used the cifar10 dataset by upscaling the image sizes to 224*224 and tried but the values are very similar to my logo dataset
I am struggling in finding the correct way. I am not an expert but it seems soo odd to me to have such bad results after having nice ones in alexnet and googlenet.
Why my models do not converge on these recent networks. I need your advices.
btw my training data contains 400 images per class and for non logo class i have collected 1200 non logo images. The validation data contains different numbers of images per logo class and a different 1200 non logo images for validation. So I have totally 5204 training, 579 test(10% of training) and 4729 validation images
Here i am attaching a trainval.txt for my resnet 32 model.
So what is my problem?
Thanks in advance
resnet32_train_val.prototxt