Are there any downsides to using Pytorch's Lazy* modules? - deep-learning

PyTorch 1.8 introduced the Lazy* modules (i.e., LazyLinear) that alleviates the need to specify input sizes. This has tremendous benefits for optimizing hyperparameters of deep neural network models during the design & architecture phases.
Are there any downsides to using these modules versus the traditional implementations that require specifying input sizes? Why would one not use the Lazy version of the implementations, other than to be overly explicit?
Example with Lazy* layers
model = nn.Sequential(
nn.Conv1d(in_channels=1, out_channels=32, kernel_size=2048, stride=128),
nn.ReLU(),
nn.MaxPool1d(kernel_size=8, stride=2),
# note: input channels not required
nn.LazyConv1d(out_channels=32, kernel_size=8, stride=2, padding=1),
nn.ReLU(),
nn.MaxPool1d(kernel_size=2, stride=2),
nn.Flatten(),
# note: input channels not required
nn.LazyLinear(8)
PyTorch 1.8 documentation

Related

Initializing the forget gate in LSTM network using Pytorch

I have read that regulating the bias term is important to improve the performance of LSTM networks. Here are some sources:
https://www.exxactcorp.com/blog/Deep-Learning/5-types-of-lstm-recurrent-neural-networks-and-what-to-do-with-them
http://proceedings.mlr.press/v37/jozefowicz15.pdf
Does anyone know how to actually implement this in Pytorch?

Using WordPiece tokenization with RoBERTa

As far as I understood, the RoBERTa model implemented by the huggingface library, uses BPE tokenizer. Here is the link for the documentation:
RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
However, I have a custom tokenizer based on WordPiece tokenization and I used the BertTokenizer.
Because my customized tokenizer is much more relevant for my task, I prefer not to use BPE.
When I pre-trained the RoBERTa from scratch (RobertaForMaskedLM) with my custom tokenizer the loss for the MLM task was much better than the loss with BPE. However, when it comes to fine-tuning, the model (RobertaForSequenceClassification) perform poorly. I am almost sure the problem is not about the tokenizer. I wonder if the huggingface library for the RobertaForSequenceClassification is not compatible with my tokenizer.
Details about the fine-tuning:
task: multilabel classification with imbalanced labels.
epochs: 20
loss: BCEWithLogitsLoss()
optimizer: Adam, weight_decay_rate:0.01, lr: 2e-5, correct_bias: True
The F1 and AUC was very low because the output probabilities for the labels was not in accordance with the actual labels (even with a very low threshold) which means the model couldn't learn anything.
*
Note: The pre-trained and fine-tuned RoBERTa with BPE tokenizer
performs better than the pre-trained and fine-tuned with custom
tokenizer although the loss for MLM with custom tokenizer was better
than BPE.

Should dropout be deactivated when training a model with some freezed modules?

I have a deep neural network made of a combination o modules, such as an encoder, a decoder, etc. Before training, I load a part of its parameters from a pretrained model, just for a subset of modules. For instance, I could load a pretrained encoder. Then I want to freeze the parameters of the pretrained modules so that they are not trained with the rest. In Pytorch:
for param in submodel.parameters()
param.requires_grad = False
Now, should I keep applying dropout to these freezed modules while learning or should I deactivate it (see example below) ? Why?
def MyModel(nn.Module):
...
def forward(x):
if freeze_submodule:
self.submodule.eval() # disable dropout when submodule is frozen
x = self._forward(x)
if freeze_submodule:
self.submodule.train()
Freezing module
You can freeze parameters by setting requires_grad_(False), which is less verbose:
submodel.requires_grad_(False)
This will freeze all submodel parameters.
You could also use with torch.no_grad context manager over submodel forward pass but it is less common indeed.
eval
Running submodule.eval() puts certain layers in evaluation mode (BatchNorm or Dropout). For Dropout (inverted dropout actually) you can check how it works in this answer.
Q: should dropout still be applied to freezed parameters?
No, as the weights will be unable to compensate dropout's effect which is one of it's goals (to make it more robust and spread information flow across more paths). They will be unable to do it as they are untrainable.
On the other hand, leaving dropout would add more noise and error to the architecture and might force your trainable part of the network to compensate for it, I'd go for experimenting.
freezing pretrained submodules is useful to avoid their weights being
messed up by the gradients that will result from training
non-pretrained submodules
Depends, fastai community uses smaller learning rates for pretrained modules, still leaving them trainable (see this blog post for example), which makes intuitive sense (task's distribution is somehow different than the one your backbone was pretrained, hence it's reasonable to assume weights need to be adjusted by some amount (possibly small) as well).

Is the xception model in keras the best model was describe in the paper?

I read the Xception paper and in this paper it was mentioned in part 4.7 that best results are achivable without any activation. Now I want to use this network on videos using keras toolbox but the model in keras uses 'ReLU' activation function. Does the model in keras returns best model or it is better to omit the relu layers?
You are confusing normal activations used for convolutional and dense layers, with the ones mentioned in the paper. Section 4.7 only deals with varying the activation between depth-wise and point-wise convolutions, the rest of the activations in the architecture are kept unchanged.

Caffe Autoencoder

I wanna compare the performance of CNN and autoencoder in caffe. I'm completely familiar with cnn in caffe but I wanna is the autoencoder also has deploy.prototxt file ? is there any differences in using this two models rather than the architecture?
Yes it also has a deploy.prototxt.
both train_val.prototxt and 'deploy.prototxt' are cnn architecture description files. The sole difference between them is, train_val.prototxt takes training data and loss as input/output, but 'deploy.prototxt' takes testing image as input, and predicted value as out put.
Here is an example of a cnn and autoencoder for MINST: Caffe Examples. (I have not tried the examples.) Using the models is generally the same. Learning rates etc. depend on the model.
You need to implement an auto-encoder example using python or matlab. The example in Caffe is not true auto-encoder because it doesn't set layer-wise training stage and during training stage, it doesn't fix W{L->L+1} = W{L+1->L+2}^T. It is easily to find a 1D auto-encoder in github, but 2D auto-encoder may be hard to find.
The main difference between the Auto encoders and conventional network is
In Auto encoder your input is your label image for training.
Auto encoder tries to approximate the output similar as input.
Auto encoders does not have softmax layer while training.
It can be used as a pre-trained model for your network which converge faster comparing to other pre-trained models. It is because your network has already extracted the features for your data.
The Conventional training and testing you can perform on pre trained auto encoder network for faster convergence and accuracy.