Initializing the forget gate in LSTM network using Pytorch - deep-learning

I have read that regulating the bias term is important to improve the performance of LSTM networks. Here are some sources:
https://www.exxactcorp.com/blog/Deep-Learning/5-types-of-lstm-recurrent-neural-networks-and-what-to-do-with-them
http://proceedings.mlr.press/v37/jozefowicz15.pdf
Does anyone know how to actually implement this in Pytorch?

Related

How to compute the Hessian of a large neural network in PyTorch?

How to compute the Hessian matrix of a large neural network or transformer model like BERT in PyTorch? I know torch.autograd.functional.hessian, but it seems like it only calculates the Hessian of a function, but not a neural network. I also saw the answer in How to compute hessian matrix for all parameters in a network in pytorch?. The problem is, I want to compute the Hessian with respect to the weights, but for large neural networks, it is very inefficient to write it as a function of the weights. Is there a better way to do this? Any suggestion is appreciated. Thanks.
After sometime I finally found a new feature in pytorch nightly build that solves this problem. The details are described in this comment: https://github.com/pytorch/pytorch/issues/49171#issuecomment-933814662. The solution uses the function torch.autograd.functional.hessian and the new feature torch.nn.utils._stateless. Notice that you have to install the nightly version of pytorch to use this new feature.

Pretrained model or training from scratch for object detection?

I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.

Training model in eval() mode gives better result in PyTorch?

I have a model with Dropout layers (with p=0.6). I ended up training the model in .eval() mode and again trained the model in .train() mode, I find that the training .eval() mode gave me better accuracy and quicker loss reduction on training data,
train(): Train loss : 0.832, Validation Loss : 0.821
eval(): Train loss : 0.323, Validation Loss : 0.251
Why is this so?
This seems like the model architecture is simple and when in train mode, is not able to capture the features in the data and hence undergoes underfitting.
eval() disables dropouts and Batch normalization, among other modules.
This means that the model trains better without dropout helping the model the learn better with more neurons, also increasing the layer size, increasing the number of layers, decreasing the dropout probability, helps.

Difference Between keras.layer.Dense(32) and keras.layer.SimpleRNN(32)?

What is the difference between keras.layer.Dense() and keras.layer.SimpleRNN()? I do understand what is Neural Network and RNN, but with the api the intuition is just not clear.? When I see keras.layer.Dense(32) I understand it as layer with 32 neurons. But not really clear if SimpleRNN(32) means the same. I am a newbie on Keras.
How Dense() and SimpleRNN differ from each other?
Is Dense() and SimpleRNN() function same at any point of time?
If so then when and if not then what is the difference between SimpleRNN() and Dense()?
Would be great if someone could help in visualizing it?
What's exactly happening in
https://github.com/fchollet/keras/blob/master/examples/addition_rnn.py
Definitely different.
According to Keras Dense Dense implements the operation: output = activation(dot(input, kernel) + bias), it is a base architecture for neural network.
But for SimpleRNN, Keras SimpleRNN Fully-connected RNN where the output is to be fed back to input.
The structure of neural network and recurrent neural network are different.
To answer your question:
The difference between Dense() and SimpleRNN is the differences between traditional neural network and recurrent neural network.
No, they are just define structure for each network, but will work in different way.
Then same as 1
Check resources about neural network and recurrent neural network, there are lots of them on the internet.

Caffe Autoencoder

I wanna compare the performance of CNN and autoencoder in caffe. I'm completely familiar with cnn in caffe but I wanna is the autoencoder also has deploy.prototxt file ? is there any differences in using this two models rather than the architecture?
Yes it also has a deploy.prototxt.
both train_val.prototxt and 'deploy.prototxt' are cnn architecture description files. The sole difference between them is, train_val.prototxt takes training data and loss as input/output, but 'deploy.prototxt' takes testing image as input, and predicted value as out put.
Here is an example of a cnn and autoencoder for MINST: Caffe Examples. (I have not tried the examples.) Using the models is generally the same. Learning rates etc. depend on the model.
You need to implement an auto-encoder example using python or matlab. The example in Caffe is not true auto-encoder because it doesn't set layer-wise training stage and during training stage, it doesn't fix W{L->L+1} = W{L+1->L+2}^T. It is easily to find a 1D auto-encoder in github, but 2D auto-encoder may be hard to find.
The main difference between the Auto encoders and conventional network is
In Auto encoder your input is your label image for training.
Auto encoder tries to approximate the output similar as input.
Auto encoders does not have softmax layer while training.
It can be used as a pre-trained model for your network which converge faster comparing to other pre-trained models. It is because your network has already extracted the features for your data.
The Conventional training and testing you can perform on pre trained auto encoder network for faster convergence and accuracy.