Fine-tuning with a very low learning rate. Any sign that something is not good? - reinforcement-learning

I have working with deep reinforcement learning and in the literature, usually the learning rates are lower than I found in other settings.
My model is the following one:
def create_model(self):
model = Sequential()
model.add(LSTM(HIDDEN_NODES, input_shape=(STATE_SIZE, STATE_SPACE), return_sequences=False))
model.add(Dense(HIDDEN_NODES, activation='relu', kernel_regularizer=regularizers.l2(0.000001)))
model.add(Dense(HIDDEN_NODES, activation='relu', kernel_regularizer=regularizers.l2(0.000001)))
model.add(Dense(ACTION_SPACE, activation='linear'))
# Compile the model
model.compile(loss=tf.keras.losses.Huber(delta=1.0), optimizer=Adam(lr=LEARNING_RATE, clipnorm=1))
return model
Where the initial learning rate (lr) is 3e-5. For the fine-tuning, I freeze the first two layers (this step is essential in my settings) and decrease the learning rate to 3e-9. During the fine-tuning, the model might suffer from a distributional shift once the source of samples is perturbed data. Is there another source of problems besides this for such a low learning rate to keep my model improving?

First, Show me your data sample.
Theoretical Answer:
We have learned how perturbation helps in solving various issues related to neural network training or trained model. Here, we have seen perturbation in three components (gradients, weights, inputs) associated with neural-network training and trained model; perturbation, in gradients is to tackle vanishing gradient problem, in weights for escaping saddle point, and in inputs to avoid malicious attacks. Overall, perturbations in different ways play the role of strengthening the model against various instabilities, for example, it can avoid staying at correctness wreckage point since such position will be tested with perturbation (input, weight, gradient) which will make the model approach towards correctness attraction point.
As of now, perturbation is mainly contingent to empirical-experimentation designed from intuition to solve encountering problems. One needs to experiment if perturbing a component of the training process makes sense intuitively, and further verify empirically if it helps mitigate the problem. Nevertheless, in future, we will see more on perturbation theory in deep learning or machine learning in general which might also be backed by a theoretical guarantee.

Related

Keras Applications (Transfer Learning)

I am a student and currently studying deep learning by myself. Here I would like to ask for clarification regarding the transfer learning.
For example MobileNetv2 (https://keras.io/api/applications/mobilenet/#mobilenetv2-function), if the weights parameter is set to None, then I am not doing transfer learning as the weights are random initialized. If I would like to do transfer learning, then I should set the weights parameter to imagenet. Is this concept correct?
Clarification and explanation regarding deep learning
Yes, when you initialize the weights with random values, you are just using the architecture and training the model from scratch. The goal of transfer learning is to use the previously gained knowledge by another trained model to get better results or to use less computational resources.
There are different ways to use transfer learning:
You can freeze the learned weights of the base model and replace the last layer of the model base on your problem and just train the last layer
You can start with the learned weights and fine-tune them (let them change in the learning process). Many people do that because sometimes it makes the training faster and gives better results because the weights already contain so much information.
You can use the first layers to extract basic features like colors, edges, circles... and add your desired layers after them. In this way, you can use your resources to learn high-level features.
There are more cases, but I hope it could give you an idea.

Why is WGAN considered an adversarial network?

I finished going over the WGAN paper: WGAN Paper Link
After reading the algorithm provided by the writers I find it odd that they would refer to the network as an adversarial network.
In the first part of the algorithm a 'critic' is trained to optimality and they show this critic approximates the Wasserstein distance between our generator distribution and the real distribution. We then take this approximation and update the parameters of the generator distribution in the direction of the gradient of the critic. So in a sense we're just approximating a loss function and then we tell the generator in what direction is best to go. so a critic is a very good name for this, but calling it an adversarial network implies that the generator and the critic are at odds. Any ideas why this should still be nicknamed an adversarial network?
The name "adversarial" does not come from this paper, it comes from the GAN itself, this paper is merely an incremental work on top (and thus is not renaming anything). The reason why the "original" GAN is called Generative Adversarial Network is because it is trained in a form of a two-player, competitive game, where a generator task is to fool discriminator, and discriminators task is to well, not be fooled. This is the "at odds" part. And it is indeed critical to the whole system, vast majority of problems of GANs, that spawned hundreds of papers (like the one above) comes from the fact that greedy optimization of 2 player games has much more chaotic dynamics, and will not "just converge with small enough learning rate" that normal minimization of (smooth enough) loss function would. From math perspective, the subtle difference that makes things chaotic is that gradients that train discriminator are not back-propagated to the generator. Otherwise generator would be "helping" discriminator. Because of this stop gradient the emerging dynamics is no longer a gradient vector field of any loss, and instead it is a dynamical system emerging from simultaneous minimization of 2 functions (also called 2 player games).

In deep learning, can I change the weight of loss dynamically?

Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!

caffe - how to properly train alexnet with only 7 classes

I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.

Overview for Deep Learning Networks

I am fairly new to Deep Learning and get quite overwhelmed by the many different Nets and their field of application. Thus, I want to know if there is some kind of overview which kind of different networks exist, what there key-features are and what kind of purpose they have.
For example I know abut LeNet, ConvNet, AlexNet - and somehow they are the same but still differ?
There are basically two types of neural networks, supervised and unsupervised learning. Both need a training set to "learn". Imagine training set as a massive book where you can learn specific information. In supervised learning, the book is supplied with answer key but without the solution manual, in contrast, unsupervised learning comes without answer key or solution manual. But the goal is the same, which is that to find patterns between the questions and answers (supervised learning) and questions (unsupervised learning).
Now we have differentiate between those two, we can go into the models. Let's discuss about supervised learning, which basically has 3 main models:
artificial neural network (ANN)
convolutional neural network (CNN)
recurrent neural network (RNN)
ANN is the simplest of all three. I believe that you have understand it, so we can move forward to CNN.
Basically in CNN all you have to do is to convolve our input with feature detectors. Feature detectors are matrices which have the dimension of (row,column,depth(number of feature detectors). The goal of convolving our input is to extract informations related to spatial data. Let's say you want to distinguish between cats and dogs. Cats have whiskers but dogs does not. Cats also have different eyes than dogs and so on. But the downside is, the more convolution layers will result in slower computation time. To mitigate that, we do some kind of processing called pooling or downsampling. Basically, this reduce the size of feature detectors while minimizing lost features or information. Then the next step would be flattening or squashing all those 3d matrix into (n,1) dimension so you can input it into ANN. Then the next step is self explanatory, which is normal ANN. Because CNN is inherently able to detect certain features, it mostly(maybe always) used for classification, for example image classification, time series classification, or maybe even video classification. For a crash course in CNN, check out this video by Siraj Raval. He's my favourite youtuber of all time!
Arguably the most sophisticated of all three, RNN is bestly described as neural networks that have "memory" by introducing "loops" within them which allow information to persist. Why is this important? As you are reading this, your brain use previous memory to comprehend all of this information. You don't seem to rethink everything from scratch again and this is what traditional neural networks do, which is to forget everything and re-learn again. But native RNN aren't effective so when people talk about RNN they mostly refer to LSTM which stands for Long Short-Term Memory. If that seems confusing to you, Cristopher Olah will give you in depth explanation in a very simple way. I advice you to check out his link for complete understanding about how RNN, especially LSTM variant
As for unsupervised learning, I'm so sorry that I haven't got the time to learn them, so this is the best I can do. Good luck and have fun!
They are the same type of Networks. Convolutional Neural Networks. The problem with the overview is that as soon as you post something it is already outdated. Most of the networks you describe are already old, even though they are only a few years old.
Nevertheless you can take a look at the networks supplied by caffe (https://github.com/BVLC/caffe/tree/master/models).
In my personal view the most important concepts in deep Learning are recurrent networks (https://keras.io/layers/recurrent/), residual connections, inception blocks (see https://arxiv.org/abs/1602.07261). The rest are largely theoretical concepts, which would not fit in a stack overflow answer.