Why `ep_rew_mean` much larger than the reward evaluated by the `evaluate_policy()` fuction - reinforcement-learning

I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:
4.349947246664763 1.1806464511030819
the way I use function evaluate_policy() is:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)
According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.
I have also tried different learning rate or other parameters, and this problem is not solved.
I have checked my environment, and I think there is no error.
I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.

evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).
If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).

Related

Most wierd loss function shape (because of weight decay parameter)

I am training a large neural network model (1 module Hourglass) for a facial landmark recognition task. Database used for training is WFLW.
Loss function used is MSELoss() between the predicted output heatmaps, and the ground-truth heatmaps.
- Batch size = 32
- Adam Optimizer
- Learning rate = 0.0001
- Weight decay = 0.0001
As I am building a baseline model, I have launched a basic experiment with the parameters shown above. I previously had executed a model with the same exact parameters, but with weight-decay=0. The model converged successfully. Thus, the problem is with the weight-decay new value.
I was expecting to observe a smooth loss function that slowly decreased. As it can be observed in the image below, the loss function has a very very wierd shape.
This will probably be fixed by changing the weight decay parameter (decreasing it, maybe?).
I would highly appreciate if someone could provide a more in-depth explanation into the strange shape of this loss function, and its relation with the weight-decay parameter.
In addition, to explain why this premature convergence into a very specific value of 0.000415 with a very narrow standard deviation? Is it a strong local minimum?
Thanks in advance.
Loss should not consistently increase when using gradient descent. It does not matter if you use weight decay or not, there is either a bug in your code (e.g. worth checking what happens with normal gradient descent, not Adam, as there are ways in which one can wrongly implement weight decay with Adam), or your learning rate is too large.

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

QAT output nodes for Quantized Model got the same min max range

Recently, I have worked on quantization aware training on tf1.x to push the model to Coral Dev Board. However, when I finished training the model, why is my min max of my 2 outputs fake quantization is the same?
Should it be different when one's maximum target is 95 and one is 2pi?
I have figured out the problem. It is the problem when that part of the model is not really trained QAT. This happens for the output node that somehow forgets to QAT when training. The -6 and 6 values come from the default source of the quantization of tf1.x as mention here
To overcome the problem, we should provide some op to trigger the QAT for the output nodes. In my regression case, I add a dummy op: tf.maximum(output,0) in the model to make the node QAT. If your output is strictly between 0-1, applying "sigmoid" activation at output instead of relu can also solve the problems.

Training and test diverge while running catboost

When I run the catboost regressor my training and test plots diverge with weird kinks at ~1000 iterations. The plot is appended below and my regressor setup is as follows:
cat_model=CatBoostRegressor(iterations=2500, depth=4, learning_rate=0.01, loss_function='RMSE', thread_count=-1, use_best_model = True, random_seed=12, random_strength=10, rsm=0.5)
I tried different values of leaf_estimation_iterations & bagging_temperature but did not get any success. Any suggestions on what i should try to get better results.
Model Fit Plot
The diverge is normal. you will always perform better on the train set, as the model overfits the training set, and your objective is to regulate it with the validation set.
First I would recommend to read on bias vs variance tradeoff for a general intuition on how to tackle this issue.
specifically for catboost, you would like to regularize the training procedure so it would generalize better.
you can start with adding more data, and set higher l2_leaf_reg parameter.
The official documentation have much more good suggestions on model tuning:
https://catboost.ai/docs/concepts/parameter-tuning.html

Reinforce Learning: Do I have to ignore hyper parameter(?) after training done in Q-learning?

Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).