Hyperparameter search for lunarlander continuous of openAI gym - reinforcement-learning

I'm trying to solve the LunarLander continuous environment from open AI gym (Solving the LunarLanderContinuous-v2 means getting an average reward of 200 over 100 consecutive trials.) With best reward average possible for 100 straight episodes from this environment.
The difficulty is that I refer to the Lunar-lander with uncertainty. (explanation: observations in the real physical world are sometimes noisy). Specifically, I add a zero-mean
Gaussian noise with mean=0 and std = 0.05 to PositionX and PositionY observation of the location of the lander.
I also discretise the LunarLander actions to a finite number of actions instead of the continuous range the environment enables.
So far I'm using DQN, double-DQN and Duelling DDQN.
My hyperparameters are:
gamma,
epsilon start
epsilon end
epsilon decay
learning rate
number of actions (discretisation)
target update
batch size
optimizer
number of episodes
network architecture.
I'm having difficulty to reach good or even mediocre results.
Does someone have an advice about the hyperparameters changes I should make to improve my results?
Thanks!

Related

Harmonizing regression and classification losses

I'm investigating the task of training a neural network to predict one future value given a sinusoidal input. So for example, as seen in the Figure, the input signal is x and the expected output signal y. The model's output is y^. Doing the regression task is fairly straightforward, and there are a lot of choices for this problem. I'm using a simple recurrent neural network with mean-squared error (MSE) loss between y and y^.
Additionally, suppose I know that the sinusoid is made up of N modalities, e.g., at some points, the wave oscillates at 5 Hz, then 10 Hz, then back to 5 Hz, then up to 15 Hz maybe—i.e., N=3.
In this case, I have ground-truth class labels in a vector k and the model does both regression and classification, additionally outputting a vector k^. An example is shown in the Figure. As this is a multi-class problem with exclusivity, I figured binary cross entropy (BCE) loss should be relevant here.
I'm sure there is a lot of research about combining loss functions, but does just adding MSE and BCE make sense? Scaling one up or down by a factor of 10 doesn't seem to change the learning outcome too much. So I was wondering what is considered the standard approach to problems where there is a joint classification and regression objective.
Additionally, on top of just BCE, I want to penalize k^ for quickly jumping around between classes; for example, if the model guesses one class, I'd like it to stay in that one class and switch only when it's necessary. See how in the Figure, there are fast dark blue blips in k^. I would like the same solid bands as seen in k, and naive BCE loss doesn't account for that.
Appreciate any and all advice!

Why `ep_rew_mean` much larger than the reward evaluated by the `evaluate_policy()` fuction

I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:
4.349947246664763 1.1806464511030819
the way I use function evaluate_policy() is:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)
According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.
I have also tried different learning rate or other parameters, and this problem is not solved.
I have checked my environment, and I think there is no error.
I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.
evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).
If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).

Relationship of Horizon and Discount factor in Reinforcement Learning

What is the connection between discount factor gamma and horizon in RL.
What I have learned so far is that the horizon is the agent`s time to live. Intuitively, agents with finite horizon will choose actions differently than if it has to live forever. In the latter case, the agent will try to maximize all the expected rewards it may get far in the future.
But the idea of the discount factor is also the same. Are the values of gamma near zero makes the horizon finite?
Horizon refers to how many steps into the future the agent cares about the reward it can receive, which is a little different from the agent's time to live. In general, you could potentially define any arbitrary horizon you want as the objective. You could define a 10 step horizon, in which the agent makes a decision that will enable it to maximize the reward it will receive in the next 10 time steps. Or we could choose a 100, or 1000, or n step horizon!
Usually, the n-step horizon is defined using n = 1 / (1-gamma).
Therefore, 10 step horizon will be achieved using gamma = 0.9, while 100 step horizon can be achieved with gamma = 0.99
Therefore, any value of gamma less than 1 imply that the horizon is finite.

Atari score vs reward in rllib DQN implementation

I'm trying to replicate DQN scores for Breakout using RLLib. After 5M steps the average reward is 2.0 while the known score for Breakout using DQN is 100+. I'm wondering if this is because of reward clipping and therefore actual reward does not correspond to score from Atari. In OpenAI baselines, the actual score is placed in info['r'] the reward value is actually the clipped value. Is this the same case for RLLib? Is there any way to see actual average score while training?
According to the list of trainer parameters, the library will clip Atari rewards by default:
# Whether to clip rewards prior to experience postprocessing. Setting to
# None means clip for Atari only.
"clip_rewards": None,
However, the episode_reward_mean reported on tensorboard should still correspond to the actual, non-clipped scores.
While the average score of 2 is not much at all relative to the benchmarks for Breakout, 5M steps may not be large enough for DQN unless you are employing something akin to the rainbow to significantly speed things up. Even then, DQN is notoriously slow to converge, so you may want to check your results using a longer run instead and/or consider upgrading your DQN configurations.
I've thrown together a quick test and it looks like the reward clipping doesn't have much of an effect on Breakout, at least early on in the training (unclipped in blue, clipped in orange):
I don't know too much about Breakout to comment on its scoring system, but if higher rewards become available later on as we get better performance (as opposed to getting the same small reward but with more frequency, say), we should start seeing the two diverge.
In such cases, we can still normalize the rewards or convert them to logarithmic scale.
Here's the configurations I used:
lr: 0.00025
learning_starts: 50000
timesteps_per_iteration: 4
buffer_size: 1000000
train_batch_size: 32
target_network_update_freq: 10000
# (some) rainbow components
n_step: 10
noisy: True
# work-around to remove epsilon-greedy
schedule_max_timesteps: 1
exploration_final_eps: 0
prioritized_replay: True
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
num_atoms: 51
double_q: False
dueling: False
You may be more interested in their rl-experiments where they posted some results from their own library against the standard benchmarks along with the configurations where you should be able to get even better performance.

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy
TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.