Relationship of Horizon and Discount factor in Reinforcement Learning - reinforcement-learning

What is the connection between discount factor gamma and horizon in RL.
What I have learned so far is that the horizon is the agent`s time to live. Intuitively, agents with finite horizon will choose actions differently than if it has to live forever. In the latter case, the agent will try to maximize all the expected rewards it may get far in the future.
But the idea of the discount factor is also the same. Are the values of gamma near zero makes the horizon finite?

Horizon refers to how many steps into the future the agent cares about the reward it can receive, which is a little different from the agent's time to live. In general, you could potentially define any arbitrary horizon you want as the objective. You could define a 10 step horizon, in which the agent makes a decision that will enable it to maximize the reward it will receive in the next 10 time steps. Or we could choose a 100, or 1000, or n step horizon!
Usually, the n-step horizon is defined using n = 1 / (1-gamma).
Therefore, 10 step horizon will be achieved using gamma = 0.9, while 100 step horizon can be achieved with gamma = 0.99
Therefore, any value of gamma less than 1 imply that the horizon is finite.

Related

Hyperparameter search for lunarlander continuous of openAI gym

I'm trying to solve the LunarLander continuous environment from open AI gym (Solving the LunarLanderContinuous-v2 means getting an average reward of 200 over 100 consecutive trials.) With best reward average possible for 100 straight episodes from this environment.
The difficulty is that I refer to the Lunar-lander with uncertainty. (explanation: observations in the real physical world are sometimes noisy). Specifically, I add a zero-mean
Gaussian noise with mean=0 and std = 0.05 to PositionX and PositionY observation of the location of the lander.
I also discretise the LunarLander actions to a finite number of actions instead of the continuous range the environment enables.
So far I'm using DQN, double-DQN and Duelling DDQN.
My hyperparameters are:
gamma,
epsilon start
epsilon end
epsilon decay
learning rate
number of actions (discretisation)
target update
batch size
optimizer
number of episodes
network architecture.
I'm having difficulty to reach good or even mediocre results.
Does someone have an advice about the hyperparameters changes I should make to improve my results?
Thanks!

Atari score vs reward in rllib DQN implementation

I'm trying to replicate DQN scores for Breakout using RLLib. After 5M steps the average reward is 2.0 while the known score for Breakout using DQN is 100+. I'm wondering if this is because of reward clipping and therefore actual reward does not correspond to score from Atari. In OpenAI baselines, the actual score is placed in info['r'] the reward value is actually the clipped value. Is this the same case for RLLib? Is there any way to see actual average score while training?
According to the list of trainer parameters, the library will clip Atari rewards by default:
# Whether to clip rewards prior to experience postprocessing. Setting to
# None means clip for Atari only.
"clip_rewards": None,
However, the episode_reward_mean reported on tensorboard should still correspond to the actual, non-clipped scores.
While the average score of 2 is not much at all relative to the benchmarks for Breakout, 5M steps may not be large enough for DQN unless you are employing something akin to the rainbow to significantly speed things up. Even then, DQN is notoriously slow to converge, so you may want to check your results using a longer run instead and/or consider upgrading your DQN configurations.
I've thrown together a quick test and it looks like the reward clipping doesn't have much of an effect on Breakout, at least early on in the training (unclipped in blue, clipped in orange):
I don't know too much about Breakout to comment on its scoring system, but if higher rewards become available later on as we get better performance (as opposed to getting the same small reward but with more frequency, say), we should start seeing the two diverge.
In such cases, we can still normalize the rewards or convert them to logarithmic scale.
Here's the configurations I used:
lr: 0.00025
learning_starts: 50000
timesteps_per_iteration: 4
buffer_size: 1000000
train_batch_size: 32
target_network_update_freq: 10000
# (some) rainbow components
n_step: 10
noisy: True
# work-around to remove epsilon-greedy
schedule_max_timesteps: 1
exploration_final_eps: 0
prioritized_replay: True
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
num_atoms: 51
double_q: False
dueling: False
You may be more interested in their rl-experiments where they posted some results from their own library against the standard benchmarks along with the configurations where you should be able to get even better performance.

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy
TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

RL Policy Gradient: How to deal with rewards that are strictly positive?

In short:
In the policy gradient method, if the reward is always positive (never negative), the policy gradient will always be positive, hence it will keep making our parameters larger. This makes the learning algorithm meaningless. How do we get around this problem?
In detail:
In "RL Course by David Silver" lecture 7 (on YouTube), he introduced the REINFORCE algorithm for policy gradient (here just showing 1 step):
The actual policy update is:
Note that v_t here stands for the reward we get. Let's say we're playing a game where the reward is always positive (eg. accumulating a score), and there are never any negative rewards, the gradient will always be positive, hence theta will keep increasing! So how do we deal with rewards that never change sign?
Theta isn't one number, but rather a vector of numbers that parameterize your model. The gradient with respect to your parameter may be positive or negative. For example, consider that your parameters are just the probabilities for each action. They are constrained to add to 1.0. Increasing the probability of one action requires at least one of the other actions decrease in probability.
Hi in the formula there is also a log probability for the action which can be positive or negative. By doing policy gradients, the policy will increase or decrease the probability of doing a specific action under give states and the value function just gives how much the probability gonna to change. So it is totally fine all rewards are positive.

How to do reinforcement learning with regression instead of classification

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.
I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation.
In this sense the agent can explore.
In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.
At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:
input: rewards, timeIndices
// rewards in (0,1) and optimal is 1
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi = (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
neglogLj = (1 - r)/std
if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
// Accept transition, i.e. learn this action
targetMask[idx] = 1
neglogLi = neglogLj
This provides a targetMask with ones for the actions that will be learned using standard backprop.
Can someone inform me the proper or better way?
Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.