Sarsa and Q Learning (reinforcement learning) don't converge optimal policy - reinforcement-learning

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy

TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

Related

Relationship of Horizon and Discount factor in Reinforcement Learning

What is the connection between discount factor gamma and horizon in RL.
What I have learned so far is that the horizon is the agent`s time to live. Intuitively, agents with finite horizon will choose actions differently than if it has to live forever. In the latter case, the agent will try to maximize all the expected rewards it may get far in the future.
But the idea of the discount factor is also the same. Are the values of gamma near zero makes the horizon finite?
Horizon refers to how many steps into the future the agent cares about the reward it can receive, which is a little different from the agent's time to live. In general, you could potentially define any arbitrary horizon you want as the objective. You could define a 10 step horizon, in which the agent makes a decision that will enable it to maximize the reward it will receive in the next 10 time steps. Or we could choose a 100, or 1000, or n step horizon!
Usually, the n-step horizon is defined using n = 1 / (1-gamma).
Therefore, 10 step horizon will be achieved using gamma = 0.9, while 100 step horizon can be achieved with gamma = 0.99
Therefore, any value of gamma less than 1 imply that the horizon is finite.

Atari score vs reward in rllib DQN implementation

I'm trying to replicate DQN scores for Breakout using RLLib. After 5M steps the average reward is 2.0 while the known score for Breakout using DQN is 100+. I'm wondering if this is because of reward clipping and therefore actual reward does not correspond to score from Atari. In OpenAI baselines, the actual score is placed in info['r'] the reward value is actually the clipped value. Is this the same case for RLLib? Is there any way to see actual average score while training?
According to the list of trainer parameters, the library will clip Atari rewards by default:
# Whether to clip rewards prior to experience postprocessing. Setting to
# None means clip for Atari only.
"clip_rewards": None,
However, the episode_reward_mean reported on tensorboard should still correspond to the actual, non-clipped scores.
While the average score of 2 is not much at all relative to the benchmarks for Breakout, 5M steps may not be large enough for DQN unless you are employing something akin to the rainbow to significantly speed things up. Even then, DQN is notoriously slow to converge, so you may want to check your results using a longer run instead and/or consider upgrading your DQN configurations.
I've thrown together a quick test and it looks like the reward clipping doesn't have much of an effect on Breakout, at least early on in the training (unclipped in blue, clipped in orange):
I don't know too much about Breakout to comment on its scoring system, but if higher rewards become available later on as we get better performance (as opposed to getting the same small reward but with more frequency, say), we should start seeing the two diverge.
In such cases, we can still normalize the rewards or convert them to logarithmic scale.
Here's the configurations I used:
lr: 0.00025
learning_starts: 50000
timesteps_per_iteration: 4
buffer_size: 1000000
train_batch_size: 32
target_network_update_freq: 10000
# (some) rainbow components
n_step: 10
noisy: True
# work-around to remove epsilon-greedy
schedule_max_timesteps: 1
exploration_final_eps: 0
prioritized_replay: True
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
num_atoms: 51
double_q: False
dueling: False
You may be more interested in their rl-experiments where they posted some results from their own library against the standard benchmarks along with the configurations where you should be able to get even better performance.

How does score function help in policy gradient?

I'm trying to learn policy gradient methods for reinforcement learning but I stuck at the score function part.
While searching for maximum or minimum points in a function, we take the derivative and set it to zero, then look for the points that holds this equation.
In policy gradient methods, we do it by taking the gradient of the expectation of trajectories and we get:
Objective function image
Here I could not get how this gradient of log policy shifts the distribution (through its parameters θ) to increase the scores of its samples mathematically? Don't we look for something that make this objective function's gradient zero as I explained above?
What you want to maximize is
J(theta) = int( p(tau;theta)*R(tau) )
The integral is over tau (the trajectory) and p(tau;theta) is its probability (i.e., of seeing the sequence state, action, next state, next action, ...), which depends on both the dynamics of the environment and the policy (parameterized by theta). Formally
p(tau;theta) = p(s_0)*pi(a_0|s_0;theta)*P(s_1|s_0,a_0)*pi(a_1|s_1;theta)*P(s_2|s_1,a_1)*...
where P(s'|s,a) is the transition probability given by the dynamics.
Since we cannot control the dynamics, only the policy, we optimize w.r.t. its parameters, and we do it by gradient ascent, meaning that we take the direction given by the gradient. The equation in your image comes from the log-trick df(x)/dx = f(x)*d(logf(x))/dx.
In our case f(x) is p(tau;theta) and we get your equation. Then since we have access only to a finite amount of data (our samples) we approximate the integral with an expectation.
Step after step, you will (ideally) reach a point where the gradient is 0, meaning that you reached a (local) optimum.
You can find a more detailed explanation here.
EDIT
Informally, you can think of learning the policy which increases the probability of seeing high return R(tau). Usually, R(tau) is the cumulative sum of the rewards. For each state-action pair (s,a) you therefore maximize the sum of the rewards you get from executing a in state s and following pi afterwards. Check this great summary for more details (Fig 1).

Reinforce Learning: Do I have to ignore hyper parameter(?) after training done in Q-learning?

Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).

Explain process noise terminology in Kalman Filter

I am just learning Kalman filter. In the Kalman Filter terminology, I am having some difficulty with process noise. Process noise seems to be ignored in many concrete examples (most focused on measurement noise). If someone can point me to some introductory level link that described process noise well with examples, that’d be great.
Let’s use a concrete scalar example for my question, given:
x_j = a x_j-1 + b u_j + w_j
Let’s say x_j models the temperature within a fridge with time. It is 5 degrees and should stay that way, so we model with a = 1. If at some point t = 100, the temperature of the fridge becomes 7 degrees (ie. hot day, poor insulation), then I believe the process noise at this point is 2 degrees. So our state variable x_100 = 7 degrees, and this is the true value of the system.
Question 1:
If I then paraphrase the phrase I often see for describing Kalman filter, “we filter the signal x so that the effects of the noise w are minimized “, http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html if we minimize the effects of the 2 degrees, are we trying to get rid of the 2 degree difference? But the true state at is x_100 == 7 degrees. What are we doing to the process noise w exactly when we Kalmen filter?
Question 2:
The process noise has a variance of Q. In the simple fridge example, it seems easy to model because you know the underlying true state is 5 degrees and you can take Q as the deviation from that state. But if the true underlying state is fluctuating with time, when you model, what part of this would be considered state fluctuation vs. “process noise”. And how do we go about determining a good Q (again example would be nice)?
I have found that as Q is always added to the covariance prediction no matter which time step you are at, (see Covariance prediction formula from http://greg.czerniak.info/guides/kalman1/) that if you select an overly large Q, then it doesn’t seem like the Kalman filter would be well-behaved.
Thanks.
EDIT1 My Interpretation
My interpretation of the term process noise is the difference between the actual state of the system and the state modeled from the state transition matrix (ie. a * x_j-1). And what Kalman filter tries to do, is to bring the prediction closer to the actual state. In that sense, it actually partially "incorporate" the process noise into the prediction through the residual feedback mechanism, rather than "eliminate" it, so that it can predict the actual state better. I have not read such an explanation anywhere in my search, and I would appreciate anyone commenting on this view.
In Kalman filtering the "process noise" represents the idea/feature that the state of the system changes over time, but we do not know the exact details of when/how those changes occur, and thus we need to model them as a random process.
In your refrigerator example:
the state of the system is the temperature,
we obtain measurements of the temperature on some time interval, say hourly,
by looking the thermometer dial. Note that you usually need to
represent the uncertainties involved in the measurement process
in Kalman filtering, but you didn't focus on this in your question.
Let's assume that these errors are small.
At time t you look at the thermometer, see that it says 7degrees;
since we've assumed the measurement errors are very small, that means
that the true temperature is (very close to) 7 degrees.
Now the question is: what is the temperature at some later time, say 15 minutes
after you looked?
If we don't know if/when the condenser in the refridgerator turns on we could have:
1. the temperature at the later time is yet higher than 7degrees (15 minutes manages
to get close to the maximum temperature in a cycle),
2. Lower if the condenser is/has-been running, or even,
3. being just about the same.
This idea that there are a distribution of possible outcomes for the real state of the
system at some later time is the "process noise"
Note: my qualitative model for the refrigerator is: the condenser is not running, the temperature goes up until it reaches a threshold temperature a few degrees above the nominal target temperature (note - this is a sensor so there may be noise in terms of the temperature at which the condenser turns on), the condenser stays on until the temperature
gets a few degrees below the set temperature. Also note that if someone opens the door, then there will be a jump in the temperature; since we don't know when someone might do this, we model it as a random process.
Yeah, I don't think that sentence is a good one. The primary purpose of a Kalman filter is to minimize the effects of observation noise, not process noise. I think the author may be conflating Kalman filtering with Kalman control (where you ARE trying to minimize the effect of process noise).
The state does not "fluctuate" over time, except through the influence of process noise.
Remember, a system does not generally have an inherent "true" state. A refrigerator is a bad example, because it's already a control system, with nonlinear properties. A flying cannonball is a better example. There is some place where it "really is", but that's not intrinsic to A. In this example, you can think of wind as a kind of "process noise". (Not a great example, since it's not white noise, but work with me here.) The wind is a 3-dimensional process noise affecting the cannonball's velocity; it does not directly affect the cannonball's position.
Now, suppose that the wind in this area always blows northwest. We should see a positive covariance between the north and west components of wind. A deviation of the cannonball's velocity northwards should make us expect to see a similar deviation to westward, and vice versa.
Think of Q more as covariance than as variance; the autocorrelation aspect of it is almost incidental.
Its a good discussion going over here. I would like to add that the concept of process noise is that what ever prediction that is made based on the model is having some errors and it is represented using the Q matrix. If you note the equations in KF for prediction of Covariance matrix (P_prediction) which is actually the mean squared error of the state being predicted, the Q is simply added to it. PPredict=APA'+Q . I suggest, it would give a good insight if you could find the derivation of KF equations.
If your state-transition model is exact, process noise would be zero. In real-world, it would be nearly impossible to capture the exact state-transition with a mathematical model. The process noise captures that uncertainty.