RL Policy Gradient: How to deal with rewards that are strictly positive? - reinforcement-learning

In short:
In the policy gradient method, if the reward is always positive (never negative), the policy gradient will always be positive, hence it will keep making our parameters larger. This makes the learning algorithm meaningless. How do we get around this problem?
In detail:
In "RL Course by David Silver" lecture 7 (on YouTube), he introduced the REINFORCE algorithm for policy gradient (here just showing 1 step):
The actual policy update is:
Note that v_t here stands for the reward we get. Let's say we're playing a game where the reward is always positive (eg. accumulating a score), and there are never any negative rewards, the gradient will always be positive, hence theta will keep increasing! So how do we deal with rewards that never change sign?

Theta isn't one number, but rather a vector of numbers that parameterize your model. The gradient with respect to your parameter may be positive or negative. For example, consider that your parameters are just the probabilities for each action. They are constrained to add to 1.0. Increasing the probability of one action requires at least one of the other actions decrease in probability.

Hi in the formula there is also a log probability for the action which can be positive or negative. By doing policy gradients, the policy will increase or decrease the probability of doing a specific action under give states and the value function just gives how much the probability gonna to change. So it is totally fine all rewards are positive.

Related

How does score function help in policy gradient?

I'm trying to learn policy gradient methods for reinforcement learning but I stuck at the score function part.
While searching for maximum or minimum points in a function, we take the derivative and set it to zero, then look for the points that holds this equation.
In policy gradient methods, we do it by taking the gradient of the expectation of trajectories and we get:
Objective function image
Here I could not get how this gradient of log policy shifts the distribution (through its parameters θ) to increase the scores of its samples mathematically? Don't we look for something that make this objective function's gradient zero as I explained above?
What you want to maximize is
J(theta) = int( p(tau;theta)*R(tau) )
The integral is over tau (the trajectory) and p(tau;theta) is its probability (i.e., of seeing the sequence state, action, next state, next action, ...), which depends on both the dynamics of the environment and the policy (parameterized by theta). Formally
p(tau;theta) = p(s_0)*pi(a_0|s_0;theta)*P(s_1|s_0,a_0)*pi(a_1|s_1;theta)*P(s_2|s_1,a_1)*...
where P(s'|s,a) is the transition probability given by the dynamics.
Since we cannot control the dynamics, only the policy, we optimize w.r.t. its parameters, and we do it by gradient ascent, meaning that we take the direction given by the gradient. The equation in your image comes from the log-trick df(x)/dx = f(x)*d(logf(x))/dx.
In our case f(x) is p(tau;theta) and we get your equation. Then since we have access only to a finite amount of data (our samples) we approximate the integral with an expectation.
Step after step, you will (ideally) reach a point where the gradient is 0, meaning that you reached a (local) optimum.
You can find a more detailed explanation here.
EDIT
Informally, you can think of learning the policy which increases the probability of seeing high return R(tau). Usually, R(tau) is the cumulative sum of the rewards. For each state-action pair (s,a) you therefore maximize the sum of the rewards you get from executing a in state s and following pi afterwards. Check this great summary for more details (Fig 1).

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy
TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

Reward function for Policy Gradient Descent in Reinforcement Learning

I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?"
Details:
I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. I'll omit the details of the NN as they don't matter. The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p))
My question now is how to define the reward r? Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice.
However, mathematically I have doubts:
Win Reward: 1 - This seems to make sense. This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1.
Draw Reward: 0 - This does not seem to make sense. This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0).
Loss Reward: -1 - This should kind of work. It should push probabilities towards 0 for moves involved in losses. However, I'm concerned about the asymmetry of the gradient compared to the win case. The closer to 0 the probability gets, the steeper the gradient gets. I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all.
You are on the right track. However, I believe you are confusing rewards with action probabilities. In case of draw, it learns that the reward itself is zero at the end of the episode. However, in case of loss, the loss function is discounted reward (which should be -1) times the action probabilities. So it will get you more towards actions which end in win and away from loss with actions ending in draw falling in the middle. Intuitively, it is very similar to supervised deep learning only with an additional weighting parameter (reward) attached to it.
Additionally, I believe this paper from Google DeepMind would be useful for you: https://arxiv.org/abs/1712.01815.
They actually talk about solving the chess problem using RL.

State value and state action values with policy - Bellman equation with policy

I am just getting start with deep reinforcement learning and i am trying to crasp this concept.
I have this deterministic bellman equation
When i implement stochastacity from the MDP then i get 2.6a
My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:
The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.
I hope someone out there can help on this one!
Best regards Søren Koch
No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.
In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.
There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.
Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy π . So yes, strictly speaking, it should be accompained by the policy sign π .
The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:
Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows:
Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.
Bellman equation, which characterizes a value function, in this case associated with any policy π:
In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable):

Reinforce Learning: Do I have to ignore hyper parameter(?) after training done in Q-learning?

Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).