My question is simple. Does an epsilon equals to zero will converge to an optimal policy using RL methods? (with negative and positive reward function values).
Thanks,
No, it does not. Because with epsilon=0 there is no exploration, and without exploration there is no guarantee. It is also intuitively sounds, since without exploration you cannot learn the environment well enough to find the optimal policy.
For example for the Q-learning algorithm, you can see the formal proof in
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992): 279-292.
which shows that Q-learning does converge to the optimal values, if \epsilon goes to zero, as the number of observation goes to infinity.
Related
I have just started a MSc in Scientific Computing, and being an engineer my knowledge of real analysis is somewhat limited.
When rewriting f(x) = 0 as a fixed point formulation Phi(x) = x, it is stressed that we must check that for x in the interval [a,b] that Phi(x) maps into the same interval.
Is there a general real analysis method of checking this, using the Mean Value Theorem for example, or do I need to use simpler calculus method of checking the minimum and maximum values of Phi(x). The simpler calculus method doesn’t seem to be satisfactory or formal enough in a real analysis sense.
Any assistance would be appreciated.
Kind regards
John
In short:
In the policy gradient method, if the reward is always positive (never negative), the policy gradient will always be positive, hence it will keep making our parameters larger. This makes the learning algorithm meaningless. How do we get around this problem?
In detail:
In "RL Course by David Silver" lecture 7 (on YouTube), he introduced the REINFORCE algorithm for policy gradient (here just showing 1 step):
The actual policy update is:
Note that v_t here stands for the reward we get. Let's say we're playing a game where the reward is always positive (eg. accumulating a score), and there are never any negative rewards, the gradient will always be positive, hence theta will keep increasing! So how do we deal with rewards that never change sign?
Theta isn't one number, but rather a vector of numbers that parameterize your model. The gradient with respect to your parameter may be positive or negative. For example, consider that your parameters are just the probabilities for each action. They are constrained to add to 1.0. Increasing the probability of one action requires at least one of the other actions decrease in probability.
Hi in the formula there is also a log probability for the action which can be positive or negative. By doing policy gradients, the policy will increase or decrease the probability of doing a specific action under give states and the value function just gives how much the probability gonna to change. So it is totally fine all rewards are positive.
I am working on an multi-class image recognition problem. The task is to have the correct answer being in the top 3 output probabilities. So I was thinking that maybe there exists a clever cost function that prioritizes the correct answer being in the top K and doesn't penalize much in between these top K.
This can be achieved by class-weighted cross-entropy loss, which essentially assigns the weight to the errors associated with each class. This loss is used in research, e.g. see the paper "Multi-task learning and Weighted Cross-entropy for DNN-based Keyword" by S. Panchapagesan at al. Before computing the cross-entropy, you can check if the predicted distribution satisfies your condition (e.g., ground truth class is in top-k of the predicted classes) and assign the zero (or near zero) weights accordingly, if it does.
There are open questions though: when the correct class is in top-k, should you penalize the k-1 incorrectly predicted classes? What if, for example, the prediction is (0.9, 0.05, 0.01, ...), the third class is correct and it is in top-3 -- is this prediction good enough or not? Should you care what exactly k-1 incorrect classes are?
All these question arise because this kind of loss doesn't have probabilistic interpretation, unlike standard cross-entropy. That's why I wouldn't recommend using it in practice, but reformulate the goal instead.
E.g., if the original problem is that for some inputs several classes are equally good, the best way to deal with it is to use soft labels, e.g. (0.33, 0.33, 0.33, 0, 0, 0, ...) instead of one-hot (note that this totally agrees with probabilistic interpretation). It will force the network to learn features associated with all three good classes, and generally lead to the same goal, but with better control over target classes.
Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).
I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?
In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also:
Difference between Q-learning and Value Iteration
How do I know when a Q-learning algorithm converges?
Sutton et al.: RL