When selecting reward value in DQN, Actor-Critic or A3C, is there any common rules to select reward value??
As I heard briefly, (-1 ~ +1) reward is quite efficient selection.
Can you tell me any suggestion and the reason ??
Ideally, you want to normalize your rewards (i.e., 0 mean and unit variance). In your example, the reward is between -1 to 1, which satisfies this condition. I believe the reason was because it speeds up gradient descent when updating your parameters for your neural network and also it allows your RL agent to distinguish good and bad actions more effectively.
An example: Imagine we are trying to build an agent to cross the street, and if it crosses the street, it gains a reward of 1. If it gets hit by a car, it gets a reward of -1, and each step yields a reward of 0. Percentage-wise, the reward for success is massively above the reward for failure (getting hit by a car).
However, if we give the agent a reward of 1,000,000,001 for successfully crossing the road, and giving it a reward of 999,999,999 for getting hit by a car (this scenario and the above are identical when normalized), the success is no longer as pronounced as previously. Also, if you discount such high rewards, it will make the distinction of the two scenarios even harder to identify.
This is especially a problem in DQN and other function approximation methods because these methods generalize the state, action, and reward spaces. So a reward of -1 and 1 are massively different, however, a reward of 1,000,000,001 and 999,999,999 are basically identical if we were to use a function to generalize it.
Related
I'm designing a reward function of a DQN model, the most tricky part of Deep reinforcement learning part. I referred several cases, and noticed usually the reward will set in [-1, 1]. Considering if the negative reward is triggered less times, more "sparse" compared with positive reward, the positive reward could be lower than 1.
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)? What's the theory or principle behind the range?
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I can vaguely understand that correlated with gradient descent, and actually it's the gap between rewards matters, not the sign or absolute value. But I'm still missing clear hint how it can destroy, and why in such range.
Besides, when should I utilize reward like [0,1] or use only negative reward? I mean, within given timestep, both methods seems can push the agent to find the highest total reward. Only in situation like I want to let the agent reach the final point asap, negative reward will seems more appropriate than positive reward.
Is there a criteria to measure if the reward is designed reasonable? Like use the Sum the Q value of good action and bad action, it it's symmetrical, the final Q should around zero which means it converge?
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)?
Essentially it's the same if you define your reward function in either [0,1] or [-1,0] range. It will just result in your action values being positive or negative, but it wouldn't affect the convergence of your neural network.
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I wouldn't really agree with the answer. Such a reward function wouldn't "destroy" the model, however it is incapable of providing a balanced positive and negative reward for the agent's action. It provides incentive for the agent not to crash, however doesn't encourage it to cut off opponents.
Besides, when should I utilize reward like [0,1] or use only negative reward?
As mentioned previously, it doesn't matter if you use positive or negative reward. What matters is the relativity of your reward. For example as you said if you want the agent to reach the terminal state asap, thus introducing negative rewards, it will only work if no positive reward is present during the episode. If the agent could pick up positive reward midway through the episode, it would not be incentivized to end the episode asap. Therefore, it's the relativity that matters.
What's the principle to design the reward function, of DQN?
As you said, this is the tricky part of RL. In my humble opinion, the reward is "just" the way to leads your system to the (state, action) pairs that you valuate most. So, if you consider that one pair (state, action) is 500x greater than the other, why not?
About the range of values... suppose that you know all the rewards that can be assigned, thus you know the range of values, and you could easily normalize it, let's say to [0,1]. So, the range doesn't mean to much, but the values that you assigned says a lot.
About negative reward values. In general, I find it in problems where the objective is to minimize costs. For instance, if you have a robot that has the goal do collect trash in a room, and from time to time he has to recharge himself to continue doing this task. You could have negative rewards regarding battery consumption, and your goal is to minimize it. On another hand, in many games the goal is to score more and more points, so can be natural to assign positive values.
I am currently learning reinforcement learning and am have built a blackjack game.
There is an obvious reward at the end of the game (payout), however some actions do not directly lead to rewards (hitting on a count of 5), which should be encouraged, even if the end result is negative (loosing the hand).
My question is what should the reward be for those actions ?
I could hard code a positive reward (fraction of the reward for winning the hand) for hits which do not lead to busting, but it feels like I am not approaching the problem correctly.
Also, when I assign a reward for a win (after the hand is over), I update the q-value corresponding to the last action/state pair, which seems suboptimal, as this action may not have directly lead to the win.
Another option I thought is to assign the same end reward to all of the action/state pairs in the sequence, however, some actions (like hitting on count <10) should be encouraged even if it leads to a lost hand.
Note: My end goal is to use deep-RL with an LSTM, but I am starting with q-learning.
I would say to start simple and use the rewards the game dictates. If you win, you'll receive a reward +1, if you lose -1.
It seems you'd like to reward some actions based on human knowledge. Maybe start with using epsilon greedy and let the agent discover all actions. Play along with the discount hyperparameter which determines the importance of future rewards, and look if it comes with some interesting strategies.
This blog is about RL and Blackjack.
https://towardsdatascience.com/playing-blackjack-using-model-free-reinforcement-learning-in-google-colab-aa2041a2c13d
I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.
I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.
My only question is how the weights are updated. According to this site the weights are updated as follows:
I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?
In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:
Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.
In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this:
Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).
The equation you presented is the Q-learning update rule set in a gradient update rule.
alpha is the step-size
R is the reward
Gamma is the discounting factor
You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.
I use n-step Sarsa/sometimes Sarsa(lambda)
After experimenting a bit with different epsilon schedules I found out that the agent learns faster when I change the epsilon during an episode based on the number of steps already taken and the mean length of the last 10 episodes.
Low number of steps/beginning of episode => Low epsilon
High number of steps/end of episode => High epsilon
This works far better than just an epsilon decay over time from episode to episode.
Does the theory allow this?
I think yes because all states are still visited regularly.
Yes, SARSA algorithm converges even in the case you are updating epsilon parameter within each episode. The requirement is that epsilon should eventually tend to zero or a small value.
In you case, if you are starting with a small epsilon value in each episode and increasing it as the number of steps grows, it's not very clear to me that your algorithm will converge towards an optimal policy. I mean, at some point epsilon should decrease.
The "best" epsilon schedule is highly problem dependent, and there is not a schedule that works fine in all problems. So, at the end, it's required some experience in the problem and probably some trial and error adjustment.