The problem I have is episodic (with early stopping when agent reaches goal state or avoid state) and with delayed rewards (agent is rewarded +1 when goal state is reached and penalized -1 when avoid state state is reached). State space is continuous while action space is discrete. I found that DQN/DDQN/averaged DQN learn well (after I shape rewards with potential based function agent reaches goal state the first time in 200 episodes). However, the PPO-clip performance as in https://github.com/ChintanTrivedi/rl-bot-football learning is very slow (agent reaches goal state first time in 3000 episodes). Is it possible somehow to tune PPO-clip for episodic with early stopping and delayed rewards problems?
Related
I was studying about the markov property in reinforcement learning, which is supposed to be one of the important assumptions of this field. In that it says, that while considering the probability of the future, we consider only the present state and actions and not that of the past. An important corollary that arises when we consider the probability of the present state given future state/action, the future state/action can't be ignored as it has valuable information in the computation of the present probability.
I do not understand this second statement. From the point of view of the future event, the present event seems to be the past for this future event. Then why are we considering this past event?
Let's focus on these two sentences individually. The Markov Property (which should apply in your problem, but in reality doesn't have to) says that the current state is all you need to look at to make your decision (e.g. a "screenshot" -aka observation- of the chess board is all you need to look at to make an optimal action). On the other hand, if you need to look at some old state (or observation) to understang something that is not implied in your current state, then the Markov property is not satisfied (e.g. you can't usually use a single frame of a videogame as a state, since you may be missing info regarding the velocity and acceleration of some moving objects. This is also why people use frame-stacking to "solve" video games using RL).
Now, regarding the future events which seems to be considered as past events: when the agent takes an action, it moves from one state to another. Remember that in RL you want to maximize the cumulative reward, that is the sum of all the rewards long-term. This also mean that you basically want to take action even sacrifying instantaneous "good" reward if this means obtaining better "future" (long-term) reward (e.g. sometimes you don't want to take the enemy queen if this allows the enemy to check-mate you in the next move). This is why in RL we try to estimate value-functions (state and/or action). State value-functions is a value assigned to a state which should represent how good is being in that state in a long-term perspective.
How is an agent supposed to know the future reward (aka calculate these value functions)? By exploring a lot of states and taking random actions (literally trial and error). Therefore, when an agent is in a certain "state1" and has to choose between taking action A and action B, he will NOT choose the one that has given him the best instantaneous reward, but the one which has made him get better rewards "long-term", that is the action with the bigger action-value, which will take into account not only the instantaneous rewards he gets from the transition from state1 to the next state, but also the value-function of that next state!
Therefore, future events in that sentence may seem to be considered as past events because estimating the value function require that you have been in those "future states" a lot of times during past iterations!
Hope I've been helpful
I am currently learning reinforcement learning and am have built a blackjack game.
There is an obvious reward at the end of the game (payout), however some actions do not directly lead to rewards (hitting on a count of 5), which should be encouraged, even if the end result is negative (loosing the hand).
My question is what should the reward be for those actions ?
I could hard code a positive reward (fraction of the reward for winning the hand) for hits which do not lead to busting, but it feels like I am not approaching the problem correctly.
Also, when I assign a reward for a win (after the hand is over), I update the q-value corresponding to the last action/state pair, which seems suboptimal, as this action may not have directly lead to the win.
Another option I thought is to assign the same end reward to all of the action/state pairs in the sequence, however, some actions (like hitting on count <10) should be encouraged even if it leads to a lost hand.
Note: My end goal is to use deep-RL with an LSTM, but I am starting with q-learning.
I would say to start simple and use the rewards the game dictates. If you win, you'll receive a reward +1, if you lose -1.
It seems you'd like to reward some actions based on human knowledge. Maybe start with using epsilon greedy and let the agent discover all actions. Play along with the discount hyperparameter which determines the importance of future rewards, and look if it comes with some interesting strategies.
This blog is about RL and Blackjack.
https://towardsdatascience.com/playing-blackjack-using-model-free-reinforcement-learning-in-google-colab-aa2041a2c13d
I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.
When selecting reward value in DQN, Actor-Critic or A3C, is there any common rules to select reward value??
As I heard briefly, (-1 ~ +1) reward is quite efficient selection.
Can you tell me any suggestion and the reason ??
Ideally, you want to normalize your rewards (i.e., 0 mean and unit variance). In your example, the reward is between -1 to 1, which satisfies this condition. I believe the reason was because it speeds up gradient descent when updating your parameters for your neural network and also it allows your RL agent to distinguish good and bad actions more effectively.
An example: Imagine we are trying to build an agent to cross the street, and if it crosses the street, it gains a reward of 1. If it gets hit by a car, it gets a reward of -1, and each step yields a reward of 0. Percentage-wise, the reward for success is massively above the reward for failure (getting hit by a car).
However, if we give the agent a reward of 1,000,000,001 for successfully crossing the road, and giving it a reward of 999,999,999 for getting hit by a car (this scenario and the above are identical when normalized), the success is no longer as pronounced as previously. Also, if you discount such high rewards, it will make the distinction of the two scenarios even harder to identify.
This is especially a problem in DQN and other function approximation methods because these methods generalize the state, action, and reward spaces. So a reward of -1 and 1 are massively different, however, a reward of 1,000,000,001 and 999,999,999 are basically identical if we were to use a function to generalize it.
In DQN paper of DeepMind company, there are two loops one for episodes and one for running time in each step (one for training and one for different time-step of running). Am I right?
Since, nothing is done in outer loop except initialization and reset to conditions of first step, what are their differences?
For instance, in case 1, if we run for 1000 episodes and 400 time steps what are the differences we should expected in case 2, if we run for 4000 episodes and 100 time steps?
(is their difference that the second one has more chance to get rid of local minimum or something similar to that? or both are the same?)
Another question is where updating the experience replay is investigated?
enter image description here
For your first question: the answer is yes, there are two loops, and they do have differences.
You have to think of the true meaning of an episode. In most cases, we can consider each episode a 'game'. A 'game' needs to have an end. And we need to do our best to let every game end within the length of an episode (imagine what you can learn if you cannot get out of a labyrinth game). The Q values of DQN is an approximation of 'current reward' + 'discounted future rewards', while you need to know when will the future ends to make a better approximation.
So assume we usually take 200 steps to finish the game, then an episode of 100 time steps has a huge difference from an episode of 400 time steps.
For experience replay update, it happens in every time step. I don't get what you're asking. If you can explain your question in detail I think I could answer it.