I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:
To do this, I tried to implement the pseudo-code presented here:
Doing this I wondered how to do this step A <- action given by π for S: I can I choose the optimal action A for my current state S? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.
I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π.
Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?
You are right, you cannot choose an action (neither derive a policy π) only from a value function V(s) because, as you notice, it depends only on the state s.
The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.
If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a). There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.
In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6:
As usual, we start by focusing on the policy evaluation or prediction
problem, that of estimating the value function for a given policy .
For the control problem (finding an optimal policy), DP, TD, and Monte
Carlo methods all use some variation of generalized policy iteration
(GPI). The differences in the methods are primarily differences in
their approaches to the prediction problem.
Related
I have a question of more general kind regarding deep reinforcement learning. I am always a little in struggle, what exactly the difference of on- and off-policy is. Sure one can say, off-policy is sampling from a different distribution for actions during trajectory sampling and on-policy is using the actual policy for trajectory generation. Or on-policy is not able to benefit from old data, while off-policy can. Both do not really answer, what the exact difference is, while rather tell me the output.
In my understanding DDPG and PPO both are build upon A2C and train in parallel an actor and a critic. While the critic usually is trained based on the MSE using the observed reward of the next timestep (maybe using some enrolling for multiple steps, but neglecting an enrollment for now) and the network itself of the next timestep. I do not see a difference between off-policy DDPG and on-policy PPO here (well TD3 does it slightly different, but its neglected for now since the idea is identical).
The actor itself has in both cases a loss-function based on the value generated by the critic. While PPO uses a ratio of the policies to limit the stepsize, DDPG uses the policy the predict the action for the value computed by the critic. Therefore both CURRENT policies are used in the loss function for the critic and actor, in both methods (PPO and DDPG).
So now to my actual question: Why is DDPG able to benefit from old data or rather, why does PPO not benefit from old data. One can argue, that the ratio of the policies in PPO limits the distance between the policies and therefor needs fresh data. But how is A2C on-policy and unable to benefit form old data in comparison to DDPG?
I do understand the difference between Q-learning being far more off-policy than policy learning. But I do not get the difference between those PG methods. Does it only rely on the fact that DDPG is deterministic. Does DDPG has any off-policy correction, which makes it able to profit form old data?
I would be really happy, if someone could bring me closer to understanding those policies.
Cheers
PPO actor-critic objective functions are based on a set of trajectories obtained by running the current policy over T timesteps. After the policy is updated, trajectories generated from old/stale policies are no longer applicable. i.e. it needs to be trained "on-policy".
[Why? Because PPO uses a stochastic policy (i.e. a conditional probability distribution of actions given states) and the policy's objective function is based on sampling from trajectories from a probability distribution that depends the current policy's probability distribution (i.e. you need to use the current policy to generate the trajectories)...NOTE #1: this is true for any policy gradients approach using a stochastic policy, not just PPO.]
DDPG/TD3 only needs a single timestep for each actor / critic update (via Bellman equation) and it is straightforward to apply the current deterministic policy to old data tuples (s_t, a_t, r_t, s_t+1). i.e. it is trained "off-policy".
[WHY? Because DDPG/TD3 use a deterministic policy and Silver, David, et al. "Deterministic policy gradient algorithms." 2014. proved that the policy's objective function is an expectation value of state trajectories from the Markov Decision Process state transition function...but does not depend on the probability distribution induced by the policy, which after all is deterministic not stochastic.]
Hello StackOverflow Community!
I have a question about Actor-Critic Models in Reinforcement Learning.
While listening policy gradient methods classes of Berkeley University, it is said in the lecture that in the actor-critic algorithms where we both optimize our policy with some policy parameters and our value functions with some value function parameters, we use same parameters in both optimization problems(i.e. policy parameters = value function parameters) in some algorithms (e.g. A2C/A3C)
I could not understand how this works. I was thinking that we should optimize them separately. How does this shared parameter solution helps us?
Thanks in advance :)
You can do it by sharing some (or all) layers of their network. If you do, however, you are assuming that there is a common state representation (the intermediate layer output) that is optimal w.r.t. both. This is a very strong assumption and it usually doesn't hold. It has been shown to work for learning from image, where you put (for instance) an autoencoder on the top both the actor and the critic network and train it using the sum of their loss function.
This is mentioned in PPO paper (just before Eq. (9)). However, they just say that they share layers only for learning Atari games, not for continuous control problems. They don't say why, but this can be explained as I said above: Atari games have a low-dimensional state representation that is optimal for both the actor and the critic (e.g., the encoded image learned by an autoencoder), while for continuous control you usually pass directly a low-dimensional state (coordinates, velocities, ...).
A3C, which you mentioned, was also used mostly for games (Doom, I think).
From my experience, in control sharing layers never worked if the state is already compact.
As part of Q learning an objective is to maximize the expected utility. I know
Reading wikipedia :
https://en.wikipedia.org/wiki/Q-learning describes expected utility in following contexts :
It works by learning an action-value function that ultimately gives
the expected utility of taking a given action in a given state and
following the optimal policy thereafter.
One of the strengths of Q-learning is that it is able to compare the
expected utility of the available actions without requiring a model of
the environment.
But does not define what utility is, what is meant by utility ?
When maximizing utility what exactly is being maximized ?
In this case, "utility" means functionality or usefulness. So "maximum functionality" or "maximum usefulness".
Plugging the word into Google gives you:
the state of being useful, profitable, or beneficial.
In general terms, utility means profitable or beneficial (as #Rob posted in his response).
In Q-learning context, utility is closed related (they can be viewed as synonyms) with action-value function, as you read in Wikipedia explanation. Here, the action-value function of policy π is an estimation of the return (long term reward) that the agent is going to obtain if it performs the action a in a given state s and follows the policy π. So, when you maximizes the utility, actually you are maximizing the rewards your agent will obtain. As rewards are defined to achieve a goal, you are maximizing the "quantity" of goal achieved.
I am trying to build a temporal difference learning agent for Othello. While the rest of my implementation seems to run as intended I am wondering about the loss function used to train my network. In Sutton's book "Reinforcement learning: An Introduction", the Mean Squared Value Error (MSVE is presented as the standard loss function. It is basically a Mean Square Error multiplied with the on policy distribution. (Sum over all states s ( onPolicyDistribution(s) * [V(s) - V'(s,w)]² ) )
My question is now: How do I obtain this on policy distribution when my policy is an e-greedy function of a learned value function? Is it even necessary and what's the issue if I just use an MSELoss instead?
I'm implementing all of this in pytorch, so bonus points for an easy implementation there :)
As you mentioned, in your case, it sounds like you are doing Q-learning, so you do not need to do policy gradient as described in Sutton's book. That is need when you are learning a policy. You are not learning a policy, you are learning a value function and using that to act.
How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.