Why does DDPG/TD3 benefit from old data and PPO not - reinforcement-learning

I have a question of more general kind regarding deep reinforcement learning. I am always a little in struggle, what exactly the difference of on- and off-policy is. Sure one can say, off-policy is sampling from a different distribution for actions during trajectory sampling and on-policy is using the actual policy for trajectory generation. Or on-policy is not able to benefit from old data, while off-policy can. Both do not really answer, what the exact difference is, while rather tell me the output.
In my understanding DDPG and PPO both are build upon A2C and train in parallel an actor and a critic. While the critic usually is trained based on the MSE using the observed reward of the next timestep (maybe using some enrolling for multiple steps, but neglecting an enrollment for now) and the network itself of the next timestep. I do not see a difference between off-policy DDPG and on-policy PPO here (well TD3 does it slightly different, but its neglected for now since the idea is identical).
The actor itself has in both cases a loss-function based on the value generated by the critic. While PPO uses a ratio of the policies to limit the stepsize, DDPG uses the policy the predict the action for the value computed by the critic. Therefore both CURRENT policies are used in the loss function for the critic and actor, in both methods (PPO and DDPG).
So now to my actual question: Why is DDPG able to benefit from old data or rather, why does PPO not benefit from old data. One can argue, that the ratio of the policies in PPO limits the distance between the policies and therefor needs fresh data. But how is A2C on-policy and unable to benefit form old data in comparison to DDPG?
I do understand the difference between Q-learning being far more off-policy than policy learning. But I do not get the difference between those PG methods. Does it only rely on the fact that DDPG is deterministic. Does DDPG has any off-policy correction, which makes it able to profit form old data?
I would be really happy, if someone could bring me closer to understanding those policies.
Cheers

PPO actor-critic objective functions are based on a set of trajectories obtained by running the current policy over T timesteps. After the policy is updated, trajectories generated from old/stale policies are no longer applicable. i.e. it needs to be trained "on-policy".
[Why? Because PPO uses a stochastic policy (i.e. a conditional probability distribution of actions given states) and the policy's objective function is based on sampling from trajectories from a probability distribution that depends the current policy's probability distribution (i.e. you need to use the current policy to generate the trajectories)...NOTE #1: this is true for any policy gradients approach using a stochastic policy, not just PPO.]
DDPG/TD3 only needs a single timestep for each actor / critic update (via Bellman equation) and it is straightforward to apply the current deterministic policy to old data tuples (s_t, a_t, r_t, s_t+1). i.e. it is trained "off-policy".
[WHY? Because DDPG/TD3 use a deterministic policy and Silver, David, et al. "Deterministic policy gradient algorithms." 2014. proved that the policy's objective function is an expectation value of state trajectories from the Markov Decision Process state transition function...but does not depend on the probability distribution induced by the policy, which after all is deterministic not stochastic.]

Related

Fitting step in Deep Q Network

I am confused why dqn with experience replay algorithm would perform gradient descent step for every step in a given episode? This will fit only one step, right? This would make it extremely slow. Why not after each episode ends or every time the model is cloned?
In the original paper, the author pushes one sample to the experience replay buffer and randomly samples 32 transitions to train the model in the minibatch fashion. The samples took from interacting with the environment is not directly feeding to the model. To increase the speed of training, the author store samples every step but updates the model every four steps.
Use OpenAI's Baseline project; this single-process method can master easy games like Atari Pong (Pong-v4) about 2.5 hours using a single GPU. Of course, training in this kind of single process way makes multi-core, multi-GPU (or single-GPU) system's resource underutilised. So in new publications had decoupled action-selection and model optimisation. They use multiple "Actors" to interact with environments simultaneously and a single GPU "Leaner" to optimise the model or multiple Leaners with multiple models on various GPUs. The multi-actor-single-learner is described in Deepmind's Apex-DQN (Distributed Prioritized Experience Replay, D. Horgan et al., 2018) method and the multi-actor-multi-learner described in (Accelerated Methods for Deep Reinforcement Learning, Stooke and Abbeel, 2018). When using multiple learners, the parameter sharing across processes becomes essential. The old trail described in Deepmind's PDQN (Massively Parallel Methods for Deep Reinforcement Learning, Nair et al., 2015) which was proposed in the period between DQN and A3C. However, the work was performed entirely on CPUs, so it looks using massive resources, the result can be easy outperformed by PPAC's batched action-selection on GPU method.
You can't optimise on each episode end, because the episode length isn't fixed, the better the model usually results in the longer episode steps. The model's learning capability will decrease when they perform a little better. The learning progress will be instable.
We also don't train the model only on target model clone, because the introduction of the target is to stabilise the training process by keeping an older set of parameters. If you update only on parameter clones, the target model's parameters will be the same as the model and this cause instability. Because, if we use the same parameters, one model update will cause the next state to have a higher value.
In Deepmind's 2015 Nature paper, it states that:
The second modification to online Q-learning aimed at further improving the stability of our method with neural networks is to use a separate network for generating the target yj in the Q-learning update. More precisely, every C updates we clone the network Q to obtain a target network Q' and use Q' for generating the Q-learning targets yj for the following C updates to Q.
This modification makes the algorithm more stable compared to standard online Q-learning, where an update that increases Q(st,at) often also increases Q(st+1, a) for all a and hence also increases the target yj, possibly leading to oscillations or divergence of the policy. Generating the targets using the older set of parameters adds a delay between the time an update to Q is made and the time the update affects the targets yj, making divergence or oscillations much more unlikely.
Human-level control through deep reinforcement
learning, Mnih et al., 2015

Difference between Evolutionary Strategies and Reinforcement Learning?

I am learning about the approach employed in Reinforcement Learning for robotics and I came across the concept of Evolutionary Strategies. But I couldn't understand how RL and ES are different. Can anyone please explain?
To my understanding, I know of two main ones.
1) Reinforcement learning uses the concept of one agent, and the agent learns by interacting with the environment in different ways. In evolutionary algorithms, they usually start with many "agents" and only the "strong ones survive" (the agents with characteristics that yield the lowest loss).
2) Reinforcement learning agent(s) learns both positive and negative actions, but evolutionary algorithms only learns the optimal, and the negative or suboptimal solution information are discarded and lost.
Example
You want to build an algorithm to regulate the temperature in the room.
The room is 15 °C, and you want it to be 23 °C.
Using Reinforcement learning, the agent will try a bunch of different actions to increase and decrease the temperature. Eventually, it learns that increasing the temperature yields a good reward. But it also learns that reducing the temperature will yield a bad reward.
For evolutionary algorithms, it initiates with a bunch of random agents that all have a preprogrammed set of actions it is going to do. Then the agents that has the "increase temperature" action survives, and moves onto the next generation. Eventually, only agents that increase the temperature survive and are deemed the best solution. However, the algorithm does not know what happens if you decrease the temperature.
TL;DR: RL is usually one agent, trying different actions, and learning and remembering all info (positive or negative). EM uses many agents that guess many actions, only the agents that have the optimal actions survive. Basically a brute force way to solve a problem.
I think the biggest difference between Evolutionary Strategies and Reinforcement Learning is that ES is a global optimization technique while RL is a local optimization technique. So RL can converge to a local optima converging faster while ES converges slower to a global minima.
Evolution Strategies optimization happens on a population level. An evolution strategy algorithm in an iterative fashion (i) samples a batch of candidate solutions from the search space (ii) evaluates them and (iii) discards the ones with low fitness values. The sampling for a new iteration (or generation) happens around the mean of the best scoring candidate solutions from the previous iteration. Doing so enables evolution strategies to direct the search towards a promising location in the search space.
Reinforcement learning requires the problem to be formulated as a Markov Decision Process (MDP). An RL agent optimizes its behavior (or policy) by maximizing a cumulative reward signal received on a transition from one state to another. Since the problem is abstracted as an MDP learning can happen on a step or episode level. Learning per step (or N steps) is done via temporal-Difference learning (TD) and per episode is done via Monte Carlo methods. So far I am talking about learning via action-value functions (learning the values of actions). Another way of learning is by optimizing the parameters of a neural network representing the policy of the agent directly via gradient ascent. This approach is introduced in the REINFORCE algorithm and the general approach known as policy-based RL.
For a comprehensive comparison check out this paper https://arxiv.org/pdf/2110.01411.pdf

How to choose action in TD(0) learning

I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:
To do this, I tried to implement the pseudo-code presented here:
Doing this I wondered how to do this step A <- action given by π for S: I can I choose the optimal action A for my current state S? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.
I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π.
Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?
You are right, you cannot choose an action (neither derive a policy π) only from a value function V(s) because, as you notice, it depends only on the state s.
The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.
If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a). There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.
In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6:
As usual, we start by focusing on the policy evaluation or prediction
problem, that of estimating the value function for a given policy .
For the control problem (finding an optimal policy), DP, TD, and Monte
Carlo methods all use some variation of generalized policy iteration
(GPI). The differences in the methods are primarily differences in
their approaches to the prediction problem.

What is action and reward in a neural network which learns weights by reinforcement learning

My goal is to predict customer churn. I want to use reinforcement learning to train a recurrent neural network which predicts a target response for its input.
I understand that the state is represented by the input to the network at each time, but I don't understand how the action is represented. Is it the values of weights which the neural network should decide to choose by some formulas?
Also, how should we create a reward or punishment to teach the neural network its weights as we don't know the target response for each input neurons?
The aim of reinforcement learning is typically to maximize long term reward for an agent playing a game of sorts (a Markov Decision Process). In typical reinforcement learning usage, neural networks are used to approximate the Q-function. So, the network's input is the state and action (or a feature representations thereof), and the output is the value of taking that action in that state. Reinforcement learning algorithms like Q-learning provide the details on how to choose actions at a given time step, and also dictate how updates to the value function should be done.
It isn't clear how your specific goal of building a customer churn model might be formulated as a Markov Decision Problem. You could define your states to be statistics about customers' interactions with the company website, but it isn't clear what the actions might be, because it isn't clear what the agent is and what it can do. This is also why you are finding it difficult to define a reward function. The reward function should tell the agent if it's doing a good job. So, if we're imagining an MDP where the agent is trying to minimize customer churn, we might provide a negative reward proportional to the number of customers that turn over.
I don't think you want to learn a Q-function. I think it's more likely that you are interested simply in supervised learning, where you have some sample data and you want to learn a function that will tell you how much churn there will be. For this, you should be looking towards gradient descent methods and forward/backward propagation for training your neural network.

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.