For example, I have tried to run lambda iteration iteration on a random MDP. I noticed getting different policies depending on the value of lambda. Can TD(1) and TD(0) give different optimal policies?
Update: Increasing my initial value function gave me the same result for both cases.
Yes, in general, RL methods with convergence guarantees can converge to any optimal policy. So, if an MDP has several optimal policies, algorithms (including Policy iteration methods) could converge to any of the optimal policies.
Related
I am trying to use Reinforcement Learning for traffic signal phase optimization for improving traffic flow at intersections.
I am aware that in practice we won't be able to get the information about all the vehicles in each of the lanes.
If we use a camera for getting information about the queue length then we can get accurate data only upto, say 200 meters.
Should I take this into consideration while defining my observation space or can I directly use the data from sumo?
Furthermore, what should be the ideal observation space for such a task?
sumo_rl allows to use various metrics for reward calucation such as pressure metric, queue length metric, etc. What will be a good choice of rewards for my use case or what factors should I consider while defining my reward?
I have tried getting metrics from the e2 detector's output file such as throughput, lane delay and queue length. For the agent however, I might not be able to use them (as traci/sumo wrappers offer better implementations?) So how do I use traci for getting this modified information?
Yes, you should try to match your observation space as close to the real world as possible. SUMO can also filter the data directly (for instance with an E3 detector).
If you want to maximize flow than the reward should also include the flow metric (throughput). It's quite easy to get it via traci (as you already noticed) but I cannot tell how it integrates with your framework since you did not give details about it.
I have N number of agents/users accessing a single wireless channel and at each time, only one agent can access the channel and receive a reward.
Each user has a buffer that can store B number of packets and I assume it as infinite buffer.
Each user n gets observation from the environment if the packet in time slot t was successful or failure (collision). If more than one users access the channel, they get penalty.
This feedback from the channel is same for all the users since we only have one channel. The reward is - B_n (negative of the number of packets in buffer). Each user wants to maximize its own reward and try to empty the buffer.
Packets arrive at each users following a poisson process with average $\lambda$ packets per time slot.
Each user has a history of previous 10 time slots that it uses as an input to the DQN to output the probability of taking action A_n: stay silent or transmit. The history is (A_n, F, B_n)
Each user is unaware of the action and buffer status of other users.
I am trying to model my problem with multiagent reinforcement learning and so far I have tried it with DQN but results are more or less like a random scheme. It could be that users don't have much contextual information in order to learn the behaviour of other users? Or can there be any other reason?
I would like to know how can I model my environment since the state (in RL sense) is static, the environment doesn't changes. The only thing that changes is each users history at each time slot. So I am not sure if its a partially observable MDP or should it be modelled as multiagent single-arm bandit problem which I don't know is correct or not.
The second concern is that I have tried DQN but it has not worked and I would like to know if such problem can be used with tabular Q-learning? I have not seen multiagent works in which anyone has used QL. Any insights might be helpful.
Your problem can be modeled as a Decentralized POMDP (see a overview here).
Summarizing this approach, you consider a multi-agent system where each agent model his own policy, and then you try to build a joint policy through these individual ones. Of course that, the complexity grows as the number of agents, states and actions increases,so for that you have several approaches mainly based in heuristics to prune branches of this joint policy tree that are not "good" in comparison with others. A very know example using this approach is exactly about routing packages where is possible define a discrete action/space.
But be aware that even for tiny system, the complexity becomes often infeasible!
I have a question of more general kind regarding deep reinforcement learning. I am always a little in struggle, what exactly the difference of on- and off-policy is. Sure one can say, off-policy is sampling from a different distribution for actions during trajectory sampling and on-policy is using the actual policy for trajectory generation. Or on-policy is not able to benefit from old data, while off-policy can. Both do not really answer, what the exact difference is, while rather tell me the output.
In my understanding DDPG and PPO both are build upon A2C and train in parallel an actor and a critic. While the critic usually is trained based on the MSE using the observed reward of the next timestep (maybe using some enrolling for multiple steps, but neglecting an enrollment for now) and the network itself of the next timestep. I do not see a difference between off-policy DDPG and on-policy PPO here (well TD3 does it slightly different, but its neglected for now since the idea is identical).
The actor itself has in both cases a loss-function based on the value generated by the critic. While PPO uses a ratio of the policies to limit the stepsize, DDPG uses the policy the predict the action for the value computed by the critic. Therefore both CURRENT policies are used in the loss function for the critic and actor, in both methods (PPO and DDPG).
So now to my actual question: Why is DDPG able to benefit from old data or rather, why does PPO not benefit from old data. One can argue, that the ratio of the policies in PPO limits the distance between the policies and therefor needs fresh data. But how is A2C on-policy and unable to benefit form old data in comparison to DDPG?
I do understand the difference between Q-learning being far more off-policy than policy learning. But I do not get the difference between those PG methods. Does it only rely on the fact that DDPG is deterministic. Does DDPG has any off-policy correction, which makes it able to profit form old data?
I would be really happy, if someone could bring me closer to understanding those policies.
Cheers
PPO actor-critic objective functions are based on a set of trajectories obtained by running the current policy over T timesteps. After the policy is updated, trajectories generated from old/stale policies are no longer applicable. i.e. it needs to be trained "on-policy".
[Why? Because PPO uses a stochastic policy (i.e. a conditional probability distribution of actions given states) and the policy's objective function is based on sampling from trajectories from a probability distribution that depends the current policy's probability distribution (i.e. you need to use the current policy to generate the trajectories)...NOTE #1: this is true for any policy gradients approach using a stochastic policy, not just PPO.]
DDPG/TD3 only needs a single timestep for each actor / critic update (via Bellman equation) and it is straightforward to apply the current deterministic policy to old data tuples (s_t, a_t, r_t, s_t+1). i.e. it is trained "off-policy".
[WHY? Because DDPG/TD3 use a deterministic policy and Silver, David, et al. "Deterministic policy gradient algorithms." 2014. proved that the policy's objective function is an expectation value of state trajectories from the Markov Decision Process state transition function...but does not depend on the probability distribution induced by the policy, which after all is deterministic not stochastic.]
I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:
To do this, I tried to implement the pseudo-code presented here:
Doing this I wondered how to do this step A <- action given by π for S: I can I choose the optimal action A for my current state S? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.
I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π.
Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?
You are right, you cannot choose an action (neither derive a policy π) only from a value function V(s) because, as you notice, it depends only on the state s.
The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.
If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a). There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.
In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6:
As usual, we start by focusing on the policy evaluation or prediction
problem, that of estimating the value function for a given policy .
For the control problem (finding an optimal policy), DP, TD, and Monte
Carlo methods all use some variation of generalized policy iteration
(GPI). The differences in the methods are primarily differences in
their approaches to the prediction problem.
I am working on a car following problem and the measurements I am receiving are uncertain ( I know that the noise model is gaussian and it's variance is also known). How do I select my next action in such kind of uncertainty?
Basically how should I change my cost function so that I can optimize my plan by selecting appropriate action?
Vanilla reinforcement learning is meant for Markov decision processes, where it's assumed that you can fully observe the state. Because your states are noisy, you have a Partially observable Markov decision process. Theoretically speaking you should be looking at a different category of RL approaches.
Practically, since you have so much information about the parameters of the uncertainty, you should consider using a Kalman or particle filter to perform state estimation. Then, use the most likely state estimate as the true state in your RL problem. The estimate will be wrong at times, of course, but if you're using a function approximation approach for the value function, the experience can generalize across similar states and you'll be able to learn. The learning performance is going to be proportional to the quality of your state estimate.