How to handle uncertainty in position? - reinforcement-learning

I am working on a car following problem and the measurements I am receiving are uncertain ( I know that the noise model is gaussian and it's variance is also known). How do I select my next action in such kind of uncertainty?
Basically how should I change my cost function so that I can optimize my plan by selecting appropriate action?

Vanilla reinforcement learning is meant for Markov decision processes, where it's assumed that you can fully observe the state. Because your states are noisy, you have a Partially observable Markov decision process. Theoretically speaking you should be looking at a different category of RL approaches.
Practically, since you have so much information about the parameters of the uncertainty, you should consider using a Kalman or particle filter to perform state estimation. Then, use the most likely state estimate as the true state in your RL problem. The estimate will be wrong at times, of course, but if you're using a function approximation approach for the value function, the experience can generalize across similar states and you'll be able to learn. The learning performance is going to be proportional to the quality of your state estimate.

Related

Convergence time of Q-learning Vs Deep Q-learning

I want to know about the convergence time of Deep Q-learning vs Q-learning when run on same problem. Can anyone give me an idea about the pattern between them? It will be better if it is explained with a graph.
In short, the more complicated the state is, the better DQN is over Q-Learning (by complicated, I mean the number of all possible states). When the state is too complicated, Q-learning becomes nearly impossible to converge due to time and hardware limitation.
note that DQN is in fact a kind of Q-Learning, it uses a neural network to act like a q table, both Q-network and Q-table are used to output a Q value with the state as input. I will continue using Q-learning to refer the simple version with Q-table, DQN with the neural network version
You can't tell convergence time without specifying a specific problem, because it really depends on what you are doing:
For example, if you are doing a simple environment like FrozenLake:https://gym.openai.com/envs/FrozenLake-v0/
Q-learning will converge faster than DQN as long as you have a reasonable reward function.
This is because FrozenLake has only 16 states, Q-Learning's algorithm is just very simple and efficient, so it runs a lot faster than training a neural network.
However, if you are doing something like atari:https://gym.openai.com/envs/Assault-v0/
there are millions of states (note that even a single pixel difference is considered totally new state), Q-Learning requires enumerating all states in Q-table to actually converge (so it will probably require a very large memory plus a very long training time to be able to enumerate and explore all possible states). In fact, I am not sure if it is ever going to converge in some more complicated game, simply because of so many states.
Here is when DQN becomes useful. Neural networks can generalize the states and find a function between state and action (or more precisely state and Q-value). It no longer needs to enumerate, it instead learns information implied in states. Even if you have never explored a certain state in training, as long as your neural network has been trained to learn the relationship on other similar states, it can still generalize and output the Q-value. And therefore it is a lot easier to converge.

Deep Value-only Reinforcement Learning: Train V(s) instead of Q(s,a)?

Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s), rather than to the state-action-value function Q(s,a)?
If not, why not, or, could it easily be implemented?
Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?
I ask because
I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on Vi(si,t+1) for the different achievable post-period states si,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).
I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s), but overall it still targets to readily learn the Q(s,a) in a decentralized way, which seems not conducive in my case.

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Difference between Evolutionary Strategies and Reinforcement Learning?

I am learning about the approach employed in Reinforcement Learning for robotics and I came across the concept of Evolutionary Strategies. But I couldn't understand how RL and ES are different. Can anyone please explain?
To my understanding, I know of two main ones.
1) Reinforcement learning uses the concept of one agent, and the agent learns by interacting with the environment in different ways. In evolutionary algorithms, they usually start with many "agents" and only the "strong ones survive" (the agents with characteristics that yield the lowest loss).
2) Reinforcement learning agent(s) learns both positive and negative actions, but evolutionary algorithms only learns the optimal, and the negative or suboptimal solution information are discarded and lost.
Example
You want to build an algorithm to regulate the temperature in the room.
The room is 15 °C, and you want it to be 23 °C.
Using Reinforcement learning, the agent will try a bunch of different actions to increase and decrease the temperature. Eventually, it learns that increasing the temperature yields a good reward. But it also learns that reducing the temperature will yield a bad reward.
For evolutionary algorithms, it initiates with a bunch of random agents that all have a preprogrammed set of actions it is going to do. Then the agents that has the "increase temperature" action survives, and moves onto the next generation. Eventually, only agents that increase the temperature survive and are deemed the best solution. However, the algorithm does not know what happens if you decrease the temperature.
TL;DR: RL is usually one agent, trying different actions, and learning and remembering all info (positive or negative). EM uses many agents that guess many actions, only the agents that have the optimal actions survive. Basically a brute force way to solve a problem.
I think the biggest difference between Evolutionary Strategies and Reinforcement Learning is that ES is a global optimization technique while RL is a local optimization technique. So RL can converge to a local optima converging faster while ES converges slower to a global minima.
Evolution Strategies optimization happens on a population level. An evolution strategy algorithm in an iterative fashion (i) samples a batch of candidate solutions from the search space (ii) evaluates them and (iii) discards the ones with low fitness values. The sampling for a new iteration (or generation) happens around the mean of the best scoring candidate solutions from the previous iteration. Doing so enables evolution strategies to direct the search towards a promising location in the search space.
Reinforcement learning requires the problem to be formulated as a Markov Decision Process (MDP). An RL agent optimizes its behavior (or policy) by maximizing a cumulative reward signal received on a transition from one state to another. Since the problem is abstracted as an MDP learning can happen on a step or episode level. Learning per step (or N steps) is done via temporal-Difference learning (TD) and per episode is done via Monte Carlo methods. So far I am talking about learning via action-value functions (learning the values of actions). Another way of learning is by optimizing the parameters of a neural network representing the policy of the agent directly via gradient ascent. This approach is introduced in the REINFORCE algorithm and the general approach known as policy-based RL.
For a comprehensive comparison check out this paper https://arxiv.org/pdf/2110.01411.pdf

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.