How can we approximate infinite horizon MDP with finite horizon MDP in the context of reinforcement learning? - reinforcement-learning

For a given value of "discount factor" (and reward values' range) in fixed finite horizon markov decision process (MDP), upto how many episodes we have to extend this MDP so that we can approximate the corresponding infinite horizon MDP?
I am actually working on a research project in which risk-averse MDP (RA-MDP) (with dynamic risk measures) are used which is by nature infinite horizon (this sentence may be irrevalent to answering question but I give it as a motivation and to make context).
I want to solve this by using online optimization. For this, I am using risk averse actor-critic algorithm, as proposed by Coache et. al. in "CONDITIONALLY ELICITABLE DYNAMIC RISK MEASURES FOR DEEP REINFORCEMENT LEARNING", which is the latest and the only RL algorithmic framework for risk-averse MDPs, but unfortunately restricted to finite MDPs!!
On the other hand, my problem is infinite horizon. So I want to approximate this "infinite horizon" with :finite horizon" case.
Please help me if you can in this regard (authentic references are appreciable).

Related

Supervised learning v.s. offline (batch) reinforcement learning

Most materials (e.g., David Silver's online course) I can find offer discussions about the relationship between supervised learning and reinforcement learning. However, it is actually a comparison between supervised learning and online reinforcement learning where the agent runs in the environment (or simulates interactions) to get feedback given limited knowledge about the underlying dynamics.
I am more curious about offline (batch) reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then? and what are the similarities they may share?
I am more curious about the offline (batch) setting for reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then ? and what are the similarities they may share ?
In the online setting, the fundamental difference between supervised learning and reinforcement learning is the need for exploration and the trade-off between exploration/exploitation in RL. However also in the offline setting there are several differences which makes RL a more difficult/rich problem than supervised learning. A few differences I can think of on the top of my head:
In reinforcement learning the agent receives what is termed "evaluative feedback" in terms of a scalar reward, which gives the agent some feedback of the quality of the action that was taken but it does not tell the agent if this action is the optimal action or not. Contrast this with supervised learning where the agent receives what is termed "instructive feedback": for each prediction that the learner makes, it receives a feedback (a label) that says what the optimal action/prediction was. The differences between instructive and evaluative feedback is detailed in Rich Sutton's book in the first chapters. Essentially reinforcement learning is optimization with sparse labels, for some actions you may not get any feedback at all, and in other cases the feedback may be delayed, which creates the credit-assignment problem.
In reinforcement learning you have a temporal aspect where the goal is to find an optimal policy that maps states to actions over some horizon (number of time-steps). If the horizon T=1, then it is just a one-off prediction problem like in supervised learning, but if T>1 then it is a sequential optimization problem where you have to find the optimal action not just in a single state but in multiple states and this is further complicated by the fact that the actions taken in one state can influence which actions should be taken in future states (i.e. it is dynamic).
In supervised learning there is a fixed i.i.d distribution from which the data points are drawn (this is the common assumption at least). In RL there is no fixed distribution, rather this distribution depends on the policy that is followed and often this distribution is not i.i.d but rather correlated.
Hence, RL is a much richer problem than supervised learning. In fact, it is possible to convert any supervised learning task into a reinforcement learning task: the loss function of the supervised task can be used as to define a reward function, with smaller losses mapping to larger rewards. Although it is not clear why one would want to do this because it converts the supervised problem into a more difficult reinforcement learning
problem. Reinforcement learning makes fewer assumptions than supervised learning and is therefore in general a harder problem to solve than supervised learning. However, the opposite is not possible, it is in general not possible to convert a reinforcement learning problem into a supervised learning problem.

Why introduce Markov property to reinforcement learning?

As a beginner of deep reinforcement learning, I am confused about why we should use Markov process in reinforcement learning, and what benefits it brings to reinforcement learning. In addition, Markov process requires that under the "known" condition, the "present" has nothing to do with the "future". Why do some deep reinforcement learning algorithms can use RNN and LSTM? Does this violate the Markov prcess's assumption?
The Markov property is used for the math to workout in the optimization process. Do keep in mind however that it is much more generally applicable than you might think it is. For example if in a certain board game you need to know the last three states of the game, this might seem as violating the Markov property; however, if you simply redefine your "state" to be the concatenation of the last three states, now you are back in a MDP.
This assumption says that the current state gives all the information needed about all aspects of the past agent-environment iteraction that makes difference for the future of the system. It is an important definition because you can define the dynamics of the process as p(s',r | s, a). In practice terms, you don't need to look and compute all the previous states of the system to determine the next possible states.

Deep Value-only Reinforcement Learning: Train V(s) instead of Q(s,a)?

Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s), rather than to the state-action-value function Q(s,a)?
If not, why not, or, could it easily be implemented?
Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?
I ask because
I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on Vi(si,t+1) for the different achievable post-period states si,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).
I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s), but overall it still targets to readily learn the Q(s,a) in a decentralized way, which seems not conducive in my case.

Difference between Evolutionary Strategies and Reinforcement Learning?

I am learning about the approach employed in Reinforcement Learning for robotics and I came across the concept of Evolutionary Strategies. But I couldn't understand how RL and ES are different. Can anyone please explain?
To my understanding, I know of two main ones.
1) Reinforcement learning uses the concept of one agent, and the agent learns by interacting with the environment in different ways. In evolutionary algorithms, they usually start with many "agents" and only the "strong ones survive" (the agents with characteristics that yield the lowest loss).
2) Reinforcement learning agent(s) learns both positive and negative actions, but evolutionary algorithms only learns the optimal, and the negative or suboptimal solution information are discarded and lost.
Example
You want to build an algorithm to regulate the temperature in the room.
The room is 15 °C, and you want it to be 23 °C.
Using Reinforcement learning, the agent will try a bunch of different actions to increase and decrease the temperature. Eventually, it learns that increasing the temperature yields a good reward. But it also learns that reducing the temperature will yield a bad reward.
For evolutionary algorithms, it initiates with a bunch of random agents that all have a preprogrammed set of actions it is going to do. Then the agents that has the "increase temperature" action survives, and moves onto the next generation. Eventually, only agents that increase the temperature survive and are deemed the best solution. However, the algorithm does not know what happens if you decrease the temperature.
TL;DR: RL is usually one agent, trying different actions, and learning and remembering all info (positive or negative). EM uses many agents that guess many actions, only the agents that have the optimal actions survive. Basically a brute force way to solve a problem.
I think the biggest difference between Evolutionary Strategies and Reinforcement Learning is that ES is a global optimization technique while RL is a local optimization technique. So RL can converge to a local optima converging faster while ES converges slower to a global minima.
Evolution Strategies optimization happens on a population level. An evolution strategy algorithm in an iterative fashion (i) samples a batch of candidate solutions from the search space (ii) evaluates them and (iii) discards the ones with low fitness values. The sampling for a new iteration (or generation) happens around the mean of the best scoring candidate solutions from the previous iteration. Doing so enables evolution strategies to direct the search towards a promising location in the search space.
Reinforcement learning requires the problem to be formulated as a Markov Decision Process (MDP). An RL agent optimizes its behavior (or policy) by maximizing a cumulative reward signal received on a transition from one state to another. Since the problem is abstracted as an MDP learning can happen on a step or episode level. Learning per step (or N steps) is done via temporal-Difference learning (TD) and per episode is done via Monte Carlo methods. So far I am talking about learning via action-value functions (learning the values of actions). Another way of learning is by optimizing the parameters of a neural network representing the policy of the agent directly via gradient ascent. This approach is introduced in the REINFORCE algorithm and the general approach known as policy-based RL.
For a comprehensive comparison check out this paper https://arxiv.org/pdf/2110.01411.pdf

Inverted Pendulum: model-based or model-free?

This is my first post here, and I came here to discuss or get clarifications on something that I have trouble understanding, namely model-free vs model-based RL methods. I am currently implementing Q-learning, but am not certain I am doing it correctly.
Example: Say I am applying Q-learning to an inverted pendulum, where the reward is given as the absolute distance between the pendulum upward position, and terminal state (or goal state) is defined to be when the pendulum is very close to upward position.
Would this setup mean that I have a model-free or model-based setup? From how I have understood, this would be model-based as I have a model of the environment that is giving me the reward (R=abs(pos-wantedPos)). But then I saw an implementation of this using Q-learning (https://medium.com/#tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947), which is a model-free algorithm. Now I am clueless...
Thankful for all responses.
Vanilla Q-learning is model-free.
The idea behind reinforcement learning is that an agent is trained to learn an optimal policy based on pairs of states and rewards--this is in contrast to trying to model the environment.
If you took a model-based approach, you would be trying to model the environment and ultimately perform value iteration or policy iteration of the Markov decision process.
In reinforcement learning, it is assumed you do not have the MDP, and thus must try to find an optimal policy based on the various rewards you receive from your experiences.
For a longer explanation, check out this post.