Reinforcement Learning without Successor State - reinforcement-learning

I'm attempting to pose a problem as a reinforcement learning problem. My difficulty is that the state which an agent is in changes randomly. They must simply choose an action within the state they are in. I want to learn appropriate actions for all states based on the reward they receive for performing actions.
Question:
Is this a specific type of RL problem?
If there is no successor state, so how would one calculate the value of a state?

If the state really changes randomly, if there is no relationship between the action and the following state, then all you can do is record and average the rewards for each action and each state.

So I've discovered that this would be called a Monte Carlo reinforcement learning problem. Rather than associating value with a state based on the value of the states one can transition to, value is associated with a state according to the outcome of a policy given that state directly. This is useful for instances when the dynamics of the state transition function are unknown or highly stochastic and difficult to model.
https://en.wikipedia.org/wiki/Reinforcement_learning

Related

How to guarantee that the actor would select a correct action?

In the training phase of Deep Deterministic Policy Gradient (DDPG) algorithm, the action selection would be simply
action = actor(state)
where state is the current state of the environment and actor is a deep neural network.
I do not understand how to guarantee that the returned action belongs to the action space of the considered environment.
For example, a state could be a vector of size 4 and the action space could be the interval [-1,1] of real numbers or the Cartesian product of [-1,1]x[-2,2]. Why, after doing action = actor(state), the returned action would belong to [-1,1] or [-1,1]x[-2,2], depending on the environment?
I was reading some source codes of DDPG on GitHub but I am missing something here and I cannot figure out the answer.
The actor usually is a neural network, and the reason of actor's action restrict in [-1,1] is usually because the output layer of the actor net using activation function like Tanh, and one can process this outputs to let action belong to any range.
The reason of actor can choose the good action depending on environment, is because in MDP(Markov decision process), the actor doing trial and error in the environment, and get reward or penalty for actor doing good or bad, i.e the actor net get gradients towards better action.
Note algorithms like PPG, PPO, SAC, DDPG, can guarantee the actor would select the best action for all states in theory! (i.e assume infinite learning time, infinite actor net capacity, etc.) in practice, there usually no guarantee unless action space is discrete and environment is very simple.
Understand the idea behind RL algorithms will greatly help you understand source codes of those algorithms, after all, code is implementation of the idea.

Inverted Pendulum: model-based or model-free?

This is my first post here, and I came here to discuss or get clarifications on something that I have trouble understanding, namely model-free vs model-based RL methods. I am currently implementing Q-learning, but am not certain I am doing it correctly.
Example: Say I am applying Q-learning to an inverted pendulum, where the reward is given as the absolute distance between the pendulum upward position, and terminal state (or goal state) is defined to be when the pendulum is very close to upward position.
Would this setup mean that I have a model-free or model-based setup? From how I have understood, this would be model-based as I have a model of the environment that is giving me the reward (R=abs(pos-wantedPos)). But then I saw an implementation of this using Q-learning (https://medium.com/#tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947), which is a model-free algorithm. Now I am clueless...
Thankful for all responses.
Vanilla Q-learning is model-free.
The idea behind reinforcement learning is that an agent is trained to learn an optimal policy based on pairs of states and rewards--this is in contrast to trying to model the environment.
If you took a model-based approach, you would be trying to model the environment and ultimately perform value iteration or policy iteration of the Markov decision process.
In reinforcement learning, it is assumed you do not have the MDP, and thus must try to find an optimal policy based on the various rewards you receive from your experiences.
For a longer explanation, check out this post.

How to choose action in TD(0) learning

I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:
To do this, I tried to implement the pseudo-code presented here:
Doing this I wondered how to do this step A <- action given by π for S: I can I choose the optimal action A for my current state S? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.
I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π.
Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?
You are right, you cannot choose an action (neither derive a policy π) only from a value function V(s) because, as you notice, it depends only on the state s.
The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.
If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a). There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.
In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6:
As usual, we start by focusing on the policy evaluation or prediction
problem, that of estimating the value function for a given policy .
For the control problem (finding an optimal policy), DP, TD, and Monte
Carlo methods all use some variation of generalized policy iteration
(GPI). The differences in the methods are primarily differences in
their approaches to the prediction problem.

How to handle uncertainty in position?

I am working on a car following problem and the measurements I am receiving are uncertain ( I know that the noise model is gaussian and it's variance is also known). How do I select my next action in such kind of uncertainty?
Basically how should I change my cost function so that I can optimize my plan by selecting appropriate action?
Vanilla reinforcement learning is meant for Markov decision processes, where it's assumed that you can fully observe the state. Because your states are noisy, you have a Partially observable Markov decision process. Theoretically speaking you should be looking at a different category of RL approaches.
Practically, since you have so much information about the parameters of the uncertainty, you should consider using a Kalman or particle filter to perform state estimation. Then, use the most likely state estimate as the true state in your RL problem. The estimate will be wrong at times, of course, but if you're using a function approximation approach for the value function, the experience can generalize across similar states and you'll be able to learn. The learning performance is going to be proportional to the quality of your state estimate.

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.