How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.
Related
I want to know about the convergence time of Deep Q-learning vs Q-learning when run on same problem. Can anyone give me an idea about the pattern between them? It will be better if it is explained with a graph.
In short, the more complicated the state is, the better DQN is over Q-Learning (by complicated, I mean the number of all possible states). When the state is too complicated, Q-learning becomes nearly impossible to converge due to time and hardware limitation.
note that DQN is in fact a kind of Q-Learning, it uses a neural network to act like a q table, both Q-network and Q-table are used to output a Q value with the state as input. I will continue using Q-learning to refer the simple version with Q-table, DQN with the neural network version
You can't tell convergence time without specifying a specific problem, because it really depends on what you are doing:
For example, if you are doing a simple environment like FrozenLake:https://gym.openai.com/envs/FrozenLake-v0/
Q-learning will converge faster than DQN as long as you have a reasonable reward function.
This is because FrozenLake has only 16 states, Q-Learning's algorithm is just very simple and efficient, so it runs a lot faster than training a neural network.
However, if you are doing something like atari:https://gym.openai.com/envs/Assault-v0/
there are millions of states (note that even a single pixel difference is considered totally new state), Q-Learning requires enumerating all states in Q-table to actually converge (so it will probably require a very large memory plus a very long training time to be able to enumerate and explore all possible states). In fact, I am not sure if it is ever going to converge in some more complicated game, simply because of so many states.
Here is when DQN becomes useful. Neural networks can generalize the states and find a function between state and action (or more precisely state and Q-value). It no longer needs to enumerate, it instead learns information implied in states. Even if you have never explored a certain state in training, as long as your neural network has been trained to learn the relationship on other similar states, it can still generalize and output the Q-value. And therefore it is a lot easier to converge.
In the training phase of Deep Deterministic Policy Gradient (DDPG) algorithm, the action selection would be simply
action = actor(state)
where state is the current state of the environment and actor is a deep neural network.
I do not understand how to guarantee that the returned action belongs to the action space of the considered environment.
For example, a state could be a vector of size 4 and the action space could be the interval [-1,1] of real numbers or the Cartesian product of [-1,1]x[-2,2]. Why, after doing action = actor(state), the returned action would belong to [-1,1] or [-1,1]x[-2,2], depending on the environment?
I was reading some source codes of DDPG on GitHub but I am missing something here and I cannot figure out the answer.
The actor usually is a neural network, and the reason of actor's action restrict in [-1,1] is usually because the output layer of the actor net using activation function like Tanh, and one can process this outputs to let action belong to any range.
The reason of actor can choose the good action depending on environment, is because in MDP(Markov decision process), the actor doing trial and error in the environment, and get reward or penalty for actor doing good or bad, i.e the actor net get gradients towards better action.
Note algorithms like PPG, PPO, SAC, DDPG, can guarantee the actor would select the best action for all states in theory! (i.e assume infinite learning time, infinite actor net capacity, etc.) in practice, there usually no guarantee unless action space is discrete and environment is very simple.
Understand the idea behind RL algorithms will greatly help you understand source codes of those algorithms, after all, code is implementation of the idea.
This is my first post here, and I came here to discuss or get clarifications on something that I have trouble understanding, namely model-free vs model-based RL methods. I am currently implementing Q-learning, but am not certain I am doing it correctly.
Example: Say I am applying Q-learning to an inverted pendulum, where the reward is given as the absolute distance between the pendulum upward position, and terminal state (or goal state) is defined to be when the pendulum is very close to upward position.
Would this setup mean that I have a model-free or model-based setup? From how I have understood, this would be model-based as I have a model of the environment that is giving me the reward (R=abs(pos-wantedPos)). But then I saw an implementation of this using Q-learning (https://medium.com/#tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947), which is a model-free algorithm. Now I am clueless...
Thankful for all responses.
Vanilla Q-learning is model-free.
The idea behind reinforcement learning is that an agent is trained to learn an optimal policy based on pairs of states and rewards--this is in contrast to trying to model the environment.
If you took a model-based approach, you would be trying to model the environment and ultimately perform value iteration or policy iteration of the Markov decision process.
In reinforcement learning, it is assumed you do not have the MDP, and thus must try to find an optimal policy based on the various rewards you receive from your experiences.
For a longer explanation, check out this post.
I am working on a car following problem and the measurements I am receiving are uncertain ( I know that the noise model is gaussian and it's variance is also known). How do I select my next action in such kind of uncertainty?
Basically how should I change my cost function so that I can optimize my plan by selecting appropriate action?
Vanilla reinforcement learning is meant for Markov decision processes, where it's assumed that you can fully observe the state. Because your states are noisy, you have a Partially observable Markov decision process. Theoretically speaking you should be looking at a different category of RL approaches.
Practically, since you have so much information about the parameters of the uncertainty, you should consider using a Kalman or particle filter to perform state estimation. Then, use the most likely state estimate as the true state in your RL problem. The estimate will be wrong at times, of course, but if you're using a function approximation approach for the value function, the experience can generalize across similar states and you'll be able to learn. The learning performance is going to be proportional to the quality of your state estimate.
My goal is to predict customer churn. I want to use reinforcement learning to train a recurrent neural network which predicts a target response for its input.
I understand that the state is represented by the input to the network at each time, but I don't understand how the action is represented. Is it the values of weights which the neural network should decide to choose by some formulas?
Also, how should we create a reward or punishment to teach the neural network its weights as we don't know the target response for each input neurons?
The aim of reinforcement learning is typically to maximize long term reward for an agent playing a game of sorts (a Markov Decision Process). In typical reinforcement learning usage, neural networks are used to approximate the Q-function. So, the network's input is the state and action (or a feature representations thereof), and the output is the value of taking that action in that state. Reinforcement learning algorithms like Q-learning provide the details on how to choose actions at a given time step, and also dictate how updates to the value function should be done.
It isn't clear how your specific goal of building a customer churn model might be formulated as a Markov Decision Problem. You could define your states to be statistics about customers' interactions with the company website, but it isn't clear what the actions might be, because it isn't clear what the agent is and what it can do. This is also why you are finding it difficult to define a reward function. The reward function should tell the agent if it's doing a good job. So, if we're imagining an MDP where the agent is trying to minimize customer churn, we might provide a negative reward proportional to the number of customers that turn over.
I don't think you want to learn a Q-function. I think it's more likely that you are interested simply in supervised learning, where you have some sample data and you want to learn a function that will tell you how much churn there will be. For this, you should be looking towards gradient descent methods and forward/backward propagation for training your neural network.