I'm trying to build a reinforcement learning model using Markov decision process , When I'm at State t and apply an action x , I move to a state t+1 and my system stays in that state for period of time , I want to know is that possible to calculate the reward at the end of new state ?
Related
This seems like it should be obvious but I can't find resources on it anywhere. I am building a reinforcement learning model with openai's gym any_trading environment and stablebaselines3. There are a ton of online tutorials and documentation for training and evaluating the model but almost nothing on actually using it in practice.
e.g. I want the model constantly looking at today's data and making predictions about what action I should take to lock in tomorrow's profits.
Reinforcement learning algorithms seem to all have a model.predict() method but you have to pass the environment which is just more historical data. What if I want it to use today's data to predict tomorrow's values? Do I just include up to today in the test set and retrain the model from scratch each time I want it to make a prediction?
e.g. Original training data ranges from 2014-01-01 to Today (aka 2023-02-12) then run through the whole train and testing process? Then tomorrow I start from scratch and train/test using date_ranges 2014-01-01 to Today (aka 2023-02-13) then the next day 2014-01-01 to Today (aka 2023-02-14) etc etc? How do I actually make real-time predictions with a Reinforcement Learning model as opposed to continually evaluating how it would have performed on past data?
Thanks.
This is a very good and practical question. I assume you use all the history data to train your RL agent in stablebaselines3 in practice and then apply the trained RL agent to predict tomorrow's action. Short answer is NO, you don't need to train your agent from scratch every day.
First you need to understand the procedure in learning and prediction:
In learning or training process:
Initialize your RL agent policy or value network.
Input the observation on day 2014-01-01 to your RL agent.
Your agent makes decisions based on the observation.
Calculate your observation and reward/profit on day 2014-01-02 and send them back to your agent.
Depend on the RL algorithm you use, your agent might update its policy or value network based on this observation reward pair or it could save this pair into buffer. And only update its policy or value network after certain amount of days (e.g., 30 days, 180 days)
repeat step 2-6 until you reach the last day of your database (e.g., 2023-02-12)
In prediction process (which has only procedure 2,3 from training process):
Input the observation on day 2014-01-01 to your RL agent.
Your agent makes decisions based on the observation.
That's it.
You can repeated train your model in the training process with the history data until you are satisfied with the performance during the training process. In this retrain process, after each training through the entire history data, you can save the model and load the saved model for the retrain as the initialized model.
Once you get that good model, you don't need to train it anymore with the new coming data after 2023-2-12. It is still valid.
You may think new data is generated everyday and the most recent data is the most valuable one. In this case, you can periodically update your existing model with the new data using following procedure:
load your existing RL agent model (the trained model).
Input the observation on day one in your most recent new data to your RL agent.
Your agent makes decisions based on the observation.
Calculate your observation and reward/profit on day two of your new data and send them back to your agent.
Depend on the RL algorithm you use, your agent might update its policy or value network based on this observation reward pair or it could save this pair into buffer. And only update its policy or value network after certain amount of days (e.g., 30 days)
repeat step 2-6 until you reach the last day of your new data
If lots of iterations are needed in a simulated environment before a reinforcement learning (RL) algorithm to work in real world, why we don’t use the same simulated environment to generate the labeled data and then use supervised learning methods instead of RL?
The reason is because the two fields has a fundamental difference:
One tries to replicate previous results and the other tries to be better than previous results.
There are 4 fields in machine learning:
Supervised learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement learning
Let's talking about the two fields you asked for, and let's intuitively explore them with a real life example of archery.
Supervised Learning
For supervised learning, we would observe a master archer in action for maybe a week and record how far they pulled the bow string back, angle of shot, etc. And then we go home and build a model. In the most ideal scenario, our model becomes equally as good as the master archer. It cannot get better because the loss function in supervised learning is usually MSE or Cross entropy, so we simply try to replicate the feature label mapping. After building the model, we deploy it. And let's just say we're extra fancy and make it learn online. So we continually take data from the master archer and continue to learn to be exactly the same as the master archer.
The biggest takeaway:
We're trying to replicate the master archer simply because we think he is the best. Therefore we can never beat him.
Reinforcement Learning
In reinforcement learning, we simply build a model and let it try many different things. And we give it a reward / penalty depending on how far the arrow was from the bullseye. We are not trying to replicate any behaviour, instead, we try to find our own optimal behaviour. Because of this, we are not given any bias towards what we think the optimal shooting strategy is.
Because RL does not have any prior knowledge, it may be difficult for RL to converge on difficult problems. Therefore, there is a method called apprenticeship learning / imitation learning, where we basically give the RL some trajectories of master archers just so it can have a starting point and begin to converge. But after that, RL will explore by taking random actions sometimes to try to find other optimal solutions. This is something that supervised learning cannot do. Because if you explore using supervised learning, you are basically saying by taking this action in this state is optimal. Then you try to make your model replicate it. But this scenario is wrong in supervised learning, and should instead be seen as an outlier in the data.
Key differences of Supervised learning vs RL:
Supervised Learning replicates what's already done
Reinforcement learning can explore the state space, and do random actions. This then allows RL to be potentially better than the current best.
Why we don’t use the same simulated environment to generate the labeled data and then use supervised learning methods instead of RL
We do this for Deep RL because it has an experience replay buffer. But this is not possible for supervised learning because the concept of reward is lacking.
Example: Walking in a maze.
Reinforcement Learning
Taking a right in square 3: Reward = 5
Taking a left in square 3: Reward = 0
Taking a up in square 3: Reward = -5
Supervised Learning
Taking a right in square 3
Taking a left in square 3
Taking a up in square 3
When you try to make a decision in square 3, RL will know to go right. Supervised learning will be confused, because in one example, your data said to take a right in square 3, 2nd example says to take left, 3rd example says to go up. So it will never converge.
In short, supervised learning is passive learning, that is, all the data is collected before you start training your model.
However, reinforcement learning is active learning. In RL, usually, you don't have much data at first and you collect new data as you are training your model. Your RL algorithm and model decide what specific data samples you can collect while training.
Supervised Learning is about the generalization of the knowledge given by the supervisor (training data) to use in an uncharted area (test data). It is based on instructive feedback where the agent is provided with correct actions (labels) to take given a situation (features).
Reinforcement Learning is about learning through interaction by trial-and-error. There is no instructive feedback but only evaluative feedback that evaluates the action taken by an agent by informing how good the action taken was instead of saying the correct action to take.
In supervised learning we have target labelled data which is assumed to be correct.
In RL that's not the case we have nothing but rewards. Agents needs to figure itself which action to take by playing with the environment while observing the rewards it gets.
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of a training data set, it is bound to learn from its experience.
My goal is to predict customer churn. I want to use reinforcement learning to train a recurrent neural network which predicts a target response for its input.
I understand that the state is represented by the input to the network at each time, but I don't understand how the action is represented. Is it the values of weights which the neural network should decide to choose by some formulas?
Also, how should we create a reward or punishment to teach the neural network its weights as we don't know the target response for each input neurons?
The aim of reinforcement learning is typically to maximize long term reward for an agent playing a game of sorts (a Markov Decision Process). In typical reinforcement learning usage, neural networks are used to approximate the Q-function. So, the network's input is the state and action (or a feature representations thereof), and the output is the value of taking that action in that state. Reinforcement learning algorithms like Q-learning provide the details on how to choose actions at a given time step, and also dictate how updates to the value function should be done.
It isn't clear how your specific goal of building a customer churn model might be formulated as a Markov Decision Problem. You could define your states to be statistics about customers' interactions with the company website, but it isn't clear what the actions might be, because it isn't clear what the agent is and what it can do. This is also why you are finding it difficult to define a reward function. The reward function should tell the agent if it's doing a good job. So, if we're imagining an MDP where the agent is trying to minimize customer churn, we might provide a negative reward proportional to the number of customers that turn over.
I don't think you want to learn a Q-function. I think it's more likely that you are interested simply in supervised learning, where you have some sample data and you want to learn a function that will tell you how much churn there will be. For this, you should be looking towards gradient descent methods and forward/backward propagation for training your neural network.
I'm attempting to pose a problem as a reinforcement learning problem. My difficulty is that the state which an agent is in changes randomly. They must simply choose an action within the state they are in. I want to learn appropriate actions for all states based on the reward they receive for performing actions.
Question:
Is this a specific type of RL problem?
If there is no successor state, so how would one calculate the value of a state?
If the state really changes randomly, if there is no relationship between the action and the following state, then all you can do is record and average the rewards for each action and each state.
So I've discovered that this would be called a Monte Carlo reinforcement learning problem. Rather than associating value with a state based on the value of the states one can transition to, value is associated with a state according to the outcome of a policy given that state directly. This is useful for instances when the dynamics of the state transition function are unknown or highly stochastic and difficult to model.
https://en.wikipedia.org/wiki/Reinforcement_learning
How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.