Plotting reward curve in reinforcement learning - reinforcement-learning

I'm wondering how to plot reward curves in reinforcement learning.
Especially, my simulated environment has significant randomness.
So there are so many zig-zag patterns in raw data of reward even though the output policy is converged.
Is there any way to plot in this case?

I am afraid I don't get your problem. Why not just plotting the reward you receive at each episode? If the policy converges, after a while you should see an increase in the reward, even though there might be those zig zags st start.

Related

Convergence time of Q-learning Vs Deep Q-learning

I want to know about the convergence time of Deep Q-learning vs Q-learning when run on same problem. Can anyone give me an idea about the pattern between them? It will be better if it is explained with a graph.
In short, the more complicated the state is, the better DQN is over Q-Learning (by complicated, I mean the number of all possible states). When the state is too complicated, Q-learning becomes nearly impossible to converge due to time and hardware limitation.
note that DQN is in fact a kind of Q-Learning, it uses a neural network to act like a q table, both Q-network and Q-table are used to output a Q value with the state as input. I will continue using Q-learning to refer the simple version with Q-table, DQN with the neural network version
You can't tell convergence time without specifying a specific problem, because it really depends on what you are doing:
For example, if you are doing a simple environment like FrozenLake:https://gym.openai.com/envs/FrozenLake-v0/
Q-learning will converge faster than DQN as long as you have a reasonable reward function.
This is because FrozenLake has only 16 states, Q-Learning's algorithm is just very simple and efficient, so it runs a lot faster than training a neural network.
However, if you are doing something like atari:https://gym.openai.com/envs/Assault-v0/
there are millions of states (note that even a single pixel difference is considered totally new state), Q-Learning requires enumerating all states in Q-table to actually converge (so it will probably require a very large memory plus a very long training time to be able to enumerate and explore all possible states). In fact, I am not sure if it is ever going to converge in some more complicated game, simply because of so many states.
Here is when DQN becomes useful. Neural networks can generalize the states and find a function between state and action (or more precisely state and Q-value). It no longer needs to enumerate, it instead learns information implied in states. Even if you have never explored a certain state in training, as long as your neural network has been trained to learn the relationship on other similar states, it can still generalize and output the Q-value. And therefore it is a lot easier to converge.

Deep Q Learning : How to visualize convergence?

I have trained an RL agent in an environment similar to the Puckworld. Theres no puck though! The agent is in continuous space and wants to reach a fixed target. Each episode the agent is born at a random location and there is an added noise to each action to make learning less trivial.
The reward is given every step as a scaled version of the distance to the target.
I want to plot the convergence of the neural network. The same problem in discrete space and using Q learning, I would plot the sum of all elements in Q matrix vs episode number. This gave me a good understanding of the performance of the network. How can i do the same for a neural network?
Plotting the reward collected in an episode vs episode number is not optimal here.
I use PyTorch. Any help is appreciated

Inverted Pendulum: model-based or model-free?

This is my first post here, and I came here to discuss or get clarifications on something that I have trouble understanding, namely model-free vs model-based RL methods. I am currently implementing Q-learning, but am not certain I am doing it correctly.
Example: Say I am applying Q-learning to an inverted pendulum, where the reward is given as the absolute distance between the pendulum upward position, and terminal state (or goal state) is defined to be when the pendulum is very close to upward position.
Would this setup mean that I have a model-free or model-based setup? From how I have understood, this would be model-based as I have a model of the environment that is giving me the reward (R=abs(pos-wantedPos)). But then I saw an implementation of this using Q-learning (https://medium.com/#tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947), which is a model-free algorithm. Now I am clueless...
Thankful for all responses.
Vanilla Q-learning is model-free.
The idea behind reinforcement learning is that an agent is trained to learn an optimal policy based on pairs of states and rewards--this is in contrast to trying to model the environment.
If you took a model-based approach, you would be trying to model the environment and ultimately perform value iteration or policy iteration of the Markov decision process.
In reinforcement learning, it is assumed you do not have the MDP, and thus must try to find an optimal policy based on the various rewards you receive from your experiences.
For a longer explanation, check out this post.

Why neural network tends to output 'mean value'?

I am using keras to build a simple neural network for a regression task.
But the output is always tends to the 'mean value' of ground truth y data.
See the first figure, blue is ground truth, red is predicted value (very close to the constant mean of ground truth).
Also the model stops learning very early even though I set a learning epoch=100.
Anyone have ideas under what kinds of conditions the neural network will stop learning early and why the regression output tends to 'the mean' of ground truth?
Thanks!
Possibly because the data are unpredictable....? Do you know for certain that the data set has N order predictability of some kind?
Just eyeballing your data set, it lacks periodicity, lacks homoscedasticity, it lacks any slope or skew or trend or pattern... I can't really tell if there is anything wrong with your 'net. In the absence of any pattern, the mean is always the best prediction... and it is entirely possible (although not certain) that the neural net is doing its job.
I suggest you find an easier data set, and see if you can tackle that first.
The model is not learning from the data. Think of a basic linear regression - the 'null' prediction, the prediction if you didn't have any predictors at all, is just the expected value; i.e. the mean. It could be caused by many different issues, but initialization comes to mind - bad initialization leads to no learning. This blog post has good practical advice that may help.

What is action and reward in a neural network which learns weights by reinforcement learning

My goal is to predict customer churn. I want to use reinforcement learning to train a recurrent neural network which predicts a target response for its input.
I understand that the state is represented by the input to the network at each time, but I don't understand how the action is represented. Is it the values of weights which the neural network should decide to choose by some formulas?
Also, how should we create a reward or punishment to teach the neural network its weights as we don't know the target response for each input neurons?
The aim of reinforcement learning is typically to maximize long term reward for an agent playing a game of sorts (a Markov Decision Process). In typical reinforcement learning usage, neural networks are used to approximate the Q-function. So, the network's input is the state and action (or a feature representations thereof), and the output is the value of taking that action in that state. Reinforcement learning algorithms like Q-learning provide the details on how to choose actions at a given time step, and also dictate how updates to the value function should be done.
It isn't clear how your specific goal of building a customer churn model might be formulated as a Markov Decision Problem. You could define your states to be statistics about customers' interactions with the company website, but it isn't clear what the actions might be, because it isn't clear what the agent is and what it can do. This is also why you are finding it difficult to define a reward function. The reward function should tell the agent if it's doing a good job. So, if we're imagining an MDP where the agent is trying to minimize customer churn, we might provide a negative reward proportional to the number of customers that turn over.
I don't think you want to learn a Q-function. I think it's more likely that you are interested simply in supervised learning, where you have some sample data and you want to learn a function that will tell you how much churn there will be. For this, you should be looking towards gradient descent methods and forward/backward propagation for training your neural network.