Reinforcement Learning, ϵ-greedy approach vs optimal action

Reinforcement Learning, ϵ-greedy approach vs optimal action - reinforcement-learning

In Reinforcement Learning, why should we select actions according to an ϵ-greedy approach rather than always selecting the optimal action ?

We use an epsilon-greedy method for exploration during training. This means that when an action is selected by training, it is either chosen as the action with the highest Q-value, or a random action by some factor (epsilon).
Choosing between these two is random and based on the value of epsilon. initially, lots of random actions are taken which means we start by exploring the space, but as training progresses, more actions with the maximum q-values are taken and we gradually start giving less attention to actions with low Q-value.
During testing, we use this epsilon-greedy method, but with epsilon at a very low value, such that there is a strong bias towards exploitation over exploration, favoring choosing the action with the highest q-value over a random action. However, random actions are still sometimes chosen.
All this is because we want to eliminate the negative effects of over-fitting or under-fitting.
Using epsilon of 0 (always choosing the optimal action) is a fully exploitative choice. For example, consider a labyrinth game where the agent’s current Q-estimates are converged to the optimal policy except for one grid, where it greedily chooses to move toward a boundary (which is currently the optimal policy) that results in it remaining in the same grid, If the agent reaches any such state, and it is choosing the maximum Q-action, it will be stuck there. However, keeping a small epsilon factor in its policy allows it to get out of such states.

There wouldn't be much learning happening if you already knew what the best action was, right ? :)
ϵ-greedy is "on-policy" learning, meaning that you are learning the optimal ϵ-greedy policy, while exploring with an ϵ-greedy policy. You can also learn "off-policy" by selecting moves that are not aligned to the policy that you are learning, an example is exploring always randomly (same as ϵ=1).
I know this can be confusing at first, how can you learn anything if you just move randomly? The key bit of knowledge here is that the policy that you learn is not defined by how you explore, but by how you calculate the sum of future rewards (in the case of regular Q-Learning it's the max(Q[next_state]) piece in the Q-Value update).
This all works assuming you are exploring enough, if you don't try out new actions the agents will never be able to figure out which ones are the best ones in the first place.

Related

Question about the reinforcement learning action, observation space size

I tried to custom environment with a reinforcement learning(RL) project.
Some examples such as ping-pong, Aarti, Super-Mario, in this case, action, and observation space really small.
But, my project action, observation space is really huge size better than some examples.
And, I will use the space for at least 5000+ actions and observations.
Then, how can I effectively handle this massive amount of action and observation?
Currently, I am using Q-table learning, so I use a wrapper function to handle it.
But this seems to be very ineffective.

Yes, Q-table learning is quite old and requires extremely huge amount of memory since it stores Q value in a table. In your case, Q-table Learning seems not good enough. A better Choice would be Deep Q Network(DQN), which replaces table by networks, but it is not that efficient.
As for the huge observation space, it is fine. But the action space (5000+) seems too huge, it requires lots of time to converge. To reduce the time used for training, I would recommend PPO.

Are there any RL algorithms / techniques well-suited to the single timestep setting?

I am working in a setting where an agent gets to make one decision and one decision only for each state in a batch of states; in other words, it's presented with multiple states, and no matter what decision is made for each, the next state is always terminal for each. After making the decision, the agent doesn't get to find out the rewards for its actions for ~20-30 minutes, making the learning process a bit slow as well. The state space is countably infinite (although there are 16 discrete actions), so I have already ruled tabular methods out.
Here are some theoretical questions I'm grappling with:
Because there is never a non-terminal next state, this makes any method relying on TD (eg. DQN) useless, correct?
Am I correct in classifying this as a model-based setting, where the full transition matrix is described by all states -> terminal state with probability 1?
Are there any algorithms that fit the single-time-step, non-tabular, off-line, model-based setting?

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?

So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Reinforcement learning with new actions/expanding actionset

I wonder if there is any research on RL problems with new actions, i.e. think of a video game, as the game goes by, the agent learns more skills/maneuvers and thus has more available actions to choose, and thus the action set is expanding over time. A related question
State dependent action set in reinforcement learning
But there is not sufficient answer to this question, either. Thanks!

All of the recent research and papers in deep reinforcement learning use environments with a small, static set of potential actions. However, there are a couple of ways which you could try to compensate for having a variable action space.
Let's say we have a game environment where the agent can perform different attacks. One of the attacks, the fireball, is only unlocked later in the game. Maybe you have to do something special to unlock this attack, but for the purposes of this argument, let's just assume your agent will unlock this ability at some point in the course of the game.
You could add the unlocked actions to the action space and assign a
large negative reward if the agent tries to take an action that have
not yet unlocked. So if your agent tries to use the fireball and it
has not been unlocked yet, they get a negative reward. However, this
has a high likelihood of the agent "learning" to never use the
fireball, even if it is unlocked.
You could also vary the action space by adding new actions as they
become available. In this scenario, the agent would not have the
fireball attack in their action space until it is unlocked. You would
have to vary your epsilon (rate of random action) to do more
exploration when new actions are added to the action space.
You could track the agent's available actions as part of the "state".
If the agent has the ability to use a fireball in one part of the
game, but not another part of the game, that could be considered a
different state, which might inform the agent. The vector representing the state could have a binary value for each different unlockable ability, and combined with the approach mentioned above in #1, your agent could learn to use unlocked abilities effectively.
This research paper discusses reinforcement learning in continuous action spaces, which isn't quite the same thing but might give you some additional thoughts.

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?

You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.

Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008