Reinforcement learning with new actions/expanding actionset - reinforcement-learning

I wonder if there is any research on RL problems with new actions, i.e. think of a video game, as the game goes by, the agent learns more skills/maneuvers and thus has more available actions to choose, and thus the action set is expanding over time. A related question
State dependent action set in reinforcement learning
But there is not sufficient answer to this question, either. Thanks!

All of the recent research and papers in deep reinforcement learning use environments with a small, static set of potential actions. However, there are a couple of ways which you could try to compensate for having a variable action space.
Let's say we have a game environment where the agent can perform different attacks. One of the attacks, the fireball, is only unlocked later in the game. Maybe you have to do something special to unlock this attack, but for the purposes of this argument, let's just assume your agent will unlock this ability at some point in the course of the game.
You could add the unlocked actions to the action space and assign a
large negative reward if the agent tries to take an action that have
not yet unlocked. So if your agent tries to use the fireball and it
has not been unlocked yet, they get a negative reward. However, this
has a high likelihood of the agent "learning" to never use the
fireball, even if it is unlocked.
You could also vary the action space by adding new actions as they
become available. In this scenario, the agent would not have the
fireball attack in their action space until it is unlocked. You would
have to vary your epsilon (rate of random action) to do more
exploration when new actions are added to the action space.
You could track the agent's available actions as part of the "state".
If the agent has the ability to use a fireball in one part of the
game, but not another part of the game, that could be considered a
different state, which might inform the agent. The vector representing the state could have a binary value for each different unlockable ability, and combined with the approach mentioned above in #1, your agent could learn to use unlocked abilities effectively.
This research paper discusses reinforcement learning in continuous action spaces, which isn't quite the same thing but might give you some additional thoughts.

Related

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

How to design the reward for an action which is the only legal action at some state

I am working on an RL project, but got stuck at one point: The task is continuous (Non-episodic). Following some suggestion from Sutton's RL book, I am using a value function approximation method with average reward (differential return instead of discount return). For some state (represented by some features), only one action is legal. I am not sure how to design a reward for such action. Is it ok to just assign the reward in the previous step? Or assign the average reward (take the average of all reward collected so far)? Could anyone tell me the best way to decide the reward for the only legal action? Thank you!
UPDATE:
To give more details, I added one simplified example:
Let me explain this by a simplified example: the state space consists of a job queue with fix size and a single server. The queue state is represented by the duration of jobs and the server state is represented by the time left to finish the currently running job. When the queue is not full and the server is idle, the agent can SCHEDULE a job to the server for execution and see a state transition(taking next job into the queue) or the agent can TAKE NEXT JOB into the queue. But when the job queue is full and the server is still running a job, the agent can do nothing except take a BLOCKING action and witness a state transit (time left to finish running job gets decreased by one unit time). The BLOCKING action is the only action that the agent can take in that state.
Designing the reward is part of the problem setup. Do you want to encourage the agent to get into states where the only action is BLOCKING? Or should it avoid such states?
There can be no correct answer without knowing your optimization goal. It doesn't have anything to do with how many legal actions the agent has. It also doesn't have to do anything with value functions. The decision is equally important if you train your agent via random search or a GA directly in the policy space.
A different problem is how to deal with invalid actions during learning. If the "BLOCKING" action can only be taken in a state where there are no other decisions, then you could re-design the environment such that it automatically skips over those states. It would have to accumulate all the rewards for the "no decision" states and give them as a combined reward for the last real decision, and present the agent with the next real decision. If you are using discounted rewards you'd have to take the discounting factor into account too, in order to not modify the cost function that the agent is optimizing.
Another way to deal with invalid actions is to make the agent learn to avoid them. You see this in most gridworld examples: when the agent tries to move into a wall, it just doesn't happen. Some default action happens instead. The reward function is then structured such that it will always yield a worse return (e.g. more steps or negative reward). The only disadvantage is that this requires extra exploration. The function approximator faces a more difficult task; it needs enough capacity and more data to recognize that in some states, some actions have a different effect.

Reinforcement Learning, ϵ-greedy approach vs optimal action

In Reinforcement Learning, why should we select actions according to an ϵ-greedy approach rather than always selecting the optimal action ?
We use an epsilon-greedy method for exploration during training. This means that when an action is selected by training, it is either chosen as the action with the highest Q-value, or a random action by some factor (epsilon).
Choosing between these two is random and based on the value of epsilon. initially, lots of random actions are taken which means we start by exploring the space, but as training progresses, more actions with the maximum q-values are taken and we gradually start giving less attention to actions with low Q-value.
During testing, we use this epsilon-greedy method, but with epsilon at a very low value, such that there is a strong bias towards exploitation over exploration, favoring choosing the action with the highest q-value over a random action. However, random actions are still sometimes chosen.
All this is because we want to eliminate the negative effects of over-fitting or under-fitting.
Using epsilon of 0 (always choosing the optimal action) is a fully exploitative choice. For example, consider a labyrinth game where the agent’s current Q-estimates are converged to the optimal policy except for one grid, where it greedily chooses to move toward a boundary (which is currently the optimal policy) that results in it remaining in the same grid, If the agent reaches any such state, and it is choosing the maximum Q-action, it will be stuck there. However, keeping a small epsilon factor in its policy allows it to get out of such states.
There wouldn't be much learning happening if you already knew what the best action was, right ? :)
ϵ-greedy is "on-policy" learning, meaning that you are learning the optimal ϵ-greedy policy, while exploring with an ϵ-greedy policy. You can also learn "off-policy" by selecting moves that are not aligned to the policy that you are learning, an example is exploring always randomly (same as ϵ=1).
I know this can be confusing at first, how can you learn anything if you just move randomly? The key bit of knowledge here is that the policy that you learn is not defined by how you explore, but by how you calculate the sum of future rewards (in the case of regular Q-Learning it's the max(Q[next_state]) piece in the Q-Value update).
This all works assuming you are exploring enough, if you don't try out new actions the agents will never be able to figure out which ones are the best ones in the first place.

Why should continuous actions be clamped?

In Deep Reinforcement Learning, using continuous action spaces, why does it seem to be common practice to clamp the action right before the agent's execution?
Examples:
OpenAI Gym Mountain Car
https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py#L57
Unity 3DBall
https://github.com/Unity-Technologies/ml-agents/blob/master/unity-environment/Assets/ML-Agents/Examples/3DBall/Scripts/Ball3DAgent.cs#L29
Isn't information lost doing so? Like if the model outputs +10 for velocity (moving), which is then clamped to +1, the action itself behaves rather discrete (concerning its mere execution). For a fine grained movement, wouldn't it make more sense to multiply the output by something like 0.1?
This is probably simply done to enforce constraints on what the agent can do. Maybe the agent would like to put out an action that increases velocity by 1,000,000. But if the agent is a self-driving car with a weak engine that can at most accelerate by 1 unit, we don't care if the agent would hypothetically like to accelerate by more units. The car's engine has limited capabilities.

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.