Reinforcement learning deterministic policies worse than non deterministic policies - reinforcement-learning

We have a custom reinforcement learning environment within which we run a PPO agent from stable baselines3 for a multi action selection problem. The agent learns as expected but when we evaluate the learned policy from trained agents the agents achieve worse results (i.e. around 50% lower rewards) when we set deterministic=True than with deterministic=False. The goal of the study is to find new policies for a real-world problem and so it would be desirable to find a deterministic policy as this is much better understandable for most people... And it seems counterintuitive that more random actions result in better performance.
The documentation says only "deterministic (bool) – Whether or not to return deterministic actions.".
I understand this as deterministic=False means that the actions are drawn from a learned distribution with a certain stochasticity (i.e. one specific state can result in several different actions) and deterministic=True means that the actions are fully based on the learned policy (i.e. one specific state always results in one specific action).
The question is what it says about the agent and / or the environment when the performance is better with deterministic=False than with deterministic=True?

You need to be very careful before making stochastic agents deterministic. This is because they can become unable to achieve certain goals. Consider the following over-simplified example with 8 states:
| | # | | # | |
| X |---| G |---| X |
'G' is goal, 'X' is pit, '-' is wall. The '#' states are impossible to fix in a deterministic way. For instance, if the policy at '#' is left then from the two states in the top left the agent will never get to the goal. The strength of stochastic policies is that they can prevent this kind of issue and let the agent find a way to the goal.
Additionally, the stochasticity of the action should reduce over time to reflect the certainty that a particular action is correct, but of course there could be some states (such as '#' above) where significant uncertainty remains.

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!
"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Is the reward related to previous state or next state?

In the reinforcement learning framework, I am a little bit confused about the reward and how it is related to states. For example, in Q-learning, we have the following formula for updating the Q table:
that means that the reward is obtained from the environment at the time t+1. I mean that after applying the action at, the environment gives st+1 and rt+1.
It is often true that the reward is associated with the previous time step, that is using rt in the above formula. See, for example the Wikipedia page for Q-learning (https://en.wikipedia.org/wiki/Q-learning). Why is this?
Accidentally, some Wikipedia pages about the same topic but in different languages, use rt+1 (or unexpectedly Rt+1). See, for example, the Italian and Japanese pages:
https://it.wikipedia.org/wiki/Q-learning
https://ja.wikipedia.org/wiki/Q%E5%AD%A6%E7%BF%92

Difference between optimisation algorithms and reinforcement learning methods

I have a sense that one step task of reinforcement learning is essentially the same with some optimisation algorithms.
For example, suppose there is only one parameter α and we try to optimise y using gradient descent for optimisation, then in each iteration(or step), α is actually moving slightly towards the direction with δy. The step is exactly the same in reinforcement learning, where δy is named as temporal difference and y is the value of that state S(a).
So, I wonder for 1 step reinforcement learning problems, is it actually a optimisation method, or can it be used to optimise parameters?(based on the context above)
I might have some misunderstanding on this, welcome to correctify.
First of all, reinforcement learning is very general. Almost any optimization problem can be transformed into a RL problem. It's usually not worth it, because a RL agent would select sub-optimal actions, doing trial and error just to confirm things you already know by design.
To your question: I think the similarity you found is that both algorithms make use of a (noisy) gradient step. Temporal difference is just one RL method of many. If I remember correctly it calculates the difference between the predicted value and the (noisy) value estimate made with the observed reward. It cannot simply set the correct value, because in general there is a complicated dependency between the values of other states, so instead it makes just one a small step to reduce the difference.
Sure, you could set up a RL task somehow to optimize reward = y(α). Now α can either be the agent's "state", in which case you need actions decrement or increment it (you learn state-values) or α can be the action in which case there is only a single state (you learn action-values). With the right exploration strategy it might even work if you are patient. But in both cases you waste your knowledge about the gradient δy(α)/δα because the RL algorithm does not know about it. Yes it takes gradient-steps, but those gradients reduce the difference between the learned value and the actual value. If the true values are exactly the rewards (which is true if the agent dies after one step, and if there is no randomness when you evaluate y(α)) then this is wasted effort. Instead of taking a small step to smooth out the non-existing influence on other states, you could have just set it to the true value directly.
You mentioned "one-step reinforcement learning": what comes to mind is the contextual bandit setup. It's a simplification of the full-blown RL setup where your actions do not influence the next state (=context). The next simplification is the multi-armed bandit, which only has actions but no state/context.

Q-learning, how about picking the action that actually gives most reward?

So in Q learning, you update the Q function by Qnew(s,a) = Q(s,a) + alpha(r + gamma*MaxQ(s',a) - Q(s,a).
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Of course, the training time would probably increase because you actually do all the actions once for each update, but since you are guaranteed to select the best action each time (except when exploring), it would give you a global optimum policy in the end?
This is a bit similar to value iteration, except I'm don't have and not building a model for the problem.
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
It is typically assumed in Reinforcement Learning that we do not have the ability to reset the (simulated) environment. Sure, when we're working on simulations it often may technically be possible, but generally we hope that work in RL can also extend to "real-world" problems outside of simulations afterwards, where that would no longer be possible.
If you do have that possibility, it would generally be recommended to look into search algorithms like Monte-Carlo Tree Search, rather than Reinforcement Learning like Sarsa, Q-learning, etc. I suspect your suggestion might work slightly better than Q-learning indeed in this case, but things like MCTS would be even better.
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Given that you don't have access to the model, you have to resort to model free methods. What you are suggesting is basically a Dynamics programming backup. See the slides 28 - 31 in David Silver's lecture notes for various backup strategies to iterate on the value function.
However, note that this is just for prediction (i.e. estimating the value function for a given policy) and not for control (figuring out the best policy). There won't be a Max involved in prediction. To do control, you can use the above policy evaluation + greedy policy improvement to arrive at a "policy iteration based on Dynamic prog backup policy evaluation" method.
The other options for model-free control are SARSA [+ greedy policy improvement] (on policy) and Q-learning (off-policy). These are Q-function based methods, though.
If you are just trying to win the game, and not necessarily interested in RL techniques discussed above, then you also have the choice of using purely planning based methods (like Monte Carlo Tree Search). Finally, you can combine planning and learning with methods such as Dyna.

AI for a Final fantasy tactics-like game

I am implementing a small grid based, turn based strategy in the lines of Final Fantasy tactics.
Do you have any ideas on how i can approach the target selection, movement and skill selection process?
I am considering having the decisions disconnected, but all these 3 decisions are largely coupled.
(eg. i can't decide where to move unless i know who i am going to attack, and what range the skill i will use has, and vice versa, i can't decide who to attack unless i know how many turns it will take me to reach each target)
I want to move towards a unified system, but trying out things from Potential field research used in a manner like in the Killzone 1 AI has me getting stuck on local maximums.
=== Update 1
I am currently trying to use potential fields / influence maps to generate the data i take decisions upon.
I have no idea how to handle having many skills, and skills that don't do damage but rather buff/debuff or alter the world.
Someone elsewhere suggested using Monte Carlo Tree Search, used currently in Go games.
I believe the space my actors will be using is not good for it, as many many moves in the game don't result in a position from which you can attack and affect the world (i am in a world bigger than final fantasy tactics)
In final fantasy tactics it might be applied successfully, although the branching factor is much bigger than that of 9x9 Go (from what i understand)
===
Thanks in advance, Xtapodi.
ps.1 - A problem is that to know accurately how far an enemy is i would need to pathfind to him, because although the enemy is near, an impassable cliff might be separating us which takes 4 turns to go around. Or worse, a unit is blocking the way on lets say a bridge so there is actually no way to reach him.
One approach I've used is to do a two-pass system.
First, find out where your unit can go. Use A* or whatever to flag out the terrain to see how far the unit can move this turn.
Once you know that, step through your available tactics (melee attack, heal friendly unit, whatever), and assign a fitness function for all available uses of the tactic. If you pass in the flagged terrain, you can very quickly determine what your space of possible tactics are.
This gives you a list of available tactics and their fitness functions for each move. Select the best one or randomize from the top. If there aren't any tactics available, repeat the process with flagging the terrain for two moves, and so on.
What I mean by fitness function is to decide on the "value" of performing the tactic on a certain unit or location. For instance, your "heal a friendly unit" tactical decision phase might step through all friendly units. If a friendly unit is within range (i.e., is reachable from a location your unit can reach), add it to the list of possible tactics and give it a fitness rating equal to, say, 100 * (1.0 - unit health), where unit health ranges from 0 to 1. Thus, healing a character down to only 10% health remaining would be worth 90 points, while a unit only down 5% would only be worth 5, and the unit wouldn't even consider healing an undamaged unit. Special units (i.e., "protect the boss" scenario units required to retain victory conditions) could be given a higher base number, so that they are given more attention by friendly units.
Similarly, your "melee attack" decision phase would step through all reachable enemy units, compute the likely damage, and compare that to the unit's health. Give each unit a "desirability" to attack, and multiply it by the percentage of remaining health you'd likely do, and you've got a pretty detailed fitness function that favors eliminating units when you can, but still goes after high-value targets.
Using a process like this, you'll get a list of options like "Move to location A and heal friendly unit B : 50 points", "Move to location C and attack hostile unit D : 15 points", etc. Suddenly, it's really easy to choose a tactic.
Further detail may be added by multiplying the fitness of the tactic by a fitness for the path you'd have to take to implement it. For instance, if the place you'd have to move to in order to heal a friendly unit puts you in severe danger (i.e., standing on a lava space or something), you might factor that in by multiplying the fitness of that tactic by .2 or so, so that the unit may still consider it, but only if it's really important. All this takes is writing an algorithm to assess the fitness of a given location, and could be as simple as a pre-computed "terrain desirability" number or as complex as maintaining "threat maps" of enemy units.
The hard part, of course, is finding the right measures to make the engine smart. But that's the fun part of your system to tweak.
If the terrain where the battle occurs are pre-determined, or not too wide, there is an article on terrain reasonning in FPS that can be used as a basis for a turn-based game.
In short, you pre-calculate for each cell of the map a set of values, such as suitability for shooting in a given direction, protection, visibility... and so on. the AI can then use these values to choose a correct action. For exemple, fighter will walk as quickly as possible toward ennemy, using protection if available, while thief will take a path where visibility from ennemy direction as low as possible, with the goal of attacking from flank or rear.
if the terrain is randomized and/or too wide, the pre-calcul can be to long to be useful, however.
regards
Guillaume
A good question the answers can be all over the place. Personally, I don't have a lot of experience with this but I would set a strategy around concept not distance.
You are going to create a state machine for each NPC. It will be predicting a character to attack via some settings.
For example a NPC would be flagged as Attack weakest or Attack Strongest or Attack Most Injured. Then I would attempt to position them such that they can damage there desired target.
If you also have healers you can do the same thing in reverse for the healer target.
Target changing will be an important part of this system too. So you will want to think about that. A simple version is to reevaluate changing target a given percentage of the turns.
And finally, I would add random chance into the system. For example a character could be set as follows
Attack Weakest .25
Attack Strongest .50
Attack Most Injured .25
Change target .1
When it's time to attack. You generate a random number from 0-1. If it's under you Change targets you change target by generating another random number of what target to attack.
You can begin to factor distance into your system by augmenting the attack mode percentages.
For example if it would take 3 turns to attack the most injured. Decrease it's percentage of being targeted by dividing that value by 3 and distributing the difference to the other two possibilities.