Action masking for continuous action space in reinforcement learning - reinforcement-learning

Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.
Simple example:
The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?
What I tried:
For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).
I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?

I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions.
I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.

Related

Deep Reinforcement Learning, how to make an agent that control many machines

Good morning, Im facing a "RL" problem, which have many constraints, the main idea is that my agent will control many different machines with for example ordering them to go out for doing their missions (we don't give importance for the mission), or ordering them to enter to the depot and choosing for them the right place where they should sit (depending from constraints).
The problem is: the agent will take decision at periods of time that are defined, for each periode we know which of actions (go out, go in) are allowed. He will for example at 8oclock decide to order for 4 machines to go out, and at 14oclock decide to bring back 2 machines(with choosing for them the right place).
In literature i show many ideas which refers to BDQ, but is it recquired for my problem ? Im thinking about having actions like [chooseMachine1, chooseMachine2,chooseMachine3...chooseMachineN, goOut, goInPlace1, goInPlace2, goInPlace3, goInPlace4]. And in the code specifying the logic that depending of the period we are, i expose for the begening a number M<=N of the machines to choose (with giving 0 probability to those actions that aren't possible for the moment' if it is 14oclock you know that only the machines that are out are concerned with the agent decision'), if the agent choose Machine1, so he will access to only the possible actions from choosing it.
So, my question is, do you think that my ideas are right ? (am beginner), my idea is to make a DQN with giving the logic for the possible/impossible actions,
Do you think that a BDQ is more accurate with my problem ? like having N branchs for N machines which have the same possible actions (brach1(Machine1) : go out, goPlace1, goPlace2 ...)
If it is the case is there any implementation examples ?
If you have ressources to advise me, i will be glad of checking them.
Thank You
What would an agent navigating a maze do in case the chosen action would run it into a wall?
I think the usual approach in RL is to allow the move and than handle the result with the environment. In such a way the environment can simply make nothing happen or even give a negative reward when an action is "disallowed".
At training convergence the agent will hopefully learn to not chose ineffective actions.

State definition in Reinforcement learning

When defining state for a specific problem in reinforcement learning, How to decide what to include and what to leave for the definition, and also how to set difference between an observation and a state.
For example assuming that the agent is in the context of human resource and planning where it needs to hire some workers based on the demand of jobs, considering the cost of hiring them (assuming the budget is limited) is a state in the format of (# workers, cost) a good definition of state?
In total I don't know what information is needed to be in state and what should be left as it's rather observation.
Thank you
I am assuming you are formulating this as an RL problem because the demand is an unknown quantity. And, maybe [this is optional criteria] the Cost of hiring them may take into account a worker's contribution towards the job which is unknown initially. If however, both these quantities are known or can be approximated beforehand then you can just run a Planning algorithm to solve the problem [or just some sort of Optimization].
Having said this, the state in this problem could be something as simple as (#workers). Note I'm not including the cost, because cost must be experienced by the agent, and therefore is unknown to the agent until it reaches a specific state. Depending on the problem, you might need to add another factor of "time", or the "job-remaining".
Most of the theoretical results on RL hinge on a key assumption in several setups that the environment is Markovian. There are several works where you can get by without this assumption, but if you can formulate your environment in a way that exhibits this property, then you would have much more tools to work with. The key idea being, the agent can decide which action to take (in your case, an action could be : Hire 1 more person. Other actions could be Fire a person) based on the current state, say (#workers = 5, time=6). Note that we are not distinguishing between workers yet, so firing "a" person, instead of firing "a specific" person x. If the workers have differing capabilities, you may need to add several other factors each representing which worker is currently hired, and which are currently in the pool, yet to be hired so like a boolean array of a fixed length. (I hope you get the idea of how to form a state representation, and this can vary based on the specifics of the problem, which are missing in your question).
Now, once we have the State definition S, the action definition A (hire / fire), we have the "known" quantities for an MDP-setup in an RL framework. We also need an environment that can supply us with the cost function when we query it (Reward Function / Cost Function), and tell us the outcome of taking a certain action on a certain state (Transition). Note that we don't necessarily need to know these Reward / Transition function beforehand, but we should have a means of getting these values when we query for a specific (state, action).
Coming to your final part, the difference between observation and state. There are much better resources to dig deep into it, but in a crude sense, observation is an agent's (any agent, AI, human etc) sensory data. For example, in your case the agent has the ability to count number of workers currently employed (but it does not have an ability to distinguish between workers).
A state, more formally, a true MDP state must be something that is Markovian and captures the environment at its fundamental level. So, maybe in order to determine the true cost to the company, the agent needs to be able to differentiate between workers, working hours of each worker, jobs they are working at, interactions between workers and so on. Note that, much of these factors may not be relevant to your task, for example a worker's gender. Typically one would like to form a good hypothesis on which factors are relevant beforehand.
Now, even though we can agree that a worker's assignment (to a specific job) maybe a relevant feature which making a decision to hire or fire them, your observation does not have this information. So you have two options, either you can ignore the fact that this information is important and work with what you have available, or you try to infer these features. If your observation is incomplete for the decision making in your formulation we typically classify them as Partially Observable Environments (and use POMDP frameworks for it).
I hope I clarified a few points, however, there is huge theory behind all of this and the question you asked about "coming up with a state definition" is a matter of research. (Much like feature engineering & feature selection in Machine Learning).

concerns regarding exploring starts given my state is not the same as my observation space in gym

My state for a custom Gym environment is not the same as my observation space. The observation is calculated from the state.
How will RL that requires exploring starts etc, work? Or do I get it wrong?
I imagine the algorithm to sample from my observation space and then setting the state of the environment and checking an action. But this will not work with my environment.
From the question above you see I'm newby with RL and with Gym. What RL should I use in above case? How would you address such a situation?
Any tips?
My custom Gym environment is now selecting a random start state. Therefore, by using this environment, one can achieve "Exploring Starts". So, I do not need to worry any more that my observation is not the same as my state. For example, implementing Monte Carlo ES for Black Jack, as described in the RLbook2018, the state of my environment includes the hidden card of the dealer, while an observation does not.
I was confused at the time as I wanted the algorithm itself to pick the random state and set it into the environment.
PS,
If you need to save states of previous "alternative realities", search SO or Google for wrappers, and how they do that for MCTS (Monte Carlo Tree Search).

I'm confused with how to determine output probabilities and picking action in Policy Optimization

I'm currently learning PPO for my game and got most basic down. I've watched several YouTube videos and tried to understand a couple codes but there's something that I'm confused of.
So, in my understanding, PPO (and maybe policy optimization in general) uses softmax as activation function to get the output as probability which then being inputted to gaussian distribution. From how I learn, all the output probabilities combined is supposed to be 1 which then implies that only one action being made. How this translates to something that may requires multiple action at the same time ? (Ex: pressing two or more button at the same time in game)
Do I need to map out all possible action? (Includes the combination)
Or did I miss something and that it is possible for the model to count output possiblity separately ? (Movement probabilities and Weapon action probabilities are different)
You would want to map out all possible combination of actions if you are specifically wanting two actions to be taken at exactly the same time. At any given time step you can only pick one action from your output distribution so combinations would have to be included.
However, your agent could learn to alternate between shooting and moving but these actions would occur in different steps.

Order-issuing neural network?

I'm interested in writing certain software that uses machine learning, and performs certain actions based on external data.
However I've run into problem (that was always interesting to me) -
how is it possible to write machine learning software that issues orders or sequences of orders?
The problem is that as I understand it, neural network gets bunch on inputs, and "recalls" output based on results of previous trainings. Instantly (well, more or less). So I'm not sure how "issuing orders" could fit into that system, especially when actions performed by system affect the system with certain delay. I'm also a bit unsure how is it possible to train this thing.
Examples of such system:
1. First person shooter enemy controller. As I understand it, it is possible to implement neural network controller for the bot that will switch bot behavior strategies(well, assign priorities to them) based on some inputs (probably something like health, ammo, etc). But I don't see a way to make higher-order controller, that could issue sequence of commands like "go there, then turn left". Also, bot's actions will affect variables that control bot's behavior. I.e. shooting reduces ammo, falling from heights reduces health, etc.
2. Automated market trader. It is certainly possible to make system that will try to predict the next market price of something. However, I don't see how is it possible to make system that would issue order to buy something, watch the trend, then sell it back to gain profit/cover up losses.
3. Car driver. Again, (as I understand it) it is possible to make system that will maintain desired movement vector based on position/velocity/torque data and results of previous training. However I don't see a way to make such system (learn to) perform sequence of actions.
I.e. as I understood it, neural net is technically a matrix - you give it input, it produces output. But what about generating sequences of actions that could change environment program operates in?
If such tasks are not entirely suitable for neural networks, what else could be used?
P.S. I understand that the question isn't exactly clear, and I suspect that I'm missing some knowledge. So I'll appreciate some pointers (i.e. books/resources to read, etc).
You could try to connect the output neurons to controllers directly, e.g. moving forward, turning, or shooting in the ego shooter, or buying orders for the trader. However, I think that the best results are gained nowadays when you let the neural net solve one rather specific subproblem, and then let a "normal" program interpret its answer. For example, you could let the neural net construct a map overlay of "where do I want to be", which the bot then translates into movements. The neural network for the trader could produce a "how much do I want which paper", which the bot then translates into buying or selling orders.
The decision which subproblem should be solved by a neural network is a very central one for its design. The important thing is that good solutions can be taught to the neural network.
Edit: Expanding this in the examples: When the ego shooter bot gets shot, it should not have wanted to be there; when it gets to shoot someone else, it should have wanted to be there more. When the trader loses money from a paper, it should have wanted it less before; if it gains, it should have wanted it more. These things can be taught.
The problem you are describing is known as Reinforcement Learning. Reinforcement learning is essentially a machine learning algorithm (such as a neural network) coupled with a controller. It has been used for all of the applications you mention, even to drive real cars.