(Reinforcement Learning) My DDPG agent converges to a suboptimal policy - deep-learning

I am trying to learn an optimal capacity planning problem based on 2 time-evolving dynamic uncertainties. I am using a DDPG agent to avoid the arbitrariness of defining the capacity increments.
https://i.imgur.com/XXTRbdY.png
As you can see in the linked picture above, I am facing Issue #1 which is that the DDPG agent converges to a suboptimal policy.
BTW. The red line (NPV) is the reward of the single-period planning model
Another issue is that the agent is very sensitive to each run. When I run my code, all episodes remain at 0 Action, 0 reward. When I run the same code again, it tries various actions. Could this be an initialization issue?
I can send ipynb file if someone experienced wants to take a look at it.
Yes I normalized my states and actions between 0 and 1
Trying out different hyperparameters now

Related

Deep Reinforcement Learning, how to make an agent that control many machines

Good morning, Im facing a "RL" problem, which have many constraints, the main idea is that my agent will control many different machines with for example ordering them to go out for doing their missions (we don't give importance for the mission), or ordering them to enter to the depot and choosing for them the right place where they should sit (depending from constraints).
The problem is: the agent will take decision at periods of time that are defined, for each periode we know which of actions (go out, go in) are allowed. He will for example at 8oclock decide to order for 4 machines to go out, and at 14oclock decide to bring back 2 machines(with choosing for them the right place).
In literature i show many ideas which refers to BDQ, but is it recquired for my problem ? Im thinking about having actions like [chooseMachine1, chooseMachine2,chooseMachine3...chooseMachineN, goOut, goInPlace1, goInPlace2, goInPlace3, goInPlace4]. And in the code specifying the logic that depending of the period we are, i expose for the begening a number M<=N of the machines to choose (with giving 0 probability to those actions that aren't possible for the moment' if it is 14oclock you know that only the machines that are out are concerned with the agent decision'), if the agent choose Machine1, so he will access to only the possible actions from choosing it.
So, my question is, do you think that my ideas are right ? (am beginner), my idea is to make a DQN with giving the logic for the possible/impossible actions,
Do you think that a BDQ is more accurate with my problem ? like having N branchs for N machines which have the same possible actions (brach1(Machine1) : go out, goPlace1, goPlace2 ...)
If it is the case is there any implementation examples ?
If you have ressources to advise me, i will be glad of checking them.
Thank You
What would an agent navigating a maze do in case the chosen action would run it into a wall?
I think the usual approach in RL is to allow the move and than handle the result with the environment. In such a way the environment can simply make nothing happen or even give a negative reward when an action is "disallowed".
At training convergence the agent will hopefully learn to not chose ineffective actions.

Action masking for continuous action space in reinforcement learning

Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.
Simple example:
The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?
What I tried:
For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).
I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?
I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions.
I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.

Gym (openAI) environment actions space depends from actual state

I'm using gym toolkit to create my own env and keras-rl to use my env within an agent.
The problem is that my actions space changes, it depends from actual state.
For example, i have 46 possible actions, but given a certain state only 7 are available, and i'm not able to find a way to modeling that.
I've read that question open-ai-enviroment-with-changing-action-space-after-each-step
but this did not resolve my problem.
In Gym Documentation there are not instructions to do this, only an issue on their Github repo (still open).
I can't understand how the agent (keras-rl, dqn agent) pick up an action, is it randomically choosen? but from where?
Can somebody help me? Ideas?
I've handled this by just ignoring any invalid actions and letting the exploration mechanics keep it from getting stuck. Quick and simple, but likely better ways to do it.
I think the better option is to somehow set the probability of selecting that action to zero, but I've had trouble figuring out how to do that.

NullPointerExceptions while executing LoadTest on WSO2BPS

While performing loadtests on WSO2 BPS 3.2.0 we`ve ran onto the problem.
Let me tell you more about out project and our actions.
Our BPS process is designed to manage some interactions with 3 systems. Basically it is "spread" on two parts - first one to CREATE INSTANCE in one of systems, then waiting a bit, and then SELECT OFFER in instance context.
In real life it looks like: user wants to get a product, the application asks system for an offers and then the user selects offer from available ones.
IN BPS the first part is a straight-forward process, the second part is spread on two flows - one to refresh information with a new offers, and another is to wait if the user chooses one of them.
Our aim is to stand about 1000-1500 simulatious threads on the load-test. An external systems are simulated by mockups executed by LoadUI.
We can achieve our goal if we disable "Process-Level Monitoring Events" in deployment descriptor (set it to "none") of our process. Everything goes well and smooth for hours.
But if we enable this feature (and we need to), everything falls with an error very soon (on about 100-200 run):
[2015-07-28 17:47:02,573] ERROR {org.wso2.carbon.bpel.core.ode.integration.BPELProcessProxy} - Error processing response for MEX null
java.lang.NullPointerException
at org.wso2.carbon.bpel.core.ode.integration.BPELProcessProxy.onResponse(BPELProcessProxy.java:402)
at org.wso2.carbon.bpel.core.ode.integration.BPELProcessProxy.onAxisServiceInvoke(BPELProcessProxy.java:187)
at
[....Et cetera....]
After the first appearance of this error another one type appears - other threads just fall after the timeout.
It seems that database is ok (by the way, it is MySQL 5.6.25). The dashboard shows no extreme levels of input or output.
So I think the BPS itself makes a bottleneck. We have gave it 8gb heap and its conf options are set for extreme amounts of threads (if it possible negative values are set and if not - just ridiculously big like 100000).
Anyone has ever faced this problem? Appreciate any help very much.
Solved in BPS 3.5.0 version, refer to release-notes

Order-issuing neural network?

I'm interested in writing certain software that uses machine learning, and performs certain actions based on external data.
However I've run into problem (that was always interesting to me) -
how is it possible to write machine learning software that issues orders or sequences of orders?
The problem is that as I understand it, neural network gets bunch on inputs, and "recalls" output based on results of previous trainings. Instantly (well, more or less). So I'm not sure how "issuing orders" could fit into that system, especially when actions performed by system affect the system with certain delay. I'm also a bit unsure how is it possible to train this thing.
Examples of such system:
1. First person shooter enemy controller. As I understand it, it is possible to implement neural network controller for the bot that will switch bot behavior strategies(well, assign priorities to them) based on some inputs (probably something like health, ammo, etc). But I don't see a way to make higher-order controller, that could issue sequence of commands like "go there, then turn left". Also, bot's actions will affect variables that control bot's behavior. I.e. shooting reduces ammo, falling from heights reduces health, etc.
2. Automated market trader. It is certainly possible to make system that will try to predict the next market price of something. However, I don't see how is it possible to make system that would issue order to buy something, watch the trend, then sell it back to gain profit/cover up losses.
3. Car driver. Again, (as I understand it) it is possible to make system that will maintain desired movement vector based on position/velocity/torque data and results of previous training. However I don't see a way to make such system (learn to) perform sequence of actions.
I.e. as I understood it, neural net is technically a matrix - you give it input, it produces output. But what about generating sequences of actions that could change environment program operates in?
If such tasks are not entirely suitable for neural networks, what else could be used?
P.S. I understand that the question isn't exactly clear, and I suspect that I'm missing some knowledge. So I'll appreciate some pointers (i.e. books/resources to read, etc).
You could try to connect the output neurons to controllers directly, e.g. moving forward, turning, or shooting in the ego shooter, or buying orders for the trader. However, I think that the best results are gained nowadays when you let the neural net solve one rather specific subproblem, and then let a "normal" program interpret its answer. For example, you could let the neural net construct a map overlay of "where do I want to be", which the bot then translates into movements. The neural network for the trader could produce a "how much do I want which paper", which the bot then translates into buying or selling orders.
The decision which subproblem should be solved by a neural network is a very central one for its design. The important thing is that good solutions can be taught to the neural network.
Edit: Expanding this in the examples: When the ego shooter bot gets shot, it should not have wanted to be there; when it gets to shoot someone else, it should have wanted to be there more. When the trader loses money from a paper, it should have wanted it less before; if it gains, it should have wanted it more. These things can be taught.
The problem you are describing is known as Reinforcement Learning. Reinforcement learning is essentially a machine learning algorithm (such as a neural network) coupled with a controller. It has been used for all of the applications you mention, even to drive real cars.