Suppose the action space is a game with 5 doors and you can choose 2 and only 2 at each step. How could that be represented as an action_space?
self.action_space = spaces.Box( np.array([0,0,0,0,0]), np.array([+1,+1,+1,+1,+1])) #
Using the above method the action_space could be none [0 0 0 0 0] or all [1 1 1 1 1] or anything in between. I am trying to force the action to choose only 2 doors.
Examples of correct actions:
[1 1 0 0 0]
[1 0 1 0 0]
etc.
Probably, the simplest solution would be to list all the possible actions, i.e., all the allowed combinations of two doors, and assign a number to each one. Then the environment must "decode" each number to the corresponding combination of two doors.
In this way, the agent should simply choose among a discrete action space (spaces.Discrete(n) in OpenAI).
Related
I'm trying to do multi-agent reinforcement learning on the grid world navigation task where multiple agents try to collectively reach multiple goals while avoiding collisions with stationary obstacles and each other. As a constraint, each agent can only see within a limited range around itself.
So on a high level, the state of each agent should contain both information to help it avoid collision and information to guide it towards the goals. I'm thinking of implementing the former by including into the agent's state a matrix consisted of the grid cells surrounding the agent, which would show the agent where the obstacles are. However, I'm not sure how to include goal navigation information on top of this matrix. Currently I just flatten the matrix and append all relative goal locations at the end, and use this as the state.
For example, for a grid world as shown below (0 means empty cell, 1 means agents, 2 means obstacles, and 3 represents goals):
[[0 0 0 0 0 0 2 2 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 2 2 0 0 0 0 0 0]
[0 3 2 2 0 0 0 0 0 2]
[0 0 0 0 0 0 0 0 0 2]
[0 0 0 0 1 0 0 0 0 2]
[2 0 0 0 0 2 2 0 3 0]
[2 0 0 0 0 2 2 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 2 0 0 1 0 0 0 0]]
The agent at row5 col4 sees the following cells that are within distance1 around it:
[[0. 0. 0.]
[0. 1. 0.]
[0. 0. 2.]]
flattened, the matrix becomes:
[0,0,0,0,1,0,0,0,2]
The location of the goal at row3 col1 relative to the aforementioned agent is (5-3=2, 4-1=3)
The location of the goal at row6 col8 relative to the aforementioned agent is (5-6=-1, 4-8=-4)
So after appending the relative locations, the state of the agent becomes:
[0,0,0,0,1,0,0,0,2,2,3,-1,-4]
(Similar process for the other agent)
Is this a reasonable way of designing the state? My primary concern is that RL might not be able to tell the difference between the flattened matrix and the relative distances. If my concerns are founded, could you give some suggestions on how I should design the state?
Thanks in advance!
Edit: To validate my concern, I trained an agent using PG REINFORCE algorithm. As I feared, the agent learned to avoid obstacles but otherwise just moved randomly without navigating towards the goals.
I didn't find a way to improve state design, but I did find a workaround in making my PG network modular.
I simply separated my PG network into two parts -- one taking in just the flattened grid matrix part from the aforementioned state, and the other taking in just the relative goal locations. Then I concatenated the outputs from the two sub-networks and passed them through a softmax layer to get the final policy.
If you want more details you can check out my codes here (The relevant codes are in MARL_PolicyGradient.py, MARL_env.py, and MARL_networks.py). Good luck!
I want to create an NN layer such that:
for the input of size 100 assume every 5 samples create "block"
the layer should compute let's say 3 values for every block
so the input/output sizes of this layer should be: 100 -> 20*3
every block of size 5 (and only this block) is fully connected to the result block of size 3
If I understand it correctly I can use Conv2d for this problem. But I'm not sure how to correctly choose conv2d parameters.
Is Conv2d suitable for this task? If so, what are the correct parameters? Is that
input channels = 100
output channels = 20*3
kernel = (5,1)
?
You can use either Conv2D or Conv1D.
With the data shaped like batch x 100 x n_features you can use Conv1D with this setup:
Input channels: n_features
Output channels: 3 * output_features
kernel: 5
strides: 5
Thereby, the kernel is applied to 5 samples and generates 3 outputs. The values for n_features and output_features can be anything you like and might as well be 1. Setting the strides to 5 results in a non-overlapping convolution so that each block uniquely contributes to one output.
I am working on a problem and I want to implement it as a reinforcement learning problem and integrate it into OpenAI gym. My states are in the form of lists of length n which each element is chosen from a discrete interval [0, m].
for example for n=6 and m=3, this is a sample from the observation space:
[0 2 1 3 3 2]
and the possible accessible states from this space is a set of other lists which are achieved by changing a number of k elements the elements in the list with a number from the same [0, m].
for example, for k=1 we can have the following states as two subsequent states of the previous state:
[0 2 2 3 3 2]
or
[0 3 1 3 3 2]
My question is that what is an efficient way to represent the "actions" in the OpenAI gym for such a scenario?
One way that come to my mind is to just use the next state as the action itself, for example, if I write:
action = env.action_space.sample()
the action would be the next state (which also implicitly contains the action) and then in the env.step(action) make the state equal to the next state.
Does anyone know a better way or using the implicit action representation with the next state is the optimal way?
Does anyone know a predefined gym environment that also has the same representation?
what are the cons of the implicit representation of the actions that I just explained?
Cellular automaton
A cellular automaton can be seen as an array of bits, plus a computation table that dictates that bits must be continuously updated as a function of their neighbors. For example,
111 -> 0
110 -> 0
101 -> 0
100 -> 1
011 -> 1
010 -> 1
001 -> 1
000 -> 0
That table dictates that whenever the array contains a 110 sequence, the middle bit must flip. This is repeated over and over globally, causing the array to evolve in interesting ways. Such computation can be performed efficiently on GPUs, since one can easily pre-load slices into the shared memory of a Streaming Multiprocessor.
Cellular automaton with insertions and deletions
Now, suppose we have a different kind of automata, where the array size can dynamically change, and on which certain rules cause a new bit to be inserted. For example:
111 -> 0
110 -> 00
101 ->
100 -> 1
011 -> 00
010 ->
001 -> 1
000 -> 0
This is similar to the previous computation, except that, now, whenever there is a 110 sequence on the array, not only the middle bit must flip, but a new bit, 0, must be inserted right next to it. Moreover, when we have the 101 sequence, the middle bit must be removed.
Obviously, implementing this new problem using the same data structure, an array, would be prohibitive, since inserting a bit on the middle of an array requires shifting all the posterior elements 1 index right, which would be extremely expensive.
Question
Is there any clever data structure or general approach that allows this computation to be performed efficiently on the GPU?
the first thing that comes to mind is a linked list, just effects the neighboring elements while the others can keep there references
This is a screenshot of the applet LogiCell 1.0, link to which I found here.
As the bottom left corner shows, this is doing sum 0+1 and the result is 01b (bottom right hand side).
I am not able to link what is displayed to what the inputs ans outputs are. For example in this case - seeing the snapshot, how do you determine that the inputs are 0 and 1 and the output is 01?
From the documentation:
An eater manages the output. The red displayed cell only is activated if an eater absorbs a glider. This cell is the output.
Yet do note that this is a transient situation you have to be measuring for, with a certain periodicity. If you keep running the automata after that square is set, the eater is designed to return to its original form. From the PDF:
To design efficient circuits we need to somehow stop a stream of gliders to prevent the gliders from "polluting" the computational space. There are compact stable patterns, called eaters that consume gliders and then recovery back to their original form.
Since we have two bits of output (MSB and LSB) I've highlighted their "eaters"/outputs:
The addition is defined according to boolean operations:
A B | A+B
---------
0 0 | 0 0
1 0 | 0 1
0 1 | 0 1
1 1 | 1 0
MSB = A and B
LSB = (A or B) and (not (A and B))
It makes sense that you'd be able to compute the MSB faster than the LSB, hence it can be gathered "earlier" (closer to the top of the screen). Just watch the simulation and see that when the bits should be one, the corresponding eater consumes a glider - when they should be zero, the glider streams are stopped before they can reach the eater.
As for how to set up the inputs, it really comes down to whether a single square is on or off in the input construction. You can see this yourself by clicking an input (say A) and then OK, and then clicking it again: