Custom Openai Gym Environment with Stable-baselines - reinforcement-learning

I am trying to create a simple 2D grid world Openai Gym environment which agent is headed to the terminal cell from anywhere in the grid world. For example, in the 5x5 grid world, X is the current agent location and O is the terminal cell where agent is headed to.
.....
.....
..X..
.....
....O
My action space is defined to discrete value [0,4) which represents up, left, down and right respectively. And, the observation space is a 1D box which denotes the agent current position in the grid world for example [12] (index start from 0 to size*size-1). I am wondering what are the differences between ways of defining the observation space. For example, other than my current definition, an observation space for the same environment can be defined as follow, just to named a few.
discrete value of i, which i represents the current location of agent.
a 2d matrix with all zero except the agent current location which is 1.
maybe others how are these different in term of stable-baselines algorithm or others?

What are the differences between ways of defining the observation space?
I think the better question is:
What are the reason for differences between ways of defining the observation space?
In order to defining the observation space, two thing need to be determined:
What information algorithm need?
This is largely determined by what information you can collect and agents objective, e.g if you want agent to reach the target in a maze, then you may provide info like current location of agent, obstacles direction around agent, target direction, etc.
What form the input information should be?
This is largely determined by what information you use and agents solution(i.e algorithm itself), sometime you have multiple choices and you need run experiments to find out which work best for given algorithm, just like the few you listed.
So in general, reason of different ways to defining the observation space is for better suit different objective and algorithm.

Related

Deep reinforcement learning for similar observations but need totally different actions, how to solve it?

For DRL using neural networks, like DQN, if there is a task that needs total different actions at similar observations, is NN going to show its weakness at this moment? Will two near input to the NN generate similar output? If so, it cannot get the different the task need?
For instance:
the agent can choose discrete action from [A,B,C,D,E], here is the observation by a set of plugs in a binary list [0,0,0,0,0,0,0].
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action? How to solve?
One more thing:
One hot encoding is a way to improve the distance between observations. It is also a common and useful way for many supervised learning tasks. But one hot will also increase the dimension heavily.
Will two near input to the NN generate similar output ?
Artificial neural networks, by nature, are non-linear function approximators. Meaning that for two given similar inputs, the output can be very different.
You might get an intuition on it considering this example, two very similar pictures (the one on the right just has some light noise added to it) give very different results for the model.
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action ?
I see no problem with this example, a properly trained NN should be able to map the desired action for both inputs. Furthermore, in your example the input vectors contain binary values, a single difference in these vectors (meaning that they have a Hamming distance of 1) is big enough for the neural net to classify them properly.
Also, the non-linearity in neural networks comes from the activation functions, hope this helps !

regression with given upper and lower bounds for the target value

I am using several regressors like xgboost, gradient boosting, random forest or decision tree to predict a continuous target value.
I have some complementary information like I know my prediction (target value) based on all features that I have should be in a given range.
Is there any way to more effectively take into consideration these bounds as a feature to any of these algorithms instead of verifying the range on already predicted values and only doing some post-processing.
Note that by just simply putting the lower and upper bound for my target value, not necessarily these algorithms will learn to effectively compute the prediction in the given range. I am looking for more effective way to take into consideration these bounds as a given data.
Thanks

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?
Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

the number of the output unit and the loss function in binary classification network

Say I have a binary classification task, and I build a neural network to do this.
There are two different framework to choose in which the first is the network has one output unit indicating the probability belonging to one of the class, thus I can use the binary cross-entropy to compute the loss, the second is the network has two output units indicating the probabilities belonging to the two classes separately, also I can use the softmax cross-entropy to compute the loss.
Some suggests to use the first option, my confusion is that what the pros and cons of the two options are, and what the severest problem is if I choose the second framework? Can anyone explain this in detail to me? Thanks in advance.
If you use one output unit then you should understand that you are choosing strictly between two classes. If the probability is high enough then your netwrok chooses class A, otherwise it chooses class B. If you have two output units your network may produce rather low probability for both your units so you will end up with neither A nor B. You should choose among those two approaches depending on what is the real system you're trying to model with your network.

Randomly Assigning Positions

Here's my basic problem. Let's say I have 50 employees working on a certain day, and I want my program to randomly distribute them to a "position" (I.e.: front desk, phones, etc) based on what they have been trained on. The program already knows what each employee has been trained on. What is the best method pragmatically to go through and assign an employee to each of the 50 positions?
P.s. I am programming this into Access using VBA, but this is more a question of process than actual code.
Hi lukewarm,
You are looking for a maximum bipartite matching. This is a problem from graph theory. It boils down to determining the maximum flow in an undirected, bipartite graph with constant edge weights of 1:
You divide all vertices in Your graph in two separate sets. The first set contains all Your workers, the second one all available positions.
Now You insert an edge from every worker to every position she/he is able to work on.
Insert two more vertices: A source and a sink. Connect the source with every worker vertex and the sink with every position vertex.
Determine the maximum flow from source to sink
Hope I could help, greetings.
EDIT: Support for randomness
Since finding the maximum bipartite matching/maximum flow is a deterministic algorithm, it would always return the same result. In order to change that You could mix/shuffle the order of the edges in the graph before applying the algorithm.
In your position table have a sequence, 1, 2, 3, 4 and a count of positions to be filled. Then look at what the person did yesterday, and 1 to the position sequence and now they're assigned to the next position. If there are enough for that position today then go to the next priority position.
Not random but maybe close enough.