In OpenAI gym environments the initial state is random or specific? - reinforcement-learning

Is the initial state randomly selected in reinforcement learning environments like OpenAI gym. In other words, does command env.reset() result in randomly selected initial state or specific initial state?

Usually yes, it is random. However, it is better that you look at the source code of the environment to be sure. For instance, the pendulum initial state is uniformly drawn from the whole state space, while for the mountain car the state position is uniformly drawn from [-0.6, -0.4] and the velocity is always 0.

Related

avoiding illegal states in openai gym

I'm trying to make a gym environment for a simulation problem. In my gym environment, I have a set of illegal states which I don't want my agent to go into them. What is the easiest way to add such logic to my environment, should I use the wrapper classes? I didn't quite get them, I tried to extend the MultiDiscrete space with inheriting a class from it and override the MulriDiscrete.sample function to stop the environment from going into the illegal states, but is there a more efficient way to do it?
I had a similar problem where i need to make a gym environment which has a sort of pool in the center of grid world where i didn't want the agent to go.
So, I represented grid world as matrix and the pool had different depths which the agent can fall into, so the values at those locations had the negative value proportional to the depth of the puddle.
When training agents this negative reward prevented the agent to fall into the puddle.
The code for the above environment is here and its usage is here
Hope this helps.

openai-gym pong: How can i make reset() more random

I implemented a dqn agent and after some hours of learning the reward is steady on 20-21.
When i want to see the agent play I can see that the same move is played again and again. the env on reset always shoots the ball in the same direction and my agent learned to play that exact move and never loose.
Is this the behavior of gym pong env? how can i make the env reset more random?
I'm using the NoopResetEnv wrapper it doesn't help!
The agent acts in a same way can be tracked from two reasons: the model itself and the pong env.
For the model, in case you are training a DQN model, the vanilla DQN model actually is a deterministic model which means it will give the same action based on the same situation. What you can try is to set a little 'randomness' for the model such like use 0.1 probability to get action randomly. For example in stable baselines you can choose to predict in a deterministic behavior by setting 'deterministic' to True.
As the perspective of env, I have not tried by myself but there is a seed parameter in openai gym atari env you can set seed for the openai gym atari env (env.seed(your_seed)). Check here and github for more information.

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Choosing the active features for function approx with radial basis functions in reinforcement learning?

I don't understand how eligibility traces fit in with reinforcement learning when using radial basis functions (RBFs) to approximate the value function with continuous state variables. In particular, how do you decide which features are 'active' for a given state?
When using tile coding, or coarse coding, each tile (not each tiling) is essentially a feature and so the eligibility traces for each tile are incremented (how depends on whether you're using replacing or accumulating traces) when the state passes through each tile, and some tiles will not have their trace incremented. However, when using radial basis functions the features are the distances between the state and the centers of the Rbf network evaluated by the chosen kernel. These can be evaluated for any position of the state, and any position of the center, so there's not a clear picture of which features are activated for a given state (they can all essentially be activated to a greater or lesser degree), and so it's not clear which features should have their traces incremented.
How should one adjust eligibility traces of features generated by RBFs at each time step of a simulation?
Do I need to assume the kernels of the RBFs are truncated?

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.