openai-gym pong: How can i make reset() more random - reinforcement-learning

I implemented a dqn agent and after some hours of learning the reward is steady on 20-21.
When i want to see the agent play I can see that the same move is played again and again. the env on reset always shoots the ball in the same direction and my agent learned to play that exact move and never loose.
Is this the behavior of gym pong env? how can i make the env reset more random?
I'm using the NoopResetEnv wrapper it doesn't help!

The agent acts in a same way can be tracked from two reasons: the model itself and the pong env.
For the model, in case you are training a DQN model, the vanilla DQN model actually is a deterministic model which means it will give the same action based on the same situation. What you can try is to set a little 'randomness' for the model such like use 0.1 probability to get action randomly. For example in stable baselines you can choose to predict in a deterministic behavior by setting 'deterministic' to True.
As the perspective of env, I have not tried by myself but there is a seed parameter in openai gym atari env you can set seed for the openai gym atari env (env.seed(your_seed)). Check here and github for more information.

Related

How to train PPO using actions from matches already played?

The idea is to initially calibrate the neural network with some prior knowledge before releasing the algorithm to evolve on its own.
To make the question simpler, imagine that an agent can take 10 actions (discrete space). Instead of training the PPO algorithm to figure out by itself which actions are best for each state, I would like to perform a training by considering that some actions were performed in some states.
I'm using Stable Baselines with Gym.
I thought about creating an action wrapper like this:
class RandomActionWrapper(gym.ActionWrapper):
def __init__(self, env):
super(RandomActionWrapper, self).__init__(env)
def action(self, action):
a = self.env.action_space.sample()
return a
Ps: this wrapper is just a proof of concept, choosing random actions all the time, but the model just doesn't learn that way (I simulated many iterations in ridiculously simple to learn custom environments, something like: "action 2 always results in reward=1 while other actions result in reward=0).
Apparently the updates on the network are being made considering the actions that the model chose (the model always predicts actions by itself) while the rewards are being calculated based on the actions defined in my wrapper. This mismatch makes learning impossible.
I think you are looking for some kind of action mask implementation. In several games/enviroments, some actions are invalid in a particular state (it is not your case, but it could be the first approach). You can check this paper and the github
As PPO is an on-policy method, there is a mismatch between my generated data and the algorithm’s cost function. There's no reason to insist on PPO here. I'll look into off-policy algorithms

Decreasing action sampling frequency for one agent in a multi-agent environment

I'm using rllib for the first time, and trying to traini a custom multi-agent RL environment, and would like to train a couple of PPO agents on it. The implementation hiccup I need to figure out is how to alter the training for one special agent such that this one only takes an action every X timesteps. Is it best to only call compute_action() every X timesteps? Or, on the other steps, to mask the policy selection such that they have to re-sample an action until a No-Op is called? Or to modify the action that gets fed into the environment + the previous actions in the training batches to be No-Ops?
What's the easiest way to implement this that still takes advantage of rllib's training features? Do I need to create a custom training loop for this, or is there a way to configure PPOTrainer to do this?
Thanks
Let t:=timesteps so far. Give the special agent this feature: t (mod X), and don't process its actions in the environment when t (mod X) != 0. This accomplishes:
the agent in effect is only taking actions every X timesteps because you are ignoring all the other ones
the agent can learn that only the actions taken every X timesteps will affect the future rewards

avoiding illegal states in openai gym

I'm trying to make a gym environment for a simulation problem. In my gym environment, I have a set of illegal states which I don't want my agent to go into them. What is the easiest way to add such logic to my environment, should I use the wrapper classes? I didn't quite get them, I tried to extend the MultiDiscrete space with inheriting a class from it and override the MulriDiscrete.sample function to stop the environment from going into the illegal states, but is there a more efficient way to do it?
I had a similar problem where i need to make a gym environment which has a sort of pool in the center of grid world where i didn't want the agent to go.
So, I represented grid world as matrix and the pool had different depths which the agent can fall into, so the values at those locations had the negative value proportional to the depth of the puddle.
When training agents this negative reward prevented the agent to fall into the puddle.
The code for the above environment is here and its usage is here
Hope this helps.

Invalid moves in reinforcement learning

I have implemented a custom openai gym environment for a game similar to http://curvefever.io/, but with discreet actions instead of continuous. So my agent can in each step go in one of four directions, left/up/right/down. However one of these actions will always lead to the agent crashing into itself, since it cant "reverse".
Currently I just let the agent take any move, and just let it die if it makes an invalid move, hoping that it will eventually learn to not take that action in that state. I have however read that one can set the probabilities for making an illegal move zero, and then sample an action. Is there any other way to tackle this problem?
You can try to solve this by 2 changes:
1: give current direction as an input and give reward of maybe +0.1 if it takes the move which does not make it crash, and give -0.7 if it make a backward move which directly make it crash.
2: If you are using neural network and Softmax function as activation function of last layer, multiply all outputs of neural network with a positive integer ( confidence ) before giving it to Softmax function. it can be in range of 0 to 100 as i have experience more than 100 will not affect much. more the integer is the more confidence the agent will have to take action for a given state.
If you are not using neural network or say, deep learning, I suggest you to learn concepts of deep learning as your environment of game seems complex and a neural network will give best results.
Note: It will take huge amount of time. so you have to wait enough to train the algorithm. i suggest you not to hurry and let it train. and i played the game, its really interesting :) my wishes to make AI for the game :)

Openai gym environment for multi-agent games

Is it possible to use openai's gym environments for multi-agent games? Specifically, I would like to model a card game with four players (agents). The player scoring a turn starts the next turn. How would I model the necessary coordination between the players (e.g. who's turn it is next)? Ultimately, I would like to use reinforcement learning on four agents that play against each other.
Yes, it is possible to use OpenAI gym environments for multi-agent games. Although in the OpenAI gym community there is no standardized interface for multi-agent environments, it is easy enough to build an OpenAI gym that supports this. For instance, in OpenAI's recent work on multi-agent particle environments they make a multi-agent environment that inherits from gym.Env which takes the following form:
class MultiAgentEnv(gym.Env):
def step(self, action_n):
obs_n = list()
reward_n = list()
done_n = list()
info_n = {'n': []}
# ...
return obs_n, reward_n, done_n, info_n
We can see that the step function takes a list of actions (one for each agent) and returns a list of observations, list of rewards, list of dones, while stepping the environment forwards. This interface is representative of Markov Game, in which all agents take actions at the same time and each observe their own subsequent observation, reward.
However, this kind of Markov Game interface may not be suitable for all multi-agent environments. In particular, turn-based games (such as card games) might be better cast as an alternating Markov Game, in which agents take turns (i.e. actions) one at a time. For this kind of environment, you may need to include which agent's turn it is in the representation of state, and your step function would then just take a single action, and return a single observation, reward and done.
There is a multi-agent deep deterministic policy gradient MADDPG approach has been implemented by OpenAI team.
This is the repo to get started.
https://github.com/openai/multiagent-particle-envs
What you are looking for is PettingZoo, it's a set of environment with multi agent setting and they have a specific class / synthax to handle multi agent environment.
It's an interesting library because you can also use it with ray / rllib to use already implemented algorithm like PPO / Q-learning. Like in this exemple.
Rllib also have an implementation for multiagents environments. But you will have to dig deeper in the documentation to understand it.
There is a specific multi-agent environment for reinforcement learning here. It supports any number of agents written in any programming language. An example game is already implemented which happens to be a card game.