Is it possible to use openai's gym environments for multi-agent games? Specifically, I would like to model a card game with four players (agents). The player scoring a turn starts the next turn. How would I model the necessary coordination between the players (e.g. who's turn it is next)? Ultimately, I would like to use reinforcement learning on four agents that play against each other.
Yes, it is possible to use OpenAI gym environments for multi-agent games. Although in the OpenAI gym community there is no standardized interface for multi-agent environments, it is easy enough to build an OpenAI gym that supports this. For instance, in OpenAI's recent work on multi-agent particle environments they make a multi-agent environment that inherits from gym.Env which takes the following form:
class MultiAgentEnv(gym.Env):
def step(self, action_n):
obs_n = list()
reward_n = list()
done_n = list()
info_n = {'n': []}
# ...
return obs_n, reward_n, done_n, info_n
We can see that the step function takes a list of actions (one for each agent) and returns a list of observations, list of rewards, list of dones, while stepping the environment forwards. This interface is representative of Markov Game, in which all agents take actions at the same time and each observe their own subsequent observation, reward.
However, this kind of Markov Game interface may not be suitable for all multi-agent environments. In particular, turn-based games (such as card games) might be better cast as an alternating Markov Game, in which agents take turns (i.e. actions) one at a time. For this kind of environment, you may need to include which agent's turn it is in the representation of state, and your step function would then just take a single action, and return a single observation, reward and done.
There is a multi-agent deep deterministic policy gradient MADDPG approach has been implemented by OpenAI team.
This is the repo to get started.
https://github.com/openai/multiagent-particle-envs
What you are looking for is PettingZoo, it's a set of environment with multi agent setting and they have a specific class / synthax to handle multi agent environment.
It's an interesting library because you can also use it with ray / rllib to use already implemented algorithm like PPO / Q-learning. Like in this exemple.
Rllib also have an implementation for multiagents environments. But you will have to dig deeper in the documentation to understand it.
There is a specific multi-agent environment for reinforcement learning here. It supports any number of agents written in any programming language. An example game is already implemented which happens to be a card game.
Related
We are planning to do a multi-agents in a openai gym super mario, is it possible to have multiple agents in one level? to see what agent will accomplish (with the mixture of genetic algorithm and deep q learning) and for the levels using level scene stitching is it possible to do that in openai gym?
Answer and we want to know if its possible to have multiple agents in one level in the game to simulate it.
The idea is to initially calibrate the neural network with some prior knowledge before releasing the algorithm to evolve on its own.
To make the question simpler, imagine that an agent can take 10 actions (discrete space). Instead of training the PPO algorithm to figure out by itself which actions are best for each state, I would like to perform a training by considering that some actions were performed in some states.
I'm using Stable Baselines with Gym.
I thought about creating an action wrapper like this:
class RandomActionWrapper(gym.ActionWrapper):
def __init__(self, env):
super(RandomActionWrapper, self).__init__(env)
def action(self, action):
a = self.env.action_space.sample()
return a
Ps: this wrapper is just a proof of concept, choosing random actions all the time, but the model just doesn't learn that way (I simulated many iterations in ridiculously simple to learn custom environments, something like: "action 2 always results in reward=1 while other actions result in reward=0).
Apparently the updates on the network are being made considering the actions that the model chose (the model always predicts actions by itself) while the rewards are being calculated based on the actions defined in my wrapper. This mismatch makes learning impossible.
I think you are looking for some kind of action mask implementation. In several games/enviroments, some actions are invalid in a particular state (it is not your case, but it could be the first approach). You can check this paper and the github
As PPO is an on-policy method, there is a mismatch between my generated data and the algorithm’s cost function. There's no reason to insist on PPO here. I'll look into off-policy algorithms
Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s), rather than to the state-action-value function Q(s,a)?
If not, why not, or, could it easily be implemented?
Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?
I ask because
I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on Vi(si,t+1) for the different achievable post-period states si,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).
I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s), but overall it still targets to readily learn the Q(s,a) in a decentralized way, which seems not conducive in my case.
I implemented a dqn agent and after some hours of learning the reward is steady on 20-21.
When i want to see the agent play I can see that the same move is played again and again. the env on reset always shoots the ball in the same direction and my agent learned to play that exact move and never loose.
Is this the behavior of gym pong env? how can i make the env reset more random?
I'm using the NoopResetEnv wrapper it doesn't help!
The agent acts in a same way can be tracked from two reasons: the model itself and the pong env.
For the model, in case you are training a DQN model, the vanilla DQN model actually is a deterministic model which means it will give the same action based on the same situation. What you can try is to set a little 'randomness' for the model such like use 0.1 probability to get action randomly. For example in stable baselines you can choose to predict in a deterministic behavior by setting 'deterministic' to True.
As the perspective of env, I have not tried by myself but there is a seed parameter in openai gym atari env you can set seed for the openai gym atari env (env.seed(your_seed)). Check here and github for more information.
Hello StackOverflow Community!
I have a question about Actor-Critic Models in Reinforcement Learning.
While listening policy gradient methods classes of Berkeley University, it is said in the lecture that in the actor-critic algorithms where we both optimize our policy with some policy parameters and our value functions with some value function parameters, we use same parameters in both optimization problems(i.e. policy parameters = value function parameters) in some algorithms (e.g. A2C/A3C)
I could not understand how this works. I was thinking that we should optimize them separately. How does this shared parameter solution helps us?
Thanks in advance :)
You can do it by sharing some (or all) layers of their network. If you do, however, you are assuming that there is a common state representation (the intermediate layer output) that is optimal w.r.t. both. This is a very strong assumption and it usually doesn't hold. It has been shown to work for learning from image, where you put (for instance) an autoencoder on the top both the actor and the critic network and train it using the sum of their loss function.
This is mentioned in PPO paper (just before Eq. (9)). However, they just say that they share layers only for learning Atari games, not for continuous control problems. They don't say why, but this can be explained as I said above: Atari games have a low-dimensional state representation that is optimal for both the actor and the critic (e.g., the encoded image learned by an autoencoder), while for continuous control you usually pass directly a low-dimensional state (coordinates, velocities, ...).
A3C, which you mentioned, was also used mostly for games (Doom, I think).
From my experience, in control sharing layers never worked if the state is already compact.