Log actions from training - reinforcement-learning

I am using SB3 PPO for training on a discrete env (4 discrete actions) and I would like to log, in Tensorboard, the (discrete) actions taken by the model.
Is there any way to access the actions taken by the model through a callback ?
If so, any example on how to access it ?
Thank you very much

Related

When we validate the model after one epoch training in Pytorch Lightning, can we just use one GPU?

I try to run a code written by Pytorch Lightning. I want to run it on one machine multi GPUs.
##The running setting is
--gpus=1,2,3,4
--strategy=ddp
There is no problem when training. But when we validate the model after one epoch training. It still runs on multi-GPUs(multi-processings), so that, the validation dataset will be split and assigned to different GPUs. So when the code try to write the prdict file and compute the scores with the gold file, it will have problems. Source Code
So I just want to shut down the ddp when I validate the model. Just run it on local_rank 0.

How to train PPO using actions from matches already played?

The idea is to initially calibrate the neural network with some prior knowledge before releasing the algorithm to evolve on its own.
To make the question simpler, imagine that an agent can take 10 actions (discrete space). Instead of training the PPO algorithm to figure out by itself which actions are best for each state, I would like to perform a training by considering that some actions were performed in some states.
I'm using Stable Baselines with Gym.
I thought about creating an action wrapper like this:
class RandomActionWrapper(gym.ActionWrapper):
def __init__(self, env):
super(RandomActionWrapper, self).__init__(env)
def action(self, action):
a = self.env.action_space.sample()
return a
Ps: this wrapper is just a proof of concept, choosing random actions all the time, but the model just doesn't learn that way (I simulated many iterations in ridiculously simple to learn custom environments, something like: "action 2 always results in reward=1 while other actions result in reward=0).
Apparently the updates on the network are being made considering the actions that the model chose (the model always predicts actions by itself) while the rewards are being calculated based on the actions defined in my wrapper. This mismatch makes learning impossible.
I think you are looking for some kind of action mask implementation. In several games/enviroments, some actions are invalid in a particular state (it is not your case, but it could be the first approach). You can check this paper and the github
As PPO is an on-policy method, there is a mismatch between my generated data and the algorithm’s cost function. There's no reason to insist on PPO here. I'll look into off-policy algorithms

Multiagent (not deep) reinforcement learning? Modeling the problem

I have N number of agents/users accessing a single wireless channel and at each time, only one agent can access the channel and receive a reward.
Each user has a buffer that can store B number of packets and I assume it as infinite buffer.
Each user n gets observation from the environment if the packet in time slot t was successful or failure (collision). If more than one users access the channel, they get penalty.
This feedback from the channel is same for all the users since we only have one channel. The reward is - B_n (negative of the number of packets in buffer). Each user wants to maximize its own reward and try to empty the buffer.
Packets arrive at each users following a poisson process with average $\lambda$ packets per time slot.
Each user has a history of previous 10 time slots that it uses as an input to the DQN to output the probability of taking action A_n: stay silent or transmit. The history is (A_n, F, B_n)
Each user is unaware of the action and buffer status of other users.
I am trying to model my problem with multiagent reinforcement learning and so far I have tried it with DQN but results are more or less like a random scheme. It could be that users don't have much contextual information in order to learn the behaviour of other users? Or can there be any other reason?
I would like to know how can I model my environment since the state (in RL sense) is static, the environment doesn't changes. The only thing that changes is each users history at each time slot. So I am not sure if its a partially observable MDP or should it be modelled as multiagent single-arm bandit problem which I don't know is correct or not.
The second concern is that I have tried DQN but it has not worked and I would like to know if such problem can be used with tabular Q-learning? I have not seen multiagent works in which anyone has used QL. Any insights might be helpful.
Your problem can be modeled as a Decentralized POMDP (see a overview here).
Summarizing this approach, you consider a multi-agent system where each agent model his own policy, and then you try to build a joint policy through these individual ones. Of course that, the complexity grows as the number of agents, states and actions increases,so for that you have several approaches mainly based in heuristics to prune branches of this joint policy tree that are not "good" in comparison with others. A very know example using this approach is exactly about routing packages where is possible define a discrete action/space.
But be aware that even for tiny system, the complexity becomes often infeasible!

Decreasing action sampling frequency for one agent in a multi-agent environment

I'm using rllib for the first time, and trying to traini a custom multi-agent RL environment, and would like to train a couple of PPO agents on it. The implementation hiccup I need to figure out is how to alter the training for one special agent such that this one only takes an action every X timesteps. Is it best to only call compute_action() every X timesteps? Or, on the other steps, to mask the policy selection such that they have to re-sample an action until a No-Op is called? Or to modify the action that gets fed into the environment + the previous actions in the training batches to be No-Ops?
What's the easiest way to implement this that still takes advantage of rllib's training features? Do I need to create a custom training loop for this, or is there a way to configure PPOTrainer to do this?
Thanks
Let t:=timesteps so far. Give the special agent this feature: t (mod X), and don't process its actions in the environment when t (mod X) != 0. This accomplishes:
the agent in effect is only taking actions every X timesteps because you are ignoring all the other ones
the agent can learn that only the actions taken every X timesteps will affect the future rewards

Sagemaker model evaluation

The Amazon documentation lists several approaches to evaluate a model (e.g. cross validation, etc.) however these methods does not seem to be available in the Sagemaker Java SDK.
Currently if we want to do 5-fold cross validation it seems the only option is to create 5 models (and also deploy 5 endpoints) one model for each subset of data and manually compute the performance metric (recall, precision, etc.).
This approach is not very efficient and can also be expensive need to deploy k-endpoints, based on the number of folds in the k-fold validation.
Is there another way to test the performance of a model?
Amazon SageMaker is a set of multiple components that you can choose which ones to use.
The built-in algorithms are designed for (infinite) scale, which means that you can have huge datasets and be able to build a model with them quickly and with low cost. Once you have large datasets you usually don't need to use techniques such as cross-validation, and the recommendation is to have a clear split between training data and validation data. Each of these parts will be defined with an input channel when you are submitting a training job.
If you have a small amount of data and you want to train on all of it and use cross-validation to allow it, you can use a different part of the service (interactive notebook instance). You can bring your own algorithm or even container image to be used in the development, training or hosting. You can have any python code based on any machine learning library or framework, including scikit-learn, R, TensorFlow, MXNet etc. In your code, you can define cross-validation based on the training data that you copy from S3 to the worker instances.