How to train PPO using actions from matches already played? - deep-learning

The idea is to initially calibrate the neural network with some prior knowledge before releasing the algorithm to evolve on its own.
To make the question simpler, imagine that an agent can take 10 actions (discrete space). Instead of training the PPO algorithm to figure out by itself which actions are best for each state, I would like to perform a training by considering that some actions were performed in some states.
I'm using Stable Baselines with Gym.
I thought about creating an action wrapper like this:
class RandomActionWrapper(gym.ActionWrapper):
def __init__(self, env):
super(RandomActionWrapper, self).__init__(env)
def action(self, action):
a = self.env.action_space.sample()
return a
Ps: this wrapper is just a proof of concept, choosing random actions all the time, but the model just doesn't learn that way (I simulated many iterations in ridiculously simple to learn custom environments, something like: "action 2 always results in reward=1 while other actions result in reward=0).
Apparently the updates on the network are being made considering the actions that the model chose (the model always predicts actions by itself) while the rewards are being calculated based on the actions defined in my wrapper. This mismatch makes learning impossible.

I think you are looking for some kind of action mask implementation. In several games/enviroments, some actions are invalid in a particular state (it is not your case, but it could be the first approach). You can check this paper and the github

As PPO is an on-policy method, there is a mismatch between my generated data and the algorithm’s cost function. There's no reason to insist on PPO here. I'll look into off-policy algorithms

Related

How to guarantee that the actor would select a correct action?

In the training phase of Deep Deterministic Policy Gradient (DDPG) algorithm, the action selection would be simply
action = actor(state)
where state is the current state of the environment and actor is a deep neural network.
I do not understand how to guarantee that the returned action belongs to the action space of the considered environment.
For example, a state could be a vector of size 4 and the action space could be the interval [-1,1] of real numbers or the Cartesian product of [-1,1]x[-2,2]. Why, after doing action = actor(state), the returned action would belong to [-1,1] or [-1,1]x[-2,2], depending on the environment?
I was reading some source codes of DDPG on GitHub but I am missing something here and I cannot figure out the answer.
The actor usually is a neural network, and the reason of actor's action restrict in [-1,1] is usually because the output layer of the actor net using activation function like Tanh, and one can process this outputs to let action belong to any range.
The reason of actor can choose the good action depending on environment, is because in MDP(Markov decision process), the actor doing trial and error in the environment, and get reward or penalty for actor doing good or bad, i.e the actor net get gradients towards better action.
Note algorithms like PPG, PPO, SAC, DDPG, can guarantee the actor would select the best action for all states in theory! (i.e assume infinite learning time, infinite actor net capacity, etc.) in practice, there usually no guarantee unless action space is discrete and environment is very simple.
Understand the idea behind RL algorithms will greatly help you understand source codes of those algorithms, after all, code is implementation of the idea.

What are backend weights in deep learning models (yolo)?

pretty new to deep learning, but couldn't seem to find/figure out what are backend weights such as
full_yolo_backend.h5
squeezenet_backend.h5
From what I have found and experimented, these backend weights have fundamentally different model architectures such as
yolov2 model has 40+ layers but the backend only 20+ layers (?)
you can build on top of the backend model with your own networks (?)
using backend models tend to yield poorer results (?)
I was hoping to seek some explanation on backend weights vs actual models for learning purposes. Thank you so much!
I'm note sure which implementation you are using but in many applications, you can consider a deep model as a feature extractor whose output is more or less task-agnostic, followed by a number of task-specific heads.
The choice of backend depends on your specific constraints in terms of tradeoff between accuracy and computational complexity. Examples of classical but time-consuming choices for backends are resnet-101, resnet-50 or VGG that can be coupled with FPN (feature pyramid networks) to yield multiscale features. However, if speed is your main concern then you can use smaller backends such as different MobileNet architectures or even the vanilla networks such as the ones used in the original Yolov1/v2 papers (tinyYolo is an extreme case).
Once you have chosen your backend (you can use a pretrained one), you can load its weights (that is what your *h5 files are). On top of that, you will add a small head that will carry the tasks that you need: this can be classification, bbox regression, or like in MaskRCNN forground/background segmentation. For Yolov2, you can just add very few, for example 3 convolutional layers (with non-linearities of course) that will output a tensor of size
BxC1xC2xAxP
#B==batch size
#C1==number vertical of cells
#C2==number of horizontal cells
#C3==number of anchors
#C4==number of parameters (i.e. bbx parameters, class prediction, confidence)
Then, you can just save/load the weights of this head separately. When you are happy with your results though, training jointly (end-to-end) will usually give you a small boost in accuracy.
Finally, to come back to your last questions, I assume that you are getting poor results with the backends because you are only loading backend weights but not the weights of the heads. Another possibility is that you are using a head trained with a backends X but that you are switching the backend to Y. In that case since the head expects different features, it's natural to see a drop in performance.

How do shared parameters in actor-critic models work?

Hello StackOverflow Community!
I have a question about Actor-Critic Models in Reinforcement Learning.
While listening policy gradient methods classes of Berkeley University, it is said in the lecture that in the actor-critic algorithms where we both optimize our policy with some policy parameters and our value functions with some value function parameters, we use same parameters in both optimization problems(i.e. policy parameters = value function parameters) in some algorithms (e.g. A2C/A3C)
I could not understand how this works. I was thinking that we should optimize them separately. How does this shared parameter solution helps us?
Thanks in advance :)
You can do it by sharing some (or all) layers of their network. If you do, however, you are assuming that there is a common state representation (the intermediate layer output) that is optimal w.r.t. both. This is a very strong assumption and it usually doesn't hold. It has been shown to work for learning from image, where you put (for instance) an autoencoder on the top both the actor and the critic network and train it using the sum of their loss function.
This is mentioned in PPO paper (just before Eq. (9)). However, they just say that they share layers only for learning Atari games, not for continuous control problems. They don't say why, but this can be explained as I said above: Atari games have a low-dimensional state representation that is optimal for both the actor and the critic (e.g., the encoded image learned by an autoencoder), while for continuous control you usually pass directly a low-dimensional state (coordinates, velocities, ...).
A3C, which you mentioned, was also used mostly for games (Doom, I think).
From my experience, in control sharing layers never worked if the state is already compact.

deep learning: How do I know my net is not memorizing

I have a convolutional neural network and my input data are 10.000 images of the same object from different views (angles in 3D around the image). My network converges, but I am not sure if the network has memorized all the different angles / views or not. Since I only have one object I cannot really check test it with different data.
My training / test plot looks like this (red trainig, green test):
Since the test is lower than training I expect the network to learn all the images by heart? Even though I have 10.000 kind of different images.
First, "memorize" is not a term we apply to the learning process, since it's not exact regurgitation of prior examples.
This is a matter of your experimental process. You get to define the success criteria. Is 95% accuracy good enough for your intended application? What, to you, is good enough performance to declare success?
One way to build a more convincing argument is to make the typical third partition: besides training and test sets, save part of your data for validation. You do the training and testing as you've already done. When the model has converged, you apply it to the validation set to predict results. If that test passes your success criterion, then you have a finished model.

Rewards in Q-Learning and in TD(lambda)

How do rewards in those two RL techniques work? I mean, they both improve the policy and the evaluation of it, but not the rewards.
How do I need to guess them from the beginning?
You don't need guess the rewards. Reward is a feedback from the enviroment and rewards are parameters of the enviroment. Algorithm works in condition that agent can observe only feedback, state space and action space.
The key idea of Q-learning and TD is asynchronous stochastic approximation where we approximate Bellman operator's fixed point using noisy evaluations of longterm reward expectation.
For example, if we want to estimate expectation Gaussian distribution then we can sample and average it.
Reinforcement Learning is for problems where the AI agent has no information about the world it is operating in. So Reinforcement Learning algos not only give you a policy/ optimal action at each state but also navigate in a completely foreign environment( with no knoledge about what action will result in which result state) and learns the parameters of this new environment. These are model-based Reinforcement Learning Algorithm
Now Q Learning and Temporal Difference Learning are model-free reinforcement Learning algorithms. Meaning, the AI agent does the same things as in model-based Algo but it does not have to learn the model( things like transition probabilities) of the world it is operating in. Through many iterations it comes up with a mapping of each state to the optimal action to be performed in that state.
Now coming to your question, you do not have to guess the rewards at different states. Initially when the agent is new to the environment, it just chooses a random action to be performed from the state it is in and gives it to the simulator. The simulator, based on the transition functions, returns the result state of that state action pair and also returns the reward for being in that state.
The simulator is analogous to Nature in the real world. For example you find something unfamiliar in the world, you perform some action, like touching it, if the thing turns out to be a hot object Nature gives a reward in the form of pain, so that the next time you know what happens when you try that action. While programming this it is important to note that the working of the simulator is not visible to the AI agent that is trying to learn the environment.
Now depending on this reward that the agent senses, it backs up it's Q-value( in the case of Q-Learning) or utility value( in the case of TD-Learning). Over many iterations these Q-values converge and you are able to choose an optimal action for every state depending on the Q-value of the state-action pairs.