In PPO’s objective function second term introduces squared error loss of the value function neural network. Is that term is essentially the squared advantage values, right?
No, that's the TD error for training V. You can separate the two losses and nothing changes, because the networks do not share parameters. In practice, the policy is trained on the first term of the equation, while V is trained on the second.
Related
I write a custom gym environment, and trained with PPO provided by stable-baselines3. The ep_rew_mean recorded by tensorboard is as follow:
the ep_rew_mean curve for total 100 million steps, each episode has 50 steps
As shown in the figure, the reward is around 15.5 after training, and the model converges. However, I use the function evaluate_policy() for the trained model, and the reward is much smaller than the ep_rew_mean value. The first value is mean reward, the second value is std of reward:
4.349947246664763 1.1806464511030819
the way I use function evaluate_policy() is:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10000)
According to my understanding, the initial environment is randomly distributed in an area when using reset() fuction, so there should not be overfitting problem.
I have also tried different learning rate or other parameters, and this problem is not solved.
I have checked my environment, and I think there is no error.
I have searched on the internet, read the doc of stable-baselines3 and issues on github, but did not find the solution.
evaluate_policy has deterministic to True by default (https://stable-baselines3.readthedocs.io/en/master/common/evaluation.html).
If you sample from the distribution during training, it may help to evaluate the policy without it selecting the actions with an argmax (by passing in deterministic=False).
How is Deep Q-learning loss function defined? What I can't understand is that every time we are in a state and we select an action based on a policy derived by Deep Q-network only one of the Q values is considered , which means that instead of a vector we have an Scalar. In other words the loss function only consists of one term. Am I correct? The notations used in papers is ambiguous.
I am just getting start with deep reinforcement learning and i am trying to crasp this concept.
I have this deterministic bellman equation
When i implement stochastacity from the MDP then i get 2.6a
My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:
The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.
I hope someone out there can help on this one!
Best regards Søren Koch
No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.
In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.
There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.
Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy π . So yes, strictly speaking, it should be accompained by the policy sign π .
The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:
Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows:
Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.
Bellman equation, which characterizes a value function, in this case associated with any policy π:
In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable):
I am trying to implement a CNN in Tensorflow (quite similar architecture to VGG), which then splits into two branches after the first fully connected layer. It follows this paper: https://arxiv.org/abs/1612.01697
Each of the two branches of the network outputs a set of 32 numbers. I want to write a joint loss function, which will take 3 inputs:
The predictions of branch 1 (y)
The predictions of branch 2 (alpha)
The labels Y (ground truth) (q)
and calculate a weighted loss, as in the image below:
Loss function definition
q_hat = tf.divide(tf.reduce_sum(tf.multiply(alpha, y),0), tf.reduce_sum(alpha,0))
loss = tf.abs(tf.subtract(q_hat, q))
I understand the fact that I need to use the tf functions in order to implement this loss function. Having implemented the above function, the network is training, but once trained, it is not outputting the expected results.
Has anyone ever tried combining outputs of two branches of a network in one joint loss function? Is this something TensorFlow supports? Maybe I am making a mistake somewhere here? Any help whatsoever would be greatly appreciated. Let me know if you would like me to add any further details.
From TensorFlow perspective, there is absolutely no difference between a "regular" CNN graph and a "branched" graph. For TensorFlow, it is just a graph that needs to be executed. So, TensorFlow certainly supports this. "Combining two branches into joint loss" is also nothing special. In fact, it is "good" that loss depends on both branches. It means that when you ask TensorFlow to compute loss, it will have to do the forward pass through both branches, which is what you want.
One thing I noticed is that your code for loss is different than the image. Your code appears to do this https://ibb.co/kbEH95
After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.
Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.
For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.