Bias/Variance of Reinforcement Algorithms for Non-Markov States

Bias/Variance of Reinforcement Algorithms for Non-Markov States - reinforcement-learning

Hello StackOverflow Community!
I have aquestion about Model-Free Prediction/Control algorithms in Reinforcement Learning.
In the lectures of David Silver, the bias/variance tradeoff analysis is done for MC and TD (i.e. MC has no bias and high variance whereas TD(0) has some bias and low variance) but while making this comparisons the states in the environment have Markov property.
Can you comment on what happens to bias and variance:
1. when we use MC in an environment with states,which do not have Markov property
2. same for the TD algorithm as well
compared to applied on states having Markov property?

Related

Why neural network have weight space symmetry?

"Neural nets have a weight space symmetry: we can permute all the hidden units in a given layer and obtain an equivalent solution" (From CSC321, lecture 10, Optimation)
I don't think it make sense, is there something wrong with my understanding?
For example, there is a simple DNN with 2 units in the only hidden layer. And there is one local optima and one global optima like this:
Obviously 2 symmetric points will result in different solution, they will go into different optima(the right-bottom one is the global optima).
Please tell me where it goes wrong?

I think you miss the definition of symmetry.
Geometry is the branch of mathematics studying invariants under some class of transformations. The invariants of a geometry are called the symmetry of the geometry. For instance, the symmetries of Euclidean geometry is length and angles because rotations and translations (the group of Euclidean transformations) preserve them. Simply put, in Euclidean geometry, length and angles are the symmetries of the geometry. In the same vein, the symmetry of the affine geometry is parallelism.
In the context of deep learning, weight space symmetry means that non-identifiable models are invariant to random permutations in their weight layers. This symmetry holds since in deep learning there are generally not enough training samples to rule out all parameter settings but one, there usually exist a large amount of possible weight combinations for a given dataset that yield similar model performance.

Sure, if you permute the weights of input layers randomly - you'll not come with the same result. Becase the order of input elements matter.
The permutations symetry is about permuting the neurons of hidden layers, not about permuting weights of single neuron.
For example, your hidden layer has 2 neorons with weights w11, w12, w13 and w21 w22, w23.
So the permutation principle states that you can easily permute
w11 <-> w21, w12<->w22 and w13<->w23 and the result will remain the same

the weight symmetry here means that there is an equivalent weight that maps the input to output. It doesn't mean the geometrical symmetry in coordinate space. You can have a deeper look in Bishop Ch5.1

Does RL methods converge with epsilon = 0?

My question is simple. Does an epsilon equals to zero will converge to an optimal policy using RL methods? (with negative and positive reward function values).
Thanks,

No, it does not. Because with epsilon=0 there is no exploration, and without exploration there is no guarantee. It is also intuitively sounds, since without exploration you cannot learn the environment well enough to find the optimal policy.
For example for the Q-learning algorithm, you can see the formal proof in
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992): 279-292.
which shows that Q-learning does converge to the optimal values, if \epsilon goes to zero, as the number of observation goes to infinity.

Reward function for Policy Gradient Descent in Reinforcement Learning

I'm currently learning about Policy Gradient Descent in the context of Reinforcement Learning. TL;DR, my question is: "What are the constraints on the reward function (in theory and practice) and what would be a good reward function for the case below?"
Details:
I want to implement a Neural Net which should learn to play a simple board game using Policy Gradient Descent. I'll omit the details of the NN as they don't matter. The loss function for Policy Gradient Descent, as I understand it is negative log likelihood: loss = - avg(r * log(p))
My question now is how to define the reward r? Since the game can have 3 different outcomes: win, loss, or draw - it seems rewarding 1 for a win, 0 for a draw, -1 for a loss (and some discounted value of those for action leading to those outcomes) would be a natural choice.
However, mathematically I have doubts:
Win Reward: 1 - This seems to make sense. This should push probabilities towards 1 for moves involved in wins with diminishing gradient the closer the probability gets to 1.
Draw Reward: 0 - This does not seem to make sense. This would just cancel out any probabilities in the equation and no learning should be possible (as the gradient should always be 0).
Loss Reward: -1 - This should kind of work. It should push probabilities towards 0 for moves involved in losses. However, I'm concerned about the asymmetry of the gradient compared to the win case. The closer to 0 the probability gets, the steeper the gradient gets. I'm concerned that this would create an extremely strong bias towards a policy that avoids losses - to the degree where the win signal doesn't matter much at all.

You are on the right track. However, I believe you are confusing rewards with action probabilities. In case of draw, it learns that the reward itself is zero at the end of the episode. However, in case of loss, the loss function is discounted reward (which should be -1) times the action probabilities. So it will get you more towards actions which end in win and away from loss with actions ending in draw falling in the middle. Intuitively, it is very similar to supervised deep learning only with an additional weighting parameter (reward) attached to it.
Additionally, I believe this paper from Google DeepMind would be useful for you: https://arxiv.org/abs/1712.01815.
They actually talk about solving the chess problem using RL.

In Q Learning, how can you ever actually get a Q value? Wouldn't Q(s,a) just go on forever?

I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?

In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also:
Difference between Q-learning and Value Iteration
How do I know when a Q-learning algorithm converges?
Sutton et al.: RL

Hard to understand Caffe MNIST example

After going through the Caffe tutorial here: http://caffe.berkeleyvision.org/gathered/examples/mnist.html
I am really confused about the different (and efficient) model using in this tutorial, which is defined here: https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_train_test.prototxt
As I understand, Convolutional layer in Caffe simply calculate the sum of Wx+b for each input, without applying any activation function. If we would like to add the activation function, we should add another layer immediately below that convolutional layer, like Sigmoid, Tanh, or Relu layer. Any paper/tutorial I read on the internet applies the activation function to the neuron units.
It leaves me a big question mark as we only can see the Convolutional layers and Pooling layers interleaving in the model. I hope someone can give me an explanation.
As a site note, another doubt for me is the max_iter in this solver:
https://github.com/BVLC/caffe/blob/master/examples/mnist/lenet_solver.prototxt
We have 60.000 images for training, 10.000 images for testing. So why does the max_iter here only 10.000 (and it still can get > 99% accuracy rate)? What does Caffe do in each iteration?
Actually, I'm not so sure if the accuracy rate is the total correct prediction/test size.
I'm very amazed of this example, as I haven't found any example, framework that can achieve this high accuracy rate in that very short time (only 5 mins to get >99% accuracy rate). Hence, I doubt there should be something I misunderstood.
Thanks.

Caffe uses batch processing. The max_iter is 10,000 because the batch_size is 64. No of epochs = (batch_size x max_iter)/No of train samples. So the number of epochs is nearly 10. The accuracy is calculated on the test data. And yes, the accuracy of the model is indeed >99% as the dataset is not very complicated.

For your question about the missing activation layers, you are correct. The model in the tutorial is missing activation layers. This seems to be an oversight of the tutorial. For the real LeNet-5 model, there should be activation functions following the convolution layers. For MNIST, the model still works surprisingly well without the additional activation layers.
For reference, in Le Cun's 2001 paper, it states:
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a_i, for unit i, is then passed through a sigmoid squashing function to produce the state of unit i ...
F6 is the "blob" between the two fully connected layers. Hence the first fully connected layers should have an activation function applied (the tutorial uses ReLU activation functions instead of sigmoid).
MNIST is the hello world example for neural networks. It is very simple to today's standard. A single fully connected layer can solve the problem with accuracy of about 92%. Lenet-5 is a big improvement over this example.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008