Explanation behind actor-critic algorithm in pytorch example? - reinforcement-learning

Pytorch provides a good example of using actor-critic to play Cartpole in the OpenAI gym environment.
I'm confused about several of their equations in the code snippet found at https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py#L67-L79:
saved_actions = model.saved_actions
value_loss = 0
rewards = []
for r in model.rewards[::-1]:
R = r + args.gamma * R
rewards.insert(0, R)
rewards = torch.Tensor(rewards)
rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)
for (action, value), r in zip(saved_actions, rewards):
action.reinforce(r - value.data.squeeze())
value_loss += F.smooth_l1_loss(value, Variable(torch.Tensor([r])))
optimizer.zero_grad()
final_nodes = [value_loss] + list(map(lambda p: p.action, saved_actions))
gradients = [torch.ones(1)] + [None] * len(saved_actions)
autograd.backward(final_nodes, gradients)
optimizer.step()
What do r and value mean in this case? Why do they run REINFORCE on the action space with the reward equal to r - value? And why do they try to set the value so that it matches r?
Thanks for your help!

First the rewards a collected for a time, along with the state:action that resulted in the reward
Then r - value is the difference between the expected reward and actual
That difference is used to adjust the expected value of that action from that state
So if in state "middle", the expected reward for action "jump" was 10 and the actual reward was only 2, then the AI was off by -8 (2-10). Reinforce means "adjust expectations". So if we adjust them by half, we'll new expected reward is 10-(8 *.5), or 6. meaning the AI really thought it would get 10 for that, but now it's less confident and thinks 6 is a better guess. So if the AI is not off by much, 10 - ( 2 *.5) = 9, it will adjust by a smaller amount.

Related

Is learning and cumulative reward a good metrics to evaluate a RL model?

i am new to reinforcement learning.
I have a problem here that i am using DQN on. I have plotted a cumulative reward curve while learning and taking actions. After 100 episodes and it shows a lot of fluctuations that does not show me whether it has learnt anything.
However, instead of using learning and cumulative reward, I put the model through the whole simulation without learning method after each episode and it shows me that the model is actually learning well. This extended the program runtime by quite a bit.
In addition, i have to extract the best model along the way because the final model seems to be performing badly at times.
Any advice or explanation for this?
Try to use the average mean return it's usually a good metric to know if the agent is improving or not.
If you're using tf_agent you can do something like this :
...
checkpoint_dir = os.path.join('./', 'checkpoint')
train_checkpointer = common.Checkpointer(
ckpt_dir=checkpoint_dir,
max_to_keep=1,
agent=agent,
policy=agent.policy,
replay_buffer=replay_buffer,
global_step=train_step
)
policy_dir = os.path.join('./', 'policy')
tf_policy_saver = policy_saver.PolicySaver(agent.policy)
def train_agent(n_iterations):
best_AverageReturn = 0
time_step = None
policy_state = agent.collect_policy.get_initial_state(tf_env.batch_size)
iterator = iter(dataset)
for iteration in range(n_iterations):
time_step, policy_state = collect_driver.run(time_step, policy_state)
trajectories, buffer_info = next(iterator)
train_loss = agent.train(trajectories)
if iteration % 10 == 0:
print("\r{} loss:{:.5f}".format(iteration, train_loss.loss.numpy()), end="")
if iteration % 1000 == 0 and averageReturnMetric.result() > best_AverageReturn:
best_AverageReturn = averageReturnMetric.result()
train_checkpointer.save(train_step)
tf_policy_saver.save(policy_dir)
After 1000 steps the train function evaluates the average return and create a checkpoint if there are any improvements

Initialisation of weights for deeplearning model

I am going through a book on deep learning which initializes weights between two layers of neurons as:
w = np.random.randn(layers[i] + 1, layers[i + 1] + 1)
self.W.append(w / np.sqrt(layers[i]))
As per the book, divison by np.sqrt(layers[i]) in second line of code is done for following reason:
scale w by dividing by the square root of the number of nodes in the current layer, thereby
normalizing the variance of each neuron’s output
What does it exactly mean? And how would it impact if we don't do it?
Weights initialization is very important to tackle the vanishing/Explosion Gradients. In order for the output/gradients(reverse direction) to flow properly, the variance of the outputs of each layer to be equal to the variance of its input. Likewise of gradients in the reverse direction. the input and output flow of a layer is called fan-in and fan-out of the layer.
To better explain what I mean above, let me give you an example. Assume that we have a hundred consecutive layers and we apply a feed forward calculation with linear activation (After all it is just matrix multiplication), the data is 500 samples of 100 features:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons))
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(-4.055498760574568e+95, 8.424477240271639e+98)
You will see that it will have a huge mean and standard deviations. Lets break this problem down; a property of a matrix multiplication of which the result will have a standard deviation very close to the square root of the number of fan in (input) connections. This property can be verified with this snippet of code:
fan_in = 1000 # change it to any number
X = np.random.normal(size=(100, fan_in))
W = np.random.normal(size=(fan_in, 1))
np.dot(X, W).std()
# result:
32.764359213560454
This happens because we sum fan_in (1000 in the above case) products of the element-wise multiplication of one element of inputs X by one column of W. Therefore, if we scaled every weights by the 1/sqrt(fan_in) to maintain the distribution of the flow as seen in the following snippet:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons), scale=np.sqrt(1 / neurons)) # scaled the weights with the fan-in
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(0.0002608301398189543, 1.021452570914829)
You can read more about kernel initialization in the following blog

Implementing WNGrad in Pytorch?

I'm trying to implement the WNGrad (technically WN-Adam, algorithm 4 in the paper) optimizier (WNGrad) in pytorch. I've never implemented an optimizer in pytorch before so I don't know if I've done it correctly (I started from the adam implementation). The optimizer does not make much progress and falls down like I would expect (bj values can only monotonically increase, which happens quickly so no progress is made) but I'm guessing I have a bug. Standard optimizers (Adam, SGD) work fine on the same model I'm trying to optimize.
Does this implementation look correct?
from torch.optim import Optimizer
class WNAdam(Optimizer):
"""Implements WNAdam algorithm.
It has been proposed in `WNGrad: Learn the Learning Rate in Gradient Descent`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.1)
beta1 (float, optional): exponential smoothing coefficient for gradient.
When beta=0 this implements WNGrad.
.. _WNGrad\: Learn the Learning Rate in Gradient Descent:
https://arxiv.org/abs/1803.02865
"""
def __init__(self, params, lr=0.1, beta1=0.9):
if not 0.0 <= beta1 < 1.0:
raise ValueError("Invalid beta1 parameter: {}".format(beta1))
defaults = dict(lr=lr, beta1=beta1)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Learning rate adjustment
state['bj'] = 1.0
exp_avg = state['exp_avg']
beta1 = group['beta1']
state['step'] += 1
state['bj'] += (group['lr']**2)/(state['bj'])*grad.pow(2).sum()
# update exponential moving average
exp_avg.mul_(beta1).add_(1 - beta1, grad)
bias_correction = 1 - beta1 ** state['step']
p.data.sub_(group['lr'] / state['bj'] / bias_correction, exp_avg)
return loss
The paper's author has an open sourced implementation on GitHub.
The WNGrad paper
states it's inspired by batch (and weight) normalization. You should use L2 norm with respect to the weight dimensions (don't sum it all) as show in this algorithm

how to work with the catboost overfitting detector

I am trying to understand the catboost overfitting detector. It is described here:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/#overfitting-detector
Other gradient boosting packages like lightgbm and xgboost use a parameter called early_stopping_rounds, which is easy to understand (it stops the training once the validation error hasn't decreased in early_stopping_round steps).
However I have a hard time understanding the p_value approach used by catboost. Can anyone explain how this overfitting detector works and when it stops the training?
It's not documented on the Yandex website or at the github repository, but if you look carefully through the python code posted to github (specifically here), you will see that the overfitting detector is activated by setting "od_type" in the parameters. Reviewing the recent commits on github, the catboost developers also recently implemented a tool similar to the "early_stopping_rounds" parameter used by lightGBM and xgboost, called "Iter."
To set the number of rounds after the most recent best iteration to wait before stopping, provide a numeric value in the "od_wait" parameter.
For example:
fit_param <- list(
iterations = 500,
thread_count = 10,
loss_function = "Logloss",
depth = 6,
learning_rate = 0.03,
od_type = "Iter",
od_wait = 100
)
I am using the catboost library with R 3.4.1. I have found that setting the "od_type" and "od_wait" parameters in the fit_param list works well for my purposes.
I realize this is not answering your question about the way to use the p_value approach also implemented by the catboost developers; unfortunately I cannot help you there. Hopefully someone else can explain that setting to the both of us.
Catboost now supports early_stopping_rounds: fit method parameters
Sets the overfitting detector type to Iter and stops the training
after the specified number of iterations since the iteration with the
optimal metric value.
This works very much like early_stopping_rounds in xgboost.
Here is an example:
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split
import numpy as np
y = np.random.normal(0, 1, 1000)
X = np.random.normal(0, 1, (1000, 1))
X[:, 0] += y * 2
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.1)
train_pool = Pool(X, y)
eval_pool = Pool(X_eval, y_eval)
model = CatBoostRegressor(iterations=1000, learning_rate=0.1)
model.fit(X, y, eval_set=eval_pool, early_stopping_rounds=10)
The result should be something like this:
522: learn: 0.3994718 test: 0.4294720 best: 0.4292901 (514) total: 957ms remaining: 873ms
523: learn: 0.3994580 test: 0.4294614 best: 0.4292901 (514) total: 958ms remaining: 870ms
524: learn: 0.3994495 test: 0.4294806 best: 0.4292901 (514) total: 959ms remaining: 867ms
Stopped by overfitting detector (10 iterations wait)
bestTest = 0.4292900745
bestIteration = 514
Shrink model to first 515 iterations.
early_stopping_rounds takes into account both od_type='Iter' and od_wait parameters. No need to individually set both od_type and od_wait, just set early_stopping_rounds parameter.

Summing Tensors

I'm implementing the system detailed in this paper.
On page 3, section 4 it shows the form that tensors take within the system:
R [ cos(2t), sin(2t); sin(2t), -cos(2t) ]
In my system, I only store R and t, since everything can be calculated from them.
However, I've got to the point where I need to sum two of these tensors (page 4, section 5.2). How can I find values for R and t after summing two tensors of this form?
I guess that's what you are looking for:
x = R_1*cos(2*t_1) + R_2*cos(2*t_2)
y = R_1*sin(2*t_1) + R_2*sin(2*t_2)
R_result = sqrt(x*x+y*y)
t_result = atan2(y,x)/2
Each term reduces to
R_1 trg(2 t_1) + R_2 trg(2 t_2) = R_1 trg_1 + R_2 trg_2
where trg represents either sin or cos and the indexed version takes the obvious meaning. So this is a just an ordinary problem in trigonometric identities repeated a couple of times.
Let
Q = (R_1 + R_2)/2
S = (R_1 - R_2)/2
then
R_1 trg(2 t_1) + R_2 trg(2 t_2) = (Q+S)(trg_1 + trg_2) + (Q-S)(trg_1 - trg_2)
which involves identities you can look up.
Sorry, adding two tensors is nothing more than algebra. The two matricies have to be the same size, and you add them term by term.
You can't just add the radii and angles and plug them back into the tensor. Do the addition properly and it'll work. Here's the first term:
R1*cost(2t1) + R2*cos(2t2) = ?
Here's the answer from Wolfram Alpha. As you can see, it doesn't simplify into a nice, neat expression with an R and a T for you.
In case you haven't thought of it, put the tensor sum into Wolfram Alpha and see what it gives you. They're better at algebra than anyone at this site. Why not get an independent check of your work?