How soft-actor-critic algorithm deal with policy gradient? - reinforcement-learning

So I was reading the soft-actor-critic paper https://arxiv.org/pdf/1801.01290.pdf
The actor uses stochastic policy which samples from a distribution. A neural network is used to approximate the policy. Instead of really "sample" the action, the authors expand the input of the network to state plus a vector of noise.
at = fφ(x; st)
In which x is the noise vector, say [x1, x2]
Then the probability πφ(at|s) is p(x1)*p(x2), I think.
Which means the entropy log πφ(at|s) is irrelative with parameter φ or at
Thus the policy gradient offered by the paper
∇φJπ(φ) = ∇φ log πφ(at|st)
+ (∇at log πφ(at|st) − ∇at Q(st, at)) * ∇φ fφ(x; st)
can be simplified to
∇φJπ(φ) = −∇at Q(st, at) * ∇φ fφ(x; st)
which is identical to DDPG.
So where did I make a mistake? Someone help me?

Related

What do BatchNorm2d's running_mean / running_var mean in PyTorch?

I'd like to know what exactly the running_mean and running_var that I can call from nn.BatchNorm2d.
Example code is here where bn means nn.BatchNorm2d.
vector = torch.cat([
torch.mean(self.conv3.bn.running_mean).view(1), torch.std(self.conv3.bn.running_mean).view(1),
torch.mean(self.conv3.bn.running_var).view(1), torch.std(self.conv3.bn.running_var).view(1),
torch.mean(self.conv5.bn.running_mean).view(1), torch.std(self.conv5.bn.running_mean).view(1),
torch.mean(self.conv5.bn.running_var).view(1), torch.std(self.conv5.bn.running_var).view(1)
])
I couldn't figure out what running_mean and running_var mean in the Pytorch official documentation and user community.
What do nn.BatchNorm2.running_mean and nn.BatchNorm2.running_var mean?
From the original Batchnorm paper:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,Seguey Ioffe and Christian Szegedy, ICML'2015
You can see on Algorithm 1. how to measure the statistics of a given batch.
However what is kept in memory across batches is the running stats, i.e. the statistics which are measured iteratively at each batch inference. The computation of the running mean and running variance is actually quite well explained in the documentation page of nn.BatchNorm2d:
By default, the momentum coefficient is set to 0.1, it regulates how much of the current batch statistics will affect the running statistics:
closer to 1 means the new running stat is closer to the current batch statistics, whereas
closer to 0 means the current batch stats will not contribute much to updating the new running stats.
It's worth pointing out that Batchnorm2d is applied across spatial dimensions, * in addition*, to the batch dimension of course. Given a batch of shape (b, c, h, w), it will compute the statistics across (b, h, w). This means the running statistics are shaped (c,), i.e. there are as many statistics components as there are in input channels (for both mean and variance).
Here is a minimal example:
>>> bn = nn.BatchNorm2d(10)
>>> x = torch.rand(2,10,2,2)
Since track_running_stats is set to True by default on BatchNorm2d, it will track the running stats when inferring on training mode.
The running mean and variance are initialized to zeros and ones, respectively.
>>> running_mean, running_var = torch.zeros(x.size(1)),torch.ones(x.size(1))
Let's perform inference on bn in training mode and check its running stats:
>>> bn(x)
>>> bn.running_mean, bn.running_var
(tensor([0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]),
tensor([0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]))
Now let's compute those stats by hand:
>>> (1-momentum)*running_mean + momentum*xmean
tensor([[0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]])
>>> (1-momentum)*running_var + momentum*xvar
tensor([[0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]])

does mlm loss calculate non masked token's loss too?

In BERT, I understand what the Masked Language Model(MLM) pretraining task does, but when calculating the loss for this task, how is it exactly calculated?
It is obvious that the loss(e.g. cross entropy loss) for the masked tokens will be included in the final loss.
But what about the other tokens which aren't masked? Is loss calculated for these tokens and included in the final loss as well?

Loss function in Faster-RCNN

I read many articles online today about fast R-CNN and faster R-CNN. From which i understand, in faster-RCNN, we train a RPN network to choose "the best region proposals", a thing fast-RCNN does in a non learning way. We have a L1 smooth loss and a log loss in this case to better train the network parameters during backpropagation. Now, i would like to understand a point regarding RPN:
If ,given the region proposal, we had 2 possible (weird case) different objects in the original image, with two different related bounding boxes (both with IoU > 0.7), should we use in the loss function that ground-truth bounding box that has the highest IoU with the predicted anchor box?
Thanks.

How does score function help in policy gradient?

I'm trying to learn policy gradient methods for reinforcement learning but I stuck at the score function part.
While searching for maximum or minimum points in a function, we take the derivative and set it to zero, then look for the points that holds this equation.
In policy gradient methods, we do it by taking the gradient of the expectation of trajectories and we get:
Objective function image
Here I could not get how this gradient of log policy shifts the distribution (through its parameters θ) to increase the scores of its samples mathematically? Don't we look for something that make this objective function's gradient zero as I explained above?
What you want to maximize is
J(theta) = int( p(tau;theta)*R(tau) )
The integral is over tau (the trajectory) and p(tau;theta) is its probability (i.e., of seeing the sequence state, action, next state, next action, ...), which depends on both the dynamics of the environment and the policy (parameterized by theta). Formally
p(tau;theta) = p(s_0)*pi(a_0|s_0;theta)*P(s_1|s_0,a_0)*pi(a_1|s_1;theta)*P(s_2|s_1,a_1)*...
where P(s'|s,a) is the transition probability given by the dynamics.
Since we cannot control the dynamics, only the policy, we optimize w.r.t. its parameters, and we do it by gradient ascent, meaning that we take the direction given by the gradient. The equation in your image comes from the log-trick df(x)/dx = f(x)*d(logf(x))/dx.
In our case f(x) is p(tau;theta) and we get your equation. Then since we have access only to a finite amount of data (our samples) we approximate the integral with an expectation.
Step after step, you will (ideally) reach a point where the gradient is 0, meaning that you reached a (local) optimum.
You can find a more detailed explanation here.
EDIT
Informally, you can think of learning the policy which increases the probability of seeing high return R(tau). Usually, R(tau) is the cumulative sum of the rewards. For each state-action pair (s,a) you therefore maximize the sum of the rewards you get from executing a in state s and following pi afterwards. Check this great summary for more details (Fig 1).

How to do reinforcement learning with regression instead of classification

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.
I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation.
In this sense the agent can explore.
In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.
At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:
input: rewards, timeIndices
// rewards in (0,1) and optimal is 1
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi = (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
neglogLj = (1 - r)/std
if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
// Accept transition, i.e. learn this action
targetMask[idx] = 1
neglogLi = neglogLj
This provides a targetMask with ones for the actions that will be learned using standard backprop.
Can someone inform me the proper or better way?
Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.