Cross entropy loss function with teacher forcing - deep-learning

I'm currently working on a presentation about RNNs and I'm trying to find the formula for cross entropy loss when using teacher forcing.
In Deep Learning by Goodfellow teacher forcing is described using the maximum likelihood criterion:
log p (y^(1), y^(2) | x^(1), x^(2) ) = log p (y^(2) | y^(1), x^(1), x^(2) ) + log p (y^(1) | x^(1), x^(2) )
But I do not fully understand what the loss function would look like and why on the left side of the sum there is x^(1) and on the right side there is x^(2). From what I have understood the output at timestep t only depends on y^(t-1) and x^t.
So I thought the loss function should like this: L^(t) = -log p(y^(t) | y^(t-1), x^(t) )
Is that correct or how should it look otherwise?

Related

DDQN for Connect 4: Sudden explosion of Loss

I am trying to solve Connect 4 with DDQN through the self-play regime that was used for AlphaZero. That means, I let a student version play against a teacher version of itself and replace the teacher with the student, once the student wins more than 60% of the games. I actually fairly quickly receive good results. After already ~5k games played, the agent is able to win more than 95% of games against a random player. Also, from the interaction with the agent, one can see that it learns to prevent "easy wins" and finds some nice strategies.
However, after about 100.000 games played, the loss steadily increases until it eventually blows up. It is not clear to me, why exactly this behaviour occurs. I have tried different learning rates (1e-5 up to 1e-3) and different replay buffer sizes (10.000 up to 1.000.000). My update rules look as follows:
# Get predicted q values for the actions that were taken
q_pred = self.Q_eval.forward(state_batch).to(self.Q_eval.device)
q_pred = q_pred[batch_index, action_indices]
# Replace -1 and 1 for new_state_batch
new_state_batch *= -1.
q_eval = self.Q_eval.forward(new_state_batch).to(self.Q_eval.device)
# Get target q values for the actions that were taken
move_validity = torch.Tensor(new_state_batch[:, :self.n_actions] == 0).to(self.Q_eval.device)
discard_values = self.discard_value * torch.ones([self.batch_size, self.n_actions]).to(self.Q_eval.device)
q_next = self.Q_target.forward(new_state_batch).to(self.Q_eval.device)
q_eval = torch.where(move_validity == 1., q_eval, discard_values)
max_actions = torch.argmax(q_eval, dim=1)
reward_batch = torch.Tensor(reward_batch).to(self.Q_eval.device)
terminal_batch = torch.Tensor(terminal_batch).to(self.Q_eval.device)
# Using minimax algorithm
q_target = reward_batch + self.gamma * (-q_next[batch_index, max_actions]) * terminal_batch
loss = self.Q_eval.loss(q_pred, q_target.detach()).to(self.Q_eval.device)
loss.backward()
Notice, that since it is a two-player game, the next state is from the opponents perspective. Therefore I reverse signs (i.e. let the agent make a move for the opponent) and calculate the target value by subtracting the max q-value of the next state. Consequently, if I choose action a that allows the opponent to win the game, this action should be of negative value.
Some other information about the hyperparameters:
I use a starting epsilon of 1 and end epsilon of 0.15 with a decay of 0.9999
I update the target network every 1000 steps
As the neural net I use a simple CNN with 6 layers and decreasing kernel sizes
Loss function is MSE
Optimizer is Adam with no scheduler
Has anyone run into similar problems and may give my advice on how to debug this? Are there any ways to make DDQN more stable (such as prioritised experience replay)?

regarding one code segment in computing log_sum_exp

In this tutorial on using Pytorch to implement BiLSTM-CRF, author implements the following function. In specific, I am not quite understand what does max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1]) try to do?, or which kind of math formula it corresponds to?
# Compute log sum exp in a numerically stable way for the forward algorithm
def log_sum_exp(vec):
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
Looking at the code, it seems like vec has a shape of (1, n).
Now we can follow the code line by line:
max_score = vec[0, argmax(vec)]
Using vec in the location 0, argmax(v) is just a fancy way of taking the maximal value of vec. So, max_score is (as the name suggests) the maximal value of vec.
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
Next, we want to subtract max_score from each of the elements of vec.
To do so the code creates a vector of the same shape as vec with all elements equal to max_score.
First, max_score is reshaped to have two dimensions using the view command, then the expanded 2d vector is "stretched" to have length n using the expand command.
Finally, the log sum exp is computed robustly:
return max_score + \
torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
The validity of this computation can be seen in this picture:
The rationale behind it is that exp(x) can "explode" for x > 0, therefore, for numerical stability, it is best to subtract the maximal value before taking exp.
As a side note, I think a slightly more elegant way to do the same computation, taking advantage of broadcasting, would be
max_score, _ = vec.max(dim=1, keepdim=True) # take max along second dimension
lse = max_score + torch.log(torch.sum(torch.exp(vec - max_score), dim=1))
return lse
Also note that log sum exp is already implemented by pytorch: torch.logsumexp.

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

How to do reinforcement learning with regression instead of classification

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.
I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation.
In this sense the agent can explore.
In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.
At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:
input: rewards, timeIndices
// rewards in (0,1) and optimal is 1
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi = (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
neglogLj = (1 - r)/std
if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
// Accept transition, i.e. learn this action
targetMask[idx] = 1
neglogLi = neglogLj
This provides a targetMask with ones for the actions that will be learned using standard backprop.
Can someone inform me the proper or better way?
Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.

Multivariate Bisection Method

I need an algorithm to perform a 2D bisection method for solving a 2x2 non-linear problem. Example: two equations f(x,y)=0 and g(x,y)=0 which I want to solve simultaneously. I am very familiar with the 1D bisection ( as well as other numerical methods ). Assume I already know the solution lies between the bounds x1 < x < x2 and y1 < y < y2.
In a grid the starting bounds are:
^
| C D
y2 -+ o-------o
| | |
| | |
| | |
y1 -+ o-------o
| A B
o--+------+---->
x1 x2
and I know the values f(A), f(B), f(C) and f(D) as well as g(A), g(B), g(C) and g(D). To start the bisection I guess we need to divide the points out along the edges as well as the middle.
^
| C F D
y2 -+ o---o---o
| | |
|G o o M o H
| | |
y1 -+ o---o---o
| A E B
o--+------+---->
x1 x2
Now considering the possibilities of combinations such as checking if f(G)*f(M)<0 AND g(G)*g(M)<0 seems overwhelming. Maybe I am making this a little too complicated, but I think there should be a multidimensional version of the Bisection, just as Newton-Raphson can be easily be multidimed using gradient operators.
Any clues, comments, or links are welcomed.
Sorry, while bisection works in 1-d, it fails in higher dimensions. You simply cannot break a 2-d region into subregions using only information about the function at the corners of the region and a point in the interior. In the words of Mick Jagger, "You can't always get what you want".
I just stumbled upon the answer to this from geometrictools.com and C++ code.
edit: the code is now on github.
I would split the area along a single dimension only, alternating dimensions. The condition you have for existence of zero of a single function would be "you have two points of different sign on the boundary of the region", so I'd just check that fro the two functions. However, I don't think it would work well, since zeros of both functions in a particular region don't guarantee a common zero (this might even exist in a different region that doesn't meet the criterion).
For example, look at this image:
There is no way you can distinguish the squares ABED and EFIH given only f() and g()'s behaviour on their boundary. However, ABED doesn't contain a common zero and EFIH does.
This would be similar to region queries using eg. kD-trees, if you could positively identify that a region doesn't contain zero of eg. f. Still, this can be slow under some circumstances.
If you can assume (per your comment to woodchips) that f(x,y)=0 defines a continuous monotone function y=f2(x), i.e. for each x1<=x<=x2 there is a unique solution for y (you just can't express it analytically due to the messy form of f), and similarly y=g2(x) is a continuous monotone function, then there is a way to find the joint solution.
If you could calculate f2 and g2, then you could use a 1-d bisection method on [x1,x2] to solve f2(x)-g2(x)=0. And you can do that by using 1-d bisection on [y1,y2] again for solving f(x,y)=0 for y for any given fixed x that you need to consider (x1, x2, (x1+x2)/2, etc) - that's where the continuous monotonicity is helpful -and similarly for g. You have to make sure to update x1-x2 and y1-y2 after each step.
This approach might not be efficient, but should work. Of course, lots of two-variable functions don't intersect the z-plane as continuous monotone functions.
I'm not much experient on optimization, but I built a solution to this problem with a bisection algorithm like the question describes. I think is necessary to fix a bug in my solution because it compute tow times a root in some cases, but i think it's simple and will try it later.
EDIT: I seem the comment of jpalecek, and now I anderstand that some premises I assumed are wrong, but the methods still works on most cases. More especificaly, the zero is garanteed only if the two functions variate the signals at oposite direction, but is need to handle the cases of zero at the vertices. I think is possible to build a justificated and satisfatory heuristic to that, but it is a little complicated and now I consider more promising get the function given by f_abs = abs(f, g) and build a heuristic to find the local minimuns, looking to the gradient direction on the points of the middle of edges.
Introduction
Consider the configuration in the question:
^
| C D
y2 -+ o-------o
| | |
| | |
| | |
y1 -+ o-------o
| A B
o--+------+---->
x1 x2
There are many ways to do that, but I chose to use only the corner points (A, B, C, D) and not middle or center points liky the question sugests. Assume I have tow function f(x,y) and g(x,y) as you describe. In truth it's generaly a function (x,y) -> (f(x,y), g(x,y)).
The steps are the following, and there is a resume (with a Python code) at the end.
Step by step explanation
Calculate the product each scalar function (f and g) by them self at adjacent points. Compute the minimum product for each one for each direction of variation (axis, x and y).
Fx = min(f(C)*f(B), f(D)*f(A))
Fy = min(f(A)*f(B), f(D)*f(C))
Gx = min(g(C)*g(B), g(D)*g(A))
Gy = min(g(A)*g(B), g(D)*g(C))
It looks to the product through tow oposite sides of the rectangle and computes the minimum of them, whats represents the existence of a changing of signal if its negative. It's a bit of redundance but work's well. Alternativaly you can try other configuration like use the points (E, F, G and H show in the question), but I think make sense to use the corner points because it consider better the whole area of the rectangle, but it is only a impression.
Compute the minimum of the tow axis for each function.
F = min(Fx, Fy)
G = min(Gx, Gy)
It of this values represents the existence of a zero for each function, f and g, within the rectangle.
Compute the maximum of them:
max(F, G)
If max(F, G) < 0, then there is a root inside the rectangle. Additionaly, if f(C) = 0 and g(C) = 0, there is a root too and we do the same, but if the root is in other corner we ignore him, because other rectangle will compute it (I want to avoid double computation of roots). The statement bellow resumes:
guaranteed_contain_zeros = max(F, G) < 0 or (f(C) == 0 and g(C) == 0)
In this case we have to proceed breaking the region recursively ultil the rectangles are as small as we want.
Else, may still exist a root inside the rectangle. Because of that, we have to use some criterion to break this regions ultil the we have a minimum granularity. The criterion I used is to assert the largest dimension of the current rectangle is smaller than the smallest dimension of the original rectangle (delta in the code sample bellow).
Resume
This Python code resume:
def balance_points(x_min, x_max, y_min, y_max, delta, eps=2e-32):
width = x_max - x_min
height = y_max - y_min
x_middle = (x_min + x_max)/2
y_middle = (y_min + y_max)/2
Fx = min(f(C)*f(B), f(D)*f(A))
Fy = min(f(A)*f(B), f(D)*f(C))
Gx = min(g(C)*g(B), g(D)*g(A))
Gy = min(g(A)*g(B), g(D)*g(C))
F = min(Fx, Fy)
G = min(Gx, Gy)
largest_dim = max(width, height)
guaranteed_contain_zeros = max(F, G) < 0 or (f(C) == 0 and g(C) == 0)
if guaranteed_contain_zeros and largest_dim <= eps:
return [(x_middle, y_middle)]
elif guaranteed_contain_zeros or largest_dim > delta:
if width >= height:
return balance_points(x_min, x_middle, y_min, y_max, delta) + balance_points(x_middle, x_max, y_min, y_max, delta)
else:
return balance_points(x_min, x_max, y_min, y_middle, delta) + balance_points(x_min, x_max, y_middle, y_max, delta)
else:
return []
Results
I have used a similar code similar in a personal project (GitHub here) and it draw the rectangles of the algorithm and the root (the system have a balance point at the origin):
Rectangles
It works well.
Improvements
In some cases the algorithm compute tow times the same zero. I thinh it can have tow reasons:
I the case the functions gives exatly zero at neighbour rectangles (because of an numerical truncation). In this case the remedy is to incrise eps (increase the rectangles). I chose eps=2e-32, because 32 bits is a half of the precision (on 64 bits archtecture), then is problable that the function don't gives a zero... but it was more like a guess, I don't now if is the better. But, if we decrease much the eps, it extrapolates the recursion limit of Python interpreter.
The case in witch the f(A), f(B), etc, are near to zero and the product is truncated to zero. I think it can be reduced if we use the product of the signals of f and g in place of the product of the functions.
I think is possible improve the criterion to discard a rectangle. It can be made considering how much the functions are variating in the region of the rectangle and how distante the function is of zero. Perhaps a simple relation between the average and variance of the function values on the corners. In another way (and more complicated) we can use a stack to store the values on each recursion instance and garantee that this values are convergent to stop recursion.
This is a similar problem to finding critical points in vector fields (see http://alglobus.net/NASAwork/topology/Papers/alsVugraphs93.ps).
If you have the values of f(x,y) and g(x,y) at the vertexes of your quadrilateral and you are in a discrete problem (such that you don't have an analytical expression for f(x,y) and g(x,y) nor the values at other locations inside the quadrilateral), then you can use bilinear interpolation to get two equations (for f and g). For the 2D case the analytical solution will be a quadratic equation which, according to the solution (1 root, 2 real roots, 2 imaginary roots) you may have 1 solution, 2 solutions, no solutions, solutions inside or outside your quadrilateral.
If instead you have analytic functions of f(x,y) and g(x,y) and want to use them, this is not useful. Instead you could divide your quadrilateral recursively, however as it was already pointed out by jpalecek (2nd post), you would need a way to stop your divisions by figuring out a test that would assure you would have no zeros inside a quadrilateral.
Let f_1(x,y), f_2(x,y) be two functions which are continuous and monotonic with respect to x and y. The problem is to solve the system f_1(x,y) = 0, f_2(x,y) = 0.
The alternating-direction algorithm is illustrated below. Here, the lines depict sets {f_1 = 0} and {f_2 = 0}. It is easy to see that the direction of movement of the algorithm (right-down or left-up) depends on the order of solving the equations f_i(x,y) = 0 (e.g., solve f_1(x,y) = 0 w.r.t. x then solve f_2(x,y) = 0 w.r.t. y OR first solve f_1(x,y) = 0 w.r.t. y and then solve f_2(x,y) = 0 w.r.t. x).
Given the initial guess, we don't know where the root is. So, in order to find all roots of the system, we have to move in both directions.