How to do reinforcement learning with regression instead of classification - reinforcement-learning

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.
I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation.
In this sense the agent can explore.
In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.
At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:
input: rewards, timeIndices
// rewards in (0,1) and optimal is 1
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi = (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
neglogLj = (1 - r)/std
if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
// Accept transition, i.e. learn this action
targetMask[idx] = 1
neglogLi = neglogLj
This provides a targetMask with ones for the actions that will be learned using standard backprop.
Can someone inform me the proper or better way?

Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.

Related

DDQN for Connect 4: Sudden explosion of Loss

I am trying to solve Connect 4 with DDQN through the self-play regime that was used for AlphaZero. That means, I let a student version play against a teacher version of itself and replace the teacher with the student, once the student wins more than 60% of the games. I actually fairly quickly receive good results. After already ~5k games played, the agent is able to win more than 95% of games against a random player. Also, from the interaction with the agent, one can see that it learns to prevent "easy wins" and finds some nice strategies.
However, after about 100.000 games played, the loss steadily increases until it eventually blows up. It is not clear to me, why exactly this behaviour occurs. I have tried different learning rates (1e-5 up to 1e-3) and different replay buffer sizes (10.000 up to 1.000.000). My update rules look as follows:
# Get predicted q values for the actions that were taken
q_pred = self.Q_eval.forward(state_batch).to(self.Q_eval.device)
q_pred = q_pred[batch_index, action_indices]
# Replace -1 and 1 for new_state_batch
new_state_batch *= -1.
q_eval = self.Q_eval.forward(new_state_batch).to(self.Q_eval.device)
# Get target q values for the actions that were taken
move_validity = torch.Tensor(new_state_batch[:, :self.n_actions] == 0).to(self.Q_eval.device)
discard_values = self.discard_value * torch.ones([self.batch_size, self.n_actions]).to(self.Q_eval.device)
q_next = self.Q_target.forward(new_state_batch).to(self.Q_eval.device)
q_eval = torch.where(move_validity == 1., q_eval, discard_values)
max_actions = torch.argmax(q_eval, dim=1)
reward_batch = torch.Tensor(reward_batch).to(self.Q_eval.device)
terminal_batch = torch.Tensor(terminal_batch).to(self.Q_eval.device)
# Using minimax algorithm
q_target = reward_batch + self.gamma * (-q_next[batch_index, max_actions]) * terminal_batch
loss = self.Q_eval.loss(q_pred, q_target.detach()).to(self.Q_eval.device)
loss.backward()
Notice, that since it is a two-player game, the next state is from the opponents perspective. Therefore I reverse signs (i.e. let the agent make a move for the opponent) and calculate the target value by subtracting the max q-value of the next state. Consequently, if I choose action a that allows the opponent to win the game, this action should be of negative value.
Some other information about the hyperparameters:
I use a starting epsilon of 1 and end epsilon of 0.15 with a decay of 0.9999
I update the target network every 1000 steps
As the neural net I use a simple CNN with 6 layers and decreasing kernel sizes
Loss function is MSE
Optimizer is Adam with no scheduler
Has anyone run into similar problems and may give my advice on how to debug this? Are there any ways to make DDQN more stable (such as prioritised experience replay)?

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy
TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

autoencoder only provides linear output

Hello Guys,
I am working right now on an Autoencoder reducing some simple 2D Data to 1D. The architecture is 2 - 10 - 1 - 10 - 2 Neurons/Layer. As Activation Function I use sigmoid in every layer but the output-layer, where I use the identity.
I am using the Accord.NET Framework to build that.
I am Pre-Training the Autoencoder with RBMs and CD-Algorithm, where I can change the initial weights, the learning rate, the momentum and the weight decay.
The Fine-Tuning is accomplished by backpropagation where I can configure the learning rate and the momentum.
The data is some artificially created shape and is marked green in the picture:
data + reconstruction
The reconstruction of the autoencoder is the yellow line. Which leads to my problem. Somehow the encoder is not able to create a non-linear shape as output.
Although I tested arround a lot and changed values a dozen times, I am not getting better results. Maybe someone here has an idea how I could find the problem.
Thanks!
Look in general any neural network is based on a linear representation for your feature against the output so what the net is actually doing (consider two features) is [w1*x1 + w2*x2 = output].
What you need to do to achieve a non-linear representation is to use extra feature(s) which is a non-linear representation of the old feature(s). Let's say for example use x1^2 as an extra feature or x2^2 or both of them. Hence, the net will give this global equation [w1*x1 + w2*x2 + w3*x1^2 = output] which in nature is a non-linear equation and then you can have a nonlinear representation.
The extra feature equation depends mainly on your data. I have used a quadratic equation in my example but it is not always the correct thing to do. Referring to
Your data I think you need to use a cos(x) or sin(x) representation.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)

Generalization functions for Q-Learning

I have to do some work with Q Learning, about a guy that has to move furniture around a house (it's basically that). If the house is small enough, I can just have a matrix that represents actions/rewards, but as the house size grows bigger that will not be enough. So I have to use some kind of generalization function for it, instead. My teacher suggests I use not just one, but several ones, so I could compare them and so. What you guys recommend?
I heard that for this situation people are using Support Vector Machines, also Neural Networks. I'm not really inside the field so I can't tell. I had in the past some experience with Neural Networks, but SVM seem a lot harder subject to grasp. Are there any other methods that I should look for? I know there must be like a zillion of them, but I need something just to start.
Thanks
Just as a refresher of terminology, in Q-learning, you are trying to learn the Q-functions, which depend on the state and action:
Q(S,A) = ????
The standard version of Q-learning as taught in most classes tells you that you for each S and A, you need to learn a separate value in a table and tells you how to perform Bellman updates in order to converge to the optimal values.
Now, lets say that instead of table you use a different function approximator. For example, lets try linear functions. Take your (S,A) pair and think of a bunch of features you can extract from them. One example of a feature is "Am I next to a wall," another is "Will the action place the object next to a wall," etc. Number these features f1(S,A), f2(S,A), ...
Now, try to learn the Q function as a linear function of those features
Q(S,A) = w1 * f1(S,A) + w2*f2(S,A) ... + wN*fN(S,A)
How should you learn the weights w? Well, since this is a homework, I'll let you think about it on your own.
However, as a hint, lets say that you have K possible states and M possible actions in each state. Lets say you define K*M features, each of which is an indicator of whether you are in a particular state and are going to take a particular action. So
Q(S,A) = w11 * (S==1 && A == 1) + w12 * (S == 1 && A == 2) + w21 * (S==2 && A==3) ...
Now, notice that for any state/action pair, only one feature will be 1 and the rest will be 0, so Q(S,A) will be equal to the corresponding w and you are essentially learning a table. So, you can think of the standard, table Q-learning as a special case of learning with these linear functions. So, think of what the normal Q-learning algorithm does, and what you should do.
Hopefully you can find a small basis of features, much fewer than K*M, that will allow you to represent your space well.