I am trying to implement Q-learning, in an environment where R (rewards) are stochastich time-dependent variables, and they are arrive in real time, after const time interval deltaT. States S (scalars) also arrive after const time interval deltaT. The task for an agent is to give optimal action after it gets (S(ndeltaT),R(ndeltaT)).
My problem is that i am very new to RL, and i don't understand how this algo should be implemented, most papers describing Q-learning algo are in "scientific english" which is not helping me.
OnTimer() executes after fixed interval:
double a = 0.95;
double g = 0.95;
double old_state = 0;
action new_action = null;
action old_action = random_action;
void OnTimer()
{
double new_state = environment.GetNewState();
double Qmax = 0;
foreach(action a in Actions)
{
if(Q(new_state, a) > Qmax)
Qmax = Q(new_state, a);
new_action = a;
}
double reward = environment.Reward(old_state, old_action);
Q(old_state, old_action) = Q(old_state, old_action) + a*(reward + g*Qmax - Q(old_state, old_action));
old_state = new_state;
old_action = new_action;
agent.ExecuteInEnvironment(new_action);
}
Question:
Is this a proper implementation of online Q-learning, because it does not seem to work? Why is this not working optimal when n*deltaT -> inf, please help it is very important.
It's hard to say exactly what's going wrong without more information, but it doesn't look like you've implemented the algorithm correctly. Generally, the algorithm is:
Start out in an initial state as the current state.
Select the next action from the current state using a learning policy (such as epsilon greedy). The learning algorithm will pick the action which will cause the transition from the current state to the next state.
The (current state, action) pair will tell you what the next state is.
Find Qmax (which I think you're doing correctly). One exception might be that Qmax should be 0 if the next state is a terminal state, but you might not have one.
Get the reward for the (current state, action, next state) tuple. You seem to be ignoring the transition to the next state in your calculation.
Update Q value for (old state, old action). I think you're doing this correctly.
Set current state to next state
Return to step 2, unless the current state is terminal.
Do you know the probability of your selected action actually causing your agent to move to the intended state, or is that something you have to estimate by observation? If states are just arriving arbitrarily and you don't have any control over what happens, this might not be an appropriate environment to apply reinforcement learning.
Related
I am trying to solve Connect 4 with DDQN through the self-play regime that was used for AlphaZero. That means, I let a student version play against a teacher version of itself and replace the teacher with the student, once the student wins more than 60% of the games. I actually fairly quickly receive good results. After already ~5k games played, the agent is able to win more than 95% of games against a random player. Also, from the interaction with the agent, one can see that it learns to prevent "easy wins" and finds some nice strategies.
However, after about 100.000 games played, the loss steadily increases until it eventually blows up. It is not clear to me, why exactly this behaviour occurs. I have tried different learning rates (1e-5 up to 1e-3) and different replay buffer sizes (10.000 up to 1.000.000). My update rules look as follows:
# Get predicted q values for the actions that were taken
q_pred = self.Q_eval.forward(state_batch).to(self.Q_eval.device)
q_pred = q_pred[batch_index, action_indices]
# Replace -1 and 1 for new_state_batch
new_state_batch *= -1.
q_eval = self.Q_eval.forward(new_state_batch).to(self.Q_eval.device)
# Get target q values for the actions that were taken
move_validity = torch.Tensor(new_state_batch[:, :self.n_actions] == 0).to(self.Q_eval.device)
discard_values = self.discard_value * torch.ones([self.batch_size, self.n_actions]).to(self.Q_eval.device)
q_next = self.Q_target.forward(new_state_batch).to(self.Q_eval.device)
q_eval = torch.where(move_validity == 1., q_eval, discard_values)
max_actions = torch.argmax(q_eval, dim=1)
reward_batch = torch.Tensor(reward_batch).to(self.Q_eval.device)
terminal_batch = torch.Tensor(terminal_batch).to(self.Q_eval.device)
# Using minimax algorithm
q_target = reward_batch + self.gamma * (-q_next[batch_index, max_actions]) * terminal_batch
loss = self.Q_eval.loss(q_pred, q_target.detach()).to(self.Q_eval.device)
loss.backward()
Notice, that since it is a two-player game, the next state is from the opponents perspective. Therefore I reverse signs (i.e. let the agent make a move for the opponent) and calculate the target value by subtracting the max q-value of the next state. Consequently, if I choose action a that allows the opponent to win the game, this action should be of negative value.
Some other information about the hyperparameters:
I use a starting epsilon of 1 and end epsilon of 0.15 with a decay of 0.9999
I update the target network every 1000 steps
As the neural net I use a simple CNN with 6 layers and decreasing kernel sizes
Loss function is MSE
Optimizer is Adam with no scheduler
Has anyone run into similar problems and may give my advice on how to debug this? Are there any ways to make DDQN more stable (such as prioritised experience replay)?
It is my understanding that Q-learning attempts to find the actual state-action values for all states and actions. However, my hypothetical example below seems to indicate that this is not necessarily the case.
Imagine a Markov decision process (MDP) with the following attributes:
a state space S = {s_1} with only one possible state,
an action space A = {a_1} with a singular possible action,
a reward function R: S X A X S → ℝ with R(s_1, a_1, s_1) = 4
and finally a state transition function T: S X A X S → [0,1] which produces probability 1 for all actions A and state transitions from and to state s_1
Now assume that we have a single agent which has been initialized using optimistic initialization. For all possible states and actions we set the Q-value equal to 5 (i.e. Q(s_1, a_1) = 5). Q-values will be updated using the Bellman equation:
Q(S,A) := Q(S,A) + α( R + γQ(S',A') - Q(S,A) )
Here α and γ are chosen such that α = (0,1] and γ = (0,1]. Notice that we will require α and γ to be non-zero.
When the agent selects its action (a_1) in state s_1, the update formula becomes:
Q(s_1, a_1) := 5 + α( 4 + γ5 - 5 )
Notice that the Q-value does not change when γ5 = 1, or more generally when γQ(S,A) = Q(S,A) - R. Also, the Q-value will increase when γQ(S,A) > Q(S,A) - R, which would further increase the difference between the actual state-action value and the expected state-action value.
This seems to indicate that in some cases, it is possible for the difference between the actual and expected state-action values to increase over time. In other words, it is possible for the expected value to diverge from the actual value.
If we were to initialize the Q-values to equal 0 for all states and actions, we surely would not end up in this situation. However, I do believe it possible that a stochastic reward/transition function may cause the agent to over estimate its state-action values in a similar fashion, causing the above behavior to take effect. This would require a highly improbable situation where the MDP transitions to a high payoff state often, even though this transition has a very low likelihood.
Perhaps there are any assumptions I made here that actually do not hold. Maybe the target goal is not to precisely estimate the true state-action value, but rather that convergence to optimal state-action values is sufficient. That being said, I do find it rather odd that the divergence behavior between actual and expected returns is possible.
Any thoughts on this would be appreciated.
The problem with the above assumption is that I expected Q(s,a) to converge to R(s,a,s'). This is not the case. As described in the RL book by Sutton and Barto:
Q(s,a) = sum_r p(s',r|s,a)*r = E[r]
The Q-values, in this case, actually represent the expected one-step reward and should converge to R + γQ(S',A') and not R(s,a,s'). It is therefore unsurprising that the state-action values can move away from the deterministic immediate reward R and that the value at which Q(s,a) converges is dependent on γ.
Furthermore, the hypothetical situation where Q(s,a) is overestimated when using a stochastic reward/transition function is possible. Convergence to the actual state-action values is only guaranteed when all state-action pairs (s,a) are visited an infinite amount of times. Therefore, this is an issue related to exploration and exploitation. (In this case, the agent should have been allowed to explore more.)
Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).
I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).
Firstly: My apologies for putting "Pro.blem" in the title - SO won't let me put "Problem" there.
Anyways, during trig yesterday, I had an idea. Suppose I want to write a program that utilizes artificial intelligence to solve problems. I stripped it down to an implementation of Dijkstra's algorithm on a directed graph, using actions as nodes and requirements/results as paths. For example, let's say Fred's in the living room and he's hungry. Some of Fred's possible actions are:
Get up:
Requirements: state = sitting or lying down.
Results: state = standing.
Lie down:
Requirements: state = standing or sitting.
Results: state = lying down.
Fall asleep:
Requirements: state = lying down, location = bedroom.
Results: state = asleep.
Walk to kitchen:
Requirements: state = standing, location is not kitchen.
Results: location = kitchen.
Walk to bedroom:
Requirements: state = standing, location is not bedroom.
Results: location = bedroom.
Prepare food:
Requirements: state = standing, location = kitchen.
Results: hasfood = true.
Eat:
Requirements: hasfood = true.
Results: hungry = false, hasfood = false.
Actions such as "get up" and "lie down" are easy because there is one requirement and one result. Actions such as "walk to kitchen" and "walk to bedroom" present more of a problem, because they have more than one requirement. How can I use requirements/results as a path if paths intertwine with each other?
Ultimately, the question(s):
Could problem solving + pathfinding work in practice (or has it worked already)? Would it make more sense to use requirements/results as nodes and actions as paths? If you think this approach is promising, please respond with pseudocode or an explanation for implementation.