How is Deep Q-learning loss function defined? What I can't understand is that every time we are in a state and we select an action based on a policy derived by Deep Q-network only one of the Q values is considered , which means that instead of a vector we have an Scalar. In other words the loss function only consists of one term. Am I correct? The notations used in papers is ambiguous.
Related
It is my understanding that Q-learning attempts to find the actual state-action values for all states and actions. However, my hypothetical example below seems to indicate that this is not necessarily the case.
Imagine a Markov decision process (MDP) with the following attributes:
a state space S = {s_1} with only one possible state,
an action space A = {a_1} with a singular possible action,
a reward function R: S X A X S → ℝ with R(s_1, a_1, s_1) = 4
and finally a state transition function T: S X A X S → [0,1] which produces probability 1 for all actions A and state transitions from and to state s_1
Now assume that we have a single agent which has been initialized using optimistic initialization. For all possible states and actions we set the Q-value equal to 5 (i.e. Q(s_1, a_1) = 5). Q-values will be updated using the Bellman equation:
Q(S,A) := Q(S,A) + α( R + γQ(S',A') - Q(S,A) )
Here α and γ are chosen such that α = (0,1] and γ = (0,1]. Notice that we will require α and γ to be non-zero.
When the agent selects its action (a_1) in state s_1, the update formula becomes:
Q(s_1, a_1) := 5 + α( 4 + γ5 - 5 )
Notice that the Q-value does not change when γ5 = 1, or more generally when γQ(S,A) = Q(S,A) - R. Also, the Q-value will increase when γQ(S,A) > Q(S,A) - R, which would further increase the difference between the actual state-action value and the expected state-action value.
This seems to indicate that in some cases, it is possible for the difference between the actual and expected state-action values to increase over time. In other words, it is possible for the expected value to diverge from the actual value.
If we were to initialize the Q-values to equal 0 for all states and actions, we surely would not end up in this situation. However, I do believe it possible that a stochastic reward/transition function may cause the agent to over estimate its state-action values in a similar fashion, causing the above behavior to take effect. This would require a highly improbable situation where the MDP transitions to a high payoff state often, even though this transition has a very low likelihood.
Perhaps there are any assumptions I made here that actually do not hold. Maybe the target goal is not to precisely estimate the true state-action value, but rather that convergence to optimal state-action values is sufficient. That being said, I do find it rather odd that the divergence behavior between actual and expected returns is possible.
Any thoughts on this would be appreciated.
The problem with the above assumption is that I expected Q(s,a) to converge to R(s,a,s'). This is not the case. As described in the RL book by Sutton and Barto:
Q(s,a) = sum_r p(s',r|s,a)*r = E[r]
The Q-values, in this case, actually represent the expected one-step reward and should converge to R + γQ(S',A') and not R(s,a,s'). It is therefore unsurprising that the state-action values can move away from the deterministic immediate reward R and that the value at which Q(s,a) converges is dependent on γ.
Furthermore, the hypothetical situation where Q(s,a) is overestimated when using a stochastic reward/transition function is possible. Convergence to the actual state-action values is only guaranteed when all state-action pairs (s,a) are visited an infinite amount of times. Therefore, this is an issue related to exploration and exploitation. (In this case, the agent should have been allowed to explore more.)
In PPO’s objective function second term introduces squared error loss of the value function neural network. Is that term is essentially the squared advantage values, right?
No, that's the TD error for training V. You can separate the two losses and nothing changes, because the networks do not share parameters. In practice, the policy is trained on the first term of the equation, while V is trained on the second.
I am just getting start with deep reinforcement learning and i am trying to crasp this concept.
I have this deterministic bellman equation
When i implement stochastacity from the MDP then i get 2.6a
My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:
The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.
I hope someone out there can help on this one!
Best regards Søren Koch
No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.
In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.
There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.
Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy π . So yes, strictly speaking, it should be accompained by the policy sign π .
The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:
Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows:
Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.
Bellman equation, which characterizes a value function, in this case associated with any policy π:
In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable):
I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?
In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also:
Difference between Q-learning and Value Iteration
How do I know when a Q-learning algorithm converges?
Sutton et al.: RL
I have a function f(x) = 1/(x + a+ b*I*sign(x)) and I want to calculate the
integral of
dx dy dz f(x) f(y) f(z) f(x+y+z) f(x-y - z)
over the entire R^3 (b>0 and a,- b are of order unity). This is just a representative example -- in practice I have n<7 variables and 2n-1 instances of f(), n of them involving the n integration variables and n-1 of them involving some linear combintation of the integration variables. At this stage I'm only interested in a rough estimate with relative error of 1e-3 or so.
I have tried the following libraries :
Steven Johnson's cubature code: the hcubature algorithm works but is abysmally slow, taking hundreds of millions of integrand evaluations for even n=2.
HintLib: I tried adaptive integration with a Genz-Malik rule, the cubature routines, VEGAS and MISER with the Mersenne twister RNG. For n=3 only the first seems to be somewhat viable option but it again takes hundreds of millions of integrand evaluations for n=3 and relerr = 1e-2, which is not encouraging.
For the region of integration I have tried both approaches: Integrating over [-200, 200]^n (i.e. a region so large that it essentially captures most of the integral) and the substitution x = sinh(t) which seems to be a standard trick.
I do not have much experience with numerical analysis but presumably the difficulty lies in the discontinuities from the sign() term. For n=2 and f(x)f(y)f(x-y) there are discontinuities along x=0, y=0, x=y. These create a very sharp peak around the origin (with a different sign in the various quadrants) and sort of 'ridges' at x=0,y=0,x=y along which the integrand is large in absolute value and changes sign as you cross them. So at least I know which regions are important. I was thinking that maybe I could do Monte Carlo but somehow "tell" the algorithm in advance where to focus. But I'm not quite sure how to do that.
I would be very grateful if you had any advice on how to evaluate the integral with a reasonable amount of computing power or how to make my Monte Carlo "idea" work. I've been stuck on this for a while so any input would be welcome. Thanks in advance.
One thing you can do is to use a guiding function for your Monte Carlo integration: given an integral (am writing it in 1D for simplicity) of ∫ f(x) dx, write it as ∫ f(x)/g(x) g(x) dx, and use g(x) as a distribution from which you sample x.
Since g(x) is arbitrary, construct it such that (1) it has peaks where you expect them to be in f(x), and (2) such that you can sample x from g(x) (e.g., a gaussian, or 1/(1+x^2)).
Alternatively, you can use a Metropolis-type Markov chain MC. It will find the relevant regions of the integrand (almost) by itself.
Here are a couple of trivial examples.