is this true ? what about Expected SARSA and double Q-Learning? - reinforcement-learning

I‘m studying Reinforcement Learning and I’m facing a problem understanding the difference between SARSA, Q-Learning, expected SARSA, Double Q Learning and temporal difference. Can you please explain the difference and tell me when to use each? And what is the effect on e-greedy and greedy move?
SARSA :
I’m in state St, an action is chosen with the help of the policy so it moves me to another state St+1 Depending on the Policy in state St+1 an action is made so my Reward in St is gonna be updated due to the expected Reward in the look ahead state St+1.
Q(S, A) ← Q(S, A) + α[ R + γQ(S , A ) − Q(S, A)]
Q-Learning:
I’m in state St, an action was chosen with the help of the policy so it moves me to state St+1, this time it’s not gonna depend on the policy instead it’s gonna observe the maximum of the expected Reward (greedy Reward) in state St+1 and through it the reward of state St is going to be updated.
Q(S, A) ← Q(S, A) + α [R + γ max Q(S , a) − Q(S, A)]
Expected SARSA:
it’s gonna be same as Q-learning instead of updating my Reward with the help of the greedy move in St+1 I take the expected reward of all actions :
Q(St , At) ← Q(St , At) + α[Rt+1 + γE[Q(St+1, At+1)|St+1] − Q(St , At)]
Temporal difference :
The current Reward is gonna be updated using the observed reward Rt+1 and the estimate value V(St+1) At timepoint t + 1:
V (St) ← V (St) + α[Rt+1 + γV (St+1) − V (St)]
is it true what I got or am I missing something? And What about Double Q Learning?
With 0.5 probabilility:
Q1(S, A) ← Q1(S, A) + α R + γQ2 S , argmaxa Q1(S , a) − Q1(S, A)
else:
Q2(S, A) ← Q2(S, A) + α R + γQ1 S , argmaxa Q2(S , a) − Q2(S, A)
Can someone explain it please!!

Related

Plotting decision boundary line in Octave

I have been working on a machine learning course and currently on Classification. I implemented the classification algorithm and obtained the parameters as well as the cost. The assignment already has a function for plotting the decision boundary and it worked but I was trying to read their code and cannot understand these lines.
plot_x = [min(X(:,2))-2, max(X(:,2))+2];
% Calculate the decision boundary line
plot_y = (-1./theta(3)).*(theta(2).*plot_x + theta(1));
Anyone explain?
I'm also taking the same course as you. I guess what the code does is to generate two points on the decision line.
As you know you have the function:
theta0 + theta1 * x1 + theta2 * x2 = 0
Which it can be rewritten as:
c + mx + ky = 0
where x and y are the axis corresponding to x1 and x2, c is theta(0) or the y-intercept, m is the slope or theta(1), and k is theta(2).
This equation (c + mx + ky = 0) corresponds to the decision boundary, so the code is finding two values for x (or x1) which cover the whole dataset (-2 and +2 in plot_x min and max functions) and then uses the equation to find the corresponding y (or x2) values. Finally, a decision boundary can be plotted -- plot(plot_x, plot_y).
In other words, what it does is to use the the equation to generate two points to plot the line on graph, the reason of doing this is that Octave cannot plot the line given an equation to it.
Hope this can help you, sorry for any mistake in grammar or unclear explanation ^.^
Rearranging equations helped me, so adding those here:
plot_y = -1/theta2 (theta1*plot_x + theta0)
note that index in Octave starts at 1, not at 0, so theta(3) = theta2, theta(2) = theta1 and theta(1) = theta0.
This plot_y equation is equivalent to:
c + mx + ky = 0 <=>
-ky = mx + c <=>
y = -1/k (mx + c)

What's the opposite of 'hardcoding'?

If rewriting
x = 2
y = 2
z = x + y
to
z = 2 + 2
can be described by 'hardcoding' ('hard-coding'?) values of x and y,
what's do you call its opposite, e.g.
z = 2 + 2
to
x = 2
y = 2
z = x + y
(I've never heard of 'softcoding', so I assume it's inappropriate)?
#melpomene's comment is correct, both of your examples are hard-coding.
Anyway, in my team we call this process parameterization, e.g. "we need to parameterize the threshold value used in function X", though I'm not sure if this is a common term.
For what it's worth, there's actually a Wikipedia entry on Softcoding, which means what you asked for, but in a negative connotation:
The term is generally used where softcoding becomes an anti-pattern. Abstracting too many values and features can introduce more complexity and maintenance issues than would be experienced with changing the code when required.

Explanation behind actor-critic algorithm in pytorch example?

Pytorch provides a good example of using actor-critic to play Cartpole in the OpenAI gym environment.
I'm confused about several of their equations in the code snippet found at https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py#L67-L79:
saved_actions = model.saved_actions
value_loss = 0
rewards = []
for r in model.rewards[::-1]:
R = r + args.gamma * R
rewards.insert(0, R)
rewards = torch.Tensor(rewards)
rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)
for (action, value), r in zip(saved_actions, rewards):
action.reinforce(r - value.data.squeeze())
value_loss += F.smooth_l1_loss(value, Variable(torch.Tensor([r])))
optimizer.zero_grad()
final_nodes = [value_loss] + list(map(lambda p: p.action, saved_actions))
gradients = [torch.ones(1)] + [None] * len(saved_actions)
autograd.backward(final_nodes, gradients)
optimizer.step()
What do r and value mean in this case? Why do they run REINFORCE on the action space with the reward equal to r - value? And why do they try to set the value so that it matches r?
Thanks for your help!
First the rewards a collected for a time, along with the state:action that resulted in the reward
Then r - value is the difference between the expected reward and actual
That difference is used to adjust the expected value of that action from that state
So if in state "middle", the expected reward for action "jump" was 10 and the actual reward was only 2, then the AI was off by -8 (2-10). Reinforce means "adjust expectations". So if we adjust them by half, we'll new expected reward is 10-(8 *.5), or 6. meaning the AI really thought it would get 10 for that, but now it's less confident and thinks 6 is a better guess. So if the AI is not off by much, 10 - ( 2 *.5) = 9, it will adjust by a smaller amount.

Defining a Differential Equation in Octave

I am attempting to use Octave to solve for a differential equation using Euler's method.
The Euler method was given to me (and is correct), which works for the given Initial Value Problem,
y*y'' + (y')^2 + 1 = 0; y(1) = 1;
That initial value problem is defined in the following Octave function:
function [YDOT] = f(t, Y)
YDOT(1) = Y(2);
YDOT(2) = -(1 + Y(2)^2)/Y(1);
The question I have is about this function definition. Why is YDOT(1) != 1? What is Y(2)?
I have not found any documentation on the definition of a function using function [YDOT] instead of simply function YDOT, and I would appreciate any clarification on what the Octave code is doing.
First things first: You have a (non linear) differential equation of order two which will require you to have two initial conditions. Thus the given information from above is not enough.
The following is defined for further explanations: A==B means A is identical to B; A=>B means B follows from A.
It seems you are mixing a few things. The guy who gave you the files rewrote the equation in the following way:
y*y'' + (y')^2 + 1 = 0; y(1) = 1; | (I) y := y1 & (II) y' := y2
(I) & (II)=>(III): y' = y2 = y1' | y2==Y(2) & y1'==YDOT(1)
Ocatve is "matrix/vector oriented" so we are writing everything in vectors or matrices. Rather writing y1=alpha and y2=beta we are writing y=[alpha; beta] where y(1)==y1=alpha and y(2)==y2=beta. You will soon realize the tremendous advantage of using especially this mathematical formalization for ALL of your problems.
(III) & f(t,Y)=>(IV): y2' == YDOT(2) = y'' = (-1 -(y')^2) / y
Now recall what is y' and y from the definitions in (I) and (II)!
y' = y2 == Y(2) & y = y1 == Y(1)
So we can rewrite equation (IV)
(IV): y2' == YDOT(2) = (-1 -(y')^2) / y == -(1 + Y(2)^2)/Y(1)
So from equation (III) and (IV) we can derive what you already know:
YDOT(1) = Y(2)
YDOT(2) = -(1 + Y(2)^2)/Y(1)
These equations are passed to the solver. Differential equations of all types are solved numerically by retrieving the "next" value in a near neighborhood to some "previously known" value. (The step size inside this neighborhood is one of the key questions when writing solvers!) So your solver uses your initial condition Y(1)==y(1)=1 to make the next step and calculate the "next" value. So right at the start YDOT(1)=Y(2)==y(2) but you didn't tell us this value! But from then on YDOT(1) is varied by the solver in dependency to the function shape to solve your problem and give you ONE unique y(t) solution.
It seems you are using Octave for the first time so let's make a last comment on function [YDOT] = f(t, Y). In general a function is defined in this way:
function[retVal1, retVal2, ...] = myArbitraryName(arg1, arg2, ...)
Where retVal is the return value or output and arg is the argument or input.

Summing Tensors

I'm implementing the system detailed in this paper.
On page 3, section 4 it shows the form that tensors take within the system:
R [ cos(2t), sin(2t); sin(2t), -cos(2t) ]
In my system, I only store R and t, since everything can be calculated from them.
However, I've got to the point where I need to sum two of these tensors (page 4, section 5.2). How can I find values for R and t after summing two tensors of this form?
I guess that's what you are looking for:
x = R_1*cos(2*t_1) + R_2*cos(2*t_2)
y = R_1*sin(2*t_1) + R_2*sin(2*t_2)
R_result = sqrt(x*x+y*y)
t_result = atan2(y,x)/2
Each term reduces to
R_1 trg(2 t_1) + R_2 trg(2 t_2) = R_1 trg_1 + R_2 trg_2
where trg represents either sin or cos and the indexed version takes the obvious meaning. So this is a just an ordinary problem in trigonometric identities repeated a couple of times.
Let
Q = (R_1 + R_2)/2
S = (R_1 - R_2)/2
then
R_1 trg(2 t_1) + R_2 trg(2 t_2) = (Q+S)(trg_1 + trg_2) + (Q-S)(trg_1 - trg_2)
which involves identities you can look up.
Sorry, adding two tensors is nothing more than algebra. The two matricies have to be the same size, and you add them term by term.
You can't just add the radii and angles and plug them back into the tensor. Do the addition properly and it'll work. Here's the first term:
R1*cost(2t1) + R2*cos(2t2) = ?
Here's the answer from Wolfram Alpha. As you can see, it doesn't simplify into a nice, neat expression with an R and a T for you.
In case you haven't thought of it, put the tensor sum into Wolfram Alpha and see what it gives you. They're better at algebra than anyone at this site. Why not get an independent check of your work?