I'm studying CS231N, lecture 14, "Reinforcement Learning". In the lecture, the instructor mentioned the value function, which is shown in the picture:
I am wondering what is that bar between rt and s0? I thought it was something like conditional probability, but I'm not sure about it. Or is it just a division?
It's the conditional probability. It literally means the reward at time t, given state s, following policy pi.
Related
While watching the Reinforcement Learning course by David Silver on youtube (and the slide: Lecture 2 MDP), I found the "Reward" and "Value Function" really confusing.
I tried to understand the "given rewards" marked on the slide (P11), but I cannot figure out why it is the case. Like, the "Class 1: R = -2" but "Pub: R = +1"
why the negative reward for Class and the positive reward for Pub? why the different value?
How to calculate the reward with the Discount Factor? (P17 and P18)
I think the lack of intuition for Reinforcement Learning is the main reason why I have encountered this kind of problem...
So, I'd really appreciate it if someone can give me a little hint.
You usually set the reward and the discount such that using RL you will drive the agent to solve a task.
In the student example the goal is to pass the exam. The student can spend his time attending a class, sleeping, on Facebook or at the pub. Attending a class is something "boring", so the student doesn't see the immediate benefits of doing it. Hence the negative reward. On the contrary, going to the pub is fun and gives a positive reward. However, only by attending all 3 classes the student can pass the exam and get the big final reward.
Now the question is: how much does the student value immediate vs future rewards? The discount factor tells you that: a small discount gives more importance to immediate rewards, because future rewards just "fade" in the long run. If we use a small discount, the student may prefer to always go to the pub or to sleep. With a discount close to 0, already after one step all rewards get close to 0 as well, so at each state the student will try to maximize the immediate reward, because after that "nothing else matter".
On the contrary, high discounts (max 1) value long-term rewards more: in this case the optimal student will attend all classes and pass the exam.
Choosing the discount can be tricky, especially if there is no terminal state (in this case "sleep" is terminal), because with a discount of 1 the agent may ignore the number of steps used to reach the highest reward. For instance, if classes would give a reward of -1 instead of -2, for the agent would be the same to spend time alternating between "class" and "pub" forever and at some point to pass the exam, because with discount 1 the rewards never fade, so even after 10 years the students will still get +10 for passing the exam.
Think also of a virtual agent having to reach a goal position. With discount 1, the agent would not learn to reach it in the least amount of steps: as long as it reaches it, it's the same for him.
Beside that, there is also a numerical problem with discount 1. Since the goal is to maximize the cumulative sum of the discounted reward, if rewards are not discounted (and the horizon is infinite) the sum will not converge.
Q1) First of all you should not forget that there rewards are given by the environment. The actions taken by the agent do not have an effect on the rewards of the environment, but of course it affects the reward gained by the followed trajectory.
In the example these +1 and -2 are just funny examples :) "As a student" you get bored during the class, so the reward of it is -2, while you have fun in the pub, so the reward is +1. Don't get confused with the reasons behind these numbers, they are environment given.
Q2) Let's do the calculation for the state with the value 4.1 in "Example: State-Value Function for Student MRP (2)":
v(s) = (-2) + 0.9 * [(0.4 * 1.9) + (0.6 * 10)] = (-2) + 6.084 =~ 4.1
Here David is using the Bellman Equation for MRPs. You can find it on the same slide.
I am taking lectures of course CS231 from Stanford university. I am unable to understand the point from RNN, Why Softmax unable to select the highest probability which is 0.84 for character o (in the attached example) instead of 0.13 for character e. Explanation will be highly appreciated.
I have not really watched the lecture, but I think the 'e' at the top is the expected output(and 'l', 'l', 'o' too). The initial weights are not giving good enough results (giving 'o' instead of 'e'). As you train the network, the weights will become more mature and ultimately you will see the change in probabilities and the the first prediction will result in 'e' ultimately
I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy
TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.
I really need some help
I have to see if there is an effect of holidays, special days, and weather on my daily data. The daily data has clearly a seasonal cycle.
So for this,I am trying to run linear regression with around 1100 observations. The first step was to transformer the number of sells as described in the papers I read.
Rt=Ln(Pt/Pt-1)*100
Where Pt is the the sales of today, and Pt-1 from the previous day
To account for the seasonality, I simply have a first linear regression using the day of the week as dummy variable
fit<- lm(log_return~D1+D3+D4+D5+D6+D7,data=mydata)
which with OLS gives the following result: Model Plot
The residuals are non-normal, Q-Q Plot shows some heavy legs. I think it might be because the Residuals have significant heteroscedacity and autocorrelation.
From the papers I read, the authors always used Standard White and Newey West for correction of heterscedacity and autocorrelation in the residuals. variables, I have both problem making my OLS certainly wrong.
From the sandwich documentation for R, it appears that Newey West can find automaticly the right amount of lags, which is for me great, because I tried several ARMA(p,q) variants without success.
So now I run the vcovHAC(fit), I am not really sure what it is. Does it correct my problem of unknown autocorrelation and heteroscedacity by itself completly also with White SE,and find the necessary lag(s) needed to correct the regression?
How do I apply this then on my regression to get the same summary as summary(fit), but with the corrected values in R, including Rsquared?
Does it magicly correct everything ? What are the limits?
Thanks a lot guys
Maybe this can help:
R and Newey-West
R and ARMA
R and vcovHAC
R and Linear Regression
I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?
In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also:
Difference between Q-learning and Value Iteration
How do I know when a Q-learning algorithm converges?
Sutton et al.: RL