Future state in Deep Q learning in stock trading - reinforcement-learning

In Q-learning, Q(S,a)is updated by the Bellman equation. Q(S,a) = r + max(Q(S',a')) where S' is the future state. Let's say S denotes the price of the stock, does it mean we are using future information (S') in the Q-learning process? After the training process and bringing into implementation, how is it possible to apply the strategy without knowing the future price (the future state)?

Related

Confused about Rewards in David Silver Lecture 2

While watching the Reinforcement Learning course by David Silver on youtube (and the slide: Lecture 2 MDP), I found the "Reward" and "Value Function" really confusing.
I tried to understand the "given rewards" marked on the slide (P11), but I cannot figure out why it is the case. Like, the "Class 1: R = -2" but "Pub: R = +1"
why the negative reward for Class and the positive reward for Pub? why the different value?
How to calculate the reward with the Discount Factor? (P17 and P18)
I think the lack of intuition for Reinforcement Learning is the main reason why I have encountered this kind of problem...
So, I'd really appreciate it if someone can give me a little hint.
You usually set the reward and the discount such that using RL you will drive the agent to solve a task.
In the student example the goal is to pass the exam. The student can spend his time attending a class, sleeping, on Facebook or at the pub. Attending a class is something "boring", so the student doesn't see the immediate benefits of doing it. Hence the negative reward. On the contrary, going to the pub is fun and gives a positive reward. However, only by attending all 3 classes the student can pass the exam and get the big final reward.
Now the question is: how much does the student value immediate vs future rewards? The discount factor tells you that: a small discount gives more importance to immediate rewards, because future rewards just "fade" in the long run. If we use a small discount, the student may prefer to always go to the pub or to sleep. With a discount close to 0, already after one step all rewards get close to 0 as well, so at each state the student will try to maximize the immediate reward, because after that "nothing else matter".
On the contrary, high discounts (max 1) value long-term rewards more: in this case the optimal student will attend all classes and pass the exam.
Choosing the discount can be tricky, especially if there is no terminal state (in this case "sleep" is terminal), because with a discount of 1 the agent may ignore the number of steps used to reach the highest reward. For instance, if classes would give a reward of -1 instead of -2, for the agent would be the same to spend time alternating between "class" and "pub" forever and at some point to pass the exam, because with discount 1 the rewards never fade, so even after 10 years the students will still get +10 for passing the exam.
Think also of a virtual agent having to reach a goal position. With discount 1, the agent would not learn to reach it in the least amount of steps: as long as it reaches it, it's the same for him.
Beside that, there is also a numerical problem with discount 1. Since the goal is to maximize the cumulative sum of the discounted reward, if rewards are not discounted (and the horizon is infinite) the sum will not converge.
Q1) First of all you should not forget that there rewards are given by the environment. The actions taken by the agent do not have an effect on the rewards of the environment, but of course it affects the reward gained by the followed trajectory.
In the example these +1 and -2 are just funny examples :) "As a student" you get bored during the class, so the reward of it is -2, while you have fun in the pub, so the reward is +1. Don't get confused with the reasons behind these numbers, they are environment given.
Q2) Let's do the calculation for the state with the value 4.1 in "Example: State-Value Function for Student MRP (2)":
v(s) = (-2) + 0.9 * [(0.4 * 1.9) + (0.6 * 10)] = (-2) + 6.084 =~ 4.1
Here David is using the Bellman Equation for MRPs. You can find it on the same slide.

How to do reinforcement learning with regression instead of classification

I'm trying to apply reinforcement learning to a problem where the agent interacts with continuous numerical outputs using a recurrent network. Basically, it is a control problem where two outputs control how an agent behave.
I define an policy as epsilon greedy with (1-eps) of the time using the output control values, and eps of the time using the output values +/- a small Gaussian perturbation.
In this sense the agent can explore.
In most of the reinforcement literature I see that policy learning requires discrete actions which can be learned with the REINFORCE (Williams 1992) algorithm, but I'm unsure what method to use here.
At the moment what I do is use masking to only learn the top choices using an algorithm based on Metropolis Hastings to decide if a transition is goes toward the optimal policy. Pseudo code:
input: rewards, timeIndices
// rewards in (0,1) and optimal is 1
// relate rewards to likelihood via L(r) = exp(-|r - 1|/std)
// r <= 1 => |r - 1| = 1 - r
timeMask = zeros(timeIndices.length)
neglogLi = (1 - mean(rewards)) / std
// Go through random order of reward to approximate Markov process
for r,idx in shuffle(rewards, timeIndices):
neglogLj = (1 - r)/std
if neglogLj < neglogLi || log(random.uniform()) < neglogLi - neglogLj:
// Accept transition, i.e. learn this action
targetMask[idx] = 1
neglogLi = neglogLj
This provides a targetMask with ones for the actions that will be learned using standard backprop.
Can someone inform me the proper or better way?
Policy gradient methods are good for learning continuous control outputs. If you look at http://rll.berkeley.edu/deeprlcourse/#lectures, the Feb 13 lecture as well as the March 8 through March 15 lectures might be useful to you. Actor Critic methods are covered there, as well.

In Q Learning, how can you ever actually get a Q value? Wouldn't Q(s,a) just go on forever?

I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?
In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also:
Difference between Q-learning and Value Iteration
How do I know when a Q-learning algorithm converges?
Sutton et al.: RL

Regression when dep. variable is a proportion in Stata

I am doing an analysis in Stata of the determinants of census tract unemployment rates. Some of the previous literature on my topic has used straight OLS regression, and I started with this type of analysis, but it seems to me after my own further reading that a Generalized Linear Model is better. This is especially because I am interested in presenting predicted values for the census tracts' unemployment rates based on my regression and I would like these to be appropriately bounded (between 0% and 100% inclusive). My unemployment rates include 0s for some census tracts so I would need to take this into account.
My questions are:
whether Stata's fracreg logit is equivalent to the program's glm with a logit link and binomial family? (I have read about using the glm version in a few places including here but see that fracreg is a new-ish command which seems to serve the same purpose). Can I specify an equivalent to the robust option when using fracreg logit?
if using fracreg, on what basis should I decide to use a fractional probit (fracreg probit) or fractional logit (fracreg logit) regression?
a simply (probably ignorant) question of interpretation: I see that the fracreg and glm regressions mentioned above don't report an R-squared value. Is there an equivalent measure for these regressions I can calculate? My OLS R-squared values have been reasonably high and this has been a point of reassurance for me, so I'd like to see how these models compare (though I know R-squared isn't everything!).
if using these models are there any additional restrictions or assumptions (such as additional assumptions beyond the BLUE of OLS) that I should keep in mind? With my OLS regressions I have taken the natural log of unemployment rates (makes my residuals more normal, higher R-squared, and convenient interpretation). Could I do the same with the fracreg or glm regressions above?
It's been a while since I formally studied limited dependent variables so please excuse my ignorance on these issues.
I have cross-posted this question at Statalist here.
This isn't Stata-specific, but check out Paolino's 2001 "Maximum Likelihood Estimation of Models with Beta-Distributed Dependent Variables;" at a minimum will highlight a lit review for why OLS offers biased estimators.
Hey, follow-up: Someone did make a Stata solution, check out "Buckley, Jack. 2003. "Estimation of Models with Beta-Distributed Dependent Variables: A Replication and Extension of Paolino's Study." Political Analysis. 11(2): 204-205."

Explain process noise terminology in Kalman Filter

I am just learning Kalman filter. In the Kalman Filter terminology, I am having some difficulty with process noise. Process noise seems to be ignored in many concrete examples (most focused on measurement noise). If someone can point me to some introductory level link that described process noise well with examples, that’d be great.
Let’s use a concrete scalar example for my question, given:
x_j = a x_j-1 + b u_j + w_j
Let’s say x_j models the temperature within a fridge with time. It is 5 degrees and should stay that way, so we model with a = 1. If at some point t = 100, the temperature of the fridge becomes 7 degrees (ie. hot day, poor insulation), then I believe the process noise at this point is 2 degrees. So our state variable x_100 = 7 degrees, and this is the true value of the system.
Question 1:
If I then paraphrase the phrase I often see for describing Kalman filter, “we filter the signal x so that the effects of the noise w are minimized “, http://www.swarthmore.edu/NatSci/echeeve1/Ref/Kalman/ScalarKalman.html if we minimize the effects of the 2 degrees, are we trying to get rid of the 2 degree difference? But the true state at is x_100 == 7 degrees. What are we doing to the process noise w exactly when we Kalmen filter?
Question 2:
The process noise has a variance of Q. In the simple fridge example, it seems easy to model because you know the underlying true state is 5 degrees and you can take Q as the deviation from that state. But if the true underlying state is fluctuating with time, when you model, what part of this would be considered state fluctuation vs. “process noise”. And how do we go about determining a good Q (again example would be nice)?
I have found that as Q is always added to the covariance prediction no matter which time step you are at, (see Covariance prediction formula from http://greg.czerniak.info/guides/kalman1/) that if you select an overly large Q, then it doesn’t seem like the Kalman filter would be well-behaved.
Thanks.
EDIT1 My Interpretation
My interpretation of the term process noise is the difference between the actual state of the system and the state modeled from the state transition matrix (ie. a * x_j-1). And what Kalman filter tries to do, is to bring the prediction closer to the actual state. In that sense, it actually partially "incorporate" the process noise into the prediction through the residual feedback mechanism, rather than "eliminate" it, so that it can predict the actual state better. I have not read such an explanation anywhere in my search, and I would appreciate anyone commenting on this view.
In Kalman filtering the "process noise" represents the idea/feature that the state of the system changes over time, but we do not know the exact details of when/how those changes occur, and thus we need to model them as a random process.
In your refrigerator example:
the state of the system is the temperature,
we obtain measurements of the temperature on some time interval, say hourly,
by looking the thermometer dial. Note that you usually need to
represent the uncertainties involved in the measurement process
in Kalman filtering, but you didn't focus on this in your question.
Let's assume that these errors are small.
At time t you look at the thermometer, see that it says 7degrees;
since we've assumed the measurement errors are very small, that means
that the true temperature is (very close to) 7 degrees.
Now the question is: what is the temperature at some later time, say 15 minutes
after you looked?
If we don't know if/when the condenser in the refridgerator turns on we could have:
1. the temperature at the later time is yet higher than 7degrees (15 minutes manages
to get close to the maximum temperature in a cycle),
2. Lower if the condenser is/has-been running, or even,
3. being just about the same.
This idea that there are a distribution of possible outcomes for the real state of the
system at some later time is the "process noise"
Note: my qualitative model for the refrigerator is: the condenser is not running, the temperature goes up until it reaches a threshold temperature a few degrees above the nominal target temperature (note - this is a sensor so there may be noise in terms of the temperature at which the condenser turns on), the condenser stays on until the temperature
gets a few degrees below the set temperature. Also note that if someone opens the door, then there will be a jump in the temperature; since we don't know when someone might do this, we model it as a random process.
Yeah, I don't think that sentence is a good one. The primary purpose of a Kalman filter is to minimize the effects of observation noise, not process noise. I think the author may be conflating Kalman filtering with Kalman control (where you ARE trying to minimize the effect of process noise).
The state does not "fluctuate" over time, except through the influence of process noise.
Remember, a system does not generally have an inherent "true" state. A refrigerator is a bad example, because it's already a control system, with nonlinear properties. A flying cannonball is a better example. There is some place where it "really is", but that's not intrinsic to A. In this example, you can think of wind as a kind of "process noise". (Not a great example, since it's not white noise, but work with me here.) The wind is a 3-dimensional process noise affecting the cannonball's velocity; it does not directly affect the cannonball's position.
Now, suppose that the wind in this area always blows northwest. We should see a positive covariance between the north and west components of wind. A deviation of the cannonball's velocity northwards should make us expect to see a similar deviation to westward, and vice versa.
Think of Q more as covariance than as variance; the autocorrelation aspect of it is almost incidental.
Its a good discussion going over here. I would like to add that the concept of process noise is that what ever prediction that is made based on the model is having some errors and it is represented using the Q matrix. If you note the equations in KF for prediction of Covariance matrix (P_prediction) which is actually the mean squared error of the state being predicted, the Q is simply added to it. PPredict=APA'+Q . I suggest, it would give a good insight if you could find the derivation of KF equations.
If your state-transition model is exact, process noise would be zero. In real-world, it would be nearly impossible to capture the exact state-transition with a mathematical model. The process noise captures that uncertainty.