Why Q-Learning is Off-Policy Learning? - reinforcement-learning

Hello Stack Overflow Community!
Currently, I am following the Reinforcement Learning lectures of David Silver and really confused at some point in his "Model-Free Control" slide.
In the slides, Q-Learning is considered as off-policy learning. I could not get the reason behind that. Also he mentions we have both target and behaviour policies. What is the role of behaviour policy in Q-Learning?
When I look at the algorithm, it looks so simple like update your Q(s,a) estimate by using the maximum Q(s',a') function. In the slides, it is said as "we choose the next action using behaviour policy" but here we choose only the maximum one.
I am so confused about the Q-Learning algorithm. Can you help me please?
Link of the slide(pages:36-38):
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf

check this answer first https://stats.stackexchange.com/a/184794
According to my knowledge, target policy is what we set as our policy it could be epsilon-greedy or something else. but in behaviour policy, we just use greedy policy to select the action without even considering what is our target policy, So it estimate our Q assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

Ideally, you want to learn the true Q-function, i.e., the one that satisfies the Bellman equation
Q(s,a) = R(s,a) + gamma*E[Q(s',a')] forall s,a
where the expectation is over a' w.r.t the policy.
First, we approximate the problem and get rid of the "forall" because we have access only to few samples (especially in continuous action, where the "forall" results in infinitely many constraints). Second, say you want to learn a deterministic policy (if there is an optimal policy, there is a deterministic optimal policy). Then the expectation disappears, but you need to collect samples somehow. This is where the "behavior" policy comes in, which usually is just a noisy version of the policy you want to optimize (the most common are e-greedy or you add Gaussian noise if the action is continuous).
So now you have samples collected from a behavior policy and a target policy (deterministic) which you want to optimize.
The resulting equation is
Q(s,a) = R(s,a) + gamma*Q(s',pi(s'))
the difference between the two sides is the TD error and you want to minimize it given samples collected from the behavior policy
min E[R(s,a) + gamma*Q(s',pi(s')) - Q(s,a)]
where the expectation is approximated with samples (s,a,s') collected using the behavior policy.
If we consider the pseudocode of Soroush, if actions are discrete, then pi(s') = max_A Q(s',A) and the update rule is the derivative of the TD(0) error.
These are some good easy reads to learn more about TD: 1, 2, 3, 4.
EDIT
Just to underline the difference between on- and off-policy. SARSA is on-policy, because the TD error to update the policy is
min E[R(s,a) + gamma*Q(s',a') - Q(s,a)]
a' is the action collected while sampling data using the behavior policy, and it's not pi(s') (the action that the target policy would choose in state s').

#Soroush's answer is only right if the red text is exchanged. Off-policy learning means you try to learn the optimal policy $\pi$ using trajectories sampled from another policy or policies. This means $\pi$ is not used to generate actual actions that are being executed in the environment. Since A is the executed action from the $\epsilon$-greedy algorithm, it is not from $\pi$ (the target policy) but another policy (the behavior policy, hence the name "behavior").

Related

State definition in Reinforcement learning

When defining state for a specific problem in reinforcement learning, How to decide what to include and what to leave for the definition, and also how to set difference between an observation and a state.
For example assuming that the agent is in the context of human resource and planning where it needs to hire some workers based on the demand of jobs, considering the cost of hiring them (assuming the budget is limited) is a state in the format of (# workers, cost) a good definition of state?
In total I don't know what information is needed to be in state and what should be left as it's rather observation.
Thank you
I am assuming you are formulating this as an RL problem because the demand is an unknown quantity. And, maybe [this is optional criteria] the Cost of hiring them may take into account a worker's contribution towards the job which is unknown initially. If however, both these quantities are known or can be approximated beforehand then you can just run a Planning algorithm to solve the problem [or just some sort of Optimization].
Having said this, the state in this problem could be something as simple as (#workers). Note I'm not including the cost, because cost must be experienced by the agent, and therefore is unknown to the agent until it reaches a specific state. Depending on the problem, you might need to add another factor of "time", or the "job-remaining".
Most of the theoretical results on RL hinge on a key assumption in several setups that the environment is Markovian. There are several works where you can get by without this assumption, but if you can formulate your environment in a way that exhibits this property, then you would have much more tools to work with. The key idea being, the agent can decide which action to take (in your case, an action could be : Hire 1 more person. Other actions could be Fire a person) based on the current state, say (#workers = 5, time=6). Note that we are not distinguishing between workers yet, so firing "a" person, instead of firing "a specific" person x. If the workers have differing capabilities, you may need to add several other factors each representing which worker is currently hired, and which are currently in the pool, yet to be hired so like a boolean array of a fixed length. (I hope you get the idea of how to form a state representation, and this can vary based on the specifics of the problem, which are missing in your question).
Now, once we have the State definition S, the action definition A (hire / fire), we have the "known" quantities for an MDP-setup in an RL framework. We also need an environment that can supply us with the cost function when we query it (Reward Function / Cost Function), and tell us the outcome of taking a certain action on a certain state (Transition). Note that we don't necessarily need to know these Reward / Transition function beforehand, but we should have a means of getting these values when we query for a specific (state, action).
Coming to your final part, the difference between observation and state. There are much better resources to dig deep into it, but in a crude sense, observation is an agent's (any agent, AI, human etc) sensory data. For example, in your case the agent has the ability to count number of workers currently employed (but it does not have an ability to distinguish between workers).
A state, more formally, a true MDP state must be something that is Markovian and captures the environment at its fundamental level. So, maybe in order to determine the true cost to the company, the agent needs to be able to differentiate between workers, working hours of each worker, jobs they are working at, interactions between workers and so on. Note that, much of these factors may not be relevant to your task, for example a worker's gender. Typically one would like to form a good hypothesis on which factors are relevant beforehand.
Now, even though we can agree that a worker's assignment (to a specific job) maybe a relevant feature which making a decision to hire or fire them, your observation does not have this information. So you have two options, either you can ignore the fact that this information is important and work with what you have available, or you try to infer these features. If your observation is incomplete for the decision making in your formulation we typically classify them as Partially Observable Environments (and use POMDP frameworks for it).
I hope I clarified a few points, however, there is huge theory behind all of this and the question you asked about "coming up with a state definition" is a matter of research. (Much like feature engineering & feature selection in Machine Learning).

Action masking for continuous action space in reinforcement learning

Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.
Simple example:
The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?
What I tried:
For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).
I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?
I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions.
I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.

concerns regarding exploring starts given my state is not the same as my observation space in gym

My state for a custom Gym environment is not the same as my observation space. The observation is calculated from the state.
How will RL that requires exploring starts etc, work? Or do I get it wrong?
I imagine the algorithm to sample from my observation space and then setting the state of the environment and checking an action. But this will not work with my environment.
From the question above you see I'm newby with RL and with Gym. What RL should I use in above case? How would you address such a situation?
Any tips?
My custom Gym environment is now selecting a random start state. Therefore, by using this environment, one can achieve "Exploring Starts". So, I do not need to worry any more that my observation is not the same as my state. For example, implementing Monte Carlo ES for Black Jack, as described in the RLbook2018, the state of my environment includes the hidden card of the dealer, while an observation does not.
I was confused at the time as I wanted the algorithm itself to pick the random state and set it into the environment.
PS,
If you need to save states of previous "alternative realities", search SO or Google for wrappers, and how they do that for MCTS (Monte Carlo Tree Search).

Is this example of off policy correct?

I am reading Sutton and Barto and want to make sure I am clear.
For Off Policy learning can we think of a robot in a particular terrain - say on sand - as the target policy but use the robot's policy for walking in snow as the behaviour policy? We are using our experience of walking on snow to approximate the optimal policy for walking on sand?
Your example works, but I think that it's a bit restrictive. In an off-policy method the behavioral policy is just a function that is used to explore state-action space while another function (the target, as you say) is being optimized. This means that as long as the behavior function is defined on the same domain as the target policy, it doesn't really matter whether it's a random process or whether it is the result of previous learning (e.g. your robot that walks on sand). It explores the state-action space, so it meets the definition. Whether it does it well or not is a different story.

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.