Is this example of off policy correct? - reinforcement-learning

I am reading Sutton and Barto and want to make sure I am clear.
For Off Policy learning can we think of a robot in a particular terrain - say on sand - as the target policy but use the robot's policy for walking in snow as the behaviour policy? We are using our experience of walking on snow to approximate the optimal policy for walking on sand?

Your example works, but I think that it's a bit restrictive. In an off-policy method the behavioral policy is just a function that is used to explore state-action space while another function (the target, as you say) is being optimized. This means that as long as the behavior function is defined on the same domain as the target policy, it doesn't really matter whether it's a random process or whether it is the result of previous learning (e.g. your robot that walks on sand). It explores the state-action space, so it meets the definition. Whether it does it well or not is a different story.

Related

What is differentiable programming?

Native support for differential programming has been added to Swift for the Swift for Tensorflow project. Julia has similar with Zygote.
What exactly is differentiable programming?
what does it enable? Wikipedia says
the programs can be differentiated throughout
but what does that mean?
how would one use it (e.g. a simple example)?
and how does it relate to automatic differentiation (the two seem conflated a lot of the time)?
I like to think about this question in terms of user-facing features (differentiable programming) vs implementation details (automatic differentiation).
From a user's perspective:
"Differentiable programming" is APIs for differentiation. An example is a def gradient(f) higher-order function for computing the gradient of f. These APIs may be first-class language features, or implemented in and provided by libraries.
"Automatic differentiation" is an implementation detail for automatically computing derivative functions. There are many techniques (e.g. source code transformation, operator overloading) and multiple modes (e.g. forward-mode, reverse-mode).
Explained in code:
def f(x):
return x * x * x
∇f = gradient(f)
print(∇f(4)) # 48.0
# Using the `gradient` API:
# ▶ differentiable programming.
# How `gradient` works to compute the gradient of `f`:
# ▶ automatic differentiation.
I never heard the term "differentiable programming" before reading your question, but having used the concepts noted in your references, both from the side of creating code to solve a derivative with Symbolic differentiation and with Automatic differentiation and having written interpreters and compilers, to me this just means that they have made the ability to calculate the numeric value of the derivative of a function easier. I don't know if they made it a First-class citizen, but the new way doesn't require the use of a function/method call; it is done with syntax and the compiler/interpreter hides the translation into calls.
If you look at the Zygote example it clearly shows the use of prime notation
julia> f(10), f'(10)
Most seasoned programmers would guess what I just noted because there was not a research paper explaining it. In other words it is just that obvious.
Another way to think about it is that if you have ever tried to calculate a derivative in a programming language you know how hard it can be at times and then ask yourself why don't they (the language designers and programmers) just add it into the language. In these cases they did.
What surprises me is how long it to took before derivatives became available via syntax instead of calls, but if you have ever worked with scientific code or coded neural networks at at that level then you will understand why this is a concept that is being touted as something of value.
Also I would not view this as another programming paradigm, but I am sure it will be added to the list.
How does it relate to automatic differentiation (the two seem conflated a lot of the time)?
In both cases that you referenced, they use automatic differentiation to calculate the derivative instead of using symbolic differentiation. I do not view differentiable programming and automatic differentiation as being two distinct sets, but instead that differentiable programming has a means of being implemented and the way they chose was to use automatic differentiation, they could have chose symbolic differentiation or some other means.
It seems you are trying to read more into what differential programming is than it really is. It is not a new way of programming, but just a nice feature added for doing derivatives.
Perhaps if they named it differentiable syntax it might have been more clear. The use of the word programming gives it more panache than I think it deserves.
EDIT
After skimming Swift Differentiable Programming Mega-Proposal and trying to compare that with the Julia example using Zygote, I would have to modify the answer into parts that talk about Zygote and then switch gears to talk about Swift. They each took a different path, but the commonality and bottom line is that the languages know something about differentiation which makes the job of coding them easier and hopefully produces less errors.
About the Wikipedia quote that
the programs can be differentiated throughout
At first reading it seems nonsense or at least lacks enough detail to understand it in context which is why I am sure you asked.
In having many years of digging into what others are trying to communicate, one learns that unless the source has been peer reviewed to take it with a grain of salt, and unless it is absolutely necessary to understand, then just ignore it. In this case if you ignore the sentence most of what your reference makes sense. However I take it that you want an answer, so let's try and figure out what it means.
The key word that has me perplexed is throughout, but since you note the statement came from Wikipedia and in Wikipedia they give three references for the statement, a search of the word throughout appears only in one
∂P: A Differentiable Programming System to Bridge Machine Learning and Scientific Computing
Thus, since our ∂P system does not require primitives to handle new
types, this means that almost all functions and types defined
throughout the language are automatically supported by Zygote, and
users can easily accelerate specific functions as they deem necessary.
So my take on this is that by going back to the source, e.g. the paper, you can better understand how that percolated up into Wikipedia, but it seems that the meaning was lost along the way.
In this case if you really want to know the meaning of that statement you should ask on the Wikipedia talk page and ask the author of the statement directly.
Also note that the paper referenced is not peer reviewed, so the statements in there may not have any meaning amongst peers at present. As I said, I would just ignore it and get on with writing wonderful code.
You can guess its definition by application of differentiability.
It's been used for optimization i.e. to calculate minimum value or maximum value
Many of these problems can be solved by finding the appropriate function and then using techniques to find the maximum or the minimum value required.

Why Q-Learning is Off-Policy Learning?

Hello Stack Overflow Community!
Currently, I am following the Reinforcement Learning lectures of David Silver and really confused at some point in his "Model-Free Control" slide.
In the slides, Q-Learning is considered as off-policy learning. I could not get the reason behind that. Also he mentions we have both target and behaviour policies. What is the role of behaviour policy in Q-Learning?
When I look at the algorithm, it looks so simple like update your Q(s,a) estimate by using the maximum Q(s',a') function. In the slides, it is said as "we choose the next action using behaviour policy" but here we choose only the maximum one.
I am so confused about the Q-Learning algorithm. Can you help me please?
Link of the slide(pages:36-38):
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/control.pdf
check this answer first https://stats.stackexchange.com/a/184794
According to my knowledge, target policy is what we set as our policy it could be epsilon-greedy or something else. but in behaviour policy, we just use greedy policy to select the action without even considering what is our target policy, So it estimate our Q assuming a greedy policy were followed despite the fact that it's not following a greedy policy.
Ideally, you want to learn the true Q-function, i.e., the one that satisfies the Bellman equation
Q(s,a) = R(s,a) + gamma*E[Q(s',a')] forall s,a
where the expectation is over a' w.r.t the policy.
First, we approximate the problem and get rid of the "forall" because we have access only to few samples (especially in continuous action, where the "forall" results in infinitely many constraints). Second, say you want to learn a deterministic policy (if there is an optimal policy, there is a deterministic optimal policy). Then the expectation disappears, but you need to collect samples somehow. This is where the "behavior" policy comes in, which usually is just a noisy version of the policy you want to optimize (the most common are e-greedy or you add Gaussian noise if the action is continuous).
So now you have samples collected from a behavior policy and a target policy (deterministic) which you want to optimize.
The resulting equation is
Q(s,a) = R(s,a) + gamma*Q(s',pi(s'))
the difference between the two sides is the TD error and you want to minimize it given samples collected from the behavior policy
min E[R(s,a) + gamma*Q(s',pi(s')) - Q(s,a)]
where the expectation is approximated with samples (s,a,s') collected using the behavior policy.
If we consider the pseudocode of Soroush, if actions are discrete, then pi(s') = max_A Q(s',A) and the update rule is the derivative of the TD(0) error.
These are some good easy reads to learn more about TD: 1, 2, 3, 4.
EDIT
Just to underline the difference between on- and off-policy. SARSA is on-policy, because the TD error to update the policy is
min E[R(s,a) + gamma*Q(s',a') - Q(s,a)]
a' is the action collected while sampling data using the behavior policy, and it's not pi(s') (the action that the target policy would choose in state s').
#Soroush's answer is only right if the red text is exchanged. Off-policy learning means you try to learn the optimal policy $\pi$ using trajectories sampled from another policy or policies. This means $\pi$ is not used to generate actual actions that are being executed in the environment. Since A is the executed action from the $\epsilon$-greedy algorithm, it is not from $\pi$ (the target policy) but another policy (the behavior policy, hence the name "behavior").

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.

Determine how fine-grained an interface should be?

The more detail I put in an interface, the less reusable it is. On the other hand the less detail the more ethereal and useless it seems to become. Is there a standard set of recommendations about how to weigh this for various situations?
I'm a big fan of SOLID principles. The "I" in SOLID leads me to belive that clients shouldn't be forced to implement interfaces they do not need or use. In other words, if you have an abstract class or an interface, then the implementer should not be forced to implement parts that they don't care about.
Ray Houston wrote a good article on it (looking at the Membership Provider) here.
I have just co-authored a paper on granularity (size) of components and one of our conclusions is that there is no simple way to determine "what's right". So no, there is no standard set of recommendations.
I can give you a couple of academic references on the subject just in case you're interested:
Genero, M., Piattini, M., Calero, C. (eds.): Metrics for Software Conceptual Models.
Imperial College Press, London, UK (2005)
Shekhovtsov, V.A.: On Conceptualization of Quality. paper presented at Dagstuhl
Seminar on Conceptual Modelling, April 27-30 2008 (preprint on conference website) (2008)
Consider the Human genome as a class.
Each instance (cell object) has available to it all the functions of the genome.
(Although not all cell objects have access to all functions; except perhaps stem cells).
I'm bringing up this point because I have seen many instances of single classes trying to perform many functions, instead of having multiple classes, each performing a single function.
This is equivalent to a grain of sand having the instructions encoded in it to build a castle. Evolution has had the benefit of billions of years to work out the bugs. Engineers just don't have the capacity or the time to do this.

Do formal methods of program verfication have a place in industry?

I took a glimpse on Hoare Logic in college. What we did was really simple. Most of what I did was proving the correctness of simple programs consisting of while loops, if statements, and sequence of instructions, but nothing more. These methods seem very useful!
Are formal methods used in industry widely?
Are these methods used to prove mission-critical software?
Well, Sir Tony Hoare joined Microsoft Research about 10 years ago, and one of the things he started was a formal verification of the Windows NT kernel. Indeed, this was one of the reasons for the long delay of Windows Vista: starting with Vista, large parts of the kernel are actually formally verified wrt. to certain properties like absence of deadlocks, absence of information leaks etc.
This is certainly not typical, but it is probably the single most important application of formal program verification, in terms of its impact (after all, almost every human being is in some way, shape or form affected by a computer running Windows).
This is a question close to my heart (I'm a researcher in Software Verification using formal logics), so you'll probably not be surprised when I say I think these techniques have a useful place, and are not yet used enough in the industry.
There are many levels of "formal methods", so I'll assume you mean those resting on a rigourous mathematical basis (as opposed to, say, following some 6-Sigma style process). Some types of formal methods have had great success - type systems being one example. Static analysis tools based on data flow analysis are also popular, model checking is almost ubiquitous in hardware design, and computational models like Pi-Calculus and CCS seem to be inspiring some real change in practical language design for concurrency. Termination analysis is one that's had a lot of press recently - The SDV project at Microsoft and work by Byron Cook are recent examples of research/practice crossover in formal methods.
Hoare Reasoning has not, so far, made great inroads in the industry - this is for more reasons than I can list, but I suspect is mostly around the complexity of writing then proving specifications for real programs (they tend to get big, and fail to express properties of many real world environments). Various sub-fields in this type of reasoning are now making big inroads into these problems - Separation Logic being one.
This is partially the nature of ongoing (hard) research. But I must confess that we, as theorists, have entirely failed to educate the industry on why our techniques are useful, to keep them relevant to industry needs, and to make them approachable to software developers. At some level, that's not our problem - we're researchers, often mathematicians, and practical usage is not foremost in our minds. Also, the techniques being developed are often too embryonic for use in large scale systems - we work on small programs, on simplified systems, get the math working, and move on. I don't much buy these excuses though - we should be more active in pushing our ideas, and getting a feedback loop between the industry and our work (one of the main reasons I went back to research).
It's probably a good idea for me to resurrect my weblog, and make some more posts on this stuff...
I cannot comment much on mission-critical software, although I know that the avionics industry uses a wide variety of techniques to validate software, including Hoare-style methods.
Formal methods have suffered because early advocates like Edsger Dijkstra insisted that they ought to be used everywhere. Neither the formalisms nor the software support were up to the job. More sensible advocates believe that these methods should be used on problems that are hard. They are not widely used in industry, but adoption is increasing. Probably the greatest inroads have been in the use of formal methods to check safety properties of software. Some of my favorite examples are the SPIN model checker and George Necula's proof-carrying code.
Moving away from practice and into research, Microsoft's Singularity operating-system project is about using formal methods to provide safety guarantees that ordinarily require hardware support. This in turn leads to faster performance and stronger guarantees. For example, in singularity they have proved that if a third-party device driver is allowed into the system (which means basic verification conditions have been proved), then it cannot possibly bring down that whole OS–he worst it can do is hose its own device.
Formal methods are not yet widely used in industry, but they are more widely used than they were 20 years ago, and 20 years from now they will be more widely used still. So you are future-proofed :-)
Yes, they are used, but not widely in all areas. There are more methods than just hoare logic, some are used more, some less, depending on suitability for given task. The common problem is that sofware is biiiiiiig and verifying that all of it is correct is still too hard a problem.
For example the theorem-prover (a software that aids humans in proving program correctness) ACL2 has been used to prove that a certain floating-point processing unit does not have a certain type of bug. It was a big task, so this technique is not too common.
Model checking, another kind of formal verification, is used rather widely nowadays, for example Microsoft provides a type of model checker in the driver development kit and it can be used to verify the driver for a set of common bugs. Model checkers are also often used in verifying hardware circuits.
Rigorous testing can be also thought of as formal verification - there are some formal specifications of which paths of program should be tested and so on.
"Are formal methods used in industry?"
Yes.
The assert statement in many programming languages is related to formal methods for verifying a program.
"Are formal methods used in industry widely ?"
No.
"Are these methods used to prove mission-critical software ?"
Sometimes. More often, they're used to prove that the software is secure. More formally, they're used to prove certain security-related assertions about the software.
There are two different approaches to formal methods in the industry.
One approach is to change the development process completely. The Z notation and the B method that were mentioned are in this first category. B was applied to the development of the driverless subway line 14 in Paris (if you get a chance, climb in the front wagon. It's not often that you get a chance to see the rails in front of you).
Another, more incremental, approach is to preserve the existing development and verification processes and to replace only one of the verification tasks at a time by a new method. This is very attractive but it means developing static analysis tools for exiting, used languages that are often not easy to analyse (because they were not designed to be).
If you go to (for instance)
http://dblp.uni-trier.de/db/indices/a-tree/d/Delmas:David.html
(sorry, only one hyperlink allowed for new users :( )
you will find instances of practical applications of formal methods to the verification of C programs (with static analyzers Astrée, Caveat, Fluctuat, Frama-C) and binary code (with tools from AbsInt GmbH).
By the way, since you mentioned Hoare Logic, in the above list of tools, only Caveat is based on Hoare logic (and Frama-C has a Hoare logic plug-in). The others rely on abstract interpretation, a different technique with a more automatic approach.
My area of expertise is the use of formal methods for static code analysis to show that software is free of run-time errors. This is implemented using a formal methods technique known "abstract interpretation". The technique essentially enables you to prove certain atributes of a s/w program. E.g. prove that a+b will not overflow or x/(x-y) will not result in a divide by zero. An example static analysis tool that uses this technique is Polyspace.
With respect to your question: "Are formal methods used in industry widely?" and "Are these methods used to prove mission-critical software?"
The answer is yes. This opinion is based on my experience and supporting the Polyspace tool for industries that rely on the use of embedded software to control safety critical systems such as electronic throttle in an automobile, braking system for a train, jet engine controller, drug delivery infusion pump, etc. These industries do indeed use these types of formal methods tools.
I don't believe all 100% of these industry segments are using these tools, but the use is increasing. My opinion is that the Aerospace and Automotive industries lead with the Medical Device industry quickly ramping up use.
Polyspace is a a (hideously expensive, but very good) commercial product based on program verification. It's fairly pragmatic, in that it scales up from 'enhanced unit testing that will probably find some bugs' to 'the next three years of your life will be spent showing these 10 files have zero defects'.
It is based more on negative verification ('this program won't corrupt your stack') instead positive verification ('this program will do precisely what these 50 pages of equations say it will').
To add to Jorg's answer, here's an interview with Tony Hoare. The tools Jorg's referring to, I think, are PREfast and PREfix. See here for more information.
Besides of other more procedural approaches, Hoare logic was in the basis of Design by Contract, introduced as an object oriented technique by Bertrand Meyer in Eiffel (see Meyer's article of 1992, page 4). While Design by Contract is not the same as formal verification methods (for one thing, DbC doesn't prove anything until the software is executed), in my opinion it provides a more practical use.