Monte Carlo pseudocode by Richard S. Sutton and Andrew G. Barto - reinforcement-learning

I'm a reading a book, "Reinforcement Learning" by S. Sutton and Andrew G. Barto
Sutton's Book explains:
"For Monte Carlo policy iteration it is natural to alternate between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation, and then the policy is improved at all the states visited in the episode. A complete simple algorithm along these lines, which we call Monte Carlo ES, for Monte Carlo with Exploring Starts, is given in pseudo code in the box on the next page"
However, in this pseudo code, it seems last two lines, which correspond to policy evaluation and policy improvement are done for each step of episode. According to Sutton's explanation, should't last two lines be moved 2 indentations to left so that policy evaluation and policy improvement are performed for each episode?

Related

In Diffusion Models literature, what do they mean by "the reversal of the diffusion process has the identical functional form as the forward process"?

I am trying to understand Diffusion Probabilistic Models, in particular the paper by (Sohl-Dickstein et al. 2015 https://arxiv.org/abs/1503.03585). When defining the backward diffusion model, they take the conditional probability of one step de-noising to be normally distributed:
screenshot from paper
They argue that this can be done because the reversal of the diffusion process has the identical functional form as the forward diffusion. What do they mean by this? Does someone have a good reference where this is explained in more detail?
Many thanks for your help!
I tried looking at the reference given in the paper:
Feller, W. On the theory of stochastic processes, with par- ticular reference to applications. In Proceedings of the [First] Berkeley Symposium on Mathematical Statistics and Probability. The Regents of the University of Cali- fornia, 1949.
However, I am struggling to find a satisfactory answer within that particular reference.
It's means that the reversal of the diffusion process can be modeled as a Gaussian (binomial) distribution

Difference between Evolutionary Strategies and Reinforcement Learning?

I am learning about the approach employed in Reinforcement Learning for robotics and I came across the concept of Evolutionary Strategies. But I couldn't understand how RL and ES are different. Can anyone please explain?
To my understanding, I know of two main ones.
1) Reinforcement learning uses the concept of one agent, and the agent learns by interacting with the environment in different ways. In evolutionary algorithms, they usually start with many "agents" and only the "strong ones survive" (the agents with characteristics that yield the lowest loss).
2) Reinforcement learning agent(s) learns both positive and negative actions, but evolutionary algorithms only learns the optimal, and the negative or suboptimal solution information are discarded and lost.
Example
You want to build an algorithm to regulate the temperature in the room.
The room is 15 °C, and you want it to be 23 °C.
Using Reinforcement learning, the agent will try a bunch of different actions to increase and decrease the temperature. Eventually, it learns that increasing the temperature yields a good reward. But it also learns that reducing the temperature will yield a bad reward.
For evolutionary algorithms, it initiates with a bunch of random agents that all have a preprogrammed set of actions it is going to do. Then the agents that has the "increase temperature" action survives, and moves onto the next generation. Eventually, only agents that increase the temperature survive and are deemed the best solution. However, the algorithm does not know what happens if you decrease the temperature.
TL;DR: RL is usually one agent, trying different actions, and learning and remembering all info (positive or negative). EM uses many agents that guess many actions, only the agents that have the optimal actions survive. Basically a brute force way to solve a problem.
I think the biggest difference between Evolutionary Strategies and Reinforcement Learning is that ES is a global optimization technique while RL is a local optimization technique. So RL can converge to a local optima converging faster while ES converges slower to a global minima.
Evolution Strategies optimization happens on a population level. An evolution strategy algorithm in an iterative fashion (i) samples a batch of candidate solutions from the search space (ii) evaluates them and (iii) discards the ones with low fitness values. The sampling for a new iteration (or generation) happens around the mean of the best scoring candidate solutions from the previous iteration. Doing so enables evolution strategies to direct the search towards a promising location in the search space.
Reinforcement learning requires the problem to be formulated as a Markov Decision Process (MDP). An RL agent optimizes its behavior (or policy) by maximizing a cumulative reward signal received on a transition from one state to another. Since the problem is abstracted as an MDP learning can happen on a step or episode level. Learning per step (or N steps) is done via temporal-Difference learning (TD) and per episode is done via Monte Carlo methods. So far I am talking about learning via action-value functions (learning the values of actions). Another way of learning is by optimizing the parameters of a neural network representing the policy of the agent directly via gradient ascent. This approach is introduced in the REINFORCE algorithm and the general approach known as policy-based RL.
For a comprehensive comparison check out this paper https://arxiv.org/pdf/2110.01411.pdf

How to choose action in TD(0) learning

I am currently reading Sutton's Reinforcement Learning: An introduction book. After reading chapter 6.1 I wanted to implement a TD(0) RL algorithm for this setting:
To do this, I tried to implement the pseudo-code presented here:
Doing this I wondered how to do this step A <- action given by π for S: I can I choose the optimal action A for my current state S? As the value function V(S) is just depending on the state and not on the action I do not really know, how this can be done.
I found this question (where I got the images from) which deals with the same exercise - but here the action is just picked randomly and not choosen by an action policy π.
Edit: Or this is pseudo-code not complete, so that I have to approximate the action-value function Q(s, a) in another way, too?
You are right, you cannot choose an action (neither derive a policy π) only from a value function V(s) because, as you notice, it depends only on the state s.
The key concept that you are probably missing here, it's that TD(0) learning is an algorithm to compute the value function of a given policy. Thus, you are assuming that your agent is following a known policy. In the case of the Random Walk problem, the policy consists in choosing actions randomly.
If you want to be able to learn a policy, you need to estimate the action-value function Q(s,a). There exists several methods to learn Q(s,a) based on Temporal-difference learning, such as for example SARSA and Q-learning.
In the Sutton's RL book, the authors distinguish between two kind of problems: prediction problems and control problems. The former refers to the process of estimating the value function of a given policy, and the latter to estimate policies (often by means of action-value functions). You can find a reference to these concepts in the starting part of Chapter 6:
As usual, we start by focusing on the policy evaluation or prediction
problem, that of estimating the value function for a given policy .
For the control problem (finding an optimal policy), DP, TD, and Monte
Carlo methods all use some variation of generalized policy iteration
(GPI). The differences in the methods are primarily differences in
their approaches to the prediction problem.

What techniques exist for the software-driven locomotion of a bipedal robot?

I'm programming a software agent to control a robot player in a simulated game of soccer. Ultimately I hope to enter it in the RoboCup competition.
Amongst the various challenges involved in creating such an agent, the motion of it's body is one of the first I'm facing. The simulation I'm targeting uses a Nao robot body with 22 hinge to control. Six in each leg, four in each arm and two in the neck:
(source: sourceforge.net)
I have an interest in machine learning and believe there must be some techniques available to control this guy.
At any point in time, it is known:
The angle of all 22 hinges
The X,Y,Z output of an accelerometer located in the robot's chest
The X,Y,Z output of a gyroscope located in the robot's chest
The location of certain landmarks (corners, goals) via a camera in the robot's head
A vector for the force applied to the bottom of each foot, along with a vector giving the position of the force on the foot's sole
The types of tasks I'd like to achieve are:
Running in a straight line as fast as possible
Moving at a defined speed (that is, one function that handles fast and slow walking depending upon an additional input)
Walking backwards
Turning on the spot
Running along a simple curve
Stepping sideways
Jumping as high as possible and landing without falling over
Kicking a ball that's in front of your feet
Making 'subconscious' stabilising movements when subjected to unexpected forces (hit by ball or another player), ideally in tandem with one of the above
For each of these tasks I believe I could come up with a suitable fitness function, but not a set of training inputs with expected outputs. That is, any machine learning approach would need to offer unsupervised learning.
I've seen some examples in open-source projects of circular functions (sine waves) wired into each hinge's angle with differing amplitudes and phases. These seem to walk in straight lines ok, but they all look a bit clunky. It's not an approach that would work for all of the tasks I mention above though.
Some teams apparently use inverse kinematics, though I don't know much about that.
So, what approaches are there for robot biped locomotion/ambulation?
As an aside, I wrote and published a .NET library called TinMan that provides basic interaction with the soccer simulation server. It has a simple programming model for the sensors and actuators of the robot's 22 hinges.
You can read more about RoboCup's 3D Simulated Soccer League:
http://en.wikipedia.org/wiki/RoboCup_3D_Soccer_Simulation_League
http://simspark.sourceforge.net/wiki/index.php/Main_Page
http://code.google.com/p/tin-man/
There is a significant body of research literature on robot motion planning and robot locomotion.
General Robot Locomotion Control
For bipedal robots, there are at least two major approaches to robot design and control (whether the robot is simulated or physically real):
Zero Moment Point - a dynamics-based approach to locomotion stability and control.
Biologically-inspired locomotion - a control approach modeled after biological neural networks in mammals, insects, etc., that focuses on use of central pattern generators modified by other motor control programs/loops to control overall walking and maintain stability.
Motion Control for Bipedal Soccer Robot
There are really two aspects to handling the control issues for your simulated biped robot:
Basic walking and locomotion control
Task-oriented motion planning
The first part is just about handling the basic control issues for maintaining robot stability (assuming you are using some physics-based model with gravity), walking in a straight-line, turning, etc. The second part is focused on getting your robot to accomplish specific tasks as a soccer player, e.g., run toward the ball, kick the ball, block an opposing player, etc. It is probably easiest to solve these separately and link the second part as a higher-level controller that sends trajectory and goal directives to the first part.
There are a lot of relevant papers and books which could be suggested, but I've listed some potentially useful ones below that you may wish to include in whatever research you have already done.
Reading Suggestions
LaValle, Steven Michael (2006). Planning Algorithms, Cambridge University Press.
Raibert, Marc (1986). Legged Robots that Balance. MIT Press.
Vukobratovic, Miomir and Borovac, Branislav (2004). "Zero-Moment Point - Thirty Five Years of its Life", International Journal of Humanoid Robotics, Vol. 1, No. 1, pp 157–173.
Hirose, Masato and Takenaka, T (2001). "Development of the humanoid robot ASIMO", Honda R&D Technical Review, vol 13, no. 1.
Wu, QiDi and Liu, ChengJu and Zhang, JiaQi and Chen, QiJun (2009). "Survey of locomotion control of legged robots inspired by biological concept ", Science in China Series F: Information Sciences, vol 52, no. 10, pp 1715--1729, Springer.
Wahde, Mattias and Pettersson, Jimmy (2002) "A brief review of bipedal robotics research", Proceedings of the 8th Mechatronics Forum International Conference, pp 480-488.
Shan, J., Junshi, C. and Jiapin, C. (2000). "Design of central pattern generator for
humanoid robot walking based on multi-objective GA", In: Proc. of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, pp. 1930–1935.
Chestnutt, J., Lau, M., Cheung, G., Kuffner, J., Hodgins, J., and Kanade, T. (2005). "Footstep planning for the Honda ASIMO humanoid", Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA 2005), pp 629-634.
I was working on a project not that dissimilar from this (making a robotic tuna) and one of the methods we were exploring was using a genetic algorithm to tune the performance of an artificial central pattern generator (in our case the pattern was a number of sine waves operating on each joint of the tail). It might be worth giving a shot, Genetic Algorithms are another one of those tools that can be incredibly powerful, if you are careful about selecting a fitness function.
Here's a great paper from 1999 by Peter Nordin and Mats G. Nordahl that outlines an evolutionary approach to controlling a humanoid robot, based on their experience building the ELVIS robot:
An Evolutionary Architecture for a Humanoid Robot
I've been thinking about this for quite some time now and I realized that you need at least two intelligent "agents" to make this work properly. The basic idea is that you have two types intelligent activity here:
Subconscious Motor Control (SMC).
Conscious Decision Making (CDM).
Training for the SMC could be done on-line... if you really think about it: defining success within motor control is basically done when you provide a signal to your robot, it evaluates that signal and either accepts it or rejects it. If your robot accepts a signal and it results in a "failure", then your robot goes "offline" and it can't accept any more signals. Defining "failure" and "offline" could be tricky, but I was thinking that it would be a failure if, for example, a sensor on the robot indicates that the robot is immobile (laying on the ground).
So your fitness function for the SMC might be something of the sort: numAcceptedSignals/numGivenSignals + numFailure
The CDM is another AI agent that generates signals and the fitness function for it could be: (numSignalsAccepted/numSignalsGenerated)/(numWinGoals/numLossGoals)
So what you do is you run the CDM and all the output that comes out of it goes to the SMC... at the end of a game you run your fitness functions. Alternately you can combine the SMC and the CDM into a single agent and you can make a composite fitness function based on the other two fitness functions. I don't know how else you could do it...
Finally, you have to determine what constitutes a learning session: is it half a game, full game, just a few moves, etc. If a game lasts 1 minute and you have a total of 8 players on the field, then the process of training could be VERY slow!
Update
Here is a quick reference to a paper that used genetic programming to create "softbots" that play soccer: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36.136&rep=rep1&type=pdf
With regards to your comments: I was thinking that for the subconscious motor control (SMC), the signals would come from the conscious decision maker (CDM). This way you're evolving your SMC agent to properly handle the CDM agent's commands (signals). You want to maximize the up-time of the SMC agent regardless of what the CDM agent says.
The SMC agent receives an input, for example a vector force on a joint, and it then runs it through its processing unit to determine if it should execute that input or if it should reject it. The SMC should only execute inputs that it doesn't "think" it will recover from and it should reject inputs that it "thinks" would lead to a "catastrophic failure".
Now the SMC agent has an output: accept or reject a signal (1 or 0). The CDM can use that signal for its own training... the CDM wants to maximize the number of signals that the SMC accepts and it also wants to satisfy a goal: a high score for its own team and a low score for the opposing team. So the CDM has its own processing unit that is being evolved to satisfy both of those needs. Your reference provided a 3-layer design, while mine is only a 2-layer... I think mine was a right step in towards the 3-layer design.
One more thing to note here: is falling really a "catastrophic failure"? What if your robot falls, but the CDM makes it stand up again? I think that would be a valid behavior, so you shouldn't penalize the robot for falling... perhaps a better thing to do is penalize it for the amount of time it takes in order to perform a goal (not necessarily a soccer goal).
There is this tutorial on humanoid locomotion control that describes the software stack used on the HRP-4 humanoid (which can walk or climb stairs). It consists mainly of:
Linear inverted pendulum: a simplified model for balancing. It involves only the center of mass (COM) and ZMP already mentioned in other answers.
Trajectory optimization: the robot computes what it wants to do, ideally, for the next 2 seconds or so. It keeps recomputing this trajectory as it moves, which is known as model predictive control.
Balance control: the last stage that corrects the robot's posture based on sensor measurements and the desired trajectory.
Follow links to the academic papers and source code to learn more.

Basis for claim that the number of bugs per line of code is constant regardless of the language used

I've heard people say (although I can't recall who in particular) that the number of bugs per line of code is roughly constant regardless of what language is used. What is the research that backs this up?
Edited to add: I don't have access to it, but apparently the authors of this paper
"asked the question whether the number of bugs per lines of code (LOC) is the same for programs written in different programming languages or not."
In his book Code Complete (quoting from the 2nd Edition), in the chapter "Developer Testing," Steve McConnell cites a handful of studies across a variety of languages:
Industry average experience is about 1-25 errors per 1000 lines of code for delivered software. The software has usually been developed using a hodgepodge of techniques (Boehm 1981, Gremillion 1984, Yourdon 1989a, Jones 1998, Jones 2000, Weber 2003). Cases that have one-tenth as many errors as this are rare; cases that have 10 times more tend not to be reported. (They probably aren't ever completed!)
The Applications Division at Microsoft experiences about 10–20 defects per 1000 lines of code during in-house testing and 0.5 defects per 1000 lines of code in released product (Moore 1992). The technique used to achieve this level is a combination of the code-reading techniques described in Other Kinds of Collaborative Development Practices, and independent testing.
Harlan Mills pioneered "cleanroom development," a technique that has been able to achieve rates as low as 3 defects per 1000 lines of code during in-house testing and 0.1 defects per 1000 lines of code in released product (Cobb and Mills 1990).
These studies ranged from high-level languages like Java, down to C++ and C, all the way down to assembly. Considering the massive impact of Code Complete on software engineering as a discipline, I suspect it is responsible for popularizing this idea.
One possible source would be Les Hatton's 1995 paper "Computer programming languages and safety-related systems", in which he concludes that language choice is at least close to irrelevant and other factors (chiefly fluency in the chosen language) are the controlling factors.
About all I could add to that would be to summarize various other papers, in which defect rates for individual projects (and such) are given. I've done a bit of looking, and never found a correlation between language and defect rate, but that's not really the same as saying the defect rate is constant across languages (i.e., they may be different, but they vary so widely within each language that I've never been able to prove a difference).