Confused about simulated annealing - simulated-annealing

Simulated annealing is a meta-heuristic for optimization. Essentially it does hill-climbing with the possibility of jumping from one hill to another -- even if the jump is to a lower place on the second hill.
Such negative jumps are allowed less and less often as the "temperature" cools and the system settles into a solution. See this pseudo-code. The point of allowing the jumps is to enable the system to move from a hill leading to a local maximum to one leading to a global maximum.
My confusion concerns how the system keeps itself from jumping from a hill leading to a global maximum to one leading to only a local maximum.

I think your understanding of simulated annealing (SA) is slightly flawed.
When solving a problem with SA, at any point in time your feasible solution will be in a hill, somewhere in the solution space. Unless you do standard hill climbing on every hill you visit (you're not) there is no way for you to know where the hill you're currently on leads. All you know is at what height you currently are. So in any iteration, you could very well be jumping from a hill leading to a global optimum to a hill leading to a local optimum.
That being said, with SA you will tend to end up on higher hills, since the lowering temperature gradually drives you towards hills where there is still the possibility of taking a step up.

Related

Deep Q learning with CNN - How can l know if my reinforcement learning model is actually learning?

I am trying to train a bot in a game like curve fever. It is like a snake which moves with a really precise turn radius (not 90°), which makes random hole (where he can passes throw) and like for a snake game he dies if he goes out of map or hits himself. The difference in the points stands on the fact that the snake has to survive as long as possible and there is no food associated. The tail of the snail increases by 1 at every step. It looks like that:
So I use a deep q learning algorithm with a CNN network, inspired by: Flappy bird deep q learning, which is itself inspired by the DeepMind's paper Playing Atari with Deep Reinforcement Learning.
My images as input are a thresholding image like above where everything is black or white.
At every step I grant +0.1 as reward for staying alive and -1 for dying in border of map or itself.
I trained my agent for hours and after 4.000.000 iterations I result in an agent which almost never goes out of map but crashes on itself in a very fast way.
So it is like he learnt how to not crash on border of the map but not on itself, what could explain this ?
Some examples:
My suppositions are:
I took a replay memory size of 25000 instead of 50000 because of OOM error, is that enough?
I did not train him long enough, but how could I know?
The border of the map never changes so it is easy to learn from it, but the tail of the agent itself changes at every new game, should I get worse reward for crashing on itself so the agent takes it more into account?
Here are my learning curves:
I am requesting your help because it takes a lot of time to train my agent and I can't be sure of what should I do.
Thanks in advance for any help.

Is a "single cycle cpu" possible if asynchronous components are used?

I have heard the term "Single Cycle Cpu" and was trying to understand what single cycle cpu actually meant. Is there a clear agreed definition and consensus and what is means?
Some home brew "single cycle cpu's" I've come across seem to use both the rising and the falling edges of the clock to complete a single instruction. Typically, the rising edge acts as fetch/decode and the falling edge as execute.
However, in my reading I came across the reasonable point made here ...
https://zipcpu.com/blog/2017/08/21/rules-for-newbies.html
"Do not transition on any negative (falling) edges.
Falling edge clocks should be considered a violation of the one clock principle,
as they act like separate clocks.".
This rings true to me.
Needing both the rising and falling edges (or high and low phases) is effectively the same as needing the rising edge of two cycles of a single clock that's running twice as fast; and this would be a "two cycle" CPU wouldn't it.
So is it honest to state that a design is a "single cycle CPU" when both the rising and falling edges are actively used for state change?
It would seem that a true single cycle cpu must perform all state changing operations on a single clock edge of a single clock cycle.
I can imagine such a thing is possible providing the data strorage is all synchronous. If we have a synchronous system that has settled then on the next clock edge we can clock the results into a synchronous data store and simultaneously clock the program counter on to the next address.
But if the target data store is for example async RAM then the surely control lines would be changing whilst that data is being stored leading to unintended behaviours.
Am I wrong, are there any examples of a "single cycle cpu" that include async storage in the mix?
It would seem that using async RAM in ones design means one must use at least two logical clock cycles to achive the state change.
Of course, with some more complexity one could perhaps add anhave a cpu that uses a single edge where instructions use solely synchronout components, but relies on an extra cycle when storing to async data; but then that still wouldn't be a single cycle cpu, but rather a a mostly single cycle cpu.
So no CPU that writes to async RAM (or other async component) can honestly be considered a single cycle CPU because the entire instruction cannot be carried out on a single clock edge. The RAM write needs two edges (ie falling and rising) and this breaks the single clock principal.
So is there a commonly accepted single cycle CPU and are we applying the term consistently?
What's the story?
(Also posted in my hackday log https://hackaday.io/project/166922-spam-1-8-bit-cpu/log/181036-single-cycle-cpu-confusion and also on a private group in hackaday)
=====
Update: Looking at simple MIP's it seems the models use synchronous memory and so can probably operate off a single edge ad maybe it does - therefore warrant the category "single cycle".
And perhaps FPGA memory is always synchronous - I don't know about that.
But is the term using inconsistently elsewhere - ie like most Homebrew TTL Computers out there??
Or am I just plain wrong?
====
Update :
Some may have misunderstood my point.
Numerous home brew TTL cpu's claim "single cycle CPU" status (not interested for the purposes of this discussion in more complex beasts that do pipelining or whatever).
By single cycle these CPU's they typically mean that they do something like advancing the PC on one edge of the clock and then the use the opposing edge of the clock to update flipflops with the result. OR they will use the other phase of the clock to update async components like latches and sram.
However, the ZipCPU reference I provided suggests that using the opposing clock edge is akin to using a second clock cycle or even a second clock. BTW Ben Eater in his vids even compares the inverted clock that he uses to update his SRAM to being a second clock.
My objection to the use of "single cycle CPU" with such CPU's (basically most/all home bred TTL CPU's I've seen as they all work that way) is that I agree with ZipCPU that using the opposing edge (or phase) of the clock for the commit is effectively the same as using a second clock and this makes a mockery of the "single cycle" claim.
If the use of oposing edge is effectively the same a using a single edge but of dual clock cycles then I think that makes use of the term questionable. So I take ZipCPU's point to heart and tighten the term to mean use of a single edge.
On the other hand is seems perfectly possible to build a CPU that uses only sync components (ie edge triggered flip flops) and which uses only a single edge, where on each edge, we clock whatever is on the bus into whatever device is selected for write and at the same moment advance the PC.
Between one edge and the next same direction edge, settling occurs.
In this manner we end up with CPI=1 and use of only a single edge - which is very distinctly different to the common TTL CPU pattern of using both edges of the clock.
BTW my impression of FPGA's (which I'm not referring to here) is that the storage elements in FPGA are all synchronous flip flops. I don't know, but that's what my reading suggests. Anyway, if this is true then a trivial FPGA based CPU probabnly has a CPI=1 and uses only say the +ve edge and so these might well meet my narrow definition of "single cycle cpu". Also, my reading suggests that various MIP's impls (educational efforts probably) are probably meeting my definition ok.
This seems mostly a question of definitions and terminology, moreso than how to actually build simple CPUs.
If you insist on that strict definition of "single cycle CPU" meaning to truly use only clock edge to set everything in motion for that instruction, then yes, that would exclude real-world toy/hobby CPUs that use a 2nd clock edge to give a consistent interval for memory access.
But they certainly still fulfil the spirit of a single-cycle CPU, which is that every instruction runs in 1 clock cycle, with no pipelining and no multi-cycle microcode.
A whole clock cycle does have 2 clock edges, and it's normal for real-world (non-single-cycle) CPUs to use the "other" edge for internal timing in some cases, but we still talk about their frequency in whole cycles, not the edge frequency, except in cases like DDR memory where we do talk about the transfer rate = twice the memory clock frequency. What sets that apart is always using both edges, and for approximately equal things, not just some extra timing / synchronization within a clock cycle.
Now could you build a CPU that keeps a store value on a memory bus for some minimum time, without using a clock edge? Maybe.
Perhaps make sure the critical path leading to store-data it is short enough that the data is always ready. And possibly propagate some "data-ready" signal along with your computations (or just from the longest critical path of any instruction), and after a couple gate delays after the data is on the bus, flip the memory clock. (And on the next CPU clock edge, flip it back). If your memory doesn't mind its clock not having a uniform duty cycle, this might be fine as long as each half of the memory clock is long enough.
For loading from memory, you can maybe do something similar by initiating a memory load cycle some gate-delays after the CPU clock edge that starts this "cycle" of your single-cycle CPU. This might even involve building a long chain of gate delays intentionally with inverters dedicated to that purpose. Or perhaps even an analog RC time delay, but either way that's obviously worse than just using the other edge of the main clock, and you'd only ever do this as an exercise in single-cycle dogmatic purity. (It can also be flaky because gate-delay isn't constant, depending on voltage and temperature, so one side of the chip running hotter than the other could change relative timing.)
The definition says that a single cycle CPU takes just one instruction per one cycle. So it's possible to make a conclusion in theory that there are other CPU's that takes more or less instruction per cycle. You can check it out that there are some concepts like multi-cycle processor and pipelined processor (Instruction pipelining). "Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps." acorkcording to Wiki. I don't know how exactly it works, but maybe it just uses available registers (maybe instead of using for example EAX, ECX is used like EAX or maybe it works some other way, but one for sure is 100% true that number of registers in increasing, so maybe it's one of main purposes. Source: https://en.wikipedia.org/wiki/Instruction_pipelining
I think the answer for the question: "is-a-single-cycle-cpu-possible-if-asynchronous-components-are-used" depends on CPU controller that controls both CPU and RAM with opcodes. I found interesing information about this on site: http://people.cs.pitt.edu/~cho/cs1541/current/handouts/lect-pipe_4up.pdf
https://ibb.co/tKy6sR2
CONCLUSION: In my opinion, if we consider the term "single cycle CPU" it should be the simplest possible construction. The term "asynchronous" implements a conclusion, that is more complex than "synchronous". So both terms are not equivalent. It's something like "Can a basic data type be considered as a structure?". In my opinion the word "single" means the simplest possible and "asynchronous" means some modification, so more complex, so just think it's not possible, but maybe the term "are used" can be bypassed by "are used at the time" - if some switch, some controller can turn off asynchronous mode and make this all the simplest possible. But generally just think it's not possible

Berkeley Pac-Man Project: features divided through by 10

I am busy coding reinforcement learning agents for the game Pac-Man and came across Berkeley's CS course's Pac-Man Projects, specifically the reinforcement learning section.
For the approximate Q-learning agent, feature approximation is used. A simple extractor is implemented in this code. What I am curious about is why, before the features are returned, they are scaled down by 10? By running the solution without the factor of 10 you can notice that Pac-Man does significantly worse, but why?
After running multiple tests it turns out that the optimal Q-value can diverge wildly away. In fact, the features can all become negative, even the one which would usually incline PacMan to eat pills. So he just stands there and eventually tries to run from ghosts but never tries to finish a level.
I speculate that this happens when he loses in training, that the negative reward is propagated through the system and since the potential number of ghosts can be greater than one, this has a heavy bearing on the weights, causing everything to become very negative and the system can't "recover" from this.
I confirmed this by adjusting the feature extractor to only scale the #-of-ghosts-one-step-away feature and then PacMan manages to get a much better result
In retrospect this question is now more mathsy and might fit better on another stackexchange.

Does anyone know what "Quantum Computing" is?

In physics, its the ability for particles to exist in multiple/parallel dynamic states at a particular point in time. In computing, would it be the ability of a data bit to equal 1 or 0 at the same time, a third value like NULL[unknown] or multiple values?.. How can this technology be applied to: computer processors, programming, security, etc.?.. Has anyone built a practical quantum computer or developed a quantum programming language where, for example, the program code dynamically changes or is autonomous?
I have done research in quantum computing, and here is what I hope is an informed answer.
It is often said that qubits as you see them in a quantum computer can exist in a "superposition" of 0 and 1. This is true, but in a more subtle way than you might first guess. Even with a classical computer with randomness, a bit can exist in a superposition of 0 and 1, in the sense that it is 0 with some probability and 1 with some probability. Just as when you roll a die and don't look at the outcome, or receive e-mail that you haven't yet read, you can view its state as a superposition of the possibilities. Now, this may sound like just flim-flam, but the fact is that this type of superposition is a kind of parallelism and that algorithms that make use of it can be faster than other algorithms. It is called randomized computation, and instead of superposition you can say that the bit is in a probabilistic state.
The difference between that and a qubit is that a qubit can have a fat set of possible superpositions with more properties. The set of probabilistic states of an ordinary bit is a line segment, because all there is a probability of 0 or 1. The set of states of a qubit is a round 3-dimensional ball. Now, probabilistic bit strings are more complicated and more interesting than just individual probabilistic bits, and the same is true of strings of qubits. If you can make qubits like this, then actually some computational tasks wouldn't be any easier than before, just as randomized algorithms don't help with all problems. But some computational problems, for example factoring numbers, have new quantum algorithms that are much faster than any known classical algorithm. It is not a matter of clock speed or Moore's law, because the first useful qubits could be fairly slow and expensive. It is only sort-of parallel computation, just as an algorithm that makes random choices is only in weak sense making all choices in parallel. But it is "randomized algorithms on steroids"; that's my favorite summary for outsiders.
Now the bad news. In order for a classical bit to be in a superposition, it has be a random choice that is secret from you. Once you look a flipped coin, the coin "collapses" to either heads for sure or tails for sure. The difference between that and a qubit is that in order for a qubit to work as one, its state has to be secret from the rest of the physical universe, not just from you. It has to be secret from wisps of air, from nearby atoms, etc. On the other hand, for qubits to be useful for a quantum computer, there has to be a way to manipulate them while keeping their state a secret. Otherwise its quantum randomness or quantum coherence is wrecked. Making qubits at all isn't easy, but it is done routinely. Making qubits that you can manipulate with quantum gates, without revealing what is in them to the physical environment, is incredibly difficult.
People don't know how to do that except in very limited toy demonstrations. But if they could do it well enough to make quantum computers, then some hard computational problems would be much easier for these computers. Others wouldn't be easier at all, and great deal is unknown about which ones can be accelerated and by how much. It would definitely have various effects on cryptography; it would break the widely used forms of public-key cryptography. But other kinds of public-key cryptography have been proposed that could be okay. Moreover quantum computing is related to the quantum key distribution technique which looks very safe, and secret-key cryptography would almost certainly still be fairly safe.
The other factor where the word "quantum" computing is used regards an "entangled pair". Essentially if you can create an entangled pair of particles which have a physical "spin", quantum physics dictates that the spin on each electron will always be opposite.
If you could create an entangled pair and then separate them, you could use the device to transmit data without interception, by changing the spin on one of the particles. You can then create a signal which is modulated by the particle's information which is theoretically unbreakable, as you cannot know what spin was on the particles at any given time by intercepting the information in between the two signal points.
A whole lot of very interested organisations are researching this technique for secure communications.
Yes, there is quantum encryption, by which if someone tries to spy on your communication, it destroys the datastream such that neither they nor you can read it.
However, the real power of quantum computing lies in that a qubit can have a superposition of 0 and 1. Big deal. However, if you have, say, eight qubits, you can now represent a superposition of all integers from 0 to 255. This lets you do some rather interesting things in polynomial instead of exponential time. Factorization of large numbers (IE, breaking RSA, etc.) is one of them.
There are a number of applications of quantum computing.
One huge one is the ability to solve NP-hard problems in P-time, by using the indeterminacy of qubits to essentially brute-force the problem in parallel.
(The struck-out sentence is false. Quantum computers do not work by brute-forcing all solutions in parallel, and they are not believed to be able to solve NP-complete problems in polynomial time. See e.g. here.)
Just a update of quantum computing industry base on Greg Kuperberg's answer:
D-Wave 2 System is using quantum annealing.
The superposition quantum states will collapse to a unique state when a observation happened. The current technologies of quantum annealing is apply a physical force to 2 quantum bits, the force adds constrains to qubits so when observation happened, the qubit will have higher probability to collapse to a result that we are willing to see.
Reference:
How does a quantum machine work
I monitor recent non-peer reviewed articles on the subject, this is what I extrapolate from what I have read. a qubit, in addition to what has been said above. namely that they can hold values in superposition, they can also hold multiple bits, for example spin up/+ spin down/+ spin -/vertical , I need to abbreviate +H,-H,+V,-V Left+, LH,LV also not all of the combinations are valid and there are additional values that can be placed on the type of qubit
each used similar to ram vs rom etc. photon with a wavelength, electron with a charge, photon with a charge, photon with a spin, you get the idea, some combinations are not valid and some require additional algorithms in order to pass the argument to the next variable(location where data is stored) or qubit(location of superposition of values to be returned, if you will simply because the use of wires is by necessity limited due to size and space. One of the greatest challenges is controlling or removing Q.(quantum) decoherence. This usually means isolating the system from its environment as interactions with the external world cause the system to decohere. November 2011 researchers factorised 143 using 4 qubits. that same year, D-Wave Systems announced the first commercial quantum annealer on the market by the name D-Wave One. The company claims this system uses a 128 qubit processor chipset.May 2013, Google Inc announced that it was launching the Q. AI. Lab, hopefully to boost AI. I really do Hope I didn't waste anyones time with things they already knew. If you learned something please up.
As I can not yet comment, it really depends on what type of qubit you are working with to know the number of states for example the UNSW silicon Q. bit" vs a Diamond-neutron-valency or a SSD NMR Phosphorus - silicon vs Liquid NMR of the same.

What techniques exist for the software-driven locomotion of a bipedal robot?

I'm programming a software agent to control a robot player in a simulated game of soccer. Ultimately I hope to enter it in the RoboCup competition.
Amongst the various challenges involved in creating such an agent, the motion of it's body is one of the first I'm facing. The simulation I'm targeting uses a Nao robot body with 22 hinge to control. Six in each leg, four in each arm and two in the neck:
(source: sourceforge.net)
I have an interest in machine learning and believe there must be some techniques available to control this guy.
At any point in time, it is known:
The angle of all 22 hinges
The X,Y,Z output of an accelerometer located in the robot's chest
The X,Y,Z output of a gyroscope located in the robot's chest
The location of certain landmarks (corners, goals) via a camera in the robot's head
A vector for the force applied to the bottom of each foot, along with a vector giving the position of the force on the foot's sole
The types of tasks I'd like to achieve are:
Running in a straight line as fast as possible
Moving at a defined speed (that is, one function that handles fast and slow walking depending upon an additional input)
Walking backwards
Turning on the spot
Running along a simple curve
Stepping sideways
Jumping as high as possible and landing without falling over
Kicking a ball that's in front of your feet
Making 'subconscious' stabilising movements when subjected to unexpected forces (hit by ball or another player), ideally in tandem with one of the above
For each of these tasks I believe I could come up with a suitable fitness function, but not a set of training inputs with expected outputs. That is, any machine learning approach would need to offer unsupervised learning.
I've seen some examples in open-source projects of circular functions (sine waves) wired into each hinge's angle with differing amplitudes and phases. These seem to walk in straight lines ok, but they all look a bit clunky. It's not an approach that would work for all of the tasks I mention above though.
Some teams apparently use inverse kinematics, though I don't know much about that.
So, what approaches are there for robot biped locomotion/ambulation?
As an aside, I wrote and published a .NET library called TinMan that provides basic interaction with the soccer simulation server. It has a simple programming model for the sensors and actuators of the robot's 22 hinges.
You can read more about RoboCup's 3D Simulated Soccer League:
http://en.wikipedia.org/wiki/RoboCup_3D_Soccer_Simulation_League
http://simspark.sourceforge.net/wiki/index.php/Main_Page
http://code.google.com/p/tin-man/
There is a significant body of research literature on robot motion planning and robot locomotion.
General Robot Locomotion Control
For bipedal robots, there are at least two major approaches to robot design and control (whether the robot is simulated or physically real):
Zero Moment Point - a dynamics-based approach to locomotion stability and control.
Biologically-inspired locomotion - a control approach modeled after biological neural networks in mammals, insects, etc., that focuses on use of central pattern generators modified by other motor control programs/loops to control overall walking and maintain stability.
Motion Control for Bipedal Soccer Robot
There are really two aspects to handling the control issues for your simulated biped robot:
Basic walking and locomotion control
Task-oriented motion planning
The first part is just about handling the basic control issues for maintaining robot stability (assuming you are using some physics-based model with gravity), walking in a straight-line, turning, etc. The second part is focused on getting your robot to accomplish specific tasks as a soccer player, e.g., run toward the ball, kick the ball, block an opposing player, etc. It is probably easiest to solve these separately and link the second part as a higher-level controller that sends trajectory and goal directives to the first part.
There are a lot of relevant papers and books which could be suggested, but I've listed some potentially useful ones below that you may wish to include in whatever research you have already done.
Reading Suggestions
LaValle, Steven Michael (2006). Planning Algorithms, Cambridge University Press.
Raibert, Marc (1986). Legged Robots that Balance. MIT Press.
Vukobratovic, Miomir and Borovac, Branislav (2004). "Zero-Moment Point - Thirty Five Years of its Life", International Journal of Humanoid Robotics, Vol. 1, No. 1, pp 157–173.
Hirose, Masato and Takenaka, T (2001). "Development of the humanoid robot ASIMO", Honda R&D Technical Review, vol 13, no. 1.
Wu, QiDi and Liu, ChengJu and Zhang, JiaQi and Chen, QiJun (2009). "Survey of locomotion control of legged robots inspired by biological concept ", Science in China Series F: Information Sciences, vol 52, no. 10, pp 1715--1729, Springer.
Wahde, Mattias and Pettersson, Jimmy (2002) "A brief review of bipedal robotics research", Proceedings of the 8th Mechatronics Forum International Conference, pp 480-488.
Shan, J., Junshi, C. and Jiapin, C. (2000). "Design of central pattern generator for
humanoid robot walking based on multi-objective GA", In: Proc. of the IEEE/RSJ
International Conference on Intelligent Robots and Systems, pp. 1930–1935.
Chestnutt, J., Lau, M., Cheung, G., Kuffner, J., Hodgins, J., and Kanade, T. (2005). "Footstep planning for the Honda ASIMO humanoid", Proceedings of the 2005 IEEE International Conference on Robotics and Automation (ICRA 2005), pp 629-634.
I was working on a project not that dissimilar from this (making a robotic tuna) and one of the methods we were exploring was using a genetic algorithm to tune the performance of an artificial central pattern generator (in our case the pattern was a number of sine waves operating on each joint of the tail). It might be worth giving a shot, Genetic Algorithms are another one of those tools that can be incredibly powerful, if you are careful about selecting a fitness function.
Here's a great paper from 1999 by Peter Nordin and Mats G. Nordahl that outlines an evolutionary approach to controlling a humanoid robot, based on their experience building the ELVIS robot:
An Evolutionary Architecture for a Humanoid Robot
I've been thinking about this for quite some time now and I realized that you need at least two intelligent "agents" to make this work properly. The basic idea is that you have two types intelligent activity here:
Subconscious Motor Control (SMC).
Conscious Decision Making (CDM).
Training for the SMC could be done on-line... if you really think about it: defining success within motor control is basically done when you provide a signal to your robot, it evaluates that signal and either accepts it or rejects it. If your robot accepts a signal and it results in a "failure", then your robot goes "offline" and it can't accept any more signals. Defining "failure" and "offline" could be tricky, but I was thinking that it would be a failure if, for example, a sensor on the robot indicates that the robot is immobile (laying on the ground).
So your fitness function for the SMC might be something of the sort: numAcceptedSignals/numGivenSignals + numFailure
The CDM is another AI agent that generates signals and the fitness function for it could be: (numSignalsAccepted/numSignalsGenerated)/(numWinGoals/numLossGoals)
So what you do is you run the CDM and all the output that comes out of it goes to the SMC... at the end of a game you run your fitness functions. Alternately you can combine the SMC and the CDM into a single agent and you can make a composite fitness function based on the other two fitness functions. I don't know how else you could do it...
Finally, you have to determine what constitutes a learning session: is it half a game, full game, just a few moves, etc. If a game lasts 1 minute and you have a total of 8 players on the field, then the process of training could be VERY slow!
Update
Here is a quick reference to a paper that used genetic programming to create "softbots" that play soccer: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.36.136&rep=rep1&type=pdf
With regards to your comments: I was thinking that for the subconscious motor control (SMC), the signals would come from the conscious decision maker (CDM). This way you're evolving your SMC agent to properly handle the CDM agent's commands (signals). You want to maximize the up-time of the SMC agent regardless of what the CDM agent says.
The SMC agent receives an input, for example a vector force on a joint, and it then runs it through its processing unit to determine if it should execute that input or if it should reject it. The SMC should only execute inputs that it doesn't "think" it will recover from and it should reject inputs that it "thinks" would lead to a "catastrophic failure".
Now the SMC agent has an output: accept or reject a signal (1 or 0). The CDM can use that signal for its own training... the CDM wants to maximize the number of signals that the SMC accepts and it also wants to satisfy a goal: a high score for its own team and a low score for the opposing team. So the CDM has its own processing unit that is being evolved to satisfy both of those needs. Your reference provided a 3-layer design, while mine is only a 2-layer... I think mine was a right step in towards the 3-layer design.
One more thing to note here: is falling really a "catastrophic failure"? What if your robot falls, but the CDM makes it stand up again? I think that would be a valid behavior, so you shouldn't penalize the robot for falling... perhaps a better thing to do is penalize it for the amount of time it takes in order to perform a goal (not necessarily a soccer goal).
There is this tutorial on humanoid locomotion control that describes the software stack used on the HRP-4 humanoid (which can walk or climb stairs). It consists mainly of:
Linear inverted pendulum: a simplified model for balancing. It involves only the center of mass (COM) and ZMP already mentioned in other answers.
Trajectory optimization: the robot computes what it wants to do, ideally, for the next 2 seconds or so. It keeps recomputing this trajectory as it moves, which is known as model predictive control.
Balance control: the last stage that corrects the robot's posture based on sensor measurements and the desired trajectory.
Follow links to the academic papers and source code to learn more.