I have a sense that one step task of reinforcement learning is essentially the same with some optimisation algorithms.
For example, suppose there is only one parameter α and we try to optimise y using gradient descent for optimisation, then in each iteration(or step), α is actually moving slightly towards the direction with δy. The step is exactly the same in reinforcement learning, where δy is named as temporal difference and y is the value of that state S(a).
So, I wonder for 1 step reinforcement learning problems, is it actually a optimisation method, or can it be used to optimise parameters?(based on the context above)
I might have some misunderstanding on this, welcome to correctify.
First of all, reinforcement learning is very general. Almost any optimization problem can be transformed into a RL problem. It's usually not worth it, because a RL agent would select sub-optimal actions, doing trial and error just to confirm things you already know by design.
To your question: I think the similarity you found is that both algorithms make use of a (noisy) gradient step. Temporal difference is just one RL method of many. If I remember correctly it calculates the difference between the predicted value and the (noisy) value estimate made with the observed reward. It cannot simply set the correct value, because in general there is a complicated dependency between the values of other states, so instead it makes just one a small step to reduce the difference.
Sure, you could set up a RL task somehow to optimize reward = y(α). Now α can either be the agent's "state", in which case you need actions decrement or increment it (you learn state-values) or α can be the action in which case there is only a single state (you learn action-values). With the right exploration strategy it might even work if you are patient. But in both cases you waste your knowledge about the gradient δy(α)/δα because the RL algorithm does not know about it. Yes it takes gradient-steps, but those gradients reduce the difference between the learned value and the actual value. If the true values are exactly the rewards (which is true if the agent dies after one step, and if there is no randomness when you evaluate y(α)) then this is wasted effort. Instead of taking a small step to smooth out the non-existing influence on other states, you could have just set it to the true value directly.
You mentioned "one-step reinforcement learning": what comes to mind is the contextual bandit setup. It's a simplification of the full-blown RL setup where your actions do not influence the next state (=context). The next simplification is the multi-armed bandit, which only has actions but no state/context.
Related
I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.
I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.
My only question is how the weights are updated. According to this site the weights are updated as follows:
I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?
In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:
Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.
In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this:
Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).
The equation you presented is the Q-learning update rule set in a gradient update rule.
alpha is the step-size
R is the reward
Gamma is the discounting factor
You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.
So in Q learning, you update the Q function by Qnew(s,a) = Q(s,a) + alpha(r + gamma*MaxQ(s',a) - Q(s,a).
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Of course, the training time would probably increase because you actually do all the actions once for each update, but since you are guaranteed to select the best action each time (except when exploring), it would give you a global optimum policy in the end?
This is a bit similar to value iteration, except I'm don't have and not building a model for the problem.
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
It is typically assumed in Reinforcement Learning that we do not have the ability to reset the (simulated) environment. Sure, when we're working on simulations it often may technically be possible, but generally we hope that work in RL can also extend to "real-world" problems outside of simulations afterwards, where that would no longer be possible.
If you do have that possibility, it would generally be recommended to look into search algorithms like Monte-Carlo Tree Search, rather than Reinforcement Learning like Sarsa, Q-learning, etc. I suspect your suggestion might work slightly better than Q-learning indeed in this case, but things like MCTS would be even better.
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Given that you don't have access to the model, you have to resort to model free methods. What you are suggesting is basically a Dynamics programming backup. See the slides 28 - 31 in David Silver's lecture notes for various backup strategies to iterate on the value function.
However, note that this is just for prediction (i.e. estimating the value function for a given policy) and not for control (figuring out the best policy). There won't be a Max involved in prediction. To do control, you can use the above policy evaluation + greedy policy improvement to arrive at a "policy iteration based on Dynamic prog backup policy evaluation" method.
The other options for model-free control are SARSA [+ greedy policy improvement] (on policy) and Q-learning (off-policy). These are Q-function based methods, though.
If you are just trying to win the game, and not necessarily interested in RL techniques discussed above, then you also have the choice of using purely planning based methods (like Monte Carlo Tree Search). Finally, you can combine planning and learning with methods such as Dyna.
I'm tackling an interesting machine learning problem and would love to hear if anyone knows a good algorithm to deal with the following:
The algorithm must learn to approximate a function of N inputs and M outputs
N is quite large, e.g. 1,000-10,000
M is quite small, e.g. 5-10
All inputs and outputs are floating point values, could be positive or negative, likely to be relatively small in absolute value but no absolute guarantees on bounds
Each time period I get N inputs and need to predict the M outputs, at the end of the time period the actual values for the M outputs are provided (i.e. this is a supervised learning situation where learning needs to take place online)
The underlying function is non-linear, but not too nasty (e.g. I expect it will be smooth and continuous over most of the input space)
There will be a small amount of noise in the function, but signal/noise is likely to be good - I expect the N inputs will expain 95%+ of the output values
The underlying function is slowly changing over time - unlikely to change drastically in a single time period but is likely to shift slightly over the 1000s of time periods range
There is no hidden state to worry about (other than the changing function), i.e. all the information required is in the N inputs
I'm currently thinking some kind of back-propagation neural network with lots of hidden nodes might work - but is that really the best approach for this situation and will it handle the changing function?
With your number of inputs and outputs, I'd also go for a neural network, it should do a good approximation. The slight change is good for a back-propagation technique, it should not have to 'de-learn' stuff.
I think stochastic gradient descent (http://en.wikipedia.org/wiki/Stochastic_gradient_descent) would be a straight forward first step, it will probably work nicely given the operating conditions you have.
I'd also go for an ANN. Single layer might do fine since your input space is large. You might wanna give it a shot before adding a lot of hidden layers.
#mikera What is it going to be used for? Is it an assignment in a ML course?
Here's the background... in my free time I'm designing an artillery warfare game called Staker (inspired by the old BASIC games Tank Wars and Scorched Earth) and I'm programming it in MATLAB. Your first thought might be "Why MATLAB? There are plenty of other languages/software packages that are better for game design." And you would be right. However, I'm a dork and I'm interested in learning the nuts and bolts of how you would design a game from the ground up, so I don't necessarily want to use anything with prefab modules. Also, I've used MATLAB for years and I like the challenge of doing things with it that others haven't really tried to do.
Now to the problem at hand: I want to incorporate AI so that the player can go up against the computer. I've only just started thinking about how to design the algorithm to choose an azimuth angle, elevation angle, and projectile velocity to hit a target, and then adjust them each turn. I feel like maybe I've been overthinking the problem and trying to make the AI too complex at the outset, so I thought I'd pause and ask the community here for ideas about how they would design an algorithm.
Some specific questions:
Are there specific references for AI design that you would suggest I check out?
Would you design the AI players to vary in difficulty in a continuous manner (a difficulty of 0 (easy) to 1 (hard), all still using the same general algorithm) or would you design specific algorithms for a discrete number of AI players (like an easy enemy that fires in random directions or a hard enemy that is able to account for the effects of wind)?
What sorts of mathematical algorithms (pseudocode description) would you start with?
Some additional info: the model I use to simulate projectile motion incorporates fluid drag and the effect of wind. The "fluid" can be air or water. In air, the air density (and thus effect of drag) varies with height above the ground based on some simple atmospheric models. In water, the drag is so great that the projectile usually requires additional thrust. In other words, the projectile can be affected by forces other than just gravity.
In a real artillery situation all these factors would be handled either with formulas or simply brute-force simulation: Fire an electronic shell, apply all relevant forces and see where it lands. Adjust and try again until the electronic shell hits the target. Now you have your numbers to send to the gun.
Given the complexity of the situation I doubt there is any answer better than the brute-force one. While you could precalculate a table of expected drag effects vs velocity I can't see it being worthwhile.
Of course a game where the AI dropped the first shell on your head every time wouldn't be interesting. Once you know the correct values you'll have to make the AI a lousy shot. Apply a random factor to the shot and then walk to towards the target--move it say 30+random(140)% towards the true target each time it shoots.
Edit:
I do agree with BCS's notion of improving it as time goes on. I said that but then changed my mind on how to write a bunch of it and then ended up forgetting to put it back in. The tougher it's supposed to be the smaller the random component should be.
Loren's brute force solution is appealing as because it would allow easy "Intelligence adjustments" by adding more iterations. Also the adjustment factors for the iteration could be part of the intelligence as some value will make it converge faster.
Also for the basic system (no drag, wind, etc) there is a closed form solution that can be derived from a basic physics text. I would make the first guess be that and then do one or more iteration per turn. You might want to try and come up with an empirical correction correlation to improve the first shot (something that will make the first shot distributions average be closer to correct)
Thanks Loren and BCS, I think you've hit upon an idea I was considering (which prompted question #2 above). The pseudocode for an AIs turn would look something like this:
nSims; % A variable storing the numbers of projectile simulations
% done per turn for the AI (i.e. difficulty)
prevParams; % A variable storing the previous shot parameters
prevResults; % A variable storing some measure of accuracy of the last shot
newParams = get_new_guess(prevParams,prevResults);
loop for nSims times,
newResults = simulate_projectile_flight(newParams);
newParams = get_new_guess(newParams,newResults);
end
fire_projectile(newParams);
In this case, the variable nSims is essentially a measure of "intelligence" for the AI. A "dumb" AI would have nSims=0, and would simply make a new guess each turn (based on results of the previous turn). A "smart" AI would refine its guess nSims times per turn by simulating the projectile flight.
Two more questions spring from this:
1) What goes into the function get_new_guess? How should I adjust the three shot parameters to minimize the distance to the target? For example, if a shot falls short of the target, you can try to get it closer by adjusting the elevation angle only, adjusting the projectile velocity only, or adjusting both of them together.
2) Should get_new_guess be the same for all AIs, with the nSims value being the only determiner of "intelligence"? Or should get_new_guess be dependent on another "intelligence" parameter (like guessAccuracy)?
A difference between artillery games and real artillery situations is that all sides have 100% information, and that there are typically more than 2 opponents.
As a result, your evaluation function should consider which opponent it would be more urgent to try and eliminate. For example, if I have an easy kill at 90%, but a 50% chance on someone who's trying to kill me and just missed two shots near me, it's more important to deal with that chance.
I think you would need some way of evaluating the risk everyone poses to you in terms of ammunition, location, activity, past history, etc.
I'm now addressing the response you posted:
While you have the general idea I don't believe your approach will be workable--it's going to converge way too fast even for a low value of nSims. I doubt you want more than one iteration of get_new_guess between shells and it very well might need some randomizing beyond that.
Even if you can use multiple iterations they wouldn't be good at making a continuously increasing difficulty as they will be big steps. It seems to me that difficulty must be handled by randomness.
First, get_initial_guess:
To start out I would have a table that divides the world up into zones--the higher the difficulty the more zones. The borders between these zones would have precalculated power for 45, 60 & 75 degrees. Do a test plot, if a shell smacks terrain try again at a higher angle--if 75 hits terrain use it anyway.
The initial shell should be fired at a random power between the values given for the low and high bounds.
Now, for get_new_guess:
Did the shell hit terrain? Increase the angle. I think there will be a constant ratio of how much power needs to be increased to maintain the same distance--you'll need to run tests on this.
Assuming it didn't smack a mountain, note if it's short or long. This gives you a bound. The new guess is somewhere between the two bounds (if you're missing a bound, use the value from the table in get_initial_guess in it's place.)
Note what percentage of the way between the low and high bound impact points the target is and choose a power that far between the low and high bound power.
This is probably far too accurate and will likely require some randomizing. I've changed my mind about adding a simple random %. Rather, multiple random numbers should be used to get a bell curve.
Another thought: Are we dealing with a system where only one shell is active at once? Long ago I implemented an artillery game where you had 5 barrels, each with a fixed reload time that was above the maximum possible flight time.
With that I found myself using a strategy of firing shells spread across the range between my current low bound and high bound. It's possible that being a mere human I wasn't using an optimal strategy, though--this was realtime, getting a round off as soon as the barrel was ready was more important than ensuring it was aimed as well as possible as it would converge quite fast, anyway. I would generally put a shell on target on the second salvo and the third would generally all be hits. (A kill required killing ALL pixels in the target.)
In an AI situation I would model both this and a strategy of holding back some of the barrels to fire more accurate rounds later. I would still fire a spread across the target range, the only question is whether I would use all barrels or not.
I have personally created such a system - for the web-game Zwok, using brute force. I fired lots of shots in random directions and recorded the best result. I wouldn't recommend doing it any other way as the difference between timesteps etc will give you unexpected results.