What is terminal state in gridworld? - reinforcement-learning

What is terminal state in gridworld? - reinforcement-learning

I am learning markov decision process.
Am I don't know where to mark terminal states.
In 4x3 grid world, I marked the terminal state that I think correct(I might be wrong) with T.
Pic
I saw an instruction mark terminal states as follow.
terminals=[(3, 2), (3, 1)]
Can someone explain how does it work?

In the given grid-world, you start at "start" which is (0,0). Then you move around from that point. If you reach at "end +1"{(3,2)} then the reward is +1 and the game ends. Likewise, if you reach at "end -1"{(3,1)} then the reward is -1 and the game ends. However, while you are moving around, you can't move to {(1,1)} as its invalid state. Also, if you reach any of the terminal state "T" which are at {(2,0) and(2,1)} then the game ends with zero reward.

Related

Is it necessary to end episodes when collision occurs in reinforcement learning

I have implemented q learning algorithm in which the agent tries to travel as far as possible. I am using instantaneous rewards and final episode reward as well. When agent collides, i am giving high collision reward in negative and I am not stopping the episode. Is it ok to do like this or the episode must be ended once the agent collides?

In my case I have defined a minimum reward threshold, if it drops below that I end the episode.
Case 1: End episode on invalid action
If you end the game before penalizing an invalid move there is no way for the network to understand that the move was invalid.
Case 2: End episode after N invalid action
This gives it room to take a few invalid actions before the episode ends. Its analogous to playing a game: you have N lives to beat the level or you lose the game
Case 3: Not ending the game at all after invalid actions
This may cause the agent to get lost in the environment sometimes only doing invalid actions, you need a good termination condition to stop the episode
Hope this helps

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy

TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

How to prevent the eligibility trace in SARSA with lambda = 1 from exploding for state-action pairs that are visited a huge number of times?

I was testing SARSA with lambda = 1 with Windy Grid World and if the exploration causes the same state-action pair to be visited many times before reaching the goal, the eligibility trace gets incremented each time without any decay, therefore it explodes and causes everything to overflow.
How can this be avoided?

If I've understood correctly your question, the problem is that the trace for a given state gets incremented too much. In this case, a potential solution is to use replacing traces instead of the classic incremental traces.
The idea in replacing traces is to reset the trace to a value (typically 1) each time the state is visited. The following figure illustrates the main difference between both kinds of traces:
You can find more information in the classical Sutton & Barto book Reinforcement Learning: An Introduction, especifically in Section 7.8.

My simple turing machine

I'm trying to understand and implement the simplest turing machine and would like feedback if I'm making sense.
We have a infinite tape (lets say an array called T with pointer at 0 at the start) and instruction table:
( S , R , W , D , N )
S->STEP (Start at step 1)
R->READ (0 or 1)
W->WRITE (0 or 1)
D->DIRECTION (0=LEFT 1=RIGHT)
N->NEXTSTEP (Non existing step is HALT)
My understanding is that a 3-state 2-symbol is the simplest machine. 3-state i don't understand. 2-symbol because we use 0 and 1 for READ/WRITE.
For example:
(1,0,1,1,2)
(1,1,0,1,2)
Starting at step 1, if Read is 0 then { Write 1, Move Right) else {Write 0, Move Right) and then go to step 2 - which does not exist which halts program.
What does 3-state mean? Does this machine pass as turing machine? Can we simplify more?

I think the confusion might come from your use of "steps" instead of "states". You can think of a machine's state as the value it has in its memory (although as a previous poster noted, some people also take a machine's state to include the contents of the tape -- however, I don't think that definition is relevant to your question). It's possible that this change in terminology might be at the heart of your confusion. Let me explain what I think it is you're thinking. :)
You gave lists of five numbers -- for example, (1,0,1,1,2). As you correctly state, this should be interpreted (reading from left to right) as "If the machine is in state 1 AND the current square contains a 0, print a 1, move right, and change to state 2." However, your use of the word "step" seems to suggest that that "step 2" must be followed by "step 3", etc., when in reality a Turing machine can go back and forth between states (and of course, there can only be finitely many possible states).
So to answer your questions:
Turing machines keep track of "states" not "steps";
What you've described is a legitimate Turing machine;
A simpler (albeit otherwise uninteresting) Turing machine would be one that starts in the HALT state.
Edits: Grammar, Formatting, and removed a needless description of Turing machines.
Response to comment:
Correct me if I'm misinterpreting your comment, but I did not mean to suggest a Turing machine could be in more than one state at a time, only that the number of possible states can be any finite number. For example, for a 3-state machine, you might label the possible states A, B, and C. (In the example you provided, you labeled the two possible states as '1' and '2') At any given time, exactly one of those values (states) would be in the machine's memory. We would say, "the machine is in state A" or "the machine is in state B", etc. (Your machine starts in state '1' and terminates after it enters state '2').
Also, it's no longer clear to me what you mean by a "simpler/est" machine. The smallest known Universal Turing machine (i.e., a Turing machine that can simulate another Turing machine, given an appropriate tape) requires 2 states and 5 symbols (see the relevant Wikipedia article).
On the other hand, if you're looking for something simpler than a Turing machine with the same computation power, Post-Turing machines might be of interest.

I believe that the concept of state is basically the same as in Finite State Machines. If I recall, you need a separate termination state, to which the turing machine can transition after it has finished running the program. As for why 3 states I'd guess that the other two states are for intialisation and execution respectively.
Unfortunately none of that is guaranteed to be correct, but I thought I'd post my thoughts anyway since the question was unanswered for 5 hours. I suspect if you were to re-ask this question on cstheory.stackexchange.com you might get a better/more definative answer.

"State" in the context of Turing machines should be clarified as to which is being described: (i) the current instruction, or (ii) the list of symbols on the tape together with the current instruction, or (iii) the list of symbols on the tape together with the current instruction placed to the left of the scanned symbol or to the right of the scanned symbol. Reference

Most effecient way to compute a series of moves in peg solitaire

Given an arbitary peg solitaire board configuration, what is the most effecient way to compute any series of moves that results in the "end game" position.
For example, the standard starting position is:
..***..
..***..
*******
***O***
*******
..***..
..***..
And the "end game" position is:
..OOO..
..OOO..
OOOOOOO
OOO*OOO
OOOOOOO
..OOO..
..OOO..
Peg solitare is described in more detail here: Wikipedia, we are considering the "english board" variant.
I'm pretty sure that it is possible to solve any given starting board in just a few secconds on a reasonable computer, say an P4 3Ghz.
Currently this is my best strategy:
def solve:
for every possible move:
make the move.
if we haven't seen a rotation or flip of this board before:
solve()
if solved: return
undo the move.

The wikipedia article you link to already mentions that there only 3,626,632 possible board positions, so it it easy for any modern computer to do an exhaustive search of the space.
Your algorithm above is right, the trick is implementing the "haven't seen a rotation or flip of this board before", which you can do using a hash table. You probably don't need the "undo the move" line as a real implementation would pass the board state as an argument to the recursive call so you would use the stack for storing the state.
Also, it is not clear what you might mean by "efficient".
If you want to find all sequences of moves that lead to a winning stage then you need to do the exhaustive search.
If you want to find the shortest sequence then you could use a branch-and-bound algorithm to cut off some search trees early on. If you can come up with a good static heuristic then you could try A* or one of its variants.

Start from the completed state, and walk backwards in time. Each move is a hop that leaves an additional peg on the board.
At every point in time, there may be multiple unmoves you can make, so you'll be generating a tree of moves. Traverse that tree (either depth-first or breadth-) stopping any branch when it reaches the starting state or no longer has any possible moves. Output the list of paths that led to the original starting state.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008