I'd like to build an autonomous ship in a virtual environment using DDPG.
However, the problem is that there's an action space of (-180', +180') for steering, and DDPG would be able to choose -180' at (t-1) and +180' at (t+1), which is impossible in the real world. (basically, you can't rotate a steering wheel that fast.)
The possible solution that I thought was this.
Set a maximum steering rate (e.g. 10' per step)
If the taken action gets out of an available action range of (current_steeringWheel_angle - 10', current_steeringWheel_angle + 10'), change the taken action to the end value in the available action range
Take a step with the changed action in the virtual environment.
(1st option) update the DDPG with the changed action.
(2nd option) update the DDPG with the originally taken action.
I think I found the solution.
1st reference :
(src: https://stats.stackexchange.com/questions/378008/how-to-handle-a-changing-action-space-in-reinforcement-learning/378025#378025?newreg=09ef385b87a54f27b5011f983dbf0270)
2nd reference (basically, it'stalking about the same thing as above.) :
https://stats.stackexchange.com/questions/328835/enforcing-game-rules-in-alpha-go-zero
Related
I'm a beginner in RL and want to know what is the advantage of having a state value function as well as an action-value function in RL algorithms, for example, Markov Design Process. What is the use of having both of them in prediction and control problems?
I think you mean state-value function and state-action-value function.
Quoting this answer by James MacGlashan:
To explain, lets first add a point of clarity. Value functions
(either V or Q) are always conditional on some policy π. To emphasize
this fact, we often write them as ππ(π ) and ππ(π ,π). In the
case when weβre talking about the value functions conditional on the
optimal policy πβ, we often use the shorthand πβ(π ) and πβ(π ,π).
Sometimes in literature we leave off the π or * and just refer to V
and Q, because itβs implicit in the context, but ultimately, every
value function is always with respect to some policy.
Bearing that in mind, the definition of these functions should clarify
the distinction for you.
ππ(π ) expresses the expected value of following policy π forever
when the agent starts following it from state π .
ππ(π ,π) expresses the expected value of first taking action π
from state π and then following policy π forever.
The main difference then, is the Q-value lets you play a hypothetical
of potentially taking a different action in the first time step than
what the policy might prescribe and then following the policy from the
state the agent winds up in.
For example, suppose in state π Iβm one step away from a terminating
goal state and I get -1 reward for every transition until I reach the
goal. Suppose my policy is the optimal policy so that it always tells
to me walk toward the goal. In this case, ππ(π )=β1 because Iβm just
one step away. However, if I consider the Q-value for an action π
that walks 1 step away from the goal, then ππ(π ,π)=β3 because
first I walk 1 step away (-1), and then I follow the policy which will
now take me two steps to get to the goal: one step to get back to
where I was (-1), and one step to get to the goal (-1), for a total of
-3 reward.
Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).
I've installed the Proton GE and preformed a simple condition verification on an input event.
My goal is to verify more complex conditions. For example: If the rain level on a period of 48 hours exceeds a limit.
How can I define this verification? Can someone show me an example?
Thank you
please refer to the fraud sample : https://github.com/ishkin/Proton/tree/master/documentation/sample/fraud
It demonstrates more complex situations and the appropriate definitions. The folder containts the description and explanation of the application: https://github.com/ishkin/Proton/blob/master/documentation/sample/fraud/SuspiciousAccountExample.pdf, and the appropriate artifacts.
On a high level - you need to define a temporal context, lasting 48 hours (you need to decide when this context begins - usually some initiator event indicates the beginning of the temporal window) and you need to define an EPA - if its just filtering out an event based on threshold it can be EPA of type "Basic" with a threshold condition.
I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).
I'm trying to understand and implement the simplest turing machine and would like feedback if I'm making sense.
We have a infinite tape (lets say an array called T with pointer at 0 at the start) and instruction table:
( S , R , W , D , N )
S->STEP (Start at step 1)
R->READ (0 or 1)
W->WRITE (0 or 1)
D->DIRECTION (0=LEFT 1=RIGHT)
N->NEXTSTEP (Non existing step is HALT)
My understanding is that a 3-state 2-symbol is the simplest machine. 3-state i don't understand. 2-symbol because we use 0 and 1 for READ/WRITE.
For example:
(1,0,1,1,2)
(1,1,0,1,2)
Starting at step 1, if Read is 0 then { Write 1, Move Right) else {Write 0, Move Right) and then go to step 2 - which does not exist which halts program.
What does 3-state mean? Does this machine pass as turing machine? Can we simplify more?
I think the confusion might come from your use of "steps" instead of "states". You can think of a machine's state as the value it has in its memory (although as a previous poster noted, some people also take a machine's state to include the contents of the tape -- however, I don't think that definition is relevant to your question). It's possible that this change in terminology might be at the heart of your confusion. Let me explain what I think it is you're thinking. :)
You gave lists of five numbers -- for example, (1,0,1,1,2). As you correctly state, this should be interpreted (reading from left to right) as "If the machine is in state 1 AND the current square contains a 0, print a 1, move right, and change to state 2." However, your use of the word "step" seems to suggest that that "step 2" must be followed by "step 3", etc., when in reality a Turing machine can go back and forth between states (and of course, there can only be finitely many possible states).
So to answer your questions:
Turing machines keep track of "states" not "steps";
What you've described is a legitimate Turing machine;
A simpler (albeit otherwise uninteresting) Turing machine would be one that starts in the HALT state.
Edits: Grammar, Formatting, and removed a needless description of Turing machines.
Response to comment:
Correct me if I'm misinterpreting your comment, but I did not mean to suggest a Turing machine could be in more than one state at a time, only that the number of possible states can be any finite number. For example, for a 3-state machine, you might label the possible states A, B, and C. (In the example you provided, you labeled the two possible states as '1' and '2') At any given time, exactly one of those values (states) would be in the machine's memory. We would say, "the machine is in state A" or "the machine is in state B", etc. (Your machine starts in state '1' and terminates after it enters state '2').
Also, it's no longer clear to me what you mean by a "simpler/est" machine. The smallest known Universal Turing machine (i.e., a Turing machine that can simulate another Turing machine, given an appropriate tape) requires 2 states and 5 symbols (see the relevant Wikipedia article).
On the other hand, if you're looking for something simpler than a Turing machine with the same computation power, Post-Turing machines might be of interest.
I believe that the concept of state is basically the same as in Finite State Machines. If I recall, you need a separate termination state, to which the turing machine can transition after it has finished running the program. As for why 3 states I'd guess that the other two states are for intialisation and execution respectively.
Unfortunately none of that is guaranteed to be correct, but I thought I'd post my thoughts anyway since the question was unanswered for 5 hours. I suspect if you were to re-ask this question on cstheory.stackexchange.com you might get a better/more definative answer.
"State" in the context of Turing machines should be clarified as to which is being described: (i) the current instruction, or (ii) the list of symbols on the tape together with the current instruction, or (iii) the list of symbols on the tape together with the current instruction placed to the left of the scanned symbol or to the right of the scanned symbol. Reference