Is AI Gym's action and state data normalized? - deep-learning

I am trying to implement a DDPG agent to control the Gym's Pendulum.
Since I am new to gym, I was wondering if the state data collected via env.step(action) is already normalized or I should do that manually. Also, should action be normalized or in the [-2, 2] range?
Thanks

env.step(action) returns tuple (observation, reward, done, info). If you're referring to data in observation, then answer is no, it's not normalized (all with accordance to observation space section: three coordinates with values in [-1; 1] for the first two and [-8; 8] for the last one). action should be normalized to [-2; 2] range, though it'll be addinionally clipped to this range.

Related

What is the best way to model an environment to force an agent to select `x out of n` choices?

I have an RL problem where I want the agent to make a selection of x out of an array of size n.
I.e. if I have [0, 1, 2, 3, 4, 5] then n = 6 and if x = 3 a valid action could be
[2, 3, 5].
Right now what I tried is have n scores:
Output n continuous numbers, and select the x highest ones. This works quite ok.
And I tried iteratively replacing duplicates out of a Multi Discrete action. Where we have x values that can be anything from 0 to n-1.
Is there some other optimal action space I am missing that would force the agent to make unique choices?
Many thanks for your valuable insights and tips in advance! I am happy to try all!
Since reinforcement learning mostly about interacting with environment, you can approach like this:
Your agent starts choosing actions. After choosing the first action, you can either update the possible choices it has by removing the last choice (with temporary action list) or you can update the values of the chosen action (giving it either negative reward or punishing it). I think this could solve your problem.

What is use of having both state value function and action value function?

I'm a beginner in RL and want to know what is the advantage of having a state value function as well as an action-value function in RL algorithms, for example, Markov Design Process. What is the use of having both of them in prediction and control problems?
I think you mean state-value function and state-action-value function.
Quoting this answer by James MacGlashan:
To explain, lets first add a point of clarity. Value functions
(either V or Q) are always conditional on some policy πœ‹. To emphasize
this fact, we often write them as π‘‰πœ‹(𝑠) and π‘„πœ‹(𝑠,π‘Ž). In the
case when we’re talking about the value functions conditional on the
optimal policy πœ‹βˆ—, we often use the shorthand π‘‰βˆ—(𝑠) and π‘„βˆ—(𝑠,π‘Ž).
Sometimes in literature we leave off the πœ‹ or * and just refer to V
and Q, because it’s implicit in the context, but ultimately, every
value function is always with respect to some policy.
Bearing that in mind, the definition of these functions should clarify
the distinction for you.
π‘‰πœ‹(𝑠) expresses the expected value of following policy πœ‹ forever
when the agent starts following it from state 𝑠.
π‘„πœ‹(𝑠,π‘Ž) expresses the expected value of first taking action π‘Ž
from state 𝑠 and then following policy πœ‹ forever.
The main difference then, is the Q-value lets you play a hypothetical
of potentially taking a different action in the first time step than
what the policy might prescribe and then following the policy from the
state the agent winds up in.
For example, suppose in state 𝑠 I’m one step away from a terminating
goal state and I get -1 reward for every transition until I reach the
goal. Suppose my policy is the optimal policy so that it always tells
to me walk toward the goal. In this case, π‘‰πœ‹(𝑠)=βˆ’1 because I’m just
one step away. However, if I consider the Q-value for an action π‘Ž
that walks 1 step away from the goal, then π‘„πœ‹(𝑠,π‘Ž)=βˆ’3 because
first I walk 1 step away (-1), and then I follow the policy which will
now take me two steps to get to the goal: one step to get back to
where I was (-1), and one step to get to the goal (-1), for a total of
-3 reward.

How is dividing into minibatches implemented in batch normalization for deeper layers?

Suppose, we have dataset X (2D array), and we divide it into batches X_1, ..., X_k.
Then for each batch we do normalization, then each i-th component of batch element we multiply by parameter gamma_i and add to them beta_i.
Batch normalization layer can be repeated several times and I didn't found anything about how it is implemented deeper in network.
In next BN layers do we use the same division to batches as in the beginning (using the same rows in X as in the firsh BN layer), just adding new gamma and beta parameters, or we do it from scratch for every layers's input?
Hope, my question is clear.

State value and state action values with policy - Bellman equation with policy

I am just getting start with deep reinforcement learning and i am trying to crasp this concept.
I have this deterministic bellman equation
When i implement stochastacity from the MDP then i get 2.6a
My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:
The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.
I hope someone out there can help on this one!
Best regards SΓΈren Koch
No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.
In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.
There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.
Yes, your assumption is completely right. In the Reinforcement Learning field, a value function is the return obtained by starting for a particular state and following a policy Ο€ . So yes, strictly speaking, it should be accompained by the policy sign Ο€ .
The Bellman equation basically represents value functions recursively. However, it should be noticed that there are two kinds of Bellman equations:
Bellman optimality equation, which characterizes optimal value functions. In this case, the value function it is implicitly associated with the optimal policy. This equation has the non linear maxoperator and is the one you has posted. The (optimal) policy dependcy is sometimes represented with an asterisk as follows:
Maybe some short texts or papers omit this dependency assuming it is obvious, but I think any RL text book should initially include it. See, for example, Sutton & Barto or Busoniu et al. books.
Bellman equation, which characterizes a value function, in this case associated with any policy Ο€:
In your case, your equation 2.6 is based on the Bellman equation, therefore it should remove the max operator and include the sum over all actions and possible next states. From Sutton & Barto (sorry by the notation change wrt your question, but I think it's understable):

Indirect Kalman Filter for Inertial Navigation System

I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).