How to check the actions available in OpenAI gym environment? - reinforcement-learning

When using OpenAI gym, after importing the library with import gym, the action space can be checked with env.action_space. But this gives only the size of the action space. I would like to know what kind of actions each element of the action space corresponds to. Is there a simple way to do it?

If your action space is discrete and one dimensional, env.action_space will give you a Discrete object. You can access the number of actions available (which simply is an integer) like this:
env = gym.make("Acrobot-v1")
a = env.action_space
print(a) #prints Discrete(3)
print(a.n) #prints 3
If your action space is discrete and multi dimensional, you'd get a MultiDiscrete (instead of Discrete) object, on which you can call nvec (instead of n) to get an array describing the number of available action for each dimension. But note that it is not a very common case.
If you have a continous action space, env.action_space will give you a Box object. Here is how to access its properties:
env = gym.make("MountainCarContinuous-v0")
a = env.action_space
print(a) #prints Box(1,)
print(a.shape) #prints (1,), note that you can do a.shape[0] which is 1 here
print(a.is_bounded()) #prints True if your action space is bounded
print(a.high) #prints [1.] an array with the maximum value for each dim
print(a.low) #prints [-1.] same for minimum value

Related

Having issue with max_norm parameter of torch.nn.Embedding

I use torch.nn.Embedding to embed my model’s categorical input features, however, I face problems when I set the max_norm parameter to not None.
There is a note on the pytorch docs page that explains how to use max_norm parameter through the following example:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=True)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor(\[1, 2\])
a = embedding.weight.clone() # W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) # W.t() # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()
I can’t easily understand this example from the docs. What is the purpose of having both ‘a’ and ‘b’ and why ‘out’ is defined as, out = (a.unsqueeze(0) + b.unsqueeze(1))?
Do we need to first clone the entire embedding tensor as in ‘a’, and then finding the embeddings for our desired indices as in ‘b’? Then how do ‘a’ and ‘b’ need to be added?
In my code, I don’t have W explicitly, I am assuming that W is representative of the weights applied by the torch.nn.Linear layers. So, I just need to prepare the input (which includes the embeddings for categorical features) that goes into my network.
I greatly appreciate any instructions on this, as understanding this example would help me adapt my code accordingly.
Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b.
If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this.
If you get an error, post it (and your code).

how to define observation and action space for an array like input

I am working on a problem and I want to implement it as a reinforcement learning problem and integrate it into OpenAI gym. My states are in the form of lists of length n which each element is chosen from a discrete interval [0, m].
for example for n=6 and m=3, this is a sample from the observation space:
[0 2 1 3 3 2]
and the possible accessible states from this space is a set of other lists which are achieved by changing a number of k elements the elements in the list with a number from the same [0, m].
for example, for k=1 we can have the following states as two subsequent states of the previous state:
[0 2 2 3 3 2]
or
[0 3 1 3 3 2]
My question is that what is an efficient way to represent the "actions" in the OpenAI gym for such a scenario?
One way that come to my mind is to just use the next state as the action itself, for example, if I write:
action = env.action_space.sample()
the action would be the next state (which also implicitly contains the action) and then in the env.step(action) make the state equal to the next state.
Does anyone know a better way or using the implicit action representation with the next state is the optimal way?
Does anyone know a predefined gym environment that also has the same representation?
what are the cons of the implicit representation of the actions that I just explained?

Understanding when to use python list in Pytorch

Basically as this thread discusses here, you cannot use python list to wrap your sub-modules (for example your layers); otherwise, Pytorch is not going to update the parameters of the sub-modules inside the list. Instead you should use nn.ModuleList to wrap your sub-modules to make sure their parameters are going to be updated. Now I have also seen codes like following where the author uses python list to calculate the loss and then do loss.backward() to do the update (in reinforce algorithm of RL). Here is the code:
policy_loss = []
for log_prob in self.controller.log_probability_slected_action_list:
policy_loss.append(- log_prob * (average_reward - b))
self.optimizer.zero_grad()
final_policy_loss = (torch.cat(policy_loss).sum()) * gamma
final_policy_loss.backward()
self.optimizer.step()
Why using the list in this format works for updating the parameters of modules but the first case does not work? I am very confused now. If I change in the previous code policy_loss = nn.ModuleList([]), it throws an exception saying that tensor float is not sub-module.
You are misunderstanding what Modules are. A Module stores parameters and defines an implementation of the forward pass.
You're allowed to perform arbitrary computation with tensors and parameters resulting in other new tensors. Modules need not be aware of those tensors. You're also allowed to store lists of tensors in Python lists. When calling backward it needs to be on a scalar tensor thus the sum of the concatenation. These tensors are losses and not parameters so they should not be attributes of a Module nor wrapped in a ModuleList.

How does rstan store posterior samples for separate chains?

I would like to understand how the output of extract in rstan orders the posterior samples. I understand that I can view the posterior samples from each chain by using as.array,
stanfit <- sampling(
model,
data = stan.data)
​
fitarray <- as.array(stanfit)
For example, fitarray[, 2, 1] will give me the samples for the second chain of the first parameter. One way to store the posterior samples in the output of extract would be just to concatenate them. When I do,
fit <- extract(stanfit)
mean(fitarray[,2,1]) == mean(fit$ss[1001:2000])
for several chains and parameters I always get TRUE (ss is the first parameter). This makes it seem like the posterior samples are being concatenated in fit. However, when I do,
fitarray[,2,1] == fit$ss[1001:2000]
I get FALSE (confirmed that there's not just precision difference). It appears that fitarray and fit are storing the iterations differently. How do I view the iterations (in order) of each posterior sample chain separately?
As can be seen from rstan:::as.array.stanfit, the as.array method is essentially defined as
extract(x, permuted = FALSE, inc_warmup = FALSE)
Your default use of extract keeps the warmup and permutes the post-warmup draws randomly, which is why the indices do not line up with the as.array output.

Need help in reading the diagram

I am trying to understand the diagram for register write operation in MIPS(Single Cycle Data Path). I do not get why do we need to AND the output of the decoder to the write enable signal? I am not getting how would it enable the specific register. Please help me out with it.
Thanks.
There are several inconsistencies in the diagram. The "n-to-2^n" decoder should have n inputs and 2^n outputs. With such a decoder, the number of registers should be 2^n.
The decoder inputs specify the address (i.e. the register) to be written to. For any of the 2^n possible register numbers, the corresponding output of the decoder will be set to 1, with all other outputs set to 0.
The "write" signal is probably driven off a clock.
The purpose of the AND gates is to make the "write" signal propagate to the correct register (just the one!) The register is chosen by the address fed into the decoder, as described above.
The selected register will latch onto the "register data", most probably on the rising edge of the clock. All the remaining registers will keep their present values, since their C inputs will remain at 0 throughout.