Reinforce Learning for environment which cannot be affected by agent - reinforcement-learning

The model of RL is defined as P^a_ss', the action space is continuous. In order to make the agent knows that the env would behave it own ways regardless what the agent does, what would I do?
It is also desirable to learn the state transition of the env, would RL at all be suffice for the job? If yes, the env has only one continuous variable x_0 in in observation space, and a numberous of hidden factors x_1, x_2, ... that affect x_0; should x_1, x_2, ... be in the observation space too? If no, what would I do next beside RNN?

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

How to define an observation space with one int value and 2 double values in OpenAI gym?

I have an environment in open AI gym, where the observation space is like [12,12.5,16.7], one value is discrete and the other two are continues, how can I define this in Gym?
I have tried to use the multi-discrete and discrete but it doesn't cover the continuous space, and I also tried box but the first integer one was problematic.
In reinforcement learning you usually want to normalize the observations in the range 0-1 (especially if you are using neural networks as function approximators). Therefore, it makes sense to use Boxes in the range 0-1.

When to use torch.no_grad() is safe in forward propagation? Why does it hurt my model badly?

I have trained a CNN model whose forward-prop is like:
*Part1*: learnable preprocess​
*Part2*: Mixup which does not need to calculate gradient
*Part3*: CNN backbone and classifier head
Both part1 and part3 need to calculate the gradient and need update weights when back-prop, but part2 is just a simple mixup and don't need gradient, so I tried wrapped this Mixup with torch.no_grad() to save computational resource and speed up training, which it indeed speed my training a lot, but the model`s prediction accuracy drops a lot.
I'm wondering if Mixup does not need to calculate the gradient, why wrap it with torch.no_grad() hurt the model`s ability so much, is it due to loss of the learned weights of Part1, or something like break the chain between Part1 and Part2?
Edit:
Thanks #Ivan for your reply and it sounds reasonable, I also have the same thought but don't know how to prove it.
In my experiment when I apply torch.no_grad() on Part2, the GPU memory consumption drops a lot, and training is much faster, so I guess this Part2 still needs gradient even it does not have learnable parameters.
So can we conclude that torch.no_grad() should not be applied between 2 or more learnable blocks, otherwise it would drop the learning ability of blocks before this no_grad() part?
but part2 is just simple mixup and don't need gradient
It actually does! In order to compute the gradient flow and backpropagate successfully to part1 of your model (which is learnable, according to you) you need to compute the gradients on part2 as well. Even though there are no learnable parameters on part2 of your model.
What I'm assuming happened when you applied torch.no_grad() on part2 is that only part3 of your model was able to learn while part1 stayed untouched.
Edit
So can we conclude that torch.no_grad() should not be applied between 2 or more learnable blocks, otherwise it would drop the learning ability of blocks before this no_grad() part?
The reasoning is simple: to compute the gradient on part1 you need to compute the gradient on intermediate results, irrespective of the fact that you won't use those gradients to update the tensors on part2. So indeed, you are correct.

What is use of having both state value function and action value function?

I'm a beginner in RL and want to know what is the advantage of having a state value function as well as an action-value function in RL algorithms, for example, Markov Design Process. What is the use of having both of them in prediction and control problems?
I think you mean state-value function and state-action-value function.
Quoting this answer by James MacGlashan:
To explain, lets first add a point of clarity. Value functions
(either V or Q) are always conditional on some policy πœ‹. To emphasize
this fact, we often write them as π‘‰πœ‹(𝑠) and π‘„πœ‹(𝑠,π‘Ž). In the
case when we’re talking about the value functions conditional on the
optimal policy πœ‹βˆ—, we often use the shorthand π‘‰βˆ—(𝑠) and π‘„βˆ—(𝑠,π‘Ž).
Sometimes in literature we leave off the πœ‹ or * and just refer to V
and Q, because it’s implicit in the context, but ultimately, every
value function is always with respect to some policy.
Bearing that in mind, the definition of these functions should clarify
the distinction for you.
π‘‰πœ‹(𝑠) expresses the expected value of following policy πœ‹ forever
when the agent starts following it from state 𝑠.
π‘„πœ‹(𝑠,π‘Ž) expresses the expected value of first taking action π‘Ž
from state 𝑠 and then following policy πœ‹ forever.
The main difference then, is the Q-value lets you play a hypothetical
of potentially taking a different action in the first time step than
what the policy might prescribe and then following the policy from the
state the agent winds up in.
For example, suppose in state 𝑠 I’m one step away from a terminating
goal state and I get -1 reward for every transition until I reach the
goal. Suppose my policy is the optimal policy so that it always tells
to me walk toward the goal. In this case, π‘‰πœ‹(𝑠)=βˆ’1 because I’m just
one step away. However, if I consider the Q-value for an action π‘Ž
that walks 1 step away from the goal, then π‘„πœ‹(𝑠,π‘Ž)=βˆ’3 because
first I walk 1 step away (-1), and then I follow the policy which will
now take me two steps to get to the goal: one step to get back to
where I was (-1), and one step to get to the goal (-1), for a total of
-3 reward.

Reinforce Learning: Do I have to ignore hyper parameter(?) after training done in Q-learning?

Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).