How to combine A2C with BPTT? - reinforcement-learning

I'm having a little difficulty understanding how I can apply backpropagation through time to the A2C method, or any reinforcement learning method for that matter.
As I understand it, BPTT conceptually unrolls a recurrent network and performs a forward pass, then takes the output from this pass, calculates a loss and uses this to backpropagate through the network, taking into account the previous states of the network.
However, I'm slightly unsure how I would go about combining this with A2C. Should I calculate the final actor and critic losses from an epoch and use these to backpropagate, or should I accumulate the total losses at each step and do the same, or have I misunderstood entirely and need to do something else?
Thanks in advance for any advice.

Related

Deep reinforcement learning for similar observations but need totally different actions, how to solve it?

For DRL using neural networks, like DQN, if there is a task that needs total different actions at similar observations, is NN going to show its weakness at this moment? Will two near input to the NN generate similar output? If so, it cannot get the different the task need?
For instance:
the agent can choose discrete action from [A,B,C,D,E], here is the observation by a set of plugs in a binary list [0,0,0,0,0,0,0].
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action? How to solve?
One more thing:
One hot encoding is a way to improve the distance between observations. It is also a common and useful way for many supervised learning tasks. But one hot will also increase the dimension heavily.
Will two near input to the NN generate similar output ?
Artificial neural networks, by nature, are non-linear function approximators. Meaning that for two given similar inputs, the output can be very different.
You might get an intuition on it considering this example, two very similar pictures (the one on the right just has some light noise added to it) give very different results for the model.
For observation [1,1,1,1,1,1,1] and [1,1,1,1,1,1,0] they are quite similar but if the agent should conduct action A at [1,1,1,1,1,1,1] but action D at [1,1,1,1,1,1,0]. Those two observation are too closed on the distance so the DQN may not easily get the proper action ?
I see no problem with this example, a properly trained NN should be able to map the desired action for both inputs. Furthermore, in your example the input vectors contain binary values, a single difference in these vectors (meaning that they have a Hamming distance of 1) is big enough for the neural net to classify them properly.
Also, the non-linearity in neural networks comes from the activation functions, hope this helps !

Difference between optimisation algorithms and reinforcement learning methods

I have a sense that one step task of reinforcement learning is essentially the same with some optimisation algorithms.
For example, suppose there is only one parameter α and we try to optimise y using gradient descent for optimisation, then in each iteration(or step), α is actually moving slightly towards the direction with δy. The step is exactly the same in reinforcement learning, where δy is named as temporal difference and y is the value of that state S(a).
So, I wonder for 1 step reinforcement learning problems, is it actually a optimisation method, or can it be used to optimise parameters?(based on the context above)
I might have some misunderstanding on this, welcome to correctify.
First of all, reinforcement learning is very general. Almost any optimization problem can be transformed into a RL problem. It's usually not worth it, because a RL agent would select sub-optimal actions, doing trial and error just to confirm things you already know by design.
To your question: I think the similarity you found is that both algorithms make use of a (noisy) gradient step. Temporal difference is just one RL method of many. If I remember correctly it calculates the difference between the predicted value and the (noisy) value estimate made with the observed reward. It cannot simply set the correct value, because in general there is a complicated dependency between the values of other states, so instead it makes just one a small step to reduce the difference.
Sure, you could set up a RL task somehow to optimize reward = y(α). Now α can either be the agent's "state", in which case you need actions decrement or increment it (you learn state-values) or α can be the action in which case there is only a single state (you learn action-values). With the right exploration strategy it might even work if you are patient. But in both cases you waste your knowledge about the gradient δy(α)/δα because the RL algorithm does not know about it. Yes it takes gradient-steps, but those gradients reduce the difference between the learned value and the actual value. If the true values are exactly the rewards (which is true if the agent dies after one step, and if there is no randomness when you evaluate y(α)) then this is wasted effort. Instead of taking a small step to smooth out the non-existing influence on other states, you could have just set it to the true value directly.
You mentioned "one-step reinforcement learning": what comes to mind is the contextual bandit setup. It's a simplification of the full-blown RL setup where your actions do not influence the next state (=context). The next simplification is the multi-armed bandit, which only has actions but no state/context.

Which reinforcement learning algorithm is applicable to a problem with a continuously variable reward and no intermediate rewards?

I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.

Is BatchNorm turned off when inferencing?

I read from several sources that implicitly suggest batchnorm being turned off for inference but I have no definite answer for this.
Most common is to use a moving average of mean and std for your batch normalization as used by Keras for example (https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py). If you just turn it off the network will perform worse on the same data, due to changes in how the images are processed.
This is done by storing the average mean and the average std of all the batches used during training the network. Then in inference this moving average is used for normalization.

Computation consideration with different Caffe's network topology (difference in number of output)

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.