What is the decay rate of the weightage given to past rewards in the computation of the Q function in the stationary and non-stationary updates in the multi-armed bandit problem?
Related
I'm trying to solve the LunarLander continuous environment from open AI gym (Solving the LunarLanderContinuous-v2 means getting an average reward of 200 over 100 consecutive trials.) With best reward average possible for 100 straight episodes from this environment.
The difficulty is that I refer to the Lunar-lander with uncertainty. (explanation: observations in the real physical world are sometimes noisy). Specifically, I add a zero-mean
Gaussian noise with mean=0 and std = 0.05 to PositionX and PositionY observation of the location of the lander.
I also discretise the LunarLander actions to a finite number of actions instead of the continuous range the environment enables.
So far I'm using DQN, double-DQN and Duelling DDQN.
My hyperparameters are:
gamma,
epsilon start
epsilon end
epsilon decay
learning rate
number of actions (discretisation)
target update
batch size
optimizer
number of episodes
network architecture.
I'm having difficulty to reach good or even mediocre results.
Does someone have an advice about the hyperparameters changes I should make to improve my results?
Thanks!
I am reading following paper. And it uses EMA decay for variables.
Bidirectional Attention Flow for Machine Comprehension
During training, the moving averages of all weights of the model are
maintained with the exponential decay rate of 0.999.
They use TensorFlow and I found the related code of EMA.
https://github.com/allenai/bi-att-flow/blob/master/basic/model.py#L229
In PyTorch, how do I apply EMA to Variables?
Moving average is the key concept of momentum in gradient descent.
In PyTorch document you can find:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
Change the parameter momentum to the value you want.
I've been studying up on reinforcement learning, but the thing I don't understand is how a Q value is ever calculated. If you use the Bellman equation Q(s,a) = r + γ*max(Q(s',a')), would't it just go on forever? Because Q(s',a') would need the Q value of one timestep further, and that would just continue on and on. How does it end?
In Reinforcement Learning you normally try to find a policy (the best action to take in a specific state), and the learning process ends when the policy does not change anymore or the value function (representing the expected reward) has converged.
You seem to confuse Q-learning and Value Iteration using the Bellman equation. Q-learning is a model-free technique where you use obtained reward to update Q:
Here the direct reward rt+1 is the reward obtained after having done action at in state st. α is the learning rate that should be between 0 and 1, if it is 0 no learning is done, if it is 1 only the newest reward is taken into account.
Value iteration with the Bellman equation:
Where a model Pa(s,s') is required, also defined as P(s'|s,a), which is the probability of going from state s to s' using action a. To check if the value function is converged, normally the value function Vt+1 is compared to Vt for all states and if it is smaller than a small value (ε) the policy is said to be converged:
See also:
Difference between Q-learning and Value Iteration
How do I know when a Q-learning algorithm converges?
Sutton et al.: RL
I am trying to formulate a function to estimate the probability of a given 1-D vector which is a periodic signal.
For instance, a sine or a cosine wave could result in a probability of 1; a white noise signal should result in a probability close to 0.
Can anyone help me to come up with this function? Thanks in advance.
I have a fundamental question:
I would like to know, why this time series:
k<-c(4,5,6,2,3,1)
is equal to:
21.0+0.000000i 0.5-6.062178i -1.5-0.866025i 5.0-0.000000i -1.5+0.866025i 0.5+6.062178i
In time series I have a set of points, but what is the resault of fft , are there points?
Fourier says that any (non-pathological) waveform can be decomposed into a bunch of sinewaves. The FFT does that for reasonable samples of a given waveform.
So your FFT results are the coefficients of each sinewave sub-component: the first for 0 Hz (or DC, or sum), the 2nd for a sinewave of 1 period per aperture, the next: 2 cycles per aperture, and etc. You can consider each coefficient pair x+iy, as either a vector in the complex plane for a sinewave's magnitude and phase, or as multipliers for a cosine and a sine that sum up to another sinewave of a specified phase.