I use n-step Sarsa/sometimes Sarsa(lambda)
After experimenting a bit with different epsilon schedules I found out that the agent learns faster when I change the epsilon during an episode based on the number of steps already taken and the mean length of the last 10 episodes.
Low number of steps/beginning of episode => Low epsilon
High number of steps/end of episode => High epsilon
This works far better than just an epsilon decay over time from episode to episode.
Does the theory allow this?
I think yes because all states are still visited regularly.
Yes, SARSA algorithm converges even in the case you are updating epsilon parameter within each episode. The requirement is that epsilon should eventually tend to zero or a small value.
In you case, if you are starting with a small epsilon value in each episode and increasing it as the number of steps grows, it's not very clear to me that your algorithm will converge towards an optimal policy. I mean, at some point epsilon should decrease.
The "best" epsilon schedule is highly problem dependent, and there is not a schedule that works fine in all problems. So, at the end, it's required some experience in the problem and probably some trial and error adjustment.
Related
The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.
The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.
I'm designing a reward function of a DQN model, the most tricky part of Deep reinforcement learning part. I referred several cases, and noticed usually the reward will set in [-1, 1]. Considering if the negative reward is triggered less times, more "sparse" compared with positive reward, the positive reward could be lower than 1.
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)? What's the theory or principle behind the range?
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I can vaguely understand that correlated with gradient descent, and actually it's the gap between rewards matters, not the sign or absolute value. But I'm still missing clear hint how it can destroy, and why in such range.
Besides, when should I utilize reward like [0,1] or use only negative reward? I mean, within given timestep, both methods seems can push the agent to find the highest total reward. Only in situation like I want to let the agent reach the final point asap, negative reward will seems more appropriate than positive reward.
Is there a criteria to measure if the reward is designed reasonable? Like use the Sum the Q value of good action and bad action, it it's symmetrical, the final Q should around zero which means it converge?
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)?
Essentially it's the same if you define your reward function in either [0,1] or [-1,0] range. It will just result in your action values being positive or negative, but it wouldn't affect the convergence of your neural network.
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I wouldn't really agree with the answer. Such a reward function wouldn't "destroy" the model, however it is incapable of providing a balanced positive and negative reward for the agent's action. It provides incentive for the agent not to crash, however doesn't encourage it to cut off opponents.
Besides, when should I utilize reward like [0,1] or use only negative reward?
As mentioned previously, it doesn't matter if you use positive or negative reward. What matters is the relativity of your reward. For example as you said if you want the agent to reach the terminal state asap, thus introducing negative rewards, it will only work if no positive reward is present during the episode. If the agent could pick up positive reward midway through the episode, it would not be incentivized to end the episode asap. Therefore, it's the relativity that matters.
What's the principle to design the reward function, of DQN?
As you said, this is the tricky part of RL. In my humble opinion, the reward is "just" the way to leads your system to the (state, action) pairs that you valuate most. So, if you consider that one pair (state, action) is 500x greater than the other, why not?
About the range of values... suppose that you know all the rewards that can be assigned, thus you know the range of values, and you could easily normalize it, let's say to [0,1]. So, the range doesn't mean to much, but the values that you assigned says a lot.
About negative reward values. In general, I find it in problems where the objective is to minimize costs. For instance, if you have a robot that has the goal do collect trash in a room, and from time to time he has to recharge himself to continue doing this task. You could have negative rewards regarding battery consumption, and your goal is to minimize it. On another hand, in many games the goal is to score more and more points, so can be natural to assign positive values.
I am currently learning reinforcement learning and am have built a blackjack game.
There is an obvious reward at the end of the game (payout), however some actions do not directly lead to rewards (hitting on a count of 5), which should be encouraged, even if the end result is negative (loosing the hand).
My question is what should the reward be for those actions ?
I could hard code a positive reward (fraction of the reward for winning the hand) for hits which do not lead to busting, but it feels like I am not approaching the problem correctly.
Also, when I assign a reward for a win (after the hand is over), I update the q-value corresponding to the last action/state pair, which seems suboptimal, as this action may not have directly lead to the win.
Another option I thought is to assign the same end reward to all of the action/state pairs in the sequence, however, some actions (like hitting on count <10) should be encouraged even if it leads to a lost hand.
Note: My end goal is to use deep-RL with an LSTM, but I am starting with q-learning.
I would say to start simple and use the rewards the game dictates. If you win, you'll receive a reward +1, if you lose -1.
It seems you'd like to reward some actions based on human knowledge. Maybe start with using epsilon greedy and let the agent discover all actions. Play along with the discount hyperparameter which determines the importance of future rewards, and look if it comes with some interesting strategies.
This blog is about RL and Blackjack.
https://towardsdatascience.com/playing-blackjack-using-model-free-reinforcement-learning-in-google-colab-aa2041a2c13d
I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.
When selecting reward value in DQN, Actor-Critic or A3C, is there any common rules to select reward value??
As I heard briefly, (-1 ~ +1) reward is quite efficient selection.
Can you tell me any suggestion and the reason ??
Ideally, you want to normalize your rewards (i.e., 0 mean and unit variance). In your example, the reward is between -1 to 1, which satisfies this condition. I believe the reason was because it speeds up gradient descent when updating your parameters for your neural network and also it allows your RL agent to distinguish good and bad actions more effectively.
An example: Imagine we are trying to build an agent to cross the street, and if it crosses the street, it gains a reward of 1. If it gets hit by a car, it gets a reward of -1, and each step yields a reward of 0. Percentage-wise, the reward for success is massively above the reward for failure (getting hit by a car).
However, if we give the agent a reward of 1,000,000,001 for successfully crossing the road, and giving it a reward of 999,999,999 for getting hit by a car (this scenario and the above are identical when normalized), the success is no longer as pronounced as previously. Also, if you discount such high rewards, it will make the distinction of the two scenarios even harder to identify.
This is especially a problem in DQN and other function approximation methods because these methods generalize the state, action, and reward spaces. So a reward of -1 and 1 are massively different, however, a reward of 1,000,000,001 and 999,999,999 are basically identical if we were to use a function to generalize it.