Q learning for blackjack, reward function? - reinforcement-learning

I am currently learning reinforcement learning and am have built a blackjack game.
There is an obvious reward at the end of the game (payout), however some actions do not directly lead to rewards (hitting on a count of 5), which should be encouraged, even if the end result is negative (loosing the hand).
My question is what should the reward be for those actions ?
I could hard code a positive reward (fraction of the reward for winning the hand) for hits which do not lead to busting, but it feels like I am not approaching the problem correctly.
Also, when I assign a reward for a win (after the hand is over), I update the q-value corresponding to the last action/state pair, which seems suboptimal, as this action may not have directly lead to the win.
Another option I thought is to assign the same end reward to all of the action/state pairs in the sequence, however, some actions (like hitting on count <10) should be encouraged even if it leads to a lost hand.
Note: My end goal is to use deep-RL with an LSTM, but I am starting with q-learning.

I would say to start simple and use the rewards the game dictates. If you win, you'll receive a reward +1, if you lose -1.
It seems you'd like to reward some actions based on human knowledge. Maybe start with using epsilon greedy and let the agent discover all actions. Play along with the discount hyperparameter which determines the importance of future rewards, and look if it comes with some interesting strategies.
This blog is about RL and Blackjack.
https://towardsdatascience.com/playing-blackjack-using-model-free-reinforcement-learning-in-google-colab-aa2041a2c13d

Related

Understanding Markov Property further

I was studying about the markov property in reinforcement learning, which is supposed to be one of the important assumptions of this field. In that it says, that while considering the probability of the future, we consider only the present state and actions and not that of the past. An important corollary that arises when we consider the probability of the present state given future state/action, the future state/action can't be ignored as it has valuable information in the computation of the present probability.
I do not understand this second statement. From the point of view of the future event, the present event seems to be the past for this future event. Then why are we considering this past event?
Let's focus on these two sentences individually. The Markov Property (which should apply in your problem, but in reality doesn't have to) says that the current state is all you need to look at to make your decision (e.g. a "screenshot" -aka observation- of the chess board is all you need to look at to make an optimal action). On the other hand, if you need to look at some old state (or observation) to understang something that is not implied in your current state, then the Markov property is not satisfied (e.g. you can't usually use a single frame of a videogame as a state, since you may be missing info regarding the velocity and acceleration of some moving objects. This is also why people use frame-stacking to "solve" video games using RL).
Now, regarding the future events which seems to be considered as past events: when the agent takes an action, it moves from one state to another. Remember that in RL you want to maximize the cumulative reward, that is the sum of all the rewards long-term. This also mean that you basically want to take action even sacrifying instantaneous "good" reward if this means obtaining better "future" (long-term) reward (e.g. sometimes you don't want to take the enemy queen if this allows the enemy to check-mate you in the next move). This is why in RL we try to estimate value-functions (state and/or action). State value-functions is a value assigned to a state which should represent how good is being in that state in a long-term perspective.
How is an agent supposed to know the future reward (aka calculate these value functions)? By exploring a lot of states and taking random actions (literally trial and error). Therefore, when an agent is in a certain "state1" and has to choose between taking action A and action B, he will NOT choose the one that has given him the best instantaneous reward, but the one which has made him get better rewards "long-term", that is the action with the bigger action-value, which will take into account not only the instantaneous rewards he gets from the transition from state1 to the next state, but also the value-function of that next state!
Therefore, future events in that sentence may seem to be considered as past events because estimating the value function require that you have been in those "future states" a lot of times during past iterations!
Hope I've been helpful

What's the principle to design the reward function, of DQN?

I'm designing a reward function of a DQN model, the most tricky part of Deep reinforcement learning part. I referred several cases, and noticed usually the reward will set in [-1, 1]. Considering if the negative reward is triggered less times, more "sparse" compared with positive reward, the positive reward could be lower than 1.
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)? What's the theory or principle behind the range?
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I can vaguely understand that correlated with gradient descent, and actually it's the gap between rewards matters, not the sign or absolute value. But I'm still missing clear hint how it can destroy, and why in such range.
Besides, when should I utilize reward like [0,1] or use only negative reward? I mean, within given timestep, both methods seems can push the agent to find the highest total reward. Only in situation like I want to let the agent reach the final point asap, negative reward will seems more appropriate than positive reward.
Is there a criteria to measure if the reward is designed reasonable? Like use the Sum the Q value of good action and bad action, it it's symmetrical, the final Q should around zero which means it converge?
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)?
Essentially it's the same if you define your reward function in either [0,1] or [-1,0] range. It will just result in your action values being positive or negative, but it wouldn't affect the convergence of your neural network.
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I wouldn't really agree with the answer. Such a reward function wouldn't "destroy" the model, however it is incapable of providing a balanced positive and negative reward for the agent's action. It provides incentive for the agent not to crash, however doesn't encourage it to cut off opponents.
Besides, when should I utilize reward like [0,1] or use only negative reward?
As mentioned previously, it doesn't matter if you use positive or negative reward. What matters is the relativity of your reward. For example as you said if you want the agent to reach the terminal state asap, thus introducing negative rewards, it will only work if no positive reward is present during the episode. If the agent could pick up positive reward midway through the episode, it would not be incentivized to end the episode asap. Therefore, it's the relativity that matters.
What's the principle to design the reward function, of DQN?
As you said, this is the tricky part of RL. In my humble opinion, the reward is "just" the way to leads your system to the (state, action) pairs that you valuate most. So, if you consider that one pair (state, action) is 500x greater than the other, why not?
About the range of values... suppose that you know all the rewards that can be assigned, thus you know the range of values, and you could easily normalize it, let's say to [0,1]. So, the range doesn't mean to much, but the values that you assigned says a lot.
About negative reward values. In general, I find it in problems where the objective is to minimize costs. For instance, if you have a robot that has the goal do collect trash in a room, and from time to time he has to recharge himself to continue doing this task. You could have negative rewards regarding battery consumption, and your goal is to minimize it. On another hand, in many games the goal is to score more and more points, so can be natural to assign positive values.

Difference between optimisation algorithms and reinforcement learning methods

I have a sense that one step task of reinforcement learning is essentially the same with some optimisation algorithms.
For example, suppose there is only one parameter α and we try to optimise y using gradient descent for optimisation, then in each iteration(or step), α is actually moving slightly towards the direction with δy. The step is exactly the same in reinforcement learning, where δy is named as temporal difference and y is the value of that state S(a).
So, I wonder for 1 step reinforcement learning problems, is it actually a optimisation method, or can it be used to optimise parameters?(based on the context above)
I might have some misunderstanding on this, welcome to correctify.
First of all, reinforcement learning is very general. Almost any optimization problem can be transformed into a RL problem. It's usually not worth it, because a RL agent would select sub-optimal actions, doing trial and error just to confirm things you already know by design.
To your question: I think the similarity you found is that both algorithms make use of a (noisy) gradient step. Temporal difference is just one RL method of many. If I remember correctly it calculates the difference between the predicted value and the (noisy) value estimate made with the observed reward. It cannot simply set the correct value, because in general there is a complicated dependency between the values of other states, so instead it makes just one a small step to reduce the difference.
Sure, you could set up a RL task somehow to optimize reward = y(α). Now α can either be the agent's "state", in which case you need actions decrement or increment it (you learn state-values) or α can be the action in which case there is only a single state (you learn action-values). With the right exploration strategy it might even work if you are patient. But in both cases you waste your knowledge about the gradient δy(α)/δα because the RL algorithm does not know about it. Yes it takes gradient-steps, but those gradients reduce the difference between the learned value and the actual value. If the true values are exactly the rewards (which is true if the agent dies after one step, and if there is no randomness when you evaluate y(α)) then this is wasted effort. Instead of taking a small step to smooth out the non-existing influence on other states, you could have just set it to the true value directly.
You mentioned "one-step reinforcement learning": what comes to mind is the contextual bandit setup. It's a simplification of the full-blown RL setup where your actions do not influence the next state (=context). The next simplification is the multi-armed bandit, which only has actions but no state/context.

Which reinforcement learning algorithm is applicable to a problem with a continuously variable reward and no intermediate rewards?

I think the title says it. A "game" takes a number of moves to complete, at which point a total score is computed. The goal is to maximize this score, and there are no rewards provided for specific moves during the game. Is there an existing algorithm that is geared toward this type of problem?
EDIT: By "continuously variable" reward, I mean it is a floating point number, not a win/loss binary. So you can't, for example, respond to "winning" by reinforcing the moves made to get there. All you have is a number. You can rank different runs in order of preference, but a single result is not especially meaningful.
First of all, in my opinion, the title of your question seems a little confusing when you talk about "continuously variable reward". Maybe you could clarify this aspect.
On the other hand, without taking into account the previous point, it looks your are talking about the temporal credit-assigment problem: How do you distribute credit for a sequence of actions which only obtain a reward (positive or negative) at the end of the sequence?
E.g., a Tic-tac-toe game where the agent doesn't recive any reward until the game ends. In this case, almost any RL algorithm tries to solve the temporal credit-assigment problem. See, for example, Section 1.5 of Sutton and Barto RL book, where they explain the working principles of RL and its advantages over other approaches using as example a Tic-tac-toe game.

Efficient reward range in deep reinforcement learning

When selecting reward value in DQN, Actor-Critic or A3C, is there any common rules to select reward value??
As I heard briefly, (-1 ~ +1) reward is quite efficient selection.
Can you tell me any suggestion and the reason ??
Ideally, you want to normalize your rewards (i.e., 0 mean and unit variance). In your example, the reward is between -1 to 1, which satisfies this condition. I believe the reason was because it speeds up gradient descent when updating your parameters for your neural network and also it allows your RL agent to distinguish good and bad actions more effectively.
An example: Imagine we are trying to build an agent to cross the street, and if it crosses the street, it gains a reward of 1. If it gets hit by a car, it gets a reward of -1, and each step yields a reward of 0. Percentage-wise, the reward for success is massively above the reward for failure (getting hit by a car).
However, if we give the agent a reward of 1,000,000,001 for successfully crossing the road, and giving it a reward of 999,999,999 for getting hit by a car (this scenario and the above are identical when normalized), the success is no longer as pronounced as previously. Also, if you discount such high rewards, it will make the distinction of the two scenarios even harder to identify.
This is especially a problem in DQN and other function approximation methods because these methods generalize the state, action, and reward spaces. So a reward of -1 and 1 are massively different, however, a reward of 1,000,000,001 and 999,999,999 are basically identical if we were to use a function to generalize it.