I'm coding a simple q-learning example and to update q-values you need a maxQ'.
I'm not sure if maxQ' is referring to the sum of all possible rewards or the highest possible reward:
That is maximum Q-values among all possible actions for the state s'. Basically, you need to take a max over all Q(s',a') for all valid actions a' in state s'.
Related
I have a dataset with 6 numeric inputs and a numeric target col to predict. I'm using LightGBM and getting considerable good results. As far as I know, predicted values are the learning rate weighted averages of the tree leaves. And each leaf's value is the mean of observations it contains. But I suspect that the observations in each leaf may not be normally distributed. So I want to use median value, or maybe trim_mean, of each leaf instead of mean. Is there an option for this?
I'm designing a reward function of a DQN model, the most tricky part of Deep reinforcement learning part. I referred several cases, and noticed usually the reward will set in [-1, 1]. Considering if the negative reward is triggered less times, more "sparse" compared with positive reward, the positive reward could be lower than 1.
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)? What's the theory or principle behind the range?
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I can vaguely understand that correlated with gradient descent, and actually it's the gap between rewards matters, not the sign or absolute value. But I'm still missing clear hint how it can destroy, and why in such range.
Besides, when should I utilize reward like [0,1] or use only negative reward? I mean, within given timestep, both methods seems can push the agent to find the highest total reward. Only in situation like I want to let the agent reach the final point asap, negative reward will seems more appropriate than positive reward.
Is there a criteria to measure if the reward is designed reasonable? Like use the Sum the Q value of good action and bad action, it it's symmetrical, the final Q should around zero which means it converge?
I wish to know why should I set always try to set the reward within this range (sometimes it can be [0,1], other times could be [-1,0] or simply -1)?
Essentially it's the same if you define your reward function in either [0,1] or [-1,0] range. It will just result in your action values being positive or negative, but it wouldn't affect the convergence of your neural network.
I went through this answer; it mentioned set the 500 as positive reward and -1 as negative reward will destroy the network. But how would it destroy the model?
I wouldn't really agree with the answer. Such a reward function wouldn't "destroy" the model, however it is incapable of providing a balanced positive and negative reward for the agent's action. It provides incentive for the agent not to crash, however doesn't encourage it to cut off opponents.
Besides, when should I utilize reward like [0,1] or use only negative reward?
As mentioned previously, it doesn't matter if you use positive or negative reward. What matters is the relativity of your reward. For example as you said if you want the agent to reach the terminal state asap, thus introducing negative rewards, it will only work if no positive reward is present during the episode. If the agent could pick up positive reward midway through the episode, it would not be incentivized to end the episode asap. Therefore, it's the relativity that matters.
What's the principle to design the reward function, of DQN?
As you said, this is the tricky part of RL. In my humble opinion, the reward is "just" the way to leads your system to the (state, action) pairs that you valuate most. So, if you consider that one pair (state, action) is 500x greater than the other, why not?
About the range of values... suppose that you know all the rewards that can be assigned, thus you know the range of values, and you could easily normalize it, let's say to [0,1]. So, the range doesn't mean to much, but the values that you assigned says a lot.
About negative reward values. In general, I find it in problems where the objective is to minimize costs. For instance, if you have a robot that has the goal do collect trash in a room, and from time to time he has to recharge himself to continue doing this task. You could have negative rewards regarding battery consumption, and your goal is to minimize it. On another hand, in many games the goal is to score more and more points, so can be natural to assign positive values.
I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.
My only question is how the weights are updated. According to this site the weights are updated as follows:
I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?
In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:
Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.
In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this:
Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).
The equation you presented is the Q-learning update rule set in a gradient update rule.
alpha is the step-size
R is the reward
Gamma is the discounting factor
You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.
First time here so forgive me for any faux pas. I have a question about the limitation of SQL as I am new to the code, and what I need I believe to be rather complex.
Is it possible to automate finding the optimal data for a specific query. For example, say I have the following columns:
1) Vehicle type (Text) e.g. car,bike,bus
2) Number of passengers (Numeric) e.g. 0-7
3) Was in an accident (Boolean) e.g. t or f
From here, I would like to get percentages. So if I were to select only cars with 3 passengers, what percentage of the total accidents does that account for.
I understand how to get this as a one off or mathematically calculate it, however my question relates how to automate this process to get the optimum number.
So, keeping with this example, say I look at just cars, what number of passengers covers the highest percentage of accidents?
At the moment, I am currently going through and testing number by number, is there a way to 'find' the optimal number? It is easy when it is just 0-7 like in the example, but I would naturally like to deal with a larger range and even multiple ranges. For example, say we add another variable titled:
4) Number of doors (numeric) e-g- 0-3
Would there be a way of finding the best combination of numbers from these two variables that cover the highest percentage of accidents?
So say we took: Car, >2 passengers, <3 doors on the vehicle. Out of the accidents variable 50% were true
But if we change that to:Car, >4 passengers, <3 doors. Out of the accidents variable 80% were true.
I hope I have explained this well. I understand that this is most likely not possible with SQL, however is there another way to find these optimum numbers?
Thanks in advance
Here's an example that will give you an answer for all possibilities. You could add a limit clause to show only the top answer, or add to the where clause to limit to specific terms.
SELECT
`vehicle_type`,
`num_passengers`,
sum(if(`in_accident`,1,0)) as `num_accidents`,
count(*) as `num_in_group`,
sum(if(`in_accident`,1,0)) / count(*) as `percent_accidents`
FROM `accidents`
GROUP BY `vehicle_type`,
`num_passengers`
ORDER BY sum(if(`in_accident`,1,0)) / count(*)
The goal is to make the frequency not so dominating.
Suppose A has an attack frequency of 100,and B's is 2.
But I don't want to see such a big difference.
I want to reduce the difference,how?
The goal is that A is at most 5 times faster than B,not 100/2=50.
But should make sure A is faster than B.
So I need a mechanism to achieve this.
Use the logarithm function to reduce the scale. For example in log base 2, A's score is between 6 and 7, while B has a score of 1. Multiply by a constant afterwards if you wish to scale the values up again. You can change the base of the logarithm to adjust how much you want to even out the differences.
Update: You will probably also want to add 1 to the score before taking the logarithm to ensure that scores below 1 don't get converted to large negative numbers.
You might consider using a gaussian around 100 for A and 2 for B. Digg into non uniform random generators.
Or you can, determine another attribut for your game and use the frequency as a factor !