Question about the reinforcement learning action, observation space size - reinforcement-learning

I tried to custom environment with a reinforcement learning(RL) project.
Some examples such as ping-pong, Aarti, Super-Mario, in this case, action, and observation space really small.
But, my project action, observation space is really huge size better than some examples.
And, I will use the space for at least 5000+ actions and observations.
Then, how can I effectively handle this massive amount of action and observation?
Currently, I am using Q-table learning, so I use a wrapper function to handle it.
But this seems to be very ineffective.

Yes, Q-table learning is quite old and requires extremely huge amount of memory since it stores Q value in a table. In your case, Q-table Learning seems not good enough. A better Choice would be Deep Q Network(DQN), which replaces table by networks, but it is not that efficient.
As for the huge observation space, it is fine. But the action space (5000+) seems too huge, it requires lots of time to converge. To reduce the time used for training, I would recommend PPO.

Related

Train neural network with unlimited training data

Does fitting a neural network with generated, and thus infinite training samples work?
Will the batch size still matter? Should training samples be repeated over some batches or is it ok to use every sample only once?
Does the term "epoch" makes any sense, if there is no complete dataset to iterate over?
Does validation makes any sense, if every sample from the training dataset already is a new one? If not, will training loss behave like validation loss would?
Does fitting a neural network with generated, and thus infinite training samples work?
Yes, it is completely fine, in fact it will likely be a much better setup than the one you are used to.
Will the batch size still matter?
Yes, batch size controls noise in the gradient estimation, the bigger the batch, smaller the error.
Should training samples be repeated over some batches or is it ok to use every sample only once?
If you can avoid repeating them, and just keep generating, you will be in a cleaner math setup, in practise it likely won't matter much.
Does the term "epoch" makes any sense, if there is no complete dataset to iterate over?
The term "epoch" is one of the big mistakes that we made as a community, it really is meaningless even when dataset is finite. Avoiding it completely will simplify your life, just think in terms of gradient updates/samples consumed and forget the epochs.
Does validation makes any sense, if every sample from the training dataset already is a new one? If not, will training loss behave like validation loss would?
It does still make sense just as an additional verification you are making progress, just remember to make sure you do not "generate" your validation set during training. That being said, it is much less important than in other cases, as long as your test scenario is also going to be generated in the same way. For example this is a reason why many RL papers (especially from Atari times) would not have validation sets - since training and test "environments" were exactly the same.

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Reinforcement Learning, ϵ-greedy approach vs optimal action

In Reinforcement Learning, why should we select actions according to an ϵ-greedy approach rather than always selecting the optimal action ?
We use an epsilon-greedy method for exploration during training. This means that when an action is selected by training, it is either chosen as the action with the highest Q-value, or a random action by some factor (epsilon).
Choosing between these two is random and based on the value of epsilon. initially, lots of random actions are taken which means we start by exploring the space, but as training progresses, more actions with the maximum q-values are taken and we gradually start giving less attention to actions with low Q-value.
During testing, we use this epsilon-greedy method, but with epsilon at a very low value, such that there is a strong bias towards exploitation over exploration, favoring choosing the action with the highest q-value over a random action. However, random actions are still sometimes chosen.
All this is because we want to eliminate the negative effects of over-fitting or under-fitting.
Using epsilon of 0 (always choosing the optimal action) is a fully exploitative choice. For example, consider a labyrinth game where the agent’s current Q-estimates are converged to the optimal policy except for one grid, where it greedily chooses to move toward a boundary (which is currently the optimal policy) that results in it remaining in the same grid, If the agent reaches any such state, and it is choosing the maximum Q-action, it will be stuck there. However, keeping a small epsilon factor in its policy allows it to get out of such states.
There wouldn't be much learning happening if you already knew what the best action was, right ? :)
ϵ-greedy is "on-policy" learning, meaning that you are learning the optimal ϵ-greedy policy, while exploring with an ϵ-greedy policy. You can also learn "off-policy" by selecting moves that are not aligned to the policy that you are learning, an example is exploring always randomly (same as ϵ=1).
I know this can be confusing at first, how can you learn anything if you just move randomly? The key bit of knowledge here is that the policy that you learn is not defined by how you explore, but by how you calculate the sum of future rewards (in the case of regular Q-Learning it's the max(Q[next_state]) piece in the Q-Value update).
This all works assuming you are exploring enough, if you don't try out new actions the agents will never be able to figure out which ones are the best ones in the first place.

Adding noise to image for deep learning, yes or no?

I've been thinking that adding noise to an image can prevent overfitting and also "increase" the dataset by adding variations to it. I'm only trying to add some random 1s to images that has shape (256,256,3) which uses uint8 to represent its color. I don't think that can affect the visualization at all (I showed both images with matplotlib and they seems almost the same) and has only ~0.01 mean difference in the sum of their values.
But it doesn't look to have its advances. After training for a long time it's still not as good as the one doesn't use noises.
Has anyone tried to use noise for image classification tasks like this? Is it eventually better?
I wouldn't go to add noise to your data. Some papers employ input deformations during training to increase robutness and convergence speed of models. However, these deformations are statistically inefficient (not just on image but any kind of data).
You can read Intriguing properties of Neural Networks from Szegedy et al. for more details (and refer to references 9 & 13 for papers that uses deformations).
If you want to avoid overfitting, you might be interested to read about regularization instead.
Yes you may add noise to extend your dataset and avoid overfitting your training set but make sure it is random otherwise your network will take this noise as something it should learn (and that's not something you want). I wouldn't use this method first to do that, I would first rotate and/or flip my samples.
However, your network should perform better or, at least, as well as your previous network.
First thing I would check is : How do you measure your performances ? What were your performances before and after ? And did you change anything else ?
There are a couple of works that deal with this problem. Because you make the training set harder the training error will be lower, however your generalization might be better. It has been shown that adding noise can have stability effects for training Generative Adversarial Networks (Adversarial Training).
For classification tasks it is not that cut and dry. Not many works have actually dealt with this topic. The closest one is to my best knowledge is this one from google (https://arxiv.org/pdf/1412.6572.pdf), where they show the limitation of using training without noise. They do report a regularization effect, but not actual better results than using other methods.

What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation?

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.
The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.
If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.