Deep Q learning with CNN - How can l know if my reinforcement learning model is actually learning? - deep-learning

I am trying to train a bot in a game like curve fever. It is like a snake which moves with a really precise turn radius (not 90°), which makes random hole (where he can passes throw) and like for a snake game he dies if he goes out of map or hits himself. The difference in the points stands on the fact that the snake has to survive as long as possible and there is no food associated. The tail of the snail increases by 1 at every step. It looks like that:
So I use a deep q learning algorithm with a CNN network, inspired by: Flappy bird deep q learning, which is itself inspired by the DeepMind's paper Playing Atari with Deep Reinforcement Learning.
My images as input are a thresholding image like above where everything is black or white.
At every step I grant +0.1 as reward for staying alive and -1 for dying in border of map or itself.
I trained my agent for hours and after 4.000.000 iterations I result in an agent which almost never goes out of map but crashes on itself in a very fast way.
So it is like he learnt how to not crash on border of the map but not on itself, what could explain this ?
Some examples:
My suppositions are:
I took a replay memory size of 25000 instead of 50000 because of OOM error, is that enough?
I did not train him long enough, but how could I know?
The border of the map never changes so it is easy to learn from it, but the tail of the agent itself changes at every new game, should I get worse reward for crashing on itself so the agent takes it more into account?
Here are my learning curves:
I am requesting your help because it takes a lot of time to train my agent and I can't be sure of what should I do.
Thanks in advance for any help.

Related

Deep Q Learning agent finds solution then diverges again

I am trying to train a DQN Agent to solve AI Gym's Cartpole-v0 environment. I have started with this person's implementation just to get some hands-on experience. What I noticed is that during training, after many episodes the agent finds the solution and is able to keep the pole upright for the maximum amount of timesteps. However, after further training, the policy looks like it becomes more stochastic and it can't keep the pole upright anymore and goes in and out of a good policy. I'm pretty confused by this why wouldn't further training and experience help the agent? At episodes my epsilon for random action becomes very low, so it should be operating on just making the next prediction. So why does it on some training episodes fail to keep the pole upright and on others it succeeds?
Here is a picture of my reward-episode curve during the training process of the above linked implementation.
This actually looks fairly normal to me, in fact I guessed your results were from CartPole before reading the whole question.
I have a few suggestions:
When you're plotting results, you should plot averages over a few random seeds. Not only is this generally good practice (it shows how sensitive your algo is to seeds), it'll smooth out your graphs and give you a better understanding of the "skill" of your agent. Don't forget, the environment and the policy are stochastic, so it's not completely crazy that your agent exhibits this type of behavior.
Assuming you're implementing e-greedy exploration, what's your epsilon value? Are you decaying it over time? The issue could also be that your agent is still exploring a lot even after it found a good policy.
Have you played around with hyperparameters, like learning rate, epsilon, network size, replay buffer size, etc? Those can also be the culprit.

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Understanding reinforcement learning on game 2048 example

So i wanna learn reinforcement learning by doing some examples. I wrote 2048 game but i do not know if i'm training it right. So as I understand I have to create neural network. I have created 16 inputs for each number. Then hidden layers 12x8 and 4 outputs for moves(up, right, down, left). (Activation function linear function for lat layer and relu for rest) Then I run one full game and save all the moves and rewards(0-nothing happend, -2-to moves that do nothink, -1 when that move lost game and a number of earned score when move do somethink). When the game ends I did backpropagation algorithm from the last move. Am i doing it rigth or what? And I know there are libraries like tensorflow but I wanna understand it all.
I would consult this GitHub repo, as it accomplishes exactly what you are trying to do.
You can actually use the above solution live here.
If you want to actually learn the fundamentals of how that all works, that's beyond the scope of what a single post on StackOverflow can provide.

Invalid moves in reinforcement learning

I have implemented a custom openai gym environment for a game similar to http://curvefever.io/, but with discreet actions instead of continuous. So my agent can in each step go in one of four directions, left/up/right/down. However one of these actions will always lead to the agent crashing into itself, since it cant "reverse".
Currently I just let the agent take any move, and just let it die if it makes an invalid move, hoping that it will eventually learn to not take that action in that state. I have however read that one can set the probabilities for making an illegal move zero, and then sample an action. Is there any other way to tackle this problem?
You can try to solve this by 2 changes:
1: give current direction as an input and give reward of maybe +0.1 if it takes the move which does not make it crash, and give -0.7 if it make a backward move which directly make it crash.
2: If you are using neural network and Softmax function as activation function of last layer, multiply all outputs of neural network with a positive integer ( confidence ) before giving it to Softmax function. it can be in range of 0 to 100 as i have experience more than 100 will not affect much. more the integer is the more confidence the agent will have to take action for a given state.
If you are not using neural network or say, deep learning, I suggest you to learn concepts of deep learning as your environment of game seems complex and a neural network will give best results.
Note: It will take huge amount of time. so you have to wait enough to train the algorithm. i suggest you not to hurry and let it train. and i played the game, its really interesting :) my wishes to make AI for the game :)

how to set up the appropriate model setting and layers for high intra-class variation

all experts
I am new in CNN and Caffe. I have a task in classification between 2 classes. The data set that I have collected is very small about 50 for class A and 50 for class B (I know that it is very very small). It is a human images.
I took the BVLC model and made a change such as Batch size for testing and training and also the maximum iteration. I try with many various setup, but the model doesn't work.
Any idea about how to come up with appropriate model or setting or other solutions ?
remark** I once randomly made a change to the BVLC model setup and it worked, but i lost the set up file.
For the train.prototxt and Solve.prototxt, I get it from this guy Adil Moujahid
I did try training batch size as 32,64,128,256 and testing for 5,20,30 but failed
For the data set, it is images of normal women and beautiful women and i will classify it, but Stackoverflow does not allowed me to add more than 2 links
I wonder that is there any formula , equation or steps that I can come up with and choose the right model setting.
Thank you in advance.
What is your meaning in "doesn't work"? Loss stays too high? Training is converged, but accuracy is low? Andrew Ng has an excellent session on "debugging" CNNs - Nuts and Bolts of Building Applications using Deep Learning (NIPS slides, summary, additional summary).
My humble guess is that your network has an overfitting problem - it learns the specific examples and can't generalize - so increasing the train dataset / regularization / data augmentation can help.