Understanding reinforcement learning on game 2048 example - reinforcement-learning

So i wanna learn reinforcement learning by doing some examples. I wrote 2048 game but i do not know if i'm training it right. So as I understand I have to create neural network. I have created 16 inputs for each number. Then hidden layers 12x8 and 4 outputs for moves(up, right, down, left). (Activation function linear function for lat layer and relu for rest) Then I run one full game and save all the moves and rewards(0-nothing happend, -2-to moves that do nothink, -1 when that move lost game and a number of earned score when move do somethink). When the game ends I did backpropagation algorithm from the last move. Am i doing it rigth or what? And I know there are libraries like tensorflow but I wanna understand it all.

I would consult this GitHub repo, as it accomplishes exactly what you are trying to do.
You can actually use the above solution live here.
If you want to actually learn the fundamentals of how that all works, that's beyond the scope of what a single post on StackOverflow can provide.

Related

Grad-cam for CNNs with GAP layer

I'm new to deep learning, so maybe this is a silly question...
Do any adjustments need to be made for applying Grad-CAM on CNNs that use a Global Average Pooling (GAP) layer right before fully connected ones?
I understand that the GAP layer aggregates the activations of an intermediate layer in order to produce a compact representation of the image, removing information regarding the features location. Is this an obstacle to grad-cam backpropagation?
I imagine that for a CNN that uses, for example, a Max Pooling layer followed by a Flatten layer, o Grad-CAM is capable of retriving the exact location of the relevant features.
I'm sorry if it is a silly doubt, but I couldn't find the answer for it anywhere.
Thanks in advance!
I have been experimenting with grad-cam with some VGGNets and ResNets in different tasks. It could be something in my head, but apparently ResNet tends to highlight larger regions in the image. Both models classify correctly, but the ResNet activation map usually highlights a larger area.
Even in the original Grad-CAM paper, this also happens, as shown below. However, I can't find any comments about it, I would like to know why.
Grad-CAM for VGGNet
Grad-CAM for ResNet

Deep Q learning with CNN - How can l know if my reinforcement learning model is actually learning?

I am trying to train a bot in a game like curve fever. It is like a snake which moves with a really precise turn radius (not 90°), which makes random hole (where he can passes throw) and like for a snake game he dies if he goes out of map or hits himself. The difference in the points stands on the fact that the snake has to survive as long as possible and there is no food associated. The tail of the snail increases by 1 at every step. It looks like that:
So I use a deep q learning algorithm with a CNN network, inspired by: Flappy bird deep q learning, which is itself inspired by the DeepMind's paper Playing Atari with Deep Reinforcement Learning.
My images as input are a thresholding image like above where everything is black or white.
At every step I grant +0.1 as reward for staying alive and -1 for dying in border of map or itself.
I trained my agent for hours and after 4.000.000 iterations I result in an agent which almost never goes out of map but crashes on itself in a very fast way.
So it is like he learnt how to not crash on border of the map but not on itself, what could explain this ?
Some examples:
My suppositions are:
I took a replay memory size of 25000 instead of 50000 because of OOM error, is that enough?
I did not train him long enough, but how could I know?
The border of the map never changes so it is easy to learn from it, but the tail of the agent itself changes at every new game, should I get worse reward for crashing on itself so the agent takes it more into account?
Here are my learning curves:
I am requesting your help because it takes a lot of time to train my agent and I can't be sure of what should I do.
Thanks in advance for any help.

In deep learning, can I change the weight of loss dynamically?

Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!

Invalid moves in reinforcement learning

I have implemented a custom openai gym environment for a game similar to http://curvefever.io/, but with discreet actions instead of continuous. So my agent can in each step go in one of four directions, left/up/right/down. However one of these actions will always lead to the agent crashing into itself, since it cant "reverse".
Currently I just let the agent take any move, and just let it die if it makes an invalid move, hoping that it will eventually learn to not take that action in that state. I have however read that one can set the probabilities for making an illegal move zero, and then sample an action. Is there any other way to tackle this problem?
You can try to solve this by 2 changes:
1: give current direction as an input and give reward of maybe +0.1 if it takes the move which does not make it crash, and give -0.7 if it make a backward move which directly make it crash.
2: If you are using neural network and Softmax function as activation function of last layer, multiply all outputs of neural network with a positive integer ( confidence ) before giving it to Softmax function. it can be in range of 0 to 100 as i have experience more than 100 will not affect much. more the integer is the more confidence the agent will have to take action for a given state.
If you are not using neural network or say, deep learning, I suggest you to learn concepts of deep learning as your environment of game seems complex and a neural network will give best results.
Note: It will take huge amount of time. so you have to wait enough to train the algorithm. i suggest you not to hurry and let it train. and i played the game, its really interesting :) my wishes to make AI for the game :)

What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation?

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.
The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.
If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.