Deep Q-Learning for grid world - reinforcement-learning

Has anyone implemented the Deep Q-learning to solve a grid world problem where state is the [x, y] coordinates of the player and goal is to reach a certain coordinate [A, B]. Reward setting could be -1 for each step and +10 for reaching [A,B]. [A, B] is always fixed.
Surprisingly enough I did not find such an implementation on google. I tried DQN using taxi-v3 myself and it didn't work. So, looking for such a reference implementation to work my way up to my problem.

For grid worlds deep Q-learning isn't needed, that's probably why there are few people doing it. However I found a tutorial that uses deep Q-learning with a grid world: https://livebook.manning.com/book/deep-reinforcement-learning-in-action/chapter-3/1

Related

Fine-tuning with a very low learning rate. Any sign that something is not good?

I have working with deep reinforcement learning and in the literature, usually the learning rates are lower than I found in other settings.
My model is the following one:
def create_model(self):
model = Sequential()
model.add(LSTM(HIDDEN_NODES, input_shape=(STATE_SIZE, STATE_SPACE), return_sequences=False))
model.add(Dense(HIDDEN_NODES, activation='relu', kernel_regularizer=regularizers.l2(0.000001)))
model.add(Dense(HIDDEN_NODES, activation='relu', kernel_regularizer=regularizers.l2(0.000001)))
model.add(Dense(ACTION_SPACE, activation='linear'))
# Compile the model
model.compile(loss=tf.keras.losses.Huber(delta=1.0), optimizer=Adam(lr=LEARNING_RATE, clipnorm=1))
return model
Where the initial learning rate (lr) is 3e-5. For the fine-tuning, I freeze the first two layers (this step is essential in my settings) and decrease the learning rate to 3e-9. During the fine-tuning, the model might suffer from a distributional shift once the source of samples is perturbed data. Is there another source of problems besides this for such a low learning rate to keep my model improving?
First, Show me your data sample.
Theoretical Answer:
We have learned how perturbation helps in solving various issues related to neural network training or trained model. Here, we have seen perturbation in three components (gradients, weights, inputs) associated with neural-network training and trained model; perturbation, in gradients is to tackle vanishing gradient problem, in weights for escaping saddle point, and in inputs to avoid malicious attacks. Overall, perturbations in different ways play the role of strengthening the model against various instabilities, for example, it can avoid staying at correctness wreckage point since such position will be tested with perturbation (input, weight, gradient) which will make the model approach towards correctness attraction point.
As of now, perturbation is mainly contingent to empirical-experimentation designed from intuition to solve encountering problems. One needs to experiment if perturbing a component of the training process makes sense intuitively, and further verify empirically if it helps mitigate the problem. Nevertheless, in future, we will see more on perturbation theory in deep learning or machine learning in general which might also be backed by a theoretical guarantee.

PyTorch Model's gradients are converging to zero

I'm currently working on a personal implementation of the Transformer architecture. The code I've written as here.
The problem that I'm facing is that I believe my model isn't training properly and I'm not sure what kind of measures I should take to fix that. I've come to this conclusion after using Weights & Biases to visualize the model's gradient histograms and they look something like this:
The gradients seem to quickly converge to zero. There is a portion of code that contains a feedforward neural network that uses ReLU activation, and I changed this to Leaky ReLU under the suspicion that dying ReLU's may be the problem. However, using Leaky ReLU's doesn't help and just prolongs the zero-convergence.
Any feedback on what else I may try is appreciated.

Inverted Pendulum: model-based or model-free?

This is my first post here, and I came here to discuss or get clarifications on something that I have trouble understanding, namely model-free vs model-based RL methods. I am currently implementing Q-learning, but am not certain I am doing it correctly.
Example: Say I am applying Q-learning to an inverted pendulum, where the reward is given as the absolute distance between the pendulum upward position, and terminal state (or goal state) is defined to be when the pendulum is very close to upward position.
Would this setup mean that I have a model-free or model-based setup? From how I have understood, this would be model-based as I have a model of the environment that is giving me the reward (R=abs(pos-wantedPos)). But then I saw an implementation of this using Q-learning (https://medium.com/#tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947), which is a model-free algorithm. Now I am clueless...
Thankful for all responses.
Vanilla Q-learning is model-free.
The idea behind reinforcement learning is that an agent is trained to learn an optimal policy based on pairs of states and rewards--this is in contrast to trying to model the environment.
If you took a model-based approach, you would be trying to model the environment and ultimately perform value iteration or policy iteration of the Markov decision process.
In reinforcement learning, it is assumed you do not have the MDP, and thus must try to find an optimal policy based on the various rewards you receive from your experiences.
For a longer explanation, check out this post.

Why neural network tends to output 'mean value'?

I am using keras to build a simple neural network for a regression task.
But the output is always tends to the 'mean value' of ground truth y data.
See the first figure, blue is ground truth, red is predicted value (very close to the constant mean of ground truth).
Also the model stops learning very early even though I set a learning epoch=100.
Anyone have ideas under what kinds of conditions the neural network will stop learning early and why the regression output tends to 'the mean' of ground truth?
Thanks!
Possibly because the data are unpredictable....? Do you know for certain that the data set has N order predictability of some kind?
Just eyeballing your data set, it lacks periodicity, lacks homoscedasticity, it lacks any slope or skew or trend or pattern... I can't really tell if there is anything wrong with your 'net. In the absence of any pattern, the mean is always the best prediction... and it is entirely possible (although not certain) that the neural net is doing its job.
I suggest you find an easier data set, and see if you can tackle that first.
The model is not learning from the data. Think of a basic linear regression - the 'null' prediction, the prediction if you didn't have any predictors at all, is just the expected value; i.e. the mean. It could be caused by many different issues, but initialization comes to mind - bad initialization leads to no learning. This blog post has good practical advice that may help.

Any visualizations of neural network decision process when recognizing images?

I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?
This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.
Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies: