Evaluating DQN, Vehicle Routing Problem (VRP) - deep-learning

I am running this DQN algorithm that is trying to minimize the total distance traveled by a vehicle (VRP).
In the training, as you can see in the images, everything works fine: the loss is decreasing, the average length in decreasing, and the reward is increasing.
However, in the evaluation phase the model behaves in a weird way. I am running 100 evaluation, so the first run of 100 the results are good; but the next runs of evaluation give me sometimes good results and sometimes very bad results.
In the good results I get min total distance (min length) value of 4, but sometimes the evaluation return a min value of 13. Eventhough the evaluation is done on the same trained model.
So my question is this behavior normale? and is there a way to improve these evaluation results?
P.S:
the number of episodes in training is 4000 ( i tried on 10000 also and it's the same thing)
the data is random array of coords and an adjacency matrix of euclidean distance between the coords. For every new episode there's a new random coords and distance arrays.
the same thing for evaluation. I do 100 iterations of evaluation and for each iteration new random data
In the evaluation I don't use any penalties or rewards. I only use them in the training.
I am using pytorch in this project
Here's an example of the evaluation output:
shortest avg length found: 5.406301895156503 (this is the value from the training)
Now here are an example of 2 solutions from evaluation
solution 1 [0, 1, 9, 4, 2, 3, 5, 0, 6, 7, 8, 10]
length 4.955087028443813
solution 2 [0, 4, 9, 3, 13, 0, 7, 13, 0, 10, 0, 6, 11, 5, 12, 1, 12, 0, 2, 12, 0, 8, 0]
length 10.15813521668315
The first 100 evaluations are similar to solution 1, and i rerun evaluation for another 100 i get results similar to solution 2.

Related

Ouput dimension in Pytorch LSTM

I have about (10000, 6) time series data.
I made a sequence with 10 bundles.
The whole data consisted of (9990, 10, 6).
If I use a batch size of 20 and put it as an LSTM(batch_first=True) input, is the output (20, 10, n) correct?
I didn't understand why the value of n comes out as hidden_size in pytorch.
How is the output of an RNN unit determined?
6 inputs 1 output per unit?
If I'm right, the output should be (batch_size, hidden_size, 1)

What are the differences between aggregation and concatenation in convolutional neural networks?

When I read some classical papers about CNNs, like Inception family, ResNet, VGGnet and so on, I notice the terminology concatenation, summation and aggregation, which makes me confused(summation is easy to understand for me). Could someone tell me what the differences are among them? Maybe in a more sepcific way, like using examples to illustrate the dimensionality and representation ability differences.
Concatenation generally consists of taking 2 or more output tensors from different network layers and concatenating them along the channel dimension
Aggregation consists in taking 2 or more output tensors from different network layers and applying a chosen multivariate function on them to aggregate the results
Summation is a special case of aggregation where the function is a sum
This implies that we lose information by doing aggregation. On the other hand, concatenation will make it possible to retain information at the cost of greater memory usage.
E.g. in PyTorch:
import torch
batch_size = 8
num_channels = 3
h, w = 512, 512
t1 = torch.rand(batch_size, num_channels, h, w) # A tensor with shape [8, 3, 512, 512]
t2 = torch.rand(batch_size, num_channels, h, w) # A tensor with shape [8, 3, 512, 512]
torch.cat((t1, t2), dim=1) # A tensor with shape [8, 6, 512, 512]
t1 + t2 # A tensor with shape [8, 3, 512, 512]

What is the difference between performing upsampling together with strided transpose convolution and transpose convolution with stride 1 only?

I noticed in a number of places that people use something like this, usually in fully convolutional networks, autoencoders, and similar:
model.add(UpSampling2D(size=(2,2)))
model.add(Conv2DTranspose(kernel_size=k, padding='same', strides=(1,1))
I am wondering what is the difference between that and simply:
model.add(Conv2DTranspose(kernel_size=k, padding='same', strides=(2,2))
Links towards any papers that explain this difference are welcome.
Here and here you can find a really nice explanation of how transposed convolutions work. To sum up both of these approaches:
In your first approach, you are first upsampling your feature map:
[[1, 2], [3, 4]] -> [[1, 1, 2, 2], [1, 1, 2, 2], [3, 3, 4, 4], [3, 3, 4, 4]]
and then you apply a classical convolution (as Conv2DTranspose with stride=1 and padding='same' is equivalent to Conv2D).
In your second approach you are first un(max)pooling your feature map:
[[1, 2], [3, 4]] -> [[1, 0, 2, 0], [0, 0, 0, 0], [3, 0, 4, 0], [0, 0, 0, 0]]
and then apply a classical convolution with filter_size, filters`, etc.
Fun fact is that - although these approaches are different they share something in common. Transpose convolution is meant to be the approximation of gradient of convolution, so the first approach is approximating sum pooling whereas second max pooling gradient. This makes the first results to produce slightly smoother results.
Other reasons why you might see the first approach are:
Conv2DTranspose (and its equivalents) are relatively new in keras so the only way to perform learnable upsampling was using Upsample2D,
Author of keras - Francois Chollet used this approach in one of his tutorials,
In the past equivalents of transpose, convolution seemed to work awful in keras due to some API inconsistencies.
I just want to point out a couple of things that you mentioned. Upsample2D is not a learnable layer since There is literally 0 parameter.
Also, we can not justify the reason why we might want to use the first approach because Francoise Chollet introduced the usage in his example.

How to get the probability of each vector belonging to each cluster?

I use the following code to create clusters. I would like to get the probability of each vector belonging to each cluster. How to do this?
import numpy as np
from nltk import cluster
from nltk.cluster import euclidean_distance
vectors = [np.array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
clusterer = cluster.KMeansClusterer(2, euclidean_distance)
clusters = clusterer.cluster(vectors, assign_clusters=True, trace=False)
from sklearn import mixture
model = mixture.GMM(n_components=4)
model.fit(dataset)
model.score_samples(dataset)
this returns, acc to docs
Posterior probabilities of each mixture component for each observation.
But of course this won't help if the Clustering doesn't converge for your data.
Are you talking about:
the assignments kmeans made to vectors from your vectors variable or
the assignment of a new vector to an existing cluster?
1. The K-means assignments
Simply print the clusters variables. If you see [0, 0, 1, 1], then it means [3, 3] and [1, 2] (the first two) got assigned to the cluster 0, and [4, 2] and [4, 0] (the last two) to the cluster 1. There's no probability here.
2. Assigning a new vector to an existing cluster
Since you're using KMeans, you first need to know what is the centroid of each cluster. The nltk API says this is a private information : the interesting variable (_means) is prefixed by an underscore. The variable could change in the future, but you can still get the value if you want to.
The NLTK algorithm is randomized, so you will get different centroids each time. As I said before, you can see the assignments with print(clusters). You can see the centroids with print(clusterer._means). Let's say you got the assignment [0, 0, 1, 1] with centroids [2, 2.5] and [4, 1]. A new vector (say [1, 2]) would be assigned to an existing cluster by using the closest cluster. Again, it makes little sense to talk about probability here. You could get scores by using distance for all clusters and then using softmax to get to probabilities if you really wanted to.

What is the name of this data structure or technique of using relative difference between sequence members

Let's say I have a sequence of values (e.g., 3, 5, 8, 12, 15) and I want to occasionally decrease all of them by a certain value.
If I store them as the sequence (0, 2, 3, 4, 3) and keep a variable as a base of 3, I now only have to change the base (and check the first items) whenever I want to decrease them instead of actually going over all the values.
I know there's an official term for this, but when I literally translate from my native language to English it doesn't come out right.
Differential Coding / Delta Encoding?
I don't know a name for the data structure, but it's basically just base+offset :-)
An offset?
If I understand your question right, you're rebasing. That's normally used in reference to patching up addresses in DLLs from a load address.
I'm not sure that's what you're doing, because your example seems to be incorrect. In order to come out with { 3, 5, 8, 12, 15 }, with a base of 3, you'd need { 0, 2, 5, 9, 12 }.
I'm not sure. If you imagine your first array as providing the results of some function of an index value f(i) where f(0) is 3, f(1) is 5, and so forth, then your second array is describing the function f`(i) where f(i+1) = f(i) + f'(i) given f(0) = 3.
I'd call it something like a derivative function, where the process of retrieving your original data is simply the summation function.
What will happen more often, will you be changing f(0) or retrieving values from f(i)? Is this technique rooted in a desire to optimize?
Perhaps you're looking for a term like "Inductive Sequence" or "Induction Sequence." (I just made that up.)