How does the wrapper normalizeFeatures behave with a validation set? - mlr

I am wondering how the function normalizeFeatures works along with a resampling strategy. Which of these statements is true?
The whole task data is normalized
The training data is normalized, and the parameters of that normalization (let's say, mean and sd in a classsic standardization) are used to normalize the validation data (what mlrCPO::retrafo does in some way).
Thank you for your help!

The function normalizeFeatures() can be called on a data.frame and a Task object.
In both cases it does the same. It simply normalizes the whole task. So statement 1) is true.
If you want to achieve the second you have two options:
a) preprocWrapperCaret
The wrapper will put the scaling infront of the training and the prediction. For the training the scaling parameters will be saved and applied.
For the prediction the saved scaling parameters will be applied.
library(mlr)
lrn = makeLearner("classif.svm")
lrn = makePreprocWrapperCaret(lrn, ppc.center = TRUE, ppc.scale = TRUE)
set.seed(1)
res = resample(lrn, iris.task, resampling = hout, models = TRUE)
# the scaling parameters learnt on the training spit
res$models[[1]]$learner.model$control$mean
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.831 3.030 3.782 1.222
res$models[[1]]$learner.model$control$std
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.8611356 0.4118203 1.7487877 0.7710127
b) mlrCPO
A bit more elegant and flexible approach is to built a preprocessing pipeline with the mlrCPO package which has the same effect as a wrapper in this case.
library(mlr)
library(mlrCPO)
lrn = cpoScale(center = TRUE, scale = TRUE) %>>% makeLearner("classif.svm")
set.seed(1)
res = resample(lrn, iris.task, resampling = hout, models = TRUE)
# the scaling parameters learnt on the training spit
res$models[[1]]$learner.model$retrafo$element$state
$center
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.831 3.030 3.782 1.222
$scale
Sepal.Length Sepal.Width Petal.Length Petal.Width
0.8611356 0.4118203 1.7487877 0.7710127
I set the seed to obtain the same training split for both cases so that the learnt scaling parameters are the same for both approaches.

Related

Difference between WGAN and WGAN-GP (Gradient Penalty)

I just find that in the code here:
https://github.com/NUS-Tim/Pytorch-WGAN/tree/master/models
The "generator" loss, G, between WGAN and WGAN-GP is different, for WGAN:
g_loss = self.D(fake_images)
g_loss = g_loss.mean().mean(0).view(1)
g_loss.backward(one) # !!!
g_cost = -g_loss
But for WGAN-GP:
g_loss = self.D(fake_images)
g_loss = g_loss.mean()
g_loss.backward(mone) # !!!
g_cost = -g_loss
Why one is one=1 and another is mone=-1?
You might have misread the source code, the first sample you gave is not averaging the resut of D to compute its loss but instead uses the binary cross-entropy.
To be more precise:
The first method ("GAN") uses the BCE loss to compute the loss terms for D and G. The standard GAN optimization objective for D is to minimize E_x[log(D(x))] + E_z[log(1-D(G(z)))]. Source code:
outputs = self.D(images)
d_loss_real = self.loss(outputs.flatten(), real_labels) # <- bce loss
real_score = outputs
# Compute BCELoss using fake images
fake_images = self.G(z)
outputs = self.D(fake_images)
d_loss_fake = self.loss(outputs.flatten(), fake_labels) # <- bce loss
fake_score = outputs
# Optimizie discriminator
d_loss = d_loss_real + d_loss_fake
self.D.zero_grad()
d_loss.backward()
self.d_optimizer.step()
For d_loss_real you optimize towards 1s (output is considered real), while d_loss_fake optimizes towards 0s (output is considered fake).
While the second ("WCGAN") uses the Wasserstein loss (ref) whereby we maximise for D the loss: E_x[D(x)] - E_z[D(G(z))]. Source code:
# Train discriminator
# WGAN - Training discriminator more iterations than generator
# Train with real images
d_loss_real = self.D(images)
d_loss_real = d_loss_real.mean()
d_loss_real.backward(mone)
# Train with fake images
z = self.get_torch_variable(torch.randn(self.batch_size, 100, 1, 1))
fake_images = self.G(z)
d_loss_fake = self.D(fake_images)
d_loss_fake = d_loss_fake.mean()
d_loss_fake.backward(one)
# [...]
Wasserstein_D = d_loss_real - d_loss_fake
By doing d_loss_real.backward(mone) you backpropage with a gradient of opposite sign, i.e. its's a gradient ascend, and you end up maximizing d_loss_real.
In order to Update D network:
lossD = Expectation of D(fake data) - Expectation of D(real data) + gradient penalty
lossD ↓,D(real data) ↑
so you need to add minus one to the gradient process

How to estimate the parameters of a mixture model in OpenTURNS?

I would like to estimate the parameters of a mixture model of normal distributions in OpenTURNS (that is, the distribution of a weighted sum of Gaussian random variables). OpenTURNS can create such a mixture, but it cannot estimate its parameters. Moreover, I need to create the mixture as an OpenTURNS distribution in order to propagate uncertainty through a function.
For example, I know how to create a mixture of two normal distributions:
import openturns as ot
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)
In this example, I would like to estimate mu1, sigma1, mu2, sigma2 on a given sample. In order to create a working example, it is easy to generate a sample by simulation.
s = m.getSample(100)
You can rely on scikit-learn's GaussianMixture to estimate the parameters and then use them to define a Mixture model in OpenTURNS.
The script hereafter contains a Python class MixtureFactory that estimates the parameters of a scikitlearn GaussianMixture and outputs an OpenTURNS Mixture distribution:
from sklearn.mixture import GaussianMixture
from sklearn.utils.validation import check_is_fitted
import openturns as ot
import numpy as np
class MixtureFactory(GaussianMixture):
"""
Representation of a Gaussian mixture model probability distribution.
This class allows to estimate the parameters of a Gaussian mixture
distribution using scikit algorithms & provides openturns Mixture object.
Read more in scikit learn user guide & openturns theory.
Parameters:
-----------
n_components : int, defaults to 1.
The number of mixture components.
covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}
String describing the type of covariance parameters to use.
Must be one of:
'full'
each component has its own general covariance matrix
'tied'
all components share the same general covariance matrix
'diag'
each component has its own diagonal covariance matrix
'spherical'
each component has its own single variance
tol : float, defaults to 1e-3.
The convergence threshold. EM iterations will stop when the
lower bound average gain is below this threshold.
reg_covar : float, defaults to 1e-6.
Non-negative regularization added to the diagonal of covariance.
Allows to assure that the covariance matrices are all positive.
max_iter : int, defaults to 100.
The number of EM iterations to perform.
n_init : int, defaults to 1.
The number of initializations to perform. The best results are kept.
init_params : {'kmeans', 'random'}, defaults to 'kmeans'.
The method used to initialize the weights, the means and the
precisions.
Must be one of::
'kmeans' : responsibilities are initialized using kmeans.
'random' : responsibilities are initialized randomly.
weights_init : array-like, shape (n_components, ), optional
The user-provided initial weights, defaults to None.
If it None, weights are initialized using the `init_params` method.
means_init : array-like, shape (n_components, n_features), optional
The user-provided initial means, defaults to None,
If it None, means are initialized using the `init_params` method.
precisions_init : array-like, optional.
The user-provided initial precisions (inverse of the covariance
matrices), defaults to None.
If it None, precisions are initialized using the 'init_params' method.
The shape depends on 'covariance_type'::
(n_components,) if 'spherical',
(n_features, n_features) if 'tied',
(n_components, n_features) if 'diag',
(n_components, n_features, n_features) if 'full'
random_state : int, RandomState instance or None, optional (default=None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
warm_start : bool, default to False.
If 'warm_start' is True, the solution of the last fitting is used as
initialization for the next call of fit(). This can speed up
convergence when fit is called several times on similar problems.
In that case, 'n_init' is ignored and only a single initialization
occurs upon the first call.
See :term:`the Glossary <warm_start>`.
verbose : int, default to 0.
Enable verbose output. If 1 then it prints the current
initialization and each iteration step. If greater than 1 then
it prints also the log probability and the time needed
for each step.
verbose_interval : int, default to 10.
Number of iteration done before the next print.
"""
def __init__(self, n_components=2, covariance_type='full', tol=1e-6,
reg_covar=1e-6, max_iter=1000, n_init=1, init_params='kmeans',
weights_init=None, means_init=None, precisions_init=None,
random_state=41, warm_start=False,
verbose=0, verbose_interval=10):
super().__init__(n_components, covariance_type, tol, reg_covar,
max_iter, n_init, init_params, weights_init, means_init,
precisions_init, random_state, warm_start, verbose, verbose_interval)
def fit(self, X):
"""
Fit the mixture model parameters.
EM algorithm is applied here to estimate the model parameters and build a
Mixture distribution (see openturns mixture).
The method fits the model ``n_init`` times and sets the parameters with
which the model has the largest likelihood or lower bound. Within each
trial, the method iterates between E-step and M-step for ``max_iter``
times until the change of likelihood or lower bound is less than
``tol``, otherwise, a ``ConvergenceWarning`` is raised.
If ``warm_start`` is ``True``, then ``n_init`` is ignored and a single
initialization is performed upon the first call. Upon consecutive
calls, training starts where it left off.
Parameters
----------
X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row
corresponds to a single data point.
Returns
-------
"""
data = np.array(X)
# Evaluate the model parameters.
super().fit(data)
# openturns mixture
# n_components ==> weight of size n_components
weights = self.weights_
n_components = len(weights)
# Create ot distribution
collection = n_components * [0]
# Covariance matrices
cov = self.covariances_
mu = self.means_
# means : n_components x n_features
n_components, n_features = mu.shape
# Following the type of covariance, we define the collection of gaussians
# Spherical : C_k = Identity * sigma_k
if self.covariance_type is 'spherical':
c = ot.CorrelationMatrix(n_features)
for l in range(n_components):
sigma = np.sqrt(cov[l])
collection[l] = ot.Normal(list(mu[l]), [ sigma ] * n_features , c)
elif self.covariance_type is 'diag' :
for l in range(n_components):
c = ot.CovarianceMatrix(n_features)
for i in range(n_features):
c[i,i] = cov[l, i]
collection[l] = ot.Normal(list(mu[l]), c)
elif self.covariance_type == 'tied':
# Same covariance for all clusters
c = ot.CovarianceMatrix(n_features)
for i in range(n_features):
for j in range(0, i+1):
c[i,j] = cov[i,j]
# Define the collection with the same covariance
for l in range(n_components):
collection[l] = ot.Normal(list(mu[l]), c)
else:
n_features = cov.shape[1]
for l in range(n_components):
c = ot.CovarianceMatrix(n_features)
for i in range(n_features):
for j in range(0, i+1):
c[i,j] = cov[l][i,j]
collection[l] = ot.Normal(list(mu[l]), c)
self._mixture = ot.Mixture(collection, weights)
return self
def get_mixture(self):
"""
Returns the Mixture object
"""
check_is_fitted(self)
return self._mixture
if __name__ == "__main__":
mu1 = 1.0
sigma1 = 0.5
mu2 = 3.0
sigma2 = 2.0
weights = [0.3, 0.7]
n1 = ot.Normal(mu1, sigma1)
n2 = ot.Normal(mu2, sigma2)
m = ot.Mixture([n1, n2], weights)
x = m.getSample(1000)
est_dist = MixtureFactory(random_state=1)
est_dist.fit(x)
print(est_dist.get_mixture())
I have actually tried this method and it works perfectly. On top of that the fit of the model through the SciKit GMM and the ulterior adjustment thanks to OpenTurns are very fast. I recommend future users to test several numbers of components and covariance matrix structures, as it will not take a lot of time and can substantially improve the goodness of fit to the data.
Thanks for the answer.
Here is a pure OpenTURNS solution. It is probably slower than the scikit-learn-based method, but it is more generic: you could use it to estimate the parameters of any mixture model, not necessarily a mixture of normal distributions.
The idea is to retrieve the log-likelihood function from the Mixture object and minimize it.
In the following, let us assume that s is the sample we want to fit the mixture on.
First, we need to build the mixture we want to estimate the parameters of. We can specify any valid set of parameters, it does not matter. In your example, you want a mixture of 2 normal distributions.
mixture = ot.Mixture([ot.Normal()]*2, [0.5]*2)
There is a small hurdle. All weights sum to 1, thus one of them is determined by the others: the solver must not be allowed to freely set it. The order of the parameters of an OpenTURNS Mixture is as follows:
weight of the first distribution;
parameters of the first distribution;
weight of the second distribution;
parameters of the second distribution:
...
You can view all parameters with mixture.getParameter() and their names with mixture.getParameterDescription(). The following is a helper function that:
takes as input the list containing of all mixture parameters except the weight of its first distribution;
outputs a Point containing all parameters including the weight of the first distribution.
def full(params):
"""
Point of all mixture parameters from a list that omits the first weight.
"""
params = ot.Point(params)
aux_mixture = ot.Mixture(mixture)
dist_number = aux_mixture.getDistributionCollection().getSize()
index = aux_mixture.getDistributionCollection()[0].getParameter().getSize()
list_weights = []
for num in range(1, dist_number):
list_weights.append(params[index])
index += 1 + aux_mixture.getDistributionCollection()[num].getParameter().getSize()
complementary_weight = ot.Point([abs(1.0 - sum(list_weights))])
complementary_weight.add(params)
return complementary_weight
The next function computes the opposite of the log-likelihood of a given list of parameters (except the first weight).
For the sake of numerical stability, it divides this value by the number of observations.
We will minimize this function in order to find the Maximum Likelihood Estimate.
def minus_log_pdf(params):
"""
- log-likelihood of a list of parameters excepting the first weight
divided by the number of observations
"""
aux_mixture = ot.Mixture(mixture)
full_params = full(params)
try:
aux_mixture.setParameter(full_params)
except TypeError:
# case where the proposed parameters are invalid:
# return a huge value
return [ot.SpecFunc.LogMaxScalar]
res = - aux_mixture.computeLogPDF(s).computeMean()
return res
To use OpenTURNS optimization facilities, we need to turn this function into a PythonFunction object.
OT_minus_log_pdf = ot.PythonFunction(mixture.getParameter().getSize()-1, 1, minus_log_pdf)
Cobyla is usually good at likelihood optimization.
problem = ot.OptimizationProblem(OT_minus_log_pdf)
algo = ot.Cobyla(problem)
In order to decrease chances of Cobyla being stuck on a local minimum, we are going to use MultiStart. We pick a starting set of parameters and randomly change the weights. The following helper function makes it easy:
def random_weights(params, nb):
"""
List of nb Points representing mixture parameters with randomly varying weights.
"""
aux_mixture = ot.Mixture(mixture)
full_params = full(params)
aux_mixture.setParameter(full_params)
list_params = []
for num in range(nb):
dirichlet = ot.Dirichlet([1.0] * aux_mixture.getDistributionCollection().getSize()).getRealization()
dirichlet.add(1.0 - sum(dirichlet))
aux_mixture.setWeights(dirichlet)
list_params.append(aux_mixture.getParameter()[1:])
return list_params
We pick 10 starting points and increase the number of maximum evaluations of the log-likelihood from 100 (by default) to 10000.
init = mixture.getParameter()[1:]
starting_points = random_weights(init, 10)
algo_multistart = ot.MultiStart(algo, starting_points)
algo_multistart.setMaximumEvaluationNumber(10000)
Let's run the solver and retrieve the result.
algo_multistart.run()
result = algo_multistart.getResult()
All that remains is to set the mixture's parameters to the optimal value.
We must not forget to add the first weight back!
optimal_parameters = result.getOptimalPoint()
mixture.setParameter(full(optimal_parameters))
Below is an alternative.
The first step creates a new GaussianMixture class, derived from PythonDistribution. The key point is to implement the computeLogPDF method and the set/getParameters methods. Notice that this parametrization of a mixture only has one single weight w.
class GaussianMixture(ot.PythonDistribution):
def __init__(self, mu1 = -5.0, sigma1 = 1.0, \
mu2 = 5.0, sigma2 = 1.0, \
w = 0.5):
super(GaussianMixture, self).__init__(1)
if w < 0.0 or w > 1.0:
raise ValueError('The weight is not in [0, 1]. w=%s.' % (w))
self.mu1 = mu2
self.sigma1 = sigma1
self.mu2 = mu2
self.sigma2 = sigma2
self.w = w
collDist = [ot.Normal(mu1, sigma1), ot.Normal(mu2, sigma2)]
weight = [w, 1.0 - w]
self.distribution = ot.Mixture(collDist, weight)
def computeCDF(self, x):
p = self.distribution.computeCDF(x)
return p
def computePDF(self, x):
p = self.distribution.computePDF(x)
return p
def computeQuantile(self, prob, tail = False):
quantile = self.distribution.computeQuantile(prob, tail)
return quantile
def getSample(self, size):
X = self.distribution.getSample(size)
return X
def getParameter(self):
parameter = ot.Point([self.mu1, self.sigma1, \
self.mu2, self.sigma2, \
self.w])
return parameter
def setParameter(self, parameter):
[mu1, sigma1, mu2, sigma2, w] = parameter
self.__init__(mu1, sigma1, mu2, sigma2, w)
return parameter
def computeLogPDF(self, sample):
logpdf = self.distribution.computeLogPDF(sample)
return logpdf
In order to create the distribution, we use the Distribution class:
gm = ot.Distribution(GaussianMixture())
Estimating the parameters of this distribution is straightforward with MaximumLikelihoodFactory. However, we must set the bounds, because sigma cannot be negative and that w is in (0, 1).
factory = ot.MaximumLikelihoodFactory(gm)
lowerBound = [0.0, 1.e-6, 0.0, 1.e-6, 0.01]
upperBound = [0.0, 0.0, 0.0, 0.0, 0.99]
finiteLowerBound = [False, True, False, True, True]
finiteUpperBound = [False, False, False, False, True]
bounds = ot.Interval(lowerBound, upperBound, finiteLowerBound, finiteUpperBound)
factory.setOptimizationBounds(bounds)
Then we configure the optimization solver.
solver = factory.getOptimizationAlgorithm()
startingPoint = [-4.0, 1.0, 7.0, 1.5, 0.3]
solver.setStartingPoint(startingPoint)
factory.setOptimizationAlgorithm(solver)
Estimating the parameters is based on the build method.
distribution = factory.build(sample)
There are two limitations with this implementation.
First, it is not as fast as it should be, because of a limitation of the PythonDistribution (see https://github.com/openturns/openturns/issues/1391).
Estimating the parameters may be difficult, because the problem may have local optimas that cannot be retrieved with the default algorithm in MaximumLikelihoodFactory. This kind of task is generally done with the EM algorithm.

Keras' ImageDataGenerator.flow() results in very low training/validation accuracy as opposed to flow_from_directory()

I am trying to train a very simple model for image recognition, nothing spectacular. My first attempt worked just fine, when I used image rescaling:
# this is the augmentation configuration to enhance the training dataset
train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
# validation generator, only rescaling
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='categorical')
Then I simply trained the model as such:
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size)
This works perfectly fine and leads to a reasonable accuracy. Then I thought it may be a good idea to try out mean subtraction, as VGG16 model uses. Instead of doing it manually, I chose to use ImageDataGenerator.fit(). For that, however, you need to supply it with training images as numpy arrays, so I first read the images, convert them, and then feed them into it:
train_datagen = ImageDataGenerator(
featurewise_center=True,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)
test_datagen = ImageDataGenerator(featurewise_center=True)
def process_images_from_directory(data_dir):
x = []
y = []
for root, dirs, files in os.walk(data_dir, topdown=False):
class_names = sorted(dirs)
global class_indices
if len(class_indices) == 0:
class_indices = dict(zip(class_names, range(len(class_names))))
for dir in class_names:
filenames = os.listdir(os.path.join(root,dir))
for file in filenames:
img_array = img_to_array(load_img(os.path.join(root,dir,file), target_size=(224, 224)))[np.newaxis]
if len(x) == 0:
x = img_array
else:
x = np.concatenate((x,img_array))
y.append(class_indices[dir])
#this step converts an array of classes [0,1,2,3...] into sparse vectors [1,0,0,0], [0,1,0,0], etc.
y = np.eye(len(class_names))[y]
return x, y
x_train, y_train = process_images_from_directory(train_data_dir)
x_valid, y_valid = process_images_from_directory(validation_data_dir)
nb_train_samples = x_train.shape[0]
nb_validation_samples = x_valid.shape[0]
train_datagen.fit(x_train)
test_datagen.mean = train_datagen.mean
train_generator = train_datagen.flow(
x_train,
y_train,
batch_size=batch_size,
shuffle=False)
validation_generator = test_datagen.flow(
x_valid,
y_valid,
batch_size=batch_size,
shuffle=False)
Then, I train the model the same way, simply giving it both iterators. After the training completes, the accuracy is basically stuck at ~25% even after 50 epochs:
80/80 [==============================] - 77s 966ms/step - loss: 12.0886 - acc: 0.2500 - val_loss: 12.0886 - val_acc: 0.2500
When I run predictions on the above model, it classifies only 1 out 4 total classes correctly, all images from other 3 classes are classified as belonging to the first class - clearly the percentage of 25% has something to do with this fact, I just can't figure out what I am doing wrong.
I realize that I could calculate the mean manually and then simply set it for both generators, or that I could use ImageDataGenerator.fit() and then still go with flow_from_directory, but that would be a waste of already processed images, I would be doing the same processing twice.
Any opinions on how to make it work with flow() all the way?
Did you try setting shuffle=True in your generators?
You did not specify shuffling in the first case (it should be True by default) and set it to False in the second case.
Your input data might be sorted by classes. Without shuffling, your model first only sees class #1 and simply learns to predict class #1 always. It then sees class #2 and learns to always predict class #2 and so on. At the end of one epoch your model learns to always predict class #4 and thus gives a 25% accuracy on validation.

Function approximator and q-learning

I am trying to implement q-learning with an action-value approximation-function. I am using openai-gym and the "MountainCar-v0" enviroment to test my algorithm out. My problem is, it does not converge or find the goal at all.
Basically the approximator works like the following, you feed in the 2 features: position and velocity and one of the 3 actions in a one-hot encoding: 0 -> [1,0,0], 1 -> [0,1,0] and 2 -> [0,0,1]. The output is the action-value approximation Q_approx(s,a), for one specific action.
I know that usually, the input is the state (2 features) and the output layer contains 1 output for each action. The big difference that I see is that I have run the feed forward pass 3 times (one for each action) and take the max, while in the standard implementation you run it once and take the max over the output.
Maybe my implementation is just completely wrong and I am thinking wrong. Gonna paste the code here, it is a mess but I am just experimenting a bit:
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
env = gym.make('MountainCar-v0')
# The mean reward over 20 episodes
mean_rewards = np.zeros(20)
# Feature numpy holder
features = np.zeros(5)
# Q_a value holder
qa_vals = np.zeros(3)
one_hot = {
0 : np.asarray([1,0,0]),
1 : np.asarray([0,1,0]),
2 : np.asarray([0,0,1])
}
model = Sequential()
model.add(Dense(20, activation="relu",input_dim=(5)))
model.add(Dense(10,activation="relu"))
model.add(Dense(1))
model.compile(optimizer='rmsprop',
loss='mse',
metrics=['accuracy'])
epsilon_greedy = 0.1
discount = 0.9
batch_size = 16
# Experience replay containing features and target
experience = np.ones((10*300,5+1))
# Ring buffer
def add_exp(features,target,index):
if index % experience.shape[0] == 0:
index = 0
global filled_once
filled_once = True
experience[index,0:5] = features
experience[index,5] = target
index += 1
return index
for e in range(0,100000):
obs = env.reset()
old_obs = None
new_obs = obs
rewards = 0
loss = 0
for i in range(0,300):
if old_obs is not None:
# Find q_a max for s_(t+1)
features[0:2] = new_obs
for i,pa in enumerate([0,1,2]):
features[2:5] = one_hot[pa]
qa_vals[i] = model.predict(features.reshape(-1,5))
rewards += reward
target = reward + discount*np.max(qa_vals)
features[0:2] = old_obs
features[2:5] = one_hot[a]
fill_index = add_exp(features,target,fill_index)
# Find new action
if np.random.random() < epsilon_greedy:
a = env.action_space.sample()
else:
a = np.argmax(qa_vals)
else:
a = env.action_space.sample()
obs, reward, done, info = env.step(a)
old_obs = new_obs
new_obs = obs
if done:
break
if filled_once:
samples_ids = np.random.choice(experience.shape[0],batch_size)
loss += model.train_on_batch(experience[samples_ids,0:5],experience[samples_ids,5].reshape(-1))[0]
mean_rewards[e%20] = rewards
print("e = {} and loss = {}".format(e,loss))
if e % 50 == 0:
print("e = {} and mean = {}".format(e,mean_rewards.mean()))
Thanks in advance!
There shouldn't be much difference between the actions as inputs to your network or as different outputs of your network. It does make a huge difference if your states are images for example. because Conv nets work very well with images and there would be no obvious way of integrating the actions to the input.
Have you tried the cartpole balancing environment? It is better to test if your model is working correctly.
Mountain climb is pretty hard. It has no reward until you reach the top, which often doesn't happen at all. The model will only start learning something useful once you get to the top once. If you are never getting to the top you should probably increase your time doing exploration. in other words take more random actions, a lot more...

Simple LSTM in PyTorch with Sequential module

In PyTorch, we can define architectures in multiple ways. Here, I'd like to create a simple LSTM network using the Sequential module.
In Lua's torch I would usually go with:
model = nn.Sequential()
model:add(nn.SplitTable(1,2))
model:add(nn.Sequencer(nn.LSTM(inputSize, hiddenSize)))
model:add(nn.SelectTable(-1)) -- last step of output sequence
model:add(nn.Linear(hiddenSize, classes_n))
However, in PyTorch, I don't find the equivalent of SelectTable to get the last output.
nn.Sequential(
nn.LSTM(inputSize, hiddenSize, 1, batch_first=True),
# what to put here to retrieve last output of LSTM ?,
nn.Linear(hiddenSize, classe_n))
Define a class to extract the last cell output:
# LSTM() returns tuple of (tensor, (recurrent state))
class extract_tensor(nn.Module):
def forward(self,x):
# Output shape (batch, features, hidden)
tensor, _ = x
# Reshape shape (batch, hidden)
return tensor[:, -1, :]
nn.Sequential(
nn.LSTM(inputSize, hiddenSize, 1, batch_first=True),
extract_tensor(),
nn.Linear(hiddenSize, classe_n)
)
According to the LSTM cell documentation the outputs parameter has a shape of (seq_len, batch, hidden_size * num_directions) so you can easily take the last element of the sequence in this way:
rnn = nn.LSTM(10, 20, 2)
input = Variable(torch.randn(5, 3, 10))
h0 = Variable(torch.randn(2, 3, 20))
c0 = Variable(torch.randn(2, 3, 20))
output, hn = rnn(input, (h0, c0))
print(output[-1]) # last element
Tensor manipulation and Neural networks design in PyTorch is incredibly easier than in Torch so you rarely have to use containers. In fact, as stated in the tutorial PyTorch for former Torch users PyTorch is built around Autograd so you don't need anymore to worry about containers. However, if you want to use your old Lua Torch code you can have a look to the Legacy package.
As far as I'm concerned there's nothing like a SplitTable or a SelectTable in PyTorch. That said, you are allowed to concatenate an arbitrary number of modules or blocks within a single architecture, and you can use this property to retrieve the output of a certain layer. Let's make it more clear with a simple example.
Suppose I want to build a simple two-layer MLP and retrieve the output of each layer. I can build a custom class inheriting from nn.Module:
class MyMLP(nn.Module):
def __init__(self, in_channels, out_channels_1, out_channels_2):
# first of all, calling base class constructor
super().__init__()
# now I can build my modular network
self.block1 = nn.Linear(in_channels, out_channels_1)
self.block2 = nn.Linear(out_channels_1, out_channels_2)
# you MUST implement a forward(input) method whenever inheriting from nn.Module
def forward(x):
# first_out will now be your output of the first block
first_out = self.block1(x)
x = self.block2(first_out)
# by returning both x and first_out, you can now access the first layer's output
return x, first_out
In your main file you can now declare the custom architecture and use it:
from myFile import MyMLP
import numpy as np
in_ch = out_ch_1 = out_ch_2 = 64
# some fake input instance
x = np.random.rand(in_ch)
my_mlp = MyMLP(in_ch, out_ch_1, out_ch_2)
# get your outputs
final_out, first_layer_out = my_mlp(x)
Moreover, you could concatenate two MyMLP in a more complex model definition and retrieve the output of each one in a similar way.
I hope this is enough to clarify, but in case you have more questions, please feel free to ask, since I may have omitted something.