Understanding log_prob for Normal distribution in pytorch - reinforcement-learning

I'm currently trying to solve Pendulum-v0 from the openAi gym environment which has a continuous action space. As a result, I need to use a Normal Distribution to sample my actions. What I don't understand is the dimension of the log_prob when using it :
import torch
from torch.distributions import Normal
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
dist = Normal(means, stds)
a = torch.tensor([1.2,3.4])
d = dist.log_prob(a)
print(d.size())
I was expecting a tensor of size 2 (one log_prob for each actions) but it output a tensor of size(2,2).
However, when using a Categorical distribution for discrete environment the log_prob has the expected size:
logits = torch.tensor([[-0.0657, -0.0949],
[-0.0586, -0.1007]])
dist = Categorical(logits = logits)
a = torch.tensor([1, 1])
print(dist.log_prob(a).size())
give me a tensor a size(2).
Why is the log_prob for Normal distribution of a different size ?

If one takes a look in the source code of torch.distributions.Normal and finds the definition of the log_prob(value) function, one can see that the main part of the calculation is:
return -((value - self.loc) ** 2) / (2 * var) - some other part
where value is a variable containing values for which you want to calculate the log probability (in your case, a), self.loc is the mean of the distribution (in you case, means) and var is the variance, that is, the square of the standard deviation (in your case, stds**2). One can see that this is indeed the logarithm of the probability density function of the normal distribution, minus some constants and logarithm of the standard deviation that I don't write above.
In the first example, you define means and stds to be column vectors, while the values to be a row vector
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([1.2,3.4])
But subtracting a row vector from a column vector, that the code does in value - self.loc in Python gives a matrix (try!), thus the result you obtain is a value of log_prob for each of your two defined distribution and for each of the variables in a.
If you want to obtain a log_prob without the cross terms, then define the variables consistently, i.e., either
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([[1.2],[3.4]])
or
means = torch.tensor([0.0538,
0.0651])
stds = torch.tensor([0.7865,
0.7792])
a = torch.tensor([1.2,3.4])
This is how you do in your second example, which is why you obtain the result you expected.

Related

Can gtsummary be used to predict an ordinal variable (several predictors of all kinds in one model), adjusted for confounding factors

i am trying to bulid a prediction model for an oridinal variable. I now that MASS:polr() function tragets this issue but i want to present it in a more approchable way. i thought gtsummary package may be sutiable
my code-
reg_tb<-tbl_uvregression(
reg_df,
include = c(a,b,c,d),
method = polr,
y = e,
exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 2))
i now that tbl_uvregression() is a univariate model but under 'methods = ' i used the 'polr' option. i suspected polr can't be used in tbl_uvregression() to do an adjusted prediction model because after including 15 predictors, they all remained significant when runing the model (not reasonbele, several factors are strongly associated with each other).

torch.nn.DataParallel with torch.autograd.grad in loss function fails

I have a neural network model that represents the surface of an object. For this to work, the gradients are calculated in the loss function (because for example it's a property of signed distance fields (sdfs) that the gradient is always unit length).
The loss function is the one from SIREN for sdfs and defined as
def sdf(model_output, gt):
gt_sdf = gt['sdf']
gt_normals = gt['normals']
coords = model_output['model_in']
pred_sdf = model_output['model_out'].to(torch.float32)
gradient = diff_operators.gradient(pred_sdf, coords)
# Wherever boundary_values is not equal to zero, we interpret it as a boundary constraint.
sdf_constraint = torch.where(gt_sdf != -1, pred_sdf, torch.zeros_like(pred_sdf))
inter_constraint = torch.where(gt_sdf != -1, torch.zeros_like(pred_sdf), torch.exp(-1e2 * torch.abs(pred_sdf)))
normal_constraint = torch.where(gt_sdf != -1, 1 - F.cosine_similarity(gradient, gt_normals, dim=-1)[..., None],
torch.zeros_like(gradient[..., :1]))
grad_constraint = torch.abs(gradient.norm(dim=-1) - 1)
return {'sdf': torch.abs(sdf_constraint).mean() * 3e3,
'inter': inter_constraint.mean() * 1e2,
'normal_constraint': normal_constraint.mean() * 1e2,
'grad_constraint': grad_constraint.mean() * 5e1}
and the gradient calculation uses torch.autograd.grad:
def gradient(y, x, grad_outputs=None):
if grad_outputs is None:
grad_outputs = torch.ones_like(y)
grad = torch.autograd.grad(y, [x], grad_outputs=grad_outputs, create_graph=True)[0]
return grad
Now I wanted to parallelise the training by implementing torch.nn.DataParallel. I get the following error:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Is it possible to use torch.nn.DataParallel with gradient calculation in the loss function and what do I need to change to make it work?
Looking at the documentation of nn.parallel.DistributedDataParallel:
This module doesn’t work with torch.autograd.grad() (i.e. it will only work if gradients are to be accumulated in .grad attributes of parameters).
It also recommends to use torch.distributed.autograd.backward and torch.distributed.optim.DistributedOptimizer.
Also in the documentation of torch.distributed it recommends using gloo backend:
Please notice that currently the only backend where all the functions are guaranteed to work is gloo.

About Softmax function as output layer in preddictions

I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.

Initialisation of weights for deeplearning model

I am going through a book on deep learning which initializes weights between two layers of neurons as:
w = np.random.randn(layers[i] + 1, layers[i + 1] + 1)
self.W.append(w / np.sqrt(layers[i]))
As per the book, divison by np.sqrt(layers[i]) in second line of code is done for following reason:
scale w by dividing by the square root of the number of nodes in the current layer, thereby
normalizing the variance of each neuron’s output
What does it exactly mean? And how would it impact if we don't do it?
Weights initialization is very important to tackle the vanishing/Explosion Gradients. In order for the output/gradients(reverse direction) to flow properly, the variance of the outputs of each layer to be equal to the variance of its input. Likewise of gradients in the reverse direction. the input and output flow of a layer is called fan-in and fan-out of the layer.
To better explain what I mean above, let me give you an example. Assume that we have a hundred consecutive layers and we apply a feed forward calculation with linear activation (After all it is just matrix multiplication), the data is 500 samples of 100 features:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons))
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(-4.055498760574568e+95, 8.424477240271639e+98)
You will see that it will have a huge mean and standard deviations. Lets break this problem down; a property of a matrix multiplication of which the result will have a standard deviation very close to the square root of the number of fan in (input) connections. This property can be verified with this snippet of code:
fan_in = 1000 # change it to any number
X = np.random.normal(size=(100, fan_in))
W = np.random.normal(size=(fan_in, 1))
np.dot(X, W).std()
# result:
32.764359213560454
This happens because we sum fan_in (1000 in the above case) products of the element-wise multiplication of one element of inputs X by one column of W. Therefore, if we scaled every weights by the 1/sqrt(fan_in) to maintain the distribution of the flow as seen in the following snippet:
neurons, features = 100, 100
n_layers = 100
X = np.random.normal(size=(500, features)) # your input
mean, var = 0, 0
for layer in range(n_layers):
W = np.random.normal(size=(features, neurons), scale=np.sqrt(1 / neurons)) # scaled the weights with the fan-in
X = np.dot(X, W)
mean = mean + X.mean()
var = var + X.var()
mean/n_layers, np.sqrt(var/n_layers)
# output:
(0.0002608301398189543, 1.021452570914829)
You can read more about kernel initialization in the following blog

Defining a seed value in Branscripts for CNTK sequential machine learning models

This is respect to CNTK brain scripts. I went through [1] to figure out whether there is an option to specify the random seed value, although I couldn't find any (Yes there is an option to set the 'random seed' parameter through the ParameterTensor() function, but if I followed that approach, I might have to explicitly initialize all the LSTM weights separately(defining separate weights for input layer gate, forget layer gate etc. ), instead of using the model sequence as below). Is there any other option available to set the random seed value, preserving the following RNN layered sequence.
nn_Train = {
action = train
BrainScriptNetworkBuilder = {
model = Sequential (
RecurrentLSTMLayer {$stateDim$, usePeepholes = true}:
DenseLayer {$labelDim$, bias=false}
)
z = model (inputs)
inputs=Input($inputDim$) # features
labels=Input($labelDim$)
# loss and metric
ce = SquareError(labels, z)
# node assignment
featureNodes = (inputs)
labelNodes = (labels)
criterionNodes = (ce)
evaluationNodes = (ce)
outputNodes = (z)
}
[1] https://github.com/microsoft/cntk/wiki/Parameters-And-Constants#random-initialization
There isn't a global random seed option for parameters unfortunately. However, you can modify the cntk.core.bs file next to cntk.exe where all the layers are defined to support random seed for the layers you want.