torch.nn.DataParallel with torch.autograd.grad in loss function fails - deep-learning

I have a neural network model that represents the surface of an object. For this to work, the gradients are calculated in the loss function (because for example it's a property of signed distance fields (sdfs) that the gradient is always unit length).
The loss function is the one from SIREN for sdfs and defined as
def sdf(model_output, gt):
gt_sdf = gt['sdf']
gt_normals = gt['normals']
coords = model_output['model_in']
pred_sdf = model_output['model_out'].to(torch.float32)
gradient = diff_operators.gradient(pred_sdf, coords)
# Wherever boundary_values is not equal to zero, we interpret it as a boundary constraint.
sdf_constraint = torch.where(gt_sdf != -1, pred_sdf, torch.zeros_like(pred_sdf))
inter_constraint = torch.where(gt_sdf != -1, torch.zeros_like(pred_sdf), torch.exp(-1e2 * torch.abs(pred_sdf)))
normal_constraint = torch.where(gt_sdf != -1, 1 - F.cosine_similarity(gradient, gt_normals, dim=-1)[..., None],
torch.zeros_like(gradient[..., :1]))
grad_constraint = torch.abs(gradient.norm(dim=-1) - 1)
return {'sdf': torch.abs(sdf_constraint).mean() * 3e3,
'inter': inter_constraint.mean() * 1e2,
'normal_constraint': normal_constraint.mean() * 1e2,
'grad_constraint': grad_constraint.mean() * 5e1}
and the gradient calculation uses torch.autograd.grad:
def gradient(y, x, grad_outputs=None):
if grad_outputs is None:
grad_outputs = torch.ones_like(y)
grad = torch.autograd.grad(y, [x], grad_outputs=grad_outputs, create_graph=True)[0]
return grad
Now I wanted to parallelise the training by implementing torch.nn.DataParallel. I get the following error:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Is it possible to use torch.nn.DataParallel with gradient calculation in the loss function and what do I need to change to make it work?

Looking at the documentation of nn.parallel.DistributedDataParallel:
This module doesn’t work with torch.autograd.grad() (i.e. it will only work if gradients are to be accumulated in .grad attributes of parameters).
It also recommends to use torch.distributed.autograd.backward and torch.distributed.optim.DistributedOptimizer.
Also in the documentation of torch.distributed it recommends using gloo backend:
Please notice that currently the only backend where all the functions are guaranteed to work is gloo.

Related

Catboost custom loss function

I'm trying to implement my custom loss function. While analyzing worsened prediction quality I mentioned that custom loss function performs worse (at least differently) on cross-validation even with Logloss implementation provided as an example in the docs. I expected it to be equal to "native" catboost Logloss.
Here is the example I'm using:
https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function
class LoglossObjective(object):
def calc_ders_range(self, approxes, targets, weights):
assert len(approxes) == len(targets)
if weights is not None:
assert len(weights) == len(approxes)
result = []
for index in range(len(targets)):
e = np.exp(approxes[index])
p = e / (1 + e)
der1 = targets[index] - p
der2 = -p * (1 - p)
if weights is not None:
der1 *= weights[index]
der2 *= weights[index]
result.append((der1, der2))
return result
Can anyone explain why user-defined logloss is different from catboost "native" logloss? And how to make user-defined prediction quality as good as "native"?
Found an answer: when running with "native" logloss CatboostClassifier is automatically adjusting learning_rate, and when running custom logloss default learning_rate is used. Thus different results.
Setting learning_rate explicitly led to equal training results.

How to define a custom loss function for a multi-dimensional target

I'm working with Tensorflow 2.0 and I'm using a normal sequential layer.
I'm trying to define a custom loss functions which does the following:
takes some elements of the input
computes their sum and invert the result
multiplies the result with a part of y_pred
constrain the result to be as close to 1 as possibile
Thus the loss function would be L() = MSE() + (described above)
My code follows:
def custom_loss_wrapper(input_train):
#tf.function
def summing(row):
return tf.math.reduce_sum(row, 1,keepdims=True)
#tf.function
def custom_loss(y_true, y_pred):
row_M = input_train
row_M = row_M[:, 2:5]
sum_M = summing(row_M)
inv_M = (1/sum_M)
row_B = y_pred[:, :3]
sum_B = summing(row_B)
row_Q = tf.math.multiply(inv_M,row_B)
alpha = 0.01
penalty = K.mean(K.square(sum_Q - 1))
return K.mean(K.square(y_true - y_pred)) + (1/alpha) * penalty
return custom_loss
I would like to understand if what I'm doing is right. There are not errors and the training runs, but I do not know if this piece of code does what I'm trying to define. Mostly if this works correctly considering batches of data and not single records

Understanding log_prob for Normal distribution in pytorch

I'm currently trying to solve Pendulum-v0 from the openAi gym environment which has a continuous action space. As a result, I need to use a Normal Distribution to sample my actions. What I don't understand is the dimension of the log_prob when using it :
import torch
from torch.distributions import Normal
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
dist = Normal(means, stds)
a = torch.tensor([1.2,3.4])
d = dist.log_prob(a)
print(d.size())
I was expecting a tensor of size 2 (one log_prob for each actions) but it output a tensor of size(2,2).
However, when using a Categorical distribution for discrete environment the log_prob has the expected size:
logits = torch.tensor([[-0.0657, -0.0949],
[-0.0586, -0.1007]])
dist = Categorical(logits = logits)
a = torch.tensor([1, 1])
print(dist.log_prob(a).size())
give me a tensor a size(2).
Why is the log_prob for Normal distribution of a different size ?
If one takes a look in the source code of torch.distributions.Normal and finds the definition of the log_prob(value) function, one can see that the main part of the calculation is:
return -((value - self.loc) ** 2) / (2 * var) - some other part
where value is a variable containing values for which you want to calculate the log probability (in your case, a), self.loc is the mean of the distribution (in you case, means) and var is the variance, that is, the square of the standard deviation (in your case, stds**2). One can see that this is indeed the logarithm of the probability density function of the normal distribution, minus some constants and logarithm of the standard deviation that I don't write above.
In the first example, you define means and stds to be column vectors, while the values to be a row vector
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([1.2,3.4])
But subtracting a row vector from a column vector, that the code does in value - self.loc in Python gives a matrix (try!), thus the result you obtain is a value of log_prob for each of your two defined distribution and for each of the variables in a.
If you want to obtain a log_prob without the cross terms, then define the variables consistently, i.e., either
means = torch.tensor([[0.0538],
[0.0651]])
stds = torch.tensor([[0.7865],
[0.7792]])
a = torch.tensor([[1.2],[3.4]])
or
means = torch.tensor([0.0538,
0.0651])
stds = torch.tensor([0.7865,
0.7792])
a = torch.tensor([1.2,3.4])
This is how you do in your second example, which is why you obtain the result you expected.

Tensor shape mismatch error in PyTorch on MNIST dataset, but no error on synthetic data

I am trying to implement a Deep Learning paper (https://github.com/kiankd/corel2019) and having a weird error when supplying real data (MNIST) to it, but no error when using the same synthetic data as the authors used.
The error happens in this function:
def get_armask(shape, labels, device=None):
mask = torch.zeros(shape).to(device)
arr = torch.arange(0, shape[0]).long().to(device)
mask[arr, labels] = -1.
return mask
More specifically this line:
mask[arr, labels] = -1.
The error is:
RuntimeError: The shape of the mask [500] at index 0 does not match the shape of the indexed tensor [500, 10] at index 1
The weird thing is, that if I use the synthetic data, there is no error and it works perfectly. If I print out the shapes, I get the following (both with synthetic data and with MNIST):
mask torch.Size([500, 10])
arr torch.Size([500])
labels torch.Size([500])
The code used to generate the synthetic data is the following:
X_data = (torch.rand(N_samples, D_input) * 10.).to(device)
labels = torch.LongTensor([i % N_classes for i in range(N_samples)]).to(device)
While the code to load MNIST is this:
train_images = mnist.train_images()
X_data_all = train_images.reshape((train_images.shape[0], train_images.shape[1] * train_images.shape[2]))
X_data = torch.tensor(X_data_all[:500,:]).to(device)
X_data = X_data.type(torch.FloatTensor)
labels = torch.tensor(mnist.train_labels()[:500]).to(device)
get_armask is used the following way:
def forward(self, predictions, labels):
mask = get_armask(predictions.shape, labels, device=self.device)
# make the attractor and repulsor, mask them!
attraction_tensor = mask * predictions
repulsion_tensor = (mask + 1) * predictions
# now, apply the special cosine-COREL rules, taking the argmax and squaring the repulsion
repulsion_tensor, _ = repulsion_tensor.max(dim=1)
repulsion_tensor = repulsion_tensor ** 2
return arloss(attraction_tensor, repulsion_tensor, self.lam)
The actual error seems to be different from what is in the error message, but I have no idea where to look. I tried a few things, like changing the learning rate, normalizing the MNIST data to be more or less in the same range as the test data but nothing seems to work.
Any suggestions? Thanks a lot in advance!
After exchanging some emails with the author of the paper we figured out what is the problem. The labels were type of Byte instead of Long, that caused the error. The error message is very misleading, the actual problem has nothing to do with the sizes...

Theano - how to override gradient for part of op graph

I have a rather complex model at hand. The model have multiple parts with linear structure:
y = theano.tensor.dot(W,x) + b
I want to build a optimizer that uses a custom rule to compute gradient for all linear structure, while keeping other operations intact. What's the easiest way to override gradient ops for all linear part of my model? Preferably no need to write a new Op.
So, I spent some time working on a PR (not merged as of Jan 13 2017 already merged) for Theano, which gives user ability to partially override gradient of a theano.OpFromGraph instance. The override is done with symbolic graph so you still gain the full benefit of theano optimization.
Typical use cases:
Numerical safety consideration
Rescale/clipping gradient
Specialized gradient routine like Riemannian natural gradient
To make an Op with overriding gradient:
Make the needed compute graph
Make an OpFromGraph instance (or a python function) for gradient of your Op
Make an OfG instance your Op, and set grad_overrides argument
call OfG instance to build your model
Defining an OpFromGraph is like compiling a theano function, with some difference:
No support for updates and givens (As of Jan 2017)
You get an symbolic Op instead of a numerical function
Example:
'''
This creates an atan2_safe Op with smoothed gradient at (0,0)
'''
import theano as th
import theano.tensor as T
# Turn this on if you want theano to build one large graph for your model instead of precompiling the small graph.
USE_INLINE = False
# In a real case you would set EPS to a much smaller value
EPS = 0.01
# define a graph for needed Op
s_x, s_y = T.scalars('xy')
s_darg = T.scalar(); # backpropagated gradient
s_arg = T.arctan2(s_y, s_x)
s_abs2 = T.sqr(s_x) + T.sqr(s_y) + EPS
s_dx = -s_y / s_abs2
s_dy = s_x / s_abs2
# construct OfG with gradient overrides
# NOTE: there are unused inputs in the gradient expression,
# however the input count must match, so we pass
# on_unused_input='ignore'
atan2_safe_grad = th.OpFromGraph([s_x, s_y, s_darg], [s_dx, s_dy], inline=USE_INLINE, on_unused_input='ignore')
atan2_safe = th.OpFromGraph([s_x, s_y], [s_arg], inline=USE_INLINE, grad_overrides=atan2_safe_grad)
# build graph using the new Op
x, y = T.scalar(), T.scalar()
arg = atan2_safe(x, y)
dx, dy = T.grad(arg, [x, y])
fn = th.function([x, y], [dx, dy])
fn(1., 0.) # gives [-0.0, 0.99099]
fn(0., 0.) # gives [0.0, 0.0], no more annoying nan!
NOTE: the theano.OpFromGraph is still largely experimental, expect bugs.