Theano - how to override gradient for part of op graph - deep-learning

I have a rather complex model at hand. The model have multiple parts with linear structure:
y = theano.tensor.dot(W,x) + b
I want to build a optimizer that uses a custom rule to compute gradient for all linear structure, while keeping other operations intact. What's the easiest way to override gradient ops for all linear part of my model? Preferably no need to write a new Op.

So, I spent some time working on a PR (not merged as of Jan 13 2017 already merged) for Theano, which gives user ability to partially override gradient of a theano.OpFromGraph instance. The override is done with symbolic graph so you still gain the full benefit of theano optimization.
Typical use cases:
Numerical safety consideration
Rescale/clipping gradient
Specialized gradient routine like Riemannian natural gradient
To make an Op with overriding gradient:
Make the needed compute graph
Make an OpFromGraph instance (or a python function) for gradient of your Op
Make an OfG instance your Op, and set grad_overrides argument
call OfG instance to build your model
Defining an OpFromGraph is like compiling a theano function, with some difference:
No support for updates and givens (As of Jan 2017)
You get an symbolic Op instead of a numerical function
Example:
'''
This creates an atan2_safe Op with smoothed gradient at (0,0)
'''
import theano as th
import theano.tensor as T
# Turn this on if you want theano to build one large graph for your model instead of precompiling the small graph.
USE_INLINE = False
# In a real case you would set EPS to a much smaller value
EPS = 0.01
# define a graph for needed Op
s_x, s_y = T.scalars('xy')
s_darg = T.scalar(); # backpropagated gradient
s_arg = T.arctan2(s_y, s_x)
s_abs2 = T.sqr(s_x) + T.sqr(s_y) + EPS
s_dx = -s_y / s_abs2
s_dy = s_x / s_abs2
# construct OfG with gradient overrides
# NOTE: there are unused inputs in the gradient expression,
# however the input count must match, so we pass
# on_unused_input='ignore'
atan2_safe_grad = th.OpFromGraph([s_x, s_y, s_darg], [s_dx, s_dy], inline=USE_INLINE, on_unused_input='ignore')
atan2_safe = th.OpFromGraph([s_x, s_y], [s_arg], inline=USE_INLINE, grad_overrides=atan2_safe_grad)
# build graph using the new Op
x, y = T.scalar(), T.scalar()
arg = atan2_safe(x, y)
dx, dy = T.grad(arg, [x, y])
fn = th.function([x, y], [dx, dy])
fn(1., 0.) # gives [-0.0, 0.99099]
fn(0., 0.) # gives [0.0, 0.0], no more annoying nan!
NOTE: the theano.OpFromGraph is still largely experimental, expect bugs.

Related

torch.nn.DataParallel with torch.autograd.grad in loss function fails

I have a neural network model that represents the surface of an object. For this to work, the gradients are calculated in the loss function (because for example it's a property of signed distance fields (sdfs) that the gradient is always unit length).
The loss function is the one from SIREN for sdfs and defined as
def sdf(model_output, gt):
gt_sdf = gt['sdf']
gt_normals = gt['normals']
coords = model_output['model_in']
pred_sdf = model_output['model_out'].to(torch.float32)
gradient = diff_operators.gradient(pred_sdf, coords)
# Wherever boundary_values is not equal to zero, we interpret it as a boundary constraint.
sdf_constraint = torch.where(gt_sdf != -1, pred_sdf, torch.zeros_like(pred_sdf))
inter_constraint = torch.where(gt_sdf != -1, torch.zeros_like(pred_sdf), torch.exp(-1e2 * torch.abs(pred_sdf)))
normal_constraint = torch.where(gt_sdf != -1, 1 - F.cosine_similarity(gradient, gt_normals, dim=-1)[..., None],
torch.zeros_like(gradient[..., :1]))
grad_constraint = torch.abs(gradient.norm(dim=-1) - 1)
return {'sdf': torch.abs(sdf_constraint).mean() * 3e3,
'inter': inter_constraint.mean() * 1e2,
'normal_constraint': normal_constraint.mean() * 1e2,
'grad_constraint': grad_constraint.mean() * 5e1}
and the gradient calculation uses torch.autograd.grad:
def gradient(y, x, grad_outputs=None):
if grad_outputs is None:
grad_outputs = torch.ones_like(y)
grad = torch.autograd.grad(y, [x], grad_outputs=grad_outputs, create_graph=True)[0]
return grad
Now I wanted to parallelise the training by implementing torch.nn.DataParallel. I get the following error:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Is it possible to use torch.nn.DataParallel with gradient calculation in the loss function and what do I need to change to make it work?
Looking at the documentation of nn.parallel.DistributedDataParallel:
This module doesn’t work with torch.autograd.grad() (i.e. it will only work if gradients are to be accumulated in .grad attributes of parameters).
It also recommends to use torch.distributed.autograd.backward and torch.distributed.optim.DistributedOptimizer.
Also in the documentation of torch.distributed it recommends using gloo backend:
Please notice that currently the only backend where all the functions are guaranteed to work is gloo.

About Softmax function as output layer in preddictions

I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.

Catboost custom loss function

I'm trying to implement my custom loss function. While analyzing worsened prediction quality I mentioned that custom loss function performs worse (at least differently) on cross-validation even with Logloss implementation provided as an example in the docs. I expected it to be equal to "native" catboost Logloss.
Here is the example I'm using:
https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function
class LoglossObjective(object):
def calc_ders_range(self, approxes, targets, weights):
assert len(approxes) == len(targets)
if weights is not None:
assert len(weights) == len(approxes)
result = []
for index in range(len(targets)):
e = np.exp(approxes[index])
p = e / (1 + e)
der1 = targets[index] - p
der2 = -p * (1 - p)
if weights is not None:
der1 *= weights[index]
der2 *= weights[index]
result.append((der1, der2))
return result
Can anyone explain why user-defined logloss is different from catboost "native" logloss? And how to make user-defined prediction quality as good as "native"?
Found an answer: when running with "native" logloss CatboostClassifier is automatically adjusting learning_rate, and when running custom logloss default learning_rate is used. Thus different results.
Setting learning_rate explicitly led to equal training results.

Implementing WNGrad in Pytorch?

I'm trying to implement the WNGrad (technically WN-Adam, algorithm 4 in the paper) optimizier (WNGrad) in pytorch. I've never implemented an optimizer in pytorch before so I don't know if I've done it correctly (I started from the adam implementation). The optimizer does not make much progress and falls down like I would expect (bj values can only monotonically increase, which happens quickly so no progress is made) but I'm guessing I have a bug. Standard optimizers (Adam, SGD) work fine on the same model I'm trying to optimize.
Does this implementation look correct?
from torch.optim import Optimizer
class WNAdam(Optimizer):
"""Implements WNAdam algorithm.
It has been proposed in `WNGrad: Learn the Learning Rate in Gradient Descent`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.1)
beta1 (float, optional): exponential smoothing coefficient for gradient.
When beta=0 this implements WNGrad.
.. _WNGrad\: Learn the Learning Rate in Gradient Descent:
https://arxiv.org/abs/1803.02865
"""
def __init__(self, params, lr=0.1, beta1=0.9):
if not 0.0 <= beta1 < 1.0:
raise ValueError("Invalid beta1 parameter: {}".format(beta1))
defaults = dict(lr=lr, beta1=beta1)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Learning rate adjustment
state['bj'] = 1.0
exp_avg = state['exp_avg']
beta1 = group['beta1']
state['step'] += 1
state['bj'] += (group['lr']**2)/(state['bj'])*grad.pow(2).sum()
# update exponential moving average
exp_avg.mul_(beta1).add_(1 - beta1, grad)
bias_correction = 1 - beta1 ** state['step']
p.data.sub_(group['lr'] / state['bj'] / bias_correction, exp_avg)
return loss
The paper's author has an open sourced implementation on GitHub.
The WNGrad paper
states it's inspired by batch (and weight) normalization. You should use L2 norm with respect to the weight dimensions (don't sum it all) as show in this algorithm

Defining a seed value in Branscripts for CNTK sequential machine learning models

This is respect to CNTK brain scripts. I went through [1] to figure out whether there is an option to specify the random seed value, although I couldn't find any (Yes there is an option to set the 'random seed' parameter through the ParameterTensor() function, but if I followed that approach, I might have to explicitly initialize all the LSTM weights separately(defining separate weights for input layer gate, forget layer gate etc. ), instead of using the model sequence as below). Is there any other option available to set the random seed value, preserving the following RNN layered sequence.
nn_Train = {
action = train
BrainScriptNetworkBuilder = {
model = Sequential (
RecurrentLSTMLayer {$stateDim$, usePeepholes = true}:
DenseLayer {$labelDim$, bias=false}
)
z = model (inputs)
inputs=Input($inputDim$) # features
labels=Input($labelDim$)
# loss and metric
ce = SquareError(labels, z)
# node assignment
featureNodes = (inputs)
labelNodes = (labels)
criterionNodes = (ce)
evaluationNodes = (ce)
outputNodes = (z)
}
[1] https://github.com/microsoft/cntk/wiki/Parameters-And-Constants#random-initialization
There isn't a global random seed option for parameters unfortunately. However, you can modify the cntk.core.bs file next to cntk.exe where all the layers are defined to support random seed for the layers you want.