Catboost custom loss function - catboost

I'm trying to implement my custom loss function. While analyzing worsened prediction quality I mentioned that custom loss function performs worse (at least differently) on cross-validation even with Logloss implementation provided as an example in the docs. I expected it to be equal to "native" catboost Logloss.
Here is the example I'm using:
https://catboost.ai/docs/concepts/python-usages-examples.html#user-defined-loss-function
class LoglossObjective(object):
def calc_ders_range(self, approxes, targets, weights):
assert len(approxes) == len(targets)
if weights is not None:
assert len(weights) == len(approxes)
result = []
for index in range(len(targets)):
e = np.exp(approxes[index])
p = e / (1 + e)
der1 = targets[index] - p
der2 = -p * (1 - p)
if weights is not None:
der1 *= weights[index]
der2 *= weights[index]
result.append((der1, der2))
return result
Can anyone explain why user-defined logloss is different from catboost "native" logloss? And how to make user-defined prediction quality as good as "native"?

Found an answer: when running with "native" logloss CatboostClassifier is automatically adjusting learning_rate, and when running custom logloss default learning_rate is used. Thus different results.
Setting learning_rate explicitly led to equal training results.

Related

torch.nn.DataParallel with torch.autograd.grad in loss function fails

I have a neural network model that represents the surface of an object. For this to work, the gradients are calculated in the loss function (because for example it's a property of signed distance fields (sdfs) that the gradient is always unit length).
The loss function is the one from SIREN for sdfs and defined as
def sdf(model_output, gt):
gt_sdf = gt['sdf']
gt_normals = gt['normals']
coords = model_output['model_in']
pred_sdf = model_output['model_out'].to(torch.float32)
gradient = diff_operators.gradient(pred_sdf, coords)
# Wherever boundary_values is not equal to zero, we interpret it as a boundary constraint.
sdf_constraint = torch.where(gt_sdf != -1, pred_sdf, torch.zeros_like(pred_sdf))
inter_constraint = torch.where(gt_sdf != -1, torch.zeros_like(pred_sdf), torch.exp(-1e2 * torch.abs(pred_sdf)))
normal_constraint = torch.where(gt_sdf != -1, 1 - F.cosine_similarity(gradient, gt_normals, dim=-1)[..., None],
torch.zeros_like(gradient[..., :1]))
grad_constraint = torch.abs(gradient.norm(dim=-1) - 1)
return {'sdf': torch.abs(sdf_constraint).mean() * 3e3,
'inter': inter_constraint.mean() * 1e2,
'normal_constraint': normal_constraint.mean() * 1e2,
'grad_constraint': grad_constraint.mean() * 5e1}
and the gradient calculation uses torch.autograd.grad:
def gradient(y, x, grad_outputs=None):
if grad_outputs is None:
grad_outputs = torch.ones_like(y)
grad = torch.autograd.grad(y, [x], grad_outputs=grad_outputs, create_graph=True)[0]
return grad
Now I wanted to parallelise the training by implementing torch.nn.DataParallel. I get the following error:
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Is it possible to use torch.nn.DataParallel with gradient calculation in the loss function and what do I need to change to make it work?
Looking at the documentation of nn.parallel.DistributedDataParallel:
This module doesn’t work with torch.autograd.grad() (i.e. it will only work if gradients are to be accumulated in .grad attributes of parameters).
It also recommends to use torch.distributed.autograd.backward and torch.distributed.optim.DistributedOptimizer.
Also in the documentation of torch.distributed it recommends using gloo backend:
Please notice that currently the only backend where all the functions are guaranteed to work is gloo.

About Softmax function as output layer in preddictions

I know the softmax activation function: The sum of the ouput layer with a softmax activation is equal to one always, that say: the output vector is normalized, also this is neccesary because the maximun accumalated probability can not exceeds one. Ok, this is clear.
But my question is the following: When the softmax is used as a classifier, is use the argmax function to get the index of the class. so, what is the difference between get a acumulative probability of one or higher if the important parameter is the index to get the correct class?
An example in python, where I made another softmax (really is not a softmax function) but the classifier works in the same way that the classifier with the real softmax function:
import numpy as np
classes = 10
classes_list = ['dog', 'cat', 'monkey', 'butterfly', 'donkey',
'horse', 'human', 'car', 'table', 'bottle']
# This simulates and NN with her weights and the previous
# layer with a ReLU activation
a = np.random.normal(0, 0.5, (classes,512)) # Output from previous layer
w = np.random.normal(0, 0.5, (512,1)) # weights
b = np.random.normal(0, 0.5, (classes,1)) # bias
# correct solution:
def softmax(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
x = np.matmul(a, w) + b
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0), np.argsort(e_x.flatten())[::-1]
# approx solution (probability is upper than one):
def softmax_app(a, w, b):
a = np.maximum(a, 0) # ReLU simulation
w_exp = np.exp(w)
coef = np.sum(w_exp)
matmul = np.exp(np.matmul(a,w) + b)
res = matmul / coef
return res, np.argsort(res.flatten())[::-1]
teor = softmax(a, w, b)
approx = softmax_app(a, w, b)
class_teor = classes_list[teor[-1][0]]
class_approx = classes_list[approx[-1][0]]
print(np.array_equal(teor[-1], approx[-1]))
print(class_teor == class_approx)
The obtained class between both methods are always the same (I'm talking about preddictions, not to training). I ask this because I'm implementing the softmax in a FPGA device and with the second method it is not necessary 2 runs to calculate the softmax function: first to find the exponentiated matrix and the sum of it and second to perform the division.
Let's review the uses of softmax:
You should use softmax if:
You are training a NN and want to limit the range of output values during training (you could use other activation functions instead). This can marginally help towards clipping the gradient.
You are performing inference on a NN and you want to obtain a metric on the "degree of confidence" of your classification result (in the range of 0-1).
You are performing inference on a NN and wish to get the top K results. In this case it is recommended as a way to have a "degree of confidence" metric to compare them.
You are performing inference on several NN (ensemble methods) and wish to average them out (otherwise their results wouldn't easily comparable).
You should not use (or remove) softmax if:
You are performing inference on a NN and you only care about the top class. Note that the NN could have been trained with Softmax (for better accuracy, faster convergence, etc..).
In your case, your insights are right: Softmax as an activation function in the last layer is meaningless if your problem only requires you to get the index of the maximum value during the inference phase. Besides, since you are targetting an FPGA implementation, this would only give you extra headaches.

Implementing WNGrad in Pytorch?

I'm trying to implement the WNGrad (technically WN-Adam, algorithm 4 in the paper) optimizier (WNGrad) in pytorch. I've never implemented an optimizer in pytorch before so I don't know if I've done it correctly (I started from the adam implementation). The optimizer does not make much progress and falls down like I would expect (bj values can only monotonically increase, which happens quickly so no progress is made) but I'm guessing I have a bug. Standard optimizers (Adam, SGD) work fine on the same model I'm trying to optimize.
Does this implementation look correct?
from torch.optim import Optimizer
class WNAdam(Optimizer):
"""Implements WNAdam algorithm.
It has been proposed in `WNGrad: Learn the Learning Rate in Gradient Descent`_.
Arguments:
params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 0.1)
beta1 (float, optional): exponential smoothing coefficient for gradient.
When beta=0 this implements WNGrad.
.. _WNGrad\: Learn the Learning Rate in Gradient Descent:
https://arxiv.org/abs/1803.02865
"""
def __init__(self, params, lr=0.1, beta1=0.9):
if not 0.0 <= beta1 < 1.0:
raise ValueError("Invalid beta1 parameter: {}".format(beta1))
defaults = dict(lr=lr, beta1=beta1)
super().__init__(params, defaults)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = torch.zeros_like(p.data)
# Learning rate adjustment
state['bj'] = 1.0
exp_avg = state['exp_avg']
beta1 = group['beta1']
state['step'] += 1
state['bj'] += (group['lr']**2)/(state['bj'])*grad.pow(2).sum()
# update exponential moving average
exp_avg.mul_(beta1).add_(1 - beta1, grad)
bias_correction = 1 - beta1 ** state['step']
p.data.sub_(group['lr'] / state['bj'] / bias_correction, exp_avg)
return loss
The paper's author has an open sourced implementation on GitHub.
The WNGrad paper
states it's inspired by batch (and weight) normalization. You should use L2 norm with respect to the weight dimensions (don't sum it all) as show in this algorithm

how to work with the catboost overfitting detector

I am trying to understand the catboost overfitting detector. It is described here:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/#overfitting-detector
Other gradient boosting packages like lightgbm and xgboost use a parameter called early_stopping_rounds, which is easy to understand (it stops the training once the validation error hasn't decreased in early_stopping_round steps).
However I have a hard time understanding the p_value approach used by catboost. Can anyone explain how this overfitting detector works and when it stops the training?
It's not documented on the Yandex website or at the github repository, but if you look carefully through the python code posted to github (specifically here), you will see that the overfitting detector is activated by setting "od_type" in the parameters. Reviewing the recent commits on github, the catboost developers also recently implemented a tool similar to the "early_stopping_rounds" parameter used by lightGBM and xgboost, called "Iter."
To set the number of rounds after the most recent best iteration to wait before stopping, provide a numeric value in the "od_wait" parameter.
For example:
fit_param <- list(
iterations = 500,
thread_count = 10,
loss_function = "Logloss",
depth = 6,
learning_rate = 0.03,
od_type = "Iter",
od_wait = 100
)
I am using the catboost library with R 3.4.1. I have found that setting the "od_type" and "od_wait" parameters in the fit_param list works well for my purposes.
I realize this is not answering your question about the way to use the p_value approach also implemented by the catboost developers; unfortunately I cannot help you there. Hopefully someone else can explain that setting to the both of us.
Catboost now supports early_stopping_rounds: fit method parameters
Sets the overfitting detector type to Iter and stops the training
after the specified number of iterations since the iteration with the
optimal metric value.
This works very much like early_stopping_rounds in xgboost.
Here is an example:
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split
import numpy as np
y = np.random.normal(0, 1, 1000)
X = np.random.normal(0, 1, (1000, 1))
X[:, 0] += y * 2
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.1)
train_pool = Pool(X, y)
eval_pool = Pool(X_eval, y_eval)
model = CatBoostRegressor(iterations=1000, learning_rate=0.1)
model.fit(X, y, eval_set=eval_pool, early_stopping_rounds=10)
The result should be something like this:
522: learn: 0.3994718 test: 0.4294720 best: 0.4292901 (514) total: 957ms remaining: 873ms
523: learn: 0.3994580 test: 0.4294614 best: 0.4292901 (514) total: 958ms remaining: 870ms
524: learn: 0.3994495 test: 0.4294806 best: 0.4292901 (514) total: 959ms remaining: 867ms
Stopped by overfitting detector (10 iterations wait)
bestTest = 0.4292900745
bestIteration = 514
Shrink model to first 515 iterations.
early_stopping_rounds takes into account both od_type='Iter' and od_wait parameters. No need to individually set both od_type and od_wait, just set early_stopping_rounds parameter.

Theano - how to override gradient for part of op graph

I have a rather complex model at hand. The model have multiple parts with linear structure:
y = theano.tensor.dot(W,x) + b
I want to build a optimizer that uses a custom rule to compute gradient for all linear structure, while keeping other operations intact. What's the easiest way to override gradient ops for all linear part of my model? Preferably no need to write a new Op.
So, I spent some time working on a PR (not merged as of Jan 13 2017 already merged) for Theano, which gives user ability to partially override gradient of a theano.OpFromGraph instance. The override is done with symbolic graph so you still gain the full benefit of theano optimization.
Typical use cases:
Numerical safety consideration
Rescale/clipping gradient
Specialized gradient routine like Riemannian natural gradient
To make an Op with overriding gradient:
Make the needed compute graph
Make an OpFromGraph instance (or a python function) for gradient of your Op
Make an OfG instance your Op, and set grad_overrides argument
call OfG instance to build your model
Defining an OpFromGraph is like compiling a theano function, with some difference:
No support for updates and givens (As of Jan 2017)
You get an symbolic Op instead of a numerical function
Example:
'''
This creates an atan2_safe Op with smoothed gradient at (0,0)
'''
import theano as th
import theano.tensor as T
# Turn this on if you want theano to build one large graph for your model instead of precompiling the small graph.
USE_INLINE = False
# In a real case you would set EPS to a much smaller value
EPS = 0.01
# define a graph for needed Op
s_x, s_y = T.scalars('xy')
s_darg = T.scalar(); # backpropagated gradient
s_arg = T.arctan2(s_y, s_x)
s_abs2 = T.sqr(s_x) + T.sqr(s_y) + EPS
s_dx = -s_y / s_abs2
s_dy = s_x / s_abs2
# construct OfG with gradient overrides
# NOTE: there are unused inputs in the gradient expression,
# however the input count must match, so we pass
# on_unused_input='ignore'
atan2_safe_grad = th.OpFromGraph([s_x, s_y, s_darg], [s_dx, s_dy], inline=USE_INLINE, on_unused_input='ignore')
atan2_safe = th.OpFromGraph([s_x, s_y], [s_arg], inline=USE_INLINE, grad_overrides=atan2_safe_grad)
# build graph using the new Op
x, y = T.scalar(), T.scalar()
arg = atan2_safe(x, y)
dx, dy = T.grad(arg, [x, y])
fn = th.function([x, y], [dx, dy])
fn(1., 0.) # gives [-0.0, 0.99099]
fn(0., 0.) # gives [0.0, 0.0], no more annoying nan!
NOTE: the theano.OpFromGraph is still largely experimental, expect bugs.