I have just begun using lasagne and Theano to do some machine learning on Python.
I am trying to modify the softmax class in Theano. I want to change how the activation function(softmax) is calculated. Instead of dividing e_x by e_x.sum(axis=1), I want to divide e_x by sum of three consecutive numbers.
For instance, the result will be as follows:
sm[0] = e_x[0]/(e_x[0]+e_x[1]+e_x[2])
sm[1] = e_x[1]/(e_x[0]+e_x[1]+e_x[2])
sm[2] = e_x[2]/(e_x[0]+e_x[1]+e_x[2])
sm[3] = e_x[3]/(e_x[3]+e_x[4]+e_x[5])
sm[4] = e_x[4]/(e_x[3]+e_x[4]+e_x[5])
sm[5] = e_x[5]/(e_x[3]+e_x[4]+e_x[5])
and so on...
The problem is that I cannot quite grasp how theano carries out the computation.
Here is my main question. Does it suffice to just change the perform() function in the softmax class?
Here is the original perform() function:
def perform(self, node, input_storage, output_storage):
x, = input_storage
e_x = numpy.exp(x - x.max(axis=1)[:, None])
sm = e_x / e_x.sum(axis=1)[:, None]
output_storage[0][0] = sm
Here is my modified perform()
def myPerform(self, node, input_storage, output_storage):
x, = input_storage
e_x = numpy.exp(x - x.max(axis=1)[:, None])
sm = numpy.zeros_like(e_x)
for i in range(0,symbolCount):
total = e_x[3*i] + e_x[3*i+1] + e_x[3*i+2]
sm[3*i] = e_x[3*i]/total
sm[3*i+1] = e_x[3*i+1]/total
sm[3*i+2] = e_x[3*i+2]/total
output_storage[0][0] = sm
With the current code, I am getting 'unorderable types:int()>str()' error when I use the predict method in lasagne.
For something like this you're probably better off constructing a custom softmax via symbolic expressions rather than creating (or modifying) an operation.
Your custom softmax can be defined in terms of symbolic expressions. Doing it this way will give you gradients (and other Theano operation bits and pieces) "for free" but might run slightly slower than a custom operation could.
Here's an example:
import numpy
import theano
import theano.tensor as tt
x = tt.matrix()
# Use the built in softmax operation
y1 = tt.nnet.softmax(x)
# A regular softmax operation defined via ordinary Theano symbolic expressions
y2 = tt.exp(x)
y2 = y2 / y2.sum(axis=1)[:, None]
# Custom softmax operation
def custom_softmax(a):
b = tt.exp(a)
b1 = b[:, :3] / b[:, :3].sum(axis=1)[:, None]
b2 = b[:, 3:] / b[:, 3:].sum(axis=1)[:, None]
return tt.concatenate([b1, b2], axis=1)
y3 = custom_softmax(x)
f = theano.function([x], outputs=[y1, y2, y3])
x_value = [[.1, .2, .3, .4, .5, .6], [.1, .3, .5, .2, .4, .6]]
y1_value, y2_value, y3_value = f(x_value)
assert numpy.allclose(y1_value, y2_value)
assert y3_value.shape == y1_value.shape
a = numpy.exp(.1) + numpy.exp(.2) + numpy.exp(.3)
b = numpy.exp(.4) + numpy.exp(.5) + numpy.exp(.6)
c = numpy.exp(.1) + numpy.exp(.3) + numpy.exp(.5)
d = numpy.exp(.2) + numpy.exp(.4) + numpy.exp(.6)
assert numpy.allclose(y3_value, [
[numpy.exp(.1) / a, numpy.exp(.2) / a, numpy.exp(.3) / a, numpy.exp(.4) / b, numpy.exp(.5) / b, numpy.exp(.6) / b],
[numpy.exp(.1) / c, numpy.exp(.3) / c, numpy.exp(.5) / c, numpy.exp(.2) / d, numpy.exp(.4) / d, numpy.exp(.6) / d]
]), y3_value
Related
I implemented a custom loss function, which looks like this:
However, the gradient of this function is always zero and I don't understand why.
The code for the objective function:
def objective(p, output):
x,y = p
a = minA
b = minB
r = 0.1
XA = 1/2 -1/2 * torch.tanh(100*((x - a[0])**2 + (y - a[1])**2 - (r + 0.02)**2))
XB = 1/2 -1/2 * torch.tanh(100*((x - b[0])**2 + (y - b[1])**2 - (r + 0.02)**2))
q = (1-XA)*((1-XB)* output + (XB))
output_grad, _ = torch.autograd.grad(q, (x,y))
output_grad.requires_grad_()
q = output_grad**2
return q
And the code for training the model (which is a simple, fully connected NN):
model = NN(input_size)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
for e in range(epochs) :
for configuration in total:
print("Train for configuration", configuration)
# Training pass
optimizer.zero_grad()
#output is q~
output = model(configuration)
#loss is the objective function we defined
loss = objective(configuration, output.item())
loss.backward()
optimizer.step()
I really think the problem is in the output_grad, _ = torch.autograd.grad(q, (x,y)).
(During he training, "configuration" is a point sampled from a distribution identified by the coordinates x and y).
Thanks!!
Here I provide the code on a google colab session:
Google colab
Tanh is a bounded function and converges quite quickly to 1. Your XA and XB points are defined as
XA = 1/2 - 1/2 * torch.tanh(100*(z1 + z2 - z0))
XB = 1/2 - 1/2 * torch.tanh(100*(z3 + z4 - z0))
Since z1 + z2 - z0 and z3 + z4 - z0 are rather close to 1, you will end up with an input close to 100. This means the tanh will output 1, resulting in XA and XB begin zeros. You might not want to have this 100 coefficient if you want to have non zero outputs.
I am trying to implement a custom loss function in a Pytorch Autoencoder.
The loss function tries to maximize the cosine similarity between a given output tensor U (a vector) and 100 random vectors J where both U and J have the same dimension of [300]. This is repeated for each batch.
Suppose we have 30 items per batch, then the output tensor is
train_Y.shape = [30,300]
Random_vectors.shape = [30,100,300]
I can implement the loss function in two ways:
All_Y =[]
for Y,z_r in zip(train_y, random_vectors):
Y_cosine_list =[]
for z in z_r:
cosi = torch.dot(Y,z) / (torch.norm(Y)*torch.norm(z))
Y_cosine_list.append(cosi)
All_Y.append(Y_cosine_list)
All_Y = torch.tensor(All_Y).to(device)
train_loss = torch.sum(torch.abs(All_Y))/dim_0
train_loss = torch.tensor(train_loss.data, requires_grad = True)
or
train_Y = torch.zeros([dim_0, 100])
for i, (Y,z_r) in enumerate(zip(train_Y, random_vectors)):
for j,z in enumerate(z_r):
train_Y[i,j] = cos(Y,z)
train_Y = train_Y.to(device)
train_loss = torch.sum(torch.abs(train_Y))/dim_0
The second one is more elegant and to the point. However it is giving a "Cuda illegal memory access error". I have checked that the memory is not exceeded in either case. Is there anything wrong with the second implementation?
The first implementation is inelegant and I am not sure that it makes sense from a neural net optimization perspective. But it does not give errors and am able to complete training for all the epochs.
Ps: I have tried encapsulating this code block in a loss_fn method but I get the same illegal memory access error.
I have tried everything that I could find for the illegal memory access error - changing GPUs, removing a torch.stack block etc. But I can't seem to get rid of the problem.
Here is a vectorized way to do it
class CosineLoss(nn.Module):
def __init__(self, ):
super().__init__()
pass
def forward(self, x, y):
"""
Args:
x (torch.tensor): [batchsize, N, M] - tensor.
y (torch.tensor): [batchsize, M] - tensor.
Returns:
torch.tensor: scalar mean cosine loss
"""
# dot product along dimension 'm' i.e multiply and sum along 'm'.
dotp = torch.einsum("bm, bnm -> bn", y, x)
# L2 norm along dimension 'm' and multiply by broadcasting
length = torch.norm(y, dim=-1)[:, None]*torch.norm(x, dim=-1)
# cosine = dotproduct of unit vectors
cos = dotp/length
return cos.mean()
def test():
b, n, m = 30, 100, 300
train_Y = torch.randn(b, m, device='cuda')
random_vectors = torch.randn(b, n, m, requires_grad=True, device='cuda')
print(f'{random_vectors.grad = }')
cosineloss = CosineLoss()
loss = cosineloss(random_vectors, train_Y)
print(f'{loss = }')
loss.backward()
print(f'{random_vectors.grad.shape = }')
References:
einsum
broadcasting
I have tensors X of shape BxNxD and Y of shape BxNxD.
I want to compute the pairwise distances for each element in the batch, i.e. I a BxMxN tensor.
How do I do this?
There is some discussion on this topic here: https://github.com/pytorch/pytorch/issues/9406, but I don't understand it as there are many implementation details while no actual solution is highlighted.
A naive approach would be to use the answer for non-batched pairwise distances as discussed here: https://discuss.pytorch.org/t/efficient-distance-matrix-computation/9065, i.e.
import torch
import numpy as np
B = 32
N = 128
M = 256
D = 3
X = torch.from_numpy(np.random.normal(size=(B, N, D)))
Y = torch.from_numpy(np.random.normal(size=(B, M, D)))
def pairwise_distances(x, y=None):
x_norm = (x**2).sum(1).view(-1, 1)
if y is not None:
y_t = torch.transpose(y, 0, 1)
y_norm = (y**2).sum(1).view(1, -1)
else:
y_t = torch.transpose(x, 0, 1)
y_norm = x_norm.view(1, -1)
dist = x_norm + y_norm - 2.0 * torch.mm(x, y_t)
return torch.clamp(dist, 0.0, np.inf)
out = []
for b in range(B):
out.append(pairwise_distances(X[b], Y[b]))
print(torch.stack(out).shape)
How can I do this without looping over B?
Thanks
I had a similar issue and spent some time to find the easiest and fastest solution. Now you can compute batched distance by using PyTorch cdist which will give you BxMxN tensor:
torch.cdist(Y, X)
Also, it works well if you just want to compute distances between each pair of rows of two matrixes.
So, I have applied kalman filter on this dataset. Following is my code (please note I am adding entire code for one to reproduce results on his/her machine).
# Multi dimensional Kalman filter
import os
from math import *
class matrix:
# implements basic operations of a matrix class
def __init__(self, value):
if isinstance(value, basestring):
print "lol"
self.value = value
self.dimx = len(value)
self.dimy = len(value[0])
if value == [[]]:
self.dimx = 0
def zero(self, dimx, dimy):
# check if valid dimensions
if dimx < 1 or dimy < 1:
raise ValueError, "Invalid size of matrix"
else:
self.dimx = dimx
self.dimy = dimy
self.value = [[0 for row in range(dimy)] for col in range(dimx)]
def identity(self, dim):
# check if valid dimension
if dim < 1:
raise ValueError, "Invalid size of matrix"
else:
self.dimx = dim
self.dimy = dim
self.value = [[0 for row in range(dim)] for col in range(dim)]
for i in range(dim):
self.value[i][i] = 1
def show(self):
for i in range(self.dimx):
print self.value[i]
print ' '
def __add__(self, other):
# check if correct dimensions
if self.dimx != other.dimx or self.dimy != other.dimy:
raise ValueError, "Matrices must be of equal dimensions to add"
else:
# add if correct dimensions
res = matrix([[]])
res.zero(self.dimx, self.dimy)
for i in range(self.dimx):
for j in range(self.dimy):
res.value[i][j] = self.value[i][j] + other.value[i][j]
return res
def __sub__(self, other):
# check if correct dimensions
if self.dimx != other.dimx or self.dimy != other.dimy:
raise ValueError, "Matrices must be of equal dimensions to subtract"
else:
# subtract if correct dimensions
res = matrix([[]])
res.zero(self.dimx, self.dimy)
for i in range(self.dimx):
for j in range(self.dimy):
res.value[i][j] = self.value[i][j] - other.value[i][j]
return res
def __mul__(self, other):
# check if correct dimensions
if self.dimy != other.dimx:
raise ValueError, "Matrices must be m*n and n*p to multiply"
else:
# subtract if correct dimensions
res = matrix([[]])
res.zero(self.dimx, other.dimy)
for i in range(self.dimx):
for j in range(other.dimy):
for k in range(self.dimy):
res.value[i][j] += self.value[i][k] * other.value[k][j]
return res
def transpose(self):
# compute transpose
res = matrix([[]])
res.zero(self.dimy, self.dimx)
for i in range(self.dimx):
for j in range(self.dimy):
res.value[j][i] = self.value[i][j]
return res
# Thanks to Ernesto P. Adorio for use of Cholesky and CholeskyInverse functions
def Cholesky(self, ztol=1.0e-5):
# Computes the upper triangular Cholesky factorization of
# a positive definite matrix.
res = matrix([[]])
res.zero(self.dimx, self.dimx)
for i in range(self.dimx):
S = sum([(res.value[k][i])**2 for k in range(i)])
d = self.value[i][i] - S
if abs(d) < ztol:
res.value[i][i] = 0.0
else:
if d < 0.0:
raise ValueError, "Matrix not positive-definite"
res.value[i][i] = sqrt(d)
for j in range(i+1, self.dimx):
S = sum([res.value[k][i] * res.value[k][j] for k in range(self.dimx)])
if abs(S) < ztol:
S = 0.0
res.value[i][j] = (self.value[i][j] - S)/res.value[i][i]
return res
def CholeskyInverse(self):
# Computes inverse of matrix given its Cholesky upper Triangular
# decomposition of matrix.
res = matrix([[]])
res.zero(self.dimx, self.dimx)
# Backward step for inverse.
for j in reversed(range(self.dimx)):
tjj = self.value[j][j]
S = sum([self.value[j][k]*res.value[j][k] for k in range(j+1, self.dimx)])
res.value[j][j] = 1.0/tjj**2 - S/tjj
for i in reversed(range(j)):
res.value[j][i] = res.value[i][j] = -sum([self.value[i][k]*res.value[k][j] for k in range(i+1, self.dimx)])/self.value[i][i]
return res
def inverse(self):
aux = self.Cholesky()
res = aux.CholeskyInverse()
return res
def __repr__(self):
return repr(self.value)
########################################
# filter function
def kalman_filter(x, P):
# measurement update
y = measurements - H * x
s = H * P * H.transpose() + R
K = P * H.transpose() * s.inverse()
x = x + K * y
P = (I - K * H) * P
# prediction
x = F * x
P = F * P * F.transpose()
return x,P
files = []
x = matrix([[0.], [0.], [0.]]) # initial state (location and velocity)
P = matrix([[1000., 0.,0.], [0., 1000.,0.] , [0.,0.,1000.]]) # initial uncertainty
u = matrix([[0.], [0.]]) # external motion
F = matrix([[1., 0., 0.], [0., 1.,0.], [0.,0.,1.] ]) # next state function
H = matrix([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]]) # measurement function
R = matrix([[1., 0., 0.], [0.,1.,0.], [0., 0., 1.]]) # measurement uncertainty
I = matrix([[1., 0.,0.], [0., 1.,0.], [0.,0.,1.]]) # identity matrix
for i in os.listdir("/home/fatima/Downloads/thesis/HMP_Dataset/Climb_stairs"):
if i.endswith('.txt'):
files.append(i)
for j in range(len(files)-1):
print "iterationh number"
print j
with open("/home/fatima/Downloads/thesis/HMP_Dataset/Climb_stairs/"+files[j]) as f:
content = f.readlines()
for e in range(len(content)-1):
content1 = [z.strip() for z in content]
content2 = content1[e].split(" ")
content2[0] =-14.709 + (float(content2[0])/63)*(2*14.709);
content2[2] =-14.709 + (float(content2[2])/63)*(2*14.709);
content2[1] =-14.709 + (float(content2[1])/63)*(2*14.709);
measurements =matrix([ [float(content2[0]) ], [float(content2[1])] , [float(content2[2])] ] )
print measurements
x,p= kalman_filter(x, P)
print x
Kalman filter is at the end of the code. I haven't added any noise in the prediction of state matrix as I am unsure of how to.
About Dataset:
It is an ADL wrist-worn accelerometer dataset. There are a total of 14 activities and the acceleration is recorded in x,y and z-direction. Values of acceleration vary from 0 to 63. For the calculation, it is normalized to -14.7 to +14.7.
Question:
My question is whether I am headed in the right direction or not. Does the output seem to be correct? Any improvements?
The code looks good and you definitely heading in the right direction!
Here are some thinks you can do to check your implementation:
consider simple cases for which you know the solution, for example H = I, P = sigma²*I and R = sigma'² * R. x should tend toward the x of the previous time step of sigma tends to zero and x should tend toward the measurements if sigma' tends to zero. Or if sigma = sigma', then the Kalman Filter analysis should be the average of the previous state and the measurements and P should be reduced by a half. You might want to write a unit test for these cases.
check that the matrix P stays always symmetric and positive defined (all eigenvalues are positive)
Implement an alternate Kalman Filter update step, see equation (28) on page 39 of the document http://modb.oce.ulg.ac.be/mediawiki/upload/Alex/AssimLecture/assim_lecture.pdf. Essentially you can compute the Kalman gain also as:
K = inv(inv(P) + H' * inv(R) H ) H' inv(R)
where inv(P) is the inverse of the matrix P. Both should be equal up to the numerical precision.
For the model noise, you can just add a small covariance matrix to the equation P = F * P * F' + Q. It is often a matrix proportional to the identity matrix. If Q is to small, measurement will no longer affect the analysis after some iterations. If Q is too large, you model state x will be quite noisy.
Some coding remarks:
If measurements, H and R are also parameters of the Kalman_filter function, then your function would be more easily portable to a different case.
Are you aware of numpy for matrix operations? It has a extensive support for matrix and array operations and uses highly optimized libraries for the computation.
Let me know if this helps!
I want to compare the performance of two models using the F statistic. Here is a reproducible example and the expected results:
load carbig
tbl = table(Acceleration,Cylinders,Horsepower,MPG);
% Testing separetly both models
mdl1 = fitlm(tbl,'MPG~1+Acceleration+Cylinders+Horsepower');
mdl2 = fitlm(tbl,'MPG~1+Acceleration');
% Comparing both models using the F-test and p-value
numerator = (mdl2.SSE-mdl1.SSE)/(mdl1.NumCoefficients-mdl2.NumCoefficients);
denominator = mdl1.SSE/mdl1.DFE;
F = numerator/denominator;
p = 1-fcdf(F,mdl1.NumCoefficients-mdl2.NumCoefficients,mdl1.DFE);
We end up with F = 298.75 and p = 0, indicating mdl1 is significantly better than mdl2, as assessed by the F statistic.
Is there anyway to obtain the F and p values without performing twice fitlm and doing all the computation?
I tried to run a coefTest, as suggested by #Glen_b, however the function is poorly documented and the results are not the ones I'm expecting.
[p,F] = coefTest(mdl1); % p = 0, F = 262.508 (this F test mdl1 vs constant mdl)
[p,F] = coefTest(mdl1,[0,0,1,1]); % p = 0, F = 57.662 (not sure what this is testing)
[p,F] = coefTest(mdl1,[1,1,0,0]); % p = 0, F = 486.810 (idem)
I believe I should carry the test with a different null hypothesis (C) using the function [p,F] = coeffTest(mdl1,H,C). But I don't really know how to do it and there's no example.
This answer is in regards to comparing two linear regression models where one model is a restricted version of the other.
Short answer:
To do an F-test on the restriction that the 3rd and 4th elements of your estimated, coefficient vector b are zero:
[p, F] = coefTest(mdl1, [0, 0, 1, 0; 0, 0, 0, 1]);
Further explanation:
Let b be our estimated vector. Linear restrictions on b are typically written in a matrix form: R*b = r. The restriction that 3rd and 4th element of b are zero would be written:
[0, 0, 1, 0 * b = [0
0, 0, 0, 1] 0];
The matrix [0, 0, 1, 0; 0, 0, 0, 1] is what coefTest calls the H matrix in the docs.
P = coefTest(M,H), with H a numeric matrix having one column for each
coefficient, performs an F test that H*B=0, where B represents the
coefficient vector.
Long version
Sometimes with this econometric routines, it's nice just to write it out yourself so you know what's really going on.
Remove rows with NaN because they just add unrelated complexity:
tbl_dirty = table(Acceleration,Cylinders,Horsepower,MPG);
tbl = tbl_dirty(~any(ismissing(tbl_dirty),2),:);
Do the estimation etc...
n = height(tbl); % number of observations
y = tbl.MPG;
X = [ones(n, 1), tbl.Acceleration, tbl.Cylinders, tbl.Horsepower];
k = size(X,2); % number of variables (including constant)
b = X \ y; % estimate b with least squares
u = y - X * b; % calculates residuals
s2 = u' * u / (n - k); % estimate variance of error term (assuming homoskedasticity, independent observations)
BCOV = inv(X'*X) * s2; % get covariance matrix of b assuming homoskedasticity of error term etc...
bse = diag(BCOV).^.5; % standard errors
R = [0, 0, 1, 0;
0, 0, 0, 1];
r = [0; 0]; % Testing restriction: R * b = r
num_restrictions = size(R, 1);
F = (R*b - r)'*inv(R * BCOV * R')*(R*b - r) / num_restrictions; % F-stat (see Hiyashi for reference)
Fp = 1 - fcdf(F, num_restrictions, n - k); % F p-val
For reference, can look at p. 65 of Hiyashi's book Econometrics.
No, there is not.
Fitlm fits an arbitrary model. In your case a regression model with an intercept and either one or three regressors. It might seem that the model with three regressors can use information from the model with one regressor, but this is only true if there are some restrictions on the model and even then this overlapping information is limited.
Fitlm is a very general framework which can be used for arbitrary models. Doing multiple regressions at the same time with sharing of information can thus get quite complex and is not implemented.
It is possible to implement this yourself for these two specific models. Usually such a linear regression is solved using the covariance matrix:
Beta = (X' X) ^-1 X' y
were X is the data with the variables as columns and y is the target variable. In this case you could reuse part of the covariance matrix for which you only need the columns from the smaller regression: the variation in Acceleration. Since adding 2 new variables adds 8 values yo the covariance matrix you only save 1/9 of the time. Furthermore, the heaviest part is the inversion. Thus the time improvement is very very little.
In short, just do two separate regressions