I have a loss where each layer plays into the loss. Which is the correct approach in terms of making sure the weights are updated properly?
# option 1
x2 = self.layer1(x1)
x3 = self.layer2(x2)
x4 = self.layer3(x3)
In this option, I detach when feeding into each subsequent block
# option 2
# x2 = self.layer1(x1.detach())
# x3 = self.layer2(x2.detach())
# x4 = self.layer3(x3.detach())
shared ops which calculate 4 losses and sum them.
x4 = F.relu(self.bn1(x4))
loss = some_loss([x1, x2, x3, x4])
Option 1 is correct. When you detach a tensor, computation history/graph is lost and gradients won't be propogated to inputs/for computation done before detaching.
This can also be seen by this toy experiment.
In [14]: import torch
In [15]: x = torch.rand(10,10).requires_grad_()
In [16]: y = x**2
In [19]: z = torch.sum(y)
In [20]: z.backward()
In [23]: x.grad is not None
Out[23]: True
Using detach
In [26]: x = torch.rand(10,10).requires_grad_()
In [27]: y = x**2
In [28]: z = torch.sum(y)
In [29]: z_ = z.detach()
In [30]: z_.backward()
# this gives error
This is because when you call detach, it returns a new tensor with the values copied and information about previous computations is lost.
Related
I implemented a custom loss function, which looks like this:
However, the gradient of this function is always zero and I don't understand why.
The code for the objective function:
def objective(p, output):
x,y = p
a = minA
b = minB
r = 0.1
XA = 1/2 -1/2 * torch.tanh(100*((x - a[0])**2 + (y - a[1])**2 - (r + 0.02)**2))
XB = 1/2 -1/2 * torch.tanh(100*((x - b[0])**2 + (y - b[1])**2 - (r + 0.02)**2))
q = (1-XA)*((1-XB)* output + (XB))
output_grad, _ = torch.autograd.grad(q, (x,y))
output_grad.requires_grad_()
q = output_grad**2
return q
And the code for training the model (which is a simple, fully connected NN):
model = NN(input_size)
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
for e in range(epochs) :
for configuration in total:
print("Train for configuration", configuration)
# Training pass
optimizer.zero_grad()
#output is q~
output = model(configuration)
#loss is the objective function we defined
loss = objective(configuration, output.item())
loss.backward()
optimizer.step()
I really think the problem is in the output_grad, _ = torch.autograd.grad(q, (x,y)).
(During he training, "configuration" is a point sampled from a distribution identified by the coordinates x and y).
Thanks!!
Here I provide the code on a google colab session:
Google colab
Tanh is a bounded function and converges quite quickly to 1. Your XA and XB points are defined as
XA = 1/2 - 1/2 * torch.tanh(100*(z1 + z2 - z0))
XB = 1/2 - 1/2 * torch.tanh(100*(z3 + z4 - z0))
Since z1 + z2 - z0 and z3 + z4 - z0 are rather close to 1, you will end up with an input close to 100. This means the tanh will output 1, resulting in XA and XB begin zeros. You might not want to have this 100 coefficient if you want to have non zero outputs.
There is an unknown function ,
and there are unknown coefficients k, l. Task is to estimate k, l using linear regression, through the data table.
-2.0 1.719334581463762
-1.0 1.900158577875515
0.0 2.1
1.0 2.3208589279588603
2.0 2.5649457921363568
Till now mathematically I did like, taking logarithm on both sides
Then using the data table, 5 equations will be formed
Now apply the linear regressor, to this logarithm-transformed data, to estimate the coefficients k and l.
I have built a linear regresor,
using DataFrames, GLM
function LinearRegression(X)
x = X[:,1]
y = X[:,2]
data = DataFrame(y = y, x = x)
reg = lm(#formula(y ~ x), data)
return coef(reg)[2], coef(reg)[1]
end
Any solution to how to find l and k values using this technique?
You're almost there, but I think you have a misconception mathematically in your code. You are right that taking the log of f(x) makes this essentially a linear fit (of form y = mx + b) but you haven't told the code that, i.e. your LinearRegression function should read:
function LinearRegression(X)
x = X[:,1]
y = X[:,2]
data = DataFrame(y = log.(y), x = x)
reg = lm(#formula(y ~ x), data)
return coef(reg)[2], coef(reg)[1]
end
Note that I have written y = log.(y) to match the formula as otherwise you are fitting a line to exponential data. We don't take the log of x because it has negative values. Your function will then return the correct coefficients l and log(k) (so if you want just k itself you need to take the exponential) -- see this plot as proof that it fits the data perfectly!
You need to convert the intercept with exp and the slope keeps as it is.
using Statistics #mean
#Data
X = [-2.0 1.719334581463762
-1.0 1.900158577875515
0.0 2.1
1.0 2.3208589279588603
2.0 2.5649457921363568]
x = X[:,1]
y = X[:,2]
yl = log.(y)
#Get a and b for: log.(y) = a + b*x
b = x \ yl
a = mean(yl) - b * mean(x)
l = b
#0.10000000000000005
k = exp(a)
#2.1
k*exp.(l.*x)
#5-element Vector{Float64}:
# 1.719334581463762
# 1.900158577875515
# 2.1
# 2.3208589279588603
# 2.5649457921363568
I am trying to implement a custom loss function in a Pytorch Autoencoder.
The loss function tries to maximize the cosine similarity between a given output tensor U (a vector) and 100 random vectors J where both U and J have the same dimension of [300]. This is repeated for each batch.
Suppose we have 30 items per batch, then the output tensor is
train_Y.shape = [30,300]
Random_vectors.shape = [30,100,300]
I can implement the loss function in two ways:
All_Y =[]
for Y,z_r in zip(train_y, random_vectors):
Y_cosine_list =[]
for z in z_r:
cosi = torch.dot(Y,z) / (torch.norm(Y)*torch.norm(z))
Y_cosine_list.append(cosi)
All_Y.append(Y_cosine_list)
All_Y = torch.tensor(All_Y).to(device)
train_loss = torch.sum(torch.abs(All_Y))/dim_0
train_loss = torch.tensor(train_loss.data, requires_grad = True)
or
train_Y = torch.zeros([dim_0, 100])
for i, (Y,z_r) in enumerate(zip(train_Y, random_vectors)):
for j,z in enumerate(z_r):
train_Y[i,j] = cos(Y,z)
train_Y = train_Y.to(device)
train_loss = torch.sum(torch.abs(train_Y))/dim_0
The second one is more elegant and to the point. However it is giving a "Cuda illegal memory access error". I have checked that the memory is not exceeded in either case. Is there anything wrong with the second implementation?
The first implementation is inelegant and I am not sure that it makes sense from a neural net optimization perspective. But it does not give errors and am able to complete training for all the epochs.
Ps: I have tried encapsulating this code block in a loss_fn method but I get the same illegal memory access error.
I have tried everything that I could find for the illegal memory access error - changing GPUs, removing a torch.stack block etc. But I can't seem to get rid of the problem.
Here is a vectorized way to do it
class CosineLoss(nn.Module):
def __init__(self, ):
super().__init__()
pass
def forward(self, x, y):
"""
Args:
x (torch.tensor): [batchsize, N, M] - tensor.
y (torch.tensor): [batchsize, M] - tensor.
Returns:
torch.tensor: scalar mean cosine loss
"""
# dot product along dimension 'm' i.e multiply and sum along 'm'.
dotp = torch.einsum("bm, bnm -> bn", y, x)
# L2 norm along dimension 'm' and multiply by broadcasting
length = torch.norm(y, dim=-1)[:, None]*torch.norm(x, dim=-1)
# cosine = dotproduct of unit vectors
cos = dotp/length
return cos.mean()
def test():
b, n, m = 30, 100, 300
train_Y = torch.randn(b, m, device='cuda')
random_vectors = torch.randn(b, n, m, requires_grad=True, device='cuda')
print(f'{random_vectors.grad = }')
cosineloss = CosineLoss()
loss = cosineloss(random_vectors, train_Y)
print(f'{loss = }')
loss.backward()
print(f'{random_vectors.grad.shape = }')
References:
einsum
broadcasting
I have solved a differential equation with a neural net. I leave code below with an example. I want to be able to compute the first derivative of this neural net with respect to its input "x" and evaluate this derivative for any "x".
1- Notice that I compute der = discretize.derivative . Is that the derivative of the neural net with respect to "x"? With this expression, if I type [first(der(phi, u, [x], 0.00001, 1, res.minimizer)) for x in xs] I get something that I wonder if it is the derivative but I cannot find a way to extract this in an array, let alone plot this. How can I evaluate this derivative at any point, lets say for all points in the array defined below as "xs"? Below in Update I give a more straightforward approach I took to try to compute the derivative (but still did not succeed).
2- Is there any other way that I could take the derivative with respect to x of the neural net?
I am new to Julia, so I am struggling a bit with how to manipulate the data types. Thanks for any suggestions!
Update: I found a way to see the symbolic expression for the neural net doing the following:
predict(x) = first(phi(x,res.minimizer))
df(x) = gradient(predict, x)[1]
After running the two lines of code type predict(x) or df(x) in the REPL and it will spit out the full neural net with the weights and biases of the solution. However I cannot evaluate the gradient, it spits an error. How can I evaluate the gradient with respect to x of my function predict(x)??
The original code creating the neural net and solving the equation
using NeuralPDE, Flux, ModelingToolkit, GalacticOptim, Optim, DiffEqFlux
import ModelingToolkit: Interval, infimum, supremum
#parameters x
#variables u(..)
Dx = Differential(x)
a = 0.5
eq = Dx(u(x)) ~ -log(x*a)
# Initial and boundary conditions
bcs = [u(0.) ~ 0.01]
# Space and time domains
domains = [x ∈ Interval(0.01,1.0)]
# Neural network
n = 15
chain = FastChain(FastDense(1,n,tanh),FastDense(n,1))
discretization = PhysicsInformedNN(chain, QuasiRandomTraining(100))
#named pde_system = PDESystem(eq,bcs,domains,[x],[u(x)])
prob = discretize(pde_system,discretization)
const losses = []
cb = function (p,l)
push!(losses, l)
if length(losses)%100==0
println("Current loss after $(length(losses)) iterations: $(losses[end])")
end
return false
end
res = GalacticOptim.solve(prob, ADAM(0.01); cb = cb, maxiters=300)
prob = remake(prob,u0=res.minimizer)
res = GalacticOptim.solve(prob,BFGS(); cb = cb, maxiters=1000)
phi = discretization.phi
der = discretization.derivative
using Plots
analytic_sol_func(x) = (1.0+log(1/a))*x-x*log(x)
dx = 0.05
xs = LinRange(0.01,1.0,50)
u_real = [analytic_sol_func(x) for x in xs]
u_predict = [first(phi(x,res.minimizer)) for x in xs]
x_plot = collect(xs)
xconst = analytic_sol_func(1)*ones(size(xs))
plot(x_plot ,u_real,title = "Solution",linewidth=3)
plot!(x_plot ,u_predict,line =:dashdot,linewidth=2)
The solution I found consists in differentiating the approximation with the help of ForwardDiff.
So if the neural network approximation to the unkown function is called "funcres" then we take its derivative with respect to x as shown below.
using ForwardDiff
funcres(x) = first(phi(x,res.minimizer))
dxu = ForwardDiff.derivative.(funcres, Array(x_plot))
display(plot(x_plot,dxu,title = "Derivative",linewidth=3))
I recently read this paper which introduces a process called "Warm-Up" (WU), which consists in multiplying the loss in the KL-divergence by a variable whose value depends on the number of epoch (it evolves linearly from 0 to 1)
I was wondering if this is the good way to do that:
beta = K.variable(value=0.0)
def vae_loss(x, x_decoded_mean):
# cross entropy
xent_loss = K.mean(objectives.categorical_crossentropy(x, x_decoded_mean))
# kl divergence
for k in range(n_sample):
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0.,
std=1.0) # used for every z_i sampling
# Sample several layers of latent variables
for mean, var in zip(means, variances):
z_ = mean + K.exp(K.log(var) / 2) * epsilon
# build z
try:
z = tf.concat([z, z_], -1)
except NameError:
z = z_
except TypeError:
z = z_
# sum loss (using a MC approximation)
try:
loss += K.sum(log_normal2(z_, mean, K.log(var)), -1)
except NameError:
loss = K.sum(log_normal2(z_, mean, K.log(var)), -1)
print("z", z)
loss -= K.sum(log_stdnormal(z) , -1)
z = None
kl_loss = loss / n_sample
print('kl loss:', kl_loss)
# result
result = beta*kl_loss + xent_loss
return result
# define callback to change the value of beta at each epoch
def warmup(epoch):
value = (epoch/10.0) * (epoch <= 10.0) + 1.0 * (epoch > 10.0)
print("beta:", value)
beta = K.variable(value=value)
from keras.callbacks import LambdaCallback
wu_cb = LambdaCallback(on_epoch_end=lambda epoch, log: warmup(epoch))
# train model
vae.fit(
padded_X_train[:last_train,:,:],
padded_X_train[:last_train,:,:],
batch_size=batch_size,
nb_epoch=nb_epoch,
verbose=0,
callbacks=[tb, wu_cb],
validation_data=(padded_X_test[:last_test,:,:], padded_X_test[:last_test,:,:])
)
This will not work. I tested it to figure out exactly why it was not working. The key thing to remember is that Keras creates a static graph once at the beginning of training.
Therefore, the vae_loss function is called only once to create the loss tensor, which means that the reference to the beta variable will remain the same every time the loss is calculated. However, your warmup function reassigns beta to a new K.variable. Thus, the beta that is used for calculating loss is a different beta than the one that gets updated, and the value will always be 0.
It is an easy fix. Just change this line in your warmup callback:
beta = K.variable(value=value)
to:
K.set_value(beta, value)
This way the actual value in beta gets updated "in place" rather than creating a new variable, and the loss will be properly re-calculated.