Variationnal auto-encoder: implementing warm-up in Keras - deep-learning

I recently read this paper which introduces a process called "Warm-Up" (WU), which consists in multiplying the loss in the KL-divergence by a variable whose value depends on the number of epoch (it evolves linearly from 0 to 1)
I was wondering if this is the good way to do that:
beta = K.variable(value=0.0)
def vae_loss(x, x_decoded_mean):
# cross entropy
xent_loss = K.mean(objectives.categorical_crossentropy(x, x_decoded_mean))
# kl divergence
for k in range(n_sample):
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0.,
std=1.0) # used for every z_i sampling
# Sample several layers of latent variables
for mean, var in zip(means, variances):
z_ = mean + K.exp(K.log(var) / 2) * epsilon
# build z
try:
z = tf.concat([z, z_], -1)
except NameError:
z = z_
except TypeError:
z = z_
# sum loss (using a MC approximation)
try:
loss += K.sum(log_normal2(z_, mean, K.log(var)), -1)
except NameError:
loss = K.sum(log_normal2(z_, mean, K.log(var)), -1)
print("z", z)
loss -= K.sum(log_stdnormal(z) , -1)
z = None
kl_loss = loss / n_sample
print('kl loss:', kl_loss)
# result
result = beta*kl_loss + xent_loss
return result
# define callback to change the value of beta at each epoch
def warmup(epoch):
value = (epoch/10.0) * (epoch <= 10.0) + 1.0 * (epoch > 10.0)
print("beta:", value)
beta = K.variable(value=value)
from keras.callbacks import LambdaCallback
wu_cb = LambdaCallback(on_epoch_end=lambda epoch, log: warmup(epoch))
# train model
vae.fit(
padded_X_train[:last_train,:,:],
padded_X_train[:last_train,:,:],
batch_size=batch_size,
nb_epoch=nb_epoch,
verbose=0,
callbacks=[tb, wu_cb],
validation_data=(padded_X_test[:last_test,:,:], padded_X_test[:last_test,:,:])
)

This will not work. I tested it to figure out exactly why it was not working. The key thing to remember is that Keras creates a static graph once at the beginning of training.
Therefore, the vae_loss function is called only once to create the loss tensor, which means that the reference to the beta variable will remain the same every time the loss is calculated. However, your warmup function reassigns beta to a new K.variable. Thus, the beta that is used for calculating loss is a different beta than the one that gets updated, and the value will always be 0.
It is an easy fix. Just change this line in your warmup callback:
beta = K.variable(value=value)
to:
K.set_value(beta, value)
This way the actual value in beta gets updated "in place" rather than creating a new variable, and the loss will be properly re-calculated.

Related

Non-linear fit Gnu Octave

I have a problem in performing a non linear fit with Gnu Octave. Basically I need to perform a global fit with some shared parameters, while keeping others fixed.
The following code works perfectly in Matlab, but Octave returns an error
error: operator *: nonconformant arguments (op1 is 34x1, op2 is 4x1)
Attached my code and the data to play with:
clear
close all
clc
pkg load optim
D = dlmread('hd', ';'); % raw data
bkg = D(1,2:end); % 4 sensors bkg
x = D(2:end,1); % input signal
Y = D(2:end,2:end); % 4 sensors reposnse
W = 1./Y; % weights
b0 = [7 .04 .01 .1 .5 2 1]; % educated guess for start the fit
%% model function
F = #(b) ((bkg + (b(1) - bkg).*(1-exp(-(b(2:5).*x).^b(6))).^b(7)) - Y) .* W;
opts = optimset("Display", "iter");
lb = [5 .001 .001 .001 .001 .01 1];
ub = [];
[b, resnorm, residual, exitflag, output, lambda, Jacob\] = ...
lsqnonlin(F,b0,lb,ub,opts)
To give more info, giving array b0, b0(1), b0(6) and b0(7) are shared among the 4 dataset, while b0(2:5) are peculiar of each dataset.
Thank you for your help and suggestions! ;)
Raw data:
0,0.3105,0.31342,0.31183,0.31117
0.013229,0.329,0.3295,0.332,0.372
0.013229,0.328,0.33,0.33,0.373
0.021324,0.33,0.3305,0.33633,0.399
0.021324,0.325,0.3265,0.333,0.397
0.037763,0.33,0.3255,0.34467,0.461
0.037763,0.327,0.3285,0.347,0.456
0.069405,0.338,0.3265,0.36533,0.587
0.069405,0.3395,0.329,0.36667,0.589
0.12991,0.357,0.3385,0.41333,0.831
0.12991,0.358,0.3385,0.41433,0.837
0.25368,0.393,0.347,0.501,1.302
0.25368,0.3915,0.3515,0.498,1.278
0.51227,0.458,0.3735,0.668,2.098
0.51227,0.47,0.3815,0.68467,2.124
1.0137,0.61,0.4175,1.008,3.357
1.0137,0.599,0.422,1,3.318
2.0162,0.89,0.5335,1.645,5.006
2.0162,0.872,0.5325,1.619,4.938
4.0192,1.411,0.716,2.674,6.595
4.0192,1.418,0.7205,2.691,6.766
8.0315,2.34,1.118,4.195,7.176
8.0315,2.33,1.126,4.161,6.74
16.04,3.759,1.751,5.9,7.174
16.04,3.762,1.748,5.911,7.151
32.102,5.418,2.942,7.164,7.149
32.102,5.406,2.941,7.164,7.175
64.142,7.016,4.478,7.174,7.176
64.142,7.018,4.402,7.175,7.175
128.32,7.176,6.078,7.175,7.176
128.32,7.175,6.107,7.175,7.173
255.72,7.165,7.162,7.165,7.165
255.72,7.165,7.164,7.166,7.166
511.71,7.165,7.165,7.165,7.165
511.71,7.165,7.165,7.166,7.164
Giving the function definition above, if you call it by F(b0) in the command windows, you will get a 34x4 matrix which is correct, since variable Y has the same size.
In that way I can (in theory) compute the standard formula for lsqnonlin (fit - measured)^2

Iterative loss function Autoencoders

I am trying to implement a custom loss function in a Pytorch Autoencoder.
The loss function tries to maximize the cosine similarity between a given output tensor U (a vector) and 100 random vectors J where both U and J have the same dimension of [300]. This is repeated for each batch.
Suppose we have 30 items per batch, then the output tensor is
train_Y.shape = [30,300]
Random_vectors.shape = [30,100,300]
I can implement the loss function in two ways:
All_Y =[]
for Y,z_r in zip(train_y, random_vectors):
Y_cosine_list =[]
for z in z_r:
cosi = torch.dot(Y,z) / (torch.norm(Y)*torch.norm(z))
Y_cosine_list.append(cosi)
All_Y.append(Y_cosine_list)
All_Y = torch.tensor(All_Y).to(device)
train_loss = torch.sum(torch.abs(All_Y))/dim_0
train_loss = torch.tensor(train_loss.data, requires_grad = True)
or
train_Y = torch.zeros([dim_0, 100])
for i, (Y,z_r) in enumerate(zip(train_Y, random_vectors)):
for j,z in enumerate(z_r):
train_Y[i,j] = cos(Y,z)
train_Y = train_Y.to(device)
train_loss = torch.sum(torch.abs(train_Y))/dim_0
The second one is more elegant and to the point. However it is giving a "Cuda illegal memory access error". I have checked that the memory is not exceeded in either case. Is there anything wrong with the second implementation?
The first implementation is inelegant and I am not sure that it makes sense from a neural net optimization perspective. But it does not give errors and am able to complete training for all the epochs.
Ps: I have tried encapsulating this code block in a loss_fn method but I get the same illegal memory access error.
I have tried everything that I could find for the illegal memory access error - changing GPUs, removing a torch.stack block etc. But I can't seem to get rid of the problem.
Here is a vectorized way to do it
class CosineLoss(nn.Module):
def __init__(self, ):
super().__init__()
pass
def forward(self, x, y):
"""
Args:
x (torch.tensor): [batchsize, N, M] - tensor.
y (torch.tensor): [batchsize, M] - tensor.
Returns:
torch.tensor: scalar mean cosine loss
"""
# dot product along dimension 'm' i.e multiply and sum along 'm'.
dotp = torch.einsum("bm, bnm -> bn", y, x)
# L2 norm along dimension 'm' and multiply by broadcasting
length = torch.norm(y, dim=-1)[:, None]*torch.norm(x, dim=-1)
# cosine = dotproduct of unit vectors
cos = dotp/length
return cos.mean()
def test():
b, n, m = 30, 100, 300
train_Y = torch.randn(b, m, device='cuda')
random_vectors = torch.randn(b, n, m, requires_grad=True, device='cuda')
print(f'{random_vectors.grad = }')
cosineloss = CosineLoss()
loss = cosineloss(random_vectors, train_Y)
print(f'{loss = }')
loss.backward()
print(f'{random_vectors.grad.shape = }')
References:
einsum
broadcasting

How can I evaluate and take the derivative of a neural net in Julia

I have solved a differential equation with a neural net. I leave code below with an example. I want to be able to compute the first derivative of this neural net with respect to its input "x" and evaluate this derivative for any "x".
1- Notice that I compute der = discretize.derivative . Is that the derivative of the neural net with respect to "x"? With this expression, if I type [first(der(phi, u, [x], 0.00001, 1, res.minimizer)) for x in xs] I get something that I wonder if it is the derivative but I cannot find a way to extract this in an array, let alone plot this. How can I evaluate this derivative at any point, lets say for all points in the array defined below as "xs"? Below in Update I give a more straightforward approach I took to try to compute the derivative (but still did not succeed).
2- Is there any other way that I could take the derivative with respect to x of the neural net?
I am new to Julia, so I am struggling a bit with how to manipulate the data types. Thanks for any suggestions!
Update: I found a way to see the symbolic expression for the neural net doing the following:
predict(x) = first(phi(x,res.minimizer))
df(x) = gradient(predict, x)[1]
After running the two lines of code type predict(x) or df(x) in the REPL and it will spit out the full neural net with the weights and biases of the solution. However I cannot evaluate the gradient, it spits an error. How can I evaluate the gradient with respect to x of my function predict(x)??
The original code creating the neural net and solving the equation
using NeuralPDE, Flux, ModelingToolkit, GalacticOptim, Optim, DiffEqFlux
import ModelingToolkit: Interval, infimum, supremum
#parameters x
#variables u(..)
Dx = Differential(x)
a = 0.5
eq = Dx(u(x)) ~ -log(x*a)
# Initial and boundary conditions
bcs = [u(0.) ~ 0.01]
# Space and time domains
domains = [x ∈ Interval(0.01,1.0)]
# Neural network
n = 15
chain = FastChain(FastDense(1,n,tanh),FastDense(n,1))
discretization = PhysicsInformedNN(chain, QuasiRandomTraining(100))
#named pde_system = PDESystem(eq,bcs,domains,[x],[u(x)])
prob = discretize(pde_system,discretization)
const losses = []
cb = function (p,l)
push!(losses, l)
if length(losses)%100==0
println("Current loss after $(length(losses)) iterations: $(losses[end])")
end
return false
end
res = GalacticOptim.solve(prob, ADAM(0.01); cb = cb, maxiters=300)
prob = remake(prob,u0=res.minimizer)
res = GalacticOptim.solve(prob,BFGS(); cb = cb, maxiters=1000)
phi = discretization.phi
der = discretization.derivative
using Plots
analytic_sol_func(x) = (1.0+log(1/a))*x-x*log(x)
dx = 0.05
xs = LinRange(0.01,1.0,50)
u_real = [analytic_sol_func(x) for x in xs]
u_predict = [first(phi(x,res.minimizer)) for x in xs]
x_plot = collect(xs)
xconst = analytic_sol_func(1)*ones(size(xs))
plot(x_plot ,u_real,title = "Solution",linewidth=3)
plot!(x_plot ,u_predict,line =:dashdot,linewidth=2)
The solution I found consists in differentiating the approximation with the help of ForwardDiff.
So if the neural network approximation to the unkown function is called "funcres" then we take its derivative with respect to x as shown below.
using ForwardDiff
funcres(x) = first(phi(x,res.minimizer))
dxu = ForwardDiff.derivative.(funcres, Array(x_plot))
display(plot(x_plot,dxu,title = "Derivative",linewidth=3))

GNU Octave: 1/N Octave Smoothing of actual FFT Data (not the representation of it)

I would like to smooth an Impulse Response audio file. The FFT of the file shows that it is very spikey. I would like to smooth out the audio file, not just its plot, so that I have a smoother IR file.
I have found a function that shows the FFT plot smoothed out. How could this smoothing be applied to the actual FFT data and not just to the plot of it?
[y,Fs] = audioread('test\test IR.wav');
function x_oct = smoothSpectrum(X,f,Noct)
%SMOOTHSPECTRUM Apply 1/N-octave smoothing to a frequency spectrum
%% Input checking
assert(isvector(X), 'smoothSpectrum:invalidX', 'X must be a vector.');
assert(isvector(f), 'smoothSpectrum:invalidF', 'F must be a vector.');
assert(isscalar(Noct), 'smoothSpectrum:invalidNoct', 'NOCT must be a scalar.');
assert(isreal(X), 'smoothSpectrum:invalidX', 'X must be real.');
assert(all(f>=0), 'smoothSpectrum:invalidF', 'F must contain positive values.');
assert(Noct>=0, 'smoothSpectrum:invalidNoct', 'NOCT must be greater than or equal to 0.');
assert(isequal(size(X),size(f)), 'smoothSpectrum:invalidInput', 'X and F must be the same size.');
%% Smoothing
% calculates a Gaussian function for each frequency, deriving a
% bandwidth for that frequency
x_oct = X; % initial spectrum
if Noct > 0 % don't bother if no smoothing
for i = find(f>0,1,'first'):length(f)
g = gauss_f(f,f(i),Noct);
x_oct(i) = sum(g.*X); % calculate smoothed spectral coefficient
end
% remove undershoot when X is positive
if all(X>=0)
x_oct(x_oct<0) = 0;
end
end
endfunction
function g = gauss_f(f_x,F,Noct)
% GAUSS_F calculate frequency-domain Gaussian with unity gain
%
% G = GAUSS_F(F_X,F,NOCT) calculates a frequency-domain Gaussian function
% for frequencies F_X, with centre frequency F and bandwidth F/NOCT.
sigma = (F/Noct)/pi; % standard deviation
g = exp(-(((f_x-F).^2)./(2.*(sigma^2)))); % Gaussian
g = g./sum(g); % normalise magnitude
endfunction
% take fft
Y = fft(y);
% keep only meaningful frequencies
NFFT = length(y);
if mod(NFFT,2)==0
Nout = (NFFT/2)+1;
else
Nout = (NFFT+1)/2;
end
Y = Y(1:Nout);
f = ((0:Nout-1)'./NFFT).*Fs;
% put into dB
Y = 20*log10(abs(Y)./NFFT);
% smooth
Noct = 12;
Z = smoothSpectrum(Y,f,Noct);
% plot
semilogx(f,Y,'LineWidth',0.7,f,Z,'LineWidth',2.2);
xlim([20,20000])
grid on
PS. I have Octave GNU, so I don't have the functions that are available with Matlab Toolboxes.
Here is the test IR audio file.
I think I found it. Since the FFT of the audio file (which is real numbers) is symmetric, with the same real part on both sides but opposite imaginary part, I thought of doing this:
take the FFT, keep the half of it, and apply the smoothing function without converting the magnitudes to dB
then make a copy of that smoothed FFT, and invert just the imaginary part
combine the two parts so that I have the same symmetric FFT as I had in the beginning, but now it is smoothed
apply inverse FFT to this and take the real part and write it to file.
Here is the code:
[y,Fs] = audioread('test IR.wav');
function x_oct = smoothSpectrum(X,f,Noct)
x_oct = X; % initial spectrum
if Noct > 0 % don't bother if no smoothing
for i = find(f>0,1,'first'):length(f)
g = gauss_f(f,f(i),Noct);
x_oct(i) = sum(g.*X); % calculate smoothed spectral coefficient
end
% remove undershoot when X is positive
if all(X>=0)
x_oct(x_oct<0) = 0;
end
end
endfunction
function g = gauss_f(f_x,F,Noct)
sigma = (F/Noct)/pi; % standard deviation
g = exp(-(((f_x-F).^2)./(2.*(sigma^2)))); % Gaussian
g = g./sum(g); % normalise magnitude
endfunction
% take fft
Y = fft(y);
% keep only meaningful frequencies
NFFT = length(y);
if mod(NFFT,2)==0
Nout = (NFFT/2)+1;
else
Nout = (NFFT+1)/2;
end
Y = Y(1:Nout);
f = ((0:Nout-1)'./NFFT).*Fs;
% smooth
Noct = 12;
Z = smoothSpectrum(Y,f,Noct);
% plot
semilogx(f,Y,'LineWidth',0.7,f,Z,'LineWidth',2.2);
xlim([20,20000])
grid on
#Apply the smoothing to the actual data
Zreal = real(Z); # real part
Zimag_neg = Zreal - Z; # opposite of imaginary part
Zneg = Zreal + Zimag_neg; # will be used for the symmetric Z
# Z + its symmetry with same real part but opposite imaginary part
reconstructed = [Z ; Zneg(end-1:-1:2)];
# Take the real part of the inverse FFT
reconstructed = real(ifft(reconstructed));
#Write to file
audiowrite ('smoothIR.wav', reconstructed, Fs, 'BitsPerSample', 24);
Seems to work! :) It would be nice if someone more knowledgeable could confirm that the thinking and code are good :)

mIoU for multi-class

I would like to understand how mIoU is calculated for multi-class classification. The formula for each class is
IoU formula
and then the average is done over the classes to get the mIoU. However, I don't understand what happens for the classes that are not represented. The formula becomes a division by 0, so I ignore them and the average is only computed for the classes represented.
The problem is that when a prediction is wrong, the accuracy is really lowered. It adds another class to make the average. For instance : in semantic segmentation the ground-truth of an image is made of 4 classes (0,1,2,3) and 6 classes are represented over the dataset. The prediction is also made of 4 classes (0,1,4,5) but all the items classified in 2 and 3 (in the ground-truth) are classified in 4 and 5 (in the prediction). In this case should we calculate the mIoU over 6 classes ? Even if 4 classes are totally wrong and there respective IoU is 0 ? So the problem is that if just one pixel is predicted in a class that is not in the ground_truth, we have to divide by a higher denominator and it lows a lot the score.
Is it the correct way to compute the mIoU for multi-class (and the semantic segmentation) ?
Instead of calculating the miou of each image and then calculate the "mean" miou over all the images, I calculate the miou as one big image. If a class is not in the image and is not predicited, I set there respective iou equal to 1.
From scratch :
def miou(gt,pred,nbr_mask):
intersection = np.zeros(nbr_mask) # int = (A and B)
den = np.zeros(nbr_mask) # den = A + B = (A or B) + (A and B)
for i in range(len(gt)):
for j in range(height):
for k in range(width):
if pred[i][j][k]==gt[i][j][k]:
intersection[gt[i][j][k]]+=1
den[pred[i][j][k]] += 1
den[gt[i][j][k]] += 1
mIoU = 0
for i in range(nbr_mask):
if den[i]!=0:
mIoU+=intersection[i]/(den[i]-intersection[i])
else:
mIoU+=1
mIoU=mIoU/nbr_mask
return mIoU
With gt the array of ground truth labels and pred the prediction of theassociated images (have to correspond in the array and be the same size).
Adding to the previous answer, this is a great fast and efficient pytorch GPU implementation of calculating the mIOU and classswise IOU for a batch of size (N, H, W) (both pred mask and labels), taken from the NeurIPS 2021 paper "Few-Shot Segmentation via Cycle-Consistent Transformer", github repo available here.
def intersectionAndUnionGPU(output, target, K, ignore_index=255):
# 'K' classes, output and target sizes are N or N * L or N * H * W, each value in range 0 to K - 1.
assert (output.dim() in [1, 2, 3])
assert output.shape == target.shape
output = output.view(-1)
target = target.view(-1)
output[target == ignore_index] = ignore_index
intersection = output[output == target]
area_intersection = torch.histc(intersection, bins=K, min=0, max=K-1)
area_output = torch.histc(output, bins=K, min=0, max=K-1)
area_target = torch.histc(target, bins=K, min=0, max=K-1)
area_union = area_output + area_target - area_intersection
return area_intersection, area_union, area_target
Example usage:
output = torch.rand(4, 5, 224, 224) # model output; batch size=4; channels=5, H,W=224
preds = F.softmax(output, dim=1).argmax(dim=1) # (4, 224, 224)
labels = torch.randint(0,5, (4, 224, 224))
i, u, _ = intersectionAndUnionGPU(preds, labels, 5) # 5 is num_classes
classwise_IOU = i/u # tensor of size (num_classes)
mIOU = i.sum()/u.sum() # mean IOU, taking (i/u).mean() is wrong
Hope this helps everyone!
(A non-GPU implementation is available as well in the repo!)