Interpreting libsvm epsilon-SVR result - regression

I tried to train & cross validate a set of data with 8616 samples using epsilon SVR.
Among the datasets, I take 4368 for test, 4248 for CV.
Kernel type = RBF kernel. Libsvm provides a result as shown below.
optimization finished, #iter = 502363
nu = 0.689607
obj = -6383530527604706.000000, rho = 2884789.960212
nSV = 3023, nBSV = 3004
This is a result gotten by setting
-s 3 -t 2 -c 2^28 -g 2^-13 -p 2^12
(a) What does "nu" means? Sometimes I got nu = 0.99xx for different parameter.
(b) It seems that "obj" is surprisingly large. Does it sounds correct? Libsvm FAQ said this is "optimal objective value of the dual SVM problem". Does it means that this is the min value of f(alpha)?
(c) "rho" is large too. This is the bias term, b. The dataset labels (y) consist of value between 82672 to 286026. So I guess this is reasonable, am I right?
For training set,
Mean squared error = 1.26991e+008 (regression)
Squared correlation coefficient = 0.881112 (regression)
For cross-validation set,
Mean squared error = 1.38909e+008 (regression)
Squared correlation coefficient = 0.883144 (regression)
Using the selected param, I have produced the below result
kernel_type=2 (best c:2^28=2.68435e+008, g:2^-13=0.00012207, e:2^12=4096)
NRMS: 0.345139, best_gap:0.00199433
Mean Absolute Percent Error (MAPE): 5.39%
Mean Absolute Error (MAE): 8956.12 MWh
Daily Peak MAPE: 5.30%
The CV set MAPE is low (5.39%). Using Bias-Variance test, the difference between train set MAPE and CV set MAPE is only 0.00199433, which mean the param seems to be set correctly. But I wonder if the extremely large "obj", "rho" value is correct....
I am very new to SVR, do correct me if my interpretation or validation method is incorrect/insufficient.
Method to calculate MAPE
train_model = svmtrain(train_label, train_data, cmd);
[result_label, train_accuracy, train_dec_values] = svmpredict(train_label, train_data, train_model);
train_err = train_label-result_label;
train_errpct = abs(train_err)./train_label*100;
train_MAPE = mean(train_errpct(~isinf(train_errpct)));

The objective and rho values are high because (most probably) the data were not scaled. Scaling is highly recommended to avoid overflow; the overflow risk also depends on the type of kernel. Btw, when scaling the training data, do not forget to also scale the test data, which is most easily accomplished by scaling all data first, and then splitting them into a training and test set.

Related

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?
Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

"iteration limit reached" in lme4 GLMM - what does it mean?

I constructed several glmer.nb models with different combinations of random intercepts, and for one of the models (nested random intercepts, with the lowest AICc), I consistently get: "iteration limit reached", without the usual "Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :..."
Here's what I know:
it is a warning (from the color) but not labeled as such
you can also have that warning with GLMs and LMERs
Here's what I don't know:
does it mean the model is invalid?
what causes that issue?
what could I do to resolve that issue?
Here's what I searched:
https://stats.stackexchange.com/questions/67287/very-large-theta-values-using-glm-nb-in-r-alternative-approaches (no explanation as to the why and how)
GLMM FAQ: no mention
I am not the only regularly running into that or similar problems: Using glmer.nb(), the error message:(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate is returned
https://stats.stackexchange.com/questions/40647/lme-error-iteration-limit-reached/40664
Here's what would be highly appreciated:
A more informative warning message: did the model converge? what caused this? What can one do to fix it? Can we read more about this (link to GLMM FAQ - brms-style)?
This is a general question. I did not provide reproducible code because an answer that is generalisable would be most useful.
library(lme4)
dd <- data.frame(f = factor(rep(1:20, each = 20)))
dd$y <- simulate(~ 1 + (1|f), family = "poisson",
newdata = dd,
newparam = list(beta = 1, theta = 1),
seed = 101)[[1]]
m1 <- glmer.nb(y ~ 1 + (1|f), data = dd)
Warning message:
In theta.ml(Y, mu, weights = object#resp$weights, limit = limit, :
iteration limit reached
It's a bit hard to tell, but this warning occurs in MASS::theta.ml(), which is called to get an initial estimate of the dispersion parameter. (If you set options(error = recover, warn = 2), warnings will be converted to errors and errors will dump you into a debugger, where you can see the sequence of calls that were active when the warning/error occurred).
This generally occurs when the data (specifically, the conditional distribution of the data) is actually equidispersed (variance == mean) or underdispersed (i.e. variance < mean), which can't be achieved by a negative binomial distribution. If you run getME(m1, "glmer.nb.theta") you'll generally get a very large value (in this case it's 62376), which indicates where the optimizer gave up while it was trying to send the dispersion parameter to infinity.
You can:
ignore the warning (the negative binomial isn't a good choice, but the model is effectively converging to a Poisson solution anyway).
revert to a Poisson model (the CV question you link to does say "a Poisson model might be a better choice")
People often worry less about underdispersion than overdispersion (because underdispersion makes results of a Poisson model conservative), but if you want to take underdispersion into account you can fit your model with a conditional distribution that allows underdispersion as well as overdispersion (not directly possible within lme4, but see here)
PS the "iteration limit reached without convergence" warning in one of your linked answers, from nlminb within lme, is a completely different issue (except that both situations involve some form of iterative solution scheme with a set maximum number of iterations ...)

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

Wasserstein GAN implemtation in pytorch. How to implement the loss?

I'm currently working on a project in pytorch on Wasserstein GAN (https://arxiv.org/pdf/1701.07875.pdf).
In Wasserstain GAN a new objective function is defined using the wasserstein distance as :
Which leads to the following algorithms for training the GAN:
My question is :
When implementing line 5 and 6 of the algorithm in pytorch should I be multiplying my loss -1 ? As in my code (I use RMSprop as my optimizer for both the generator and critic):
############################
# (1) Update D network: maximize (D(x)) + (D(G(x)))
###########################
for n in range(n_critic):
D.zero_grad()
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
output = D(real_cpu)
#errD_real = -criterion(output, label) #DCGAN
errD_real = torch.mean(output)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, 100, device=device) #Careful here we changed shape of input (original : torch.randn(4, 100, 1, 1, device=device))
# Generate fake image batch with G
fake = G(noise)
# Classify all fake batch with D
output = D(fake.detach())
# Calculate D's loss on the all-fake batch
errD_fake = torch.mean(output)
# Calculate the gradients for this batch
errD_fake.backward()
D_G_z1 = output.mean().item()
# Add the gradients from the all-real and all-fake batches
errD = -(errD_real - errD_fake)
# Update D
optimizerD.step()
#Clipping weights
for p in D.parameters():
p.data.clamp_(-0.01, 0.01)
As you can see, I do the operation errD = -(errD_real - errD_fake), with errD_real and errD_fake being respectively the mean of the predictions of the critic on real and fake samples.
To my understanding RMSprop should optimize the weights of the critic the following way :
w <- w - alpha*gradient(w)
(alpha being the learning rate divided by the square root of the weighted moving average of the squared gradient)
Since the optimization problem requires to "go" in the same direction as the gradient it should be required to multiply gradient(w) by -1 before optimizing the weights.
Do you think that my reasoning is right ?
The program runs but my results are quiet poor.
I follow the same logic for the generator's weights but this time in order to go in the opposite direction of the gradient:
############################
# (2) Update G network: minimize -D(G(x))
###########################
G.zero_grad()
noise = torch.randn(b_size, 100, device=device)
fake = G(noise)
#label.fill_(fake_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = D(fake).view(-1)
# Calculate G's loss based on this output
#errG = criterion(output, label) #DCGAN
errG = -torch.mean(output)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
Sorry for the long question, I tried to explain my doubt as clear as possible. Thank you everyone.
I noticed some errors in the implementation of your discriminator training protocol. You call your backward functions twice with both the real and fake values loss being backpropagated at different time steps.
Technically an implementation using this scheme is possible but highly unreadable. There was a mistake with your errD_real in which your output is going to be positive instead of negative as an optimal D(G(z))>0 and so you penalize it for being correct. Overall your model converges simply by predicting D(x)<0 for all inputs.
To fix this do not call your errD_readl.backward() or your errD_fake.backward(). Simply using an errD.backward() after you define errD would work perfectly fine. Otherwise, your generator seems to be correct.

Octave -inf and NaN

I searched the forum and found this thread, but it does not cover my question
Two ways around -inf
From a Machine Learning class, week 3, I am getting -inf when using log(0), which later turns into an NaN. The NaN results in no answer being given in a sum formula, so no scalar for J (a cost function which is the result of matrix math).
Here is a test of my function
>> sigmoid([-100;0;100])
ans =
3.7201e-44
5.0000e-01
1.0000e+00
This is as expected. but the hypothesis requires ans = 1-sigmoid
>> 1-ans
ans =
1.00000
0.50000
0.00000
and the Log(0) gives -Inf
>> log(ans)
ans =
0.00000
-0.69315
-Inf
-Inf rows do not add to the cost function, but the -Inf carries through to NaN, and I do not get a result. I cannot find any material on -Inf, but am thinking there is a problem with my sigmoid function.
Can you provide any direction?
The typical way to avoid infinity in these cases is to add eps to the operand:
log(ans + eps)
eps is a very, very small value, and won't affect the output for values of ans unless ans is zero:
>> z = [-100;0;100];
>> g = 1 ./ (1+exp(-z));
>> log(1-g + eps)
ans =
0.0000
-0.6931
-36.0437
Adding to the answers here, I really do hope you would provide some more context to your question (in particular, what are you actually trying to do.
I will go out on a limb and guess the context, just in case this is useful. You are probably doing machine learning, and trying to define a cost function based on the negative log likelihood of a model, and then trying to differentiate it to find the point where this cost is at its minimum.
In general for a reasonable model with a useful likelihood that adheres to Cromwell's rule, you shouldn't have these problems, but, in practice it happens. And presumably in the process of trying to calculate a negative log likelihood of a zero probability you get inf, and trying to calculate a differential between two points produces inf / inf = nan.
In this case, this is an 'edge case', and generally in computer science edge cases need to be spotted as exceptional circumstances and dealt with appropriately. The reality is that you can reasonably expect that inf isn't going to be your function's minimum! Therefore, whether you remove it from the calculations, or replace it by a very large number (whether arbitrarily or via machine precision) doesn't really make a difference.
So in practice you can do either of the two things suggested by others here, or even just detect such instances and skip them from the calculation. The practical result should be the same.
-inf means negative infinity. Which is the correct answer because log of (0) is minus infinity by definition.
The easiest thing to do is to check your intermediate results and if the number is below some threshold (like 1e-12) then just set it to that threshold. The answers won't be perfect but they will still be pretty close.
Using the following as the sigmoid function:
function g = sigmoid(z)
g = 1 ./ (1 + e.^-z);
end
Then the following code runs with no issues. Choose the threshold value in the 'max' statement to be less than the expected noise in your measurements and then you're good to go
>> a = sigmoid([-100, 0, 100])
a =
3.7201e-44 5.0000e-01 1.0000e+00
>> b = 1-a
b =
1.00000 0.50000 0.00000
>> c = max(b, 1e-12)
c =
1.0000e+00 5.0000e-01 1.0000e-12
>> d = log(c)
d =
0.00000 -0.69315 -27.63102