Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code - deep-learning

I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?

Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

Related

Augmentation on the same sample

Working on audio classification task where my inputs are raw audio samples and outputs are class labeles , and for this particular question, I want to augment only the Trainset split samples
Q: is it a good practice to augment the same audio sample more then one time ?
E.g., to apply to the same record x, first aug1 , which yield record_x_aug1_sample, and later aug2, which yield record_x_aug2_sample.
Then the Trainset will hold both: [record_x_aug1_sample,record_x_aug2_sample] and a model will train on this Trainset
Q2: is it a good practice to also add the original record x to the Trainset?
It is perfectly fine to augment the same audio more then one time. Moreover it is a good practice to reduce overfitting when your model each time takes slightly different versions of the same sample.
Yes it's fine. Also you can construct two datasets: 1. the original samples without augmentation 2. dataset with augmentations. Comparing the quality on those two dataset you can get a grasp of how strong your augmentations are. Also it can show the benefits of adding of augmentations to your training process.
Also you may consider augmenting your samples on-the-fly if you are using some iterative training process (like a neural network fitted with SGD). So the samples will be slightly different all the time. Pseudo-code:
for sample in dataset:
augmented_sample = augment(sample)
model.train(augmented_sample)
Another approach that may improve performance is first train on the augmented datasets. Then fine-tune the model on the clean original samples for few time.
Some libraries for audio augmentation:
https://github.com/iver56/audiomentations
https://github.com/asteroid-team/torch-audiomentations
Usage:
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift
import numpy as np
augment = Compose([
AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
TimeStretch(min_rate=0.8, max_rate=1.25, p=0.5),
PitchShift(min_semitones=-4, max_semitones=4, p=0.5),
Shift(min_fraction=-0.5, max_fraction=0.5, p=0.5),
])
# Generate 2 seconds of dummy audio for the sake of example
samples = np.random.uniform(low=-0.2, high=0.2, size=(32000,)).astype(np.float32)
# Augment/transform/perturb the audio data
augmented_samples = augment(samples=samples, sample_rate=16000)

PyTorch find keypoints: output nodes to be in a range and negative loss

I am beginner in deep learning.
I am using this dataset and I want my network to detect keypoints of a hand.
How can I make my output layer's nodes to be in range [-1, 1] (range of normalized 2D points)?
Another problem is when I train for more than 1 epoch the loss gets negative values
criterion: torch.nn.MultiLabelSoftMarginLoss() and optimizer: torch.optim.SGD()
Here u can find my repo
net = nnModel.Net()
net = net.to(device)
criterion = nn.MultiLabelSoftMarginLoss()
optimizer = optim.SGD(net.parameters(), lr=learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer=optimizer, gamma=decay_rate)
You can use the Tanh activation function, since the image of the function lies in [-1, 1].
The problem of predicting key-points in an image is more of a regression problem than a classification problem (especially if you're making your model outputs + targets fall within a continuous interval). Therefore, I suggest you use the L2 Loss.
In fact, it could be a good exercise for you to determine which loss function that is appropriate for regression problems provides the lowest expected generalization error using cross-validation. There's several such functions available in PyTorch.
One way I can think of is to use torch.nn.Sigmoid which produces outputs in [0,1] range and scale outputs to [-1,1] using 2*x-1 transformation.

How to train two pytorch networks with different inputs together?

I'm totally new to pytorch, so it might be a very basic question. I have two networks that should be trained together.
First one takes data as input and returns its embedding as output.
Second one takes pairs of embedded datapoints and returns their 'similarity' as output.
Partial loss is then computed for every datapoint, and then all the losses are combined.
This final loss should be backpropagated through both networks.
How should the code for that look like? I'm thinking something like this:
def train_models(inputs, targets):
network1.train()
network2.train()
embeddings = network1(inputs)
paired_embeddings = pair_embeddings(embeddings)
similarities = network2(similarities)
"""
I don't know how the loss should be calculated here.
I have a loss formula for every embedded datapoint,
but not for every similarity.
But if I only calculate loss for every embedding (using similarites),
won't backpropagate() only modify network1,
since embeddings are network1's outputs
and have not been modified in network2?
"""
optimizer1.step()
optimizer2.step()
scheduler1.step()
scheduler2.step()
network1.eval()
network2.eval()
I hope this specific enough. I'll gladly share more details if necessary. I'm just so inexperienced with pytorch and deep learning in general, that I'm not even sure how to ask this question.
You can use single optimizer for this purpose, and even pass different learning rate for each network.
optimizer = optim.Adam([
{'params': network1.parameters()},
{'params': network2.parameters(), 'lr': 1e-3}
], lr=1e-4)
# ...
loss = loss1 + loss2
loss.backward()
optimizer.step()

Wasserstein GAN implemtation in pytorch. How to implement the loss?

I'm currently working on a project in pytorch on Wasserstein GAN (https://arxiv.org/pdf/1701.07875.pdf).
In Wasserstain GAN a new objective function is defined using the wasserstein distance as :
Which leads to the following algorithms for training the GAN:
My question is :
When implementing line 5 and 6 of the algorithm in pytorch should I be multiplying my loss -1 ? As in my code (I use RMSprop as my optimizer for both the generator and critic):
############################
# (1) Update D network: maximize (D(x)) + (D(G(x)))
###########################
for n in range(n_critic):
D.zero_grad()
real_cpu = data[0].to(device)
b_size = real_cpu.size(0)
output = D(real_cpu)
#errD_real = -criterion(output, label) #DCGAN
errD_real = torch.mean(output)
# Calculate gradients for D in backward pass
errD_real.backward()
D_x = output.mean().item()
## Train with all-fake batch
# Generate batch of latent vectors
noise = torch.randn(b_size, 100, device=device) #Careful here we changed shape of input (original : torch.randn(4, 100, 1, 1, device=device))
# Generate fake image batch with G
fake = G(noise)
# Classify all fake batch with D
output = D(fake.detach())
# Calculate D's loss on the all-fake batch
errD_fake = torch.mean(output)
# Calculate the gradients for this batch
errD_fake.backward()
D_G_z1 = output.mean().item()
# Add the gradients from the all-real and all-fake batches
errD = -(errD_real - errD_fake)
# Update D
optimizerD.step()
#Clipping weights
for p in D.parameters():
p.data.clamp_(-0.01, 0.01)
As you can see, I do the operation errD = -(errD_real - errD_fake), with errD_real and errD_fake being respectively the mean of the predictions of the critic on real and fake samples.
To my understanding RMSprop should optimize the weights of the critic the following way :
w <- w - alpha*gradient(w)
(alpha being the learning rate divided by the square root of the weighted moving average of the squared gradient)
Since the optimization problem requires to "go" in the same direction as the gradient it should be required to multiply gradient(w) by -1 before optimizing the weights.
Do you think that my reasoning is right ?
The program runs but my results are quiet poor.
I follow the same logic for the generator's weights but this time in order to go in the opposite direction of the gradient:
############################
# (2) Update G network: minimize -D(G(x))
###########################
G.zero_grad()
noise = torch.randn(b_size, 100, device=device)
fake = G(noise)
#label.fill_(fake_label) # fake labels are real for generator cost
# Since we just updated D, perform another forward pass of all-fake batch through D
output = D(fake).view(-1)
# Calculate G's loss based on this output
#errG = criterion(output, label) #DCGAN
errG = -torch.mean(output)
# Calculate gradients for G
errG.backward()
D_G_z2 = output.mean().item()
# Update G
optimizerG.step()
Sorry for the long question, I tried to explain my doubt as clear as possible. Thank you everyone.
I noticed some errors in the implementation of your discriminator training protocol. You call your backward functions twice with both the real and fake values loss being backpropagated at different time steps.
Technically an implementation using this scheme is possible but highly unreadable. There was a mistake with your errD_real in which your output is going to be positive instead of negative as an optimal D(G(z))>0 and so you penalize it for being correct. Overall your model converges simply by predicting D(x)<0 for all inputs.
To fix this do not call your errD_readl.backward() or your errD_fake.backward(). Simply using an errD.backward() after you define errD would work perfectly fine. Otherwise, your generator seems to be correct.

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels.
http://amunategui.github.io/anomaly-detection-h2o/
But, can I use autoencoder to cluster data if I did not have its labels.?
Regards
The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.
The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.
m <- h2o.deeplearning(
2:564, training_frame = tfidf,
hidden = c(2), auto-encoder = T, activation = "Tanh"
)
f <- h2o.deepfeatures(m, tfidf, layer = 1)
The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:
Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.
BTW, I made the above plot with this code:
d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above
(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )
In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.
A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap.
The code can be found here
This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.
This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.
In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.
A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.