Difference between WGAN and WGAN-GP (Gradient Penalty) - deep-learning

I just find that in the code here:
https://github.com/NUS-Tim/Pytorch-WGAN/tree/master/models
The "generator" loss, G, between WGAN and WGAN-GP is different, for WGAN:
g_loss = self.D(fake_images)
g_loss = g_loss.mean().mean(0).view(1)
g_loss.backward(one) # !!!
g_cost = -g_loss
But for WGAN-GP:
g_loss = self.D(fake_images)
g_loss = g_loss.mean()
g_loss.backward(mone) # !!!
g_cost = -g_loss
Why one is one=1 and another is mone=-1?

You might have misread the source code, the first sample you gave is not averaging the resut of D to compute its loss but instead uses the binary cross-entropy.
To be more precise:
The first method ("GAN") uses the BCE loss to compute the loss terms for D and G. The standard GAN optimization objective for D is to minimize E_x[log(D(x))] + E_z[log(1-D(G(z)))]. Source code:
outputs = self.D(images)
d_loss_real = self.loss(outputs.flatten(), real_labels) # <- bce loss
real_score = outputs
# Compute BCELoss using fake images
fake_images = self.G(z)
outputs = self.D(fake_images)
d_loss_fake = self.loss(outputs.flatten(), fake_labels) # <- bce loss
fake_score = outputs
# Optimizie discriminator
d_loss = d_loss_real + d_loss_fake
self.D.zero_grad()
d_loss.backward()
self.d_optimizer.step()
For d_loss_real you optimize towards 1s (output is considered real), while d_loss_fake optimizes towards 0s (output is considered fake).
While the second ("WCGAN") uses the Wasserstein loss (ref) whereby we maximise for D the loss: E_x[D(x)] - E_z[D(G(z))]. Source code:
# Train discriminator
# WGAN - Training discriminator more iterations than generator
# Train with real images
d_loss_real = self.D(images)
d_loss_real = d_loss_real.mean()
d_loss_real.backward(mone)
# Train with fake images
z = self.get_torch_variable(torch.randn(self.batch_size, 100, 1, 1))
fake_images = self.G(z)
d_loss_fake = self.D(fake_images)
d_loss_fake = d_loss_fake.mean()
d_loss_fake.backward(one)
# [...]
Wasserstein_D = d_loss_real - d_loss_fake
By doing d_loss_real.backward(mone) you backpropage with a gradient of opposite sign, i.e. its's a gradient ascend, and you end up maximizing d_loss_real.

In order to Update D network:
lossD = Expectation of D(fake data) - Expectation of D(real data) + gradient penalty
lossD ↓,D(real data) ↑
so you need to add minus one to the gradient process

Related

PyTorch: Multi-class segmentation loss value != 0 when using target image as the prediction

I was performing semantic segmentation using PyTorch. There are a total of 103 different classes in the dataset and the targets are RGB images with only the Red channel containing the labels. I was using nn.CrossEntropyLoss as my loss function. For sanity, I wanted to check if using nn.CrossEntropyLoss is correct for this problem and whether it has the expected behaviour
I pick a random mask from my dataset and create a categorical version of it using this custom transform
class ToCategorical:
def __init__(self, n_classes: int) -> None:
self.n_classes = n_classes
def __call__(self, sample: torch.Tensor):
mask = sample.permute(1, 2, 0)
categories = torch.unique(mask).tolist()[1:] # get all categories other than 0
# build a tensor with `n_classes` channels
one_hot_image = torch.zeros(self.n_classes, *mask.shape[:-1])
for category in categories:
# get spacial locs where the categ is present
rows, cols, _ = torch.where(mask == category)
# in same spacial loc but in `categ` channel fill 1
one_hot_image[category, rows, cols] = 1
return one_hot_image
And then I send this image as the output (prediction) and use the ground truth mask as the target to the loss function.
import torch.nn as nn
mask = T.PILToTensor()(Image.open("path_to_image").convert("RGB"))
categorical_mask = ToCategorical(103)(mask).unsqueeze(0)
mask = mask[0].unsqueeze(0) # get only the red channel, add fake batch_dim
loss_fn = nn.CrossEntropyLoss()
target = mask
output = categorical_mask
print(output.shape, target.shape)
print(loss_fn(output, target.to(torch.long)))
I expected the loss to be zero but to my surprise, the output is as follows
torch.Size([1, 103, 600, 800]) torch.Size([1, 600, 800])
tensor(4.2836)
I verified with other samples in the dataset and I obtained similar values for other masks as well. Am I doing something wrong? I expect the loss to be = 0 when the output is the same as the target.
PS. I also know that nn.CrossEntropyLoss is the same as using log_softmax followed by nn.NLLLoss() but even I obtained the same value by using nllloss as well
For Reference
Dataset used: UECFoodPixComplete
I would like to adress this:
I expect the loss to be = 0 when the output is the same as the target.
If the prediction matches the target, i.e. the prediction corresponds to a one-hot-encoding of the labels contained in the dense target tensor, but the loss itself is not supposed to equal to zero. Actually, it can never be equal to zero because the nn.CrossEntropyLoss function is always positive by definition.
Let us take a minimal example with number of #C classes and a target y_pred and a prediction y_pred consisting of prefect predictions:
As a quick reminder:
The softmax is applied on the logits (q_i) as p_i = log(exp(q_i)/sum_j(exp(q_j)):
>>> p = F.softmax(y_pred, 1)
Similarly if you are using the log-softmax, defined as logp_i = log(p_i):
>>> logp = F.log_softmax(y_pred, 1)
Then comes the negative likelihood function computed between x the input and y the target: -y*x. In association with the softmax, it comes down to -y*p, or -y*logp respectively. In any case, whether you apply the log or not, only the predictions corresponding to the true classes will remain since the others ones are zeroed-out.
That being said, applying the NLLLoss on y_pred would indeed result with a 0 as you expected in your question. However, here we apply it on the probability distribution or log-probability: p, or logp respectively!
In our specific case, p_i = 1 for the true class and p_i = 0 for all other classes (there are #C - 1 of those). This means the softmax of the logit associated with the true class will equal to exp(1)/sum_i(p_i). And since sum_i(p_i) = (#C-1)*exp(0) + exp(1). We therefore have:
softmax(p) = e / (#C - 1 + e)
Similarly for log-softmax:
log-softmax(p) = log(e / (#C-1 + e)) = 1 - log(#C - 1 + e)
If we proceed by applying the negative likelihood function we simply get cross-entropy(y_pred, y_true) = (nllloss o log-softmax)(y_pred, y_true). This results in:
loss = - (1 - log(#C - 1 + e)) = log(#C - 1 + e) - 1
This effectively corresponds to the minimum of the nn.CrossEntropyLoss function.
Regarding your specific case where #C = 103, you may have an issue in your code... since the average loss should equal to log(102 + e) - 1 i.e. around 3.65.
>>> y_true = torch.randint(0,103,(1,1,2,5))
>>> y_pred = torch.zeros(1,103,2,5).scatter(1, y_true, value=1)
You can see for yourself with one of the provided methods:
the builtin function nn.functional.cross_entropy:
>>> F.cross_entropy(y_pred, y_true[:,0])
tensor(3.6513)
manually computing the quantity:
>>> logp = F.log_softmax(y_pred, 1)
>>> -logp.gather(1, y_true).mean()
tensor(3.6513)
analytical result:
>>> log(102 + e) - 1
3.6513

How to normalize pytorch model output to be in range [0,1]

lets say I have model called UNet
output = UNet(input)
that output is a vector of grayscale images shape: (batch_size,1,128,128)
What I want to do is to normalize each image to be in range [0,1].
I did it like this:
for i in range(batch_size):
output[i,:,:,:] = output[i,:,:,:]/torch.amax(output,dim=(1,2,3))[i]
now every image in the output is normalized, but when I'm training such model, pytorch claim it cannot calculate the gradients in this procedure, and I understand why.
my question is what is the right way to normalize image without killing the backpropogation flow?
something like
output = UNet(input)
output = output.normalize
output2 = some_model(output)
loss = ..
loss.backward()
optimize.step()
my only option right now is adding a sigmoid activation at the end of the UNet but i dont think its a good idea..
update - code (gen2,disc = unet,discriminator models. est_bias is some output):
update 2x code:
with torch.no_grad():
est_bias_for_disc = gen2(input_img)
est_bias_for_disc /= est_bias_for_disc.amax(dim=(1,2,3), keepdim=True)
disc_fake_hat = disc(est_bias_for_disc.detach())
disc_fake_loss = BCE(disc_fake_hat, torch.zeros_like(disc_fake_hat))
disc_real_hat = disc(bias_ref)
disc_real_loss = BCE(disc_real_hat, torch.ones_like(disc_real_hat))
disc_loss = (disc_fake_loss + disc_real_loss) / 2
if epoch<=epochs_till_gen2_stop:
disc_loss.backward(retain_graph=True) # Update gradients
opt_disc.step() # Update optimizer
then theres seperate training:
opt_gen2.zero_grad()
est_bias = gen2(input_img)
est_bias /= est_bias.amax(dim=(1,2,3), keepdim=True)
disc_fake = disc(est_bias)
ADV_loss = BCE(disc_fake, torch.ones_like(disc_fake))
gen2_loss = ADV_loss
gen2_loss.backward()
opt_gen2.step()
You can use the normalize function:
>>> import torch
>>> import torch.nn.functional as F
>>> x = torch.tensor([[3.,4.],[5.,6.],[7.,8.]])
>>> x = F.normalize(x, dim = 0)
>>> print(x)
tensor([[0.3293, 0.3714],
[0.5488, 0.5571],
[0.7683, 0.7428]])
This will give a differentiable tensor as long as out is not used.
You are overwriting the tensor's value because of the indexing on the batch dimension. Instead, you can perform the operation in vectorized form:
output = output / output.amax(dim=(1,2,3), keepdim=True)
The keepdim=True argument keeps the shape of torch.Tensor.amax's output equal to that of its inputs allowing you to perform an in-place operation with it.

Deep Q Learning - Cartpole Environment

I have a concern in understanding the Cartpole code as an example for Deep Q Learning. The DQL Agent part of the code as follow:
class DQLAgent:
def __init__(self, env):
# parameter / hyperparameter
self.state_size = env.observation_space.shape[0]
self.action_size = env.action_space.n
self.gamma = 0.95
self.learning_rate = 0.001
self.epsilon = 1 # explore
self.epsilon_decay = 0.995
self.epsilon_min = 0.01
self.memory = deque(maxlen = 1000)
self.model = self.build_model()
def build_model(self):
# neural network for deep q learning
model = Sequential()
model.add(Dense(48, input_dim = self.state_size, activation = "tanh"))
model.add(Dense(self.action_size,activation = "linear"))
model.compile(loss = "mse", optimizer = Adam(lr = self.learning_rate))
return model
def remember(self, state, action, reward, next_state, done):
# storage
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
# acting: explore or exploit
if random.uniform(0,1) <= self.epsilon:
return env.action_space.sample()
else:
act_values = self.model.predict(state)
return np.argmax(act_values[0])
def replay(self, batch_size):
# training
if len(self.memory) < batch_size:
return
minibatch = random.sample(self.memory,batch_size)
for state, action, reward, next_state, done in minibatch:
if done:
target = reward
else:
target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
train_target = self.model.predict(state)
train_target[0][action] = target
self.model.fit(state,train_target, verbose = 0)
def adaptiveEGreedy(self):
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
In the training section, we found our target and train_target. So why did we set train_target[0][action] = target here?
Every predict made while learning is not correct, but thanks to error calculation and backpropagation, the predict made at the end of the network will get closer and closer, but when we make train_target[0][action] = target here the error becomes 0, and in this case, how will the learning be?
self.model.predict(state) will return a tensor of shape of (1, 2) containing the estimated Q values for each action (in cartpole the action space is {0,1}).
As you know the Q value is a measure of the expected reward.
By setting self.model.predict(state)[0][action] = target (where target is the expected sum of rewards) it is creating a target Q value on which to train the model. By then calling model.fit(state, train_target) it is using the target Q value to train said model to approximate better Q values for each state.
I don't understand why you are saying that the loss becomes 0: the target is set to the discounted sum of rewards plus the current reward
target = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
while the network prediction for the highest Q value is
np.amax(self.model.predict(next_state)[0])
The loss between the target and the predicted values is what is used to train the model.
Edit - more detailed explaination
(you can ignore the [0] to the predicted values, as it is just to access the right column and unimportant in the understanding)
The target variable is set to the sum between the current reward and the estimated sum of future rewards, or the Q value. Note that this variable is called target but it is not the target of the network, but the target Q value for the chosen action.
The train_target variable is used as what you call the "dataset". It represents the target of the network.
train_target = self.model.predict(state)
train_target[0][action] = target
You can clearly see that:
train_target[<taken action>] = reward + self.gamma*np.amax(self.model.predict(next_state)[0])
train_target[<any other action>] = <prediction from the model>
the loss (mean squared error):
prediction = self.model.predict(state)
loss = (train_target - prediction)^2
For any line of the that is not the the loss is 0. For the one line that has been set the loss is
(target - prediction[action])^2
or
((reward + self.gamma*np.amax(self.model.predict(next_state)[0])) - self.model.predict(state)[0][action])^2
which is clearly different from 0.
Note that this agent is not ideal. I would strongly recommend the use of a target model instead of creating target Q values that way. Check out this answer as for why.

network values goes to 0 by linear layers

I designed the Graph Attention Network.
However, during the operations inside the layer, the values of features becoming equal.
class GraphAttentionLayer(nn.Module):
## in_features = out_features = 1024
def __init__(self, in_features, out_features, dropout):
super(GraphAttentionLayer, self).__init__()
self.dropout = dropout
self.in_features = in_features
self.out_features = out_features
self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))
self.a1 = nn.Parameter(torch.zeros(size=(out_features, 1)))
self.a2 = nn.Parameter(torch.zeros(size=(out_features, 1)))
nn.init.xavier_normal_(self.W.data, gain=1.414)
nn.init.xavier_normal_(self.a1.data, gain=1.414)
nn.init.xavier_normal_(self.a2.data, gain=1.414)
self.leakyrelu = nn.LeakyReLU()
def forward(self, input, adj):
h = torch.mm(input, self.W)
a_input1 = torch.mm(h, self.a1)
a_input2 = torch.mm(h, self.a2)
a_input = torch.mm(a_input1, a_input2.transpose(1, 0))
e = self.leakyrelu(a_input)
zero_vec = torch.zeros_like(e)
attention = torch.where(adj > 0, e, zero_vec) # most of values is close to 0
attention = F.softmax(attention, dim=1) # all values are 0.0014 which is 1/707 (707^2 is the dimension of attention)
attention = F.dropout(attention, self.dropout)
return attention
The dimension of 'attention' is (707 x 707) and I observed the value of attention is near 0 before the softmax.
After the softmax, all values are 0.0014 which is 1/707.
I wonder how to keep the values normalized and prevent this situation.
Thanks
Since you say this happens during training I would assume it is at the start. With random initialization you often get near identical values at the end of the network during the start of the training process.
When all values are more or less equal the output of the softmax will be 1/num_elements for every element, so they sum up to 1 over the dimension you chose. So in your case you get 1/707 as all the values, which just sounds to me your weights are freshly initialized and the outputs are mostly random at this stage.
I would let it train for a while and observe if this changes.

Function approximator and q-learning

I am trying to implement q-learning with an action-value approximation-function. I am using openai-gym and the "MountainCar-v0" enviroment to test my algorithm out. My problem is, it does not converge or find the goal at all.
Basically the approximator works like the following, you feed in the 2 features: position and velocity and one of the 3 actions in a one-hot encoding: 0 -> [1,0,0], 1 -> [0,1,0] and 2 -> [0,0,1]. The output is the action-value approximation Q_approx(s,a), for one specific action.
I know that usually, the input is the state (2 features) and the output layer contains 1 output for each action. The big difference that I see is that I have run the feed forward pass 3 times (one for each action) and take the max, while in the standard implementation you run it once and take the max over the output.
Maybe my implementation is just completely wrong and I am thinking wrong. Gonna paste the code here, it is a mess but I am just experimenting a bit:
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
env = gym.make('MountainCar-v0')
# The mean reward over 20 episodes
mean_rewards = np.zeros(20)
# Feature numpy holder
features = np.zeros(5)
# Q_a value holder
qa_vals = np.zeros(3)
one_hot = {
0 : np.asarray([1,0,0]),
1 : np.asarray([0,1,0]),
2 : np.asarray([0,0,1])
}
model = Sequential()
model.add(Dense(20, activation="relu",input_dim=(5)))
model.add(Dense(10,activation="relu"))
model.add(Dense(1))
model.compile(optimizer='rmsprop',
loss='mse',
metrics=['accuracy'])
epsilon_greedy = 0.1
discount = 0.9
batch_size = 16
# Experience replay containing features and target
experience = np.ones((10*300,5+1))
# Ring buffer
def add_exp(features,target,index):
if index % experience.shape[0] == 0:
index = 0
global filled_once
filled_once = True
experience[index,0:5] = features
experience[index,5] = target
index += 1
return index
for e in range(0,100000):
obs = env.reset()
old_obs = None
new_obs = obs
rewards = 0
loss = 0
for i in range(0,300):
if old_obs is not None:
# Find q_a max for s_(t+1)
features[0:2] = new_obs
for i,pa in enumerate([0,1,2]):
features[2:5] = one_hot[pa]
qa_vals[i] = model.predict(features.reshape(-1,5))
rewards += reward
target = reward + discount*np.max(qa_vals)
features[0:2] = old_obs
features[2:5] = one_hot[a]
fill_index = add_exp(features,target,fill_index)
# Find new action
if np.random.random() < epsilon_greedy:
a = env.action_space.sample()
else:
a = np.argmax(qa_vals)
else:
a = env.action_space.sample()
obs, reward, done, info = env.step(a)
old_obs = new_obs
new_obs = obs
if done:
break
if filled_once:
samples_ids = np.random.choice(experience.shape[0],batch_size)
loss += model.train_on_batch(experience[samples_ids,0:5],experience[samples_ids,5].reshape(-1))[0]
mean_rewards[e%20] = rewards
print("e = {} and loss = {}".format(e,loss))
if e % 50 == 0:
print("e = {} and mean = {}".format(e,mean_rewards.mean()))
Thanks in advance!
There shouldn't be much difference between the actions as inputs to your network or as different outputs of your network. It does make a huge difference if your states are images for example. because Conv nets work very well with images and there would be no obvious way of integrating the actions to the input.
Have you tried the cartpole balancing environment? It is better to test if your model is working correctly.
Mountain climb is pretty hard. It has no reward until you reach the top, which often doesn't happen at all. The model will only start learning something useful once you get to the top once. If you are never getting to the top you should probably increase your time doing exploration. in other words take more random actions, a lot more...