My code does not run on multiple GPUs in PyTorch - deep-learning

I used Faster R-CNN of PyTorch to train it on a dataset. It works well with one GPU. However, I have access to a system with 4 GPUs. I want to use 4 GPUs. However, when I check GPUs usage, only one GPU is used.
I select device in this manner:
if torch.cuda.is_available() == False and device_name == 'gpu':
raise ValueError('GPU is not available!')
elif device_name == 'cpu':
device = torch.device('cpu')
elif device_name == 'gpu':
if batch_size % torch.cuda.device_count() != 0:
raise ValueError('Batch Size is no dividable by number of gpus')
device = torch.device('cuda')
After that I do this:
# multi GPUs
if torch.cuda.device_count() > 1 and device_name == 'gpu':
print('=' * 50)
print('=' * 50)
print('=' * 50)
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
# model = nn.DataParallel(model, device_ids=[i for i in range(torch.cuda.device_count())])
model = nn.DataParallel(model)
print('=' * 50)
print('=' * 50)
print('=' * 50)
# transfer model to selected device
model.to(device)
I move data to the device in this way:
# iterate over all batches
counter_batches = 0
for images, targets in metric_logger.log_every(data_loader, print_freq, header):
# transfer tensors to device(gpu, if not available cpu)
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
# in train mode, faster r-cnn gives losses
loss_dict = model(images, targets)
# sum of losses
losses = sum(loss for loss in loss_dict.values())
I do not know what I did wrong.
Also, I get this warning:
/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all ’

Related

Iterative loss function Autoencoders

I am trying to implement a custom loss function in a Pytorch Autoencoder.
The loss function tries to maximize the cosine similarity between a given output tensor U (a vector) and 100 random vectors J where both U and J have the same dimension of [300]. This is repeated for each batch.
Suppose we have 30 items per batch, then the output tensor is
train_Y.shape = [30,300]
Random_vectors.shape = [30,100,300]
I can implement the loss function in two ways:
All_Y =[]
for Y,z_r in zip(train_y, random_vectors):
Y_cosine_list =[]
for z in z_r:
cosi = torch.dot(Y,z) / (torch.norm(Y)*torch.norm(z))
Y_cosine_list.append(cosi)
All_Y.append(Y_cosine_list)
All_Y = torch.tensor(All_Y).to(device)
train_loss = torch.sum(torch.abs(All_Y))/dim_0
train_loss = torch.tensor(train_loss.data, requires_grad = True)
or
train_Y = torch.zeros([dim_0, 100])
for i, (Y,z_r) in enumerate(zip(train_Y, random_vectors)):
for j,z in enumerate(z_r):
train_Y[i,j] = cos(Y,z)
train_Y = train_Y.to(device)
train_loss = torch.sum(torch.abs(train_Y))/dim_0
The second one is more elegant and to the point. However it is giving a "Cuda illegal memory access error". I have checked that the memory is not exceeded in either case. Is there anything wrong with the second implementation?
The first implementation is inelegant and I am not sure that it makes sense from a neural net optimization perspective. But it does not give errors and am able to complete training for all the epochs.
Ps: I have tried encapsulating this code block in a loss_fn method but I get the same illegal memory access error.
I have tried everything that I could find for the illegal memory access error - changing GPUs, removing a torch.stack block etc. But I can't seem to get rid of the problem.
Here is a vectorized way to do it
class CosineLoss(nn.Module):
def __init__(self, ):
super().__init__()
pass
def forward(self, x, y):
"""
Args:
x (torch.tensor): [batchsize, N, M] - tensor.
y (torch.tensor): [batchsize, M] - tensor.
Returns:
torch.tensor: scalar mean cosine loss
"""
# dot product along dimension 'm' i.e multiply and sum along 'm'.
dotp = torch.einsum("bm, bnm -> bn", y, x)
# L2 norm along dimension 'm' and multiply by broadcasting
length = torch.norm(y, dim=-1)[:, None]*torch.norm(x, dim=-1)
# cosine = dotproduct of unit vectors
cos = dotp/length
return cos.mean()
def test():
b, n, m = 30, 100, 300
train_Y = torch.randn(b, m, device='cuda')
random_vectors = torch.randn(b, n, m, requires_grad=True, device='cuda')
print(f'{random_vectors.grad = }')
cosineloss = CosineLoss()
loss = cosineloss(random_vectors, train_Y)
print(f'{loss = }')
loss.backward()
print(f'{random_vectors.grad.shape = }')
References:
einsum
broadcasting

Calculate output size in CNN

What will be the output size of each layer in the following model?
'''
model = Sequential()
model.add(Conv2D(32, (8, 8), padding='same', strides=(4, 4), input_shape=(80,80,4)))
model.add(Activation('relu'))
model.add(Conv2D(64, (4, 4), padding='same', strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3), padding='same', strides=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(2))
'''
Full Traceback (most recent call)
TL;DR If "output size" means number of parameters, here's that plus memory size of model and memory required to train:
-------------------------------------
Layer # of Params
=====================================
Conv2D 544
-------------------------------------
Conv2D 8256
-------------------------------------
Conv2D 36928
-------------------------------------
Dense 13107712
-------------------------------------
Dense 1026
=====================================
Total # of Params: 13154466
Model Size: 59.47 MiB
Memory Required to Train: 1.95 GiB
-------------------------------------
and here are the relevant equations to compute each:
Model:
2D Conv Layer:
nw = kw * kh * d * d_prev
nb = d
n = nw + nb
Dense Layer:
nw = no * ni
nb = no
n = nw + nb
Training:
2D Conv Layer:
n = wi * hi * dl
Dense Layer:
n = no
Legend:
nw = Number of Weight Parameters
nb = Number of Bias Parameters
kw = Kernel Width
kh = Kernel Height
d = Depth of Current Layer
d_prev = Depth of Previous Layer
no = Number of outputs
ni = Number of inputs
nt = Number of training parameters
n = Total number of parameters
For a batch size of 1, 59.47 MiB to train. For a batch size of 32, 1.95 GiB to train.
A great resource is Memory Usage Computational Considerations, by Kevin McGuinness. You can watch his presentation on YouTube here. He gives a link to slides in the info portion of the YouTube post, but they are here for your reference (look for D2L1 (Day 2, Lecture 1).
It depends on four things:
The size of your data types,
The number of parameters in your model,
The number of parameters required to store training outputs, and
Your batch size
By default, tensorflow uses 32-bit floating point data types (these are 4 bytes in size since there are 8 bits to a byte).
Here's the code I wrote to calculate it. It's pretty much the same as what keras will output, but also includes memory requirements
Source Code:
def model_params_Conv2D(d_n, k_w, k_h, d_n_prev, s_x=1, s_y=1):
"""Calculate the number of model parameters in a 2D Convolution Layer
Args:
d_n ([int]): Depth of current layer
k_w ([int]]): Kernel width
k_h ([int]): Kernel height
d_n_prev ([int]): Depth of previous layer
s_x ([int]): Strides in x-direction
s_y ([int]): Strides in y-direction
Returns:
[int]: Number of layer parameters
"""
n_w = d_n * k_w * k_h * d_n_prev // (s_x * s_y) # Number of weight paramters
n_b = d_n # Number of bias parameters
return n_w + n_b
def model_params_Dense(n_o, n_i):
"""Calculate the number of model parameters in a dense layer
Args:
n_o ([int]): Number of output
n_i ([int]): Number of inputs
Returns:
[int]: Number of layer parameters
"""
n_w = n_o * n_i # Number of weight parameters
n_b = n_o # Number of bias parameters
return n_w + n_b
def training_params_Conv2D(w_i, h_i, d_l):
"""Calclate number of training parameters in a 2D Convolution layer
Args:
w_i (int): Input width
h_i (int): Input height
d_l (int): Layer depth
"""
return w_i * h_i * d_l
def training_params_Dense(n_o):
"""Calculate the number of training parameters in a Dense layer
Args:
n_o (int): Number of outputs
"""
return n_o
def memory_requirement(n_p, m_dt=4):
"""Size of neural network model in bytes
Args:
n_p ([int]): Number of parameters
m_dt ([int]): Memory size of data type in bytes
Returns:
[int]: Memory consumption in bytes
"""
return n_p * m_dt
def SI2ibi(mem):
"""Convert from SI bytes to ibibytes
Computers use powers of 2, so memory is represented in ibibytes, but
SI prefixes are powers of 1000)
kibi (KiB) = (2^10)^1, kilo (KB) = (10^3)^1 (1.024 KiB = 1 KB)
mebi (MiB) = (2^10)^2, mega (MB) = (10^3)^2 (1.048576 MiB = 1 MB)
gibi (GiB) = (2^10)^3, giga (GB) = (10^3)^3 (1.073741824 GiB = 1 GB)
Args:
mem ([int]): Memory size in bytes
"""
KB = 10 ** 3
MB = KB ** 2
GB = KB ** 3
KB2KiB = 1 / 1.024
MB2MiB = 1 / 1.048576
GB2GiB = 1 / 1.073741824
if mem >= GB:
mem /= GB * GB2GiB
units = "GiB"
elif mem >= MB:
mem /= MB * MB2MiB
units = "MiB"
else: # mem >= KB
mem /= KB * KB2KiB
units = "KiB"
return mem, units
if __name__ == "__main__":
# NOTE: Activation layers don't require any parameters. Use depth of
# input as d_n_prev of first layer.
input_shape = (80, 80, 4)
w_i = input_shape[0] # Input width
h_i = input_shape[1] # Input height
d_i = input_shape[2] # Input depth
conv01_params = model_params_Conv2D(
d_n=32, k_w=8, k_h=8, d_n_prev=d_i, s_x=4, s_y=4
)
conv02_params = model_params_Conv2D(d_n=64, k_w=4, k_h=4, d_n_prev=32, s_x=2, s_y=2)
conv03_params = model_params_Conv2D(d_n=64, k_w=3, k_h=3, d_n_prev=64)
dense01_params = model_params_Dense(n_i=w_i * h_i * d_i, n_o=512)
dense02_params = model_params_Dense(n_i=512, n_o=2)
num_model_params = (
conv01_params + conv02_params + conv03_params + dense01_params + dense02_params
)
header_ = "Layer\t\t\t# of Params"
len_header_ = len(repr(header_.expandtabs()))
bar_eq_ = "=" * len_header_
bar_dash_ = "-" * len_header_
num_training_params = training_params_Conv2D(w_i, h_i, 32)
num_training_params += training_params_Conv2D(w_i, h_i, 64)
num_training_params += training_params_Conv2D(w_i, h_i, 64)
num_training_params += training_params_Dense(512)
num_training_params += training_params_Dense(2)
model_memory = memory_requirement(num_model_params)
training_memory = memory_requirement(num_training_params)
total_memory = model_memory + training_memory
batch_size = 32
mem, units = SI2ibi(total_memory)
mem32, units32 = SI2ibi(total_memory * batch_size)
print(f"{bar_dash_}")
print(f"{header_}")
print(f"{bar_eq_}")
print(f"Conv2D\t\t\t{conv01_params}")
print(f"{bar_dash_}")
print(f"Conv2D\t\t\t{conv02_params}")
print(f"{bar_dash_}")
print(f"Conv2D\t\t\t{conv03_params}")
print(f"{bar_dash_}")
print(f"Dense\t\t\t{dense01_params}")
print(f"{bar_dash_}")
print(f"Dense\t\t\t{dense02_params}")
print(f"{bar_eq_}")
print(f"Total # of Params: {num_model_params}")
print(f"Model Size: {mem:.2f} {units}")
print(f"Memory Required to Train: {mem32:.2f} {units32}")
print(f"{bar_dash_}")

mIoU for multi-class

I would like to understand how mIoU is calculated for multi-class classification. The formula for each class is
IoU formula
and then the average is done over the classes to get the mIoU. However, I don't understand what happens for the classes that are not represented. The formula becomes a division by 0, so I ignore them and the average is only computed for the classes represented.
The problem is that when a prediction is wrong, the accuracy is really lowered. It adds another class to make the average. For instance : in semantic segmentation the ground-truth of an image is made of 4 classes (0,1,2,3) and 6 classes are represented over the dataset. The prediction is also made of 4 classes (0,1,4,5) but all the items classified in 2 and 3 (in the ground-truth) are classified in 4 and 5 (in the prediction). In this case should we calculate the mIoU over 6 classes ? Even if 4 classes are totally wrong and there respective IoU is 0 ? So the problem is that if just one pixel is predicted in a class that is not in the ground_truth, we have to divide by a higher denominator and it lows a lot the score.
Is it the correct way to compute the mIoU for multi-class (and the semantic segmentation) ?
Instead of calculating the miou of each image and then calculate the "mean" miou over all the images, I calculate the miou as one big image. If a class is not in the image and is not predicited, I set there respective iou equal to 1.
From scratch :
def miou(gt,pred,nbr_mask):
intersection = np.zeros(nbr_mask) # int = (A and B)
den = np.zeros(nbr_mask) # den = A + B = (A or B) + (A and B)
for i in range(len(gt)):
for j in range(height):
for k in range(width):
if pred[i][j][k]==gt[i][j][k]:
intersection[gt[i][j][k]]+=1
den[pred[i][j][k]] += 1
den[gt[i][j][k]] += 1
mIoU = 0
for i in range(nbr_mask):
if den[i]!=0:
mIoU+=intersection[i]/(den[i]-intersection[i])
else:
mIoU+=1
mIoU=mIoU/nbr_mask
return mIoU
With gt the array of ground truth labels and pred the prediction of theassociated images (have to correspond in the array and be the same size).
Adding to the previous answer, this is a great fast and efficient pytorch GPU implementation of calculating the mIOU and classswise IOU for a batch of size (N, H, W) (both pred mask and labels), taken from the NeurIPS 2021 paper "Few-Shot Segmentation via Cycle-Consistent Transformer", github repo available here.
def intersectionAndUnionGPU(output, target, K, ignore_index=255):
# 'K' classes, output and target sizes are N or N * L or N * H * W, each value in range 0 to K - 1.
assert (output.dim() in [1, 2, 3])
assert output.shape == target.shape
output = output.view(-1)
target = target.view(-1)
output[target == ignore_index] = ignore_index
intersection = output[output == target]
area_intersection = torch.histc(intersection, bins=K, min=0, max=K-1)
area_output = torch.histc(output, bins=K, min=0, max=K-1)
area_target = torch.histc(target, bins=K, min=0, max=K-1)
area_union = area_output + area_target - area_intersection
return area_intersection, area_union, area_target
Example usage:
output = torch.rand(4, 5, 224, 224) # model output; batch size=4; channels=5, H,W=224
preds = F.softmax(output, dim=1).argmax(dim=1) # (4, 224, 224)
labels = torch.randint(0,5, (4, 224, 224))
i, u, _ = intersectionAndUnionGPU(preds, labels, 5) # 5 is num_classes
classwise_IOU = i/u # tensor of size (num_classes)
mIOU = i.sum()/u.sum() # mean IOU, taking (i/u).mean() is wrong
Hope this helps everyone!
(A non-GPU implementation is available as well in the repo!)

pytorch nllloss function target shape mismatch

I'm training a LSTM model using pytorch with batch size of 256 and NLLLoss() as loss function.
The loss function is having problem with the data shape.
The softmax output from the forward passing has shape of torch.Size([256, 4, 1181]) where 256 is batch size, 4 is sequence length, and 1181 is vocab size.
The target is in the shape of torch.Size([256, 4]) where 256 is batch size and 4 is the output sequence length.
When I was testing earlier with batch size of 1, the model works fine but when I add batch size, it is breaking. I read that NLLLoss() can take class target as input instead of one hot encoded target.
Am I misunderstanding it? Or did I not format the shape of the target correctly?
class LSTM(nn.Module):
def __init__(self, embed_size=100, hidden_size=100, vocab_size=1181, embedding_matrix=...):
super(LSTM, self).__init__()
self.hidden_size = hidden_size
self.word_embeddings = nn.Embedding(vocab_size, embed_size)
self.word_embeddings.load_state_dict({'weight': torch.Tensor(embedding_matrix)})
self.word_embeddings.weight.requires_grad = False
self.lstm = nn.LSTM(embed_size, hidden_size)
self.hidden2out = nn.Linear(hidden_size, vocab_size)
def forward(self, tokens):
batch_size, num_steps = tokens.shape
embeds = self.word_embeddings(tokens)
lstm_out, _ = self.lstm(embeds.view(batch_size, num_steps, -1))
out_space = self.hidden2out(lstm_out.view(batch_size, num_steps, -1))
out_scores = F.log_softmax(out_space, dim=1)
return out_scores
model = LSTM(self.config.embed_size, self.config.hidden_size, self.config.vocab_size, self.embedding_matrix)
loss_function = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=self.config.lr)
Error:
~/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
1846 if target.size()[1:] != input.size()[2:]:
1847 raise ValueError('Expected target size {}, got {}'.format(
-> 1848 out_size, target.size()))
1849 input = input.contiguous().view(n, c, 1, -1)
1850 target = target.contiguous().view(n, 1, -1)
ValueError: Expected target size (256, 554), got torch.Size([256, 4])
Your input shape to the loss function is (N, d, C) = (256, 4, 1181) and your target shape is (N, d) = (256, 4), however, according to the docs on NLLLoss the input should be (N, C, d) for a target of (N, d).
Supposing x is your network output and y is the target then you can compute loss by transposing the incorrect dimensions of x as follows:
loss = loss_function(x.transpose(1, 2), y)
Alternatively, since NLLLoss is just averaging all the responses anyway, you can reshape x and y to be (N*d, C) and (N*d). This gives the same result without creating temporary copies of your tensors.
loss = loss_function(x.reshape(N*d, C), y.reshape(N*d))

Variationnal auto-encoder: implementing warm-up in Keras

I recently read this paper which introduces a process called "Warm-Up" (WU), which consists in multiplying the loss in the KL-divergence by a variable whose value depends on the number of epoch (it evolves linearly from 0 to 1)
I was wondering if this is the good way to do that:
beta = K.variable(value=0.0)
def vae_loss(x, x_decoded_mean):
# cross entropy
xent_loss = K.mean(objectives.categorical_crossentropy(x, x_decoded_mean))
# kl divergence
for k in range(n_sample):
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0.,
std=1.0) # used for every z_i sampling
# Sample several layers of latent variables
for mean, var in zip(means, variances):
z_ = mean + K.exp(K.log(var) / 2) * epsilon
# build z
try:
z = tf.concat([z, z_], -1)
except NameError:
z = z_
except TypeError:
z = z_
# sum loss (using a MC approximation)
try:
loss += K.sum(log_normal2(z_, mean, K.log(var)), -1)
except NameError:
loss = K.sum(log_normal2(z_, mean, K.log(var)), -1)
print("z", z)
loss -= K.sum(log_stdnormal(z) , -1)
z = None
kl_loss = loss / n_sample
print('kl loss:', kl_loss)
# result
result = beta*kl_loss + xent_loss
return result
# define callback to change the value of beta at each epoch
def warmup(epoch):
value = (epoch/10.0) * (epoch <= 10.0) + 1.0 * (epoch > 10.0)
print("beta:", value)
beta = K.variable(value=value)
from keras.callbacks import LambdaCallback
wu_cb = LambdaCallback(on_epoch_end=lambda epoch, log: warmup(epoch))
# train model
vae.fit(
padded_X_train[:last_train,:,:],
padded_X_train[:last_train,:,:],
batch_size=batch_size,
nb_epoch=nb_epoch,
verbose=0,
callbacks=[tb, wu_cb],
validation_data=(padded_X_test[:last_test,:,:], padded_X_test[:last_test,:,:])
)
This will not work. I tested it to figure out exactly why it was not working. The key thing to remember is that Keras creates a static graph once at the beginning of training.
Therefore, the vae_loss function is called only once to create the loss tensor, which means that the reference to the beta variable will remain the same every time the loss is calculated. However, your warmup function reassigns beta to a new K.variable. Thus, the beta that is used for calculating loss is a different beta than the one that gets updated, and the value will always be 0.
It is an easy fix. Just change this line in your warmup callback:
beta = K.variable(value=value)
to:
K.set_value(beta, value)
This way the actual value in beta gets updated "in place" rather than creating a new variable, and the loss will be properly re-calculated.