PyTorch: Inference on a single very large image using multiple GPUs? - deep-learning

I want to perform inference (i.e. semantic segmentation) on a very large satellite image without splitting it into pieces. I have access to 4 GPUs (each having 15 GBs of memory) and was wondering if it is possible to somehow use all the memory of these GPUs combined (i.e. 60 GB) for inference on the image in PyTorch?

You are looking for model parallel mode of work.
Basically, you can assign different parts of your model to different GPUs and then you should take care of the "bookkeeping".
This solution is very model-specific and task-specific therefore, there are no "generic" wrappers for it (as opposed to data parallel).
For example:
class MyModelParallelNetwork(nn.Module):
def __init__(self, ...):
# define the network
self.part_one = ... # some nn.Module
self.part_two = ... # additional nn.Module
self.part_three = ...
self.part_four = ...
# important part - "send" the different parts to different GPUs
self.part_one.to(torch.device('gpu:0'))
self.part_two.to(torch.device('gpu:1'))
self.part_three.to(torch.device('gpu:2'))
self.part_four.to(torch.device('gpu:3'))
def forward(self, x):
# forward through model parts and GPUs:
p1 = self.part_one(x.to(torch.device('gpu:0')))
p2 = self.part_two(p1.to(torch.device('gpu:1')))
p3 = self.part_three(p2.to(torch.device('gpu:2')))
y = self.part_four(p3.to(torch.device('gpu:3')))
return y # result is on cuda:3 device

Related

Splitting Dataset over Multiple GPUs

I'm training a large network that inputs and outputs 512x512 images. At the moment, I have 2 Tesla A100 GPUs with 40 GB of memory each, and a dataset comprising 10,000 input and outputs pairs. This adds up to roughly 38 GB of training data, which leads me to run out of memory when sending this data to the "cuda" device to create my dataset. I am simply using DataParallel to distribute the training.
How can I split my dataset up over the two GPUs to avoid running out of memory?
Here is my solution. Open to others, especially more memory-efficient options!
to_t = lambda array: torch.tensor(array, device=device)
class CustomDataset(Dataset):
def __init__(self, image, label):
self.image = image
self.label = label
def __len__(self):
return len(self.label)
def __getitem__(self, idx):
image = self.image[idx]
label = self.label[idx]
return to_t(image).float(), to_t(label).float()

Pytorch : different behaviours in GAN training with different, but conceptually equivalent, code

I'm trying to implement a simple GAN in Pytorch. The following training code works:
for epoch in range(max_epochs): # loop over the dataset multiple times
print(f'epoch: {epoch}')
running_loss = 0.0
for batch_idx,(data,_) in enumerate(data_gen_fn):
# data preparation
real_data = data
input_shape = real_data.shape
inputs_generator = torch.randn(*input_shape).detach()
# generator forward
fake_data = generator(inputs_generator).detach()
# discriminator forward
optimizer_generator.zero_grad()
optimizer_discriminator.zero_grad()
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
# loss discriminator
labels_real_fake = torch.tensor([1]*batch_size + [0]*batch_size)
loss_discriminator_batch = criterion_discriminator(predictions,
labels_real_fake)
# update discriminator
loss_discriminator_batch.backward()
optimizer_discriminator.step()
# generator
# zero the parameter gradients
optimizer_discriminator.zero_grad()
optimizer_generator.zero_grad()
fake_data = generator(inputs_generator) # make again fake data but without detaching
predictions_on_fake = discriminator(fake_data) # D(G(encoding))
# loss generator
labels_fake = torch.tensor([1]*batch_size)
loss_generator_batch = criterion_generator(predictions_on_fake,
labels_fake)
loss_generator_batch.backward() # dL(D(G(encoding)))/dW_{G,D}
optimizer_generator.step()
If I plot the generated images for each iteration, I see that the generated images look like the real ones, so the training procedure seems to work well.
However, if I try to change the code in the ALERT CODE part , i.e., instead of:
#################### ALERT CODE #######################
predictions_on_real = discriminator(real_data)
predictions_on_fake = discriminator(fake_data)
predictions = torch.cat((predictions_on_real,
predictions_on_fake), dim=0)
#########################################################
I use the following:
#################### ALERT CODE #######################
predictions = discriminator(torch.cat( (real_data, fake_data), dim=0))
#######################################################
That is conceptually the same (in a nutshell, instead of doing two different forward on the discriminator, the former on the real, the latter on the fake data, and finally concatenate the results, with the new code I first concatenate real and fake data, and finally I make just one forward pass on the concatenated data.
However, this code version does not work, that is the generated images seems to be always random noise.
Any explanation to this behavior?
Why do we different results?
Supplying inputs in either the same batch, or separate batches, can make a difference if the model includes dependencies between different elements of the batch. By far the most common source in current deep learning models is batch normalization. As you mentioned, the discriminator does include batchnorm, so this is likely the reason for different behaviors. Here is an example. Using single numbers and a batch size of 4:
features = [1., 2., 5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 3.5, std 2.0615528128088303
>>>normalized features [-1.21267813 -0.72760688 0.72760688 1.21267813]
Now we split the batch into two parts. First part:
features = [1., 2.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 1.5, std 0.5
>>>normalized features [-1. 1.]
Second part:
features = [5., 6.]
print("mean {}, std {}".format(np.mean(features), np.std(features)))
print("normalized features", (features - np.mean(features)) / np.std(features))
>>>mean 5.5, std 0.5
>>>normalized features [-1. 1.]
As we can see, in the split-batch version, the two batches are normalized to the exact same numbers, even though the inputs are very different. In the joint-batch version, on the other hand, the larger numbers are still larger than the smaller ones as they are normalized using the same statistics.
Why does this matter?
With deep learning, it's always hard to say, and especially with GANs and their complex training dynamics. A possible explanation is that, as we can see in the example above, the separate batches result in more similar features after normalization even if the original inputs are quite different. This may help early in training, as the generator tends to output "garbage" which has very different statistics from real data.
With a joint batch, these differing statistics make it easy for the discriminator to tell the real and generated data apart, and we end up in a situation where the discriminator "overpowers" the generator.
By using separate batches, however, the different normalizations result in the generated and real data to look more similar, which makes the task less trivial for the discriminator and allows the generator to learn.

Incremental Neural Network Training

I am working with a very large dataset with hundreds of long videos to be used as training and I'm using Google Colab to perform some tests. The whole code I wrote is quite simple and uses PyTorch.
When I try to perform the training, if I use more than 200 videos at a time, the RAM fullfills during the training and the Colab crashes. I noticed that this does not happens if I train with lower number of training videos.
For that reason I thought that my model may be trained incrementally creaing a structure as follows:
model = torch.nn.Sequential( # create a model
...
nn.Softmax(dim=1)
)
MAX_VIDEOS_PER_BATCH = 100
for current_batch in range (0, TOTAL_BATCHES): # Perform TOTAL_BATCHES trainings
videos = []
labels = []
for index, video_file_name in enumerate(os.listdir(VIDEOS_DIR)): # Read 100 videos as training set
if index < MAX_VIDEOS_PER_BATCH * current_batch:
continue
... # read the video and add it to videos
... # add the considered labels to videos list
video_training = torch.tensor(np.asarray(videos)).float() # (batch x frames x channels x height x width)
learning_rate = 1e-4
for t in range(ITERATIONS): # Train the model, if I already trained it the model is not resetted
y_pred = model(torch.FloatTensor(np.asarray(video_training )))
loss = loss_fn(y_pred, torch.tensor(labels))
print("#" + str(t), " loss:" + str(loss.item()))
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
My question is, is this method correct? I am training the network in a correct manner or this batches approach will create some damages or biases to the model?
When I go from batch 1 to batch 2, the model won't lose the previous trained knowledge, is it correct?
This is correct but the best way is to do the reverse.

How to manually compute the number of FLOPS in backward pass for a CNN like ResNet?

I've been trying to figure out how to compute the number of Flops in backward pass of ResNet. For forward pass, it seems straightforward: apply the conv filters to the input for each layer. But how does one do the Flops counts for gradient computation and update of all weights during the backward pass?
Specifically,
how to compute Flops in gradient computations for each layer?
what all gradients need to be computed so Flops for each of those can be counted?
How many Flops in computation of gradient for Pool, BatchNorm, and Relu layers?
I understand the chain rule for gradient computation, but having a hard time formulating how it'd apply to weight filters in conv layers of ResNet and how many Flops each of those would take. It'd be very useful to get any comments about method to compute total Flops for Backward pass. Thanks
You can definitely count the number of multiplication, addition for the backward pass manually, but I guess that's an exhaustive process for complex models.
Usually, most models are benchmarked with flops for a forward pass instead of backward flop count for CNN and other models. I guess the reason has to do with the inference being more important in terms of different CNN variants and other deep learning models in the application.
The backward pass is only important while training, and for most of the simple models, the backward and forward flops should be close with some constant factors.
So, I tried a hacky approach to calculate the gradients for the whole resnet model in the graph to get the flop counts for both forward pass and gradient calculation and then subtracted the forward flops. It's not an exact measurement, may miss many operations for a complex graph/model.
But this may give a flop estimate for most models.
[Following code snippet works with tensorflow 2.0]
import tensorflow as tf
def get_flops():
for_flop = 0
total_flop = 0
session = tf.compat.v1.Session()
graph = tf.compat.v1.get_default_graph()
# forward
with graph.as_default():
with session.as_default():
model = tf.keras.applications.ResNet50() # change your model here
run_meta = tf.compat.v1.RunMetadata()
opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
# We use the Keras session graph in the call to the profiler.
flops = tf.compat.v1.profiler.profile(graph=graph,
run_meta=run_meta, cmd='op', options=opts)
for_flop = flops.total_float_ops
# print(for_flop)
# forward + backward
with graph.as_default():
with session.as_default():
model = tf.keras.applications.ResNet50() # change your model here
outputTensor = model.output
listOfVariableTensors = model.trainable_weights
gradients = tf.gradients(outputTensor, listOfVariableTensors)
run_meta = tf.compat.v1.RunMetadata()
opts = tf.compat.v1.profiler.ProfileOptionBuilder.float_operation()
# We use the Keras session graph in the call to the profiler.
flops = tf.compat.v1.profiler.profile(graph=graph,
run_meta=run_meta, cmd='op', options=opts)
total_flop = flops.total_float_ops
# print(total_flop)
return for_flop, total_flop
for_flops, total_flops = get_flops()
print(f'forward: {for_flops}')
print(f'backward: {total_flops - for_flops}')
Out:
51112224
102224449
forward: 51112224
backward: 51112225

Proper way to extract embedding weights for CBOW model?

I'm currently trying to implement the CBOW model on managed to get the training and testing, but am facing some confusion as to the "proper" way to finally extract the weights from the model to use as our word embeddings.
Model
class CBOW(nn.Module):
def __init__(self, config, vocab):
self.config = config # Basic config file to hold arguments.
self.vocab = vocab
self.vocab_size = len(self.vocab.token2idx)
self.window_size = self.config.window_size
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.config.embed_dim)
self.linear = nn.Linear(in_features=self.config.embed_dim, out_features=self.vocab_size)
def forward(self, x):
x = self.embed(x)
x = torch.mean(x, dim=0) # Average out the embedding values.
x = self.linear(x)
return x
Main process
After I run my model through a Solver with the training and testing data, I basically told the train and test functions to also return the model that's used. Then I assigned the embedding weights to a separate variable and used those as the word embeddings.
Training and testing was conducted using cross entropy loss, and each training and testing sample is of the form ([context words], target word).
def run(solver, config, vocabulary):
for epoch in range(config.num_epochs):
loss_train, model_train = solver.train()
loss_test, model_test = solver.test()
embeddings = model_train.embed.weight
I'm not sure if this is the correct way of going about extracting and using the embeddings. Is there usually another way to do this? Thanks in advance.
Yes, model_train.embed.weight will give you a torch tensor that stores the embedding weights. Note however, that this tensor also contains the latest gradients. If you don't want/need them, model_train.embed.weight.data will give you the weights only.
A more generic option is to call model_train.embed.parameters(). This will give you a generator of all the weight tensors of the layer. In general, there are multiple weight tensors in a layer and weight will give you only one of them. Embedding happens to have only one, so here it doesn't matter which option you use.