Cross-posted from: https://discuss.pytorch.org/t/how-to-reset-gpu-memory-usage-when-figuring-out-max-batch-size/171348.
I have a function that does binary search to find the maximum batch size that a model can support for a given GPU. Here's an example of the logs:
Batch size 1 succeeded. Increasing to 2...
Batch size 2 succeeded. Increasing to 4...
Batch size 4 succeeded. Increasing to 8...
Batch size 8 succeeded. Increasing to 16...
Batch size 16 succeeded. Increasing to 32...
Batch size 32 succeeded. Increasing to 64...
Batch size 64 succeeded. Increasing to 128...
Batch size 128 succeeded. Increasing to 256...
Batch size 256 succeeded. Increasing to 512...
Batch size 512 failed. Binary searching...
# We start with the bounds as (256 - 50) to 512 to detect the bug
# detailed later in this post
Batch size 359 failed. New bounds: [206, 359]
Batch size 282 failed. New bounds: [206, 282]
Batch size 244 failed. New bounds: [206, 244]
Batch size 225 failed. New bounds: [206, 225]
Batch size 215 failed. New bounds: [206, 215]
Batch size 210 failed. New bounds: [206, 210]
Batch size 208 failed. New bounds: [206, 208]
Batch size 207 failed. New bounds: [206, 207]
However, notice something odd about these logs. Initially, the batch size of 256 succeeds. However, later–when doing the binary search–we see that smaller batch sizes later fail.
I think this indicates some sort of bug on my end, where the GPU memory is not being reclaimed properly.
Right now, before each call to the function that does the forward/backward pass I call: torch.cuda.empty_cache(), but I think that might not be enough.
What else should I be doing to reset the GPU memory state?
Code for reference:
def binary_search_batch_size(cfg: Settings):
def _is_cuda_oom(e: RuntimeError):
"""Determines if error is CUDA Out of Memory and if adaptive_grad_accum is enabled."""
return 'CUDA out of memory' in str(e)
cfg = Settings.parse_obj(cfg)
model = ModelWrapper(model=cfg.model.model(), out_dim=cfg.model.output_dim).cuda()
optimizer = cfg.optimizer.create_optimizer(model)
def run():
with torch.cuda.amp.autocast(enabled=True):
torch.cuda.empty_cache()
run_batch(model, optimizer, shape)
batch_size = 1
max_batch_size = 1
while True:
try:
batch_size = max_batch_size
shape = (batch_size, 3, 224, 224)
run()
max_batch_size *= 2
print(f"Batch size {batch_size} succeeded. Increasing to {max_batch_size}...")
except RuntimeError as e:
if not _is_cuda_oom(e):
raise e
print(f"Batch size {batch_size} failed. Binary searching...")
# the 50 acts as a bullshit check to make sure we haven't regressed somehow
low = batch_size // 2 - 50
high = batch_size
while low + 1 < high:
batch_size = (low + high) // 2
shape = (batch_size, 3, 224, 224)
try:
run()
low = batch_size
print(f"Batch size {batch_size} succeeded. New bounds: [{low}, {high}]")
except RuntimeError as e:
if not _is_cuda_oom(e):
raise e
high = batch_size
print(f"Batch size {batch_size} failed. New bounds: [{low}, {high}]")
max_batch_size = low
break
return max_batch_size
def run_batch(model, optimizer, shape):
batch = dict(
aug1=torch.randint(0, 256, shape, dtype=torch.uint8).cuda(),
aug2=torch.randint(0, 256, shape, dtype=torch.uint8).cuda(),
)
image_logits1, image_logits2 = model(batch)
loss_val = loss(image_logits1, image_logits2, 0)
optimizer.zero_grad()
loss_val.backward()
optimizer.step()
Related
Problem description:
I want to train a GRU by using a small dataset, which has 40 training samples. Length of input sequence is 3 and length of output sequence is 1. Below is the configuraions:
hidden units: 128
input size(input feature num): 11
layer: 2
drupout: 0.5
bidirectional: False
lr: 0.0001
optimizer: Adam
batch size: 8
Below is my model. Thank Azhar Khan for helping me format the code.
class GRU(nn.Module):
def __init__(self, GRU_input_num, GRU_hidden_num, GRU_layer_num, dropout, bidirectional, seed):
super(GRU, self).__init__()
torch.manual_seed(seed)
self.GRU_input_num = GRU_input_num
self.GRU_hidden_num = GRU_hidden_num
self.GRU_layer_num = GRU_layer_num
self.dropout = dropout
self.bidirectional = bidirectional
self.direction = 2 if self.bidirectional else 1
if GRU_layer_num > 1:
self.gru = nn.GRU(GRU_input_num, GRU_hidden_num, GRU_layer_num, dropout=dropout, batch_first=True,
bidirectional=bidirectional)
else:
self.gru = nn.GRU(GRU_input_num, GRU_hidden_num, GRU_layer_num, batch_first=True,
bidirectional=bidirectional)
self.predict = nn.Linear(in_features=self.GRU_hidden_num * self.direction * self.GRU_layer_num,
out_features=constants.CATEGORY_NUM * constants.BACTERIA_PER_CATEGORY)
def forward(self, x):
"""GRU forward. Input shape: [batch, input_step, features]"""
# shape of hidden state: [layer * direction, batch, hidden]
batch = x.shape[0]
_, hidden_state = self.gru(x)
hidden_state = hidden_state.permute(1, 0, 2) # [batch, layer * direction, hidden]
hidden_state = hidden_state.contiguous().view(batch, 1, self.GRU_hidden_num * self.direction * self.GRU_layer_num).squeeze(1) # [batch, layer * direction * hidden_num]
return self.predict(hidden_state), hidden_state.cpu().detach().numpy() # [batch, features]
I am curious about how will the hidden state change during training. Here are my observations.
My observations:
Heatmap of hidden state.
I printed the heatmap of hidden states every 25 epochs. x axis is sample index in one batch, y axis is hidden state dimension and the title is the sum of all hidden states. The figures are showed below
after 25 epochs
after 75 epochs
after 150 epochs
after 300 epochs
after 550 epochs
The summation keeps shrink along with training.
Weights of GRU.
I also have a glance to the weights of GRU, below is a snapshot of part of weights.
snapshot of GRU weights
Almost all the weights are end with 'e-40(or 41 etc.)', which mean every single weight close to zero.
My question:
Is this a normal phenomenon? If not, what could be the cause of this issue?
Thanks for anyone who can give me some comments about this, and do let me know what else information you need.
I am a newbie in pytorch and running the WGan-gp algorithm on google colab using GPU runtime. I encountered the error below. The algorithm works fine when at None runtime i.e cpu.
Error generated during training
0%| | 0/3 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-18-7e1d4849a60a> in <module>
19 # Calculate gradient penalty on real and fake images
20 # generated by generator
---> 21 gp = gradient_penalty(netCritic, real_image, fake, device)
22 critic_loss = -(torch.mean(critic_real_pred)
23 - torch.mean(critic_fake_pred)) + LAMBDA_GP * gp
<ipython-input-15-f84354d74f37> in gradient_penalty(netCritic, real_image, fake_image, device)
8 # image
9 # interpolated image ← alpha *real image + (1 − alpha) fake image
---> 10 interpolated_image = (alpha*real_image) + (1-alpha) * fake_image
11
12 # calculate the critic score on the interpolated image
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Snippet of my WGan-gp code
def gradient_penalty(netCritic, real_image, fake_image, device=device):
batch_size, channel, height, width = real_image.shape
# alpha is selected randomly between 0 and 1
alpha = torch.rand(batch_size, 1, 1, 1).repeat(1, channel, height, width)
# interpolated image=randomly weighted average between a real and fake
# image
# interpolated image ← alpha *real image + (1 − alpha) fake image
interpolated_image = (alpha*real_image) + (1-alpha) * fake_image
# calculate the critic score on the interpolated image
interpolated_score = netCritic(interpolated_image)
# take the gradient of the score wrt to the interpolated image
gradient = torch.autograd.grad(inputs=interpolated_image,
outputs=interpolated_score,
retain_graph=True,
create_graph=True,
grad_outputs=torch.ones_like
(interpolated_score)
)[0]
gradient = gradient.view(gradient.shape[0], -1)
gradient_norm = gradient.norm(2, dim=1)
gradient_penalty = torch.mean((gradient_norm - 1)**2)
return gradient_penalty
n_epochs = 2000
cur_step = 0
LAMBDA_GP = 10
display_step = 50
CRITIC_ITERATIONS = 5
nz = 100
for epoch in range(n_epochs):
# Dataloader returns the batches
for real_image, _ in tqdm(dataloader):
cur_batch_size = real_image.shape[0]
real_image = real_image.to(device)
for _ in range(CRITIC_ITERATIONS):
fake_noise = get_noise(cur_batch_size, nz, device=device)
fake = netG(fake_noise)
critic_fake_pred = netCritic(fake).reshape(-1)
critic_real_pred = netCritic(real_image).reshape(-1)
# Calculate gradient penalty on real and fake images
# generated by generator
gp = gradient_penalty(netCritic, real_image, fake, device)
critic_loss = -(torch.mean(critic_real_pred)
- torch.mean(critic_fake_pred)) + LAMBDA_GP * gp
netCritic.zero_grad()
# To make a backward pass and retain the intermediary results
critic_loss.backward(retain_graph=True)
optimizerCritic.step()
# Train Generator: max E[critic(gen_fake)] <-> min -E[critic(gen_fake)]
gen_fake = netCritic(fake).reshape(-1)
gen_loss = -torch.mean(gen_fake)
netG.zero_grad()
gen_loss.backward()
# Update optimizer
optimizerG.step()
# Visualization code ##
if cur_step % display_step == 0 and cur_step > 0:
print(f"Step{cur_step}: GenLoss: {gen_loss}: CLoss: {critic_loss}")
display_images(fake)
display_images(real_image)
gen_loss = 0
critic_loss = 0
cur_step += 1
I tried to introduce cuda() at the lines 10 and 21 indicated in the error output.But not working.
Here is one approach to solve this kind of error:
Read the error message and locate the exact line where it occured:
... in gradient_penalty(netCritic, real_image, fake_image, device)
8 # image
9 # interpolated image ← alpha *real image + (1 − alpha) fake image
---> 10 interpolated_image = (alpha*real_image) + (1-alpha) * fake_image
11
12 # calculate the critic score on the interpolated image
RuntimeError: Expected all tensors to be on the same device,
but found at least two devices, cuda:0 and cpu!
Look for input tensors that have not been properly transferred to the correct device. Then look for intermediate tensors that have not been transferred.
Here alpha is assigned to a random tensor but no transfer is done!
>>> alpha = torch.rand(batch_size, 1, 1, 1) \
.repeat(1, channel, height, width)
Fix the issue and test:
>>> alpha = torch.rand(batch_size, 1, 1, 1, device=fake_image.device) \
.repeat(1, channel, height, width)
Trying to implement a simple multi-label image classifier using Pytorch Lightning. Here's the model definition:
import torch
from torch import nn
# creates network class
class Net(pl.LightningModule):
def __init__(self):
super().__init__()
# defines conv layers
self.conv_layer_b1 = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=32,
kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
)
# passes dummy x matrix to find the input size of the fc layer
x = torch.randn(1, 3, 800, 600)
self._to_linear = None
self.forward(x)
# defines fc layer
self.fc_layer = nn.Sequential(
nn.Linear(in_features=self._to_linear,
out_features=256),
nn.ReLU(),
nn.Linear(256, 5),
)
# defines accuracy metric
self.accuracy = pl.metrics.Accuracy()
self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)
def forward(self, x):
x = self.conv_layer_b1(x)
if self._to_linear is None:
# does not run fc layer if input size is not determined yet
self._to_linear = x.shape[1]
else:
x = self.fc_layer(x)
return x
def cross_entropy_loss(self, logits, y):
criterion = nn.CrossEntropyLoss()
return criterion(logits, y)
def training_step(self, train_batch, batch_idx):
x, y = train_batch
logits = self.forward(x)
train_loss = self.cross_entropy_loss(logits, y)
train_acc = self.accuracy(logits, y)
train_cm = self.confusion_matrix(logits, y)
self.log('train_loss', train_loss)
self.log('train_acc', train_acc)
self.log('train_cm', train_cm)
return train_loss
def validation_step(self, val_batch, batch_idx):
x, y = val_batch
logits = self.forward(x)
val_loss = self.cross_entropy_loss(logits, y)
val_acc = self.accuracy(logits, y)
return {'val_loss': val_loss, 'val_acc': val_acc}
def validation_epoch_end(self, outputs):
avg_val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
avg_val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()
self.log("val_loss", avg_val_loss)
self.log("val_acc", avg_val_acc)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=0.0008)
return optimizer
The issue is probably not the machine since I'm using a cloud instance with 60 GBs of RAM and 12 GBs of VRAM. Whenever I run this model even for a single epoch, I get an out of memory error. On the CPU it looks like this:
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1966080000 bytes. Error code 12 (Cannot allocate memory)
and on the GPU it looks like this:
RuntimeError: CUDA out of memory. Tried to allocate 7.32 GiB (GPU 0; 11.17 GiB total capacity; 4.00 KiB already allocated; 2.56 GiB free; 2.00 MiB reserved in total by PyTorch)
Clearing the cache and reducing the batch size did not work. I'm a novice so clearly something here is exploding but I can't tell what. Any help would be appreciated.
Thank you!
Indeed, it's not a machine issue; the model itself is simply unreasonably big. Typically, if you take a look at common CNN models, the fc layers occur near the end, after the inputs already pass through quite a few convolutional blocks (and have their spatial resolutions reduced).
Assuming inputs are of shape (batch, 3, 800, 600), while passing the conv_layer_b1 layer, the feature map shape would be (batch, 32, 400, 300) after the MaxPool operation. After flattening, the inputs become (batch, 32 * 400 * 300), ie, (batch, 3840000).
The immediately following fc_layer thus contains nn.Linear(3840000, 256), which is simply absurd. This single linear layer contains ~983 million trainable parameters! For reference, popular image classification CNNs roughly have 3 to 30 million parameters on average, with larger variants reaching 60 to 80 million. Few ever really cross the 100 million mark.
You can count your model params with this:
def count_params(model):
return sum(map(lambda p: p.data.numel(), model.parameters()))
My advice: 800 x 600 is really a massive input size. Reduce it to something like 400 x 300, if possible. Furthermore, add several convolutional blocks similar to conv_layer_b1, before the FC layer. For example:
def get_conv_block(C_in, C_out):
return nn.Sequential(
nn.Conv2d(in_channels=C_in, out_channels=C_out,
kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2)
)
class Net(pl.LightningModule):
def __init__(self):
super().__init__()
# defines conv layers
self.conv_layer_b1 = get_conv_block(3, 16)
self.conv_layer_b2 = get_conv_block(16, 32)
self.conv_layer_b3 = get_conv_block(32, 64)
self.conv_layer_b4 = get_conv_block(64, 128)
self.conv_layer_b5 = get_conv_block(128, 256)
# passes dummy x matrix to find the input size of the fc layer
x = torch.randn(1, 3, 800, 600)
self._to_linear = None
self.forward(x)
# defines fc layer
self.fc_layer = nn.Sequential(
nn.Flatten(),
nn.Linear(in_features=self._to_linear,
out_features=256),
nn.ReLU(),
nn.Linear(256, 5)
)
# defines accuracy metric
self.accuracy = pl.metrics.Accuracy()
self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)
def forward(self, x):
x = self.conv_layer_b1(x)
x = self.conv_layer_b2(x)
x = self.conv_layer_b3(x)
x = self.conv_layer_b4(x)
x = self.conv_layer_b5(x)
if self._to_linear is None:
# does not run fc layer if input size is not determined yet
self._to_linear = nn.Flatten()(x).shape[1]
else:
x = self.fc_layer(x)
return x
Here, because more conv-relu-pool layers are applied, the input is reduced to a feature map of a much smaller shape, (batch, 256, 25, 18), and the overall number of trainable parameters would be reduced to about ~30 million parameters.
I am trying to fit a UNet CNN to a task very similar to image to image translation. The input to the network is a binary matrix of size (64,256) and the output is of size (64,32). The columns represent a status of a communication channel where each entry in the column is the status of a subchannel. 1 means that the subchannel is occupied and 0 means that the subchannel is vacant. The horizontal axis represents the flow of time. So, the first column is status of the channel at time slot 1 and the second column is the status at time slot and so forth. The task is to predict the status of the channel in the next 32 time slots given the the previous 256 time slots which I treated as image to image translation.
The accuracy on the training data is around 90% while the accuracy on the test is around 50%. By accuracy here, I mean the average percentage of correct entries in each image. Also, while training the validation loss increases while the loss decreases which is a clear sign of overfitting. I have tried most of the regularization techniques and also tried reducing the capacity of the model but this only reduces the training error while not improving the generalization error. Any advice or ideas? I included in the next part the learning curve for training on 1000 samples, the implementation of the network and samples from the training and test sets.
Learning curves of training on 1000 samples
3 Samples from the training set
3 Samples From the test set
Here is the implementation of the network:
def define_encoder_block(layer_in, n_filters, batchnorm=True):
# weight initialization
init = RandomNormal(stddev=0.02)
# add downsampling layer
g = Conv2D(n_filters, (4,4), strides=(2,2), padding='same',
kernel_initializer=init)(layer_in)
# conditionally add batch normalization
if batchnorm:
g = BatchNormalization()(g, training=True)
# leaky relu activation
g = LeakyReLU(alpha=0.2)(g)
return g
# define a decoder block
def decoder_block(layer_in, skip_in, n_filters, filter_strides, dropout=True, skip=True):
# weight initialization
init = RandomNormal(stddev=0.02)
# add upsampling layer
g = Conv2DTranspose(n_filters, (4,4), strides=filter_strides, padding='same',
kernel_initializer=init)(layer_in)
# add batch normalization
g = BatchNormalization()(g, training=True)
# conditionally add dropout
if dropout:
g = Dropout(0.5)(g, training=True)
if skip:
g = Concatenate()([g, skip_in])
# relu activation
g = Activation('relu')(g)
return g
# define the standalone generator model
def define_generator(image_shape=(64,256,1)):
# weight initialization
init = RandomNormal(stddev=0.02)
# image input
in_image = Input(shape=image_shape)
e1 = define_encoder_block(in_image, 64, batchnorm=False)
e2 = define_encoder_block(e1, 128)
e3 = define_encoder_block(e2, 256)
e4 = define_encoder_block(e3, 512)
e5 = define_encoder_block(e4, 512)
e6 = define_encoder_block(e5, 512)
e7 = define_encoder_block(e6, 512)
# bottleneck, no batch norm and relu
b = Conv2D(512, (4,4), strides=(2,2), padding='same', kernel_initializer=init)(e7)
b = Activation('relu')(b)
# decoder model
d1 = decoder_block(b, e7, 512, (1,2))
d2 = decoder_block(d1, e6, 512, (1,2))
d3 = decoder_block(d2, e5, 512, (2,2))
d4 = decoder_block(d3, e4, 512, (2,2), dropout=False)
d5 = decoder_block(d4, e3, 256, (2,2), dropout=False)
d6 = decoder_block(d5, e2, 128, (2,1), dropout=False, skip= False)
d7 = decoder_block(d6, e1, 64, (2,1), dropout=False, skip= False)
# output
g = Conv2DTranspose(1, (4,4), strides=(2,1), padding='same', kernel_initializer=init)(d7)
out_image = Activation('sigmoid')(g)
# define model
model = Model(in_image, out_image)
return model
This issue seem to be existing for a long time and lots of users are facing the issue.
stream_executor/cuda/cuda_dnn.cc:444] could not convert BatchDescriptor {count: 0 feature_map_count: 64 spatial: 7 264 value_min: 0.000000 value_max: 0.000000 layout: BatchDepthYX} t
o cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
The message is so mysterious that I do not know what happened in my code, however, my code works fine on CPU tensorflow.
I heard that we can use tf.cond to get around this, but I'm new to tensorflow-gpu, so can someone please help me? My code uses Keras and takes generator like input, this is to avoid any out-of-memory issue. The generator is built by a while True loop that spits out data by some batch size.
def resnet_model(bin_multiple):
#input and reshape
inputs = Input(shape=input_shape)
reshape = Reshape(input_shape_channels)(inputs)
#normal convnet layer (have to do one initially to get 64 channels)
conv = Conv2D(64,(1,bin_multiple*note_range),padding="same",activation='relu')(reshape)
pool = MaxPooling2D(pool_size=(1,2))(conv)
for i in range(int(np.log2(bin_multiple))-1):
print( i)
#residual block
bn = BatchNormalization()(pool)
re = Activation('relu')(bn)
freq_range = int((bin_multiple/(2**(i+1)))*note_range)
print(freq_range)
conv = Conv2D(64,(1,freq_range),padding="same",activation='relu')(re)
#add and downsample
ad = add([pool,conv])
pool = MaxPooling2D(pool_size=(1,2))(ad)
flattened = Flatten()(pool)
fc = Dense(1024, activation='relu')(flattened)
do = Dropout(0.5)(fc)
fc = Dense(512, activation='relu')(do)
do = Dropout(0.5)(fc)
outputs = Dense(note_range, activation='sigmoid')(do)
model = Model(inputs=inputs, outputs=outputs)
return model
model = resnet_model(bin_multiple)
init_lr = float(args['init_lr'])
model.compile(loss='binary_crossentropy',
optimizer=SGD(lr=init_lr,momentum=0.9), metrics=['accuracy', 'mae', 'categorical_accuracy'])
model.summary()
history = model.fit_generator(trainGen.next(),trainGen.steps(), epochs=epochs,
verbose=1,validation_data=valGen.next(),validation_steps=valGen.steps(),callbacks=callbacks, workers=8, use_multiprocessing=True)
The problem is when you model received 0 batch size. For me I had the error because I have 1000 example and I run it on multiple GPus ( 2 GPU) with batch size equal to 32 .And in My graph I divided the batch size to mini batch size to so each GPU take 16 example. At step 31 ( 31 * 32) I will finished 992 examples , so there is only 8 example left, it will go to GPU 1 and GPU2 will end with zero batch size that's why I received your error above.
Still couldn't solve it and still searching about proper solution.
I hope this help you to discover when in your code you received zero batch size.