I am working in a Deep Learning project where I am trying different CNN architectures with CIFAR10. I've built some custom functions and do some nested foor-loops to iterate over my different architectures. The problem I get is that the 12GB of RAM get close to 100% and I cannot free that space to continue. I would like a solution different to "reset your runtime environment", I want to free that space, given that 12GB should be enough for what I am doing, if you manage it correctly.
What I've done so far:
Added gc.collect() at the end of each training epoch
Added keras.backend.clear_session() after each model is trained
I've also tried to see the locals() using
import sys
def sizeof_fmt(num, suffix='B'):
''' by Fred Cirera, https://stackoverflow.com/a/1094933/1870254, modified'''
for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
if abs(num) < 1024.0:
return "%3.1f %s%s" % (num, unit, suffix)
num /= 1024.0
return "%.1f %s%s" % (num, 'Yi', suffix)
for name, size in sorted(((name, sys.getsizeof(value)) for name, value in locals().items()),
key= lambda x: -x[1])[:10]:
print("{:>30}: {:>8}".format(name, sizeof_fmt(size)))
Which yields
xtrain: 1.1 GiB
xtest: 234.4 MiB
_i13: 3.4 KiB
So I cannot understand how the other 10GB are allocated in my current session.
Related
I am trying to train a DQN on the OpenAI LunarLander Enviroment. I included an argument parser to control which device I use in different runs (CPU and GPU computing with Pytorch's to("cpu") or to("dml") command).
Here is my code:
# Putting networks to either CPU or DML e.g. .to("cpu") for CPU .to("dml") for Microsoft DirectML GPU computing.
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
However, in pytorch-directml some methods do not have support yet such as .gather(), .max(), MSE_Loss() etc. That is why I need to unload the data from GPU to CPU, do the computations, calculate loss and put it back to GPU for further actions. See it below.
Q_targets_next = self.Q_target(next_states.to("cpu")).detach().max(1)[0].unsqueeze(1).to("cpu") # Calculate target value from bellman equation
Q_targets = (rewards.to("cpu") + self.args.gamma * Q_targets_next.to("cpu") * (1-dones.to("cpu"))) # Calculate expected value from local network
Q_expected = self.Q(states).contiguous().to("cpu").gather(1, actions.to("cpu"))
# Calculate loss (on CPU)
loss = F.mse_loss(Q_expected, Q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Put the networks back to DML
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
The strange thing is this:
Code is bug free; when I run it with args.device = "cpu" it works perfectly however, when I run the exact same code with args.device = "dml" it is terrible and network does not learn anything.
I noticed in every iteration results between CPU and GPU are changing just a little bit(1e-5) but after long iterations this makes a huge difference and GPU and CPU results are almost completely different.
What am I missing here? Is there something I need to pay attention when moving matrices between CPU and GPU? Should I make them contiguous()? Or simply is this a bug in pytorch-dml library?
I'd like to know what exactly the running_mean and running_var that I can call from nn.BatchNorm2d.
Example code is here where bn means nn.BatchNorm2d.
vector = torch.cat([
torch.mean(self.conv3.bn.running_mean).view(1), torch.std(self.conv3.bn.running_mean).view(1),
torch.mean(self.conv3.bn.running_var).view(1), torch.std(self.conv3.bn.running_var).view(1),
torch.mean(self.conv5.bn.running_mean).view(1), torch.std(self.conv5.bn.running_mean).view(1),
torch.mean(self.conv5.bn.running_var).view(1), torch.std(self.conv5.bn.running_var).view(1)
])
I couldn't figure out what running_mean and running_var mean in the Pytorch official documentation and user community.
What do nn.BatchNorm2.running_mean and nn.BatchNorm2.running_var mean?
From the original Batchnorm paper:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,Seguey Ioffe and Christian Szegedy, ICML'2015
You can see on Algorithm 1. how to measure the statistics of a given batch.
However what is kept in memory across batches is the running stats, i.e. the statistics which are measured iteratively at each batch inference. The computation of the running mean and running variance is actually quite well explained in the documentation page of nn.BatchNorm2d:
By default, the momentum coefficient is set to 0.1, it regulates how much of the current batch statistics will affect the running statistics:
closer to 1 means the new running stat is closer to the current batch statistics, whereas
closer to 0 means the current batch stats will not contribute much to updating the new running stats.
It's worth pointing out that Batchnorm2d is applied across spatial dimensions, * in addition*, to the batch dimension of course. Given a batch of shape (b, c, h, w), it will compute the statistics across (b, h, w). This means the running statistics are shaped (c,), i.e. there are as many statistics components as there are in input channels (for both mean and variance).
Here is a minimal example:
>>> bn = nn.BatchNorm2d(10)
>>> x = torch.rand(2,10,2,2)
Since track_running_stats is set to True by default on BatchNorm2d, it will track the running stats when inferring on training mode.
The running mean and variance are initialized to zeros and ones, respectively.
>>> running_mean, running_var = torch.zeros(x.size(1)),torch.ones(x.size(1))
Let's perform inference on bn in training mode and check its running stats:
>>> bn(x)
>>> bn.running_mean, bn.running_var
(tensor([0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]),
tensor([0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]))
Now let's compute those stats by hand:
>>> (1-momentum)*running_mean + momentum*xmean
tensor([[0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]])
>>> (1-momentum)*running_var + momentum*xvar
tensor([[0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]])
First I want to say thank to anyone consider reading this question, and I want to sorry if my question is so stubborn, and for my poor English.
So currently I'm working on a recommendation system problem, and my approach was to use matrix factorization with implicit feedback using BPR (arXiv:1205.2618). Somehow, I discovered that when I trained my model (BPRMF), using a large batch size (in this case 4096), resulted in a poorer BPR loss compared to when I used a smaller batch size (1024). my training log on few epochs.
I noted that higher batch size resulted in faster training time as it can utilize GPU memory more efficiently, but the higher loss is something maybe I'm not so willingly to trade. As far as I know, a large batch size bring much more information for the gradient descent step to take a better step, so it should help with convergence, and usually problem with large batch size is in memory and resource, not with loss.
I have did some research about this, and saw that Large Batch Training Result in Poor Generalization and here another, but in my case, it was poor lost while in training.
My best guess is that using a large batch size, then take the mean of the loss make the gradient flow to the user and item embedding lower by the mean ( 1 / batch size) coefficient, make it hard to escape local maxima while training. Is it the answer in this case ? (However, I saw that recent study has show that local minima is not necessarily bad, so ...)
Really appreciated anybody help me answer why large batchsize ended up with anomaly results.
Side note: Might be another stupid question, but as you can see in the code below, you can see that the l2 loss is not normalized by batch size, so I expected it to at least double or quadruple when I multiply batch size by 4, but that seem not to be the case here in the log above.
Here is my code
from typing import Tuple
import torch
from torch.nn.parameter import Parameter
import torch.nn.functional as F
from .PretrainedModel import PretrainedModel
class BPRMFModel(PretrainedModel):
def __init__(self, n_users: int, n_items: int, u_embed: int, l2:float,
dataset: str, u_i_pretrained_dir, use_pretrained = 0, **kwargs) -> None:
super().__init__(n_users=n_users, n_items=n_items, u_embed=u_embed, dataset=dataset,
u_i_pretrained_dir=u_i_pretrained_dir, use_pretrained=use_pretrained,
**kwargs)
self.l2 = l2
self.reset_parameters()
self.items_e = Parameter(self._items_e)
self.users_e = Parameter(self._users_e)
def forward(self, u: torch.Tensor, i: torch.Tensor) -> torch.Tensor:
u = F.embedding(u, self.users_e)
i = F.embedding(i, self.items_e)
return torch.matmul(u, i.T)
def CF_loss(self, u: torch.Tensor, i_pos: torch.Tensor, i_neg: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
#u, i_pos, i_neg shape is [batch_size,]
u = F.embedding(u, self.users_e)
i_pos = F.embedding(i_pos, self.items_e)
i_neg = F.embedding(i_neg, self.items_e)
pos_scores = torch.einsum("ij,ij->i", u, i_pos)
neg_scores = torch.einsum("ij,ij->i", u, i_neg)
# loss = torch.mean(
# F.softplus(-(pos_scores - neg_scores))
# )
loss = torch.neg(
torch.mean(
F.logsigmoid(pos_scores - neg_scores)
)
)
l2_loss = (
u.pow(2).sum() +
i_pos.pow(2).sum() +
i_neg.pow(2).sum()
)
return loss, self.l2 * l2_loss
def get_users_rating_for_each_items(self, u: torch.Tensor, i: torch.Tensor) -> torch.Tensor:
return self(u, i)
def save_pretrained(self):
self._items_e = self.items_e.data
self._users_e = self.users_e.data
return super().save_pretrained()
PretrainedModel is just a base class helping me with the save and load model weight
Really appreciated anyone who bear with me till this end.
Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
I have worked with the alternative optimization matlab code before, currently I am trying to get joint learning running. I am able to run the test demo with my GPU Tesla 2070. For training, I have set all the batch sizes to 1:
__C.TRAIN.IMS_PER_BATCH = 1
__C.TRAIN.BATCH_SIZE = 1
__C.TRAIN.RPN_BATCHSIZE = 1
(updated yaml to 1 as well since it was overridden)
But I still have the error == cudaSuccess (2 vs. 0) out of memory.
I have tried to experiment with lowering the number of proposals. (the originals are below:)
train:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TRAIN.RPN_PRE_NMS_TOP_N = 12000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TRAIN.RPN_POST_NMS_TOP_N = 2000
test:
Number of top scoring boxes to keep before apply NMS to RPN proposals
C.TEST.RPN_PRE_NMS_TOP_N = 6000
Number of top scoring boxes to keep after applying NMS to RPN proposals
C.TEST.RPN_POST_NMS_TOP_N = 300
I tried as low as pre: 100 post:50 for sanity check.
And I still am not able to run without the out of memory problem. What am I missing here?? I have a Tesla 5376 MB dedicated memory and I use the Tesla only for this (have a separate GPU for my screen) I am positive about reading 5376 MB should be enough by an author himself.
Thanks.