Pytorch DirectML computational inconsistency - deep-learning

I am trying to train a DQN on the OpenAI LunarLander Enviroment. I included an argument parser to control which device I use in different runs (CPU and GPU computing with Pytorch's to("cpu") or to("dml") command).
Here is my code:
# Putting networks to either CPU or DML e.g. .to("cpu") for CPU .to("dml") for Microsoft DirectML GPU computing.
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
However, in pytorch-directml some methods do not have support yet such as .gather(), .max(), MSE_Loss() etc. That is why I need to unload the data from GPU to CPU, do the computations, calculate loss and put it back to GPU for further actions. See it below.
Q_targets_next = self.Q_target(next_states.to("cpu")).detach().max(1)[0].unsqueeze(1).to("cpu") # Calculate target value from bellman equation
Q_targets = (rewards.to("cpu") + self.args.gamma * Q_targets_next.to("cpu") * (1-dones.to("cpu"))) # Calculate expected value from local network
Q_expected = self.Q(states).contiguous().to("cpu").gather(1, actions.to("cpu"))
# Calculate loss (on CPU)
loss = F.mse_loss(Q_expected, Q_targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Put the networks back to DML
self.Q = self.Q.to(self.args.device)
self.Q_target = self.Q_target.to(self.args.device)
The strange thing is this:
Code is bug free; when I run it with args.device = "cpu" it works perfectly however, when I run the exact same code with args.device = "dml" it is terrible and network does not learn anything.
I noticed in every iteration results between CPU and GPU are changing just a little bit(1e-5) but after long iterations this makes a huge difference and GPU and CPU results are almost completely different.
What am I missing here? Is there something I need to pay attention when moving matrices between CPU and GPU? Should I make them contiguous()? Or simply is this a bug in pytorch-dml library?

Related

Is it possible to use pytorch for real-time input data with post-processing

I'm constructing Projector-camera system, I want to build radiometric compensation for it using deep-learning.
Here, Is it possible to use network as below? (I guess gradient does not flow, thus weights will not be updated, but I cannot sure)
0. I have ground truth image GT. set Input_image = GT
While True:
1. Encoder-Decoder network structure : projection_image = network(Input_image)
2. project projection_image and capture it as Cap
3. loss calculation : loss = RMSE(Cap, GT)
4. Input_image = projection_image
For this situation,
If I assume ordinary deep-learning, the loss will be calculated between direct output of the network (projection_image) and ground truth data GT. Of course, it works.
However for my case, I want to calculate loss between post-processed network output (network output image -> projection -> capture) and GT.
Here, post-processing is done by cpu, I guess loss does not affect network weights. Actually In my code, the network did not updated.
Is it possible to solve my problem?

What do BatchNorm2d's running_mean / running_var mean in PyTorch?

I'd like to know what exactly the running_mean and running_var that I can call from nn.BatchNorm2d.
Example code is here where bn means nn.BatchNorm2d.
vector = torch.cat([
torch.mean(self.conv3.bn.running_mean).view(1), torch.std(self.conv3.bn.running_mean).view(1),
torch.mean(self.conv3.bn.running_var).view(1), torch.std(self.conv3.bn.running_var).view(1),
torch.mean(self.conv5.bn.running_mean).view(1), torch.std(self.conv5.bn.running_mean).view(1),
torch.mean(self.conv5.bn.running_var).view(1), torch.std(self.conv5.bn.running_var).view(1)
])
I couldn't figure out what running_mean and running_var mean in the Pytorch official documentation and user community.
What do nn.BatchNorm2.running_mean and nn.BatchNorm2.running_var mean?
From the original Batchnorm paper:
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,Seguey Ioffe and Christian Szegedy, ICML'2015
You can see on Algorithm 1. how to measure the statistics of a given batch.
However what is kept in memory across batches is the running stats, i.e. the statistics which are measured iteratively at each batch inference. The computation of the running mean and running variance is actually quite well explained in the documentation page of nn.BatchNorm2d:
By default, the momentum coefficient is set to 0.1, it regulates how much of the current batch statistics will affect the running statistics:
closer to 1 means the new running stat is closer to the current batch statistics, whereas
closer to 0 means the current batch stats will not contribute much to updating the new running stats.
It's worth pointing out that Batchnorm2d is applied across spatial dimensions, * in addition*, to the batch dimension of course. Given a batch of shape (b, c, h, w), it will compute the statistics across (b, h, w). This means the running statistics are shaped (c,), i.e. there are as many statistics components as there are in input channels (for both mean and variance).
Here is a minimal example:
>>> bn = nn.BatchNorm2d(10)
>>> x = torch.rand(2,10,2,2)
Since track_running_stats is set to True by default on BatchNorm2d, it will track the running stats when inferring on training mode.
The running mean and variance are initialized to zeros and ones, respectively.
>>> running_mean, running_var = torch.zeros(x.size(1)),torch.ones(x.size(1))
Let's perform inference on bn in training mode and check its running stats:
>>> bn(x)
>>> bn.running_mean, bn.running_var
(tensor([0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]),
tensor([0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]))
Now let's compute those stats by hand:
>>> (1-momentum)*running_mean + momentum*xmean
tensor([[0.0650, 0.0432, 0.0373, 0.0534, 0.0476,
0.0622, 0.0651, 0.0660, 0.0406, 0.0446]])
>>> (1-momentum)*running_var + momentum*xvar
tensor([[0.9027, 0.9170, 0.9162, 0.9082, 0.9087,
0.9026, 0.9136, 0.9043, 0.9126, 0.9122]])

Is it possible to execute from the point where the neural network model is interrupted?

Assume that I am training a neural network model. I am storing the tensor file of the neural network model for every 15 epochs in .pth format.
I need to run 1000 epochs in total. Suppose I stopped my program during the 501st epoch, then I have the following files
15.pth, 30.pth, 45.pth, 60.pth, 75.pth,.... 420.pth, 435.pth, 450.pth, 465.pth, 480.pth, 495.pth
Then my doubt is
Is it possible to use the last stored model 495.pth and continue execution as it generally happens if done without any interruption? In short, I am asking for something similar to the "resumption" of the training phase with a few modifications to the existing code. I am just asking for such a possibility.
I am asking for general practice and not particular to any code. If such a method exists, I will be free to stop any program under execution and can resume later. Currently, I cannot use resources for shorter programs if longer programs are in execution and hence I am asking this question.
I order to resume training from a checkpoint, you need to save the entire state of your training process. This includes:
Current weights of the model.
State of the optimizer: most optimizers keep track of different statistics of the updates, e.g., momentum, variance etc.
State of the learning rate scheduler.
Additional "state" variables unique to your code.
If you saved all this information, you should be able to fully restore the "state" of your training process and resume from that point.
So what I do is the following:
After each epoch I save my models weights into a .pt file and each time I run my program in gerneral I check if the resume argument is set to True. If so, I initialize the model using the weights in the .pt file as just continue training, if not I initialize random weights as normal. This could look like this:
def train(resume: bool=False):
model = Model()
if resume:
model.load_state_dict(torch.load("weights.pt"))
criterion = Loss()
optimizer = Optimizer()
for epoch in range(100):
for data, targets in dataloader:
optimizer.zero_grad()
predictions = model.train()(data)
loss = criterion(predicitions, targets)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), "weights.pt")
So if I interrupt the training, I can still continue after my last epoch that I saved.
Normally you are logging more stuff than only the weights, for example the learning-rate scheduler or simply the loss and accuracy history. For that you could save the training history into a json file and read it out if resume is True.

Modifying the Learning Rate in the middle of the Model Training in Deep Learning

Below is the code to configure TrainingArguments consumed from the HuggingFace transformers library to finetune the GPT2 language model.
training_args = TrainingArguments(
output_dir="./gpt2-language-model", #The output directory
num_train_epochs=100, # number of training epochs
per_device_train_batch_size=8, # batch size for training #32, 10
per_device_eval_batch_size=8, # batch size for evaluation #64, 10
save_steps=100, # after # steps model is saved
warmup_steps=500,# number of warmup steps for learning rate scheduler
prediction_loss_only=True,
metric_for_best_model = "eval_loss",
load_best_model_at_end = True,
evaluation_strategy="epoch",
learning_rate=0.00004, # learning rate
)
early_stop_callback = EarlyStoppingCallback(early_stopping_patience = 3)
trainer = Trainer(
model=gpt2_model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=test_dataset,
callbacks = [early_stop_callback],
)
The number of epochs as 100 and learning_rate as 0.00004 and also the early_stopping is configured with the patience value as 3.
The model ran for 5/100 epochs and noticed that the difference in loss_value is negligible. The latest checkpoint is saved as checkpoint-latest.
Now Can I modify the learning_rate may be to 0.01 from 0.00004 and resume the training from the latest saved checkpoint - checkpoint-latest? Doing that will be efficient?
Or to train with the new learning_rate value should I start the training from the beginning?
No, you don't have to restart your training.
Changing the learning rate is like changing how big a step your model take in the direction determined by your loss function.
You can also think of it as transfer learning where the model has some experience (no matter how little or irrelevant) and the weights are in a state most likely better than a randomly initialised one.
As a matter of fact, changing the learning rate mid-training is considered an art in deep learning and you should change it if you have a very very good reason to do it.
You would probably want to write down when (why, what, etc) you did it if you or someone else wants to "reproduce" the result of your model.
Pytorch provides several methods to adjust the learning_rate: torch.optim.lr_scheduler.
Check the docs for usage https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

non-linear optimization on the GPU (CUDA) without data transfer latency

I am trying to perform a non-linear optimization problem entirely on the GPU. Computation of the objective function and data transfer from the GPU to CPU are the bottlenecks. To solve this, I want to
heavily parallelize computation of the objective and
perform the entire optimization on the GPU.
More specifically, the problem is as follows in pseudo-code:
x = x0 // initial guess of the vector of unknowns, typically of size ~10,000
for iteration = 1 : max_iter
D = compute_search_direction(x)
alpha = compute_step_along_direction(x)
x = x + D * alpha // update
end for loop
The functions compute_search_direction(x) and compute_step_along_direction(x) both call the objective function f0(x) dozens of times per iteration. The objective function is a complicated CUDA kernel, basically it is a forward Bloch simulation (=the set of equations that describes the dynamics of nuclear spins in a magnetic field). The output of f0(x) are F (value of the objective function, scalar) and DF (Jacobian, or vector of first derivatives, with same size as x, i.e. ~10,000). On the GPU, f0(x) is really fast but transfer of x from the CPU to the GPU and then transfer back of F and DF from the GPU to the CPU takes a while (~1 second total). Because the function is called dozens of time per iteration, this leads to a pretty slow overall optimization.
Ideally, I would want to have the entire pseudo code above on the GPU. The only solution I can think of now is recursive kernels. The pseudo code above would be the "outer kernel", launched with a number of threads = 1 and a number of blocks = 1 (i.e., this kernel is not really parallel...). This kernel would then call the objective function (i.e., the "inner kernel", this one massively parallel) every time it needs to evaluate the objective function and the vector of first derivatives. Since kernel launches are asynchronous, I can force the GPU to wait until the f0 inner kernel is fully evaluated to move to the next instruction of the outer kernel (using a synchronization point).
In a sense, this is really the same as regular CUDA programming where the CPU controls kernel launches for evaluation of the objective function f0, except the CPU is replaced by an outer kernel that is not parallelzied (1 thread, 1 block). However, since everything is on the GPU, there is no data transfer latency anymore.
I am testing the idea now on a simple example to test feasibility. However, this seems quite cumbersome... My questions are:
Does this make any sense to anyone else?
Is there a more direct way to achieve the same result without the added complexity of nested kernels?
It seems you are mixing up "reducing memory transfer between GPU and CPU", and "having the entire code run on device (aka. on gpu)".
In order to reduce memory transfers, you do not need to have the entire code run on GPU.
You can copy your data to the GPU once, and then switch back and forth between GPU code and CPU code. As long as you don't try to access any GPU memory from your CPU code (and vice-versa), you should be fine.
Here's a pseudo-code of a correct approach for what you want to do.
// CPU code
cudaMalloc(&x,...) //allocate memory for x on GPU
cudaMemCpy(x, x0, size, cudaMemCpyHostToDevice); //Copy x0 to the freshly allocated array
cudaMalloc(&D, ....) //allocate D and alpha before the loop
cudaMalloc(&alpha, ....)
for iteration = 1 : max_iter
compute_search_direction<<<...>>>(x, D) //Call a kernel that does the computation and stores the result in D
compute_step_along_direction<<<....>>>(x, alpha)
combine_result<<<...>>>(x, D, alpha) // x + D * alpha
end for loop
//Eventually copy x on CPU, if need be
Hope it helps!