what happens to model weights and how does checkpointing work? - deep-learning

I have basic question about about model weights and checkpoints.
When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated during backward pass, where are all the updated weights stored. Are they moved back to CPU memory? when does this move happen ?
when checkpointing is done, do we get weights from CPU memory ?
Can someone explain the whole execution flow ?

In most cases, the updated weights from the backward pass remain on the GPU memory. The weights are typically stored in the GPU's memory as floating-point numbers, which allows for fast matrix operations and helps to optimize the training process. The weights are updated during each iteration of the training loop and remain on the GPU until the end of the training process.
When checkpointing is done, the weights are saved to disk, either on the CPU or in a remote storage if the execution is stopped. These weights are usually loaded in CPU memory when needed for execution. This is the general process but it can vary with architecture and hardware sometimes.

The weights stay on the GPU unless they are explicitly moved somewhere else.
When you save a checkpoint, the weights are serialized to the disk using pickle, without being first moved to CPU, That's why if e.g. you pickle a model's state_dict thats on the GPU, and try to load it on a system without a GPU, it will fail.
Also note that the pickle itself, has to move the data it needs to dump, to system ram and does its required processings first but it doesn't change the objects underlying attributes when doing so, thus your models weight gets stored in its original form and intact attributes.

Related

Best way to handle batch during training and inference in Pytorch with GPU

I am learning the best ways to manage batches and other best practices during model training and inference and I have the following question:
If I have a batch that I move to GPU, should. I move it back to CPU after doing the training? If no, why?
batch, label = batch.to(device), label.to(device)
model(batch)
# ..Training pass...
batch, label = batch.cpu(), label.cpu()
If I cache my data in my Dataset class how can I ensure I can reuse the same batches on GPU to avoid transferring from and to CPU multiple times?
You shouldn't have to move your data back to the cpu. The data allocation on the GPU is handled by PyTorch. You should use a torch.utils.data.DataLoader to handle the data loading from the dataset. However, you will have to send the inputs on the GPU yourself: essentially, every time you need to infer some output, you will send your batch and labels to the GPU and compute the result (and loss), that's all.

PyTorch: Move Weights Between GPU and CPU on the fly

I have a large architecture which does not fit into GPU memory, but there is a nice property of this architecture where only subsets of the architecture run at any given time for a stretch of time. Therefore, I would like to dynamically load/unload the weights of layers which are not being utilized between the CPU and GPU. How can this be achieved?
The first thing one might try is call .cpu() or .cuda() on the parameters I wish to move. Unfortunately, that would cause training problems with the optimizer as stated in the docs:
cuda(device=None)
Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
One example use case would be implementing ProxylessNAS, however only final trained models are available at the time of writing and the architecture search implementation is not available.

mxnet: parameters are always on CPU

When using mxnet, after building and training a module mod, I called the method mod.get_params() to inspect the weights and bias of the model.
However, I found that even if I set the context to mx.gpu(0) when creating the module, the outputs of the get_params method always show that the parameters (weights and bias) are on cpu(0). See below:
I wonder whether the weights were really on cpu, so I timed the program and found that, it was in fact much faster if I set the context to gpu(0) than to cpu(0). Therefore, I think the weights were in fact on gpu, otherwise the training wouldn't be so fast. But, why did the get_params method show that my weights were on cpu?
Calling mod.get_params synchronizes the parameters in GPU memory, with a copy that is placed in CPU memory. You're seeing the copy, that's in the cpu context, so there's no need for concern.
Under the hood, _sync_params_from_devices is called if the parameters are 'dirty' (i.e. out of sync); where 'device' is GPU(s).

Memory-compute overlap affects kernel duration?

Profiling my solution, I see dependencies between memory transfer and kernel computation. For a 60Mb data transfer, I have 2ms overhead for each overlapped kernel computation.
I'm computing my basic solution and the enhanced one (overlapped) to see the differences. They treat the same amount of data with the same kernels (which do not depend on the data value).
So am I wrong or missing something somewhere, or does the overlap really use a "significant" part of the GPU ?
I think the overlapping process must order the data transfer and control its issue and you may add the context switching. But compared to 2ms it seems to be too much ?
When you overlap data copy with compute, both operations are competing for GPU memory bandwidth. If your kernel is memory-bandwidth bound, then its possible that overlapping the operations will cause both the compute and the memory copy to run longer, than if either were running alone.
60 megabytes of data on a PCIE Gen2 link will take ~10ms of time if there is no contention. An extra 2ms when there is contention doesn't sound out-of-range to me, but it will depend to a significant degree, which GPU you are using. It's also not clear if the "overhead" you're referring to is an extension of the length of the transfer, or the kernel compute, or the overall program. Different GPUs have different GPU memory bandwidth numbers.

CUDA and web development

It seems apparent that each core of the GPU could allow for handling of a request, rather than one main processor (the system's CPU) handling all requests. On the surface, it seems like it is possible, perhaps with Templates in GPU + Redis database in GPU GDDR5?
Is it possible and worthwhile?
How would the GPU access disks, databases, etc.?
Requests are usually short sharp processing snippets. You'd have to load each request off main memory, into GPU memory, do a computation and fire it back again. There's an overhead when transferring data from main memory to GPU memory. Therefore, it's only worth doing a GPU computation if the calculation is long enough and the problem is ammenable to parallel processing on a GPU.
In essence, GPUs are good at stream processing. Not for lots of small requests.
The previous answer is valid. There is also another ceveat regarding GPU's, the instruction set is smaller and data is processed in matrices. I.e. same operation applied to every element in a set. So you will need to be very clever in designing what this replicated operations are.
I take it you are considering a GPU HTTP server.