PyTorch: Move Weights Between GPU and CPU on the fly - deep-learning

I have a large architecture which does not fit into GPU memory, but there is a nice property of this architecture where only subsets of the architecture run at any given time for a stretch of time. Therefore, I would like to dynamically load/unload the weights of layers which are not being utilized between the CPU and GPU. How can this be achieved?
The first thing one might try is call .cpu() or .cuda() on the parameters I wish to move. Unfortunately, that would cause training problems with the optimizer as stated in the docs:
cuda(device=None)
Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
One example use case would be implementing ProxylessNAS, however only final trained models are available at the time of writing and the architecture search implementation is not available.

Related

what happens to model weights and how does checkpointing work?

I have basic question about about model weights and checkpoints.
When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated during backward pass, where are all the updated weights stored. Are they moved back to CPU memory? when does this move happen ?
when checkpointing is done, do we get weights from CPU memory ?
Can someone explain the whole execution flow ?
In most cases, the updated weights from the backward pass remain on the GPU memory. The weights are typically stored in the GPU's memory as floating-point numbers, which allows for fast matrix operations and helps to optimize the training process. The weights are updated during each iteration of the training loop and remain on the GPU until the end of the training process.
When checkpointing is done, the weights are saved to disk, either on the CPU or in a remote storage if the execution is stopped. These weights are usually loaded in CPU memory when needed for execution. This is the general process but it can vary with architecture and hardware sometimes.
The weights stay on the GPU unless they are explicitly moved somewhere else.
When you save a checkpoint, the weights are serialized to the disk using pickle, without being first moved to CPU, That's why if e.g. you pickle a model's state_dict thats on the GPU, and try to load it on a system without a GPU, it will fail.
Also note that the pickle itself, has to move the data it needs to dump, to system ram and does its required processings first but it doesn't change the objects underlying attributes when doing so, thus your models weight gets stored in its original form and intact attributes.

mxnet: parameters are always on CPU

When using mxnet, after building and training a module mod, I called the method mod.get_params() to inspect the weights and bias of the model.
However, I found that even if I set the context to mx.gpu(0) when creating the module, the outputs of the get_params method always show that the parameters (weights and bias) are on cpu(0). See below:
I wonder whether the weights were really on cpu, so I timed the program and found that, it was in fact much faster if I set the context to gpu(0) than to cpu(0). Therefore, I think the weights were in fact on gpu, otherwise the training wouldn't be so fast. But, why did the get_params method show that my weights were on cpu?
Calling mod.get_params synchronizes the parameters in GPU memory, with a copy that is placed in CPU memory. You're seeing the copy, that's in the cpu context, so there's no need for concern.
Under the hood, _sync_params_from_devices is called if the parameters are 'dirty' (i.e. out of sync); where 'device' is GPU(s).

Using CUDA GPUs at prediction time for high througput streams

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back.
The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up.
We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras.
The GPUs are still mostly idle, but adding more Python workers does not speed up the process.
What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.
I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?
This is not an answer to your more general question, but rather an answer based on how I understand the scenario you described.
If someone has coded a system which uses a GPU for some computational task, they have (hopefully) taken the time to parallelize its execution so as to benefit from the full resources the GPU can offer, or something close to that.
That means that if you add a second similar task - even in parallel - the total amount of time to complete them should be similar to the amount of time to complete them serially, i.e. one after the other - since there are very little underutilized GPU resources for the second task to benefit from. In fact, it could even be the case that both tasks will be slower (if, say, they both somehow utilize the L2 cache a lot, and when running together they thrash it).
At any rate, when you want to improve performance, a good thing to do is profile your application - in this case, using the nvprof profiler or its nvvp frontend (the first link is the official documentation, the second link is a presentation).

Multiple GPUs in OptiX (asynchronous launches possible?)

I have some challenges with my Master's thesis I hope you can help me with or maybe point me in the correct direction.
I'm implementing Progressive Photon Mapping using the new approach by Knaus and Zwicker (http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Knaus11.pdf) using OptiX. This approach makes each iteration/frame of PPM independent and more suitable for multi-GPU.
What i do (with a single GPU) is trace a number of photons using OptiX and then store them in a buffer. Then, the photons are then sorted into a spatial hash map using CUDA and thrust, never leaving the GPU. I want to do the spatial hash map creation on GPU since it is the bottleneck of my renderer. Finally, this buffer is used during indirect radiance estimation. So this is a several pass algorithm, consisting of ray-tracing, photon-tracing, photon map generation and finally create image.
I understand that OptiX can support multiple GPU. Each context launch is divided up across the GPUs. Any writes to buffers seems to be serialized and broadcasted to each device so that their buffer contents are the same.
What i would like to do is let one GPU do one frame, while second GPU does the next frame. I can then combine the results, for instance on the CPU or on one of the GPU's in a combine pass. It is also acceptable if i can do each pass in parallel on each device (synchronize between each pass). Is this somehow possible?
For instance, could I create two OptiX contexts mapping to each device on two different host threads. This would allow me to do the CUDA/thrust spatial hash map generation as before, assuming the photons are on one device, and merge the two generated images at the end of the pipeline. However, the programming guide states it does not support multi-thread context handling. I could use multiple processes but then there is a lot of mess with inter-process communication. This approach also requires duplicate work with creating the scene geometry, compiling PTX files and so on.
Thanks!
OptiX already splits the workload accordingly to your GPUs power so your approach will likely not be faster than having OptiX dispose of all the GPUs.
If you want to force your data to remain on the device (notice that in such a situation writes from different devices will not be coherent) you can use the RT_BUFFER_GPU_LOCAL flag as indicated into the programming guide
https://developer.nvidia.com/optix-documentation

GPGPU Applications other than Image processing?

I am in search for few cpu applications which can be ported to gpgpu for better efficiency.
Else where can gpgpu be used other than image processing area ?
This is actually for my graduate project.
The specialized processing architectures of GPU compute engines are useful for just about any data crunching problem where you have:
a non-trivial amount of data
a non-trivial computation to perform on every element of that data,
the input data needed to compute each output element fits in GPU memory, or can be choreographed to arrive in GPU memory when it is needed.
It helps if the computation can be performed independently on all data elements at the same time, but this is not strictly required.
Image processing happens to be one example of that scenario - a finite (but large) number of pixels to process, and many image algorithms can be executed on each pixel in parallel.
Other examples include: generalized signal analysis, such as processing audio signals. Image processing is just a specialized form of signal analysis. Pattern recognition, where much of the challenge is to separate the signal from the noise. Voice recognition, anyone? 3 dimensional surface matching, such as figuring out the shapes of organic compounds based on the flex angles of their chemical bonds, or figuring out if two organic compounds are likely to interact in interesting ways (eg, bioreceptors). Physical modelling of all kinds (collision simulations, seismic analysis, etc). And of course, cryptography, where you can always spend more compute time going over the same data again and again.
GPU compute engines are not well suited for problems where the volume of data significantly dwarfs the computations to be performed. GPUs work well on stuff in memory. Moving data into or out of GPU memory is often the most expensive step of an entire computation, so you want to make sure you have enough computation going on to "make up" for the cost of loading the data into memory. If the data is too big to fit into memory you have to adopt distributed computing tactics.
For example, calculating a primary key index of a petabyte database probably isn't a great fit for a GPU since most of the effort will probably be spent just getting the data off the hard disk into memory. The index computation itself is fairly trivial, which doesn't make for a very interesting GPU win, and while I'm sure the data could be carved up into chunks and the chunks indexed independently by a boatload of GPU cores, variability in the data will likely prevent the GPU from operating at its full capacity. (GPU code works best when all "oarsmen" (processor cores / threads) are pulling in the same direction - uniform execution on separate data) While database indexing might see some benefit by using a GPU approach, it certainly won't be as big of a performance improvement over CPU baseline as something better suited to the GPU execution model constraints - like signal processing.
Brute force crypto attacks? MD5 has been done by Whitepixel and SHA-256 has been done by all the Bitcoin miners. On the other hand, I am not aware of any GPU implementations of bcrypt() or scrypt(), but an academic working in the area is probably a better person to ask.
Easiest way to figure out what kinds of applications are well-suited for GPGPU is to look at the speedups other groups have achieved. Here are a couple of links with that information:
NVIDIA's case studies
AccelerEyes' case studies
Looks like military/aerospace, life science, energy, finance, manufacturing, media, and a few other smaller industries have examples of strong speedups.