mxnet: parameters are always on CPU - deep-learning

When using mxnet, after building and training a module mod, I called the method mod.get_params() to inspect the weights and bias of the model.
However, I found that even if I set the context to mx.gpu(0) when creating the module, the outputs of the get_params method always show that the parameters (weights and bias) are on cpu(0). See below:
I wonder whether the weights were really on cpu, so I timed the program and found that, it was in fact much faster if I set the context to gpu(0) than to cpu(0). Therefore, I think the weights were in fact on gpu, otherwise the training wouldn't be so fast. But, why did the get_params method show that my weights were on cpu?

Calling mod.get_params synchronizes the parameters in GPU memory, with a copy that is placed in CPU memory. You're seeing the copy, that's in the cpu context, so there's no need for concern.
Under the hood, _sync_params_from_devices is called if the parameters are 'dirty' (i.e. out of sync); where 'device' is GPU(s).

Related

what happens to model weights and how does checkpointing work?

I have basic question about about model weights and checkpoints.
When training a model, each layer in the model graph calls kernel executed on the GPU. These weights remain on the GPU for forward pass and backward pass. Once the weights are updated during backward pass, where are all the updated weights stored. Are they moved back to CPU memory? when does this move happen ?
when checkpointing is done, do we get weights from CPU memory ?
Can someone explain the whole execution flow ?
In most cases, the updated weights from the backward pass remain on the GPU memory. The weights are typically stored in the GPU's memory as floating-point numbers, which allows for fast matrix operations and helps to optimize the training process. The weights are updated during each iteration of the training loop and remain on the GPU until the end of the training process.
When checkpointing is done, the weights are saved to disk, either on the CPU or in a remote storage if the execution is stopped. These weights are usually loaded in CPU memory when needed for execution. This is the general process but it can vary with architecture and hardware sometimes.
The weights stay on the GPU unless they are explicitly moved somewhere else.
When you save a checkpoint, the weights are serialized to the disk using pickle, without being first moved to CPU, That's why if e.g. you pickle a model's state_dict thats on the GPU, and try to load it on a system without a GPU, it will fail.
Also note that the pickle itself, has to move the data it needs to dump, to system ram and does its required processings first but it doesn't change the objects underlying attributes when doing so, thus your models weight gets stored in its original form and intact attributes.

PyTorch: Move Weights Between GPU and CPU on the fly

I have a large architecture which does not fit into GPU memory, but there is a nice property of this architecture where only subsets of the architecture run at any given time for a stretch of time. Therefore, I would like to dynamically load/unload the weights of layers which are not being utilized between the CPU and GPU. How can this be achieved?
The first thing one might try is call .cpu() or .cuda() on the parameters I wish to move. Unfortunately, that would cause training problems with the optimizer as stated in the docs:
cuda(device=None)
Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
One example use case would be implementing ProxylessNAS, however only final trained models are available at the time of writing and the architecture search implementation is not available.

What is the use of task graphs in CUDA 10?

CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.

How do the nodes in a CUDA graph connect?

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe "meta-kernel".
What's unclear to me is how the data flow works for these graphs. In the capture phase, I allocate memory A for the input, memory B for the temporary values, and memory C for the output. But when I capture this in a graph, I don't capture the memory allocations. So when I then instantiate multiple copies of these graphs, they cannot share the input memory A, temporary workspace B or output memory C.
How then does this work? I.e. when I call cudaGraphLaunch, I don't see a way to provide input parameters. My captured graph basically starts with a cudaMemcpyHostToDevice, how does the graph know which host memory to copy and where to put it?
Background: I found that CUDA is heavily bottlenecked on kernel launches; my AVX2 code was 13x times slower when ported to CUDA. The kernels themselves seem fine (according to NSight), it's just the overhead of scheduling several hundred thousand kernel launches.
A memory allocation would typically be done outside of a graph definition/instantiation or "capture".
However, graphs provide for "memory copy" nodes, where you would typically do cudaMemcpy type operations.
At the time of graph definition, you pass a set of arguments for each graph node (which will depend on the node type, e.g. arguments for the cudaMemcpy operation, if it is a memory copy node, or kernel arguments if it is a kernel node). These arguments determine the actual memory allocations that will be used when that graph is executed.
If you wanted to use a different set of allocations, one method would be to instantiate another graph with different arguments for the nodes where there are changes. This could be done by repeating the entire process, or by starting with an existing graph, making changes to node arguments, and then instantiating a graph with those changes.
Currently, in cuda graphs, it is not possible to perform runtime binding (i.e. at the point of graph "launch") of node arguments to a particular graph/node. It's possible that new features may be introduced in future releases, of course.
Note that there is a CUDA sample code called simpleCudaGraphs available in CUDA 10 which demonstrates the use of both memory copy nodes, and kernel nodes, and also how to create dependencies (effectively execution dependencies) between nodes.

Matlab - Memory usage of a program

I'm currently implementing different signal processing algorithms in MATLAB, to later implement one of these in C++. To choose between these I'm going to perform a number of tests, one being a memory usage check. That is, I want to see how much memory the different algorithms use. Since the implementations are divided in to sub-function, I'm having problems collecting information about the actual memory usage.
This is what I've tried so far:
I've used the profiler to check memory usage of every function.
Problem: It only shows allocated memory usage. It doesn't show e.g. memory usage of variables in every function.
I've used whos at the end of every function to collect information about all the variables in the workspace of the functions. I then added these to a global variable.
Problem: The global variable keeps increasing even after the execution is done and it seems to never stop.
Now to my question. How can I, in a rather simple way, get information about the memory usage of my program, all functions included?
Best regards
I think your strategy to call whos at the end of every function (just before it returns) is a good one; but maybe you want to print the result to the screen rather than a global. If it "keeps increasing", then maybe you have a callback function that is being called unbeknownst to you, and that includes one of your whos calls. By printing to screen (and maybe including a disp('**** memory usage at the end of <function name> ***') just before it, you will find out why it "keeps going".
The alternative of using memory is somewhat helpful, but it gives information about "available" memory, as well as all the memory used by Matlab (not just the variables).
Of course any snapshot of memory usage doesn't necessarily grab the peak - it's possible that a statement like
x = sum(repmat(A, [1000 1]));
would require quite a large peak memory usage (as you replicate the matrix A 1000 times), yet a snapshot of memory (or running whos) right before or after won't tell you what just happened...
The best way to monitor memory usage is to use the profiler, with the memory option turned on:
profile -memory on
% run your code
profreport
The profiler returns memory usage and function calls statistics. Note that the memory option has an impact on your execution speed.
You can use memory function. Also, see memory management functions. Take a look to matlab memory usage.