I have a feed-forward neural network which is basically a composition of N functions. I want to pipeline the training procedure of said network in a multi-device environment by executing some of these functions in one device, forwarding the result to the second, execute some more functions etc. So far, I think something like the following would work:
subfunctions = [a list of jit-ed functions, each of which executes one or more network layers]
input = some provided input
for f in subfunctions:
input = f(input) #these get called asynchronously, right?
In addition, I need the final device to send back a "message" with backpropagated gradients to its previous device, which it in turn will also send back (after applying chain rule).
I also need these things to happen concurrently, i.e. call the function of device 1 again while device 2 is just beginning to process the input it got from device 1 (think of a pipelined execution).
Is there native support in Jax for such operations, or should I be looking into something like mpi4jax? Would that even work for me if I'm looking into managing, say, GPU devices and not CPU processes?
Related
There is an external C++ function that is called from Tcl/Tk and does some stuff in a noticeable amount of time. Tcl caller has to get the result of that function so it waits until it's finished. To avoid blocking of GUI, that C++ function has some kind of event loop implemented in its body:
while (m_curSyncProc.isRunning()) {
const clock_t tm = clock();
while (Tcl_DoOneEvent(TCL_ALL_EVENTS | TCL_DONT_WAIT) > 0) {} // <- stuck here in case of tkwait/vwait
// Pause for 10 ms to avoid 100% CPU usage
if (double(clock() - tm) / CLOCKS_PER_SEC < 0.005) {
nanosleep(10000);
}
}
Everything works great unless tkwait/vwait is in action in Tcl code.
For example, for dialogs the tkwait variable someVariable is used to wait Ok/Close/<whatever> button is pressed. I see that even standard Tk bgerror uses the same method (it uses vwait).
The problem is that once called Tcl_DoOneEvent does not return while Tcl code is waiting in tkwait/vwait line, otherwise it works well. Is it possible to fix it in that event loop without total redesigning of C++ code? Because that code is rather old and complicated and its author is not accessible anymore.
Beware! This is a complex topic!
The Tcl_DoOneEvent() call is essentially what vwait, tkwait and update are thin wrappers around (passing different flags and setting up different callbacks). Nested calls to any of them create nested event loops; you don't really want those unless you're supremely careful. An event loop only terminates when it is not processing any active event callbacks, and if those event callbacks create inner event loops, the outer event loop will not get to do anything at all until the inner one has finished.
As you're taking control of the outer event loop (in a very inefficient way, but oh well) you really want the inner event loops to not run at all. There's three possible ways to deal with this; I suspect that the third (coroutines) will be most suitable for you and that the first is what you're really trying to avoid, but that's definitely your call.
1. Continuation Passing
You can rewrite the inner code into continuation-passing style — a big pile of procedures that hands off from step to step through a state machine/workflow — so that it doesn't actually call vwait (and friends). The only one of the family that tends to be vaguely safe is update idletasks (which is really just Tcl_DoOneEvent(TCL_IDLE_EVENTS | TCL_DONT_WAIT)) to process Tk internally-generated alterations.
This option was your main choice up to Tcl 8.5, and it was a lot of work.
2. Threads
You can move to a multi-threaded application. This can be easy… or very difficult; the details depend on an examination of what you're doing throughout the application.
If going this route, remember that Tcl interpreters and Tcl values are totally thread-bound; they internally use thread-specific data so that they can avoid big global locks. This means that threads in Tcl are comparatively expensive to set up, but actually use multiple CPUs very efficiently afterwards; thread pooling is a very common approach.
3. Coroutines
Starting in 8.6, you can put the inner code in a coroutine. Almost everything in 8.6 is coroutine-aware (“non-recursive” in our internal lingo) by default (including commands you wouldn't normally think of, such as source) and once you've done that, you can replace the vwait calls with equivalents from the Tcllib coroutine package and things will typically “just work”. (For example, vwait var becomes coroutine::vwait var, and after 123 becomes coroutine::after 123.)
The only things that don't have direct replacements are tkwait window and tkwait visibility; you'll need to simulate those with waiting for a <Destroy> or <Visibility> event (the latter is uncommon as it is unsupported on some platforms), which you do by binding a trivial callback on those that just sets a variable that you can coroutine::vwait on (which is essentially all that tkwait does internally anyway).
Coroutines can become messy in a few cases, such as when you've got C code that is not coroutine-aware. The main places in Tcl where these come into play are in trace callbacks, inter-interpreter calls, and the scripted implementations of channels; the issue there is that the internal APIs these sit behind are rather complicated already (especially channels) and nobody's felt up to wading in and enabling a non-recursive implementation.
CUDA 10 added runtime API calls for putting streams (= queues) in "capture mode", so that instead of executing, they are returned in a "graph". These graphs can then be made to actually execute, or they can be cloned.
But what is the rationale behind this feature? Isn't it unlikely to execute the same "graph" twice? After all, even if you do run the "same code", at least the data is different, i.e. the parameters the kernels take likely change. Or - am I missing something?
PS - I skimmed this slide deck, but still didn't get it.
My experience with graphs is indeed that they are not so mutable. You can change the parameters with 'cudaGraphHostNodeSetParams', but in order for the change of parameters to take effect, I had to rebuild the graph executable with 'cudaGraphInstantiate'. This call takes so long that any gain of using graphs is lost (in my case). Setting the parameters only worked for me when I build the graph manually. When getting the graph through stream capture, I was not able to set the parameters of the nodes as you do not have the node pointers. You would think the call 'cudaGraphGetNodes' on a stream captured graph would return you the nodes. But the node pointer returned was NULL for me even though the 'numNodes' variable had the correct number. The documentation explicitly mentions this as a possibility but fails to explain why.
Task graphs are quite mutable.
There are API calls for changing/setting the parameters of task graph nodes of various kinds, so one can use a task graph as a template, so that instead of enqueueing the individual nodes before every execution, one changes the parameters of every node before every execution (and perhaps not all nodes actually need their parameters changed).
For example, See the documentation for cudaGraphHostNodeGetParams and cudaGraphHostNodeSetParams.
Another useful feature is the concurrent kernel executions. Under manual mode, one can add nodes in the graph with dependencies. It will explore the concurrency automatically using multiple streams. The feature itself is not new but make it automatic becomes useful for certain applications.
When training a deep learning model it happens often to re-run the same set of kernels in the same order but with updated data. Also, I would expect Cuda to do optimizations by knowing statically what will be the next kernels. We can imagine that Cuda can fetch more instructions or adapt its scheduling strategy when knowing the whole graph.
CUDA Graphs is trying to solve the problem that in the presence of too many small kernel invocations, you see quite some time spent on the CPU dispatching work for the GPU (overhead).
It allows you to trade resources (time, memory, etc.) to construct a graph of kernels that you can use a single invocation from the CPU instead of doing multiple invocations. If you don't have enough invocations, or your algorithm is different each time, then it won't worth it to build a graph.
This works really well for anything iterative that uses the same computation underneath (e.g., algorithms that need to converge to something) and it's pretty prominent in a lot of applications that are great for GPUs (e.g., think of the Jacobi method).
You are not going to see great results if you have an algorithm that you invoke once or if your kernels are big; in that case the CPU invocation overhead is not your bottleneck. A succinct explanation of when you need it exists in the Getting Started with CUDA Graphs.
Where task graph based paradigms shine though is when you define your program as tasks with dependencies between them. You give a lot of flexibility to the driver / scheduler / hardware to do scheduling itself without much fine-tuning from the developer's part. There's a reason why we have been spending years exploring the ideas of dataflow programming in HPC.
I know this question seems very generic as it can depend on the platform,
but I understand with procedure / function calls, the assembler code to push return address on the stack and local variables etc. can be part of either the caller function or callee function.
When a hardware exception or interrupt occurs tho, the Program Counter will get the address of the exception handler via the exception table, but where is the actual code to store the state, return address etc. Or is this automatically done at the hardware level for interrupts and exceptions?
Thanks in advance
since you are asking about arm and you tagged microcontroller you might be talking about the arm7tdmi but are probably talking about one of the cortex-ms. these work differently than the full sized arm architecture. as documented in the architectural reference manual that is associated with these cores (the armv6-m or armv7-m depending on the core) it documents that the hardware conforms to the ABI, plus stuff for an interrupt. So the return address the psr and registers 0 through 4 plus some others are all put on the stack, which is unusual for an architecture to do. R14 instead of getting the return address gets an invalid address of a specific pattern which is all part of the architecture, unlike other processor ip, addresses spaces on the cortex-ms are encouraged or dictated by arm, that is why you see ram starts at 0x20000000 usually on these and flash is less than that, there are some exceptions where they place ram in the "executable" range pretending to be harvard when really modified harvard. This helps with the 0xFFFxxxxx link register return address, depending on the manual they either yada yada over the return address or they go into detail as to what the patterns you find mean.
likewise the address in the vector table is spelled out something like the first 16 are system/arm exceptions then interrupts follow after that where it can be up to 128 or 256 possible interrupts, but you have to look at the chip vendor (not arm) documentation for that to see how many they exposed and what is tied to what. if you are not using those interrupts you dont have to leave a huge hole in your flash for vectors, just use that flash for your program (so long as you insure you are never going to fire that exception or interrupt).
For function calls, which occur at well defined (synchronous) locations in the program, the compiler generates executable instructions to manage the return address, registers and local variables. These instructions are integrated with your function code. The details are hardware and compiler specific.
For a hardware exception or interrupt, which can occur at any location (asynchronous) in the program, managing the return address and registers is all done in hardware. The details are hardware specific.
Think about how a hardware exception/interrupt can occur at any point during the execution of a program. And then consider that if a hardware exception/interrupt required special instructions integrated into the executable code then those special instructions would have to be repeated everywhere throughout the program. That doesn't make sense. Hardware exception/interrupt management is handled in hardware.
The "code" isn't software at all; by definition the CPU has to do it itself internally because interrupts happen asynchronously. (Or for synchronous exceptions caused by instructions being executed, then the internal handling of that instruction is what effectively triggers it).
So it's microcode or hardwired logic inside the CPU that generates the stores of a return address on an exception, and does any other stuff that the architecture defines as happening as part of taking an exception / interrupt.
You might as well as where the code is that pushes a return address when the call instruction executes, on x86 for example where the call instruction pushes return info onto the stack instead of overwriting a link register (the way most RISCs do).
When using mxnet, after building and training a module mod, I called the method mod.get_params() to inspect the weights and bias of the model.
However, I found that even if I set the context to mx.gpu(0) when creating the module, the outputs of the get_params method always show that the parameters (weights and bias) are on cpu(0). See below:
I wonder whether the weights were really on cpu, so I timed the program and found that, it was in fact much faster if I set the context to gpu(0) than to cpu(0). Therefore, I think the weights were in fact on gpu, otherwise the training wouldn't be so fast. But, why did the get_params method show that my weights were on cpu?
Calling mod.get_params synchronizes the parameters in GPU memory, with a copy that is placed in CPU memory. You're seeing the copy, that's in the cpu context, so there's no need for concern.
Under the hood, _sync_params_from_devices is called if the parameters are 'dirty' (i.e. out of sync); where 'device' is GPU(s).
On a bare metal system (embedded microcontroller, no MMU, no paging) what is more expensive? A full context switch (register save & restore) or a function call (activation record allocation)?
I understand that this is highly dependent on calling convention and hardware capability, but how would I go about evaluating this?
EDIT:
To provide more context, I'm trying to model two scheduling schemes. The first being a pre-emptive scheduler with context switching between tasks. The second being a function pointer run queue where tasks are state-machines broken into several enque-able function calls (where enqueing occurs on an IO event driven basis).
For the most part, I can gather good data on how long my tasks take (both IO and CPU time) but I need some help figuring out the additional overhead costs to add as constants in my model.
Since the system calls that trigger context switches are function calls, and the hardware interrupts that can trigger context-switches are similar, (and require a call to an event/semaphore, and a jump/call to the scheduler entry point, to signal the context-switch), I would say that a function call would be cheaper CPU-cycle wise unless an unreasonable number of parameters were passed.
This smells like an XY problem - why do you ask this? Context switches and function calls are almost orthogonal - one is a stack-based mechanism, the other selects a different stack entirely.
You'd go about evaluating this by contrasting the techniques and their actual affects on overall data movement.
For example, on a 6502, an interrupt pushes: The Program Counter, X, Y, A, and status register. That's 6 bytes of actual data, and takes 7 CPU cycles.
Granted, a 6502 is a much simpler CPU than modern designs, but sits as a fundamental example of the problem.
Now, a function call can arguably be as little as a Jump Subroutine, which simply pushes the current PC on to the stack, and then changes the PC to the new location. On the 6502, a JSR cost 6 cycles.
If you consider JSR and BRK (software interrupt on 6502) as the primitives, JSR is cheaper than BRK, by 1 cycle. This is outside the costs of standing up the call frame.
Most context switches are done automatically (via a timer or whatever) to simulate multi processing. But some systems use the CPUs trap primitive for system calls (like INT in MS-DOS, and TRAP in the old Mac OS). So a soft interrupt still has to stand up the stack frame, just like a normal subroutine would.
In the end, a JSR is likely cheaper than any of the higher level switch mechanisms, simply because it's so lightweight. The interrupt mechanisms usually have a indirection mechanic (which is why they're used in system call so much) that a subroutine does not have. The compiler managed the subroutine addresses at compile time.
But those are the considerations to look at to evaluate raw performance.