If I have a __constant__ value
__constant__ float constVal;
Which may or may not be initialized by MPI ranks on non-blocking streams:
cudaMemcpyToSymbolAsync((void*)&constVal,deviceValue,sizeof(float),0,cudaMemcpyDeviceToDevice,stream);
Is this:
Safe to be accessed by multiple MPI ranks simultaneously within kernels? I.e. do ranks share the same instance of val or do MPI semantics (they all have a private copy) still hold?
If the above is safe, is it safe to be initialized by multiple MPI ranks?
Safe to be accessed by multiple MPI ranks simultaneously within kernels? I.e. do ranks share the same instance of val or do MPI
semantics (they all have a private copy) still hold?
Neither. CUDA contexts are not shared amongst processes. If you have multiple processes you get multiple contexts, and each context has its own copy of all the statically defined symbols and code. This behaviour is independent of MPI semantics. If you are imagining that multiple processes in an MPI communicator are sharing the same GPU context and state, they aren't.
If the above is safe, is it safe to be initialized by multiple MPI ranks?
It isn't only safe, it is mandatory.
Related
In CUDA programming, I try to reduce the synchronization overhead between the off-chip memory and on-chip memory if there is data dependency between two kernels? What's the differences between these two techniques?
The idea behind kernel fusion is to take two (or more) discrete operations, that could be realized (and might already be realized) in separate kernels, and combine them so the operations all happen in a single kernel.
The benefits of this may or may not seem obvious, so I refer you to this writeup.
Persistent threads/Persistent kernel is a kernel design strategy that allows the kernel to continue execution indefinitely. Typical "ordinary" kernel design focuses on solving a particular task, and when that task is done, the kernel exits (at the closing curly-brace of your kernel code).
A persistent kernel however has a governing loop in it that only ends when signaled - otherwise it runs indefinitely. People often connect this with the producer-consumer model of application design. Something (host code) produces data, and your persistent kernel consumes that data and produces results. This producer-consumer model can run indefinitely. When there is no data to consume, the consumer (your persistent kernel) simply waits in a loop, for new data to be presented.
Persistent kernel design has a number of important considerations, which I won't try to list here but instead refer you to this longer writeup/example.
Benefits:
Kernel fusion may combine work into a single kernel so as to increase performance by reduction of unnecessary loads and stores - because the data being operated on can be preserved in-place in device registers or shared memory.
Persistent kernels may have a variety of benefits. They may possibly reduce the latency associated with processing data, because the CUDA kernel launch overhead is no longer necessary. However another possible performance factor may be the ability to retain state (similar to kernel fusion) in device registers or shared memory.
Kernel fusion doesn't necessarily imply a persistent kernel. You may simply be combining a set of tasks into a single kernel. A persistent kernel doesn't necessarily imply fusion of separate computation tasks - there may be only 1 "task" that you are performing in a governing "consumer" loop.
But there is obviously considerable conceptual overlap between the two ideas.
I think this saves some configuration time, but I am not sure whether this will cause unexpected behaviours.
If you need to issue calls in any sort of thread concurrency scenario, its recommended to use independent handles:
https://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
The library is thread safe and its functions can be called from multiple host threads, as long as threads do not share the same cuBLAS handle simultaneously.
Also note that the device associated with a particular cublas handle is expected to remain unchanged for duration of handle use:
https://docs.nvidia.com/cuda/cublas/index.html#cublas-context
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls.
Otherwise, using a single handle should be fine amongst cublas calls belonging to the same device and host thread, even if shared amongst multiple streams.
An example of using a single "global" handle with multiple streamed CUBLAS calls (from the same host thread, on the same GPU device) is given in the CUDA batchCUBLAS sample code.
Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams.
Here's an example:
App 0: issues kernels to logical stream 0;
App 1: issues kernels to (its own) logical stream 0.
In this case,
1) Does / how does MPS "hijack" these CUDA calls? Does it have full knowledge of , for each application, what streams are used and what kernels are in which streams?
2) Does MPS create its own 2 streams, and place the respective kernels into the right streams? Or does MPS potentially enable kernel concurrency via mechanisms other than streams?
If it helps, I'm interested in how MPS work on Volta, but information with respect to older architecture is appreciated as well.
A way to think about MPS is that it acts as a funnel for CUDA activity, emanating from multiple processes, to take place on the GPU as if they emanated from a single process. One of the specific benefits of MPS is that it is theoretically possible for kernel concurrency even if the kernels emanate from separate processes. The "ordinary" CUDA multi-process execution model would serialize such kernel executions.
Since kernel concurrency in a single process implies that the kernels in question are issued to separate streams, it stands to reason that conceptually, MPS is treating the streams from the various client processes as being completely separate. Naturally, then, if you profile such a MPS setup, the streams will show up as being separate from each other, whether they are separate streams associated with a single client process, or streams across several client processes.
In the pre-Volta case, MPS did not guarantee process isolation between kernel activity from separate processes. In this respect, it was very much like a funnel, taking activity from several processes and issuing it to the GPU as if it were issued from a single process.
In the Volta case, activity from separate processes behaves from an execution standpoint (e.g. concurrency, etc.) as if it were from a single process, but activity from separate processes still carry process isolation (e.g. independent address spaces).
1) Does / how does MPS "hijack" these CUDA calls? Does it have full knowledge of , for each application, what streams are used and what kernels are in which streams?
Yes, CUDA MPS understands separate streams from a given process, as well as the activity issued to each, and maintains such stream semantics when issuing work to the GPU. The exact details of how CUDA calls are handled by MPS are unpublished, to my knowledge.
2) Does MPS create its own 2 streams, and place the respective kernels into the right streams? Or does MPS potentially enable kernel concurrency via mechanisms other than streams?
MPS maintains all stream activity, as well as CUDA stream semantics, across all clients. Activity issued into a particular CUDA stream will be serialized. Activity issued to independent streams may possibly run concurrently. This is true regardless of the origin of the streams in question, be they from one process or several.
Is it possible to share a cudaMalloc'ed GPU buffer between different contexts (CPU threads) which use the same GPU? Each context allocates an input buffer which need to be filled up by a pre-processing kernel which will use the entire GPU and then distribute the output to them.
This scenario is ideal to avoid multiple data transfer to and from the GPUs. The application is a beamformer, which will combine multiple antenna signals and generate multiple beams, where each beam will be processed by a different GPU context. The entire processing pipeline for the beams is already in place, I just need to add the beamforming part. Having each thread generate it's own beam would duplicate the input data so I'd like to avoid that (also, the it's much more efficient to generate multiple beams at one go).
Each CUDA context has it's own virtual memory space, therefore you cannot use a pointer from one context inside another context.
That being said, as of CUDA 4.0 by default there is one context created per process and not per thread. If you have multiple threads running with the same CUDA context, sharing device pointers between threads should work without problems.
I don't think multiple threads can run with the same CUDA context. I have done the experiments, parent cpu thread create a context and then fork a child thread. The child thread will launch a kernel using the context(cuCtxPushCurrent(ctx) ) created by the parent thread. The program just hang there.
I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.
Single host thread - multiple devices (shared context)
Single host thread - concurrent execution of kernels on a single device (shared context)
Multiple host threads - (Equal) Multiple devices (independent contexts)
Single host thread - Sequential kernel execution on one device
Multiple host threads - concurrent execution of kernels on one device (independent contexts)
Multiple host threads - sequential execution of kernels on one device (independent contexts)
Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.
Thanks,
Sayan
EDIT:
I thought that the previous categorization involved some redundancy, so modified it.
Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.
Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.
Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.
Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.