Should I make cublas handle global and reuse them in different (host) functions? - cuda

I think this saves some configuration time, but I am not sure whether this will cause unexpected behaviours.

If you need to issue calls in any sort of thread concurrency scenario, its recommended to use independent handles:
https://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
The library is thread safe and its functions can be called from multiple host threads, as long as threads do not share the same cuBLAS handle simultaneously.
Also note that the device associated with a particular cublas handle is expected to remain unchanged for duration of handle use:
https://docs.nvidia.com/cuda/cublas/index.html#cublas-context
The device associated with a particular cuBLAS context is assumed to remain unchanged between the corresponding cublasCreate() and cublasDestroy() calls.
Otherwise, using a single handle should be fine amongst cublas calls belonging to the same device and host thread, even if shared amongst multiple streams.
An example of using a single "global" handle with multiple streamed CUBLAS calls (from the same host thread, on the same GPU device) is given in the CUDA batchCUBLAS sample code.

Related

Optimal use of GPU resources in case of many interdependent tasks

In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.

Share GPU buffers across different CUDA contexts

Is it possible to share a cudaMalloc'ed GPU buffer between different contexts (CPU threads) which use the same GPU? Each context allocates an input buffer which need to be filled up by a pre-processing kernel which will use the entire GPU and then distribute the output to them.
This scenario is ideal to avoid multiple data transfer to and from the GPUs. The application is a beamformer, which will combine multiple antenna signals and generate multiple beams, where each beam will be processed by a different GPU context. The entire processing pipeline for the beams is already in place, I just need to add the beamforming part. Having each thread generate it's own beam would duplicate the input data so I'd like to avoid that (also, the it's much more efficient to generate multiple beams at one go).
Each CUDA context has it's own virtual memory space, therefore you cannot use a pointer from one context inside another context.
That being said, as of CUDA 4.0 by default there is one context created per process and not per thread. If you have multiple threads running with the same CUDA context, sharing device pointers between threads should work without problems.
I don't think multiple threads can run with the same CUDA context. I have done the experiments, parent cpu thread create a context and then fork a child thread. The child thread will launch a kernel using the context(cuCtxPushCurrent(ctx) ) created by the parent thread. The program just hang there.

Is cudaMallocHost() , cudaCreateEvent() asynchronous with executing kernels?

I am running on a very strange issue with the Cuda Runtime API. Calls to functions like cudaMallocHost(), cudaEventCreate(), cudaFree() etc.. seem to be executed only when kernels finish execution on GPU. This kernels are all launched on a stream created with the cudaStreamNonBlocking flag. What is the problem? Do I have to put up some other flags somewhere?
They could be made asynchronous, but it wouldn't be surprising if they are not.
With respect to cudaMallocHost(), which requires that the host memory be mapped for the GPU: if the allocation can't be satisfied from a preallocated pool, the GPU's page tables must be edited. It would not surprise me in the least if the driver had a restriction where it could not edit the page tables of an executing kernel. (Esp. since the page table editing must be done by kernel mode driver code.)
With respect to cudaEventCreate(), that really should be asynchronous since those allocations generally can be satisfied from a preallocated pool. The main impediment there is that changing the behavior would break existing applications that rely on its current, synchronous behavior.
Freeing objects asynchronously requires the driver to track which objects are referenced in the command buffers submitted to the GPU, and defer the actual free operation until after the GPU has finished processing them. It is doable but I am not sure NVIDIA has done the work.
For cudaFree(), it is not possible to track references as you could for CUDA events (because pointers can be stored for running kernels to read and chase). So for large vitrual address ranges that should be deallocated and unmapped, the free must be deferred until after all pending GPU operations have executed. Again, doable but I am not sure NVIDIA has done the work.
I think NVIDIA generally expects developers to work around the lack of asynchrony in these entry points.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)

Check context of given resource

Lets imagine the situation, that I have a lot of initialized resources for example: streams, host and device memory end events, part of them are initialized in context of one GPU and the rest of them belong to the other GPU context.
Is there a way to check if given resource (event, stream or memory) belongs to certain GPU context?
In some case it would worthy to assert such things, before order memory copy or kernel execution and then get cudaErrorInvalidArgument.
I am not really aware of such option in CUDA API itself. It is just a low-level sets of orders that you can issue to your GPU.
What I would do is to wrap the CUDA API functions into some nice class which would track what is where and what is initialised. A class representing a GPU might be useful as well.