Is it possible to share a cudaMalloc'ed GPU buffer between different contexts (CPU threads) which use the same GPU? Each context allocates an input buffer which need to be filled up by a pre-processing kernel which will use the entire GPU and then distribute the output to them.
This scenario is ideal to avoid multiple data transfer to and from the GPUs. The application is a beamformer, which will combine multiple antenna signals and generate multiple beams, where each beam will be processed by a different GPU context. The entire processing pipeline for the beams is already in place, I just need to add the beamforming part. Having each thread generate it's own beam would duplicate the input data so I'd like to avoid that (also, the it's much more efficient to generate multiple beams at one go).
Each CUDA context has it's own virtual memory space, therefore you cannot use a pointer from one context inside another context.
That being said, as of CUDA 4.0 by default there is one context created per process and not per thread. If you have multiple threads running with the same CUDA context, sharing device pointers between threads should work without problems.
I don't think multiple threads can run with the same CUDA context. I have done the experiments, parent cpu thread create a context and then fork a child thread. The child thread will launch a kernel using the context(cuCtxPushCurrent(ctx) ) created by the parent thread. The program just hang there.
Related
In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.
I am a newbie in cuda. According to my knowledge I must use global memory to make blocks communicate with each other, but my understanding of the stream concept and memory capabilities stuck somewhere. After searching I figured that streams queue multiple kernels in sequence and can be used to apply different kernels on different blocks.
Now I NEED to exchange arrays between 2 blocks or more. Can kernel be used to swap or exchange data within shared memory between blocks without involving global/device memory.
if I allocated block for each sub population to calculate fitness using some kernel and shared memory, can I transfer data between blocks
No. Shared memory has block scope. It is not portable between blocks. Global memory or heap memory is portable and could potentially be used to hold data to be accessed by multiple blocks.
However, the standard execution model in CUDA doesn't support grid level synchronization. Since CUDA 9, and with the newest generations of hardware, there is support for a grid level synchronization mechanism if you use cooperative groups, however neither PyCUDA nor Numba expose that facility as far as I am aware.
I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.
I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?
What I keep in my mind after reading and googling "persistent threads":
Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.
Is this correct? Thanks in advance
Reference: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089
http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf
CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the
same instruction on different data.
Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.
Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.
The persistent threads technique is better illustrated by the following example, which has been taken from the presentation
“GPGPU” computing and the CUDA/OpenCL Programming Model
Another more detailed example is available in the paper
Understanding the efficiency of ray traversal on GPUs
// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available
// count represents the number of data to be processed
__global__ void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) < count)
{
load_locally(a[local_input_data_index]);
do_work_with_locally_loaded_data();
int out_index = read_and_increment(bhead);
write_result(b[out_index]);
}
}
// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);
Quite easy to understand. Usually each work item processed a small amount of work. If you want to save save workgroup switch time, you can let one work item process a lot of work using a loop. For instance, you have one image, and it is 1920x1080, you have 1920 workitem, and each work item processes one column of 1080 pixels using loop.
I am trying to "map" a few tasks to CUDA GPU. There are n tasks to process. (See the pseudo-code)
malloc an boolean array flag[n] and initialize it as false.
for each work-group in parallel do
while there are still unfinished tasks do
Do something;
for a few j_1, j_2, .. j_m (j_i<k) do
Wait until task j_i is finished; [ while(flag[j_i]) ; ]
Do Something;
end for
Do something;
Mark task k finished; [ flag[k] = true; ]
end while
end for
For some reason, I will have to use threads in different thread block.
The question is how to implement the Wait until task j_i is finished; and Mark task k finished; in CUDA. My implementation is to use an boolean array as the flag. Then set flag once a task is done, and read the flag to check if a task is done.
But it only works on small case, one large case, the GPU get crashed with unknown reason. Is there any better way to implement the Wait and Mark in CUDA.
That's basically a problem of inter-thread communication on CUDA.
Synchronising within a threadblock is straightforward using __syncthreads(). However synchronising between threadblocks is more tricky - the programming model method is to break into two kernels.
If you think about it, it makes sense. The execution model (for both CUDA and OpenCL) is for a whole bunch of blocks executing on processing units, but says nothing about when. This means that some blocks will be executing but others will not (they'll be waiting). So if you have a __syncblocks() then you would risk deadlock, since those already executing will stop, but those not executing will never reach the barrier.
You can share information between blocks (using global memory and atomics, for example), but not global synchronisation.
Depending on what you're trying to do, there is frequently another way of solving or breaking down the problem.
What you're asking for is not easily done since thread blocks can be scheduled in any order, and there is no easy way to synchronize or communicate between them. From the CUDA Programming Guide:
For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases: Either these threads belong to the same block, in which case they should use __syncthreads() and share data through shared memory within the same kernel invocation, or they belong to different blocks, in which case they must share data through global memory using two separate kernel invocations, one for writing to and one for reading from global memory. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
So if you can't fit all the communication you need within a thread block, you would need to have multiple kernel calls in order to accomplish what you want.
I don't believe there is any difference with OpenCL, but I also don't work in OpenCL.
This kind of problems is best solved by a slightly different approach:
Don't assign fixed tasks to your threads, forcing your threads to wait until their task becomes available (which isn't possible in CUDA since threads can't block).
Instead, keep a list of available tasks (using atomic operations) and have each thread grab a task from that list.
This is still tricky to implement and get the corner cases right, but at least it's possible.
I think you dont need to implement in CUDA. Every thing can be implemented on CPU. You are waiting for a task to complete, then doing another task randomly. If you want to implement in CUDA, you dont need to wait for all the flags to be true. You know initially that all the flags are false. So just implement Do something in parallel for all the thread and change the flag to true.
If you want to implement in CUDA, take int flag and keep on adding 1 it after finishing Do something so that you can know the change in flag before and after doing Do something.
If i got your question wrong, please comment. I'll try to improve the answer.