In CUDA programming, I try to reduce the synchronization overhead between the off-chip memory and on-chip memory if there is data dependency between two kernels? What's the differences between these two techniques?
The idea behind kernel fusion is to take two (or more) discrete operations, that could be realized (and might already be realized) in separate kernels, and combine them so the operations all happen in a single kernel.
The benefits of this may or may not seem obvious, so I refer you to this writeup.
Persistent threads/Persistent kernel is a kernel design strategy that allows the kernel to continue execution indefinitely. Typical "ordinary" kernel design focuses on solving a particular task, and when that task is done, the kernel exits (at the closing curly-brace of your kernel code).
A persistent kernel however has a governing loop in it that only ends when signaled - otherwise it runs indefinitely. People often connect this with the producer-consumer model of application design. Something (host code) produces data, and your persistent kernel consumes that data and produces results. This producer-consumer model can run indefinitely. When there is no data to consume, the consumer (your persistent kernel) simply waits in a loop, for new data to be presented.
Persistent kernel design has a number of important considerations, which I won't try to list here but instead refer you to this longer writeup/example.
Benefits:
Kernel fusion may combine work into a single kernel so as to increase performance by reduction of unnecessary loads and stores - because the data being operated on can be preserved in-place in device registers or shared memory.
Persistent kernels may have a variety of benefits. They may possibly reduce the latency associated with processing data, because the CUDA kernel launch overhead is no longer necessary. However another possible performance factor may be the ability to retain state (similar to kernel fusion) in device registers or shared memory.
Kernel fusion doesn't necessarily imply a persistent kernel. You may simply be combining a set of tasks into a single kernel. A persistent kernel doesn't necessarily imply fusion of separate computation tasks - there may be only 1 "task" that you are performing in a governing "consumer" loop.
But there is obviously considerable conceptual overlap between the two ideas.
Related
In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.
I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.
How do you keep data in fast GPU memory across kernel invocations?
Let's suppose, I need to answer 1 million queries, each of which has about 1.5MB of data that's reusable across invocations and about 8KB of data that's unique to each query.
One approach is to launch a kernel for each query, copying the 1.5MB + 8KB of data to shared memory each time. However, then I spend a lot of time just copying 1.5MB of data that really could persist across queries.
Another approach is to "recycle" the GPU threads (see https://stackoverflow.com/a/49957384/3738356). That involves launching one kernel that immediately copies the 1.5MB of data to shared memory. And then the kernel waits for requests to come in, waiting for the 8KB of data to show up before proceeding with each iteration. It really seems like CUDA wasn't meant to be used this way. If one just uses managed memory, and volatile+monotonically increasing counters to synchronize, there's still no guarantee that the data necessary to compute the answer will be on the GPU when you go to read it. You can seed the values in the memory with dummy values like -42 that indicate that the value hasn't yet made its way to the GPU (via the caching/managed memory mechanisms), and then busy wait until the values become valid. Theoretically, that should work. However, I had enough memory errors that I've given up on it for now, and I've pursued....
Another approach still uses recycled threads but instead synchronizes data via cudaMemcpyAsync, cuda streams, cuda events, and still a couple of volatile+monotonically increasing counters. I hear I need to pin the 8KB of data that's fresh with each query in order for the cudaMemcpyAsync to work correctly. But, the async copy isn't blocked -- its effects just aren't observable. I suspect with enough grit, I can make this work too.
However, all of the above makes me think "I'm doing it wrong." How do you keep extremely re-usable data in the GPU caches so it can be accessed from one query to the next?
First of all to observe the effects of the streams and Async copying
you definitely need to pin the host memory. Then you can observe
concurrent kernel invocations "almost" happening at the same time.
I'd rather used Async copying since it makes me feel in control of
the situation.
Secondly you could just hold on to the data in global memory and load
it in the shared memory whenever you need it. To my knowledge shared
memory is only known to the kernel itself and disposed after
termination. Try using Async copies while the kernel is running and
synchronize the streams accordingly. Don't forget to __syncthreads()
after loading to the shared memory. I hope it helps.
I want to copy a bunch of data to device from host either at once or as a series of chunks in turn, it'll affect my algorithm. My question is that which one has more overhead? I think, invoking more than one copy operations has more overhead than the other one does; however i just want to know this concept in more details(thinking generally doesn't square with the reality:)). If you may show a reference to sort of a document, it would be really appreciated!
Kind Regards,
Ilker
If the copies are synchronous (the default), they do incur more overhead because each copy function waits until the GPU is idle before returning. If the copies are asynchronous, the overhead is a few microseconds, which will only be visible for small memcpy's.
Note that you can specify an asynchronous memcpy with the NULL stream, and you will still get the benefits of CPU/GPU concurrency. (i.e. the CPU can kick off the next memcpy while the GPU is processing the previous ones.)
The CUDA Handbook (which, in the interests of full disclosure, I should say that I wrote) covers this issue in Chapter 6, and the source code includes an app that measures the size of a memcpy that hides the overhead of invoking a memcpy. Check out breakevenMemcpyHtoD.cu in the repository:
https://github.com/ArchaeaSoftware/cudahandbook/tree/master/concurrency
It seems apparent that each core of the GPU could allow for handling of a request, rather than one main processor (the system's CPU) handling all requests. On the surface, it seems like it is possible, perhaps with Templates in GPU + Redis database in GPU GDDR5?
Is it possible and worthwhile?
How would the GPU access disks, databases, etc.?
Requests are usually short sharp processing snippets. You'd have to load each request off main memory, into GPU memory, do a computation and fire it back again. There's an overhead when transferring data from main memory to GPU memory. Therefore, it's only worth doing a GPU computation if the calculation is long enough and the problem is ammenable to parallel processing on a GPU.
In essence, GPUs are good at stream processing. Not for lots of small requests.
The previous answer is valid. There is also another ceveat regarding GPU's, the instruction set is smaller and data is processed in matrices. I.e. same operation applied to every element in a set. So you will need to be very clever in designing what this replicated operations are.
I take it you are considering a GPU HTTP server.