I want to know which merge algorithm thrust::merge uses internally.
For example, mgpu::merge (Modern GPU Merge) is based on GPU Merge Path - A GPU Merging Algorithm.
Which algorithm does thrust::merge use and how does it differ from GPU Merge Path?
Sadly, Thrust does not explicitly describe the algorithm they actually use. However, you can find the CUDA implementation here and understand what the algorithm does. At the time of writing, they use a tiled-based partitioned merge.
The algorithm is divided in 3 steps:
partitioning: use a binary search in shared memory to find merge path for each of thread;
merging: execute an independent serial merge in each thread;
joining: write data back into the global memory.
The current Thrust implementation seems to be close to the one in your provided research paper.
Related
In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.
I'm working with some large data using the cublas library for matrix multiplication. To save memory space, I want something like A=A*B where A and B are both n-by-n square matrices, i.e. using the same memory space for the output and one of the input matrices.
While some old posts say this is not allowed in the cublas library, I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run. So I'm wondering if the overlapped input and output is supported by the current cublas library. If yes, how much memory does it actually save? I mean intuitively the function at least needs some extra memory to store intermediate calculations, since Aij = AikBkj is dependent on a whole row of A. Is this particularly memory saving for batched gemms?
While some old posts say this is not allowed in the cublas library,
And they are completely correct (noting that the "old posts" were referring to the standard GEMM calls, not the batched implementations you are asking about).
I actually implemented it using the cublasZgemmStridedBatched() function. Surprisingly the calculation is totally correct, and is stable with repeated run
This isn't documented as being safe and I suspect you are probably only getting stable results by luck, given that small matrices are probably preloaded into shared memory or registers and so an in-place operation works. If you went to larger matrices, I guess you would see failures, because eventually there would be a case where a single GEMM could not be performed without multiple trips to the source matrix after a write cycle, which would corrupt the source matrix.
I would not recommend in-place operations even if you find it works for one case. Different problem sizes, library versions, and hardware could produce failures which you simply haven't tested. The choice and associated risk is up to you.
I have basically the same question as posed in this discussion. In particular I want to refer to this final response:
I think there are two different questions mixed together in this
thread:
Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the
reasons you and others have described. If the data or calculation has
spatial locality, then so should the assignment of work to threads in
a warp.
Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since
you can do the index calculation trivially yourself at the top of the
kernel. This burns a few arithmetic instructions, but that should be
negligible compared to the kernel launch overhead.
This is why I think the multidimensional grids are intended as a
programmer convenience rather than a way to improve performance. You
do absolutely need to think about each warp's memory access patterns,
though.
I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.
What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?
A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.
There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA
PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions.
The special registers are:
%tid
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid
If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.
On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So,
Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations.
This might be obsolete once NVIDIA implements device calls on CUFFT.
(something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))
So, Is this it? Is there no gain in performing more than one FFT at a time?
If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.
If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5, as there have been some API improvements.
Is there any library that supports calls from the device?
cuFFT library cannot be used by making calls from device code.
There are other CUDA libraries, of course, such as ArrayFire, which may have options I'm not familiar with.
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.
I have some challenges with my Master's thesis I hope you can help me with or maybe point me in the correct direction.
I'm implementing Progressive Photon Mapping using the new approach by Knaus and Zwicker (http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Knaus11.pdf) using OptiX. This approach makes each iteration/frame of PPM independent and more suitable for multi-GPU.
What i do (with a single GPU) is trace a number of photons using OptiX and then store them in a buffer. Then, the photons are then sorted into a spatial hash map using CUDA and thrust, never leaving the GPU. I want to do the spatial hash map creation on GPU since it is the bottleneck of my renderer. Finally, this buffer is used during indirect radiance estimation. So this is a several pass algorithm, consisting of ray-tracing, photon-tracing, photon map generation and finally create image.
I understand that OptiX can support multiple GPU. Each context launch is divided up across the GPUs. Any writes to buffers seems to be serialized and broadcasted to each device so that their buffer contents are the same.
What i would like to do is let one GPU do one frame, while second GPU does the next frame. I can then combine the results, for instance on the CPU or on one of the GPU's in a combine pass. It is also acceptable if i can do each pass in parallel on each device (synchronize between each pass). Is this somehow possible?
For instance, could I create two OptiX contexts mapping to each device on two different host threads. This would allow me to do the CUDA/thrust spatial hash map generation as before, assuming the photons are on one device, and merge the two generated images at the end of the pipeline. However, the programming guide states it does not support multi-thread context handling. I could use multiple processes but then there is a lot of mess with inter-process communication. This approach also requires duplicate work with creating the scene geometry, compiling PTX files and so on.
Thanks!
OptiX already splits the workload accordingly to your GPUs power so your approach will likely not be faster than having OptiX dispose of all the GPUs.
If you want to force your data to remain on the device (notice that in such a situation writes from different devices will not be coherent) you can use the RT_BUFFER_GPU_LOCAL flag as indicated into the programming guide
https://developer.nvidia.com/optix-documentation