I am wondering if it is possible in Cuda or Optix to accelerate the computation of the minimum and maximum value along a line/ray casted from one point to another in a 3D volume.
If not, is there any special hardware on Nvidia GPU's that can accelerate this function (particularly on Volta GPUs or Tesla K80's)?
The short answer to the title question is: yes, hardware accelerated ray casting is available in CUDA & OptiX. The longer question has multiple interpretations, so I'll try to outline the different possibilities.
The different axes of your question that I'm seeing are: CUDA vs OptiX, pre-RTX GPUs vs RTX GPUs (e.g., Volta vs Ampere), min ray queries vs max ray queries, and possibly surface representations vs volume representations.
pre-RTX vs RTX GPUs:
To perhaps state the obvious, a K80 or a GV100 GPU can be used to accelerate ray casting compared to a CPU, due to the highly parallel nature of the GPU. However, these pre-RTX GPUs don't have any hardware that is specifically dedicated to ray casting. There are bits of somewhat special purpose hardware not dedicated to ray casting that you could probably leverage in various ways, so up to you to identify and design these kinds of hardware acceleration hacks.
The RTX GPUs starting with the Turing architecture do have specialized hardware dedicated to ray casting, so they accelerate ray queries even further than the acceleration you get from using just any GPU to parallelize the ray queries.
CUDA vs OptiX:
CUDA can be used for parallel ray tracing on any GPUs, but it does not currently (as I write this) support access to the specialized RTX hardware for ray tracing. When using CUDA, you would be responsible for writing all the code to build an acceleration structure (e.g. BVH) & traverse rays through the acceleration structure, and you would need to write the intersection and shading or hit-processing programs.
OptiX, Direct-X, and Vulkan all allow you to access the specialized ray-tracing hardware in RTX GPUs. By using these APIs, one can achieve higher speeds with lower power requirements, and they also require much less effort because the intersections and ray traversal through an acceleration structure are provided for you. These APIs also provide other commonly needed features for production-level ray casting, things like instancing, transforms, motion blur, as well as a single-threaded programming model for processing ray hits & misses.
Min vs Max ray queries:
OptiX has built-in functionality to return the surface intersection closest to the ray origin, i.e. a 'min query'. OptiX does not provide a similar single query for the furthest intersection (which is what I assume you mean by "max"). To find the maximum distance hit, or the closest hit to a second point on your ray, you would need to track through multiple hits and keep track of the hit that you want.
In CUDA you're on your own for detecting both min and max queries, so you can do whatever you want as long as you can write all the code.
Surfaces vs Volumes:
Your question mentioned a "3D volume", which has multiple meanings, so just to clarify things:
OptiX (+ DirectX + Vulkan) are APIs for ray tracing of surfaces, for example triangles meshes. The RTX specialty hardware is dedicated to accelerating ray tracing of surface based representations.
If your "3D volume" is referring to a volumetric representation such as voxel data or a tetrahedral mesh, then surface-based ray tracing might not be the fastest or most appropriate way to cast ray queries. In this case, you might want to use "ray marching" techniques in CUDA, or look at volumetric ray casting APIs for GPUs like NanoVDB.
Related
I'm trying to figure out if I can use OpenACC in place of normal CPU serial execution calls. Usually my programming is all about 3D programming, or uses the GPU normally in some way. I.E. Image processing, or some other type of rendering that requires the use of shaders. I'm trying to figure out if this Library would benefit me or not.
The reason I ask this is because if I'm rendering 3D Graphics (as fast as possible) would it slow down that process in away? Or is it able to maintain it's (in theory) "high frame rates" or not.
If so, what's the trade off, and how much? I'm not willing to loose 3D Graphics (display) performance to enhance operations that can be done on the CPU serially.
Edit:
This is a C++ context.
On the AMD and NVIDIA GPUs that I am familiar with, OpenACC programs will make use of compute resources that would also be used to some degree by shader programs. There are many other pieces of graphics hardware in a GPU that are not shared between compute and graphics, but there are some shared resources. Likewise, the GPU may be connected to the system by PCIE, and so this can also present a shared resource or contention point (however it's the rare compute or graphics program that would even come close to using up the bandwidth of a modern Gen3 x16 PCIE connection.)
So if you were using both graphics (or compute) shaders, as well as OpenACC acceleration, there would be contention for resources, to some degree. The level of contention, or the trade off, is not something that I can generalize about. It will depend very much on the specifics of your program, and the extent and the detail sequencing of the compute functions and the graphics functions.
GPU designers have these types of use-cases in mind, and so GPUs are generally pretty good at rapid context switching between the various tasks that may compete for resources.
It looks like my application starting to be (i)FFT-bounded, it doing a lot of 2D correlations for rectangles with average sizes about 500x200 (width and height always even). Scenario is as usual - do two FFT (one per field), multiply complex fields, then one iFFT.
So, on CPU (Intel Q6600, with JTransforms libraly) FFT-transformations eating about 70% of time according to profiler, on GPU (GTX670, cuFFT library) - about 50% (so, there is some performance increase on CUDA, but not what I want). I realize, that it's may be the case that GPU not fully saturated (bandwith limited), but from other case - doing calculation in batches will significantly increase application complexity.
Questions:
what I can do further to decrease time spent on FFT at least several
times?
should I try FFTW library (at this moment I am not sure that it will give significant gain comparing to JTransforms) ?
are there any specialized hardware which can be plugged to PC
for FFT-conversions ?
I'm answering your first question: what I can do further to decrease time spent by cuFFT?
Quoting the CUFFT LIBRARY USER'S GUIDE
Restrict the size along all dimensions to be representable as 2^a*3^b*5^c*7^d. The CUFFT library has highly optimized kernels for transforms whose dimensions have these prime factors.
Restrict the size along each dimension to use fewer distinct prime factors. For example, a transform of size 3^n will usually be faster than one of size 2^i*3^j even
if the latter is slightly smaller.
Restrict the power-of-two factorization term of the x dimension to be a multiple of either 256 for single-precision transforms or 64 for double-precision transforms. This further aids with memory coalescing.
Restrict the x dimension of single-precision transforms to be strictly a power of two either between 2 and 8192 for Fermi-class, Kepler-class, and more recent GPUs or between 2 and 2048 for earlier architectures. These transforms are implemented as specialized hand-coded kernels that keep all intermediate results in shared memory.
Use native compatibility mode for in-place complex-to-real or real-to-complex transforms. This scheme reduces the write/read of padding bytes hence helping with coalescing of the data.
Starting with version 3.1 of the CUFFT Library, the conjugate symmetry property of real-to-complex output data arrays and complex-to-real input data arrays is exploited when the power-of-two factorization term of the x dimension is at least a multiple of 4. Large 1D sizes (powers-of-two larger than 65,536), 2D, and 3D transforms benefit the most from the performance optimizations in the implementation of real-to-complex or complex-to-real transforms.
Other things you can do are (Quoting Robert Crovella's answer to running FFTW on GPU vs using CUFFT):
cuFFT routines can be called by multiple host threads, so it is possible to make multiple calls into cufft for multiple independent transforms. It's unlikely you would see much speedup from this if the individual transforms are large enough to utilize the machine.
cufft also supports batched plans which is another way to execute multiple transforms "at once".
Please, note that:
cuFFT may be not be convenient as compared to an optimized sequential or multicore FFT if the dimensions of the transform are not enough large;
You can get a rough idea on the performance of cuFFT as compared to Intel MKL from CUDA Toolkit 4.0 Performance Report.
I'm trying to solve the Poisson equation in real space on a multi GPUs architecture using a code in C/CUDA with the MPI library. For the moment, I'm only interested in solving the problem in a periodic box. But in the future, I may want to look at spherical geometry.
Is there an existing routine to solve this problem ?
Comments dated from August 2012 seem to indicate that the thrust library in not adapted for multi GPUs architectures. Is that still correct ?
If the routine exists, what method does it use (Jacobi, SOR, Gauss-Seidel, Krylov) ?
Please express your opinion about its speed and the problems you may have encountered.
Thanks for your time.
Solving the Poisson equation by a Multi-GPU approach, with GPUs located on different cluster nodes interacting by using the MPI protocol, is a relatively recent research topic. The basic idea is to use domain decomposition, so that each GPU solves for one part of the computational domain, and MPI is used to exchange boundary data.
You may wish to have a look at the papers Towards a multi-GPU solver for the three-dimensional
two-phase incompressible Navier-Stokes equations, presented at GTC 2012, and An MPI-CUDA Implementation for Massively
Parallel Incompressible Flow Computations on
Multi-GPU Clusters. Particularly in the first approach, Navier-Stokes equations are solved by Chorin’s projection approach which in turn requires the solution of a Poisson equation, which is the most demanding task and is solved by a MultiGPU/MPI strategy exploiting a Jacobi preconditioned conjugate gradient solver.
Concerning available routines, in the past I have bumped into GAMER, a downloadable software for astrophysics applications. The authors claim that the code contains a variety of GPU-accelerated Poisson solvers and hybrid OpenMP/MPI/GPU parallelization. However, I have never had the chance to download it.
Thrust can be used in a multi GPU environment. You can use the runtime api i.e. cudaSetDevice, to switch devices. Since thrust handles allocations and deallocations for vectors implicitly, care must be taken to make sure that the correct device is selected when device vectors are declared and when they are deallocated i.e. go out of scope.
I am investigating if GPGPUs could be used for accelerating simulation of hardware.
My reasoning is this: As hardware by nature is very parallel, why simulate on highly sequential CPUs?
GPUs would be excellent for this, if not for their restrictive style of programming: You have a single kernel running, etc.
I have little experience with GPGPU-programming, but is it possible to use events or queues in OpenCL / CUDA?
Edit: By hardware simulation I don't mean emulation, but bit-accurate behavorial simulation (as in VHDL behavioral simulation).
I am not aware of any approaches regarding VHDL simulation on GPUs (or a general scheme to map discrete-event simulations), but there are certain application areas where discrete-event simulation is typically applied and which can be simulated efficiently on GPUs (e.g. transportation networks, as in this paper or this one, or stochastic simulation of chemical systems, as done in this paper).
Is it possible to re-formulate the problem in a way that makes a discrete time-stepped simulator feasible? In this case, simulation on a GPU should be much simpler (and still faster, even if it seems wasteful because the time steps have to be sufficiently small - see this paper on the GPU-based simulation of cellular automata, for example).
Note, however, that this is still most likely a non-trivial (research) problem, and the reason why there is no general scheme (yet) is what you already assumed: implementing an event queue on a GPU is difficult, and most simulation approaches on GPUs gain speed-up due to clever memory layout and application-specific optimizations and problem modifications.
This is outside my area of expertise, but it seems that while the following paper discusses gate-level simulation rather than behavioral simulation, it may contain some useful ideas:
Debapriya Chatterjee, Andrew Deorio, Valeria Bertacco.
Gate-Level Simulation with GPU Computing
http://web.eecs.umich.edu/~valeria/research/publications/TODAES0611.pdf
CUDA vs Direct X 10 for parallel mathematics. any thoughs you have about it ?
CUDA is probably a better option, if you know your target architecture is using nVidia chips. You have complete control over your data transfers, instruction paths and order of operations. You can also get by with a lot less __syncthreads calls when you're working on the lower level.
DirectX 10 will be easier to interface against, I should think, but if you really want to push your speed optimization, you have to bypass the extra layer. DirectX 10 will also not know when to use texture memory versus constant memory versus shared memory as well as you will depending on your particular algorithm.
If you have access to a Tesla C1060 or something like that, CUDA is by far the better choice hands down. You can really speed things up if you know the specifics of your GPGPU - I've seen 188x speedups in one particular algorithm on a Tesla versus my desktop.
I find CUDA awkward. It's not C, but a subset of it. It doesn't support double precision floating point natively and is emulated. For single precision it's okay though. It depends on the type of task you throw at it. You have to spend more time computing in parallel than you spend passing the data around for it to be worth using. But that issue is not unique to CUDA.
I'd wait for Apple's OpenCL which seems like it will be the industry standard for parallel computing.
Well, CUDA is portable... That's a big win if you ask me...
CUDA has nothing to do about supporting double precision floating point operations.
This is dependent on the hardware available. The 9, 100, 200 and Tesla series support double precision floating point operations tesla.
It should be easy to decide between them.
If your app can tolerate being Windows specific, you can still consider DirectX Compute. Otherwise, use CUDA or OpenCL.
If your app cannot tolerate a vendor lock on NVIDIA, you cannot use CUDA, you must use OpenCL or DirectX Compute.
If your app is doing DirectX interop, consider that CUDA/OpenCL will incur context switch overhead doing graphics API interop, and DirectX Compute will not.
Unless one or more of those criteria affect your application, use the great granddaddy of massively parallel toolchains: CUDA.