Solving the Poisson equation on multiple GPUs located on different cluster nodes interacting by the MPI protocol - cuda

I'm trying to solve the Poisson equation in real space on a multi GPUs architecture using a code in C/CUDA with the MPI library. For the moment, I'm only interested in solving the problem in a periodic box. But in the future, I may want to look at spherical geometry.
Is there an existing routine to solve this problem ?
Comments dated from August 2012 seem to indicate that the thrust library in not adapted for multi GPUs architectures. Is that still correct ?
If the routine exists, what method does it use (Jacobi, SOR, Gauss-Seidel, Krylov) ?
Please express your opinion about its speed and the problems you may have encountered.
Thanks for your time.

Solving the Poisson equation by a Multi-GPU approach, with GPUs located on different cluster nodes interacting by using the MPI protocol, is a relatively recent research topic. The basic idea is to use domain decomposition, so that each GPU solves for one part of the computational domain, and MPI is used to exchange boundary data.
You may wish to have a look at the papers Towards a multi-GPU solver for the three-dimensional
two-phase incompressible Navier-Stokes equations, presented at GTC 2012, and An MPI-CUDA Implementation for Massively
Parallel Incompressible Flow Computations on
Multi-GPU Clusters. Particularly in the first approach, Navier-Stokes equations are solved by Chorin’s projection approach which in turn requires the solution of a Poisson equation, which is the most demanding task and is solved by a MultiGPU/MPI strategy exploiting a Jacobi preconditioned conjugate gradient solver.
Concerning available routines, in the past I have bumped into GAMER, a downloadable software for astrophysics applications. The authors claim that the code contains a variety of GPU-accelerated Poisson solvers and hybrid OpenMP/MPI/GPU parallelization. However, I have never had the chance to download it.

Thrust can be used in a multi GPU environment. You can use the runtime api i.e. cudaSetDevice, to switch devices. Since thrust handles allocations and deallocations for vectors implicitly, care must be taken to make sure that the correct device is selected when device vectors are declared and when they are deallocated i.e. go out of scope.

Related

When does it make sense to use a GPU?

I have code doing a lot of operations with objects which can be represented as arrays.
When does it make to sense to use GPGPU environments (like CUDA) in an application? Can I predict performance gains before writing real code?
The convenience depends on a number of factors. Elementwise independent operations on large arrays/matrices are a good candidate.
For your particular problem (machine learning/fuzzy logic), I would recommend reading some related documents, as
Large Scale Machine Learning using NVIDIA CUDA
and
Fuzzy Logic-Based Image Processing Using Graphics Processor Units
to have a feeling on the speedup achieved by other people.
As already mentioned, you should specify your problem. However, if large parts of your code involve operations on your objects that are independent in a sense that object n does not have to wait for the results of the operations objects 0 to n-1, GPUs may enhance performance.
You could go to CUDA Zone to get yourself a general idea about what CUDA can do and do better than CPU.
https://developer.nvidia.com/category/zone/cuda-zone
CUDA has already provided lots of performance libraries, tools and ecosystems to reduce the development difficulty. It could also help you understand what kind of operations CUDA are good at.
https://developer.nvidia.com/cuda-tools-ecosystem
Further more, CUDA provided benchmark report on some of the most common and representative operations. You could find if your code can benefit from that.
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/CUDA_5.0_Math_Libraries_Performance.pdf

Paralelizing FFT (using CUDA)

On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So,
Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations.
This might be obsolete once NVIDIA implements device calls on CUFFT.
(something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))
So, Is this it? Is there no gain in performing more than one FFT at a time?
If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.
If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5, as there have been some API improvements.
Is there any library that supports calls from the device?
cuFFT library cannot be used by making calls from device code.
There are other CUDA libraries, of course, such as ArrayFire, which may have options I'm not familiar with.
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.

GPU-accelerated hardware simulation?

I am investigating if GPGPUs could be used for accelerating simulation of hardware.
My reasoning is this: As hardware by nature is very parallel, why simulate on highly sequential CPUs?
GPUs would be excellent for this, if not for their restrictive style of programming: You have a single kernel running, etc.
I have little experience with GPGPU-programming, but is it possible to use events or queues in OpenCL / CUDA?
Edit: By hardware simulation I don't mean emulation, but bit-accurate behavorial simulation (as in VHDL behavioral simulation).
I am not aware of any approaches regarding VHDL simulation on GPUs (or a general scheme to map discrete-event simulations), but there are certain application areas where discrete-event simulation is typically applied and which can be simulated efficiently on GPUs (e.g. transportation networks, as in this paper or this one, or stochastic simulation of chemical systems, as done in this paper).
Is it possible to re-formulate the problem in a way that makes a discrete time-stepped simulator feasible? In this case, simulation on a GPU should be much simpler (and still faster, even if it seems wasteful because the time steps have to be sufficiently small - see this paper on the GPU-based simulation of cellular automata, for example).
Note, however, that this is still most likely a non-trivial (research) problem, and the reason why there is no general scheme (yet) is what you already assumed: implementing an event queue on a GPU is difficult, and most simulation approaches on GPUs gain speed-up due to clever memory layout and application-specific optimizations and problem modifications.
This is outside my area of expertise, but it seems that while the following paper discusses gate-level simulation rather than behavioral simulation, it may contain some useful ideas:
Debapriya Chatterjee, Andrew Deorio, Valeria Bertacco.
Gate-Level Simulation with GPU Computing
http://web.eecs.umich.edu/~valeria/research/publications/TODAES0611.pdf

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, or whether there are any useful libraries that can facilitate this task.
It seems like I would need to know the maximum size of the hash map a priori in order to allocate Device memory. All my previous CUDA endeavors have used arrays and memcpys and therefore been fairly straightforward.
Any insight into this problem are appreciated. Thanks.
There is a GPU Hash Table implementation presented in "CUDA by example", from Jason Sanders and Edward Kandrot.
Fortunately, you can get information on this book and download the examples source code freely on this page:
http://developer.nvidia.com/object/cuda-by-example.html
In this implementation, the table is pre-allocated on CPU and safe multithreaded access is ensured by a lock function based upon the atomic function atomicCAS (Compare And Swap).
Moreover, newer hardware generation (from 2.0) combined with CUDA >= 4.0 are supposed to be able to use directly new/delete operators on the GPU ( http://developer.nvidia.com/object/cuda_4_0_RC_downloads.html?utm_source=http://forums.nvidia.com&utm_medium=http://forums.nvidia.com&utm_term=Developers&utm_content=Developers&utm_campaign=CUDA4 ), which could serve your implementation. I haven't tested these features yet.
cuCollections is a relatively new open-source library started by NVIDIA engineers aiming at implementing efficient containers on the GPU.
cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.
Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.
cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.
At the moment it provides a fixed size hashtable cuco::static_map and one that can grow cuco::dynamic_map.
I recall someone developed a straightforward hash map implementation on top of thrust. There is some code for it here, although whether it works with current thrust releases is something I don't know. It might at least give you some ideas.
AFAIK, the hash table given in "Cuda by Example" does not perform too well.
Currently, I believe, the fastest hash table on CUDA is given in Dan Alcantara's PhD dissertation. Look at chapter 6.
BTW, warpcore is a framework for creating high-throughput, purpose-built hashing data structures on CUDA-accelerators. Hashing at the speed of light on modern CUDA-accelerators. You can find it here:
https://github.com/sleeepyjack/warpcore

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.