cuGraph on Multi-GPU - cuda

Recently, I am reading the code of cuGraph. I notice that it is mentioned that Louvain and Katz algorithms support multi-GPU. However, when I read the C++ code of Louvain, I cannot find code that is related to multi-GPU. Specifically, according to a prior post, multi-GPU can be implemented by calling cudaSetDevice. I cannot find this function in the code of Louvain, however. Am I missing anything?

cuGraph supports multi-GPU by leveraging Dask. I encourage you to read the Dask cuGraph documentation that shows an example using PageRank.
For a Louvain example, I recommend looking at the docstring of the cugraph.dask.louvain function.
For completeness, under the hood cuGraph is using RAFT to manage underlying NCCL and UCX communication.

Related

Check if GPU is shared

When the GPU is shared with other processes (e.g. Xorg or other CUDA procs), a CUDA process better should not consume all remaining memory but dynamically grow its usage instead.
(There are various errors you might get indirectly from this, like Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR. But this question is not about that.)
(In TensorFlow, you would use allow_growth=True in the GPU options to accomplish this. But this question is not about that.)
Is there a simple way to check if the GPU is currently used by other processes? (I'm not asking whether it is configured to be used for exclusive access.)
I could parse the output nvidia-smi and look for other processes. But that seems somewhat hacky and maybe not so reliable, and not simple enough.
(My software is using TensorFlow, so if TensorFlow provides such a function, nice. But if not, I don't care if this would be a C API or Python function. I would prefer to avoid other external dependencies though, except those I'm anyway using, like CUDA itself, or TensorFlow. I'm not afraid to use ctypes. So consider this question language invariant.)
There is nvmlDeviceGetComputeRunningProcesses and nvmlDeviceGetGraphicsRunningProcesses. (Documentation.)
This is a C API, but I could use pynvml if I don't care about the extra dependency.
Example usage (via).

How to disable or remove numba and cuda from python project?

i've cloned a "PointPillars" repo for 3D detection using just point cloud as input. But when I came to run it, I noted it use cuda and numba. With any prior knowledge about these two, I'm asking if there is any way to remove or disable numba and cuda. I want to run it on local server with CPU only, so I want your advice to solve.
The actual code matters here.
If the usage is only of vectorize or guvectorize using the target=cuda parameter, then "removal" of CUDA should be trivial. Just remove the target parameter.
However if there is use of the #cuda.jit decorator, or explicit copying of data between host and device, then other code refactoring would be involved. There is no simple answer here in that case, the code would have to be converted to an alternate serial or parallel realization via refactoring or porting.

Normal Cuda Vs CuBLAS?

Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.

Standard Fortran interface for cuBLAS

I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.
To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.
Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines.
I use readelf -a to check for the exported function.
On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.
I have also found CULA, but I am not sure if this will solve the problem or not.
Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!
Updated
As indicated by #talonmies, CUDA has provided a fortran thunking wrapper interface.
http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings
You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.
Old
It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.
You may have to write your own wrapper to deal with it like
void dgemm(...) {
copy_data_from_cpu_ram_to_gpu_mem();
cublas_dgemm(...);
copy_data_from_gpu_mem_to_cpu_ram();
}
On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.

CUDA - Implementing Device Hash Map?

Does anyone have any experience implementing a hash map on a CUDA Device? Specifically, I'm wondering how one might go about allocating memory on the Device and copying the result back to the Host, or whether there are any useful libraries that can facilitate this task.
It seems like I would need to know the maximum size of the hash map a priori in order to allocate Device memory. All my previous CUDA endeavors have used arrays and memcpys and therefore been fairly straightforward.
Any insight into this problem are appreciated. Thanks.
There is a GPU Hash Table implementation presented in "CUDA by example", from Jason Sanders and Edward Kandrot.
Fortunately, you can get information on this book and download the examples source code freely on this page:
http://developer.nvidia.com/object/cuda-by-example.html
In this implementation, the table is pre-allocated on CPU and safe multithreaded access is ensured by a lock function based upon the atomic function atomicCAS (Compare And Swap).
Moreover, newer hardware generation (from 2.0) combined with CUDA >= 4.0 are supposed to be able to use directly new/delete operators on the GPU ( http://developer.nvidia.com/object/cuda_4_0_RC_downloads.html?utm_source=http://forums.nvidia.com&utm_medium=http://forums.nvidia.com&utm_term=Developers&utm_content=Developers&utm_campaign=CUDA4 ), which could serve your implementation. I haven't tested these features yet.
cuCollections is a relatively new open-source library started by NVIDIA engineers aiming at implementing efficient containers on the GPU.
cuCollections (cuco) is an open-source, header-only library of GPU-accelerated, concurrent data structures.
Similar to how Thrust and CUB provide STL-like, GPU accelerated algorithms and primitives, cuCollections provides STL-like concurrent data structures. cuCollections is not a one-to-one, drop-in replacement for STL data structures like std::unordered_map. Instead, it provides functionally similar data structures tailored for efficient use with GPUs.
cuCollections is still under heavy development. Users should expect breaking changes and refactoring to be common.
At the moment it provides a fixed size hashtable cuco::static_map and one that can grow cuco::dynamic_map.
I recall someone developed a straightforward hash map implementation on top of thrust. There is some code for it here, although whether it works with current thrust releases is something I don't know. It might at least give you some ideas.
AFAIK, the hash table given in "Cuda by Example" does not perform too well.
Currently, I believe, the fastest hash table on CUDA is given in Dan Alcantara's PhD dissertation. Look at chapter 6.
BTW, warpcore is a framework for creating high-throughput, purpose-built hashing data structures on CUDA-accelerators. Hashing at the speed of light on modern CUDA-accelerators. You can find it here:
https://github.com/sleeepyjack/warpcore