How do I use Thrust with multiple GPUs?
Is it simply a matter of using cudaSetDevice(deviceId)
and then running the relevant Thrust code?
With CUDA 4.0 or later, cudaSetDevice(deviceId) followed by your thrust code should work.
Just keep in mind that you will need to create and operate on separate vectors on each device (unless you have devices that support peer-to-peer memory access and PCI-express bandwidth is sufficient for your task).
Related
I am trying to implement a CUDA program that uses Unified Memory. I have two unified arrays and sometimes they need to be updated atomically.
The question below has an answer for a single GPU environment but I am not sure how to extend the answer given in the question to adapt in multi-GPU platforms.
Question: cuda atomicAdd example fails to yield correct output
I have 4 Tesla K20 if you need this information and all of them updates a part of those arrays that must be done atomically.
I would appreciate any help/recommendations.
To summarize comments into an answer:
You can perform this sort of address space wide atomic operation using atomicAdd_system
However, you can only do this on compute capability 6.x or newer devices (7.2 or newer if using Tegra)
specifically this means you have to compile for the correct compute capability such as -arch=sm_60 or similar
You state in the question you are using Telsa K20 cards -- these are compute capability 3.5 and do not support any of the system wide atomic functions.
As always, this information is neatly summarized in the relevant section of the Programming Guide.
Can we assign a number of processes (i.e. 100-500 processes) to GPU, each process running on a GPU core?
In my application of video processing, I have to use ffmpeg library to proceed video and audio. If there are like more than 100 or even 500 such independent processes, I guess it's faster to assign each process to a GPU. However, I don't know if we can do it, and to do it, what libraries, tools are necessary? CUDA?
Can we assign a number of processes (i.e. 100-500 processes) to GPU, each process running on a GPU core?
No, you can't. In general it's not possible to schedule anything on a GPU core per se. This level of "scheduling" is handled mainly by the mechanics of the CUDA architecture and runtime system.
The basic idea is to expose parallelism at a fairly low level in your code (e.g. at the loop level) and with proper use of a GPU acceleration syntax (such as CUDA, OpenACC, OpenCL, etc.) the GPU can often make such elements of your program run faster.
But the GPU is not designed to be a drop-in replacement for CPU cores. There is the scheduling factor that I mentioned already, as well as the fact that codes generally need to be compiled for the GPU specifically.
I am using cuda programming for the effective and fast computation. and during the study I found that multi gpu and the gpu cluster are the other means for the much further effective calculation but I am confused between these two terms.
What is the actual difference between these two in terms of programming cuda?
I assume that you mean a PC with multiple GPUs and many PCs with single GPU (Cluster)
if this is the case, for a multi-GPU PC you can easily use CUDA library itself and if you connect GPUs with a SLI bridge, you will see improvements in performance.
If you want to use a cluster with GPUs, you may use CUDA-Aware MPI. It is combined solution of MPI standard and CUDA library. I suggest you to check this blog post: https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/
I have some experience with nVIDIA CUDA and am now thinking about learning openCL too. I would like to be able to run my programs on any GPU. My question is: does every GPU use the same architecture as nVIDIA (multi-processors, SIMT stracture, global memory, local memory, registers, cashes, ...)?
Thank you very much!
Starting with your stated goal:
"I would like to be able to run my programs on any GPU."
Then yes, you should learn OpenCL.
In answer to your overall question, other GPU vendors do use different architectures than Nvidia GPUs. In fact, GPU designs from a single vendor can vary by quite a bit, depending on the model.
This is one reason that a given OpenCL code may perform quite differently (depending on your performance metric) from one GPU to the next. In fact, to achieve optimized performance on any GPU, an algorithm should be "profiled" by varying, for example, local memory size, to find the best algorithm settings for a given hardware design.
But even with these hardware differences, the goal of OpenCL is to provide a level of core functionality that is supported by all devices (CPUs, GPUs, FPGAs, etc) and include "extensions" which allow vendors to expose unique hardware features. Although OpenCL cannot hide significant differences in hardware, it does guarantee portability. This makes it much easier for a developer to start with an OpenCL program tuned for one device and then develop a program optimized for another architecture.
To complicate matters with identifying hardware differences, the terminology used by CUDA is different than that used by OpenCL, for example, the following are roughly equivalent in meaning:
CUDA: OpenCL:
Thread Work-item
Thread block Work-group
Global memory Global memory
Constant memory Constant memory
Shared memory Local memory
Local memory Private memory
More comparisons and discussion can be found here.
You will find that the kinds of abstraction provided by OpenCL and CUDA are very similar. You can also usually count on your hardware having similar features: global mem, local mem, streaming multiprocessors, etc...
Switching from CUDA to OpenCL, you may be confused by the fact that many of the same concepts have different names (for example: CUDA "warp" == OpenCL "wavefront").
Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.