How does NCCL AllReduce works in detail? - deep-learning

I'm confused about how exactly NCCL performs AllReduce operation in a distributed training. Let's assume there are two nodes with one GPU respectively. They are running a DNN training task in the data parallelism mode using Horovod. In the back-propagation phase, they need to perform AllReduce operations of their gradients. Here, GPU on each node needs to send it's local gradient to the other node. I want to understand the data flow during this AllReduce operation. Here is my understanding:
CPU copies the gradient from GPU's DRAM to the host DRAM
CPU sends the gradient to the other node via NIC
Here, I'm curious what does the CUDA kernels launched by NCCL is doing. I don't see any GPU involvement during this communication (the memory copy is performed by CPU), so looks like GPU is doing nothing here. Could anyone gives some insight on this? Thanks!

Related

How to reduce CUDA context size (Multi-Process Service)

I followed Robert Crovella's example on how to use Nvidia's Multi-Process Service. According to docs:
2.1.2. Reduced on-GPU context storage
Without MPS each CUDA processes using a GPU allocates separate storage
and scheduling resources on the GPU. In contrast, the MPS server
allocates one copy of GPU storage and scheduling resources shared by
all its clients.
which I understood as the reduction of each of the processes' context sizes, which is possible because they are shared. This would increase free GPU memory and thus enable running more processes in parallel.
Now, back to the example. Without MPS:
And with MPS:
Unfortunately each process still takes virtually the same (~300MB) amount of memory. Isn't this in contradiction to the docs? Is there a way to decrease per process memory consumption?
Oops, I overeagerly asked before checking the memory usage on the other (pre-Volta) card and yes, there is actually a difference. Let me just post it here for future reference if anyone else stumbled on this problem too:
MPS off:
MPS on:
Indeed, as seen here, in Volta architecture, you can see the processes communicate directly with the GPU, without the MPS server in the middle:
Volta MPS clients submit work directly to the GPU without passing through the MPS server.
This can be easily seen from your first screenshot where the t1034 processes are listed as using the GPU.
On the contrary, in pre-Volta architectures, the client processes communicate with the GPU through the MPS server. This results in seeing only the MPS server process communicating directly with the GPU in the latter screenshot.

GPUDirect Peer 2 peer using PCIe bus: If I need to access too much data on other GPU, will it not result in deadlocks?

I have simulation program which requires a lot of data.
I load the data in the GPUs for calculation and there is a lot of dependency in the data.
Since 1 GPU was not enough for the data, so I upgraded it to 2 GPUs.
but the limitation was, if I required data on other GPU, there had to be a copy to host first.
So, if I use GPU Direct P2P, will the PCI bus handle that much of to and fro communication between the GPUs? Wont it result in deadlocks?
I am new to this, so need some help and insight.
PCI Express has full speed in both directions. There should be no "deadlock" like you may experience in a synchronous MPI communication that needs handshaking before proceeding.
As Robert mentioned in a comment "accessing data over PCIE bus is a lot slower than accessing it from on-board memory". However, it should be significantly faster than transferring data from GPU1 to CPU, then from CPU to GPU2 since you don't have to copy it twice.
You should try to minimize the amount of GPU to GPU transfers, especially if you have to sync data before you do it (could happen in some algorithms). However, you could also try to do some concurrent execution while transferring data. You can look at the Peer-to-Peer memory section of the CUDA C guide.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#peer-to-peer-memory-copy

cuda-mpi programming model without GPUDirect

I am using a GPU cluster without GPUDirect support. From this briefing, the following is done when transferring GPU data across nodes:
GPU writes to pinned sysmem1
CPU copies from sysmem1 to sysmem2
Infiniband driver copies from sysmem2
Now I am not sure whether the second step is an implicit step when I transfer sysmem1 across Infiniband using MPI. By assuming this, my current programming model is something like this:
cudaMemcpy(hostmem, devicemem, size, cudaMemcpyDeviceToHost).
MPI_Send(hostmem,...)
Is my above assumption true and will my programming model work without causing communication issues?
Yes, you can use CUDA and MPI independently (i.e. without GPUDirect), just as you describe.
Move the data from device to host
Transfer the data as you ordinarily would, using MPI
You might be interested in this presentation, which explains CUDA-aware MPI, and gives an example side-by-side on slide 11 of non-cuda MPI and CUDA-MPI

GPU reads from CPU or CPU writes to the GPU?

I am beginner in parallel programming. I have a query which might be seem to be silly but I didn't get a definitive answer when I googled it out.
In GPU computing there is a device i.e. the GPU and the host i.e. the CPU. I wrote a simple hello world program which will allocate some memory on the gpu, pass two parameters (say src[] and dest[]) to the kernel, copy src string i.e. Hello world to dest string and get the dest string from gpu to the host.
Is the string "src" read by the GPU or the CPU writes to the GPU ? Also when we get back the string from GPU, is the GPU writing to the CPU or the CPU reading from the GPU?
In transferring the data back and forth there can be four possibilities
1. CPU to GPU
- CPU writes to GPU
- GPU reads form CPU
2. GPU to CPU
- GPU writes to the CPU
- CPU reads from GPU
Can someone please explain which of these are possible and which are not?
In earlier versions of CUDA and corresponding hardware models, the GPU was more strictly a coprocessor owned by the CPU; the CPU wrote information to the GPU, and read the information back when the GPU was ready. At the lower level, this meant that really all four things were happening: the CPU wrote data to PCIe, the GPU read data from PCIe, the GPU then wrote data to PCIe, and the CPU read back the result. But transactions were initiated by the CPU.
More recently (CUDA 3? 4? maybe even beginning in 2?), some of these details are hidden from the application level, so that, effectively, GPU code can cause transfers to be initiated in much the same way as the CPU can. Consider unified virtual addressing, whereby programmers can access a unified virtual address space for CPU and GPU memory. When the GPU requests memory in the CPU space, this must initiate a transfer from the CPU, essentially reading from the CPU. The ability to put data onto the GPU from the CPU side is also retained. Basically, all ways are possible now, at the top level (at low levels, it's largely the same sort of protocol as always: both read from and write to the PCIe bus, but now, GPUs can initiate transactions as well).
Actually none of these.
Your CPU code initiates the copy of data, but while the data is transferred by the memory controller to the memory of the GPU through whatever bus you have on your system. Meanwhile, the CPU can process other data.
Similarly, when the GPU has finished running the kernels you launched, your CPU code initiates the copy of data, but meanwhile both GPU and CPU can handle other data or run other code.
The copies are called asynchronous or non-blocking. You can optionally do blocking copies, in which the CPU waits for the copy to be completed.
When launching asynchronous tasks, you usually register an "event", which is some kind of flag that you can check later on, to see if the task is finished or not.
In OpenCL the Host (CPU) is exclusively controlling all the transfers of data between GPU and GPU. The host transfers data to the GPU using buffers. The host transfers (reads) back
from the GPU using buffers. For some systems and devices, the transfer isn't physically copying bytes as the Host and GPU use the same physical memory. This is called zero copy.
I just found out in this forum http://devgurus.amd.com/thread/129897 that using CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR in clCreateBuffer allocates memory on the host and that it wont be copied on the device.
There may be issue with performance but this is what I am looking for. Your comments please..

Prefetching in Nvidia CUDA

I'm working on data prefetching in nVidia CUDA. I read some documents on prefetching on device itself i.e. Prefetching from shared memory to cache.
But I'm interested in data prefetching between CPU and GPU. Can anyone connect me with some documents or something regarding this matter. Any help would be appreciated.
Answer based on your comment:
when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i.e SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half i can use for data prefetching,so while computation is being performed in one half simultaneously i cn prefetch data in otherhalf.so no stalls will be there..now tell me is it feasible to do?performance will be degraded or upgraded?should enhance..
CUDA streams were introduced to enable exactly this approach.
If your compoutation is rather intensive, then yes --- it can greatly speed up your performance. On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops...
The details, including examples, on how to use streams is provided in CUDA Programming Guide.
For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.
Perhaps you would be interested in the asynchronous host/device memory transfer capabilities of CUDA 4.0? You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to...
Copy working set #1 & #2 from host to device.
Process #i, promote #i+1, and load #i+2 - concurrently.
So you could be streaming data in and out of the GPU and computing on it all at once (!). Please refer to the CUDA 4.0 Programming Guide and CUDA 4.0 Best Practices Guide for more detailed information. Good luck!
Cuda 6 will eliminate the need to copy, ie the copying will be automatic.
however you may still benefit from prefetching.
In a nutshell you want the data for the "next" computation transferring while you complete the current computation. to achieve that you need to have at least two threads on the CPU, and some kind of signalling scheme (to know when to send the next data). Chunking will of course play a big role and affect performance.
The above may be easier on an APU (CPU+GPU on the same die) as the need to copy is eliminated as both processors can access the same memory.
If you want to find some papers on GPU prefetching just use google scholar.