What is the difference between cudaMemcpy() and cudaMemcpyPeer() for P2P-copy? - cuda

I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.
As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf
Peer-to-Peer Memcpy
 Direct copy from pointer on GPU A to pointer on GPU B
 With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
 Or cudaMemcpyAsync(…, cudaMemcpyDefault)
 Also non-UVA explicit P2P copies:
 cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src,
int srcDevice, size_t count )
 cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )
If I use cudaMemcpy() then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )?
Do I have to use cudaMemcpy() pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)?
Are there any advantages of function cudaMemcpyPeer(), and if no any advantage, why it is needed?

Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.
Peer-to-peer memcpy with UVA*
When UVA is possible, then cudaMemcpy can be used for peer-to-peer memcpy since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy with UVA are the following:
//Check for peer access between participating GPUs:
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);
Peer-to-peer memcpy without UVA
When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer. Here is an example
// Set device 0 as current
cudaSetDevice(0);
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size);
// Set device 1 as current
cudaSetDevice(1);
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size);
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0);
// Set device 1 as current
cudaSetDevice(1);
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size);
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);
As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.
The instruction
cudaSetDeviceFlags(cudaDeviceMapHost);
is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.
In conclusion, the answer to your questions are:
NO;
NO;
When possible, enable UVA and use cudaMemcpy (you don't need to specify the devices); otherwise, use cudaMemcpyPeer (and you need to specify the devices).

Related

Is synchronization required before releasing device memory back to an allocator?

When dealing with some CUDA runtime api calls for memory management like cudaFree, explicit synchronization is not neccessary. Is cudaDeviceSynchronize() required before cudaFree()?
Now, assume that instead of CUDA api calls, a user-defined allocator object is used to allocate and deallocate device memory.
template<class Alloc>
void reduce(Alloc& alloc, ...){
size_t temp_storage_bytes = 0;
cub::DeviceReduce::Sum(nullptr, temp_storage_bytes, d_in, d_out, num_items);
void* d_temp_storage = alloc.allocate(temp_storage_bytes);
cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
//synchronize before deallocate?
alloc.deallocate(d_temp_storage);
}
Should explicit synchronization be used before deallocate?
Must explicit synchronization be used before deallocate?
Is there a possible implementation of the allocator which can cause problems if the memory is still in use by a kernel at the point of calling deallocate?
For example, in the linked question, the argument for implicit synchronization is the modification of virtual address space.
Would above code sample break if, e.g. the allocator is implemented using driver-side virtual memory management https://developer.nvidia.com/blog/introducing-low-level-gpu-virtual-memory-management/ , and deallocate would just unmap the physical memory from the virtual address range, but keeping the virtual adress range intact.

Transferring data from CPU to GPU and vice versa where exactly?

I understand that cudaMalloc and cudaMemcpy transfer CPU (host) data to GPU (device), but I want to know exactly from which memory to which memory (if indeed it is a memory and not a register, because I'm not sure), because I read that a GPU has more than one kind of memory.
The cudaMalloc function allocates a requested number of bytes in Device global memory of the GPU and gives back the initialised pointer to that chunk of memory.
cudaMemcpy takes 4 parameters:
Address of pointer to the destination memory where the
copy is to be done.
Source address
Number of bytes
The direction of copy i.e. Host to device or device to host.
For example
void Add(float *A, float *B, float *C, int n)
{
int size = n * sizeof(float);
float *d_A, *d_B, *d_C;
cudaMalloc((void**) &d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &d_B, size);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &d_C, size);
cudaMemcpy(d_C, C, size, cudaMemcpyHostToDevice);
// further processing code
........
cudaMemcpy(C, d_C, size, cudaMemcopyDeviceToHost);
.......
}
cudaMemcpyHostToDevice and cudaMemcopyDeviceToHost are constants defined in CUDA programming environment.
In CUDA, host and device have separate memory spaces. GPUs have on board DRAM and some boards may have more than 4 GB of DRAM on, it is known as Device Global Memory. To execute a kernel on a device, the programmer needs to allocate Device Global Memory and transfer the relevant data from host to device memory. After the GPU processing is done the result is transferred back to the Host. These operations are shown in the code snippet above.

Getting an unexpected value in global device memory when multiple threads write to it

Here is problem with cuda threads , memory magament, it returns single threads result "100" but would expect 9 threads result "900".
#indudel <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
__global__
void test(int in1,int*ptr){
int e = 0;
for (int i = 0; i < 100; i++){
e++;
}
*ptr +=e;
}
int main(int argc, char **argv)
{
int devID = 0;
cudaError_t error;
error = cudaGetDevice(&devID);
if (error == cudaSuccess)
{
printf("GPU Device fine\n");
}
else{
printf("GPU Device problem, aborting");
abort();
}
int* d_A;
cudaMalloc(&d_A, sizeof(int));
int res=0;
//cudaMemcpy(d_A, &res, sizeof(int), cudaMemcpyHostToDevice);
test <<<3, 3 >>>(0,d_A);
cudaDeviceSynchronize();
cudaMemcpy(&res, d_A, sizeof(int),cudaMemcpyDeviceToHost);
printf("res is : %i",res);
Sleep(10000);
return 0;
}
It returns:
GPU Device fine\n
res is : 100
Would expect it to return higher number because 3x3(blocks,threads), insted of just one threads result?
What is done wrong and where does the numbers get lost?
You can't write your sum in this way to global memory.
You have to use an atomic function to ensure that the store is atomic.
In general, when having multiple device threads writing into the same values on global memory, you have to use either atomic functions :
float atomicAdd(float* address, float val);
double atomicAdd(double*
address, double val);
reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes (old + val), and stores the result
back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
or thread synchronization :
Throughput for __syncthreads() is 16 operations per clock cycle for
devices of compute capability 2.x, 128 operations per clock cycle for
devices of compute capability 3.x, 32 operations per clock cycle for
devices of compute capability 6.0 and 64 operations per clock cycle
for devices of compute capability 5.x, 6.1 and 6.2.
Note that __syncthreads() can impact performance by forcing the
multiprocessor to idle as detailed in Device Memory Accesses.
(adapting another answer of mine:)
You are experiencing the effects of the increment operator not being atomic. (C++-oriented description of what that means). What's happening, chronologically, is the following sequence of events (not necessarily in the same order of threads though):
...(other work)...
block 0 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 1 issues a LOAD instruction with address ptr into register r
...
block 2 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 0 completes the LOAD, now having 0 in register r
...
block 2 thread 2 completes the LOAD, now having 0 in register r
block 0 thread 0 adds 100 to r
...
block 2 thread 2 adds 100 to r
block 0 thread 0 issues a STORE instruction from register r to address ptr
...
block 2 thread 2 issues a STORE instruction from register r to address ptr
Thus every thread sees the initial value of *ptr, which is 0; adds 100; and stores 0+100=100 back. The order of the stores doesn't matter here as long as all of the threads try to store the same false value.
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use a block-level reduction primitive. This will ensure some partial ordering of the computational activity vis-a-vis shared block memory - using __syncthreads() or other mechanisms. Thus it might first have each thread add its own two elements up; then synchronize block threads; then have less threads add up pairs of pair-sums and so on. Here's an nVIDIA blog post on implementing fast reductions on their more modern GPU architectures.
block-local or warp-local and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

How to avoid memcpy if number of blocks depends on device variable?

I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?
I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.
An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}

CUDA UVA copy failed [duplicate]

I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.
As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf
Peer-to-Peer Memcpy
 Direct copy from pointer on GPU A to pointer on GPU B
 With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
 Or cudaMemcpyAsync(…, cudaMemcpyDefault)
 Also non-UVA explicit P2P copies:
 cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src,
int srcDevice, size_t count )
 cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )
If I use cudaMemcpy() then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )?
Do I have to use cudaMemcpy() pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)?
Are there any advantages of function cudaMemcpyPeer(), and if no any advantage, why it is needed?
Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.
Peer-to-peer memcpy with UVA*
When UVA is possible, then cudaMemcpy can be used for peer-to-peer memcpy since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy with UVA are the following:
//Check for peer access between participating GPUs:
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);
Peer-to-peer memcpy without UVA
When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer. Here is an example
// Set device 0 as current
cudaSetDevice(0);
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size);
// Set device 1 as current
cudaSetDevice(1);
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size);
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0);
// Set device 1 as current
cudaSetDevice(1);
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size);
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);
As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.
The instruction
cudaSetDeviceFlags(cudaDeviceMapHost);
is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.
In conclusion, the answer to your questions are:
NO;
NO;
When possible, enable UVA and use cudaMemcpy (you don't need to specify the devices); otherwise, use cudaMemcpyPeer (and you need to specify the devices).