I understand that cudaMalloc and cudaMemcpy transfer CPU (host) data to GPU (device), but I want to know exactly from which memory to which memory (if indeed it is a memory and not a register, because I'm not sure), because I read that a GPU has more than one kind of memory.
The cudaMalloc function allocates a requested number of bytes in Device global memory of the GPU and gives back the initialised pointer to that chunk of memory.
cudaMemcpy takes 4 parameters:
Address of pointer to the destination memory where the
copy is to be done.
Source address
Number of bytes
The direction of copy i.e. Host to device or device to host.
For example
void Add(float *A, float *B, float *C, int n)
{
int size = n * sizeof(float);
float *d_A, *d_B, *d_C;
cudaMalloc((void**) &d_A, size);
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &d_B, size);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
cudaMalloc((void**) &d_C, size);
cudaMemcpy(d_C, C, size, cudaMemcpyHostToDevice);
// further processing code
........
cudaMemcpy(C, d_C, size, cudaMemcopyDeviceToHost);
.......
}
cudaMemcpyHostToDevice and cudaMemcopyDeviceToHost are constants defined in CUDA programming environment.
In CUDA, host and device have separate memory spaces. GPUs have on board DRAM and some boards may have more than 4 GB of DRAM on, it is known as Device Global Memory. To execute a kernel on a device, the programmer needs to allocate Device Global Memory and transfer the relevant data from host to device memory. After the GPU processing is done the result is transferred back to the Host. These operations are shown in the code snippet above.
Related
I am computing a number, X, on the device. Now I need to launch a kernel with X threads. I can set the blockSize to 1024. Is there a way to set the number of blocks to ceil(X / 1024) without performing a memcpy?
I see two possibilities:
Use dynamic parallelism (if feasible). Rather than copying the result back to determine the execution parameters of the next launch, just have the device perform the next launch itself.
Use zero-copy or managed memory. In that case the GPU writes directly to CPU memory over the PCI-e bus, rather than requiring an explicit memory transfer.
Of those options, dynamic parallelism and managed memory require hardware features which are not available on all GPUs. Zero-copy memory is supported by all GPUs with compute capability >= 1.1, which in practice is just about every CUDA compatible device ever made.
An example of using managed memory, as outlined by #talonmies, allowing kernel1 to determine the number of blocks for kernel2 without an explicit memcpy.
#include <stdio.h>
#include <cuda.h>
using namespace std;
__device__ __managed__ int kernel2_blocks;
__global__ void kernel1() {
if (threadIdx.x == 0) {
kernel2_blocks = 42;
}
}
__global__ void kernel2() {
if (threadIdx.x == 0) {
printf("block: %d\n", blockIdx.x);
}
}
int main() {
kernel1<<<1, 1024>>>();
cudaDeviceSynchronize();
kernel2<<<kernel2_blocks, 1024>>>();
cudaDeviceSynchronize();
return 0;
}
I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.
As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf
Peer-to-Peer Memcpy
Direct copy from pointer on GPU A to pointer on GPU B
With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
Or cudaMemcpyAsync(…, cudaMemcpyDefault)
Also non-UVA explicit P2P copies:
cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src,
int srcDevice, size_t count )
cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )
If I use cudaMemcpy() then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )?
Do I have to use cudaMemcpy() pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)?
Are there any advantages of function cudaMemcpyPeer(), and if no any advantage, why it is needed?
Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.
Peer-to-peer memcpy with UVA*
When UVA is possible, then cudaMemcpy can be used for peer-to-peer memcpy since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy with UVA are the following:
//Check for peer access between participating GPUs:
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);
Peer-to-peer memcpy without UVA
When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer. Here is an example
// Set device 0 as current
cudaSetDevice(0);
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size);
// Set device 1 as current
cudaSetDevice(1);
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size);
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0);
// Set device 1 as current
cudaSetDevice(1);
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size);
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);
As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.
The instruction
cudaSetDeviceFlags(cudaDeviceMapHost);
is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.
In conclusion, the answer to your questions are:
NO;
NO;
When possible, enable UVA and use cudaMemcpy (you don't need to specify the devices); otherwise, use cudaMemcpyPeer (and you need to specify the devices).
In my code, I need to call a CUDA kernel to parallelize some matrix computation. However, this computation must be done iteratively for ~60,000 times (kernel is called inside a 60,000 iteration for loop).
That means, If I do cudaMalloc/cudaMemcpy across every single call to the kernel, most of the time will be spent doing memory allocation and transfer and I get a significant slowdown.
Is there a way to say, allocate a piece of memory before the for loop, use that memory in each iteration of the kernel, and then after the for loop, copy that memory back from device to host?
Thanks.
Yes, you can do exactly what you describe:
int *h_data, *d_data;
cudaMalloc((void **)&d_data, DSIZE*sizeof(int));
h_data = (int *)malloc(DSIZE*sizeof(int));
// fill up h_data[] with data
cudaMemcpy(d_data, h_data, DSIZE*sizeof(int), cudaMemcpyHostToDevice);
for (int i = 0; i < 60000; i++)
my_kernel<<<grid_dim, block_dim>>>(d_data)
cudaMemcpy(h_data, d_data, DSIZE*sizeof(int), cudaMemcpyDeviceToHost);
...
I want to copy data from GPU0-DDR to GPU1-DDR directly without CPU-RAM.
As said here on the page-15: http://people.maths.ox.ac.uk/gilesm/cuda/MultiGPU_Programming.pdf
Peer-to-Peer Memcpy
Direct copy from pointer on GPU A to pointer on GPU B
With UVA, just use cudaMemcpy(…, cudaMemcpyDefault)
Or cudaMemcpyAsync(…, cudaMemcpyDefault)
Also non-UVA explicit P2P copies:
cudaError_t cudaMemcpyPeer( void * dst, int dstDevice, const void* src,
int srcDevice, size_t count )
cudaError_t cudaMemcpyPeerAsync( void * dst, int dstDevice,
const void* src, int srcDevice, size_t count, cuda_stream_t stream = 0 )
If I use cudaMemcpy() then do I must at first to set a flag cudaSetDeviceFlags( cudaDeviceMapHost )?
Do I have to use cudaMemcpy() pointers which I got as result from the function cudaHostGetDevicePointer(& uva_ptr, ptr, 0)?
Are there any advantages of function cudaMemcpyPeer(), and if no any advantage, why it is needed?
Unified Virtual Addressing (UVA) enables one address space for all CPU and GPU memories since it allows determining physical memory location from pointer value.
Peer-to-peer memcpy with UVA*
When UVA is possible, then cudaMemcpy can be used for peer-to-peer memcpy since CUDA can infer which device "owns" which memory. The instructions you typically need to perform a peer-to-peer memcpy with UVA are the following:
//Check for peer access between participating GPUs:
cudaDeviceCanAccessPeer(&can_access_peer_0_1, gpuid_0, gpuid_1);
cudaDeviceCanAccessPeer(&can_access_peer_1_0, gpuid_1, gpuid_0);
//Enable peer access between participating GPUs:
cudaSetDevice(gpuid_0);
cudaDeviceEnablePeerAccess(gpuid_1, 0);
cudaSetDevice(gpuid_1);
cudaDeviceEnablePeerAccess(gpuid_0, 0);
//UVA memory copy:
cudaMemcpy(gpu0_buf, gpu1_buf, buf_size, cudaMemcpyDefault);
Peer-to-peer memcpy without UVA
When UVA is not possible, then peer-to-peer memcpy is done via cudaMemcpyPeer. Here is an example
// Set device 0 as current
cudaSetDevice(0);
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size);
// Set device 1 as current
cudaSetDevice(1);
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size);
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0);
// Set device 1 as current
cudaSetDevice(1);
// Copy p0 to p1
cudaMemcpyPeer(p1, 1, p0, 0, size);
// Launch kernel on device 1
MyKernel<<<1000, 128>>>(p1);
As you can see, while in the former case (UVA possible) you don't need to specify which device the different pointers refer to, in the latter case (UVA not possible) you have to explicitly mention which device the pointers refer to.
The instruction
cudaSetDeviceFlags(cudaDeviceMapHost);
is used to enable host mapping to device memory, which is a different thing and regards host<->device memory movements and not peer-to-peer memory movements, which is the topic of your post.
In conclusion, the answer to your questions are:
NO;
NO;
When possible, enable UVA and use cudaMemcpy (you don't need to specify the devices); otherwise, use cudaMemcpyPeer (and you need to specify the devices).
Is there any application level API available to free shared memory allocated by CTA in CUDA? I want to reuse my CTA for another task and before starting that task I should clean memory used by previous task.
Shared memory is statically allocated at kernel launch time. You can optionally specify an unsized shared allocation in the kernel:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
}
The third kernel launch parameter then specifies how much shared memory corresponds to that unsized allocation.
MyKernel<<<blocks, threads, numInts*sizeof(int)>>>( ... );
The total amount of shared memory allocated for the kernel launch is the sum of the amount declared in the kernel, plus the shared memory kernel parameter, plus alignment overhead. You cannot "free" it - it stays allocated for the duration of the kernel launch.
For kernels that go through multiple phases of execution and need to use the shared memory for different purposes, what you can do is reuse the memory with shared memory pointers - use pointer arithmetic on the unsized declaration.
Something like:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
__syncthreads();
char *nowINeedChars = (char *) extraShared;
...
}
I don't know of any SDK samples that use this idiom, though the threadFenceReduction sample declares a __shared__ bool and also uses shared memory to hold the partial sums of the reduction.