Within CUDA kernel, copying from device to host without pinned memory? - cuda

Is there a way to copy from device to host within the kernel?
Something like the following code:
__global__ void kernel(int n, double *devA, double *hostA) {
double x = 1.0;
do_computation();
cudaMemcpy(hostA, &x, sizeof(double), cudaMemcpyDeviceToHost);
do_computation();
cudaMemcpy(hostA, devA, sizeof(double), cudaMemcpyDeviceToHost);
}
Is it possible? Based on the CUDA documentation, the cudaMemcpy is not callable from the device, right?
NOTE: I don't want to use the pinned memory. It is low performance since I will constantly check the host variable (memory). So, using pinned memory will issue a page-fault (at best for post-Pascal) that will definitely happen! If both host and device access the same location, it will basically be a ping-pong effect!

Is it possible?
In one word, no.
Based on the CUDA documentation, the cudaMemcpy is not callable from the device, right?
In fact, if you do read the documentation, you will see that cudaMemcpy is supported in device code, but only for device to device transfers and not using local variables as source or destination.

Related

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

Why do I need to declare CUDA variables on the Host before allocating them on the Device

I've just started trying to learn CUDA again and came across some code I don't fully understand.
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
When the GPU memory pointers are declared, they allocate memory on the host. The cudaMalloc calls throw away the information that d_in and d_out are pointers to floats.
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored. It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
So, what is the purpose of the original variable declarations on the host?
======================================================================
I would've thought something like this would make more sense:
// declare GPU memory pointers
cudaFloat * d_in;
cudaFloat * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
This way, everything GPU related takes place on the GPU. If d_in or d_out are accidentally used in host code, an error can be thrown at compile time, since those variables wouldn't be defined on the host.
I guess what I also find confusing is that by storing device memory addresses on the host, it feels like the device isn't in fully in charge of managing its own memory. It feels like there's a risk of host code accidentally overwriting the value of either d_in or d_out either through accidentally assigning to them in host code or another more subtle error, which could cause the GPU to lose access to its own memory. Also, it seems strange that the addresses assigned to d_in & d_out are chosen by the host, instead of the device. Why should the host know anything about which addresses are/are not available on the device?
What am I failing to understand here?
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored
That is just the C pass by reference idiom.
It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
Ok, so let's design the API your way. Here is a typical sequence of operations on the host -- allocate some memory on the device, copy some data to that memory, launch a kernel to do something to that memory. You can think for yourself how it would be possible to do this without having the pointers to the allocated memory stored in a host variable:
cudaMalloc(somebytes);
cudaMemcpy(?????, hostdata, somebytes, cudaMemcpyHOstToDevice);
kernel<<<1,1>>>(?????);
If you can explain what should be done with ????? if we don't have the address of the memory allocation on the device stored in a host variable, then you are really onto something. If you can't, then you have deduced the basic reason why we store the return address of memory allocated on the GPU in host variables.
Further, because of the use of typed host pointers to store the addresses of device allocations, CUDA runtime API can do type checking. So this:
__global__ void kernel(double *data, int N);
// .....
int N = 1 << 20;
float * d_data;
cudaMalloc((void **)&d_data, N * sizeof(float));
kernel<<<1,1>>>(d_data, N);
can report type mismatches at compile time, which is very useful.
Your fundamental conceptual failure is mixing up host-side code and device-side code. If you call cudaMalloc() from code execution on the CPU, then, well, it's on the CPU: It's you who want to have the arguments in CPU memory, and the result in CPU memory. You asked for it. cudaMalloc has told the GPU/device how much of its (the device's) memory to allocate, but if the CPU/host wants to access that memory, it needs a way to refer to it that the device will understand. The memory location on the device is a way to do this.
Alternatively, you can call it from device-side code; then everything takes place on the GPU. (Although, frankly, I've never done it myself and it's not such a great idea except in special cases).

Ensure that thrust doesnt memcpy from host to device

I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function
Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.

How to use shared memory between kernel call of CUDA?

I want to use shared memory between kernel call of one kernel.
Can I use shared memory between kernel call?
No, you can't. Shared memory has thread block life-cycle. A variable stored in it can be accessible by all the threads belonging to one group during one __global__ function invocation.
Take a try of page-locked memory, but the speed should be much slower than graphic memory.
cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped);
then send the ptr to the kernel code.
Previously you could do it in a non-standard way where you would have a unique id for each shared memory block and the next kernel would check the id and therefore carry out required processing on this shared memory block. This was hard to implement as you needed to ensure full occupancy for each kernel and deal with various corner cases. In addition, without formal support you coulf not rely on compatibility across compute device and cuda versions.

CUDA how to create arrays in runtime in kernel in shared memory?

I have the task of large number of threads running, each doing a small matrix multiplication. All the small matrices have been loaded to the global memory. I wish to improve performance by letting each thread load its small matrices into shared memory, and then compute the product. But the problem is that I do not know the sizes of the matrices during compile time. So I cannot create variables as in __shared__ double mat1[XSIZE][YSIZE]. On PC, I would have made a dynamic allocation. But I do not know if I could do it on the shared memory. If calling malloc in a kernel would allocate only in global memory (assuming such a call is possible), that does not help either.
Is there a way to declare arrays during runtime in kernel? Is there any other way to resolve this problem?
You can declare dynamically sized shared memory allocations in CUDA, like this
__global__ void kernel()
{
extern __shared__ double *mat1;
}
And then launch your kernel like this
kernel<<<grid,block,XSIZE*YSIZE*sizeof(double)>>>();
This is discussed in more detail in the CUDA programming guide.