Ensure that thrust doesnt memcpy from host to device - cuda

I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function

Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.

Related

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

Why do I need to declare CUDA variables on the Host before allocating them on the Device

I've just started trying to learn CUDA again and came across some code I don't fully understand.
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
When the GPU memory pointers are declared, they allocate memory on the host. The cudaMalloc calls throw away the information that d_in and d_out are pointers to floats.
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored. It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
So, what is the purpose of the original variable declarations on the host?
======================================================================
I would've thought something like this would make more sense:
// declare GPU memory pointers
cudaFloat * d_in;
cudaFloat * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
This way, everything GPU related takes place on the GPU. If d_in or d_out are accidentally used in host code, an error can be thrown at compile time, since those variables wouldn't be defined on the host.
I guess what I also find confusing is that by storing device memory addresses on the host, it feels like the device isn't in fully in charge of managing its own memory. It feels like there's a risk of host code accidentally overwriting the value of either d_in or d_out either through accidentally assigning to them in host code or another more subtle error, which could cause the GPU to lose access to its own memory. Also, it seems strange that the addresses assigned to d_in & d_out are chosen by the host, instead of the device. Why should the host know anything about which addresses are/are not available on the device?
What am I failing to understand here?
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored
That is just the C pass by reference idiom.
It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
Ok, so let's design the API your way. Here is a typical sequence of operations on the host -- allocate some memory on the device, copy some data to that memory, launch a kernel to do something to that memory. You can think for yourself how it would be possible to do this without having the pointers to the allocated memory stored in a host variable:
cudaMalloc(somebytes);
cudaMemcpy(?????, hostdata, somebytes, cudaMemcpyHOstToDevice);
kernel<<<1,1>>>(?????);
If you can explain what should be done with ????? if we don't have the address of the memory allocation on the device stored in a host variable, then you are really onto something. If you can't, then you have deduced the basic reason why we store the return address of memory allocated on the GPU in host variables.
Further, because of the use of typed host pointers to store the addresses of device allocations, CUDA runtime API can do type checking. So this:
__global__ void kernel(double *data, int N);
// .....
int N = 1 << 20;
float * d_data;
cudaMalloc((void **)&d_data, N * sizeof(float));
kernel<<<1,1>>>(d_data, N);
can report type mismatches at compile time, which is very useful.
Your fundamental conceptual failure is mixing up host-side code and device-side code. If you call cudaMalloc() from code execution on the CPU, then, well, it's on the CPU: It's you who want to have the arguments in CPU memory, and the result in CPU memory. You asked for it. cudaMalloc has told the GPU/device how much of its (the device's) memory to allocate, but if the CPU/host wants to access that memory, it needs a way to refer to it that the device will understand. The memory location on the device is a way to do this.
Alternatively, you can call it from device-side code; then everything takes place on the GPU. (Although, frankly, I've never done it myself and it's not such a great idea except in special cases).

Set the number of blocks and threads in calling a device function in CUDA?

I have a basic question about calling a device function from a global CUDA kernel. Can we specify the number of blocks and threads when I want to call a device function???
I post an question earlier about min reduction (here) and I want to call this function inside another global kernel. However the reduction code needs certain blocks and threads.
There are two types of functions that can be called on the device:
__device__ functions are like ordinary c or c++ functions: they operate in the context of a single (CUDA) thread. It's possible to call these from any number of threads in a block, but from the standpoint of the function itself, it does not automatically create a set of threads like a kernel launch does.
__global__ functions or "kernels" can only be called using a kernel launch method (e.g. my_kernel<<<...>>>(...); in the CUDA runtime API). When calling a __global__ function via a kernel launch, you specify the number of blocks and threads to launch as part of the kernel configuration (<<<...>>>). If your GPU is of compute capability 3.5 or higher, then you can also call a __global__ function from device code (using essentially the same kernel launch syntax, which allows you to specify blocks and threads for the "child" kernel). This employs CUDA Dynamic Parallelism which has a whole section of the programming guide dedicated to it.
There are many CUDA sample codes that demonstrate:
calling a __device__ function, such as simpleTemplates
calling a __global__ function from the device, such as cdpSimplePrint

Difference on creating a CUDA context

I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:
__global__ void warmStart(int* f)
{
*f = 0;
}
which is launched before the kernels I want to time as follows:
int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");
I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.
The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.
Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.
So, what's the difference of creating a context in these two ways, using a kernel and using an API call?
I run the test in a GTX480, using CUDA 4.0 under Linux.
Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.

Calling a kernel from a kernel

A follow up Q from: CUDA: Calling a __device__ function from a kernel
I'm trying to speed up a sort operation. A simplified pseudo version follows:
// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
float saveData; // swap some
saveData= *Adata; // big complex
*Adata= *Bdata // data chunk
*Bdata= saveData;
}
// a rather simple sort operation
__global__ sort(float data[]){
for (i=0; i<limit: i++){
find left swap point
find right swap point
swap<<<1,1>>>(left, right);
}
}
(Note: This simple version doesn't show the reduction techniques in the blocks.)
The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel.
This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place.
Is there a way to tell the compiler to NOT in-line a device call?
It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.
Solution:
I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.
__global__ void kernel_child(float *var1, int N){
//do data operations here
}
__global__ void kernel_parent(float *var1, int N)
{
kernel_child<<<1,2>>>(var1,N);
}
The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.
Terminal way
You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.
$ nvcc -arch=sm_35 -rdc=true yourFile.cu
$ ./a.out
Hope it helps. Thank you!
[1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf
Edit (2016):
Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices.
Original Answer:
You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels - it is presently unsupported.