Should we reuse the cublasHandle_t across different calls? - cuda

I'm using the latest version CUDA 5.5 and the new CUBLAS has a stateful taste where every function needs a cublasHandle_t e.g.
cublasHandle_t handle;
cublasCreate_v2(&handle);
cublasDgemm_v2(handle, A_trans, B_trans, m, n, k, &alpha, d_A, lda, d_B, ldb, &beta, d_C, ldc);
cublasDestroy_v2(handle);
Is it a good practice to reuse this handle instance as much as possible like some sort of a Session or the performance impact would be so small that it makes more sense to lower code complexity by having short-living handle instances and therefore create/destroy it continuously?

I think it is a good practice for two reasons:
From the cuBLAS Library User Guide, "cublasCreate() [...] allocates hardware resources on the host", which makes me think that there is some overhead on its call.
Multiple cuBLAS handle creation/destruction can break concurrency by unneeded context synchronizations.

As the CUDA Toolkit states in here
The application must initialize the handle to the cuBLAS library context by calling the cublasCreate() function. Then, the context is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory() to release the resources associated with the cuBLAS library context.

Related

CUDA cuFFT and cuBLAS libraries require of several handles if working with several non-blocking streams?

Does anyone know if cuBLAS and cuFFT require of creating a different handle/plan for each stream involved? Or can they re-use the same handle/plan and set the stream before the invocation?
//re-using the same handle/plan for cuBLAS and cuFFT invocations with several non-blocking streams
fftcu_plan3d_z(planBWD, ioverflow, ns);
cublasCreate(&handle);
...
cublasSetStream(handle, stream[i]);
cublasZcopy(handle, count, x, incx, data, incy);
cufftSetStream(planBWD, stream[i]);
cufftExecZ2Z(planBWD, data, data, CUFFT_INVERSE);
Is this correct? I am finding some performance issues, just wondering if it can be related to this assumption of re-using same handle/plans over several streams. Thanks!

Ensure that thrust doesnt memcpy from host to device

I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function
Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.

Asynchrony and memory ownership in CUBLAS

CUBLAS is an asynchronous library. What are the requirements on memory ownership for parameters passed to CUBLAS?
It seems clear that matrices being operated on by CUBLAS should not be freed until the asynchronous calls complete - but what about the scalar parameters?
For example, is the following code sound:
//...
float alpha = compute_alpha();
cublasSaxpy(handle, n,
//Taking the address of an automatic variable!
&alpha, //and handing it to an asynchronous function!
x, incx,
y, incy);
return;
I'm worried that alpha might not exist by the time Saxpy actually gets launched: if we return from the function before Saxpy launches, and the stack space for alpha gets overwritten with other stuff, it's possible Saxpy could get the wrong answer (or even crash).
I don't want to have to copy my scalar parameters to some sort of heap memory and ensure they don't get destructed until after an asynchronous call to CUBLAS - tracking this would be complicated.
It'd be great if CUBLAS explicitly guaranteed that scalar parameters do not need to live after a call to CUBLAS, but the documentation doesn't seem super clear about this.
If pointer mode is HOST, alpha and beta can be on the stack or allocated on the heap. Underneath the kernel(s) will be launched with the value of alpha and beta. So if they were allocated on the heap, they can be freed just after the return of the call (even though the kernel launch is asynchronous)
If the pointer is DEVICE, alpha and beta MUST be accessible on the device and their values should not be modified until the kernel is done. Note that since cudaFree does an implicit cudaDeviceSynchronize(), cudaFree of alpha/beta can still be called just after the call but it defeats the purpose of the DEVICE pointer mode in this case.

Alternatives to malloc for dynamic memory allocations in CUDA kernel functions

I'm trying to compile my CUDA C code for a GPU with sm_10 architecture which does not support invoking malloc from __global__ functions.
I need to keep a tree for which the nodes are created dynamically in the GPU memory. Unfortunately, without malloc apparently I can't do that.
Is there is a way to copy an entire tree using cudaMalloc? I think that such an approach will just copy the root of my tree.
Quoting the CUDA C Programming Guide
Dynamic global memory allocation and operations are only supported by devices of
compute capability 2.x and higher.
For compute capability earlier than 2.0, the only possibilities are:
Use cudaMalloc from host side to allocate as much global memory as you need in your __global__ function;
Use static allocation if you know the required memory size at compile time;

How to free __device__ memory in CUDA

__device__ int data;
__constant__ int var1;
How to free the "data" and "var1" in the CUDA?
Thank you
With Device compute capability of sm_20 and above you can simply use new or delete keyword,
even better would be to use CUDA thrust API ( it an implementation of standard template library on top of GPU) really cool stuff .
http://code.google.com/p/thrust/
You can't free it. It gets automatically freed when the program ends.
Similarly, as in host code you don't free global variables.
As # CygnusX1 said, you can't free it. As you have declared it, the memory will be allocated for the life of your program -- NOTE: Even if you never call the kernel.
You can however use cudaMalloc, and cudaFree (or new/delete within in CUDA 4.0) to allocate and free memory temporarily. Of course you must manipulate everything with pointers, but this is a huge savings if you need to store several large objects, free them, and then store several more large objects...