CUDA cuFFT and cuBLAS libraries require of several handles if working with several non-blocking streams? - cuda

Does anyone know if cuBLAS and cuFFT require of creating a different handle/plan for each stream involved? Or can they re-use the same handle/plan and set the stream before the invocation?
//re-using the same handle/plan for cuBLAS and cuFFT invocations with several non-blocking streams
fftcu_plan3d_z(planBWD, ioverflow, ns);
cublasCreate(&handle);
...
cublasSetStream(handle, stream[i]);
cublasZcopy(handle, count, x, incx, data, incy);
cufftSetStream(planBWD, stream[i]);
cufftExecZ2Z(planBWD, data, data, CUFFT_INVERSE);
Is this correct? I am finding some performance issues, just wondering if it can be related to this assumption of re-using same handle/plans over several streams. Thanks!

Related

CUDA pipeline asynchronous memory copy from global to shared memory

I'm currently learning how to write fast CUDA kernels. I implemented a tiled matrix multiplication (block size 32x32) which only does coalesc reads/writes from/to global memory and has no bank conflicts when writing/reading from shared memory (it has ~50% of the speed of the pytorch matrix multiplication implementation). Now I tried to use pipelining (two stages) and copy memory asyncronously from global to shared memory (see here, and here).
torch::PackedTensorAccessor32<float,2,torch::RestrictPtrTraits> a; // input to the kernel
constexpr unsigned stages_count = 2;
__shared__ float s_a[stages_count][32][32];
auto block = cooperative_groups::this_thread_block();
__shared__ cuda::pipeline_shared_state<cuda::thread_scope::thread_scope_block, stages_count> shared_state;
auto pipeline = cuda::make_pipeline(block, &shared_state);
for(int step=0; step<a.size(1); step+=32) {
for(int stage=0; stage<stages_count; stage++) {
pipeline.producer_acquire();
// what i would like to do (this works but is not asynchronous)
s_a[stage][threadIdx.y][threadIdx.x] = a[blockIdx.x*stages_count*32 + stage*32 + threadIdx.y][step + threadIdx.x];
// this does not work
cuda::memcpy_async(block,
&s_a[stage][threadIdx.y][0],
&a[blockIdx.x*stages_count*32 + stage*32 + threadIdx.y][step],
sizeof(float) * 32,
pipeline);
pipeline.producer_commit();
}
for(int stage=0; stage<stages_count; stage++) {
pipeline.consumer_wait();
// use shared memory
pipeline.consumer_release();
}
}
However, I don't know how to make the asynchronous memory copy work. Problem is, I think, that I don't want to copy 32*32 consecutive floats from global memory, but one tile of the matrix (32 times 32 consecutive float). Also, is it possible to somehow transpose (or e.g. use a permuted shared memory layout) while asynchronously loading to prevent later bank conflicts?
The new Hopper architecture (H100 GPU) has a new hardware feature for this, called the tensor memory accelerator (TMA). Software support will come with CUDA 12 later this year.
As far as I understand it this will allow to asynchronously copy tensor tiles with a single command. But, if it works at all on Ampere and older architectures, it might be quite slow in the same way that the emulated cuda::memcpy_async is quite slow on pre-Ampere GPUs in my experience due to missing hardware support.
Not sure if the transpose you mention will be part of that new API but it might:
TMA significantly reduces addressing overhead and improves efficiency with support for different tensor layouts (1D-5D tensors), different memory access modes, reductions, and other features.
When asynchronicity is not required, CUB provides some useful functionality for "transposing" data in cub::BlockLoad (and cub::BlockStore). A downside with these is that they use shared memory as an intermediary only and write to registers or local memory in the end, so for these kinds of tiled matrix multiplication kernels they probably are not of any help. They might add a feature for only copying to shared memory in the future. Maybe these new containers will even support asynchronicity.

Nvidia CUDA: Profiler indicates memory transfer operations are not performed asynchronously

I have profiled my CUDA application and the profiling results are not as I would expect them to be.
Here's a summary of how my application works:
There are 4 streams used
The CPU loop runs around polling the state of each stream
If the stream is found to be idle, then a function is called: launch_job
This function looks liks this:
launch_job(cudaStream_t stream, ...)
{
cudaMemcpyAsync(..., stream);
cuda_process_kernel<<<grid, block, 0, stream>>>(...);
cudaError_t err = cudaGetLastError();
if(err) ...
cudaMemcpyAsync(..., stream);
}
For the first block of 4 kernel thread launches seen in the profiler screenshot, the stream is different for each time launch_job is called.
However there is no overlapping of the memory transfers or the kernel execution.
I would have expected to see at least one memory transfer overlapped with a kernel function execution, if not both memory transfers. (One is direction H2D the other is direction D2H but that was probably obvious.)
Have I fundamentally misunderstood something about the way in which streams work? Or is there some other reason why my launch_job function does not produce parallelized memory transfer and kernel function execution?
Please try this:
For each stream, do cudaMemcpyAsync(..., stream) to copy H2D.
For each stream, launch the kernels on that stream;
For each stream, do cudaMemcpyAsync(..., stream) to copy D2H.
Note you are having three for loops here. If your GPU supports, your profiler should show some overlapping among different streams.
Also, if your data is really small, say only 1 MB, you may not see much overlapping, it would be more obvious if you have 100MB data copy on each stream.

Does cudnnCreate() call create multiple streams internally?

I am writing a simple multi-stream CUDA application. Following is the part of code where I create cuda-streams, cublas-handle and cudnn-handle:
cudaSetDevice(0);
int num_streams = 1;
cudaStream_t streams[num_streams];
cudnnHandle_t mCudnnHandle[num_streams];
cublasHandle_t mCublasHandle[num_streams];
for (int ii = 0; ii < num_streams; ii++) {
cudaStreamCreateWithFlags(&streams[ii], cudaStreamNonBlocking);
cublasCreate(&mCublasHandle[ii]);
cublasSetStream(mCublasHandle[ii], streams[ii]);
cudnnCreate(&mCudnnHandle[ii]);
cudnnSetStream(mCudnnHandle[ii], streams[ii]);
}
Now, my stream count is 1. But when I profile the executable of above application using Nvidia Visual Profiler I get following:
For every stream I create it creates additional 4 more streams. I tested it with num_streams = 8, it showed 40 streams in profiler. It raised following questions in my mind:
Does cudnn internally create streams? If yes, then why?
If it implicitly creates streams then what is the way to utilize it?
In such case does explicitly creating streams make any sense?
Does cudnn internally create streams?
Yes.
If yes, then why?
Because it is a library, and it may need to organize CUDA concurrency. Streams are used to organize CUDA concurrency. If you want a detailed explanation of what exactly the streams are used for, the library internals are not documented.
If it implicitly creates streams then what is the way to utilize it?
Those streams are not intended for you to utilize separately/independently. They are for usage by the library, internal to the library routines.
In such case does explicitly creating streams make any sense?
You would still need to explicitly create any streams you needed to manage CUDA concurrency outside of the library usage.
I would like to point out that this statement is a bit misleading:
"For every stream I create it creates additional 4 more streams."
What you are doing is going through a loop, and at each loop iteration you are creating a new handle. Your observation is tied to the number of handles you create, not the number of streams you create.

Should we reuse the cublasHandle_t across different calls?

I'm using the latest version CUDA 5.5 and the new CUBLAS has a stateful taste where every function needs a cublasHandle_t e.g.
cublasHandle_t handle;
cublasCreate_v2(&handle);
cublasDgemm_v2(handle, A_trans, B_trans, m, n, k, &alpha, d_A, lda, d_B, ldb, &beta, d_C, ldc);
cublasDestroy_v2(handle);
Is it a good practice to reuse this handle instance as much as possible like some sort of a Session or the performance impact would be so small that it makes more sense to lower code complexity by having short-living handle instances and therefore create/destroy it continuously?
I think it is a good practice for two reasons:
From the cuBLAS Library User Guide, "cublasCreate() [...] allocates hardware resources on the host", which makes me think that there is some overhead on its call.
Multiple cuBLAS handle creation/destruction can break concurrency by unneeded context synchronizations.
As the CUDA Toolkit states in here
The application must initialize the handle to the cuBLAS library context by calling the cublasCreate() function. Then, the context is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory() to release the resources associated with the cuBLAS library context.

CUDA overlap of data transfer and kernel execution, implicit synchronization for streams

After reading CUDA's "overlap of data transfer and kernel execution" section in "CUDA C Programming Guide", I have a question: what exactly does data transfer refers to? Does it include cudaMemsetAsync, cudaMemcpyAsync, cudaMemset, cudaMemcpy. Of course, the memory allocated for memcpy is pinned.
In the implicit synchronization (streams) section, the book says "a device memory set" may serialize the streams. So, does it refer to cudaMemsetAsync, cudaMemcpyAsync, cudaMemcpy, cudaMemcpy? I am not sure.
Any function call with an Async at the end has a stream parameter. Additionally, some of the libraries provided by the CUDA toolkit also have the option of setting a stream. By using this, you can have multiple streams running concurrently.
This means, unless you specifically create and set a stream, you will be using the defualt stream. For example, there are no default data transfer and kernel execution streams. You will have to create two streams (or more), and allocate them a task of choice.
A common use case is to have the two streams as mentioned in the programming guide. Keep in mind, this is only useful if you have multiple kernel launches. You can get the data needed for the next (independent) kernel or the next iteration of the current kernel while computing the results for the current kernel. This can maximize both compute and bandwidth capabilities.
For the function calls you mention, cudaMemcpy and cudaMemcpyAsync are the only functions performing data transfers. I don't think cudaMemset and cudaMemsetAsync can be termed as data transfers.
Both cudaMempyAsync and cudaMemsetAsync can be used with streams, while cudaMemset and cudaMemcpy are blocking calls that do not make use of streams.